Deployment issues and SC3
description
Transcript of Deployment issues and SC3
Deployment issues and SC3
Jeremy Coles
GridPP Tier-2 Board and Deployment Board Glasgow, 1st June 2005
June 2005 Deployment update
Current deployment issues
Main GridPP concerns:• gLite migration, fabric management & future of YAIM• dCache• Data migration – classic SE to SRM SE• Security• Ganglia deployment• Use of ticketing system• Use of UK testzone
General• Jobs at sites – improving (nb. Freedom of Choice is coming!)• Few general EGEE VOs supported at GridPP sites
June 2005 Deployment update
2nd LCG Operations Workshop
• Took place in Bologna last week: http://infnforge.cnaf.infn.it/cdsagenda//fullAgenda.php?ida=a0517
• Covered the following areas:– Daily operations– Pre-production service– Glite deployment and migration– Future monitoring (metrics)– Interoperation with OSG – User support (Executive Support Committee!)– VO management processes – Fabric management– Accounting (DGAS and APEL)– Little on security! Romain presented potential tools.
June 2005 Deployment update
LCG-2_4_0
-14
6
26
46
66
86
106
126
146
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
days since rerlease
sit
es o
n L
CG
-2_4_0 (
info s
ys b
ased
)
-14
6
26
46
66
86
106
126
146
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
days since rerlease
sit
es o
n L
CG
-2_4_0 (
info s
ys b
ased
)
Plan
CPUs:
2_4_0 10642
2_3_1 912
2_3_0 2167
CPUs:
2_4_0 10642
2_3_1 912
2_3_0 2167
June 2005 Deployment update
0
20
40
60
80
100
120
140
160
16111621263136414651566166717681869196101
days
sit
es
all
2_4_0
2_3_1
2_3_0
other
Version Change in the last 100 days
Others: Sites on older versions or down
All sites in LCG-2All sites in LCG-2
June 2005 Deployment update
Regions with less than 5 sites are not shown
0
1
2
3
4
5
6
7
15913172125293337414549535761656973778185899397101
CanadaCanada
0
2
4
6
8
10
12
15913172125293337414549535761656973778185899397101
RussiaRussia
0
5
10
15
20
25
30
15913172125293337414549535761656973778185899397101
ItalyItaly
0
2
4
6
8
10
12
14
15913172125293337414549535761656973778185899397101
Germany/SwitzerlandGermany/Switzerland
June 2005 Deployment update
0
1
2
3
4
5
6
7
8
9
15913172125293337414549535761656973778185899397101
FranceFrance
0
1
2
3
4
5
6
7
13579111315171921232527293133353739414345474951535557596163656769717375
Asia PacificAsia Pacific
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
15913172125293337414549535761656973778185899397101
NorthernNorthern
0
2
4
6
8
10
12
14
15913172125293337414549535761656973778185899397101
SWSW
June 2005 Deployment update
0
2
4
6
8
10
12
15913172125293337414549535761656973778185899397101
CentralCentral
0
2
4
6
8
10
12
14
16
18
20
15913172125293337414549535761656973778185899397101
SESE
June 2005 Deployment update
UKI
0
5
10
15
20
25
15913172125293337414549535761656973778185899397101
June 2005 Deployment update
LCG-2_4_0
Lessons learned:– Harder than expected (rate independent of packaging)– Differences between regions --> ROCs matter– Release definition non trivial with 3 months intervals
• Components dependencies– X without Y and V is useless….
• During certification we still find problems• Upgrade and installation from scratch needed (time consuming)
– Test pilots for deployment are useful– Early announcement of releases is useful – We need to introduce “updates” via APT to fix bugs that show
during deployment – Number of sites is the wrong metric to measure success
• CPUs on new release needs to be tacked, not sites
June 2005 Deployment update
The next release
• Why?– SC3 is approaching and the needed components are not deployed at the sites
• What?– File transfer service (will need VDT 1.2.2)
• Servers for Tier1 and Tier0, clients for the rest – Improved monitoring sensors for gridFtp– RFC proxy extension for VOMS– New version of the GLUE schema (compatible)– LFC production service– Interoperability with GRID3/OSG– User level stdio monitoring (maybe later)– Bug fixes …….. as always
• When?– Aimed at mid June
• Who?– Tier 1 centers and Tier 2 centers participating in SC3
• As fast as possible– Others?
• At their own pace– Updated release (fixes from 1st release) expected by July 1st.
June 2005 Deployment update
SITESITE
FIREMAN
VOMS
LFC
shared LCG
gLite SRM-SE
myProxygLiteWLMRB
UIs
WNsgLiteLCG
gLite-IO
gLite-CE
FTS
LCGCE
FTS
R-GMAR-GMA
BD-II BD-II
Data from LCG is owned by VO and role, gLite-IO service owns gLite data
FTS for LCG uses user proxy, gLite uses service cert
R-GMAs can be merged (security ON)
CEs use same batch system
Independent IS
Catalogue and access control
Coexistence &
Extended Pre-Production
Coexistence &
Extended Pre-Production
dgasAPEL
June 2005 Deployment update
SITESITE
VOMS
LFC
shared LCG
gLite SRM-SE
myProxygLiteWLMRB
UIs
WNsLCG gLite-CE
LCGCE
FTS
R-GMA
BD-II
FTS for LCG uses user proxy, gLite uses service cert
CEs use same batch system
Gradual Transition 1Gradual Transition 1
gLite
dgasAPEL
Optional additional WLM
Data Management LCG
Optional dgas accounting
Optional additional WLM
Data Management LCG
Optional dgas accounting
June 2005 Deployment update
SITESITE
VOMS
LFC
shared LCG
gLite SRM-SE
myProxygLiteWLM
UIs
WNsLCG gLite-CE
FTS
BD-II
Gradual Transition 2Gradual Transition 2
gLite
R-GMA
FIREMAN
dgasAPEL
Removed LCG WLM
Optional Catalogue
R-GMA in gLite mode
Removed LCG WLM
Optional Catalogue
R-GMA in gLite mode
June 2005 Deployment update
SITESITE
VOMS
LFC
shared LCG
gLite SRM-SE
myProxygLiteWLM
UIs
WNsLCG gLite-CE
FTS
BD-II
Gradual Transition 3Gradual Transition 3
gLite
R-GMA
FIREMAN
gLite-IO
FTS
Data from LCG is owned by VO and role, gLite-IO service owns gLite data
dgasAPEL
Adding gLite-IO
Second path to data
Additional security model
Data migration phase
Adding gLite-IO
Second path to data
Additional security model
Data migration phase
June 2005 Deployment update
SITESITE
VOMS
LFC
shared LCG
gLite SRM-SE
myProxygLiteWLM
UIs
WNsLCG gLite-CE
BD-II
Gradual Transition 4Gradual Transition 4
gLite
R-GMA
FIREMAN
gLite-IO
FTS
dgasAPEL
Finalize switch to new security model. LFC, now a local catalogue under VO control
BDII later replaced by
R-GMA
Finalize switch to new security model. LFC, now a local catalogue under VO control
BDII later replaced by
R-GMA
June 2005 Deployment update
Metrics - EGEE
• General Agreement on the concept– detailed discussions on:
• time windows– Sliding windows (week, month, 3 month)
• quantities to watch for (RCs, ROCs, CICs…..)– ROCs based on RCs– CICs based on services – Release quality has to be measured
• To make progress: workgroup to define quantities– Organized by: Ognjen Prnjat ([email protected]) – Small (˜5), Ognjen, Markus, Helene, Jeff T. and Jeremy– Ognjen will collect input– ROCs, CICs and OMC have to agree on ONE set of
quantities
June 2005 Deployment update
Operations summary
• CIC On Duty is now well established– COD is just 6 month old!!!!! – Tools have evolved at a dramatic pace
• Portal, SFT,……– Many rapid iterations
• Truly distributed effort • Integration of new COD partner (Russia) went smoothly
– Tuning of procedures is an ongoing process• No dramatic changes (take resource size more into
account)
June 2005 Deployment update
Accounting
Last November still an area of concern– APEL now well established
• Support for batch systems is improving• Several privacy related problems have been understood
and solved
– gLite Accounting: DGAS• Some concerns about amount of information published
– Can be handled by proper authorization?• Collaboration with APEL on batch sensors (BBQS,
Condor,..)– DGAS agreed to provide them
• Will be introduced initially on a voluntary basis – Sites will give feedback (including privacy issues)
June 2005 Deployment update
Current deployment issues (recap)
Main GridPP concerns:• gLite migration, fabric management & future of YAIM• dCache• Data migration – classic SE to SRM SE• Security• Ganglia deployment• Use of ticketing system• Use of UK testzone
General• Jobs at sites – improving (nb. Freedom of Choice is coming!)• Few general EGEE VOs supported at GridPP sites
June 2005 Deployment update
Freedom of choice - VO Page
June 2005 Deployment update
Service Challenge 3
June 2005 Deployment update
SC2SC3
LHC Service OperationFull physics run
2005 20072006 2008
First physicsFirst beams
cosmics
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 – SC4 Service Phase
Sep06 – Initial LHC Service in stable operation
SC4
SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERNSC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period)SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughputLHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput
Apr07 – LHC Service commissioned
SC timelines
June 2005 Deployment update
Service Challenge 3 - Phases
High level view:• Throughput phase
– 2 weeks sustained in July 2005• “Obvious target” – GDB of July 20th
– Primary goals: • 150MB/s disk – disk to Tier1s; • 60MB/s disk (T0) – tape (T1s)
– Secondary goals:• Include a few named T2 sites (T2 -> T1 transfers) Encourage remaining T1s to start disk – disk transfers
• Service phase– September – end 2005
• Start with ALICE & CMS, add ATLAS and LHCb October/November• All offline use cases except for analysis• More components: WMS, VOMS, catalogs, experiment-specific
solutions
– Implies production setup (CE, SE, …)
June 2005 Deployment update
SC implications
• SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July– Must have the release to be used in SC3 available in mid-June– Involved sites must upgrade for July– Not reasonable to expect those sites to commit to other
significant work (pre-production etc) on that timescale– T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA,
RAL and • Expect SC3 release to include FTS, LFC, DPM, but otherwise
be very similar to LCG-2.4.0• September-December: experiment “production” verification
of SC3 services; in parallel set up for SC4• Expect “normal” support infrastructure (CICs, ROCs, GGUS)
to support service challenge usage• Bio-med also planning data challenges
– Must make sure these are all correctly scheduled
June 2005 Deployment update
SC3 issues
• Tier-1 network being extensively re-configured. Tests showed up to 40% packet loss! Waiting for UKLight to be fixed. Not intending to use dual-homing but dCache have provided a solution
• Lancaster link up at the link level• What is the bandwidth of the Lancaster connection• Edinburgh hardware problem with raid-array to be used as SE – IBM
investigating• Lancaster set up test system. Now deploying more hardware• Need clarification about classification of volatile vs permanent data
in respect of Tier-2s• The file transfer service should be ready now but has problems with
the client component• RAL would like longer period for testing tape than suggested in SC3
plans• There has been an issue with CMS preferring to use Phedex and not
to use FTS for transfers. We need to add into the plans a period to do Phedex only transfer tests
• dCache mailing list very active now. There have been problems with the installation scripts
June 2005 Deployment update
SC3 issues continued
• We have questions about whether FTS uses SRM-put or SRM-cp.
• From September onwards SC3 infrastructure is to provide a production quality service for all experiments – remember comments about UKLight being a research network – risk!?
• Differing engagement with the experiments. Edinburgh needs a better releationship with LHCb
• There is an LCG workshop in mid-June where the experiment plans should be almost final!
• GridPP needs to do more load testing than is anticipated in SC3
• Planning for SC4 needs to start soon. Currently we are pushing dCache but DPM is also supposed to be available.
June 2005 Deployment update
Imperial (London Tier-2)
• SRM/dCache Status– Production server installed
• gfe02.hep.ph.ic.ac.uk• Information provider still developing
– 1.5TB Pool node added• RHEL 4 , 64 bit system• Installed using dcache.org instructions http://www.dcache.org/downloads/dCache-instructions.txt
– Extra 1.5TB ready to add when CMS ready– 6TB being purchased. Should be in place by start of Setup
Phase
• CMS Software– Service node provided– Phedex installed– Confirmation on FTS/Phedex issue sought
June 2005 Deployment update
Edinburgh
Current LCG production setup:
• Compute Element (CE), Classic Storage Element (SE), 3 Worker Nodes (2 machines, 3 CPUs). Monitoring takes place on the SE, running LCG 2.4.0. About to add 2 Worker Nodes (2 CPUs in 1 machine) and have a User Interface (UI) in testing. We have a 22TB datastore available
Plans
• £2000 available for 2 machines - one for dCache work and one to connect to EPCC's SAN (10 TBytes promised).
Considering the procurement of more WNs but have no clear requirements from LHCb.
June 2005 Deployment update
Lancaster (current)
June 2005 Deployment update
Lancaster (planned)
1. LighPath and terminal Endbox installed.
2. Still require some hardware for our internal network topology.
3. Increase in Storage to ~84TB to possible ~92TB with working resilient dCache from CE
June 2005 Deployment update
Other areas…
June 2005 Deployment update
JRA4 request
• We have some idea of requirements from networking experts within JRA4
• Draft requirements document available here:– https://edms.cern.ch/document/593620/1
• Draft use case document available here:– https://edms.cern.ch/document/591777/1
• We’re looking for more input from NOCs and GOCs• If you have requirements, use cases or opinions on interfaces
or needed metrics, please send them to us• Even if you don’t have ideas at the moment, but would like to
be involved in the process, please get in contact• Contact details are at the end of the talk
June 2005 Deployment update
DTEAM discussion
• Review of team objectives – what is the team focus for the next 3 & 5 months• Communications with the experiments• Using a project tool to work better as a team• Metrics!!• Review of plans and what needs to be done to keep them up-to-date including
GridPP challenges and SC4• Web-page status• Areas raised at the T2B and DB meetings• Security challenge involvement• Accounting – status and making further progress• Libraries and understanding expt. Needs• Review dCache efforts • Address issues with Quarterly reports & weekly reports• Next release, test-zone and test-zone machines• Data management – guidelines required• Improving robustness• GI – (Documentation (esp. releases), multi-Tier R-GMA, intro. New sites,
LCFGng distribution (Kickstart & Pixieboot… ), jobs – how to get