David Quarrie LBNL - Lawrence Berkeley National …madaras/atlas/Quarrie_NERSC.pdfDavid Quarrie LBNL...
Transcript of David Quarrie LBNL - Lawrence Berkeley National …madaras/atlas/Quarrie_NERSC.pdfDavid Quarrie LBNL...
David Quarrie: The ATLAS Experiment2
Overview
• The ATLAS Detector and Physics Goals• The ATLAS Collaboration and Management• The Trigger and Data Acquisition System• The Computing and Software• Software Deployment and Production• Reality Checks and Stress Testing• Summary
David Quarrie: The ATLAS Experiment
LHC • √s = 14 TeV (7 times higher than Tevatron/Fermilab) → search for new massive particles up to m ~ 5 TeV
• Ldesign = 1034 cm-2 s-1 (>102 higher than Tevatron/Fermilab)
→ search for rare processes with small σ (N = Lσ )
ALICE : heavy ions
ATLAS and CMS :pp, general purpose
27 km ring used fore+e- LEP machine in 1989-2000
Start : Summer 2007
pp
LHCb : pp, B-physics
David Quarrie: The ATLAS Experiment
The ATLAS physics goals
Search for the Standard Model Higgs boson over ~ 115 < mH < 1000 GeV
Search for physics beyond the SM (Supersymmetry, q/l compositeness, leptoquarks, W’/Z’, heavy q/l, Extra-dimensions, ….) up to the TeV-range
Precise measurements : -- W mass -- top mass, couplings and decay properties -- Higgs mass, spin, couplings (if Higgs found) -- B-physics (complementing LHCb): CP violation, rare decays, B0 oscillations -- QCD jet cross-section and αs
-- etc. …. Study of phase transition at high density from hadronic matter to plasma of deconfined quarks and gluons (complementing ALICE). Transition plasma → hadronic matter happened in universe ~ 10-5 s after Big Bang
Etc. etc. …..
David Quarrie: The ATLAS Experiment
Cross Sections and Production Rates
• Inelastic proton-proton reactions: 109 / s • bb pairs 5 106 / s • tt pairs 8 / s
• W → e ν 150 / s• Z → e e 15 / s
• Higgs (150 GeV) 0.2 / s• Gluino, Squarks (1 TeV) 0.03 / s
Rates for L = 1034 cm-2 s-1: (LHC)
LHC is a factory for: top-quarks, b-quarks, W, Z, ……. Higgs, ……
(The only problem: you have to detect them !)
David Quarrie: The ATLAS Experiment
The Underground Cavern at Pit-1 forthe ATLAS Detector
Length = 55 mWidth = 32 mHeight = 35 m
David Quarrie: The ATLAS Experiment
ATLASLength : ~ 46 m Radius : ~ 12 m Weight : ~ 7000 tons~ 108 electronic channels~ 3000 km of cables
• Tracking (|η|<2.5, B=2T) : -- Si pixels and strips -- Transition Radiation Detector (e/π separation)
• Calorimetry (|η|<5) : -- EM : Pb-LAr -- HAD: Fe/scintillator (central), Cu/W-LAr (fwd)
• Muon Spectrometer (|η|<2.7) : air-core toroids with muon chambers
ATLAS superimposed onthe 5 floors of building 40
David Quarrie: The ATLAS Experiment
H → ZZ → 4 l
e, µ
Z
e, µ
e, µ
e, µ mZ
Hg
g
tZ(*)
“Gold-plated” channel for Higgs discovery at LHC
Simulation of a H → µµ ee event in ATLAS
Signal expected in ATLASafter 1 year of LHC operation
Physics example
David Quarrie: The ATLAS Experiment
Inner Detector (ID)
The Inner Detector (ID) is organized into four sub-systems:
Pixels (0.8 108 channels)
Silicon Tracker (SCT) (6 106 channels)
Transition Radiation Tracker (TRT) (4 105 channels)
Common ID items
David Quarrie: The ATLAS Experiment
Pixels
All FE chips have been delivered(all tested, showing a yield of 82%)
The sensor production is finished for2 layers, and on time for 3 layers
The module production rate (with bump-bonding in 2 industries) has improved, on track for 3 layers intime
First completed disk (two layers of 24 modules each, with 2’200’000 channelsof electronics
ATLAS plans to have the Pixels operational for LHC start-up
The series production of finalstaves (barrel) and sectors (end-cap disks) has passed the 10% mark, this activity is now on the critical path of thePixel project
David Quarrie: The ATLAS Experiment
Inner Detector Progress Summary
Pixels: Steady ‘on-schedule’ progress on all aspects of the sub-system for 3 layers
SCT: Module mounting (‘macro-assembly’) on the 4 barrel cylinders ongoing (the first two cylinders are finished and tested, and one is at CERN)
The module mounting progressing on the forward disks (the first 8 disks are completed)
We have to recover a problem with LMTs (low mass tapes for the services)
TRT: Barrel module mounting into support structure is completed
End-cap wheel production is now also smooth, and the stacking at CERN into the end-cap structures is progressing
TRT barrel support with all modules
First complete SCT barrel cylinder
David Quarrie: The ATLAS Experiment
LAr and Tile Calorimeters
Tile barrel
Tile extended barrel
LAr forward calorimeter (FCAL)
LAr hadronic end-cap (HEC)
LAr EM end-cap (EMEC)
LAr EM barrel
David Quarrie: The ATLAS Experiment
LAr EM Barrel Calorimeter and Solenoid Commissioning at the Surface
The barrel EM calorimeter is installed in the cryostat, and after insertion of the solenoid, the cold vessel was closed and welded
A successful complete cold test (with LAr) was made during summer 2004 in hall 180
End of October the cryostat was transported to the pit, and lowered into the cavern
LAr barrel EM calorimeter after insertion into thecryostat
Solenoid just before insertion into the cryostat
David Quarrie: The ATLAS Experiment
The preparations for installation of the fifth BT coil in the cavern are well-advanced
The warm structure components production is nearing completion, matching the required schedule
David Quarrie: The ATLAS Experiment
ATLAS Collaboration
34 Countries151 Institutions1770 Scientific Authors
Albany, Alberta, NIKHEF Amsterdam, Ankara, LAPP Annecy, Argonne NL, Arizona, UT Arlington, Athens, NTU Athens, Baku, IFAE Barcelona, Belgrade, Bergen, Berkeley LBL and UC, Bern, Birmingham, Bonn, Boston, Brandeis, Bratislava/SAS Kosice,
Brookhaven NL, Bucharest, Cambridge, Carleton, Casablanca/Rabat, CERN, Chinese Cluster, Chicago, Clermont-Ferrand, Columbia, NBI Copenhagen, Cosenza, INP Cracow, FPNT Cracow, Dortmund, JINR Dubna, Duke, Frascati, Freiburg, Geneva,
Genoa, Glasgow, LPSC Grenoble, Technion Haifa, Hampton, Harvard, Heidelberg, Hiroshima, Hiroshima IT, Indiana, Innsbruck, Iowa SU, Irvine UC, Istanbul Bogazici, KEK, Kobe, Kyoto, Kyoto UE, Lancaster, Lecce, Lisbon LIP, Liverpool, Ljubljana,
QMW London, RHBNC London, UC London, Lund, UA Madrid, Mainz, Manchester, Mannheim, CPPM Marseille, Massachusetts, MIT, Melbourne, Michigan, Michigan SU, Milano, Minsk NAS, Minsk NCPHEP, Montreal, FIAN Moscow, ITEP Moscow, MEPhI Moscow, MSU Moscow, Munich LMU, MPI Munich, Nagasaki IAS, Naples, Naruto UE, New Mexico, Nijmegen,
BINP Novosibirsk, Ohio SU, Okayama, Oklahoma, LAL Orsay, Oslo, Oxford, Paris VI and VII, Pavia, Pennsylvania, Pisa, Pittsburgh, CAS Prague, CU Prague, TU Prague, IHEP Protvino, Ritsumeikan, UFRJ Rio de Janeiro, Rochester, Rome I, Rome II, Rome III,
Rutherford Appleton Laboratory, DAPNIA Saclay, Santa Cruz UC, Sheffield, Shinshu, Siegen, Simon Fraser Burnaby, Southern Methodist Dallas, NPI Petersburg, Stockholm, KTH Stockholm, Stony Brook, Sydney, AS Taipei, Tbilisi, Tel Aviv,
Thessaloniki, Tokyo ICEPP, Tokyo MU, Tokyo UAT, Toronto, TRIUMF, Tsukuba, Tufts, Udine, Uppsala, Urbana UI, Valencia, UBC Vancouver, Victoria, Washington, Weizmann Rehovot, Wisconsin, Wuppertal, Yale, Yerevan
David Quarrie: The ATLAS Experiment
ATLAS Appointments(March 2005)
ATLAS Plenary Meeting
Collaboration Board(Chair: S. BethkeDeputy: C. Oram)
Resources ReviewBoard
Spokesperson(P. Jenni
Deputies: F. Gianottiand S. Stapnes)
Technical Co-ordinator
(M. Nessi)
Resources Co-ordinator(M. Nordberg)
Executive Board
CB Chair AdvisoryGroup
Inner Detector(L. Rossi,
K. EinsweilerM. Tyndel, F. Dittus)
Tile Calorimeter(B. Stanek)
Magnet System(H. ten Kate)
ComputingCo-ordination
(D. Barberis,D. Quarrie)
ElectronicsCo-ordination
(P. Farthouat)
LAr Calorimeter(H. Oberlack,D. Fournier,J. Parsons)
Muon Instrum.(G. Mikenberg,
F. Taylor,S. Palestini)
Trigger/DAQ(C. Bee, N. Ellis,
L. Mapelli)
PhysicsCo-ordination
(G. Polesello)
AdditionalMembers(H. Gordon,A. Zaitsev)
David Quarrie: The ATLAS Experiment19
10-9 10-6 10-3
10-0 103 106 sec
25ns 3µshour year
ms
Reconstruction& Analyses TIER0/1/2
Centers
ON-line OFF-line
sec
10-2
100
102
104
106
108
QED
W,ZTopZ*
Higgs
10-4
Rate (Hz)
2 µs1 sec
10 ms
Level-1 Trigger 40 MHzHardware (ASIC, FPGA)Massive parallel ArchitecturePipelines
Level-2 Trigger ~75 kHzs/w PC farmLocale Reconstruction
Level-3 Trigger 1 kHzs/w PC farmFull Reconstruction
ATLAS Trigger
Event rate Event rate ➨➨
Level-2 Level-2 ➨➨
Level-1 Level-1 ➨➨
Offline Analyses Offline Analyses
Mass storage Mass storage ➨➨
David Quarrie: The ATLAS Experiment20
ATLAS Trigger Hierarchy
• ATLAS trigger comprises 3 levels– LVL1
• Custom electronics & ASICS, FPGAs• Max. time 2.5µs• Use of Calorimeter and Muon detector data• Reduce interaction rate to 75 kHz
– LVL2• Software trigger based on linux PC farm (~500 dual CPUs)• Mean processing time ~10 ms• Uses selected data from all detectors (Regions of Interest indicated by LVL1)• Reduces LVL1 rate to ~1 kHz
– Event Filter• Software trigger based on linux PC farm (~1600 dual CPUs)• Mean processing time ~1s• Full event & calibration data available• Reduces LVL2 rate to ~200Hz• Note – large fraction of HLT processor cost deferred initial running with
reduced computing capacity
David Quarrie: The ATLAS Experiment21
ATLAS Trigger & DAQ Architecture
H
L
T
DATAFLOW
40 MHz
75 kHz
~2 kHz
~ 200 Hz
Event Building N/workDataflow Manager
Sub-Farm InputEvent Builder EB
SFI
EBNDFMLvl2 acc = ~2 kHz
Event Filter N/work
Sub-Farm Output
Event FilterProcessors EFN
SFO
Event FilterEFP
EFPEFP
EFP
~ sec
~4 G
B/s
EFacc = ~0.2 kHz
Trigger DAQ
RoI BuilderL2 Supervisor
L2 N/workL2 Proc Unit
Read-Out Drivers
FE Pipelines
Read-Out Sub-systems
Read-Out Buffers
Read-Out Links
ROS
120 GB/s
ROB ROB ROB
LVL1
DET
R/O
2.5 µs
Calo MuTrCh Other detectors
Lvl1 acc = 75 kHz
40 MHz
RODRODROD
LVL2 ~ 10 ms
ROIB
L2P
L2SV
L2N
RoI
RoI data = 1-2%
RoI requests
specialized h/wASICsFPGA
120 GB/s
~ 300 MB/s
~2+4 GB/s
1 PB/s
David Quarrie: The ATLAS Experiment22
ATLAS Three Level Trigger Architecture
2.5 µs
~10 ms
~ sec.
• LVL1 decision made with calorimeter data with coarse granularity and muon trigger chambers data.
• Buffering on detector
• LVL2 uses Region of Interest data (ca. 2%) with full granularity and combines information from all detectors; performs fast rejection.
• Buffering in ROBs
• EventFilter refines the selection, can perform event reconstruction at full granularity using latest alignment and calibration data.
• Buffering in EB & EF
David Quarrie: The ATLAS Experiment23
RoI Mechanism
LVL2 uses Regions of Interest as identified by Level-1
• Local data reconstruction, analysis,and sub-detector matching of RoI data
LVL1 triggers on high pT objects
• Calorimeter cells and muon chambers to find e/γ/τ-jet-µ candidates above thresholds
The total amount of RoI data is minimal
• ~2% of the Level-1 throughput but it has to be accessed at 75 kHz
H →2e + 2µ
2µ
2e
David Quarrie: The ATLAS Experiment24
ATLAS Computing Characteristics
• Large, complex detector– ~108 channels
• Long lifetime– Project started in 1992, first data in 2007, last data 2027?
• 320 MB/sec raw data rate (x2 for processed and simulated data)– ~3 PB/year raw data
• Large, geographically dispersed collaboration– 1770 people, 151 institutions, 34 countries– Many are, and most will become, software developers
• Currently ~150FTE in offline software (~400 people)• Scale and complexity reflected in software
– ~1000 packages, ~7000 C++ classes, ~2M lines of code– ~70% code is algorithmic (written by physicists)– ~30% infrastructure, framework (written by software engineers)– Provide robustness but plan for evolution– Requires enabling technologies– Requires management & coherency
David Quarrie: The ATLAS Experiment26
Software Methodology
• Object-Oriented using C++ as programming language– Some wrapped FORTRAN and Java– Python as interactive & configuration language
• Heavy use of components behind abstract interfaces– Support multiple implementations– Robustness & evolution– Decoupling of dependencies
• Lightweight development process– Emphasis on automation and feedback rather than very formal process
• Previous attempt at developing a software system had failed due to a too rigorous software process decoupled from physicist developers
– Make it easy for developers to do the “right thing”– Some requirements/design reviews– Just completing 10 sub-system reviews
• 2 weeks each, 4-5 reviewers• Focus on client viewpoint and experience from DC2 (see later)• Feedback into planning process
David Quarrie: The ATLAS Experiment27
Simulated Data Processing
• Used to design detectors and trigger and to estimate how well reconstruction is being performed– Comparison with “truth”
• Generators– Creation of particles following a theoretical prediction of physics
• Simulation– Tracking of particles through detector material and magnetic field– Scattering and decays of particles
• Pile-up [optional]– Addition of multiple interactions per beam crossing and cavern
backgrounds (e.g. beam-gas, beam-halo interactions)• Digitization
– Folding in detector response to create electronic channel contents• Final data format identical to that actually produced by data
acquisition electronics (see later)– With the optional addition of “truth”
David Quarrie: The ATLAS Experiment29
Primary Data Processing
• Raw data through Physics Analysis– Detector reconstruction
• Correction of non-linear detector & electronics response• Correction (and determination) of intra-detector mis-alignments• Local pattern recognition within sub-detector (e.g. track segment finding)
– Combined reconstruction• Combining results across detectors• Tentative particle identification & energy flow (e.g. jets)• Correction (and determination) of inter-detector mis-alignments
– Physics Analysis• Final particle identification • Physics hypotheses matching
• Online Trigger– Performance optimized reconstruction
• Online Monitoring & Calibration– Simplified detector performance monitoring– Determination of detector response & mis-alignments
David Quarrie: The ATLAS Experiment30
Control Framework
• Capture common behaviour for HEP processing– Processing stages
• Generation, simulation, digitization, reconstruction, analysis– Online (trigger & monitoring) & offline
• Control framework steers series of modules to perform transformations– Component based
– Dynamically reconfigurable
• Although framework captures common behaviour, it’s important to make it as flexible and extensible as possible
• Blackboard model– Algorithms register and retrieve data on shared blackboard– Component decoupling
• Athena Framework common project with LHCb– Both shared and ATLAS-specific components
David Quarrie: The ATLAS Experiment31
Athena Object Diagram
Converter
Algorithm
Event DataService
PersistencyService
DataFiles
AlgorithmAlgorithm
Transient Event Store
Detec. DataService
PersistencyService
DataFiles
Transient Detector Store
MessageService
JobOptionsService
Particle Prop.Service
OtherServices Histogram
ServicePersistency
ServiceDataFiles
TransientHistogram
Store
ApplicationManager
ConverterConverter
David Quarrie: The ATLAS Experiment32
Athena Components
• Algorithms– Provide basic per-event processing– Share a common interface (state machine)
• Tools– More specialized but more flexible than Algorithms
• Data Stores (blackboards)– Data registered by one Algorithm/Tool can be retrieved by another– Multiple stores handle different lifetimes (per event, per job, etc.)
• Services– E.g. Scripting, Random Numbers, Histogramming
• Converters– Transform data from one representation to another
• e.g. transient/persistent• Properties
– Adjustable parameters of components– Can be modified at run-time to configure job
David Quarrie: The ATLAS Experiment33
Data Access Model
• StoreGate provides the blackboard– Algorithms register data and downstream Algorithms retrieve – Multiple instances for different lifetimes– Manages transient/persistent conversion
• Handles user-defined types– Most objects (STL assignable) can be registered & retrieved– Keyed on (store, type, key) for multiple object instances– Optionally locks objects once registered to prevent modification– Provides iterators for wildcard retrieval
• Manages object ownership• Flexible container management
– Value containers (container owns objects)– View containers (support polymorphism)
• Inter-object links to support persistency– Support deferred access
• Referenced object isn’t read from disk until link is traversed
David Quarrie: The ATLAS Experiment34
Scripting
• Python used as both configuration and interactive scripting language
• Python bindings to C++ components provided using a introspection dictionary with both C++ and Python APIs– Database populated by parsing C++ header files using gccxml
– API based on that proposed for C++ language standard– Database and API also used for persistifying data objects
• Athena jobs configured by specifying the set of Algorithms & Services that are needed, as well as their Property overrides– History service records configuration and can be used for
“playback”
• Initially Python used as simple “data cards”, now being used as true OO language in order to simplify the user interface– Python objects map onto a sequence of C++ components, not just
one-to-one
David Quarrie: The ATLAS Experiment35
Code Repository
• CVS (Code Versioning System)– Subversion to be evaluated in future
• Packages grouped hierarchically for management– Container packages correspond both to CVS directory structure
and to logical groupings• Secure Network Access• Extensive of use of authorization for commit & tag access• Package tags to create snapshots• Set of tagged packages can be built into a release• Dependencies between packages managed by Code
Management Tool (CMT)– Ensures packages built in correct sequence– Also specifies components to be built (libraries, application, etc.)– Also specifies dependences on external packages
• ~40 external packages• Gaudi Framework (Athena kernel), LCG Apps Area, event generators,
Java support, online common, misc.
David Quarrie: The ATLAS Experiment36
Nightly Releases
• Complete software built every night on several platforms– Primarily Enterprise Linux 3 (RH7.3 just terminated)– More platforms (including AMD-64 and Mac OS X) underway
• Partial regression/unit tests• Problem reports emailed automatically to developers• 7 copies rotated so each lasts one week
– Allows more time for developers to fix problems
• Takes ~20 hours for full release (performed once per week)– Use incremental builds for other days
• Prototyping parallel builds– Package level parallelism using CMT build tool
• Takes advantage of multi-cpu computers– File level parallelism using distcc compiler
• Uses master with several slaves– Initial testing shows x3 speed up using file-level parallelism only
• Decomposition into multiple projects underway– Reduced build time per project– Better control over dependencies– More complicated build management
David Quarrie: The ATLAS Experiment37
Release Hierarchy
• Developer Releases– Every 3-4 weeks
– Subject to more management prior to build– Full regression tests
– Normally no attempt to fix problems after build is completed
• Production Releases– 2-3 times per year - synchronized with major milestones– Strict management (tag-approval) control
– Full regression tests– Iteration until immediate problems fixed
• Bug-fix Releases– In case of problems discovered after extended use
– Sometimes multiple bug-fix releases are necessary
David Quarrie: The ATLAS Experiment38
Release Management: Tag Collector
• Web-based API for specifying package versions within a release
• Release consists of a consistent set of packages & versions• Tag collector manages access rights• Auto-generates dependencies for container packages
– Packages specifying a group of child packages• E.g. Reconstruction
• Manages release sequence for decomposition into projects• Supports parallel development
– Some development in primary branch
– Other development into bug-fix branch
David Quarrie: The ATLAS Experiment40
NICOS
• System to manage primarily the nightly builds– Also builds summary web-pages for other release builds
• Performs CVS checkout, release builds, submission of automated tests, parses logfiles for errors– Sends emails to developers if errors detected
• Web-based browser to allow problems to be examined• Generates a web-page per release
David Quarrie: The ATLAS Experiment42
Reality Checks
• Major milestones to test software and computer operations– Stress tests
• 1-2 per year• Data Challenges
– Production and processing of large simulated data samples
– Every 12-18 months
• Physics Workshops– Every 18-24 months– Major emphasis is exposure and feedback from physics community
• Test beams– Early use of offline software with real data using “vertical-slice” of
detectors and TDAQ hardware
David Quarrie: The ATLAS Experiment43
Data Challenges
• ATLAS has had 3 Data Challenges so far• Most recent (DC2) in 2nd half of 2004
– First large scale use of new C++ software
• Full Athena-based processing chain• Geant4 simulation engine• New persistency mechanisms for event and time-varying data
– Validate the computing model– Perform 10% test of Tier-0 (descoped)
• Pseudo real-time first-pass processing of raw data– Original scale 107 events
• Descoped because of delay to 106 events
– World-wide production• Using 3 Grid flavours (Grid3, LCG, NorduGrid)
David Quarrie: The ATLAS Experiment44
Physics Workshops
• Every 18-24 months• Rome Workshop held earlier this month• 450 physicists attended
– ~25% of ATLAS– A reminder that the software is not the end product
• World wide production– Used some of the lessons learned from DC2– Used expected ATLAS turn-on detector configuration– 8x106 events processed
• Important feedback on software usability, technical performance as well as physics performance
• Some problems but overall in pretty good shape– Software performance was side-comment to physics talks, not
major limiting factor
Towards the complete experiment: ATLAS combined test beam in 2004
Full “vertical slice” of ATLAS tested on CERN H8 beam line May-November 2004
x
z
y
Geant4 simulation of test-beam set-up
For the first time, all ATLAS sub-detectors integrated and run together with common DAQ, “final” electronics, slow-control, etc. Gained lot of global operation experience during ~ 6 month run. Common ATLAS software used to analyze the data
David Quarrie: The ATLAS Experiment
Test Beam
• Vertical detector slice (every detector subsystems represented)• Use of prototype TDAQ hardware & software• Use of Offline Software in Trigger and for Monitoring
– Also online calibrations
• Test of ability of software to deal with non-standard geometries– Geometry versioning management
– Non vertex-pointing tracking• Important later for commissioning with cosmics
• Test of reconstruction in non standard magnetic field• Exercise conditions database prototypes• Exercise mis-alignment determination and correction• Exercise data management software• Exercise development & release infrastructure
– Rapid turn-around but also robust
46
David Quarrie: The ATLAS Experiment
Tier2 Centre ~200kSI2k
Event Builder
Event Filter~7.5MSI2k
T0 ~5MSI2k
UK Regional Centre (RAL)
US Regional Centre
French Regional Centre
Dutch Regional Centre
SheffieldManchesterLiverpoolLancaster ~0.25TIPS
Workstations
10 GB/sec
320 MB/sec
100 - 1000 MB/s links
Castor
MSS
•Some data for calibration and monitoring to institutes
•Calibrations flow back
Each Tier 2 has ~20 physicists working on one or more channels
Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data
Tier 2 do bulk of simulation
Physics data cache
~Pb/sec
~ 75MB/s/T1 for ATLAS
MSS
Tier2 Centre ~200kSI2k
Tier2 Centre ~200kSI2k
≥622Mb/s links
Tier 0
Tier 1
Desktop
PC (2004) = ~1 kSpecInt2k
Northern Tier ~200kSI2k
Tier 2# ~200 Tb/year/T2
# ~2MSI2k/T1# ~2 Pb/year/T1
# ~5 Pb/year# No simulation
≥622Mb/s linksMSS MSS
10 Tier-1s reprocess
house simulation
Group Analysis
47
The Computing Model
David Quarrie: The ATLAS Experiment48
Service monitoring
Grid3 – participating sites
Sep 04•30 sites, multi-VO•shared resources•~3000 CPUs (shared)
David Quarrie: The ATLAS Experiment49
NorduGrid & Co. Participating sitesSite Country ~ # CPUs ~ % Dedicated
1 atlas.hpc.unimelb.edu.au 28 30%
2 genghis.hpc.unimelb.edu.au 90 20%
3 charm.hpc.unimelb.edu.au 20 100%
4 lheppc10.unibe.ch 12 100%
5 lxsrv9.lrz-muenchen.de 234 5%
6 atlas.fzk.de 884 5%
7 morpheus.dcgc.dk 18 100%
8 lscf.nbi.dk 32 50%
9 benedict.aau.dk 46 90%
10 fe10.dcsc.sdu.dk 644 1%
11 grid.uio.no 40 100%
12 fire.ii.uib.no 58 50%
13 grid.fi.uib.no 4 100%
14 hypatia.uio.no 100 60%
15 sigrid.lunarc.lu.se 100 30%
16 sg-access.pdc.kth.se 100 30%
17 hagrid.it.uu.se 100 30%
18 bluesmoke.nsc.liu.se 100 30%
19 ingrid.hpc2n.umu.se 100 30%
20 farm.hep.lu.se 60 60%
21 hive.unicc.chalmers.se 100 30%
22 brenta.ijs.si 50 100%
Totals:• 7 countries• 22 sites• ~3000 CPUs
– dedicated ~600• 7 Storage Services (in
RLS)– few more storage
facilities– ~12TB
• ~1FTE (1-3 persons) in charge of production
– 2-3 executor instances
David Quarrie: The ATLAS Experiment
Country providing resourcesCountry anticipating joining
In LCG-2: 139 sites, 32 countries ~14,000 cpu ~5 PB storage
Includes non-EGEE sites:• 9 countries• 18 sites
LCG Computing Resources: May 2005
Number of sites is already at the scale expected for LHC
- demonstrates the full complexity of operations
David Quarrie: The ATLAS Experiment52
ATLAS Production system
LCG NG Grid3 LSF
LCGexe
LCGexe
NGexe
G3exe
LSFexe
super super super super super
prodDB dms
RLS RLS RLS
jabber jabber soap soap jabber
Don Quijote
Windmill
Lexor
AMI
CaponeDulcinea
David Quarrie: The ATLAS Experiment53
Jobs on Grid30%
16%
1%
7%
0%
12%
2%
4%
1%0%1%4% 1%
9%
4%
2%
19%
15%
0%
ANL_HEPBNL_ATLASBU_ATLAS_Tier2CalTech_PGFNAL_CMSIU_ATLAS_Tier2OU_OSCERPDSFPSU_Grid3Rice_Grid3SMU_Physics_ClusterUBuffalo_CCRUCSanDiego_PGUC_ATLAS_Tier2UFlorida_PGUM_ATLASUNM_HPCUTA_dpccUWMadison
19 sites~93000 jobs
30 N
ovem
ber 20
04
David Quarrie: The ATLAS Experiment54
Job Success Rate on GRID3
Finished Failed Success Rate
July 8799 6676 57%
August 17083 9448 64%
September 17283 7717 69%
October 26600 5186 84%
November 21869 5038 81%
David Quarrie: The ATLAS Experiment55
Jobs Total0%6%
0%2%
0%
4%
1%1%
0%0%0%1%
0%
3%
1%
1%
6%
5%
0%
5%
0%
3%
2%3%
1% 4% 4%1%
1%1%
1%0%0%
4%
0%2%
1%1%
2%
0%
3%
0%0%1%0%
4%
0%1%1%0%
1%0%
1%
1%0%
2%0%0%
3%
2%
1%1%
1%3%
1%1%0%0%0%
at.uibk ca.albertaca.montreal ca.torontoca.triumf ch.cerncz.cesnet cz.goliasde.fzk es.ifaees.ific es.uamfr.in2p3 it.cnafit.lnf it.lnlit.mi it.nait.roma it.tojp.icepp nl.nikhefpl.zeus tw.sinicauk.cam uk.lancsuk.man uk.pp.icuk.rl uk.shefuk.ucl au.melbournech.unibe de.fzkde.lrz-muenchen dk.aaudk.dcgc dk.nbidk.sdu no.uibno.grid.uio no.hypatia.uiose.hoc2n.umu se.it.uuse.lu se.lunarcse.nsc se.pdcse.unicc.chalmers si.ijsANL_HEP BNL_ATLASBU_ATLAS_Tier2 CalTech_PGFNAL_CMS IU_ATLAS_Tier2OU_OSCER PDSFPSU_Grid3 Rice_Grid3SMU_Physics_Cluster UBuffalo_CCRUCSanDiego_PG UC_ATLAS_Tier2UFlorida_PG UM_ATLASUNM_HPC UTA_dpccUWMadison
69 sites~276000 Jobs
30 N
ovem
ber 20
04
David Quarrie: The ATLAS Experiment57
Production Efficiency
0%
25%
50%
75%
100%
381813818238183381843818538186381873818838189381903819138192381933819438195381963819738198381993820038201382023820338204382053820638207382083820938210382113821238213382143821538216382173821838219382203822138222382233822438225382263822738228382293823038231382323823338234382353823638237382383823938240382413824238243382443824538246382473824838249382503825138252382533825438255382563825738258382593826038261382623826338264382653826638267382683826938270382713827238273382743827538276382773827838279382803828138282382833828438285382863828738288382893829038291382923829338294382953829638297382983829938300383013830238303383043830538306383073830838309383103831138312383133831438315383163831738318383193832038321383223832338324383253832638327383283832938330383313833238333383343833538336383373833838339383403834138342383433834438345383463834738348383493835038351383523835338354383553835638357383583835938360383613836238363383643836538366383673836838369383703837138372383733837438375383763837738378383793838038381383823838338384383853838638387383883838938390383913839238393383943839538396383973839838399384003840138402384033840438405384063840738408384093841038411384123841338414384153841638417384183841938420384213842238423384243842538426384273842838429384303843138432384333843438435384363843738438384393844038441384423844338444384453844638447384483844938450384513845238453384543845538456384573845838459384603846138462384633846438465384663846738468384693847038471384723847338474384753847638477384783847938480384813848238483384843848538486384873848838489384903849138492384933849438495384963849738498
Depends on many factors….
GRID3 used for most of the testing
for Rome production
NG had personel change between DC2 and Rome
David Quarrie: The ATLAS Experiment58
Feedback from Grid Deployment
• Simulation software very stable– E.g. No failures in 35k jobs over 3.5M events
• Major failure modes from access to input data files or failure to register output files
• Retry mechanisms put into place helped significantly– Some “good news, bad news” stories
• Error recovery (obviously) harder than error detection– Production management components had to be redesigned in
some places to provide adequate error recovery• Still a very manpower intensive activity• Scale of number of sites/nodes already reached for ATLAS turn-
on• 2nd generation of production tools being worked on• Next generation of Grid middleware also being deployed
David Quarrie: The ATLAS Experiment59
Computing System Commissioning
• Starts early in 2006 through to experiment turn-on in mid 2007• Detailed planning just started• 8 major sub-system tests
– Full software chain– Tier-0 scaling
• Pseudo-real time processing of data from Event Filter• Goal is <5 day latency
– Calibration & Alignment– Trigger Integration & Monitoring– Distributed Data Management
– Distributed Physics Analysis– Distributed Production
– TDAQ/Offline Full chain
• Completion of these corresponds to ATLAS turn-on
David Quarrie: The ATLAS Experiment60
CSC Acceptance Tests
• Detailed set of acceptance criteria for each test• Incorporated into automated tests• Establish functionality, technical performance and physics
performance thresholds• E.g. Acceptance criteria for Full Software Chain
– Validation of output for each stage by ability to read at next stage– Non-recoverable error rates– Event processing times
• Nominal 100 kSI2k sec/event for simulation– Currently x2-8 slower but development ongoing to meet goal
• Nominal 15 kSI2k sec/event for reconstruction– Currently x2 slow and again strategy in place to meet goal
– Memory usage• <1GB
– Job startup time– Etc.
David Quarrie: The ATLAS Experiment
Main Concerns
• Ability to deal with moving, inefficient detector• Apply lessons learned from Rome Workshop to physics analysis• Performance
– x2 required on reconstruction– x4-8 on simulation
• Improving ease of use– Distributed user support
• Establishing Tier-0 production• Grid production robustness• Coping with parallel detector commissioning activities• Establishing operations teams
– Shift crews plus long-term management staff
• Migrating from mode where emphasis is on rapid software development to one where emphasis is on robustness and validation
61
David Quarrie: The ATLAS Experiment
Overall summary installation schedule version 7.0(New baseline approved in the February 2005 ATLAS EB)
David Quarrie: The ATLAS Experiment63
NERSC HENPC Group
• Mixture of staff scientists, computer software engineers and post-docs (11 in total)– Mainly with degrees in physics but with subsequent training and
experience in computer science and software engineering
• Provide computing systems for large HEP and Nuclear Science experiments
• Leadership and architectural roles as well as core development• Generate institutional knowledge base• Leverage the coupling between NERSC and Physical Sciences at
LBNL• 5 Current projects
– ATLAS, BaBar, IceCube, Majorana, SNAP
• 5 members currently working on ATLAS (with new post-doc hire soon)– Paolo Calafiura, (Chris Day), Charles Leggett, Wim Lavrijsen,
Massimo Marino, David Quarrie, (Craig Tull)
David Quarrie: The ATLAS Experiment64
HENPC Group ATLAS Responsibilities
• Software Project Management• Chief Architect• Core Services Management within Software Project• Athena Framework• Data Access Model (StoreGate & EDM kernel)• Scripting & Interactivity• Aspects of introspection• Histogram, N-tuple, History & IOV Services• Pile-up & Event Mixing frameworks• Aspects of release build infrastructure• Usability Task Force• Tutorials & consultancy• Performance & diagnostic tools• Etc.
David Quarrie: The ATLAS Experiment65
Summary
• ATLAS experiment is highly complex– Multiple dimensions of scale
• Large number of detector channels, high data rate• Size and geographical dispersion of collaboration• Large developer base and large user base
• Long timescale– Plan for evolution
• Many problems are sociological rather than technical– Emphasis on enabling technologies and automated tests
• Good synergy between physicists, computer scientists and software engineers essential– LBNL & NERSC are good example of this
• Extensive stress tests underway and planned prior to startup– Feedback from most recent show that we’re on track
• Aside: ATLAS will be my 8th experiment turn-on– Each one worse than previous, despite additional experience
gained