Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software...
-
Upload
kimberly-gallagher -
Category
Documents
-
view
216 -
download
1
Transcript of Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software...
Large Scale and Performance Tests Large Scale and Performance Tests of the ATLAS Online Software of the ATLAS Online Software
CERN ATLAS TDAQ Online Software System
D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu, M.Caprini, M.Dobson, R.Hart, R.Jones, A.Kazarov, S.Kolos, V.Kotov, D.Liko, L.Lucio, L.Mapelli, M.Mineev, L.Moneta, M.Nassiakou, L.Pedro, A.Ribeiro, Y.Ryabov, D.Schweiger, I.Soloviev, H. Wolters
CHEP2001 Beijing China
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 2
ContentContent The Online System in ATLAS TDAQ Testing in the Online System
Aims of the large Scale and Performance Tests Approach Test Series and their Setup Test Configurations
Results Experience and tips for doing large scale tests Future tests and Conclusions
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 3
TDAQ System/ContextTDAQ System/ContextDetector ~ 200 nodes
Detector ControlSystem
Physics & Event Selection
Architecture (PESA)
EventStore
OfflineComputing
Online Software
LVL1
Trigger
Dataflow: ~800 nodes Readout System Data Collection
High Level Trigger
Reconstruction
Framework (Athena)
Selected Events
HLT StrategyAlgorithms
LVL1Result
Detector DataLVL1 Input
Configuration, Run Control,
Process Control, Inter Process
CommunicationMessage Reporting,
Info Service, Monitoring
Detector ~ 200 nodes
LVL1
Trigger
Dataflow: ~800 nodes Readout System Data Collection
High Level Trigger
: elements running the online software
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 4
AimsAims of the large Scale and Performance Testsof the large Scale and Performance Tests Verify Scalability Verify Scalability
of the online system to a large configuration
Study Interaction Study Interaction between the online components in a large configuration
Measure PerformanceMeasure Performance take timing values of the various setup,
run control transition and shutdown phases
Understand System LimitsUnderstand System Limits Push the system to a very large size
Perform selected Fault Tolerance testsPerform selected Fault Tolerance tests
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 5
Testing in the Online SystemTesting in the Online System Component TestingComponent Testing
Formal Inspection of Components Unit Tests of Components Nightly Builds with component check
System Integration TestingSystem Integration Testing Nightly Builds with basic check on integration Last Successful Nightly Build available to developers
Planned Public Releases Planned Public Releases 3-5 times a year Remote Test Centers to test the Pre-Release
retrieving the system from a tar file or from CD-ROM
Deployment Deployment in Test Beam Operation gives feedback
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 6
ApproachApproach for the large Scale and Performance Testsfor the large Scale and Performance Tests
Test Preparation Test Preparation Test Plan prepared beforehand defines aims, scope,
configurations, resources and describes the tests
TestwareTestware use of existing example programs for controllers and
monitoring, use of standard setup script utility scripts to establish the configuration,
and to start/stop process manager daemons Functionality of other systems emulated where necessary
During the TestsDuring the Tests automatically produced test results and log files immediate logging and follow up of issues found fixes and enhancements verified in the next iteration
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 7
DAQ ConfigurationDAQ ConfigurationThe ATLAS DetectorThe ATLAS Detector
• Each sub-detector has a large number of readout nodes/crates
• The Online System Control Tree connects the sub-detectors
• Online system is responsible for Configuration Database Run Control Process Management Inter Process Communication Message Reporting Information Service Monitoring
Control of a multi-detector systemControl of a multi-detector system
The configuration database describes a partition :• information on all processes and their relationships• the run control hierarchy in the online system• startup and shutdown dependencies
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 8
Test:Test:• run each base partition separatelyrun each base partition separately• run base partitions in parallelrun base partitions in parallel Detector Controller
per crate/node: one run controllerone monitoring sampler
read out crates are linked to a detector controllerread out crates are linked to a detector controller
Test Set-UpTest Set-Up Hardware and NetworkHardware and Network
6 test series on 3 Test clusters, 2 days - 1 week: 16, 65, 112 PCs, Linux 6.1, 400-733 Mhz, 128-512 MB afs , nfs , local network
Base PartitionBase Partition 10 independent partitions created 11 PCs per partition
one process manager daemon
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 9
Test:Test:run the 10 run the 10 test partitions sequentially test partitions sequentially
10 configurations 10 configurations build from the base partitions up to 10 base partitions + 1 root controller + 1 monitoring factory one monitoring sampler per crate controller up to 112 PCs in a 3-level hierarchy
Root Controller Detector controller
10crate controllers
Test Configuration-3 Level PartitionsTest Configuration-3 Level Partitions
Separate Partitions are combinedSeparate Partitions are combined
Example for 112 nodesExample for 112 nodes
Monitoring factory
Nested Partitions in Nested Partitions in Configuration data fileConfiguration data file
CHEP2001 - Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek
See contribution for this conference: Atlas DAQ Configuration Databasesby Igor Soloview
100 controller partition 100 controller partition
CHEP2001 - Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 12
clo se d
a bse nt
initial
c lose d
absent
loaded
c onfigured
running
initial
lo aded
c onfigured
running
setup
boot shutdown
close
cold start cold stop
luke warm start luke warm stop
warm start warm stop
infrastructure:server processescreated and stopped by play_daq
supervisor:processes startedand stopped
communication w ith a ll controllers
communication w ith a ll controllers
supervisor & controllercommunication
Timing tests:Timing tests:Logical View of TransitionsLogical View of Transitions
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 13
Setup/Boot/Shutdown/CloseSetup/Boot/Shutdown/CloseIT-ClusterIT-Cluster
Slow increase with largerconfiguration
Constant
Expected increase with
number of processes
Dependencyproblem discovered
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 14
Scalability and Performance Scalability and Performance RC state transitions IT-ClusterRC state transitions IT-Cluster
Heavy load ofcommunication
Single state transition
Single state transition plus 1s
3 state transitions
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 15
Results in numbersResults in numbers For the large test partitionsFor the large test partitions
on 112 PCs were ~ 340 processes running: 111 controllers, 100 monitoring samplers, 112 pmg daemons, ~10 servers, 1 monitoring factory
~ 850 entries in the database data file (250 sw, 600 hw)
First large scale test:First large scale test: 45 issues found (bugs, problems, improvement suggestions) 52 days in equivalent of 8h working days for an elapsed time of 3 weeks
test preparation and 1 week testing, excluding analysis, for 1-3 testers tons of log files
Following iterations: Following iterations: re-use original test plan and add brief update preparation time reduced radically to ~ 2-3 days test runs mostly done automatically
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 16
Experience and TipsExperience and Tips Preparation:Preparation:
Require Unit Tests of components Prepare a detailed Test Plan beforehand Run large Scale Tests on a tested and frozen release Foresee expandable, flexible configuration and test infrastructure Encourage precise information logging for problem tracing
OrganizationOrganization Store the testware in the software repository Run the testware regularly/automatically to verify it is up to date Re-use test items like test structure, testware, scripts, checklists
Network Network Use NFS not AFS Run on isolated network & monitor activity
CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 17
Conclusions and FutureConclusions and Future The online system can run a partition The online system can run a partition
consisting of > 100 PCs consisting of > 100 PCs The online system can run partitions in parallelThe online system can run partitions in parallelScalability tests spot problems you can’t see in another mannerShielding from Cern network has a very positive effect 4 level hierarchy is behavior very similar to 3-levelVery large scale Stress Tests help studying process communication FutureFutureRun basic integration test at each successful nightly build Repeat Tests on a regular basis for each major release building on existing material Push scale further to uncover new effectsAutomate the tests further Gradually include more SW items and components from other systems
Many thanks to CMS and to CERN/IT
for giving us access to their PC clusters