Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software...

17
Large Scale and Performance Large Scale and Performance Tests of the ATLAS Online Tests of the ATLAS Online Software Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu, M.Caprini, M.Dobson, R.Hart, R.Jones, A.Kazarov, S.Kolos, V.Kotov, D.Liko, L.Lucio, L.Mapelli, M.Mineev, L.Moneta, M.Nassiakou, L.Pedro, A.Ribeiro, Y.Ryabov, D.Schweiger, I.Soloviev, H. Wolters CHEP2001 Beijing China

Transcript of Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software...

Page 1: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

Large Scale and Performance Tests Large Scale and Performance Tests of the ATLAS Online Software of the ATLAS Online Software

CERN ATLAS TDAQ Online Software System

D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu, M.Caprini, M.Dobson, R.Hart, R.Jones, A.Kazarov, S.Kolos, V.Kotov, D.Liko, L.Lucio, L.Mapelli, M.Mineev, L.Moneta, M.Nassiakou, L.Pedro, A.Ribeiro, Y.Ryabov, D.Schweiger, I.Soloviev, H. Wolters

CHEP2001 Beijing China

Page 2: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 2

ContentContent The Online System in ATLAS TDAQ Testing in the Online System

Aims of the large Scale and Performance Tests Approach Test Series and their Setup Test Configurations

Results Experience and tips for doing large scale tests Future tests and Conclusions

Page 3: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 3

TDAQ System/ContextTDAQ System/ContextDetector ~ 200 nodes

Detector ControlSystem

Physics & Event Selection

Architecture (PESA)

EventStore

OfflineComputing

Online Software

LVL1

Trigger

Dataflow: ~800 nodes Readout System Data Collection

High Level Trigger

Reconstruction

Framework (Athena)

Selected Events

HLT StrategyAlgorithms

LVL1Result

Detector DataLVL1 Input

Configuration, Run Control,

Process Control, Inter Process

CommunicationMessage Reporting,

Info Service, Monitoring

Detector ~ 200 nodes

LVL1

Trigger

Dataflow: ~800 nodes Readout System Data Collection

High Level Trigger

: elements running the online software

Page 4: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 4

AimsAims of the large Scale and Performance Testsof the large Scale and Performance Tests Verify Scalability Verify Scalability

of the online system to a large configuration

Study Interaction Study Interaction between the online components in a large configuration

Measure PerformanceMeasure Performance take timing values of the various setup,

run control transition and shutdown phases

Understand System LimitsUnderstand System Limits Push the system to a very large size

Perform selected Fault Tolerance testsPerform selected Fault Tolerance tests

Page 5: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 5

Testing in the Online SystemTesting in the Online System Component TestingComponent Testing

Formal Inspection of Components Unit Tests of Components Nightly Builds with component check

System Integration TestingSystem Integration Testing Nightly Builds with basic check on integration Last Successful Nightly Build available to developers

Planned Public Releases Planned Public Releases 3-5 times a year Remote Test Centers to test the Pre-Release

retrieving the system from a tar file or from CD-ROM

Deployment Deployment in Test Beam Operation gives feedback

Page 6: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 6

ApproachApproach for the large Scale and Performance Testsfor the large Scale and Performance Tests

Test Preparation Test Preparation Test Plan prepared beforehand defines aims, scope,

configurations, resources and describes the tests

TestwareTestware use of existing example programs for controllers and

monitoring, use of standard setup script utility scripts to establish the configuration,

and to start/stop process manager daemons Functionality of other systems emulated where necessary

During the TestsDuring the Tests automatically produced test results and log files immediate logging and follow up of issues found fixes and enhancements verified in the next iteration

Page 7: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 7

DAQ ConfigurationDAQ ConfigurationThe ATLAS DetectorThe ATLAS Detector

• Each sub-detector has a large number of readout nodes/crates

• The Online System Control Tree connects the sub-detectors

• Online system is responsible for Configuration Database Run Control Process Management Inter Process Communication Message Reporting Information Service Monitoring

Control of a multi-detector systemControl of a multi-detector system

The configuration database describes a partition :• information on all processes and their relationships• the run control hierarchy in the online system• startup and shutdown dependencies

Page 8: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 8

Test:Test:• run each base partition separatelyrun each base partition separately• run base partitions in parallelrun base partitions in parallel Detector Controller

per crate/node: one run controllerone monitoring sampler

read out crates are linked to a detector controllerread out crates are linked to a detector controller

Test Set-UpTest Set-Up Hardware and NetworkHardware and Network

6 test series on 3 Test clusters, 2 days - 1 week: 16, 65, 112 PCs, Linux 6.1, 400-733 Mhz, 128-512 MB afs , nfs , local network

Base PartitionBase Partition 10 independent partitions created 11 PCs per partition

one process manager daemon

Page 9: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 9

Test:Test:run the 10 run the 10 test partitions sequentially test partitions sequentially

10 configurations 10 configurations build from the base partitions up to 10 base partitions + 1 root controller + 1 monitoring factory one monitoring sampler per crate controller up to 112 PCs in a 3-level hierarchy

Root Controller Detector controller

10crate controllers

Test Configuration-3 Level PartitionsTest Configuration-3 Level Partitions

Separate Partitions are combinedSeparate Partitions are combined

Example for 112 nodesExample for 112 nodes

Monitoring factory

Page 10: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

Nested Partitions in Nested Partitions in Configuration data fileConfiguration data file

CHEP2001 - Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek

See contribution for this conference: Atlas DAQ Configuration Databasesby Igor Soloview

Page 11: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

100 controller partition 100 controller partition

CHEP2001 - Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek

Page 12: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 12

clo se d

a bse nt

initial

c lose d

absent

loaded

c onfigured

running

initial

lo aded

c onfigured

running

setup

boot shutdown

close

cold start cold stop

luke warm start luke warm stop

warm start warm stop

infrastructure:server processescreated and stopped by play_daq

supervisor:processes startedand stopped

communication w ith a ll controllers

communication w ith a ll controllers

supervisor & controllercommunication

Timing tests:Timing tests:Logical View of TransitionsLogical View of Transitions

Page 13: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 13

Setup/Boot/Shutdown/CloseSetup/Boot/Shutdown/CloseIT-ClusterIT-Cluster

Slow increase with largerconfiguration

Constant

Expected increase with

number of processes

Dependencyproblem discovered

Page 14: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 14

Scalability and Performance Scalability and Performance RC state transitions IT-ClusterRC state transitions IT-Cluster

Heavy load ofcommunication

Single state transition

Single state transition plus 1s

3 state transitions

Page 15: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 15

Results in numbersResults in numbers For the large test partitionsFor the large test partitions

on 112 PCs were ~ 340 processes running: 111 controllers, 100 monitoring samplers, 112 pmg daemons, ~10 servers, 1 monitoring factory

~ 850 entries in the database data file (250 sw, 600 hw)

First large scale test:First large scale test: 45 issues found (bugs, problems, improvement suggestions) 52 days in equivalent of 8h working days for an elapsed time of 3 weeks

test preparation and 1 week testing, excluding analysis, for 1-3 testers tons of log files

Following iterations: Following iterations: re-use original test plan and add brief update preparation time reduced radically to ~ 2-3 days test runs mostly done automatically

Page 16: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 16

Experience and TipsExperience and Tips Preparation:Preparation:

Require Unit Tests of components Prepare a detailed Test Plan beforehand Run large Scale Tests on a tested and frozen release Foresee expandable, flexible configuration and test infrastructure Encourage precise information logging for problem tracing

OrganizationOrganization Store the testware in the software repository Run the testware regularly/automatically to verify it is up to date Re-use test items like test structure, testware, scripts, checklists

Network Network Use NFS not AFS Run on isolated network & monitor activity

Page 17: Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

CHEP2001 - Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 17

Conclusions and FutureConclusions and Future The online system can run a partition The online system can run a partition

consisting of > 100 PCs consisting of > 100 PCs The online system can run partitions in parallelThe online system can run partitions in parallelScalability tests spot problems you can’t see in another mannerShielding from Cern network has a very positive effect 4 level hierarchy is behavior very similar to 3-levelVery large scale Stress Tests help studying process communication FutureFutureRun basic integration test at each successful nightly build Repeat Tests on a regular basis for each major release building on existing material Push scale further to uncover new effectsAutomate the tests further Gradually include more SW items and components from other systems

Many thanks to CMS and to CERN/IT

for giving us access to their PC clusters