Post on 24-Dec-2015
First operational experience with the First operational experience with the CMS Run Control SystemCMS Run Control SystemHannes Sakulin, CERN/PHHannes Sakulin, CERN/PHon behalf of the CMS DAQ groupon behalf of the CMS DAQ group
1717thth IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH2
The Compact Muon Solenoid ExperimentDrift-Tube chambersDrift-Tube chambers
Cathode Strip Cathode Strip ChambersChambers
Resistive Plate ChambersResistive Plate Chambers
Iron YokeIron Yoke4 T Superconducting Coil4 T Superconducting Coil
TrackersTrackers•Silicon StripSilicon Strip•Silcon PixelSilcon Pixel
Electromagnetic Electromagnetic CalorimeterCalorimeter
HadronicHadronicCalorimeterCalorimeter
LHC p-p collisions, ECM=14 TeV (2010: 7 TeV), heavy ion Bunch crossing frequency 40 MHzCMS Multi-purpose detector, broad physics programme 55 million readout channels
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH3
CMS Trigger and DAQ design First Level Trigger
(hardware) up to 100 kHz
Central DAQ builds events at 100 kHz, 100 GB/s 2 stages 8 independent
event builder / filter slices
High level trigger running on filter farm ~700 PCs ~6000 cores
In total around 10000 applications to control
First Level Trigger (hardware) up to 100 kHz
Central DAQ builds events at 100 kHz, 100 GB/s 2 stages 8 independent
event builder / filter slices
High level trigger running on filter farm ~700 PCs ~6000 cores
In total around 10000 applications to control
Filter farm
FrontendReadout Links
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH4
CMS Control Systems
Front-end Drivers, First Level Trigger
Central DAQ& High Level Trigger Farm
DAQTrigger
Slice Slice
ECALTracker …
…
DCS
Trigger SupervisorXDAQC++
Run Control SystemJava, Web Technologies
Front-end Electronics
datadata
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH5
CMS Control Systems
Front-end Drivers, First Level Trigger
Central DAQ& High Level Trigger Farm
DAQTrigger
Slice Slice
ECALTracker …
…
Tracker ECAL
Detector Control System
DCS
Run Control SystemJava, Web Technologies
…
…
Low voltageHigh voltage Gas, Magnet
Front-end Electronics
datadata
PVSS (Siemens ETM)SMI (State Management Interface )
Trigger SupervisorXDAQC++
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH6
CMS Run Control System
XDAQ Application
…
Function ManagerNode in the Run Control Tree defines a State Machine & parametersUser function managers dynamically loaded into the web application
Run Control World – Java, Web TechnologiesDefines the control structure
XDAQ World – C++, XML, SOAPXDAQ applications control hardware and data flow
XDAQ is the framework of CMS online softwareIt provides Hardware Access, Transport Protocols, Services etc.
~10000 applications to control
data
HTML, CSS, JavaScript, AJAXGUI in a web browser
Run Control Web ApplicationApache Tomcat Servlet ContainerJava Server Pages, Tag Libraries,Web Services (WSDL, Axis, SOAP)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH7
ChildResource Proxy
ChildResource Proxy
EventProcessor
Child Resource Proxy – Run Control
StateMachineEngine
StateMachine
Definition
Event Handler
Event Handler
ParameterSet
Web Service
from/to Parent Function Manager / GUI
AsynchronousNotifications
to / from Child Function Manager
Web serviceL
ifecy
cle
+
Co
nfig
ura
tion
Co
mm
an
dP
ara
me
ter
Mo
nito
r
Servlet / Web Service
JobControl
Ev
State MachineCallback
LifecycleCommandParameter
Child Resource Proxy - XDAQ
Function Manager FrameworkState, ErrorsParameters
YY
C. Resource Proxy – PSX
Servlet
to / from DetectorControl System
Custom code
FunctionManager
XX
Frame-workcode
Legend
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH8
ChildResource Proxy
ChildResource Proxy
EventProcessor
Child Resource Proxy – Run Control
StateMachineEngine
StateMachine
Definition
Event Handler
Event Handler
ParameterSet
Web Service
from/to Parent Function Manager / GUI
Resource Service DB
Run InfoDB
AsynchronousNotifications
to / from Child Function Manager
Web serviceL
ifecy
cle
+
Co
nfig
ura
tion
Co
mm
an
dP
ara
me
ter
Mo
nito
r
Servlet / Web Service
Logs
XDAQMonitoring& Alarming
System
DAQ StructureDB
JobControl
Ev
State MachineCallback
LifecycleCommandParameter
Child Resource Proxy - XDAQ
Function Manager Framework
LogCollector
State, ErrorsParameters
YY
C. Resource Proxy – PSX
Servlet
to / from DetectorControl System
Custom code
FunctionManager
XX
Frame-workcode
ConditionsConfigurationFM + XDAQ
Legend
Monitoring
Errors
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH9
Entire DAQ System Structure is Configurable
Job ControlService
Database
XML XML
Control structure• Function Managers to load (URL)• Parameters• Child nodes
Configuration of XDAQ Executives (XML)• libraries to be loaded• applications (e.g. builder unit, filter unit)
& parameters• network connections• collaborating applications
Control structure• Function Managers to load (URL)• Parameters• Child nodes
Configuration of XDAQ Executives (XML)• libraries to be loaded• applications (e.g. builder unit, filter unit)
& parameters• network connections• collaborating applications
ResourceService
API
Flow of configuration data
SOAP
High-level tools to generate configurations
versioning
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH10
CMS Control Tree
Level-0Level-0
DAQ
TTS FB
Trigger
FEC Slice 0 Slice 7
FB RB HLT
ECALTracker
FED …
Level-0: Control and parameterization of Run
Level-1: Common state machine andParameters
Level-2:
GUI (Web browser)
DT
Level-n:
Sub-system specific…
RPC …
Sub-system Run Control developedby sub-system teams
Framework and Top-Level Run Controldeveloped by central team
Frontend controller
Frontend driver
TriggerThrottlingSystem
FEDBuilder
ReadoutBuilder
HighLevel
Trigger
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH11
RCMS Level-1 State Machine (simplified)
Halted
Created
Configured
Running
Paused
Pre-Configured
Error
CreationLoad & start Level-1 Function Managers
InitializationStart further levels of function managersStart all XDAQ processes on the cluster
New: Pre-Configuration (trigger only – few seconds)Sets up the clock and periodic timing signals
ConfigurationLoad configuration from databaseConfigures hardware and applications
Start run
Pause / ResumePauses / resumes the trigger (and trackers which may need to change settings)
Stop run
Halt
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH12
Top-Level Run Control (Level-0)
Central point of control Global State Machine
Level-0 allows to parameterize configuration Sub-system Run Key (e.g. level of zero suppression) First Level Trigger Key / High Level Trigger Key Clock source (LHC / local)
Central point of control Global State Machine
Level-0 allows to parameterize configuration Sub-system Run Key (e.g. level of zero suppression) First Level Trigger Key / High Level Trigger Key Clock source (LHC / local)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH13
Masking of components
Level-0 allows to mask out components Remove/add sub-systems from control and readout Remove add detector partitions Remove/add individual Frontend-Drivers (masking)
Connection to readout (SLINK) Connection to Trigger Throttling System
Mask out DAQ slices ( = 1/8 of central DAQ)
Level-0 allows to mask out components Remove/add sub-systems from control and readout Remove add detector partitions Remove/add individual Frontend-Drivers (masking)
Connection to readout (SLINK) Connection to Trigger Throttling System
Mask out DAQ slices ( = 1/8 of central DAQ)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH14
Commissioning and First Operation with the LHC
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH15
Commissioning and First Operation Independent parallel commissioning of sub-detectors
Mini DAQ setups allow for standalone operation
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH16
Mini DAQ (“partitioning”)
Dedicated small DAQ setups for most sub-systems
Low bandwidth but sufficient for most tests
Mini DAQ may be used in parallel to the Global Runs
Dedicated small DAQ setups for most sub-systems
Low bandwidth but sufficient for most tests
Mini DAQ may be used in parallel to the Global Runs
Level-0Level-0
GlobalDAQ
GlobalTrigger
Slice 0 Slice 7
Tracker
…
Level-0Level-0
MiniDAQECALLTC… DTLocal Trigger Controller(or Global Trigger)
Global Run MiniDAQ Run(heavily used in commissioning phase)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH17
Commissioning and First Operation Independent parallel commissioning of sub-detectors
Mini DAQ setups allow for standalone operation Run start time
End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold start)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH18
Optimization of run startup time Globally
Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH19
Optimization of run startup time Globally
Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements
Central DAQ Developed tool to analyze log files and plot timelines of all operations Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) Reduced message traffic between Run Control and XDAQ applications
combine commands and parameters into single message
New startup method for High Level Trigger processes on multi-core machines Initialize and Configure mother process, then fork child processes Reduced memory footprint due to copy-on-write
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH20
Optimization of run startup time Globally
Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements
Central DAQ Developed tool to analyze log files and plot timelines of all operations Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) Reduced message traffic between Run Control and XDAQ applications
combine commands and parameters into single message
New startup method for High Level Trigger processes on multi-core machines Initialize and Configure mother process, then fork child processes Reduced memory footprint due to copy-on-write
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH21
Run Start timing (May 2010)
Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) Configuration time now dominated by frontend configuration (Tracker) Pause/Resume 7x faster than Stop/Start
Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) Configuration time now dominated by frontend configuration (Tracker) Pause/Resume 7x faster than Stop/Start
sub-
syst
em
time (seconds)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH22
Commissioning and First Operation Independent parallel commissioning of sub-detectors
Mini DAQ setups allow for standalone operation Run start time
End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute
Initially some stability issues Problems solved by debugging user code (thread leaks)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH23
Commissioning and First Operation Independent parallel commissioning of sub-detectors
Mini DAQ setups allow for standalone operation Run start time
End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute
Initially some stability issues Problems solved by debugging user code (thread leaks)
Recovery from sub-system faults Control of individual sub-systems from top-level control node Fast masking / unmasking of components (partial re-configuration, only)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH24
Commissioning and First Operation Independent parallel commissioning of sub-detectors
Mini DAQ setups allow for standalone operation Run start time
End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute
Initially some stability issues Problems solved by debugging user code (thread leaks)
Recovery from sub-system faults Control of individual sub-systems from top-level control node Fast masking / unmasking of components (partial re-configuration, only)
Operator efficiency Operation is complex
Subsystem inter-dependencies when configuring partially Dependencies on internal & external parameters Procedures to follow (Clock change)
Operators are no longer DAQ experts but colleagues from the entire collaboration Built-in cross checks to guide the operator
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH25
Built-in cross-checks
Built-in cross-checks guide the shifter Indicate sub-systems to re-configure if
A parameter is changed in the GUI A sub-system / FED is added/removed External parameters change
Enforce correct order of re-configuration Enforce re-configuration of CMS if clock source
changed or LHC has been unstable
Built-in cross-checks guide the shifter Indicate sub-systems to re-configure if
A parameter is changed in the GUI A sub-system / FED is added/removed External parameters change
Enforce correct order of re-configuration Enforce re-configuration of CMS if clock source
changed or LHC has been unstable
Improved operator efficiency
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH26
Operation with the LHC
Cosmic run1. Bring the detector into the desired state (Detector Control system)
2. Start Data Acquisition (Run Control System) LHC
Detector state and DAQ state depend on the LHC Want to keep DAQ going before beams are stable to ensure that we are ready
Cosmic run1. Bring the detector into the desired state (Detector Control system)
2. Start Data Acquisition (Run Control System) LHC
Detector state and DAQ state depend on the LHC Want to keep DAQ going before beams are stable to ensure that we are ready
LHC
dip
ole
cu
rren
t
…
LHC clock stable
Ramp:clock variationsmay unlock some links in the trigger
Tracking detector high voltage only ramped up whenbeams are stable(detector safety)
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH27
Integration with DCS & automatic actions
In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states
Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) Trigger masks channels while LHC is ramping Silicon-Strip Tracker masks payload when running with HV off (noise) Silicon-Pixel Tracker reduce gains when running with HV off (high currents)
Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run
In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states
Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) Trigger masks channels while LHC is ramping Silicon-Strip Tracker masks payload when running with HV off (noise) Silicon-Pixel Tracker reduce gains when running with HV off (high currents)
Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run
Level-0Level-0
DAQTracker …
DCSDCS
Tracker ECAL
Detector Control System
DCS
Run Control System
…0
PSX
LHC
PVSS SOAP
eXchange
XDAQservice
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH28
Automatic actions
LHC
dip
ole
cu
rren
t
…
LHC clock stable
startramp start
Masksensitivetrigger
channels
ramp doneUnmasksensitivetrigger
channels
Tracker HV onEnable payloadlower thresholds
log HV state in data
Ramp up tracker HV
stop
Ramp down tracker HV
CMS run:Tracker HV offDisable payloadraise thresholds
log HV state in data
Automatic actions
…
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH29
Observations Standardizing the experiment’s software is important for
long-term maintenance Almost successful considering the size of the collaboration Run Control Framework was available early in the development of the
experiment’s software (2003) Adopted by all sub-systems But some sub-systems built their own framework, underneath
Ease-of-use becomes more and more important Run Control / DAQ is now operated by members of the entire CMS
collaboration Running with high life-time:
> 95 % so far for stable-beam periods in 2010
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH30
Observations – Web Technology Operations
Typical advantages of a web application: multiple clients, remote login Stability of the server (Apache Tomcat + Run Control Web Application)
very good: running for weeks Stability of the GUI depends on third-part products (browser)
Behavior changes from one release to the next Not a big problem - GUI can be restarted without affecting the run
Development Knowledge of Java and the Run Control Framework sufficient for basic
function managers Web-based GUI & web technologies handled by framework
Development of complex GUIs such as the top-level control node more difficult
Many technologies need to be mastered Modern web toolkits not yet used by Run Control
IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH31
Summary & Outlook CMS Run Control System is based on Java & Web Technologies Good stability Top-Level Control node optimized for efficiency
Flexible operation of individual sub-systems Built-in cross-checks to guide the operator Automatic actions
triggered by detector and LHC state High CMS data-taking efficiency
life-time > 95%
Next developments Further improve fault tolerance Automatic recovery procedures Auto Pilot candidate event