Fault Tolerance and Adaptation in Large Scale, Heterogeneous, Soft Real-Time Systems RTES...
-
Upload
agatha-hopkins -
Category
Documents
-
view
215 -
download
1
Transcript of Fault Tolerance and Adaptation in Large Scale, Heterogeneous, Soft Real-Time Systems RTES...
Fault Tolerance and Adaptation in Large Scale, Heterogeneous, Soft Real-Time Systems
RTES Collaboration (NSF ITR grant ACI-0121658)
Paul SheldonPaul SheldonVanderbilt UniversityVanderbilt UniversityPaul SheldonPaul SheldonVanderbilt UniversityVanderbilt University
16 March 2005 CMS Week 2
Introduction
The Problem Goals and Deliverables Demonstration system Comments
16 March 2005 CMS Week 3
A 20 TeraHz Real-Time System:BTeV Trigger… some similarities to CMS
Input: 800 GB/s (2.5 MHz) Level 1
Lvl1 processing: 190s rate of 396 ns
528 “8 GHz” G5 CPUs (factor of 50 event reduction)
high performance interconnects Level 2/3:
Lvl 2 processing: 5 ms (factor of 10 event reduction)
Lvl 3 processing: 135 ms (factor of 2 event reduction)
1536 “12 GHz” CPUs commodity networking Output: 200 MB/s (4 kHz) = 1-2 Petabytes/year
16 March 2005 CMS Week 4
The Problem: Early Project Review “Given the very complex nature of this
system where thousands of events are simultaneously and asynchronously cooking, issues of data integrity, robustness, and monitoring are critically important and have the capacity to cripple a design if not dealt with at the outset…” “BTeV [needs to] supply the necessary level of “self-awareness” in the trigger system.”[June 2000 Project Review]
16 March 2005 CMS Week 5
Chap. 9 & Appendix E, DAQ/HLT TDR…demonstrate that the current system and its associated protocols are
such that all transient faults are handled in real-time and that, after relatively short periods of time, the system resumes its correct operation.
The implementation of the established fault-tolerant protocols is tested by using prototype software and hardware developed in the context of the CMS-DAQ R&D. This the step where fault-injection tests are carried out.
[Fault Injection] is particularly important because fault-handling code is exercised only when a fault occurs. Given that the occurrence of faults should be the exception, rather than the rule, such code is therefore either not tested, or is only minimally tested. To establish the correctness and scaling of the fault handling mechanisms in the design, fault injection at the level of the prototype Event Builder and in a simulation of the full CMS system is used to establish the correctness of the actual implementation.
16 March 2005 CMS Week 6
For lack of a better name: RTES The Real Time Embedded Systems Group
A collaboration of five institutions: University of Illinois University of Pittsburgh University of Syracuse Vanderbilt University (PI) Fermilab
NSF ITR grant ACI-0121658 Funds Computer Scientists/Electrical Engineers with
expertise in: High performance, real-time system software and hardware, Reliability and fault tolerance, System specification, generation, and modeling tools.
RTES Graduate Students
16 March 2005 CMS Week 7
RTES Goals High availability
Fault handling infrastructure capable of Accurately identifying problems (where, what, and why) Compensating for problems (shift the load, changing
thresholds) Automated recovery procedures (restart /
reconfiguration) Accurate accounting Extensibility (capturing new detection/recovery
procedures) Policy driven monitoring and control
Dynamic reconfiguration adjust to potentially changing resources
16 March 2005 CMS Week 8
RTES Goals (continued)
Faults must be detected/corrected ASAP semi-autonomously
with as little human intervention as possible distributed & hierarchical monitoring and control
Life-cycle maintainability and evolvability to deal with new algorithms, new hardware and new
versions of the OS
16 March 2005 CMS Week 9
The RTES SolutionAnalysis
Runtime
Configuration, Design andAnalysis
AlgorithmsFault Behavior
Resource
Synt
hesi
s
PerformanceDiagnosabilityReliability
ExperimentControlInterface
SynthesisFe
edba
ck
ModelingReconfigure
Logi
cal D
ata
Net
Region Operations
Mgr
Region Operations
Mgr
L2,3/CISC/RISC L1/DSP
Region Fault Mgr
LocalFault MgrLocalFault Mgr
LocalOper. MgrLocalOper. Mgr
ARMOR/LinuxARMOR/Linux
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
LocalFault MgrLocalFault Mgr
LocalOper. MgrLocalOper. Mgr
ARMOR/LinuxARMOR/Linux
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
LocalFault MgrLocalFault Mgr
LocalOper. MgrLocalOper. Mgr
ARMOR/DSPARMOR/DSP
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
LocalFault MgrLocalFault Mgr
LocalOper. MgrLocalOper. Mgr
ARMOR/DSPARMOR/DSP
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Logical Control N
etwork
Logical Control N
etworkLo
gica
l Dat
a N
et
Logical Control N
etwork
Logical Control N
etwork Local
Fault MgrLocalFault Mgr
LocalOper. MgrLocalOper. Mgr
ARMOR/LinuxARMOR/Linux
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Time Time
TriggerAlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithmTrigger
AlgorithmTriggerAlgorithm
Logical Control N
etwork
Global Fault Manager
Global Operations Manager
Soft Real Time Hard
16 March 2005 CMS Week 10
RTES Deliverables A hierarchical fault management system and toolkit:
Model Integrated Computing [Vanderbilt] GME (Generic Modeling Environment) system modeling tools
and application specific “graphic languages” for modeling system configuration, messaging, fault behaviors, user interface, etc.
ARMORs (Adaptive, Reconfigurable, and Mobile Objects for Reliability) [Illinois] Robust framework for detection and reaction to faults in
processes VLAs (Very Lightweight Agents for limited resource
environments) [Syracuse and Pittsburgh] To monitor/mitigate at every level
DSP, Supervisory nodes, Linux farm, etc.
16 March 2005 CMS Week 11
GME: Configuration through Modeling Multi-aspect tool, separate views of
Hardware – components and physical connectivity Executables – configuration and logical connectivity Fault handling behavior using hierarchical state machines
Model interpreters can generate the system image At the code fragment level (for fault handling) Download scripts and configuration
Modeling “languages” are application specific Shapes, properties, associations, constraints Appropriate to application/context
System model Messaging Fault mitigation GUI, etc.
16 March 2005 CMS Week 12
Modeling Environment
Fault handlingProcess dataflowHardware Configuration
16 March 2005 CMS Week 13
ARMOR Adaptive Reconfigurable Mobile Objects of Reliability:
Multithreaded processes composed of replaceable building blocks
Provide error detection and recovery services to user applications
Hierarchy of ARMOR processes form runtime environment: System management, error detection, and error recovery
services distributed across ARMOR processes. ARMOR Runtime environment is itself self checking.
3-tiered ARMOR support of user application Completely transparent and external support Enhancement of standard libraries Instrumentation with ARMOR API
16 March 2005 CMS Week 14
ARMOR Scaleable Design ARMOR processes designed to be reconfigurable:
Internal architecture structured around event-driven modules called elements.
Elements provide functionality of the runtime environment, error-detection capabilities, and recovery policies.
Deployed ARMOR processes contain only elements necessary for required error detection and recovery services.
ARMOR processes resilient to errors by leveraging multiple detection and recovery mechanisms:
Internal self-checking mechanisms to prevent failures from occurring and to limit error propagation.
State protected through checkpointing. Detection and recovery of errors.
ARMOR runtime environment fault-tolerant and scalable: 1-node, 2-node, and N-node configurations.
16 March 2005 CMS Week 15
Execution ARMOR in Worker
Worker
Exec ARMOR
Filter 1 Filter 2 Event Builder
ARMOR MicrokernelARMOR Microkernel
Msg tableMsg table
Named pipeNamed pipe
Msg routingMsg routing
Process mgmtProcess mgmt
App id mgmtApp id mgmt
Crash detectionCrash detection
Infrastructure elements
Elvin/Armor msg converterElvin/Armor msg converter
Hang detectionHang detection
Node status reportNode status report
Filter crash reportFilter crash report
Bad data reportBad data report
Execution time reportExecution time report
Custom elements
Memory leak reportMemory leak report
16 March 2005 CMS Week 16
Very Lightweight Agents
Minimal footprint Platform independence
Employable everywhere in the system!
Monitors hardware and software
Handles fault detection& communications with higher level entities
PhysicsApplication
HardwareOS Kernel
(Linux)
VLA
L2/L3 Manager Nodes(Linux)
PhysicsApplication
Level 2/3 Farm Nodes(Linux)
NetworkAPI
16 March 2005 CMS Week 17
Demonstration System Demonstrate fault mitigation and recovery in a
“large” cluster (64 node) 54 worker nodes (108 CPUs) divided into 9 “regions” 9 regional managers 1 Global manager
Distributed and hierarchical monitoring and error handling
Inject errors and watch for appropriate behavior
Real-Time and Embedded Technology & Applications Symposium (IEEE)San Francisco, March 7-10, 2005
16 March 2005 CMS Week 18
The Demonstration System Architecture
Iron
Gangliapublic
private
laptop
MatlabElvin
laptop
MatlabElvin
Boulder
Elvin
Global
RC, ARMOR
Regional
RC, ARMOR
Worker
RC, VLA, ARMORFilterApp
Worker
RC, VLA, ARMORFilterApp
…
Regional
RC, ARMOR
Worker
RC, VLA, ARMORFilterApp
Worker
RC, VLA, ARMORFilterApp
…
Regional
RC, ARMOR
Worker
RC, VLA, ARMORFilterApp
…
DataSource
file reader
16 March 2005 CMS Week 19
Prototype Trigger Farms at Fermilab
16-node Apple G5 farm
Infiniband switch
L2/3 Trigger Racks F1-F5
L2/3 Trigger Workers L1 Trigger Farm
16 March 2005 CMS Week 20
L2/3 Prototype Farm Components Using PC’s from old PC Farms at Fermilab
3 dual-CPU Manager PCs boulder (1 GHz P3) - meant for data server iron (2.8 GHz P4) - httpd, BB and Ganglia gmetad slate (500 MHz P3) - httpd, BB
Managers have a private network through 1 Gbps link bouldert, iront, slatet
15 dual-CPU (500 MHz P3) workers (btrigw2xx) 84 dual-CPU (1GHz P3) PC workers (btrigw1xx)
No plans to add more, but may replace with faster ones Ideal for RTES
11 workers already have problems!
A Heterogeneous Mix of Aging Systems!
16 March 2005 CMS Week 21
Errors Types “Bad event” corrupt event data detected by filter
application, message sent for logging/counting
Kill a filter application ARMOR detects its absence restarts it
Hang a filter application Time-out message from event source triggers killing Filter restarted by its ARMOR
Exponential slowing of filter Detected by VLA, notifies ARMOR, kills filter application
ARMOR restarts filter app
16 March 2005 CMS Week 22
…Errors Types Increase in memory usage (memory leak)
Detected by VLA, notifies ARMOR, kills filter application ARMOR restarts filter app
Regional growth in filter execution time (all or many filter processes slow) Regional ARMOR takes corrective action: increases threshholds
in filter
16 March 2005 CMS Week 23
Near term goals Run 65 node demo 24/6/365 Use to debug/shake down tools (ARMORs,
VLAs, modeling) Improve/update Use as a running experiment!
Measure/characterize performance of tools Improve monitoring interface: system info
provided to human operator More realistic faults, and more complete set Filter feedback
16 March 2005 CMS Week 24
Comments This is an integrated approach – from hardware to
physics applications Standardization of resource monitoring, management, error
reporting, and integration of recovery procedures can make operating the system more efficient and make it possible to comprehend and extend.
There are real-time constraints that must be met Scheduling and deadlines Numerous detection and recovery actions
The product of this research will Automatically handle simple problems that occur frequently Be as smart as the detection/recovery modules plugged into it
16 March 2005 CMS Week 25
Comments (continued)
The product can lead to better or increased System uptime by compensating for problems or
predicting them instead of pausing or stopping the experiment
Resource utilization the system will use resources that it needs
Understanding of the operating characteristics of the software
Ability to debug and diagnose difficult problems
16 March 2005 CMS Week 26
Further Information
General information about RTES www-btev.fnal.gov/public/hep/detector/rtes/
General information about BTeV www-btev.fnal.gov/
Information about GME and the Vanderbilt ISIS group www.isis.vanderbilt.edu/
16 March 2005 CMS Week 27
Further Information (continued) Information about ARMOR technology
www.crhc.uiuc.edu/DEPEND/projectsARMORs.htm Talks from our workshops
false2002.vanderbilt.edu/ www.ecs.syr.edu/faculty/oh/FALSE2005/
Wiki (internal, today) whcdf03.fnal.gov/BTeV-wiki/DemoSystem2004
Elvin publish/subscribe networking www.mantara.com
16 March 2005 CMS Week 28
16 March 2005 CMS Week 29
Backup Slides
16 March 2005 CMS Week 30
The Problem Chapter 9 & Appendix E of the CMS DAQ/HLT TDR
provide an excellent description.
Monitoring, Fault Tolerance/Mitigation are crucial In a cluster of this size, processes and daemons are
constantly hanging/failing, becoming corrupted, etc.
Software reliability & performance depends on Physics detector-machine performance Program testing procedures, implementation, and design
quality Behavior of the electronics (front-end and within the trigger)
Hardware failures will occur! one to a few per week… maybe more?
16 March 2005 CMS Week 31
ARMOR System: Basic Configuration
Heartbeat ARMORDetects and recovers FTM failures
Fault Tolerant ManagerHighest ranking manager in the system
DaemonsDetect ARMOR crash and hang failures
ARMOR processesProvide a hierarchy of error detection and recovery.ARMORS are protected through checkpointingand internal self-checking.
Execution ARMOROversees application process(e.g. the various Trigger Supervisor/Monitors)
Daemon
Fault TolerantManager (FTM)
Daemon
HeartbeatARMOR
Daemon
ExecARMOR
AppProcess
network
16 March 2005 CMS Week 32
ARMOR: Internal Structure
Daemon
TCP ConnectionMgmt.
Named PipeMgmt.
ProcessMgmt.
DetectionPolicy
ARMOR Microkernel
ProcessMgmt. Network
Daemon
Daemon
Remote daemons
Node 1 Node 2
Node 3
ARMOR Microkernel
RecoveryPolicy
Local Manager ARMOR
TriggerApplication
ExecutionController
Adaptive, Reconfigurable, and Mobile Objects for Reliability
16 March 2005 CMS Week 33
The Demonstration System Components Matlab as GUI engine
GUI defined by GME models Elvin publish/subscribe networking (everywhere)
Messages defined by GME models RunControl (RC) state machines
Defined by GME models ARMORs
Custom elements defined by GME models FilterApp, DataSource
Actual physics trigger code File-reader supplies physics/simulation data to the FilterApp
Demo “faults” encoded onto Source-Worker data messages for execution on the Worker