A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by...
-
Upload
juliana-dalton -
Category
Documents
-
view
219 -
download
2
Transcript of A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by...
A Proactive Resiliency Approach for Large-Scale HPC Systems
System Research Team
Presented by Geoffroy Vallee
Oak Ridge National LaboratoryWelcome to HPCVirt 2009
Goal of the Presentation
Can we anticipate failures and avoid their impact on application execution?
Introduction
Traditional Fault Tolerance Policies in HPC Systems Reactive policies
Other approach: pro-active fault tolerance Two critical capabilities to make pro-active FT successful
Failure prediction Anomaly detection
Application migration Pro-active policy
Testing / Experimentation
Is proactive fault tolerance the solution?
Failure Detection & Prediction
System monitoring Live monitoring Study non-intrusive monitoring techniques Postmortem failure analysis
System log analysis Live analysis for failure prediction Postmortem analysis
Anomaly analysis Collaboration with George Ostrouchov Statistical tool for anomaly detection
Anomaly Detection
Anomaly Analyzer (George Ostrouchov) Ability to view groups of components as statistical distributions
Identify anomalous components Identify anomalous time periods
Based on numeric data with no expert knowledge for grouping Scalable approach, only statistical properties of simple summaries Power from examination of high-dimensional relationships Visualization utility used to explore data
Implementation uses R project for statistical computing GGobi visualization tool for high-dimensional data exploration
With good failure data, could be used for failure prediction
Anomaly Detection Prototype
Monitoring / Data collection Prototype developed using XTORC Ganglia monitoring system
Standard metrics, e.g., memory/cpu utilization LM_sensor data, e.g., cpu/mb temperature
Leveraged RRD reader from Ovis v1.1.1
Proactive Fault Tolerance Mechanisms
Goal: move the application away from the component that is about to fail Migration Pause/unpause
Major proactive FT mechanisms Process-level migration Virtual machine migration
In our context Do not care about the underlying mechanism We can easily switch between solutions
System and application resilience
What policy to use for proactive FT?
Modular framework Virtual machine ckpt/rsrt and
migration Process-level ckpt/rsrt and
migration Implementation of new policies
via our SDK Feedback loop
Policy simulator Ease initial phase of study
of new policies Results match experimental
virtualization results
Type 1 Feedback-Loop Control Architecture
Alert-driven coverage Basic failures
No evaluation of application health history or context Prone to false positives Prone to false negatives Prone to miss real-time
window Prone to decrease application
heath through migration No correlation of health
context or history
Type 2 Feedback-Loop Control Architecture
Trend-driven coverage Basic failures Less false positives/negatives
No evaluation of application reliability Prone to miss real-time
window Prone to decrease application
heath through migration No correlation of health
context or history
Type 3 Feedback-Loop Control Architecture
Reliability-driven coverage Basic and correlated failures Less false positives/negatives Able to maintain real-time
window Does not decrease application
heath through migration Correlation of short-term
health context and history
No correlation of long-term health context or history Unable to match system and
application reliability patterns
Type 4 Feedback-Loop Control Architecture
Reliability-driven coverage of failures and anomalies Basic and correlated failures,
anomaly detection Less prone to false positives Less prone to false negatives Able to maintain real-time
window Does not decrease application
heath through migration Correlation of short and long-
term health context & history
Testing and Experimentation
How to evaluate a failure prediction mechanism? Failure injection Anomaly detection
How to evaluate the impact of a given proactive policy? Simulation Experimentation
Fault Injection / Testing
First purpose: testing our research Inject failure at different levels: system, OS, application Framework for fault injection
Controller: Analyzer, Detector & Injector Target system & user level targets
Testing of failure prediction/detection mechanisms
Mimic behavior of other systems “Replay” failures sequence on another system Based on system logs, we can evaluate the impact of different
policies
Fault Injection
Example faults/errors Bit-flips - CPU registers/memory Memory errors - mem corruptions/leaks Disk faults - read/write errors Network faults - packet loss, etc.
Important characteristics Representative failures (fidelity) Transparency and low overhead Detection/Injection are linked
Existing Work Techniques: Hardware vs. Software Software FI can leverage perf./debug hardware Not many publicly available tools
Simulator
System logs based Currently based on LLNL ASCI
White
Evaluate impact of Alternate policies System/FT mechanisms
parameters (e.g., checkpoint cost)
Enable studies & evaluation of different configurations before actual deployment
Anomaly Detection: Experimentation on “XTORC”
Hardware Compute nodes: ~45-60 (P4 @ 2 Ghz) Head node: 1 (P4 @ 1.7Ghz) Service/log server: 1 (P4 @ 1.8Ghz) Network: 100 Mb Ethernet
Software Operating systems span RedHat 9, Fedora Core 4 & 5
RH9: node53 FC4: node4, 58, 59, 60 FC5: node1-3, 5-52, 61
RH9 is Linux 2.4 FC4/5 is Linux 2.6 NFS exports ‘/home’
XTORC Idle 48-hr Results
Data classified and grouped automatically However, those results were manually interpreted (admin & statistician)
Observations Node 0 is the most different from the rest, particularly hours 13, 37, 46,
and 47. This is the head node where most services are running. Node 53 runs the older Red Hat 9 (all others run Fedora Core 4/5). It turned out that nodes 12, 31, 39, 43, and 63 were all down. Node 13 … and particularly its hour 47! Node 30 hour 7 … ? Node 1 & Node 5 … ? Three groups emerged in data clustering
1. temperature/memory related, 2. cpu related, 3. i/o related
Anomaly Detection - Next Steps
Data Reduce overhead in data gathering Monitor more fields Investigate methods to aid data interpretation Identify significant fields for given workloads Heterogeneous nodes
Different workloads Base (no/low work) Loaded (benchmark/app work) Loaded + Fault Injection
Working toward links between anomalies and failures
Prototypes - Overview
Proactive & reactive fault tolerance Process level: BLCR + LAM-MPI Virtual machine level: Xen + any kind of MPI implementation
Detection Monitoring framework: based on Ganglia Anomaly detection tool
Simulator System log based Enable customization of policies and system/application
parameters
Is proactive the answer?
Most of the time: prediction accuracy is not good enough, we may loose all the benefit of proactive FT
No “one-fit-all” solution
Combination of different policies “Holistic” fault tolerance Example: decrease the checkpoint frequency combining
proactive and reactive FT policies
Optimization of existing policies Leverage existing techniques/policies Tuning Customization
Resourcehttp://www.csm.ornl.gov/srt/
Contacts Geoffroy Vallee <[email protected]>
Performance Prediction
Important variance between different runs of the same experiment
Only few studies to address the problem “System noise” Critical to scale up Scientists want strict answer
What are the problems: Lack of tools? VMMs are too big/complex? Not enough VMM-bypass/optimization?
Fault Tolerance Mechanisms
FT mechanisms are not yet mainstream (out-of-the-box) But different solutions start to be available (BLCR, Xen, etc.) Support of as many mechanisms as possible
Reactive FT mechanisms Process-level checkpoint/restart Virtual machine checkpoint/restart
Proactive FT mechanisms Process-level migration Virtual machine migration
Existing System Level Fault Injection
Virtual Machines FAUmachine
Pro: focused on FI & experiments, code available Con: older project, lots of dependencies, slow
FI-QEMU (patch) Pro: works with ‘qemu’ emulator, code available Con: patch for ARM arch, limited capabilities
Operating System Linux (>= 2.6.20)
Pro: extensible, kernel & user level targets, maintained by Linux community
Con: immature, focused on testing Linux
Future Work
Implementation of the RAS framework
Ultimately have an “end-to-end” solution for system resilience From initial studies based on the simulator To deployment and testing on computing platforms Using different low-level mechanisms (process level versus
virtual machine level mechanisms) Adapting the policies to both the platform and the applications