Taming the Beast - Some Thoughts On Exascale Resiliency

Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger, Senior Researcher Operating Systems and Middleware Group Hasso Plattner Institute University of Potsdam

Dr. Peter Tröger | HPCS 2013, Helsinki

The Exascale Beast

-Projections look similar

-Millions of nodes

-Billions of concurrent activities

-Faults in the order of minutes

-Checkpointing in the order of hours

-Logic soft errors and silent data corruption

-Software faults everywhere

-Power wall (20MW) for redundancy

-Resiliency for Exascale is a multi-dimensional beast

!2

[C. Engelmann]

[C. Engelmann]


You are not alone ...

!3


Dependability

-Umbrella term for operational requirements on a system

-„Trustworthiness of a computer system such that reliance can be placed on the service it delivers to the user“ [Laprie]

!!!!!

-Resiliency solution: A fault tolerance solution that allows the visibility of faults on uncommitted data [Bottomley]

System Quality

!4


Mission-Critical Single Systems

Large-ScaleDistributed Systems

Hardware Solutions

Hybrid Systems

Software Solutions Combined Solutions

Time

Dependability Research


Fault Tolerance [Hanmer]

-Error recovery: Roll-forward, rollback, retry, failover, ...

-Checkpointing is a tool to implement rollback

-Spatial redundancy is a tool to implement failover

-Error mitigation: Marked data (IEEE 754-1985 NaN), error correcting codes, algorithmic-based fault tolerance, ...

!5

Fault Tolerance

Latent Fault Error

Normal Operation

Fault Activation

Error Recovery

Error Mitigation

Error Detection

Failure


System View

!6

Hardware

Operating System / Hypervisor

Virtual Machine

Operating System

Cluster Framework

MPI

Library / Framework

Exec

utio

n en

viro

nmen

t

ApplicationProgramming model

What is best location for

fault tolerance ?


The Fancy Approach

-New resilient programming models + unreliable execution environments

-Power and cooling is hard enough

-Delegate the fault tolerance problem to the application

-Processes need to fail-fast, applications can then apply their own resiliency scheme

-Examples: Application-level checkpointing, User level failure mitigation (ULFM), new message passing facilities (e.g. Erlang)

-This demands active participation by HPC users

-Old codes are most likely to break completely

!X


The Traditional Approach

-New resilient execution environments + established programming models

-Hardware level: Despite interconnects, typically too costly; adds additional power demands

-OS / Cluster level: Page checkpointing, virtual clusters, ...

-MPI level: Coordinated checkpointing, message logging, data reliability, automatic path migration, FT-MPI, ...

-Seems to be preferred

-Old codes should be adoptable

!-But which layer is the right one ?

!X


Coverage vs. Overhead

-Migration object moved between failover units at one system layer

-System layer as containment barrier

-Coverage of the layer

-Fault model from available data

-Monitoring granularity may prevent fault detection for lower levels

-Overhead of the layer

-Migration object granularity

-Prediction quality (from data) influences false migration percentage

!X


Selection of a System Layer

-Strategy for identification of the right layer

-Choose a fault model - excludes all lower system layers

-Determine average time ∆F between error and failure

-Find highest system layer with average migration time below ∆F

-Reduce remaining candidates by prediction quality and overhead

!X


Selection of a System Layer

1.Choose a fault model, exclude all lower layers

2.Determine average time ∆F between error and failure

3.Find layers with average failover time below ∆F

4.Filter candidates by detection quality, failover speed and redundancy overhead

!7

Hardware


Virtual Machine

Operating System

Cluster Framework

MPI

Library / Framework

Exec

utio

n en

viro

nmen

t

Application


The Fault Model

-Error detection scheme

-Most effective fault tolerance scheme

-Sphere of redundancy

-Testing procedures

!8

[Laprie & Kanoun]

Fault M

odel

„One of the main problem is that it difficult to get fault

details on very recent machines and to anticipate what kind of faults are likely

to happen at Exascale.“ !

[Cappello]

„It is important to note that the number of failures with

undetermined root cause is significant. [...] hardware and software are among the largest

contributors to failures.“ !

[Schröder, Gibson]


The Exascale Dilemma

-Everybody agrees that Exascale resiliency is a problem.

-The budget for solving this problem is very limited.

-Performance is the top priority.

-Your HPC supplier has this problem too.

-The key issue is uncertainty.

-Fault model, failure modes and rates, error propagation paths, ...

-But many people still try to find ,the‘ right answer.

-Based on incomplete knowledge.

-Let‘s give up on that.

!11

System Quality


Our Proposal: Embrace the Uncertainty

-Create novel ways to deal with partial system knowledge

-Create incomplete fault models and start to use them

-Perform partial dependability assessments of designs

-Something in-between purely qualitative or quantitative

-Perform this iteratively

-Focus on relative comparisons

-Redundancy approach A is better than B

-Avoid the numbers game (e.g. MTTI)

-Make uncertainty explicit

!12

http

://a

myb

ruck

er.c

om/

http://amybrucker.com


Example 1: Anomaly Signals

-Extended version of anomaly detection approach by Oliver et al. (2010)

-Monitoring on different system levels has incompatible metrics

-Each error situation can best be identified by only one of the system layers

-Idea: Normalize and correlate health indicators across all system levels

!13 � � � �

��

� �

� �

��

��

��

��

��

��

��

��

��

��

��

��

��

Hardware


Virtual Machine

Operating System

Cluster Framework

MPI

Library / Framework

Application


Example 2: FuzzTrees

-Dependability modeling in failure space is widely established

-Fault trees, attack trees, FMEA

-Describe potential failure modes and how they may occur

-Current approaches assume a fixed design and well-known fault probabilities

-FuzzTrees: Extended version of fault tree analysis

-Make uncertainty about system configuration explicit

-Make uncertainty about failure rates explicit

-Still get some answers

!14


N: 4-5k: N-2

Secondary CPUFailure

p=0.08 ± 0.008

Primary CPUFailure

p=0.08 ± 0.008

Server Failure

Power UnitFailure

p=0.15 ± 0.05

RAID 0 Failure RAID 1 Failure

Disc Failurep=0.12 ± 0.01

#2

Disc Failurep=0.12 ± 0.01

#2

-,Hello World‘ example

-Optional spare processor for failover

-Choice of RAID level

-Choice of power supply redundancy

-Consideration of cost factor

11.02.13 FuzzEd - Server Failure (1xCPU, N=4, RAID 0)

fuzztrees.net/editor/43 1/1

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#4

k/N: 2-4



Disc Failure#2

RAID 1

Power UnitFailure

#4

k/N: 2-4Primary CPUFailure

Server Failure



Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#5

k/N: 3-5



Primary CPUFailure

Server Failure

Disc Failure#2

RAID 1

Power UnitFailure

#5

k/N: 3-5




Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#4

k/N: 2-4




Primary CPUFailure

Server Failure

Disc Failure#2

RAID 1

Power UnitFailure

#4

k/N: 2-4



k/N: 3-5


Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#5




Primary CPUFailure

Server Failure

Disc Failure#2

RAID 1

Power UnitFailure

#5

k/N: 3-5

FuzzTrees

!15

Online editor available at www.fuzztrees.net

http://www.fuzztrees.net


Summary

-Exascale resiliency is about uncertainty

-Everything fails in completely unforeseeable ways

-Reactive fault tolerance does not scale

-There is more monitoring data than smart mining approaches

-Surprisingly, industry agrees to most of this ...

-Promising new research directions

-Imprecise health indication with automated correlation

-Imprecise dependability modeling

!18

Taming the Beast - Some Thoughts On Exascale Resiliency

Education

Transcript of Taming the Beast - Some Thoughts On Exascale Resiliency