Taming the Beast - Some Thoughts On Exascale Resiliency

24
Taming the Beast - Some Thoughts On Exascale Resiliency Dr. Peter Tröger, Senior Researcher Operating Systems and Middleware Group Hasso Plattner Institute University of Potsdam

description

Keynote at the International Conference on High Performance Computing and Simulation 2013

Transcript of Taming the Beast - Some Thoughts On Exascale Resiliency

Page 1: Taming the Beast - Some Thoughts On Exascale Resiliency

Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger, Senior Researcher Operating Systems and Middleware Group Hasso Plattner Institute University of Potsdam

Page 2: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

The Exascale Beast

-Projections look similar

-Millions of nodes

-Billions of concurrent activities

-Faults in the order of minutes

-Checkpointing in the order of hours

-Logic soft errors and silent data corruption

-Software faults everywhere

-Power wall (20MW) for redundancy

-Resiliency for Exascale is a multi-dimensional beast

!2

[C. Engelmann]

Page 3: Taming the Beast - Some Thoughts On Exascale Resiliency

[C. Engelmann]

Page 4: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

You are not alone ...

!3

Page 5: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Dependability

-Umbrella term for operational requirements on a system

-„Trustworthiness of a computer system such that reliance can be placed on the service it delivers to the user“ [Laprie]

!!!!!

-Resiliency solution: A fault tolerance solution that allows the visibility of faults on uncommitted data [Bottomley]

System Quality

!4

Page 6: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Mission-Critical Single Systems

Large-ScaleDistributed Systems

Hardware Solutions

Hybrid Systems

Software Solutions Combined Solutions

Time

Dependability Research

Page 7: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Fault Tolerance [Hanmer]

-Error recovery: Roll-forward, rollback, retry, failover, ...

-Checkpointing is a tool to implement rollback

-Spatial redundancy is a tool to implement failover

-Error mitigation: Marked data (IEEE 754-1985 NaN), error correcting codes, algorithmic-based fault tolerance, ...

!5

Fault Tolerance

Latent Fault Error

Normal Operation

Fault Activation

Error Recovery

Error Mitigation

Error Detection

Failure

Page 8: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

System View

!6

Hardware

Operating System / Hypervisor

Virtual Machine

Operating System

Cluster Framework

MPI

Library / Framework

Exec

utio

n en

viro

nmen

t

ApplicationProgramming model

What is best location for

fault tolerance ?

Page 9: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

The Fancy Approach

-New resilient programming models + unreliable execution environments

-Power and cooling is hard enough

-Delegate the fault tolerance problem to the application

-Processes need to fail-fast, applications can then apply their own resiliency scheme

-Examples: Application-level checkpointing, User level failure mitigation (ULFM), new message passing facilities (e.g. Erlang)

-This demands active participation by HPC users

-Old codes are most likely to break completely

!X

Page 10: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

The Traditional Approach

-New resilient execution environments + established programming models

-Hardware level: Despite interconnects, typically too costly; adds additional power demands

-OS / Cluster level: Page checkpointing, virtual clusters, ...

-MPI level: Coordinated checkpointing, message logging, data reliability, automatic path migration, FT-MPI, ...

-Seems to be preferred

-Old codes should be adoptable

!-But which layer is the right one ?

!X

Page 11: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Coverage vs. Overhead

-Migration object moved between failover units at one system layer

-System layer as containment barrier

-Coverage of the layer

-Fault model from available data

-Monitoring granularity may prevent fault detection for lower levels

-Overhead of the layer

-Migration object granularity

-Prediction quality (from data) influences false migration percentage

!X

Page 12: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Selection of a System Layer

-Strategy for identification of the right layer

-Choose a fault model - excludes all lower system layers

-Determine average time ∆F between error and failure

-Find highest system layer with average migration time below ∆F

-Reduce remaining candidates by prediction quality and overhead

!X

Page 13: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Selection of a System Layer

1.Choose a fault model, exclude all lower layers

2.Determine average time ∆F between error and failure

3.Find layers with average failover time below ∆F

4.Filter candidates by detection quality, failover speed and redundancy overhead

!7

Hardware

Operating System / Hypervisor

Virtual Machine

Operating System

Cluster Framework

MPI

Library / Framework

Exec

utio

n en

viro

nmen

t

Application

Page 14: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

The Fault Model

-Error detection scheme

-Most effective fault tolerance scheme

-Sphere of redundancy

-Testing procedures

!8

[Laprie & Kanoun]

Fault M

odel

Page 15: Taming the Beast - Some Thoughts On Exascale Resiliency

„One of the main problem is that it difficult to get fault

details on very recent machines and to anticipate what kind of faults are likely

to happen at Exascale.“ !

[Cappello]

Page 16: Taming the Beast - Some Thoughts On Exascale Resiliency

„It is important to note that the number of failures with

undetermined root cause is significant. [...] hardware and software are among the largest

contributors to failures.“ !

[Schröder, Gibson]

Page 17: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

The Exascale Dilemma

-Everybody agrees that Exascale resiliency is a problem.

-The budget for solving this problem is very limited.

-Performance is the top priority.

-Your HPC supplier has this problem too.

-The key issue is uncertainty.

-Fault model, failure modes and rates, error propagation paths, ...

-But many people still try to find ,the‘ right answer.

-Based on incomplete knowledge.

-Let‘s give up on that.

!11

System Quality

Page 18: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Our Proposal: Embrace the Uncertainty

-Create novel ways to deal with partial system knowledge

-Create incomplete fault models and start to use them

-Perform partial dependability assessments of designs

-Something in-between purely qualitative or quantitative

-Perform this iteratively

-Focus on relative comparisons

-Redundancy approach A is better than B

-Avoid the numbers game (e.g. MTTI)

-Make uncertainty explicit

!12

http

://a

myb

ruck

er.c

om/

Page 19: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Example 1: Anomaly Signals

-Extended version of anomaly detection approach by Oliver et al. (2010)

-Monitoring on different system levels has incompatible metrics

-Each error situation can best be identified by only one of the system layers

-Idea: Normalize and correlate health indicators across all system levels

!13 � � � �

���������

� �

� �

���������������

����

����

����

����

�����

��

�������������

��

�������������

��

�������������

����� �����

Hardware

Operating System / Hypervisor

Virtual Machine

Operating System

Cluster Framework

MPI

Library / Framework

Application

Page 20: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Example 2: FuzzTrees

-Dependability modeling in failure space is widely established

-Fault trees, attack trees, FMEA

-Describe potential failure modes and how they may occur

-Current approaches assume a fixed design and well-known fault probabilities

-FuzzTrees: Extended version of fault tree analysis

-Make uncertainty about system configuration explicit

-Make uncertainty about failure rates explicit

-Still get some answers

!14

Page 21: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

N: 4-5k: N-2

Secondary CPUFailure

p=0.08 ± 0.008

Primary CPUFailure

p=0.08 ± 0.008

Server Failure

Power UnitFailure

p=0.15 ± 0.05

RAID 0 Failure RAID 1 Failure

Disc Failurep=0.12 ± 0.01

#2

Disc Failurep=0.12 ± 0.01

#2

-,Hello World‘ example

-Optional spare processor for failover

-Choice of RAID level

-Choice of power supply redundancy

-Consideration of cost factor

11.02.13 FuzzEd - Server Failure (1xCPU, N=4, RAID 0)

fuzztrees.net/editor/43 1/1

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#4

k/N: 2-4

11.02.13 FuzzEd - Server Failure (1xCPU, N=4, RAID 1)

fuzztrees.net/editor/42 1/1

Disc Failure#2

RAID 1

Power UnitFailure

#4

k/N: 2-4Primary CPUFailure

Server Failure

11.02.13 FuzzEd - Server Failure (1xCPU, N=5, RAID 0)

fuzztrees.net/editor/48 1/1

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#5

k/N: 3-5

11.02.13 FuzzEd - Server Failure (1xCPU, N=5, RAID 1)

fuzztrees.net/editor/49 1/1

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 1

Power UnitFailure

#5

k/N: 3-5

11.02.13 FuzzEd - Server Failure (2xCPU, N=4, RAID 0)

fuzztrees.net/editor/44 1/1

Secondary CPUFailure

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#4

k/N: 2-4

11.02.13 FuzzEd - Server Failure (2xCPU, N=4, RAID 1)

fuzztrees.net/editor/45 1/1

Secondary CPUFailure

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 1

Power UnitFailure

#4

k/N: 2-4

11.02.13 FuzzEd - Server Failure (2xCPU, N=5, RAID 0)

fuzztrees.net/editor/47 1/1

k/N: 3-5

Secondary CPUFailure

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 0

Power UnitFailure

#5

11.02.13 FuzzEd - Server Failure (2xCPU, N=5, RAID 1)

fuzztrees.net/editor/46 1/1

Secondary CPUFailure

Primary CPUFailure

Server Failure

Disc Failure#2

RAID 1

Power UnitFailure

#5

k/N: 3-5

FuzzTrees

!15

Page 22: Taming the Beast - Some Thoughts On Exascale Resiliency

Online editor available at www.fuzztrees.net

Page 23: Taming the Beast - Some Thoughts On Exascale Resiliency

Online editor available at www.fuzztrees.net

Page 24: Taming the Beast - Some Thoughts On Exascale Resiliency

Dr. Peter Tröger | HPCS 2013, Helsinki

Summary

-Exascale resiliency is about uncertainty

-Everything fails in completely unforeseeable ways

-Reactive fault tolerance does not scale

-There is more monitoring data than smart mining approaches

-Surprisingly, industry agrees to most of this ...

-Promising new research directions

-Imprecise health indication with automated correlation

-Imprecise dependability modeling

!18