Mitigation of Failures in High Performance Computing via...

Mitigation of Failures in High Performance Computing

via Runtime Techniques

Xiang Ni Advisor: Laxmikant Kale

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Table of Contents

• Motivations

• Protection for Hard Errors

• Detection and Correction of Silent Data Corruptions

• Memory Limitations

2


Fault Tolerance is Everywhere

3



3

We use multiple locks to protect our house



3



3

We set as many alarms as possible



3



3

In games or movies, we can restart to make progress



3



3

But not with your traditional HPC applications



3

But not with your traditional HPC applications Fault Tolerance


Failures are Rising in HPC

4

• Titan: 2nd largest in the world at Oak Ridge National Lab • More than 17 PFlops with 560640 cores

1.61 Failures/day (2014)

Memory Errors Machine Check Exception

Voltage Fault

Exascale machines that are 100 times powerful will have more obstacles.Commercial vendors unlikely to address resilience issues for HPC market.


Failures maybe Silent• Common source of soft errors

• Particle induced single event upset

• Manufacturing fault

• Data corruption: you may or may not know

5

Shrinking chip size• More energy efficient

• Higher soft error rate


Failures maybe Silent

6


Memory is Limited

7

Byte

s to

Flo

ps ra

tio

0

0.05

0.1

0.15

0.2

0.25

TFlops

10 1000 100000

TianheTitan

Mira

Stampede

EdisonHopper

KrakenIntrepid

BlueGene/L

Jaguar(XT4)Red Storm


Memory is Limited

8

Read Latency Write Latency

DDR3 .01μs .01μs

PCM .05μs 1μs

NAND 10μs 100μs

DISK 1000us 1000us

It is promising to use new types of memory in HPC.


Thesis Focus

❖ Develop runtime techniques to protect HPC applications from failures

Hard errors: reducing checkpoint and restart overhead

Soft errors: detecting and correcting silent data corruptions

Lack of memory: how to utilize NVRAM for checkpointing and application execution?

9


Part 1 Protection for Hard Errors

10


Double In-memory Checkpointing

11



❖ Charm++

❖ SCR

❖ FTI

11



❖ Charm++

❖ SCR

❖ FTI

11

Application resumes computation after all the nodes have successfully saved the checkpoints in their buddy nodes.


Limitation of Checkpoint/Restart

12

0

10

20

30

40

50

60

2008 2010 2012 2014 2016 2018 2020

Rel

ativ

e In

crea

se

Year

Memory SizeNetwork Bandwidth

✤ Increase in memory size per year: 41% ✤ Increase in network bandwidth per year: 26%


Semi-Blocking Checkpoint

13

NODE 1

NODE 2

barrier local checkpoint done

remote checkpoint done

!

βα

β

$ φ%

α

✤ Resume computation as soon as each node stores its own checkpoint (local checkpoint).

✤ Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint).


Minimize Checkpoint Interference

14

1 2 3 4


Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4


Optimistic Scheduling

15

0

2

4

6

8

10

6 7 8 9 10 11 12 13 14 15

2 3 4 5 6 7 8 9 10 11 12

Inte

rfer

ence

(s)

Bene

fit(%

)

Overlap(s)

Opt

imis

tic

interferencebenefit


Single Checkpoint Overhead

16

0

5

10

15

20

25

30

35

40

128 256 512 1024

Che

ckpo

int O

verh

ead(

s)Number of Cores

blocking checkpointsemi−blocking checkpoint

0

10

20

30

40

50

60

70

128 256 512 1024

Che

ckpo

int O

verh

ead(

s)

Number of Cores

blocking checkpointsemi−blocking checkpoint


Automatic Checkpoint

17


Automatic Checkpoint❖ Optimal Checkpoint Interval

17



✓ Traditionally, application checkpoints at fixed interval.

17




✓ If MTBF follows Poisson process, Daly’s model

17





✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?

17






✓ How can applications automatically take preventative checkpoint based on failure prediction?

17







Too expensive for HPC applications to synchronize very often for consistent checkpoint.

17







Too expensive for HPC applications to synchronize very often for consistent checkpoint.

Otherwise applications may hang due to inconsistent checkpoint.

17


Automatic Checkpoint Decision

18



19



20



21



22


Adapting to Failures

✤

✤

23


Adapting to Failures

✤

✤

23

✤

✤


Part 2 Detection and Correction of

Silent Data Corrections

24


MOTIVATION4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

0 1

Vulnerability

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion


per Socket (F

IT)

Util

izat

ion

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion


per Socket (F

IT)

Util

izat

ion

25


Replication Enhanced Checkpointing

26


Replication Enhanced Checkpointing

replica 1 replica 2

transfer checkpoint for soft error detection

hard error

hard error detected by replica 2

replica 2 sends checkpoints to replica 1 for recovery

soft error detected, both replicas roll back

application execution

checkpoint

recovery

Job Starts

T1

T2

T3

TIME

27


Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress


replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress


replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28



Time

Prog

ress


replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint Time

Prog

ress




Replica 1 Replica 2



MEDIUM

Time

Prog

ress


replica 2 crashes

Replica 1 Replica 2



28



Time

Prog

ress


replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress




Replica 1 Replica 2



MEDIUM

Time

Prog

ress


replica 2 crashes

Replica 1 Replica 2



28


Utilization vs. Vulnerability

29

✤

✤

0

0.05

0.1

0.15

0.2

0.25

0.3

1K 2K 4K 8K 16K 32K 64K 128K 256K

Prob

abili

ty o

f Und

etec

ted

SDC

Number of Sockets per Replica

Weak δ = 180sMedium δ = 180s


0.3

0.35

0.4

0.45

0.5

1K 2K 4K 8K 16K 32K 64K 128K 256K

Util

izat

ion



Strong δ = 15sWeak δ = 180s

Medium δ = 180sStrong δ = 180s


Base performance

0

0.5

1

1.5

2

2.5

3

1k 2k 4k 8k 16k

Tim

e (s

)

Number of Cores per Replica

checkpoint

30


Optimization: Topology Aware Mapping

31



31

43

4

4

4

4

4

4

43

3

3

3

3

3

3

2

2

2

2

2

2

2

21

1

1

1

1

1

1

1 1

1

1

1

1

1

1

2

2

2

2

2

2

2

23

3

3

3

3

3

3

3

(a) Default-mapping

Replica 1 nodes

1



31

43

4

4

4

4

4

4

43

3

3

3

3

3

3

2

2

2

2

2

2

2

21

1

1

1

1

1

1

1 1

1

1

1

1

1

1

2

2

2

2

2

2

2

23

3

3

3

3

3

3

3

(a) Default-mapping

Replica 1 nodes

1

012

2

2

2

2

2

2

21

1

1

1

1

1

1

1 1

1

1

1

1

1

1

12

2

2

2

2

2

2

2

(c) Mixed-mapping

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

1

1

1

1

1

1

1

# inter-replica messages[0-4]Replica 2 nodes

1

1

1

1

1

1

1

1

(b) Column-mapping1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0 1 0 1 0 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0


Optimization: Checksum

• Transfer the checksum of 1 integer instead of the whole checkpoints

• Floating point round-off error

0

0.5

1

1.5

2

2.5

3

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)


local checkpointcomparison

checkpoint transfer

checksumcolumnmixeddefault

32


Experimental Results: Checkpoint

0

0.5

1

1.5

2

2.5

3

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)



checkpoint transfer


0

0.01

0.02

0.03

0.04

0.05

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)



checkpoint transfer


33

Benchmark Description Configurationper core

Memory Pressure Runtime

Jacobi3D 7-point stencil 64*64*128 High AMPI

LeanMD Short-range non-bonded force

calculation in NAMD

4000 atoms Low Charm++


Experimental Results: Restart

0

0.5

1

1.5

2

2.5

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)


checkpoint transferreconstruction

strongmedium (column)

medium (mixed)

medium (default)

0

0.1

0.2

0.3

0.4

0.5

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)


checkpoint transferreconstruction

strongmedium (column)

medium (mixed)

medium (default)

34


Checkpoint Overhead

35

0

0.5

1

1.5

2

2.5

3

1k 4k 16k 1k 4k 16k 1k 4k 16k 1k 4k 16k

Ove

rhea

d pe

r R

eplic

a (%

)


strongmedium

weak

column+checksumcolumndefault+checksumdefault

0

0.1

0.2

0.3

0.4

0.5

1k 4k 16k 1k 4k 16k 1k 4k 16k 1k 4k 16k

Ove

rhea

d pe

r R

eplic

a (%

)


strongmedium

weak

column+checksumcolumndefault+checksumdefault


Part 3 Memory Limitations

36


Checkpointing to SSD

• Hard to fit application data and checkpoints in memory at the same time

• Full SSD strategy: store both local and remote checkpoints in SSD

• Half SSD strategy: store only remote checkpoint in SSD

37

Node A Node B Node C

! " #

! " #

$ %

& '

& '

! " #

$ %

$ %

& '

Objects

RemoteCheckpoint

LocalCheckpoint

B is the buddy of A


Asynchronous Checkpointing

38

✤

๏

๏


Checkpoint/Restart on SSD

39

0

5

10

15

20

25

30

0.45 1.34 2.23

Tim

ing

Pena

lty(s

)

Checkpoint Size/Node(GB)

half−aiofull−aiohalf−siofull−sio

5

10

15

20

25

30

35

40

45

0.45 1.34 2.23

Res

tart

Tim

e(s)

Checkpoint Size/Node(GB)

in−memoryhalf−aiofull−aio

✤

✤

Mitigation of Failures in High Performance Computing via...

Documents

Transcript of Mitigation of Failures in High Performance Computing via...