Mitigation of Failures in High Performance Computing via...

93
Mitigation of Failures in High Performance Computing via Runtime Techniques Xiang Ni Advisor: Laxmikant Kale

Transcript of Mitigation of Failures in High Performance Computing via...

Mitigation of Failures in High Performance Computing

via Runtime Techniques

Xiang Ni Advisor: Laxmikant Kale

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Table of Contents

• Motivations

• Protection for Hard Errors

• Detection and Correction of Silent Data Corruptions

• Memory Limitations

2

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

We use multiple locks to protect our house

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

We set as many alarms as possible

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

In games or movies, we can restart to make progress

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

But not with your traditional HPC applications

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Fault Tolerance is Everywhere

3

But not with your traditional HPC applications Fault Tolerance

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Failures are Rising in HPC

4

• Titan: 2nd largest in the world at Oak Ridge National Lab • More than 17 PFlops with 560640 cores

1.61 Failures/day (2014)

Memory Errors Machine Check Exception

Voltage Fault

Exascale machines that are 100 times powerful will have more obstacles.Commercial vendors unlikely to address resilience issues for HPC market.

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Failures maybe Silent• Common source of soft errors

• Particle induced single event upset

• Manufacturing fault

• Data corruption: you may or may not know

5

Shrinking chip size• More energy efficient

• Higher soft error rate

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Failures maybe Silent

6

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Memory is Limited

7

Byte

s to

Flo

ps ra

tio

0

0.05

0.1

0.15

0.2

0.25

TFlops

10 1000 100000

TianheTitan

Mira

Stampede

EdisonHopper

KrakenIntrepid

BlueGene/L

Jaguar(XT4)Red Storm

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Memory is Limited

8

Read Latency Write Latency

DDR3 .01μs .01μs

PCM .05μs 1μs

NAND 10μs 100μs

DISK 1000us 1000us

It is promising to use new types of memory in HPC.

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Thesis Focus

❖ Develop runtime techniques to protect HPC applications from failures

Hard errors: reducing checkpoint and restart overhead

Soft errors: detecting and correcting silent data corruptions

Lack of memory: how to utilize NVRAM for checkpointing and application execution?

9

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Part 1 Protection for Hard Errors

10

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Double In-memory Checkpointing

11

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Double In-memory Checkpointing

11

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Double In-memory Checkpointing

11

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Double In-memory Checkpointing

11

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Double In-memory Checkpointing

❖ Charm++

❖ SCR

❖ FTI

11

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Double In-memory Checkpointing

❖ Charm++

❖ SCR

❖ FTI

11

Application resumes computation after all the nodes have successfully saved the checkpoints in their buddy nodes.

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Limitation of Checkpoint/Restart

12

0

10

20

30

40

50

60

2008 2010 2012 2014 2016 2018 2020

Rel

ativ

e In

crea

se

Year

Memory SizeNetwork Bandwidth

✤ Increase in memory size per year: 41% ✤ Increase in network bandwidth per year: 26%

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Semi-Blocking Checkpoint

13

NODE 1

NODE 2

barrier local checkpoint done

remote checkpoint done

!

βα

β

$ φ%

α

✤ Resume computation as soon as each node stores its own checkpoint (local checkpoint).

✤ Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint).

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Minimize Checkpoint Interference

14

1 2 3 4

1 2 3 4

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Optimistic Scheduling

15

0

2

4

6

8

10

6 7 8 9 10 11 12 13 14 15

2 3 4 5 6 7 8 9 10 11 12

Inte

rfer

ence

(s)

Bene

fit(%

)

Overlap(s)

Opt

imis

tic

interferencebenefit

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Optimistic Scheduling

15

0

2

4

6

8

10

6 7 8 9 10 11 12 13 14 15

2 3 4 5 6 7 8 9 10 11 12

Inte

rfer

ence

(s)

Bene

fit(%

)

Overlap(s)

Opt

imis

tic

interferencebenefit

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Single Checkpoint Overhead

16

0

5

10

15

20

25

30

35

40

128 256 512 1024

Che

ckpo

int O

verh

ead(

s)Number of Cores

blocking checkpointsemi−blocking checkpoint

0

10

20

30

40

50

60

70

128 256 512 1024

Che

ckpo

int O

verh

ead(

s)

Number of Cores

blocking checkpointsemi−blocking checkpoint

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?

✓ How can applications automatically take preventative checkpoint based on failure prediction?

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?

✓ How can applications automatically take preventative checkpoint based on failure prediction?

Too expensive for HPC applications to synchronize very often for consistent checkpoint.

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?

✓ How can applications automatically take preventative checkpoint based on failure prediction?

Too expensive for HPC applications to synchronize very often for consistent checkpoint.

Otherwise applications may hang due to inconsistent checkpoint.

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint❖ Optimal Checkpoint Interval

✓ Traditionally, application checkpoints at fixed interval.

✓ If MTBF follows Poisson process, Daly’s model

✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?

✓ How can applications automatically take preventative checkpoint based on failure prediction?

Too expensive for HPC applications to synchronize very often for consistent checkpoint.

Otherwise applications may hang due to inconsistent checkpoint.

17

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

18

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

18

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

18

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

19

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

20

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

20

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

20

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

20

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

21

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

22

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Automatic Checkpoint Decision

22

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Adapting to Failures

23

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Adapting to Failures

23

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Adapting to Failures

23

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Adapting to Failures

23

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Part 2 Detection and Correction of

Silent Data Corrections

24

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

MOTIVATION4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

0 1

Vulnerability

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

25

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

MOTIVATION4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

0 1

Vulnerability

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

25

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

MOTIVATION4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

0 1

Vulnerability

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

25

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

MOTIVATION4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

0 1

Vulnerability

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

25

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

MOTIVATION4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

0 1

Vulnerability

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

4K 16K 64K 256K 1M1

10010000

0 0.2 0.4 0.6 0.8

1

Util

izat

ion

Number of Sockets Soft Error Rate

per Socket (F

IT)

Util

izat

ion

25

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Replication Enhanced Checkpointing

26

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Replication Enhanced Checkpointing

replica 1 replica 2

transfer checkpoint for soft error detection

hard error

hard error detected by replica 2

replica 2 sends checkpoints to replica 1 for recovery

soft error detected, both replicas roll back

application execution

checkpoint

recovery

Job Starts

T1

T2

T3

TIME

27

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress

periodic checkpointing

replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint Time

Prog

ress

periodic checkpointing

replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress

periodic checkpointing

replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress

periodic checkpointing

replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress

periodic checkpointing

replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Different Ways to Restart from Hard Errors

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 recovers using the previous checkpoint

Time

Prog

ress

periodic checkpointing

replica 2 crashesreplica 1 detects the crash of

replica 2 and checkpoints

Replica 1 Replica 2

replica 2 recovers using the most

recent checkpoint from replica 1

MEDIUM

Time

Prog

ress

periodic checkpointing

replica 2 crashes

Replica 1 Replica 2

replica 2 waits for replica 1 to make the next

periodical checkpoint and recover

28

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Utilization vs. Vulnerability

29

0

0.05

0.1

0.15

0.2

0.25

0.3

1K 2K 4K 8K 16K 32K 64K 128K 256K

Prob

abili

ty o

f Und

etec

ted

SDC

Number of Sockets per Replica

Weak δ = 180sMedium δ = 180s

Weak δ = 15sMedium δ = 15s

0.3

0.35

0.4

0.45

0.5

1K 2K 4K 8K 16K 32K 64K 128K 256K

Util

izat

ion

Number of Sockets per Replica

Weak δ = 15sMedium δ = 15s

Strong δ = 15sWeak δ = 180s

Medium δ = 180sStrong δ = 180s

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Base performance

0

0.5

1

1.5

2

2.5

3

1k 2k 4k 8k 16k

Tim

e (s

)

Number of Cores per Replica

checkpoint

30

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Base performance

0

0.5

1

1.5

2

2.5

3

1k 2k 4k 8k 16k

Tim

e (s

)

Number of Cores per Replica

checkpoint

30

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Optimization: Topology Aware Mapping

31

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Optimization: Topology Aware Mapping

31

43

4

4

4

4

4

4

43

3

3

3

3

3

3

2

2

2

2

2

2

2

21

1

1

1

1

1

1

1 1

1

1

1

1

1

1

2

2

2

2

2

2

2

23

3

3

3

3

3

3

3

(a) Default-mapping

Replica 1 nodes

1

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Optimization: Topology Aware Mapping

31

43

4

4

4

4

4

4

43

3

3

3

3

3

3

2

2

2

2

2

2

2

21

1

1

1

1

1

1

1 1

1

1

1

1

1

1

2

2

2

2

2

2

2

23

3

3

3

3

3

3

3

(a) Default-mapping

Replica 1 nodes

1

012

2

2

2

2

2

2

21

1

1

1

1

1

1

1 1

1

1

1

1

1

1

12

2

2

2

2

2

2

2

(c) Mixed-mapping

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

1

1

1

1

1

1

1

# inter-replica messages[0-4]Replica 2 nodes

1

1

1

1

1

1

1

1

(b) Column-mapping1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0 1 0 1 0 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Optimization: Checksum

• Transfer the checksum of 1 integer instead of the whole checkpoints

• Floating point round-off error

0

0.5

1

1.5

2

2.5

3

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)

Number of Cores per Replica

local checkpointcomparison

checkpoint transfer

checksumcolumnmixeddefault

32

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Experimental Results: Checkpoint

0

0.5

1

1.5

2

2.5

3

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)

Number of Cores per Replica

local checkpointcomparison

checkpoint transfer

checksumcolumnmixeddefault

0

0.01

0.02

0.03

0.04

0.05

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)

Number of Cores per Replica

local checkpointcomparison

checkpoint transfer

checksumcolumnmixeddefault

33

Benchmark Description Configurationper core

Memory Pressure Runtime

Jacobi3D 7-point stencil 64*64*128 High AMPI

LeanMD Short-range non-bonded force

calculation in NAMD

4000 atoms Low Charm++

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Experimental Results: Restart

0

0.5

1

1.5

2

2.5

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)

Number of Cores per Replica

checkpoint transferreconstruction

strongmedium (column)

medium (mixed)

medium (default)

0

0.1

0.2

0.3

0.4

0.5

1k 64k 1k 64k 1k 64k 1k 64k

Tim

e (s

)

Number of Cores per Replica

checkpoint transferreconstruction

strongmedium (column)

medium (mixed)

medium (default)

34

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Checkpoint Overhead

35

0

0.5

1

1.5

2

2.5

3

1k 4k 16k 1k 4k 16k 1k 4k 16k 1k 4k 16k

Ove

rhea

d pe

r R

eplic

a (%

)

Number of Sockets per Replica

strongmedium

weak

column+checksumcolumndefault+checksumdefault

0

0.1

0.2

0.3

0.4

0.5

1k 4k 16k 1k 4k 16k 1k 4k 16k 1k 4k 16k

Ove

rhea

d pe

r R

eplic

a (%

)

Number of Sockets per Replica

strongmedium

weak

column+checksumcolumndefault+checksumdefault

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Part 3 Memory Limitations

36

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Checkpointing to SSD

• Hard to fit application data and checkpoints in memory at the same time

• Full SSD strategy: store both local and remote checkpoints in SSD

• Half SSD strategy: store only remote checkpoint in SSD

37

Node A Node B Node C

! " #

! " #

$ %

& '

& '

! " #

$ %

$ %

& '

Objects

RemoteCheckpoint

LocalCheckpoint

B is the buddy of A

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Checkpointing to SSD

• Hard to fit application data and checkpoints in memory at the same time

• Full SSD strategy: store both local and remote checkpoints in SSD

• Half SSD strategy: store only remote checkpoint in SSD

37

Node A Node B Node C

! " #

! " #

$ %

& '

& '

! " #

$ %

$ %

& '

Objects

RemoteCheckpoint

LocalCheckpoint

B is the buddy of A

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Asynchronous Checkpointing

38

Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC

Checkpoint/Restart on SSD

39

0

5

10

15

20

25

30

0.45 1.34 2.23

Tim

ing

Pena

lty(s

)

Checkpoint Size/Node(GB)

half−aiofull−aiohalf−siofull−sio

5

10

15

20

25

30

35

40

45

0.45 1.34 2.23

Res

tart

Tim

e(s)

Checkpoint Size/Node(GB)

in−memoryhalf−aiofull−aio