Mitigation of Failures in High Performance Computing via...
Transcript of Mitigation of Failures in High Performance Computing via...
Mitigation of Failures in High Performance Computing
via Runtime Techniques
Xiang Ni Advisor: Laxmikant Kale
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Table of Contents
• Motivations
• Protection for Hard Errors
• Detection and Correction of Silent Data Corruptions
• Memory Limitations
2
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Fault Tolerance is Everywhere
3
We use multiple locks to protect our house
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Fault Tolerance is Everywhere
3
We set as many alarms as possible
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Fault Tolerance is Everywhere
3
In games or movies, we can restart to make progress
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Fault Tolerance is Everywhere
3
But not with your traditional HPC applications
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Fault Tolerance is Everywhere
3
But not with your traditional HPC applications Fault Tolerance
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Failures are Rising in HPC
4
• Titan: 2nd largest in the world at Oak Ridge National Lab • More than 17 PFlops with 560640 cores
1.61 Failures/day (2014)
Memory Errors Machine Check Exception
Voltage Fault
Exascale machines that are 100 times powerful will have more obstacles.Commercial vendors unlikely to address resilience issues for HPC market.
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Failures maybe Silent• Common source of soft errors
• Particle induced single event upset
• Manufacturing fault
• Data corruption: you may or may not know
5
Shrinking chip size• More energy efficient
• Higher soft error rate
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Memory is Limited
7
Byte
s to
Flo
ps ra
tio
0
0.05
0.1
0.15
0.2
0.25
TFlops
10 1000 100000
TianheTitan
Mira
Stampede
EdisonHopper
KrakenIntrepid
BlueGene/L
Jaguar(XT4)Red Storm
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Memory is Limited
8
Read Latency Write Latency
DDR3 .01μs .01μs
PCM .05μs 1μs
NAND 10μs 100μs
DISK 1000us 1000us
It is promising to use new types of memory in HPC.
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Thesis Focus
❖ Develop runtime techniques to protect HPC applications from failures
Hard errors: reducing checkpoint and restart overhead
Soft errors: detecting and correcting silent data corruptions
Lack of memory: how to utilize NVRAM for checkpointing and application execution?
9
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Part 1 Protection for Hard Errors
10
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Double In-memory Checkpointing
11
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Double In-memory Checkpointing
11
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Double In-memory Checkpointing
11
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Double In-memory Checkpointing
11
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Double In-memory Checkpointing
❖ Charm++
❖ SCR
❖ FTI
11
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Double In-memory Checkpointing
❖ Charm++
❖ SCR
❖ FTI
11
Application resumes computation after all the nodes have successfully saved the checkpoints in their buddy nodes.
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Limitation of Checkpoint/Restart
12
0
10
20
30
40
50
60
2008 2010 2012 2014 2016 2018 2020
Rel
ativ
e In
crea
se
Year
Memory SizeNetwork Bandwidth
✤ Increase in memory size per year: 41% ✤ Increase in network bandwidth per year: 26%
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Semi-Blocking Checkpoint
13
NODE 1
NODE 2
barrier local checkpoint done
remote checkpoint done
!
βα
β
$ φ%
α
✤ Resume computation as soon as each node stores its own checkpoint (local checkpoint).
✤ Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint).
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Minimize Checkpoint Interference
14
1 2 3 4
1 2 3 4
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Optimistic Scheduling
15
0
2
4
6
8
10
6 7 8 9 10 11 12 13 14 15
2 3 4 5 6 7 8 9 10 11 12
Inte
rfer
ence
(s)
Bene
fit(%
)
Overlap(s)
Opt
imis
tic
interferencebenefit
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Optimistic Scheduling
15
0
2
4
6
8
10
6 7 8 9 10 11 12 13 14 15
2 3 4 5 6 7 8 9 10 11 12
Inte
rfer
ence
(s)
Bene
fit(%
)
Overlap(s)
Opt
imis
tic
interferencebenefit
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Single Checkpoint Overhead
16
0
5
10
15
20
25
30
35
40
128 256 512 1024
Che
ckpo
int O
verh
ead(
s)Number of Cores
blocking checkpointsemi−blocking checkpoint
0
10
20
30
40
50
60
70
128 256 512 1024
Che
ckpo
int O
verh
ead(
s)
Number of Cores
blocking checkpointsemi−blocking checkpoint
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?
✓ How can applications automatically take preventative checkpoint based on failure prediction?
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?
✓ How can applications automatically take preventative checkpoint based on failure prediction?
Too expensive for HPC applications to synchronize very often for consistent checkpoint.
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?
✓ How can applications automatically take preventative checkpoint based on failure prediction?
Too expensive for HPC applications to synchronize very often for consistent checkpoint.
Otherwise applications may hang due to inconsistent checkpoint.
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Automatic Checkpoint❖ Optimal Checkpoint Interval
✓ Traditionally, application checkpoints at fixed interval.
✓ If MTBF follows Poisson process, Daly’s model
✓ What if MTBF follows Weibull process, increasing or decreasing failure rate?
✓ How can applications automatically take preventative checkpoint based on failure prediction?
Too expensive for HPC applications to synchronize very often for consistent checkpoint.
Otherwise applications may hang due to inconsistent checkpoint.
17
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Part 2 Detection and Correction of
Silent Data Corrections
24
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
MOTIVATION4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
0 1
Vulnerability
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
25
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
MOTIVATION4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
0 1
Vulnerability
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
25
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
MOTIVATION4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
0 1
Vulnerability
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
25
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
MOTIVATION4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
0 1
Vulnerability
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
25
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
MOTIVATION4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
0 1
Vulnerability
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
4K 16K 64K 256K 1M1
10010000
0 0.2 0.4 0.6 0.8
1
Util
izat
ion
Number of Sockets Soft Error Rate
per Socket (F
IT)
Util
izat
ion
25
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Replication Enhanced Checkpointing
26
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Replication Enhanced Checkpointing
replica 1 replica 2
transfer checkpoint for soft error detection
hard error
hard error detected by replica 2
replica 2 sends checkpoints to replica 1 for recovery
soft error detected, both replicas roll back
application execution
checkpoint
recovery
Job Starts
T1
T2
T3
TIME
27
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Different Ways to Restart from Hard Errors
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 recovers using the previous checkpoint
Time
Prog
ress
periodic checkpointing
replica 2 crashesreplica 1 detects the crash of
replica 2 and checkpoints
Replica 1 Replica 2
replica 2 recovers using the most
recent checkpoint from replica 1
MEDIUM
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 waits for replica 1 to make the next
periodical checkpoint and recover
28
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Different Ways to Restart from Hard Errors
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 recovers using the previous checkpoint Time
Prog
ress
periodic checkpointing
replica 2 crashesreplica 1 detects the crash of
replica 2 and checkpoints
Replica 1 Replica 2
replica 2 recovers using the most
recent checkpoint from replica 1
MEDIUM
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 waits for replica 1 to make the next
periodical checkpoint and recover
28
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Different Ways to Restart from Hard Errors
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 recovers using the previous checkpoint
Time
Prog
ress
periodic checkpointing
replica 2 crashesreplica 1 detects the crash of
replica 2 and checkpoints
Replica 1 Replica 2
replica 2 recovers using the most
recent checkpoint from replica 1
MEDIUM
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 waits for replica 1 to make the next
periodical checkpoint and recover
28
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Different Ways to Restart from Hard Errors
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 recovers using the previous checkpoint
Time
Prog
ress
periodic checkpointing
replica 2 crashesreplica 1 detects the crash of
replica 2 and checkpoints
Replica 1 Replica 2
replica 2 recovers using the most
recent checkpoint from replica 1
MEDIUM
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 waits for replica 1 to make the next
periodical checkpoint and recover
28
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Different Ways to Restart from Hard Errors
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 recovers using the previous checkpoint
Time
Prog
ress
periodic checkpointing
replica 2 crashesreplica 1 detects the crash of
replica 2 and checkpoints
Replica 1 Replica 2
replica 2 recovers using the most
recent checkpoint from replica 1
MEDIUM
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 waits for replica 1 to make the next
periodical checkpoint and recover
28
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Different Ways to Restart from Hard Errors
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 recovers using the previous checkpoint
Time
Prog
ress
periodic checkpointing
replica 2 crashesreplica 1 detects the crash of
replica 2 and checkpoints
Replica 1 Replica 2
replica 2 recovers using the most
recent checkpoint from replica 1
MEDIUM
Time
Prog
ress
periodic checkpointing
replica 2 crashes
Replica 1 Replica 2
replica 2 waits for replica 1 to make the next
periodical checkpoint and recover
28
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Utilization vs. Vulnerability
29
✤
✤
0
0.05
0.1
0.15
0.2
0.25
0.3
1K 2K 4K 8K 16K 32K 64K 128K 256K
Prob
abili
ty o
f Und
etec
ted
SDC
Number of Sockets per Replica
Weak δ = 180sMedium δ = 180s
Weak δ = 15sMedium δ = 15s
0.3
0.35
0.4
0.45
0.5
1K 2K 4K 8K 16K 32K 64K 128K 256K
Util
izat
ion
Number of Sockets per Replica
Weak δ = 15sMedium δ = 15s
Strong δ = 15sWeak δ = 180s
Medium δ = 180sStrong δ = 180s
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Base performance
0
0.5
1
1.5
2
2.5
3
1k 2k 4k 8k 16k
Tim
e (s
)
Number of Cores per Replica
checkpoint
30
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Base performance
0
0.5
1
1.5
2
2.5
3
1k 2k 4k 8k 16k
Tim
e (s
)
Number of Cores per Replica
checkpoint
30
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Optimization: Topology Aware Mapping
31
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Optimization: Topology Aware Mapping
31
43
4
4
4
4
4
4
43
3
3
3
3
3
3
2
2
2
2
2
2
2
21
1
1
1
1
1
1
1 1
1
1
1
1
1
1
2
2
2
2
2
2
2
23
3
3
3
3
3
3
3
(a) Default-mapping
Replica 1 nodes
1
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Optimization: Topology Aware Mapping
31
43
4
4
4
4
4
4
43
3
3
3
3
3
3
2
2
2
2
2
2
2
21
1
1
1
1
1
1
1 1
1
1
1
1
1
1
2
2
2
2
2
2
2
23
3
3
3
3
3
3
3
(a) Default-mapping
Replica 1 nodes
1
012
2
2
2
2
2
2
21
1
1
1
1
1
1
1 1
1
1
1
1
1
1
12
2
2
2
2
2
2
2
(c) Mixed-mapping
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
# inter-replica messages[0-4]Replica 2 nodes
1
1
1
1
1
1
1
1
(b) Column-mapping1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0 1 0 1 0 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Optimization: Checksum
• Transfer the checksum of 1 integer instead of the whole checkpoints
• Floating point round-off error
0
0.5
1
1.5
2
2.5
3
1k 64k 1k 64k 1k 64k 1k 64k
Tim
e (s
)
Number of Cores per Replica
local checkpointcomparison
checkpoint transfer
checksumcolumnmixeddefault
32
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Experimental Results: Checkpoint
0
0.5
1
1.5
2
2.5
3
1k 64k 1k 64k 1k 64k 1k 64k
Tim
e (s
)
Number of Cores per Replica
local checkpointcomparison
checkpoint transfer
checksumcolumnmixeddefault
0
0.01
0.02
0.03
0.04
0.05
1k 64k 1k 64k 1k 64k 1k 64k
Tim
e (s
)
Number of Cores per Replica
local checkpointcomparison
checkpoint transfer
checksumcolumnmixeddefault
33
Benchmark Description Configurationper core
Memory Pressure Runtime
Jacobi3D 7-point stencil 64*64*128 High AMPI
LeanMD Short-range non-bonded force
calculation in NAMD
4000 atoms Low Charm++
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Experimental Results: Restart
0
0.5
1
1.5
2
2.5
1k 64k 1k 64k 1k 64k 1k 64k
Tim
e (s
)
Number of Cores per Replica
checkpoint transferreconstruction
strongmedium (column)
medium (mixed)
medium (default)
0
0.1
0.2
0.3
0.4
0.5
1k 64k 1k 64k 1k 64k 1k 64k
Tim
e (s
)
Number of Cores per Replica
checkpoint transferreconstruction
strongmedium (column)
medium (mixed)
medium (default)
34
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Checkpoint Overhead
35
0
0.5
1
1.5
2
2.5
3
1k 4k 16k 1k 4k 16k 1k 4k 16k 1k 4k 16k
Ove
rhea
d pe
r R
eplic
a (%
)
Number of Sockets per Replica
strongmedium
weak
column+checksumcolumndefault+checksumdefault
0
0.1
0.2
0.3
0.4
0.5
1k 4k 16k 1k 4k 16k 1k 4k 16k 1k 4k 16k
Ove
rhea
d pe
r R
eplic
a (%
)
Number of Sockets per Replica
strongmedium
weak
column+checksumcolumndefault+checksumdefault
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Checkpointing to SSD
• Hard to fit application data and checkpoints in memory at the same time
• Full SSD strategy: store both local and remote checkpoints in SSD
• Half SSD strategy: store only remote checkpoint in SSD
37
Node A Node B Node C
! " #
! " #
$ %
& '
& '
! " #
$ %
$ %
& '
Objects
RemoteCheckpoint
LocalCheckpoint
B is the buddy of A
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Checkpointing to SSD
• Hard to fit application data and checkpoints in memory at the same time
• Full SSD strategy: store both local and remote checkpoints in SSD
• Half SSD strategy: store only remote checkpoint in SSD
37
Node A Node B Node C
! " #
! " #
$ %
& '
& '
! " #
$ %
$ %
& '
Objects
RemoteCheckpoint
LocalCheckpoint
B is the buddy of A
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Asynchronous Checkpointing
38
✤
๏
๏
Ph.D Preliminary Examination © Xiang Ni Mitigating Failures in HPC
Checkpoint/Restart on SSD
39
0
5
10
15
20
25
30
0.45 1.34 2.23
Tim
ing
Pena
lty(s
)
Checkpoint Size/Node(GB)
half−aiofull−aiohalf−siofull−sio
5
10
15
20
25
30
35
40
45
0.45 1.34 2.23
Res
tart
Tim
e(s)
Checkpoint Size/Node(GB)
in−memoryhalf−aiofull−aio
✤
✤