A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller...
-
Upload
dominick-gardner -
Category
Documents
-
view
218 -
download
1
Transcript of A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller...
A Job Pause Service under LAM/MPI+BLCR
for Transparent Fault Tolerance
Chao Wang, Frank Mueller
North Carolina State University
Christian Engelmann, Stephen L. Scott
Oak Ridge National Laboratory
2
Outline
Problem vs. Our Solution
Overview of LAM/MPI and BLCR
Our Design and Implementation
Experimental Framework
Performance Evaluation
Related Work
Conclusion
3
Problem Statement
Trends in HPC: high end systems with thousands of processors— Increased probability of node failure: MTTF becomes
shorter
Extensions to MPI for FT exist but…
MPI widely accepted in scientific computing— But no fault recovery method in MPI standard
— Cannot dynamically add/delete nodes transparently at runtime
— Must reboot LAM RTE— Must restart entire job
-Inefficient if only one/few node(s) fail-Staging overhead
— Requeuing penalty
Today’s MPI Job C/R
failure
lambootn0 n2n1
mpirun
1 20coordinated checkpoint
new
restart
lam rebootn0 n2n1
1 20
4
Our Solution - Job-pause Service
Integrate group communication— Add/delete nodes— Detect node failures automatically
Processes on live nodes remain active (roll back to last checkpoint)Only processes on failed nodes dynamically replaced by spares
resumed from the last checkpoint Hence:— no restart of entire
job— no staging overhead— no job requeue
penalty— no Lam RTE reboot
new
Old Approach
failure
restart
lambootn0 n2n1
mpirun
1 20coordinated checkpoint
lam rebootn0 n2n1
1 20
failure
lambootn0 n2n1 n3
mpirun
1 20coordinated checkpoint
New Approach
pause
migrate
1 23
5
Outline
Problem vs. Our Solution
Overview of LAM/MPI and BLCR
Our Design and Implementation
Experimental Framework
Performance Evaluation
Related Work
Conclusion
6
LAM-MPI Overview
Modular, component-based architecture
— 2 major layers— Daemon-based RTE: lamd— “ Plug in” C/R to MPI SSI
framework:— Coordinated C/R & support BLCR
User Application
Operating System
MPI Layer
LAM Layer
Ex: 2-node MPI job
CR
MPI API
MPI SSI Framework
RPI
MPI app
mpirunlamd
MPI app lamd
out-of-bandcommunication channel
TCP socket
Node 0
Node 1
RTE: Run-time EnvironmentSSI: System Services InterfaceRPI: Request Progression Interface
7
BLCR Overview
Process-level C/R facility: for single MPI application process
Kernel-based: saves/restores most/all resources
Implementation: Linux kernel module
allows upgrades & bug fixes w/o reboot
Provides hooks used for distributed C/R: LAM-MPI jobs
8
Outline
Problem vs. Our Solution
Overview of LAM/MPI and BLCR
Our Design and Implementation
Experimental Framework
Performance Evaluation
Related Work
Conclusion
9
Our Design & Implementation – LAM/MPI
Decentralized scalable Membership and failure detector (ICS’06)— Radix tree scalability— dynamically detects node
failures — NEW: Integrated into lamd
MPI app
mpirunlamd
MPI app
lamd
out-of-bandcommunication channel
TCP socket
Node 0
Node 1
NEW: Decentralized scheduler— Integrated into lamd— Periodic coordinated
checkpointing— Node failure trigger1. process migration (failed nodes)2. job-pause (operational nodes)
membershipd
membershipd
schedulerd
schedulerd
new
new
10
New Job Pause Mechanism – LAM/MPI & BLCR
Operational nodes: Pause
Failed nodes: Migrate
— BLCR: reuse processes— restore part of state of
process from checkpoint— LAM: reuse existing
connections
— Restart on new node from checkpoint file
— Connect w/ paused tasks
paused MPI process
paused MPI process
failed MPI process
live node
live node failed node
existing connection
failed
parallel file system
failed
migrated MPI process
spare node
process migration
new connection
11
New Job Pause Mechanism - BLCR
(In kernel: dashed lines/boxes)
1. app registers threaded callback spawns callback thread
4. All threads complete callbacks & enter kernel
6. Run regular application code from restored state
running normally
run handler functions
Pause_req() unblocks
handler_thr
Pause
still running normally
receives signal, runs handlers, and ioctl()
signal other threadsother work
cleanup
block
mark checkpoint as complete
block in ioctl()continue normal execution
thread1
thread2
blocked in ioctl()
handler_thr
5. New: All threads restore part of their states
Call-back kernel thread: coordinates user command process and app. process
2. thread blocks in kernel3. pause utility calls ioctl(), unblocks callback thread
running normally
run handler functions
Pause_req() unblocks
handler_thr
Pause
still running normally
thread1
thread2
blocked in ioctl()
handler_thr
barrier
barrier
first thread restores shared resource registers/signals
reg/sig
registers/signals
1
2
3
4
6
12
Process Migration – LAM/MPI
Change addressing information of migrated process— in process itself— in all other processes
Use node id (not IP) for addressing information
No change to BLCR for Process Migration
Update addressing information at run time1. Migrated process tells coordinator
(mpirun) about its new location2. Coordinator broadcasts new location3. All processes update their process
listn0 n2n1 n3
I amon n3
mpirun
X
He ison n3
13
Outline
Problem vs. Our Solution
Overview of LAM/MPI and BLCR
Our Design and Implementation
Experimental Framework
Performance Evaluation
Related Work
Conclusion
14
Experimental Framework
Experiments conducted on— Opt cluster: 16 nodes, 2 core, dual Opteron 265, 1 Gbps
Ether— Fedora Core 5 Linux x86_64— Lam/MPI + BLCR w/ our extensions
Benchmarks— NAS V3.2.1 (MPI version)
– run 5 times, results report avg.– Class C (large problem size) used– BT, CG, EP, FT, LU, MG and SP benchmarks– IS run is too short
15
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
BT
4
BT
9
BT
16
CG
4
CG
8
CG
16
EP
4
EP
8
EP1
6
FT 4
FT 8
FT16
LU
4
LU
8
LU
16
MG
4
MG
8
MG
16
SP 4
SP 9
SP16
Execution Time Checkpoint Overhead
Relative Overhead (Single Checkpoint)
Checkpoint overhead < 10%
Except FT, MG (explained late)
16
Absolute Overhead (Single Checkpoint)
0
5
10
15
20
25
30
BT
4
BT
9
BT1
6
CG
4
CG
8
CG
16
EP 4
EP 8
EP16
FT 4
FT 8
FT16
LU 4
LU 8
LU16
MG
4
MG
8
MG
16
SP 4
SP 9
SP16
Seco
nds
223.668
Short: ~ 10 secs
Checkpoint times increase linearly with checkpoint file size
EP: small+const. chkpt file size incr. communication overheadExcept FT (explained next)
17
Analysis of Outliers
on 4 nodes
on 16 nodes
on 8 nodes
Size of checkpoint files [MB]
100.39157.3152.61460.731.3363.5111.1216
170.47310.3699.5920.821.33127.17186.688
355.27619.46185.511841.021.33250.88406.94
SPMGLUFTEPCGBTNode #
0
50
100
150
200
250
300
350
400
450
BT CG EP FT LU MG SP
Seco
nds
No Checkpoint Single Checkpoint
0
100
200
300
400
500
600
700
BT CG EP FT LU MG SP
Seco
nds
No Checkpoint Single Checkpoint
0
200
400
600
800
1000
1200
1400
BT CG EP FT LU MG SP
Seco
nds
No Checkpoint Single Checkpoint
MG: large checkpoint files, but short overall exec time
Large Checkpoint filesFT: thrashing/swap (BLCR problem)
18
Job Migration Overhead
69.6% < job restart + lam reboot
NO LAM Reboot
No requeue penalty
Transparent continuation of exec
on 16 nodes
— Less staging overhead
0
1
2
3
4
5
6
7
8
9
10
BT CG EP FT LU MG SP
Sec
onds
Job Pause and Migrate LAM Reboot Job Restart
19
Related Work
FT – Reactive approach
Transparent— Checkpoint/restart
– LAM/MPI w/ BLCR [S.Sankaran et.al LACSI ’03]– Process Migration: scan & update checkpoint files
[J. Cao, Y. Li and M.Guo, ICPADS, 2005] still requires restart of entire job
– CoCheck [G.Stellner, IPPS ’ 96]— Log based (Log msg + temporal ordering)
– MPICH-V [G.Bosilica , Supercomputing, 2002]
Non-transparent— Explicit invocation of checkpoint routines
– LA-MPI [R.T.Aulwes et. Al, IPDPS 2004]– FT-MPI [G. E. Fagg and J. J. Dongarra, 2000]
20
Conclusion
Job-Pause for fault tolerance in HPC
Design generic for any MPI implementation / process C/R
Implemented over LAM-MPI w/ BLCR
Decentralized P2P scalable membership protocol & scheduler
High-performance job-pause for operational nodes
Process migration for failed nodes
Completely transparent
Low overhead: 69.6% < job restart + lam reboot
— No job requeue overhead
— Less staging cost
— No LAM Reboot
Suitable for proactive fault tolerance with diskless migration
failure
lambootn0 n2n1 n3
mpirun
1 20coordinated checkpoint
New Approach
pause
migrate
1 23