A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller...

21
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory

Transcript of A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller...

A Job Pause Service under LAM/MPI+BLCR

for Transparent Fault Tolerance

Chao Wang, Frank Mueller

North Carolina State University

Christian Engelmann, Stephen L. Scott

Oak Ridge National Laboratory

2

Outline

Problem vs. Our Solution

Overview of LAM/MPI and BLCR

Our Design and Implementation

Experimental Framework

Performance Evaluation

Related Work

Conclusion

3

Problem Statement

Trends in HPC: high end systems with thousands of processors— Increased probability of node failure: MTTF becomes

shorter

Extensions to MPI for FT exist but…

MPI widely accepted in scientific computing— But no fault recovery method in MPI standard

— Cannot dynamically add/delete nodes transparently at runtime

— Must reboot LAM RTE— Must restart entire job

-Inefficient if only one/few node(s) fail-Staging overhead

— Requeuing penalty

Today’s MPI Job C/R

failure

lambootn0 n2n1

mpirun

1 20coordinated checkpoint

new

restart

lam rebootn0 n2n1

1 20

4

Our Solution - Job-pause Service

Integrate group communication— Add/delete nodes— Detect node failures automatically

Processes on live nodes remain active (roll back to last checkpoint)Only processes on failed nodes dynamically replaced by spares

resumed from the last checkpoint Hence:— no restart of entire

job— no staging overhead— no job requeue

penalty— no Lam RTE reboot

new

Old Approach

failure

restart

lambootn0 n2n1

mpirun

1 20coordinated checkpoint

lam rebootn0 n2n1

1 20

failure

lambootn0 n2n1 n3

mpirun

1 20coordinated checkpoint

New Approach

pause

migrate

1 23

5

Outline

Problem vs. Our Solution

Overview of LAM/MPI and BLCR

Our Design and Implementation

Experimental Framework

Performance Evaluation

Related Work

Conclusion

6

LAM-MPI Overview

Modular, component-based architecture

— 2 major layers— Daemon-based RTE: lamd— “ Plug in” C/R to MPI SSI

framework:— Coordinated C/R & support BLCR

User Application

Operating System

MPI Layer

LAM Layer

Ex: 2-node MPI job

CR

MPI API

MPI SSI Framework

RPI

MPI app

mpirunlamd

MPI app lamd

out-of-bandcommunication channel

TCP socket

Node 0

Node 1

RTE: Run-time EnvironmentSSI: System Services InterfaceRPI: Request Progression Interface

7

BLCR Overview

Process-level C/R facility: for single MPI application process

Kernel-based: saves/restores most/all resources

Implementation: Linux kernel module

allows upgrades & bug fixes w/o reboot

Provides hooks used for distributed C/R: LAM-MPI jobs

8

Outline

Problem vs. Our Solution

Overview of LAM/MPI and BLCR

Our Design and Implementation

Experimental Framework

Performance Evaluation

Related Work

Conclusion

9

Our Design & Implementation – LAM/MPI

Decentralized scalable Membership and failure detector (ICS’06)— Radix tree scalability— dynamically detects node

failures — NEW: Integrated into lamd

MPI app

mpirunlamd

MPI app

lamd

out-of-bandcommunication channel

TCP socket

Node 0

Node 1

NEW: Decentralized scheduler— Integrated into lamd— Periodic coordinated

checkpointing— Node failure trigger1. process migration (failed nodes)2. job-pause (operational nodes)

membershipd

membershipd

schedulerd

schedulerd

new

new

10

New Job Pause Mechanism – LAM/MPI & BLCR

Operational nodes: Pause

Failed nodes: Migrate

— BLCR: reuse processes— restore part of state of

process from checkpoint— LAM: reuse existing

connections

— Restart on new node from checkpoint file

— Connect w/ paused tasks

paused MPI process

paused MPI process

failed MPI process

live node

live node failed node

existing connection

failed

parallel file system

failed

migrated MPI process

spare node

process migration

new connection

11

New Job Pause Mechanism - BLCR

(In kernel: dashed lines/boxes)

1. app registers threaded callback spawns callback thread

4. All threads complete callbacks & enter kernel

6. Run regular application code from restored state

running normally

run handler functions

Pause_req() unblocks

handler_thr

Pause

still running normally

receives signal, runs handlers, and ioctl()

signal other threadsother work

cleanup

block

mark checkpoint as complete

block in ioctl()continue normal execution

thread1

thread2

blocked in ioctl()

handler_thr

5. New: All threads restore part of their states

Call-back kernel thread: coordinates user command process and app. process

2. thread blocks in kernel3. pause utility calls ioctl(), unblocks callback thread

running normally

run handler functions

Pause_req() unblocks

handler_thr

Pause

still running normally

thread1

thread2

blocked in ioctl()

handler_thr

barrier

barrier

first thread restores shared resource registers/signals

reg/sig

registers/signals

1

2

3

4

6

12

Process Migration – LAM/MPI

Change addressing information of migrated process— in process itself— in all other processes

Use node id (not IP) for addressing information

No change to BLCR for Process Migration

Update addressing information at run time1. Migrated process tells coordinator

(mpirun) about its new location2. Coordinator broadcasts new location3. All processes update their process

listn0 n2n1 n3

I amon n3

mpirun

X

He ison n3

13

Outline

Problem vs. Our Solution

Overview of LAM/MPI and BLCR

Our Design and Implementation

Experimental Framework

Performance Evaluation

Related Work

Conclusion

14

Experimental Framework

Experiments conducted on— Opt cluster: 16 nodes, 2 core, dual Opteron 265, 1 Gbps

Ether— Fedora Core 5 Linux x86_64— Lam/MPI + BLCR w/ our extensions

Benchmarks— NAS V3.2.1 (MPI version)

– run 5 times, results report avg.– Class C (large problem size) used– BT, CG, EP, FT, LU, MG and SP benchmarks– IS run is too short

15

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

BT

4

BT

9

BT

16

CG

4

CG

8

CG

16

EP

4

EP

8

EP1

6

FT 4

FT 8

FT16

LU

4

LU

8

LU

16

MG

4

MG

8

MG

16

SP 4

SP 9

SP16

Execution Time Checkpoint Overhead

Relative Overhead (Single Checkpoint)

Checkpoint overhead < 10%

Except FT, MG (explained late)

16

Absolute Overhead (Single Checkpoint)

0

5

10

15

20

25

30

BT

4

BT

9

BT1

6

CG

4

CG

8

CG

16

EP 4

EP 8

EP16

FT 4

FT 8

FT16

LU 4

LU 8

LU16

MG

4

MG

8

MG

16

SP 4

SP 9

SP16

Seco

nds

223.668

Short: ~ 10 secs

Checkpoint times increase linearly with checkpoint file size

EP: small+const. chkpt file size incr. communication overheadExcept FT (explained next)

17

Analysis of Outliers

on 4 nodes

on 16 nodes

on 8 nodes

Size of checkpoint files [MB]

100.39157.3152.61460.731.3363.5111.1216

170.47310.3699.5920.821.33127.17186.688

355.27619.46185.511841.021.33250.88406.94

SPMGLUFTEPCGBTNode #

0

50

100

150

200

250

300

350

400

450

BT CG EP FT LU MG SP

Seco

nds

No Checkpoint Single Checkpoint

0

100

200

300

400

500

600

700

BT CG EP FT LU MG SP

Seco

nds

No Checkpoint Single Checkpoint

0

200

400

600

800

1000

1200

1400

BT CG EP FT LU MG SP

Seco

nds

No Checkpoint Single Checkpoint

MG: large checkpoint files, but short overall exec time

Large Checkpoint filesFT: thrashing/swap (BLCR problem)

18

Job Migration Overhead

69.6% < job restart + lam reboot

NO LAM Reboot

No requeue penalty

Transparent continuation of exec

on 16 nodes

— Less staging overhead

0

1

2

3

4

5

6

7

8

9

10

BT CG EP FT LU MG SP

Sec

onds

Job Pause and Migrate LAM Reboot Job Restart

19

Related Work

FT – Reactive approach

Transparent— Checkpoint/restart

– LAM/MPI w/ BLCR [S.Sankaran et.al LACSI ’03]– Process Migration: scan & update checkpoint files

[J. Cao, Y. Li and M.Guo, ICPADS, 2005] still requires restart of entire job

– CoCheck [G.Stellner, IPPS ’ 96]— Log based (Log msg + temporal ordering)

– MPICH-V [G.Bosilica , Supercomputing, 2002]

Non-transparent— Explicit invocation of checkpoint routines

– LA-MPI [R.T.Aulwes et. Al, IPDPS 2004]– FT-MPI [G. E. Fagg and J. J. Dongarra, 2000]

20

Conclusion

Job-Pause for fault tolerance in HPC

Design generic for any MPI implementation / process C/R

Implemented over LAM-MPI w/ BLCR

Decentralized P2P scalable membership protocol & scheduler

High-performance job-pause for operational nodes

Process migration for failed nodes

Completely transparent

Low overhead: 69.6% < job restart + lam reboot

— No job requeue overhead

— Less staging cost

— No LAM Reboot

Suitable for proactive fault tolerance with diskless migration

failure

lambootn0 n2n1 n3

mpirun

1 20coordinated checkpoint

New Approach

pause

migrate

1 23

21

Questions?

Thank you!