Fault Tolerant MPI

38
Fault Tolerant Fault Tolerant MPI MPI Anthony Skjellum Anthony Skjellum *$ *$ , Yoginder , Yoginder Dandass Dandass $ , Pirabhu Raman , Pirabhu Raman * MPI Software Technology, Inc MPI Software Technology, Inc * Misissippi State University Misissippi State University $ FALSE2002 Workshop FALSE2002 Workshop November 14, 2002 November 14, 2002

description

Fault Tolerant MPI. Anthony Skjellum *$ , Yoginder Dandass $ , Pirabhu Raman * MPI Software Technology, Inc * Misissippi State University $ FALSE2002 Workshop November 14, 2002. Outline. Motivations Strategy Audience Technical Approaches Summary and Conclusions. Motivations for MPI/FT. - PowerPoint PPT Presentation

Transcript of Fault Tolerant MPI

Page 1: Fault Tolerant MPI

Fault Tolerant Fault Tolerant MPIMPI

Anthony SkjellumAnthony Skjellum*$*$, Yoginder Dandass, Yoginder Dandass$$, , Pirabhu RamanPirabhu Raman**

MPI Software Technology, IncMPI Software Technology, Inc**

Misissippi State UniversityMisissippi State University$$

FALSE2002 WorkshopFALSE2002 WorkshopNovember 14, 2002November 14, 2002

Page 2: Fault Tolerant MPI

2

OutlineOutline

MotivationsMotivations

StrategyStrategy

AudienceAudience

Technical ApproachesTechnical Approaches

Summary and ConclusionsSummary and Conclusions

Page 3: Fault Tolerant MPI

3

Motivations for MPI/FTMotivations for MPI/FT

Well written and well tested legacy MPI Well written and well tested legacy MPI applications will abort, hang or die more applications will abort, hang or die more often in harsh or long-running environments often in harsh or long-running environments because of extraneously introduced errors.because of extraneously introduced errors.

Parallel Computations are Fragile at PresentParallel Computations are Fragile at Present There is apparent demand for recovery of There is apparent demand for recovery of

running parallel applicationsrunning parallel applications Learn how “fault tolerant” can we make MPI Learn how “fault tolerant” can we make MPI

programs and implementations without programs and implementations without abandoning this programming modelabandoning this programming model

Page 4: Fault Tolerant MPI

4

StrategyStrategy Build on MPI/Pro, a commercial MPIBuild on MPI/Pro, a commercial MPI Support extant MPI applicationsSupport extant MPI applications Define application requirements/subdomainsDefine application requirements/subdomains Do a very good job for Master/Slave Model FirstDo a very good job for Master/Slave Model First Offer higher-availability servicesOffer higher-availability services Harden TransportsHarden Transports Work with OEMs to offer more recoverable Work with OEMs to offer more recoverable

servicesservices Use specialized parallel computational models to Use specialized parallel computational models to

enhance effective coverageenhance effective coverage Exploit third-party checkpoint/restart, nMR, etc.Exploit third-party checkpoint/restart, nMR, etc. Exploit Gossip for DetectionExploit Gossip for Detection

Page 5: Fault Tolerant MPI

5

AudienceAudience

High Scale, Higher Reliability UsersHigh Scale, Higher Reliability Users Low Scale, Extremely High Reliability Low Scale, Extremely High Reliability

Users (nMR involved for some nodes)Users (nMR involved for some nodes) Users of clusters for production runsUsers of clusters for production runs Users of embedded multicomputersUsers of embedded multicomputers Space-based, and highly embedded Space-based, and highly embedded

settingssettings Grid-based MPI applicationsGrid-based MPI applications

Page 6: Fault Tolerant MPI

6

Detection and Recovery Detection and Recovery From Extraneously From Extraneously

Induced ErrorsInduced Errors

Application

MPI

Errors/Failures

Network, Drivers, NIC

ABFT/aBFT

MPI Sanity

WATCHDOG/BIT/OTHER

DETECTION

RECOVERY Process

Application Recovers

APPLICATIONEXECUTION

MODEL SPECIFICS

OS, R/T, Monitors

N/W Sanity

Page 7: Fault Tolerant MPI

7

mpirun –np NP my.app no error

my.app finishes

mpirun finishes

(success)

MPI-lib error

my.app aborts

mpirun finishes

(failure)

process dies

my.app hangs

mpirun hangs

(failure)

aborted run ?

send it to ground

n

yhung job ?

continue waiting

n

abort my.app y

DETECTION

RECOVERYRECOVERY

Coarse Grain Detection and Coarse Grain Detection and RecoveryRecovery

(adequate if no SEU)

Page 8: Fault Tolerant MPI

8

r0sendbuf

MPI

r1recvbuf

MPI

NIC NIC

user level

device level

user level

device level

2nd highest SEU strike-rate after main cpu

• Legacy MPI applications will be run in simplex mode

Example: NIC errorsExample: NIC errors

Page 9: Fault Tolerant MPI

9

““Obligations” of a Fault-Obligations” of a Fault-Tolerant MPITolerant MPI

Ensure Reliability of Data Transfer at the Ensure Reliability of Data Transfer at the MPI LevelMPI Level

Build Reliable Header FieldsBuild Reliable Header Fields Detect Process FailuresDetect Process Failures Transient Error Detection and HandlingTransient Error Detection and Handling Checkpointing supportCheckpointing support Two-way negotiation with scheduler and Two-way negotiation with scheduler and

checkpointing componentscheckpointing components Sanity checking of MPI applications and Sanity checking of MPI applications and

underlying resources (non-local)underlying resources (non-local)

Page 10: Fault Tolerant MPI

10

Initiate Device Level Communication

Low Level Success ?

yReturn MPI_Success

Error Type ? TimeoutAsk SCT:

Is Peer Alive ?

n

Trigger Event

Reset Timeout

A

A

OtherAsk EH: Recoverable?

Trigger Event y

Reset Timeout

A

y

n

n

EH : Error HandlerSCT: Self-Checking Thread

Low Level Detection Low Level Detection Strategy for Errors and Strategy for Errors and

Dead ProcessesDead Processes

Page 11: Fault Tolerant MPI

11

App SysMFT-I No ranks nMR Yes YesMFT-II Several ranks nMR Yes Yes

No ranks nMR YesSeveral ranks nMR Yes

MFT-IIIs Rank 0 nMR Yes YesMFT-IIIm Several ranks nMR Yes YesMFT-IVs Rank 0 nMR Yes YesMFT-IVm Several ranks nMR Yes Yes

No ranks nMR YesSeveral ranks nMR Yes

Cp/Recov

SPMDWith MPI-1.2

No MPI

Application Style

MPI Support nMRModel Name

Master/Slave

With MPI-1.2

With MPI-2 DPM

No MPI

Overview of MPI/FT Overview of MPI/FT ModelsModels

Page 12: Fault Tolerant MPI

12

Rank 1(non-nMR)

Rank 0Rank 0

Rank 0

nMRRank

Rank 1(non-nMR)

Rank 0Rank 0

Rank 0

nMRRank

Design Choices Design Choices

Replicated ranks send/receive messages independently from each other

One copy of the replicated rank acts as the message conduit

Rank 0Rank 0

Rank 0

nMRRank

Rank 1Rank 1

Rank 1

nMRRank

Rank 0Rank 0

Rank 0

nMRRank

Rank 1Rank 1

Rank 1

nMRRank

Replicated ranks send/receive messages independently from each

other

One copy of the replicated rank acts as the message conduit

Message Replication (nMR to nMR)Message Replication (nMR to nMR)

Message Replication (nMR to simplex)Message Replication (nMR to simplex)

Page 13: Fault Tolerant MPI

13

00 11

22 3300 11

22 3300 11

22 33

A B C

n=3; np=4

•Voting on messages only (not on each word of state)

• Local errors remain local

• Requires two failures to fail (e.g., A0 and C0)

Parallel nMR Parallel nMR AdvantagesAdvantages

Page 14: Fault Tolerant MPI

14

Master (rank 0) is nMRMaster (rank 0) is nMR MPI_COMM_WORLD is maintained in MPI_COMM_WORLD is maintained in

nMRnMR MPI-1.2 only MPI-1.2 only

Application cannot actively manage Application cannot actively manage processesprocesses

Only middleware can restart processesOnly middleware can restart processes Pros: Pros:

Supports send/receive MPI-1.2Supports send/receive MPI-1.2 Minimizes excess messagingMinimizes excess messaging Largely MPI application transparentLargely MPI application transparent Quick recovery possibleQuick recovery possible

ABFT based process recovery assumed.ABFT based process recovery assumed. Cons:Cons:

Scales to O(10) Ranks onlyScales to O(10) Ranks only Voting still limitedVoting still limited Application explicitly fault awareApplication explicitly fault aware

Rank 1(Slave)

Rank 2(Slave)

Rank n(Slave)

Logical flow of MPI messages

Actual flow of MPI message

Rank 0Rank 0

Rank 0

Replicated Rank 0

Messagefrom 1to 0

Message from 0 to 2

MFT-IIIs: Master/Slave MFT-IIIs: Master/Slave with MPI-1.2with MPI-1.2

Page 15: Fault Tolerant MPI

15

Master (rank 0) is nMRMaster (rank 0) is nMR MPI_COMM_WORLD is maintained in MPI_COMM_WORLD is maintained in

nMRnMR MPI_COMM_SPAWN()MPI_COMM_SPAWN()

Application can actively restart processesApplication can actively restart processes Pros: Pros:

Supports send/receive MPI-1.2 + DPMSupports send/receive MPI-1.2 + DPM Minimizes excess messagingMinimizes excess messaging Largely MPI application transparentLargely MPI application transparent Quick recovery possible, simpler than MFT-Quick recovery possible, simpler than MFT-

IIIsIIIs ABFT based process recovery assumed.ABFT based process recovery assumed.

Cons:Cons: Scales to O(10) Ranks onlyScales to O(10) Ranks only Voting still limitedVoting still limited Application explicitly fault awareApplication explicitly fault aware

Rank 1(Slave)

Rank 2(Slave)

Rank n(Slave)

Logical flow of MPI messages

Actual flow of MPI message

Rank 0Rank 0

Rank 0

Replicated Rank 0

Messagefrom 1to 0

Message from 0 to 2

MFT-IVs: Master/Slave MFT-IVs: Master/Slave with MPI-2with MPI-2

Page 16: Fault Tolerant MPI

16

Checkpointing the Master for Checkpointing the Master for recoveryrecovery

Master checkpointsMaster checkpoints Voting on master livenessVoting on master liveness Master failure detectedMaster failure detected

Lowest rank slave restarts Lowest rank slave restarts master from checkpointed datamaster from checkpointed data

Any of the slaves could promote Any of the slaves could promote and assume the role of masterand assume the role of master

Peer liveness knowledge Peer liveness knowledge required to decide the lowest required to decide the lowest rankrank

Pros:Pros: Recovery independent of the Recovery independent of the

number of faults number of faults No additional resourcesNo additional resources

Cons:Cons: Checkpointing further reduces Checkpointing further reduces

scalabilityscalability Recovery time depends on the Recovery time depends on the

checkpointing frequencycheckpointing frequency

Rank 1(Slave)

Rank 2(Slave)

Rank n(Slave)

MPI messages

Rank 0

Messagefrom 1to 0

Message from 0 to 2

StorageMedium

Checkpointing data

Page 17: Fault Tolerant MPI

17

Checkpointing Slaves for Checkpointing Slaves for RecoveryRecovery

(Speculative)(Speculative) Slaves checkpoint periodically at a Slaves checkpoint periodically at a

low frequencylow frequency Prob. of failure of a slave > prob. of Prob. of failure of a slave > prob. of

failure of the masterfailure of the master Master failure detectedMaster failure detected

Recovered from data Recovered from data checkpointed at various slavescheckpointed at various slaves

Peer liveness knowledge required Peer liveness knowledge required to decide the lowest rankto decide the lowest rank

Pros:Pros: Checkpointing overhead of Checkpointing overhead of

master eliminatedmaster eliminated Aids in faster recovery of slavesAids in faster recovery of slaves

Cons:Cons: Increase in Master recovery time Increase in Master recovery time Increase in overhead due to Increase in overhead due to

checkpointing of all the slavescheckpointing of all the slaves

Rank 1(Slave)

Rank 2(Slave)

Rank n(Slave)

Flow of MPI messages

Rank 0

Messagefrom 1to 0

Message from 0 to 2

SM

Checkpointing data

SM SM

SM Storage Medium

•Slaves are stateless and hence checkpointing slaves doesn’t help in anyway in restarting the slaves

•Checkpointing at all the slaves could be really expensive

•Instead of checkpointing slaves could return the results tto the master

Page 18: Fault Tolerant MPI

18

Adaptive checkpointing and nMR Adaptive checkpointing and nMR of the master for recoveryof the master for recovery

Start with ‘n’ replicatesStart with ‘n’ replicates Initial Checkpointing calls Initial Checkpointing calls

generate No-opsgenerate No-ops Slaves track the liveness of Slaves track the liveness of

master and the replicatesmaster and the replicates Failure of last replicate Failure of last replicate

initiates checkpointinginitiates checkpointing Pros:Pros:

Tolerates ‘n’ faults with Tolerates ‘n’ faults with negligible recovery timenegligible recovery time

Subsequent faults can Subsequent faults can still be recoveredstill be recovered

Cons:Cons: Increase in overhead of Increase in overhead of

tracking the replicatestracking the replicates

Rank 1(Slave)

Rank 2(Slave)

Rank n(Slave)

Logical flow of MPI messages

Actual flow of MPI message

Rank 0Rank 0

Rank 0

Replicated Rank 0

Messagefrom 1to 0

Message from 0 to 2

StorageMedium

Checkpointing data

Page 19: Fault Tolerant MPI

19

Self-Checking ThreadsSelf-Checking Threads(Scales > O(10) nodes can be (Scales > O(10) nodes can be

considered)considered)

Invoked by MPI library

•Checks whether peers are alive•Checks for network sanity•Server to coordinator queries•Exploits timeouts•Periodic execution, no polling•Provides heart-beat across app.•Can check internal MPI state

Queries by coordinator

•Vote on communicator state•Check buffers•Check queues for aging•Check local program state•Invoked Periodically•Invoked when suspicion arises

Page 20: Fault Tolerant MPI

20

MPI/FT SCT Support MPI/FT SCT Support LevelsLevels

Simple, Non-portable, uses internal state Simple, Non-portable, uses internal state of MPI and/or system (“I”)of MPI and/or system (“I”)

Simple, Portable, exploits threads and Simple, Portable, exploits threads and PMPI_ (“II”) or PERUSE APIPMPI_ (“II”) or PERUSE API

Complex state checks, Portable exploits Complex state checks, Portable exploits queue interfaces (“III”)queue interfaces (“III”)

All of the aboveAll of the above

Page 21: Fault Tolerant MPI

21

MPI/FT CoordinatorMPI/FT Coordinator Spawned by mpirun or similarlySpawned by mpirun or similarly

Closely coupled / friend with the MPI library of Closely coupled / friend with the MPI library of applicationapplication

User TransparentUser Transparent

Periodically collects status information from the Periodically collects status information from the applicationapplication

Can kill and restart the application or individual Can kill and restart the application or individual ranksranks

Preferably implemented using MPI-2 Preferably implemented using MPI-2

We’d like to replicate and distribute this functionalityWe’d like to replicate and distribute this functionality

Page 22: Fault Tolerant MPI

22

Use of Gossip in MPI/FTUse of Gossip in MPI/FT

Applications in Model III and IV assume a star Applications in Model III and IV assume a star topologytopology

Gossip requires a virtual all-to-all topologyGossip requires a virtual all-to-all topology Data network may be distinct from the control Data network may be distinct from the control

networknetwork Gossip provides:Gossip provides:

Potentially scalable and fully distributed Potentially scalable and fully distributed scheme for failure detection and notification scheme for failure detection and notification with reduced overheadwith reduced overhead

Notification of failures in the form of Notification of failures in the form of broadcastbroadcast

Page 23: Fault Tolerant MPI

23

Gossip-based Failure Gossip-based Failure DetectionDetection

NodNodee

HearHeart t

BeatBeat

11 22

22 00

33 22

1

3

NodNodee

HearHeart t

BeatBeat

11 00

22 33

33 44

NodNodee

HearHeart t

BeatBeat

11 33

22 11

33 00

Gossip

NodNodee

HearHeart t

BeatBeat

11 22

22 00

33 22

0

2

Heartbeat 0 < 2

No updateHeartbeat 3 > 0 UpdateHeartbeat 4>2 Update

00 00 00 00 00 00

00 00 00

NodNodee

HearHeart t

BeatBeat

11 00

22 00

33 22

3

Node dead !!!

Tcleanup 5 * Tgossip

S - Suspect vector

51

1 2 3S

1 2 3S

1 2 3S

Node 1’s Data

Node 2’s Data

Node 2’s Data

Node 3’s Data

Clock

2

Cycles Elapsed :

1234

34

34

1

NodNodee

HearHeart t

BeatBeat

11 00

22 00

335

Page 24: Fault Tolerant MPI

24

Consensus about FailureConsensus about Failure

00 00 11

00 00 00

00 00 00

00 00 00

00 00 11

00 00 00

11 11 11

11 11 11

1 2 3

1 2 3

At Node 1

At Node 2

Suspect matrices

merged at Node 1

00 00 11

00 00 11

00 00 00

11 11 11

1 2 3

At Nodes 1 and 2

L

L

L 011 11 00

L – Live list

Node 3 dead

Page 25: Fault Tolerant MPI

25

Issues with Gossip - IIssues with Gossip - I

After node After node aa fails fails If node If node bb, the node, the node that arrives at that arrives at

consensus on node consensus on node aa’s’s failure last failure last (notification broadcaster) also fails before (notification broadcaster) also fails before broadcastbroadcast Gossiping continues until another node, Gossiping continues until another node,

c,c,suspects that node suspects that node bb has failed has failed Node Node cc broadcasts the failure notification of broadcasts the failure notification of

node node aa Eventually node Eventually node b b is also determined to have is also determined to have

failedfailed

Page 26: Fault Tolerant MPI

26

Issues with Gossip - IIIssues with Gossip - II

If control and data networks are separate:If control and data networks are separate: MPI progress threads monitor the status of the MPI progress threads monitor the status of the

data networkdata network Failure of the link to the master is indicated when Failure of the link to the master is indicated when

communication operations timeoutcommunication operations timeout Gossip monitors the status of the control networkGossip monitors the status of the control network

The progress threads will communicate the The progress threads will communicate the suspected status of the master node to the suspected status of the master node to the gossip threadgossip thread

Gossip will incorporate this information in Gossip will incorporate this information in its own failure detection mechanismits own failure detection mechanism

Page 27: Fault Tolerant MPI

27

Issues with RecoveryIssues with Recovery

If network failure causes the partitioning of If network failure causes the partitioning of processes:processes: Two or more isolated groups may form that Two or more isolated groups may form that

communicate within themselvescommunicate within themselves Each group assumes that the other processes have Each group assumes that the other processes have

failed and attempts recoveryfailed and attempts recovery Only the group that can reach the checkpoint data is Only the group that can reach the checkpoint data is

allowed to initiate recovery and proceedallowed to initiate recovery and proceed The issue of recovering when multiple groups can access The issue of recovering when multiple groups can access

the checkpoint data is under investigationthe checkpoint data is under investigation If only nMR is used, the group with the master is If only nMR is used, the group with the master is

allowed to proceedallowed to proceed The issue of recovering when the replicated master The issue of recovering when the replicated master

processes are split between groups is under investigationprocesses are split between groups is under investigation

Page 28: Fault Tolerant MPI

28

Shifted APIsShifted APIs Try to “morally conserve” MPI standardTry to “morally conserve” MPI standard Timeout parameter added to messaging calls to control Timeout parameter added to messaging calls to control

the behavior of individual MPI callsthe behavior of individual MPI calls Modify existing MPI callsModify existing MPI calls Add new calls with the added functionality to support ideaAdd new calls with the added functionality to support idea

Add a callback function to MPI calls (for error handling)Add a callback function to MPI calls (for error handling) Modify existing MPI callsModify existing MPI calls Add new calls with the added functionalityAdd new calls with the added functionality

Support in-band or out-of-band error management made Support in-band or out-of-band error management made explicit to applicationexplicit to application

Runs in concert with MPI_ERRORS_RETURN.Runs in concert with MPI_ERRORS_RETURN. Offers opportunity to give hints as well, where Offers opportunity to give hints as well, where

meaningful.meaningful.

Page 29: Fault Tolerant MPI

29

Application-based Application-based CheckpointCheckpoint

Point of synchronization for a cohort of Point of synchronization for a cohort of processesprocesses

Minimal fault tolerance could be applied only at Minimal fault tolerance could be applied only at such checkpointssuch checkpoints

Defines “save state” or “restart data” needed to Defines “save state” or “restart data” needed to resumeresume

Common practice in parallel CFD and other MPI Common practice in parallel CFD and other MPI codes, because of reality of failurescodes, because of reality of failures

Essentially gets no special help from systemEssentially gets no special help from system Look to Parallel I/O (MPI-2) for improvementLook to Parallel I/O (MPI-2) for improvement Why? Minimum complexity of I/O + FeasibleWhy? Minimum complexity of I/O + Feasible

Page 30: Fault Tolerant MPI

30

In situ Checkpoint In situ Checkpoint OptionsOptions

Checkpoint to bulk memoryCheckpoint to bulk memory Checkpoint to flashCheckpoint to flash Checkpoint to other distributed RAMCheckpoint to other distributed RAM Other choices?Other choices? Are these useful … depends on error Are these useful … depends on error

modelmodel

Page 31: Fault Tolerant MPI

31

Early Results with Hardening Early Results with Hardening Transport:Transport:

CRC vs. time-based nMRCRC vs. time-based nMR

020406080

100120140

32 128

512

2048

8192

3276

8

1310

72

5242

88

size (bytes)

To

tal t

ime

(sec

) no crccrc3mr

Comparison of nMR, CRC Comparison of nMR, CRC with baseline using MPI/Pro with baseline using MPI/Pro

(version 1.6.1-1tv)(version 1.6.1-1tv)

MPI/Pro Comparisons of Time Ratios,

normalized against baseline performance

01234567

size (bytes)

Tim

e r

ati

o crc/nocrc

3mr/nocrc

Page 32: Fault Tolerant MPI

32

Early Results, II.Early Results, II.

Comparison of nMR and CRCComparison of nMR and CRC with baseline using MPICHwith baseline using MPICH

(version 1.2.1)(version 1.2.1)

0500

100015002000

size (bytes)

Tota

l tim

e (s

ec)

no crccrc3mr

MPICH Comparison of Time Ratios Using baseline MPI/Pro Timings

0

10

20

30

40

50

60

70

size (bytes)

tim

e r

ati

os

nocrc_mpich/nocrc_mpipro

crc_mpich/nocrc_mpipro

3mr_mpich/nocrc_mpipro

Page 33: Fault Tolerant MPI

33

Early Results, IIIEarly Results, IIItime-based nMR with time-based nMR with

MPI/ProMPI/Pro

Total Time for 10,000 Runs vs Total Time for 10,000 Runs vs Message Size for Various nMRMessage Size for Various nMR

0

10

20

30

40

50

60

70

80

size (bytes)

tim

e (s

ec)

3mr4mr5mr6mr7mr8mr9mr

MPI/Pro Time Ratio Comparisons

for various nMR to baseline

0

5

10

15

20

size (bytes)

Tim

e ra

tio

3mr/nocrc

4mr/nocrc

5mr/nocrc

6mr/nocrc

7mr/nocrc

8mr/nocrc

9mr/nocrc

Page 34: Fault Tolerant MPI

34

Other Possible ModelsOther Possible Models

Master Slave was considered beforeMaster Slave was considered before Broadcast/Reduce Data Parallel Apps.Broadcast/Reduce Data Parallel Apps. Independent Processing + Corner Independent Processing + Corner

TurnsTurns Ring ComputingRing Computing Pipeline Bi-Partite Computing Pipeline Bi-Partite Computing General MPI-1 models (all-to-all)General MPI-1 models (all-to-all) Idea: Trade Generality for CoverageIdea: Trade Generality for Coverage

Page 35: Fault Tolerant MPI

35

What about Receiver-What about Receiver-Based Models?Based Models?

Should we offer, instead or in addition to Should we offer, instead or in addition to MPI/Pro, a receiver-based model?MPI/Pro, a receiver-based model?

Utilize publish/subscribe semanticsUtilize publish/subscribe semantics Bulletin boards? Tagged messages, etc.Bulletin boards? Tagged messages, etc. Try to get rid of single point of failure this wayTry to get rid of single point of failure this way Sounds good, can it be done?Sounds good, can it be done? Will anything like an MPI code work?Will anything like an MPI code work? Does anyone code this way!? (e.g., Java Does anyone code this way!? (e.g., Java

Spaces, Linda, military embedded distributed Spaces, Linda, military embedded distributed computing)computing)

Page 36: Fault Tolerant MPI

36

Plans for Upcoming 12 Plans for Upcoming 12 MonthsMonths

Continue Implementation of MPI/FT to Continue Implementation of MPI/FT to Support Applications in simplex modeSupport Applications in simplex mode

Remove single points of failure for Remove single points of failure for master/slavemaster/slave

Support for multi-SPMD ModelsSupport for multi-SPMD Models Explore additional application-relevant Explore additional application-relevant

modelsmodels Performance StudiesPerformance Studies Integrate fully with Gossip protocol for Integrate fully with Gossip protocol for

detectiondetection

Page 37: Fault Tolerant MPI

37

Summary & ConclusionsSummary & Conclusions

MPI/FT = MPI/Pro + one or more availability enhancementsMPI/FT = MPI/Pro + one or more availability enhancements Fault-tolerant concerns leads to new MPI implementations Fault-tolerant concerns leads to new MPI implementations Support for Simplex, Parallel nMR and/or Mixed ModeSupport for Simplex, Parallel nMR and/or Mixed Mode nMR not scalablenMR not scalable Both Time-based nMR and CRC (depends upon message size Both Time-based nMR and CRC (depends upon message size

and the MPI implementation) - can do nowand the MPI implementation) - can do now Self Checking Threads - can do nowSelf Checking Threads - can do now Coordinator (Execution Models) - can do very soonCoordinator (Execution Models) - can do very soon Gossip for detection – can do, need to integrateGossip for detection – can do, need to integrate Shifted APIs/callbacks - easy to do, will people use?Shifted APIs/callbacks - easy to do, will people use? Early Results with CRC vs. nMR over TCP/IP cluster shownEarly Results with CRC vs. nMR over TCP/IP cluster shown

Page 38: Fault Tolerant MPI

38

Related WorkRelated Work G. Stellner (CoCheck, 1996)G. Stellner (CoCheck, 1996)

Checkingpointing that works with MPI and CondorCheckingpointing that works with MPI and Condor M. Hayden (M. Hayden (The Ensemble SystemThe Ensemble System, 1997), 1997)

Next-generation Horus communication toolkitNext-generation Horus communication toolkit EvripidouEvripidou et al ( et al (A Portable Fault Tolerant Scheme for MPI,A Portable Fault Tolerant Scheme for MPI, 1998) 1998)

Redundant processes approach to masking failed nodesRedundant processes approach to masking failed nodes A. Agbaria and R. Friedman, (Starfish, 1999)A. Agbaria and R. Friedman, (Starfish, 1999)

Event bus, works in specialized language, related to EnsembleEvent bus, works in specialized language, related to Ensemble G.F. Fagg, and J.J. Dongarra, (FT-MPI, 2000)G.F. Fagg, and J.J. Dongarra, (FT-MPI, 2000)

Growing/shrinking communicators in response to node failures, memory-Growing/shrinking communicators in response to node failures, memory-based checkpointing, reversing calculation?based checkpointing, reversing calculation?

G. Bosilca et al, (MPICH-V, 2002) – new, to be presented at SC2002 G. Bosilca et al, (MPICH-V, 2002) – new, to be presented at SC2002 Grid-based modifications to MPICH for “volatile nodes”Grid-based modifications to MPICH for “volatile nodes” Automated checkpoint, rollback, message loggingAutomated checkpoint, rollback, message logging

Also, substantial literature related to ISIS/HORUS (Dr. Berman et al Also, substantial literature related to ISIS/HORUS (Dr. Berman et al at Cornell) that is interesting for distributed computingat Cornell) that is interesting for distributed computing