Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram...

26
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy # , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory 1

Transcript of Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram...

Selective Recovery From Failures In A Task Parallel Programming Model

James Dinan*, Sriram Krishnamoorthy# , Arjun Singri*, P. Sadayappan*

*The Ohio State University# Pacific Northwest National Laboratory

1

Faults at Scale

Future systems built with large number of componentsMTBF inversely proportional to #components

Faults will be frequent

Checkpoint-restart too expensive with numerous faultsStrain on system components, notably file system

Assumption of fault-free operation infeasible

Applications need to think about faults

2

Programming Models

SPMD ties computation to a processFixed machine modelApplications needs to change with major architectural shifts

Fault handling involves non-local design changesRely on p processes: what if one goes away?

Message-passing makes it harderConsistent cuts are challengingMessage logging, etc. expensive

Fault management requires lot of user involvement

3

Problem Statement

Fault management framework

Minimize user effort

ComponentsData state

Application dataCommunication operations

Control stateWhat work is each process doing?Continue to completion despite faults

4

Approach

One-sided communication modelEasy to derive consistent cuts

Task parallel control modelComputation decoupled from processes

User specifies computationCollection of tasks on global data

Runtime schedules computationLoad balancingFault management

5

Global Arrays (GA)

PGAS Family: UPC (C), CAF (Fortran), Titanium (Java), GA (library)Aggregate memory from multiple nodes into global address space

Data access via one-sided get(..), put(..), acc(..) operationsProgrammer controls data distribution and locality

Fully inter-operable with MPI and ARMCISupport for higher-level collectives – DGEMM, etc.Widely used – chemistry, sub-surface transport, bioinformatics, CFD

6

Shared

Glo

bal

ad

dre

ss s

pac

e

Private

Proc0 Proc1 Procn

X[M][M][N]

X[1..9][1..9][1..9]

X

GA Memory Model

Remote memory accessDominant communication in GA programsDestination known in advanceNo receive operation or tag matchingRemote Progress

Ensure overlap

Atomics and collectivesBlockingFew outstanding at any time

7

Saving Data State

Data State = Commn state + memory state

Communication state“Flush” pending RMA operations (single call)Save atomic and collective ops (small state)

Memory stateForce other processes to flush their pending ops

Used in virtualized execution of GA apps (Comp. Frontiers’09)

Also enables pre-emptive migration

8

9

The Asynchronous Gap

The PGAS memory model simplifies managing data

Computation model is still regular, process-centric SPMD

Irregularity in the data can lead toload imbalance

Extend PGAS model to bridge asynchronous gapDynamic, irregular view of the computationRuntime system should perform load balancingAllow for computation movement to exploit locality

X[M][M][N]

X[1..9][1..9][1..9]X

get(…)

Control State – Task Model

Express computation as collection of tasksTasks operate on data stored in Global ArraysExecuted in collective task parallel phases

Runtime system manages task execution

10

SPMD

SPMD

Task Parallel

Termination

11

Task Model

• Inputs: Global data, Immediates, CLOs• Outputs: Global data, CLOs, Child tasks• Strict dependence: Only parent → child (for now)

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In: 5, Y[0], ...

Out: X[1]

Task:

Partitioned Global Address Space

X[0] X[1] X[N]

12

Scioto Programming Interface

High level interface: shared global task collection

Low level interface: set of distributed task queuesQueues are prioritized by affinityUse work first principle (LIFO)Load balancing via work stealing (FIFO)

13

Work Stealing Runtime System

ARMCI task queue on each processorSteals don’t interrupt remote process

When a process runs out of workSelect a victim at random and steal work from them

Scaled to 8192 cores (SC’09)

Communication Markers

Communication initiated by a failed process

Handling partial completionsGet(), Put() are idempotent – ignoreAcc() non-idempotent

Mark beginning and end of acc() ops

OverheadMemory usage – proportional to # tasksCommunication – additional small messages

14

Fault Tolerant Task Pool

15

Re-execute incomplete tasks till a round without failures

Task Execution

16

Update result only if it has not already been modified

Detecting Incomplete Commn

Data with ‘started’ set but not ‘contributed’

Approach 1: “Naïve” schemeCheck all markers for any that remain `started’Not scalable

Approach 2: “Home-based” schemeInvert the task-to-data mappingDistributed meta-data check + all-to-all

17

Algorithm Characteristics

Tolerance to arbitrary number of failures

Low overhead in absence of failuresSmall messages for markersCan we optimized through pre-issue/speculation

Space overhead proportional to task pool sizeStorage for markers

Recovery cost proportional to #failuresRedo work to produce data in failed processes

18

Bounding Cascading Failures

A process with “corrupted” dataIncomplete comm. from failed process

Marking it as failed -> cascade failures

A process with “corrupted” dataFlushes its communication; then recovers its data

Each task computes only a few data blocksEach process: pending comm. to few blocks at a timeTotal recovery cost

Data in failed processes + a small additional number

19

Experimental Setup

Linux cluster

Each nodeDual quad-core 2.5GHz opterons24GB RAM

Infiniband interconnection network

Self-Consistent Field (SCF) kernel – 48 Be atoms

Worst case fault – at the end of a task pool

20

Cost of Failure – Strong Scaling

21

#tasks re-executed goes down with increase in process count

Worst Case Failure Cost

22

Relative Performance

23

Less than 10% cost for one worst case fault

Related Work

Checkpoint restartContinues to handle the SPMD portion of an appFiner-grain recoverability using our approach

BOINC – client-serverCilkNOW – single assignment formLinda – requires transactionsCHARM++

processor virtualization basedNeeds message logging

Efforts on fault tolerant runtimesComplements this work

24

Conclusions

Fault tolerance throughPGAS memory modelTask parallel computation model

Fine-grain recoverability through markers

Cost of failure proportional to #failures

Demonstrated low cost recovery for an SCF kernel

25

Thank You!

26