Application Level Fault Tolerance and Detection

32
Application Level Fault Tolerance and Detection Principal Investigators: C. Mani Krishna Israel Koren Graduate Students: Diganta Eric Janhavi Osman Vijay Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003

description

Application Level Fault Tolerance and Detection. Principal Investigators: C. Mani KrishnaIsrael Koren Graduate Students: Diganta Eric Janhavi Osman Vijay. Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering - PowerPoint PPT Presentation

Transcript of Application Level Fault Tolerance and Detection

Page 1: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Principal Investigators:C. Mani Krishna Israel Koren

Graduate Students:Diganta Eric Janhavi Osman Vijay

Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering

University of Massachusetts Amherst MA 01003

Page 2: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

What is ALFTD?

Application Level Fault Tolerance and Detection ALFTD complements existing system or algorithm

level fault tolerance by leveraging information available only at the application level Using such application level semantic information

significantly reduces the overall cost providing fault tolerance

ALFTD may be used alone or to supplement other fault detection schemes

ALFTD is scalable Error overhead can be traded off with invested time

overhead for fault tolerance

Page 3: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

ALFTD Overview

Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. System faults cause a process to eventually cease

functioning Data faults cause a process to continue running with

incorrect results ALFTD has been implemented into OTIS to

determine its feasibility as a fault detection and tolerance method for REE applications

OTIS has two sets of related output data, the temperature and emissivity Experiments have focused mostly on the temperature

output

Page 4: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

3. Master sends tasks

OTIS Structure

M

S

S

S

2. MPI Starts Slave and master processes

4. Slave Calculations

5. Slave Output to File

OUTPUT

MPI1. MPI Starts

Page 5: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

OTIS’ Work Distribution

OTIS’ dynamic workload distribution allows it to compensate for system faults Work originally partitioned for a failed processor is

instead taken by the remaining processes OTIS does not compensate for data faults

As long as the work is completed, there is no measure of correctness

OTIS does not consider deadline repercussions

Page 6: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

OTIS Fault Cases

Page 7: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

3. Master sends tasks

ALFTD OTIS Structure

4. Slave Calculations

2. MPI Starts Slave and master processes, primary and secondary

M

S2P1

S1P3

S3P2

MPI 1. MPI Starts

OUTPUT

5. Slave Output to File?

?

Page 8: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Secondaries in OTIS

The secondary required for ALFTD is implemented to be functionally similar to the primary

Secondary scaling occurs through resolution reduction OTIS’ “natural” data input exhibits spatial locality Points not directly calculated can be approximately

estimated using interpolation between calculated points

Secondary processes have been tested at 20%-50% of the primary calculation overhead While 50% affords better quality, 20% has less

overhead

Page 9: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Example of Secondary Resolution

(ALFTD Compensation for 10 rows in a sample dataset)

100% Secondary Resolution

50% Secondary Resolution

33% Secondary Resolution

25% Secondary Resolution

Page 10: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

ALFTD Benefit

Page 11: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

ALFTD Benefit (cont’d)

Page 12: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Detection

When to run the secondary, and when to use the secondary output, is determined by output filters

Output filters are created to check for application-specific trends in data Aberrations from normal data characteristics can be

considered to be the product of potentially faulty processes

OTIS relies on natural temperature characteristics to detect potentially faulty data Spatial Locality: temperature changes gradually over small

areas Absolute Bounds: temperature should not exceed certain

values

Page 13: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Data Sets

Three data sets were chosen for their interesting characteristics

“Blob” “Stripe” “Spots”Broad,

unchanging areas with dark

spots

Relatively undynamic

except for one “stripe”

Turbulent spots may defy

“spatial locality” predictions

Page 14: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Data Frequency (Values)

Page 15: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Data Frequency (Spatial Locality)

Page 16: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Validation Through Secondaries

When the primary deadline is hit, rows are re-delegated to the secondaries if (and only if): The primary has returned results for that row

suspected to be faulty The secondary results can be used to decide whether the

results are indeed faulty A particular row was never successfully calculated

The secondary results can be immediately used in place of the missing primary results

Page 17: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Validation Through Secondaries (cont’d)

After the secondary has been run to verify a primary’s results, the “better” data is chosen according to the following logic grid:

PrimaryFaultles

sAmbiguo

us Faulty

Faultless Primary Secondary

Secondary

Ambiguous Primary Primary Secondar

yFaulty Primary Primary Primary*

Seco

ndar

y

Page 18: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots”

Fault Tolerance with injected faults in “Spots”

Page 19: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots” (cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead

Page 20: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots” (cont’d)

No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

No Error Max Error

Difference Plots – faulty output versus faultless output

Page 21: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob”

Fault Tolerance with injected faults in “Blob”

Page 22: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob” (cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead

Page 23: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob” (cont’d)

No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

No Error Max Error

Difference Plots – faulty output versus faultless output

Page 24: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”

Fault Tolerance with injected faults in “Stripe”

Page 25: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”(cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead

Page 26: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”(cont’d)

No ALFTD 25% ALFTD Computation Overhead 33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

No Error Max Error

Difference Plots – faulty output versus faultless output

Page 27: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Emissivity Data

Emissivity is loosely proportional to temperature data Emissivity exhibits spatial locality Emissivity has natural bounds of expected data

Natural Metal~0.5

Vegetatation, Water~1.0

<0.5 - Faulty >1.0 - Faulty

Rock~0.8 - ~0.95

Page 28: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Emissivity Data (cont’d)

Emissivity does not exhibit the same data “closeness” as temperature output This makes it very difficult to distinguish faulty from

non-faulty data Luckily, faults present in temperature output are easily

detected, and reflect faults in emissivity output. Emissivity does not have per-pixel

independence of calculation Dependence on the correctness of neighboring pixels

makes resolution reduction a viable, but not the best, method for secondary reduction

Page 29: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Data Frequency (Emissivity Values)

Page 30: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Conclusion

ALFTD has already shown to be a worthwhile alternative to full redundancy

Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead in both the emissivity and temperature outputs

OTIS, as a general matrix-based, master/slave program is a springboard to other, similar programs (e.g., NGST)

ALFTD as a fault-detection scheme will continue to be effective in programs which exhibit “natural” output

Page 31: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Thank You!

Page 32: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Relative Error Calculation

Error in OTIS output is calculated relative to a faultless “template”

The average relative error is the average of all relative errors of the entire output Faulty value = f(x,y) Faultless value = F(x,y)

Error =),(

)],(),([yxF

yxFyxfabs