A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor:...

39
A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau

Transcript of A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor:...

Page 1: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

A Software Layer for Disk Fault Injection

Jake Adriaens

Dan Gibson

CS 736 Spring 2005

Instructor: Remzi Arpaci-Dusseau

Page 2: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Outline

1. Introduction, Motivation, & Challenges

2. Related Work

3. Implementation Details & IDE Driver

4. Fault Model

5. Methods & Evaluation

6. Summary

Page 3: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Overview - 1

Software system for modeling IDE disk faults in an x86/Linux-based computer

Modification to IDE driver for read/write event interception

Page 4: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Overview - 2

Disks faults described at a high level Faults passed to kernel-level module On read/write event:

– IDE driver calls kernel module to perform request modification

– Before write event, module may modify data to-be-written

– After read event, module may modify data read from disk

Page 5: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Motivation – Why purposely cause disk failures?

Commodity HW (and SW!) fails, usually at unexpected times– Causing failures at expected times can help improve

fault tolerance measures

Can be used to determine fault tolerance of systems– Various flavors of RAID need fault injection

Page 6: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Motivation

Faults can happen at the worst time– In the middle of a PowerPoint presentation…

Page 7: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.
Page 8: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Challenges

Drivers are typically written with reliability in mind– May have error detection / correction measures

Should these be removed? Fooled? Applauded?

Low-level drivers critically affect performance and stability of the system– Disk faults need not be “stable,” but shouldn’t have

unusual “side effects”

Page 9: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Challenges

Failure models difficult to justify– Disk manufacturers don’t offer details on how/why

their disks fail Failstop model is widely used: models complete, detected

disk failure Other models must be chosen generally to account for

many different disks, controllers, etc.

Page 10: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Outline

1. Introduction, Motivation, & Challenges

2. Related Work

3. Implementation Details & IDE Driver

4. Fault Model

5. Methods & Evaluation

6. Summary

Page 11: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Related Work

Software fault injection– Huang et. al. (and many others) use software fault

injection for modifying cached web pages (ACM/ProcWWW)

– Jarboui et. al. inject software faults into the Linux kernel and observe system behavior

– Nagaraja et. al. inject faults into cluster-based systems

Page 12: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Related Work

Disk Faults, Modeling, Detection– Kaaniche et. al. inject disk faults to study RAID

behavior– Kari et. al. presents fault detection and diagnosis

techniques (separate studies)– Various other RAID and/or FS papers use some

form of fault injection to model failures

Page 13: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Related Work

Hardware Fault Injection

Page 14: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Outline

1. Introduction, Motivation, & Challenges

2. Related Work

3. Implementation Details & IDE Driver

4. Fault Model

5. Methods & Evaluation

6. Summary

Page 15: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Implementation

Core components– User-level parser– In-kernel injection module– In-driver upcalls– System calls

Added ~20 lines to IDE driver code Kernel module is demand-loaded, ~250 lines in size 2 System calls, inject_fault and getdrivesize, ~ 120

lines

Page 16: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Implementation – User-level Console

Used for fault definition– Console interface for

fault definition– Processes batch files– Checks faults for validity

Sector ranges, probability, etc. (more later)

– Passes faults to kernel module

Page 17: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Implementation – IDE Driver Modification

Added “upcalls” to injection module– Pass I/O requests to module for modification– Provide callback service on I/O completion

Added special-purpose code for certain fault models– Failstop model requires in-driver actions

Page 18: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Implementation – Kernel Module

Receives fault lists from user-level console Called by IDE driver to perform insertion when:

– LBA sector (SCSI-like) becomes known – sector may be modified

– Write is initiated – data to be written may be modified

– Read completes – data may be modified before returning control to I/O initiator

Page 19: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Implementation – System Calls

Added two system calls– inject_faults()

Used to pass fault definitions to kernel module from user space

– getsectors() Used to determine raw sector ranges of IDE devices by

name (there are other ways to do this)

Page 20: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

ImplementationFaults Defined

Faults Injected

Disk Request

I/O Initiated Upcall

Modified Request

Bus TrafficI/O Returns

Control Returns

Page 21: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

IDE Driver (2.4.26 Linux Kernel)

Important structures– struct request

Information about an IDE request– READ / WRITE– Number of sectors– Etc

– struct ide_drive_s (_t) Information about a drive

– Drive name (eg. “hdc”)– Sizing/addressing information– Etc

Page 22: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

IDE Driver (2.4.26 Linux Kernel)

Functions– ide_do_rw_disk (3 versions)

Common choke-point for reads & writes Many other similar functions, only this one in use Two versions, swapped by preprocessor directives (one for

DMA, one for PIO)

Page 23: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Outline

1. Introduction, Motivation, & Challenges

2. Related Work

3. Implementation Details

4. Fault Model

5. Methods & Evaluation

6. Summary

Page 24: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Failure Model

Models selected to represent “generic IDE” disk– No modeling of specific failure (i.e. Western Digital’s

“classic” servo malfunction)– Models based on ranges of affected logical sectors

(ala SCSI)

Page 25: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Failure Model – Fault Types

sectorfail– Models inability of a given sector (block) or sector

range to store data reliably– Excited on read of sector:

Data read is permuted in some way:– Randomized – Set to specific value – Added to offset – Shifted by one or more bytes

Page 26: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Failure Model – Fault Types

sectorro– Writes to block have no effect on stored value– Excited on writes to sector:

Write requests ignored

sectorwrong– Traffic to a given block is directed to a different

block– Excited on reads & writes

Address permuted, similarly to data

Page 27: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Failure Model – Fault Types

transaddr– Sector number wrong for first fault excitation, but

right for all others– Excited on reads & writes

Sector permuted as in sectorwrong

transdata– Data is wrong for first fault excitation

Data permuted as in sectorfail

Page 28: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Failure Model – Fault Types

failstop– Drive is totally unresponsive—performs no reads or

writes– Differs from traditional Failstop in that our failstop is invisible

Drive does not report any errors, simply fails to perform reads or writes to any sector

Page 29: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Outline

1. Introduction, Motivation, & Challenges

2. Related Work

3. Implementation Details

4. Fault Model

5. Methods & Evaluation

6. Summary

Page 30: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification of Faults (?)

Faults excited and observed by microbenchmarks tailored to individual fault types

Techniques similar to latent fault detection (Kari et. al., and other studies)

Verification of faults is fault-specific

Page 31: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification - sectorfail

Corrupts data when read from disk1. Write known data to disk - observe location using

printk statement

2. Inject sectorfail fault at location of file on disk.

3. Unmount/remount FS (flush cache)

4. Attempt to read faulty file (with cat)

Page 32: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification - sectorro

Ignores writes to a given location1. Write known data to disk

2. Inject sectorro fault

3. Flush file cache

4. Write different data to same location

5. Flush file cache

6. Read data from (1) from disk

Page 33: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification - sectorwrong

Changes address (sector) to another sector number

1. Write known data to disk

2. Flush file cache

3. Inject sectorwrong fault—redirect to known location

4. Read from file – observe data from other sector

Page 34: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification - transdata

Data modified after read, but only the first time

1. Verify sectorfail functionality

2. Flush file cache

3. Re-read, expect correct data

Page 35: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification - transaddr

Sector number modified before reads & writes1. Verify sectorwrong functionality

2. Flush file cache

3. Repeat read, expect correct data

Page 36: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Verification - failstop

Easy!1. Install failstop fault

2. Attempt to access any portion of affected drive

3. Expect bad things– Usually causes kernel panic

Page 37: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Evaluation

Execution time overhead of injection SW

– Overhead << standard dev. of runtime for unaffected regions of disk space

– Overhead << standard dev. of runtime for affected regions

– Averaged over 250 accesses

Avg. (ms) Std.Dev.

No injection 3.025 0.075

Unaffected region 3.020 0.076

Affected Region

Page 38: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Outline

1. Introduction, Motivation, & Challenges

2. Related Work

3. Implementation Details

4. Fault Model

5. Methods & Evaluation

6. Summary

Page 39: A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Summary

Present five new failure models for disk accesses, and the ability to inject them

Verified fault manifestation– Did not verify potential side effects ?

Fault injection has no noticeable effect on access times– Small SW overhead much smaller than access time

to physical device