Characterization of Pathological Behavior ices.cmu/ballista

Characterization ofPathologicalBehavior http://www.ices.cmu.edu/ballista

Philip [email protected] - (412) 268-5225

Dan Siewiorek [email protected] - (412) 268-2570

(and more than a dozen other contributors)

2

Goals Detect pathological patterns for fault prognosis Develop fault propagation models Develop statistical identification and stochastic characterization of

pathological phenomena

3

Outline Definitions Digital Hardware Prediction Digital Software Characterization Research Challenges

4

Definitions: Cause-Effect Sequence and Duration FAULT - incorrect state of hardware/software caused

by component failure, environment, operator errors, or incorrect design

ERROR - manifestation of a fault within a program or data structure

FAILURE - services deviates from specified service due to an error

DURATION• Permanent- continuous and stable due to hardware failure, repair

by replacement

• Intermittent- occasionally present due to unstable hardware or varying hardware/software state, repair by

replacement

• Transient- resulting from design errors or temporary environmental conditions, not repairable by

replacement

5

CMU Andrew File Server Study Configuration

• 13 SUN II Workstations with 68010 processor

• 4 Fujitsu Eagle Disk Drives Observations

• 21 Workstation Years Frequency of events

• Permanent Failures 29

• Intermittent Faults 610

• Transient Faults 446

• System Crashes 298 Mean Time To

• Permanent Failures 6552 hours

• Intermittent Faults 58 hours

• Transient Faults 354 hours

• System Crash 689 hours

6

Some Interesting Numbers Permanent Outages/Total Crashes = 0.1

Intermittent Faults/Permanent Failures = 21• Thus first symptom appears over 1200 hours prior to repair

(Crashes - Permanent)/Total Faults = 0.255 14/29 failures had three or fewer error log entries

• 8/29 had no error log entries

7

Harbinger Detection of Anomalies

8

Digital Hardware Prediction

9

Measurement and Prediction Module

History Collection -- Calculation and reporting of system availability

Future prediction -- failure prediction of system devices

HistoryCollection

Future Predict

Measurement & Prediction Module

Op

era

ting

Sy

ste

m

Us

er

Ap

plic

atio

n P

rog

10

Op

era

ting

Sy

ste

m

History Collection

Uptime(fraction)Calculator

CrashMonitor

Files of system state info

History Collection

This module consists :

• Crash Monitor - monitors system state

• Calculator - Average uptime and average of fraction,

Us

er

Ap

plic

atio

n P

rog

Files of uptime

(fraction)information

)( downtimeuptimeuptime

=> Availability)( downtimeuptimeuptime

11

Average uptime

rebootcrash

Crash Monitor

System state’s changing

periodically samples system state

up down

minmin 5600~600

:

uptimereal

downtime’ = t3 - t1=13min

uptime’ = t2 - t1 = 600min

interval = 5min

timet1 t3t2

12

An NT system accumulative availability daily report overAn NT system accumulative availability daily report over

5-month period5-month period

Preliminary Experiment Data (cont.)

availability number

0

20

40

60

80

100

120

time/date

availab

ilit

y

availability number

13

This module generates device failure warning information

•Sys-log Monitor : monitors new entries by checking the system event log periodically.

•DFT Engine : DFT Heuristic applied and corresponding device failure warning issued if rules satisfied.

Future Prediction

DFT

Error LogSys-log Monitor

Dispersion Frame Technique

Engine

Future Prediction

Us

er

Ap

plic

atio

n P

rog

Op

era

ting

Sy

ste

m

Files of device failure

warning

14

periods of increasingly unreliable behavior prior to catastrophic failure.

Principle from observation

disk

time

errorsDisk

repairMem Board

repair

memFilt

er b

yev

ent

typ

e

CPUrepair

Error entry example: DISK:9/180445/563692570/829000:errmsg:xylg:syc:cmd6:reset failed (drive not ready) blk 0 type time

Based on this observation, the DFT Heuristic was derived, to detect the non-monotonically decreasing inter-arrival time.

15

i-4 i-2i-3 ii-1 t

How DFT Works via an example rule: if a sliding window of 1/2 of the current error interval successively

twice covers 3 errors in the future - issue a warning

last 5 errors of the same type (disk)

warning

16

Digital Software Characterization

17

Where We Started: Component Wrapping Improve Commercial Off-The-Shelf (COTS) software robustness

18

Exception Handling The Basis for Error Detection

Exception handling is an important part of dependable systems• Responding to unexpected operating conditions• Tolerating activation of latent design defects

Robustness testing can help evaluate software dependability• Reaction to exceptional situations (current results)• Reaction to overloads and software “aging” (future results)• First big objective: measure exception handling robustness

– Apply to operating systems– Apply to other applications

It’s difficult to improve something you can’t measure … so let’s figure out how to measure robustness!

19

Measurement Part 1: Software Testing SW Testing requires: Ballista uses:

• Test case “Bad” value combinations

• Module under test Module under Test

• Oracle (a “specification”) Watchdog timer/core dumps

INPUTSPACE

RESPONSESPACE

VALIDINPUTS

INVALIDINPUTS

SPECIFIEDBEHAVIOR

SHOULDWORK

UNDEFINED

SHOULDRETURNERROR

MODULEUNDER

TEST

ROBUSTOPERATION

REPRODUCIBLEFAILURE

UNREPRODUCIBLEFAILURE

20

Ballista: Scalable Test GenerationAPI

TESTINGOBJECTS

write(int filedes, const void *buffer, size_t nbytes)

write(FD_OPEN_RD, BUFF_NULL, SIZE_16)

TESTVALUES

TEST CASE

FILEDESCRIPTORTEST OBJECT

MEMORYBUFFERTEST OBJECT

SIZETESTOBJECT

FD_CLOSED

FD_OPEN_WRITEFD_DELETEDFD_NOEXISTFD_EMPTY_FILEFD_PAST_ENDFD_BEFORE_BEGFD_PIPE_INFD_PIPE_OUTFD_PIPE_IN_BLOCKFD_PIPE_OUT_BLOCKFD_TERMFD_SHM_READFD_SHM_RWFD_MAXINTFD_NEG_ONE

FD_OPEN_READBUF_SMALL_1BUF_MED_PAGESIZEBUF_LARGE_512MBBUF_XLARGE_1GBBUF_HUGE_2GBBUF_MAXULONG_SIZEBUF_64KBUF_END_MEDBUF_FAR_PASTBUF_ODD_ADDRBUF_FREEDBUF_CODEBUF_16

BUF_NEG_ONE BUF_NULL

SIZE_1

SIZE_PAGESIZE_PAGEx16SIZE_PAGEx16plus1SIZE_MAXINTSIZE_MININTSIZE_ZEROSIZE_NEG

SIZE_16

Ballista combines test values to generate test cases

21

Ballista: “High Level” + “Repeatable” High level testing is done using API to perform fault injection

• Send exceptional values into a system through the API– Requires no modification to code -- only linkable object files needed– Can be used with any function that takes a parameter list

• Direct testing instead of middleware injection simplifies usage

Each test is a specific function call with a specific set of parameters• System state initialized & cleaned up for each single-call test

• Combinations of valid and invalid parameters tried in turn

• A “simplistic” model, but it does in fact work...

Early results were encouraging:• Found a significant percentage of functions with robustness failures

• Crashed systems from user mode The testing object-based approach scales!

22

CRASH Robustness Testing Result Categories Catastrophic

• Computer crashes/panics, requiring a reboot• e.g., Irix 6.2: munmap(malloc((1<<30)+1), ((1<<31)-1)) );• e.g., DUNIX 4.0D: mprotect(malloc((1 << 29)+1), 65537,

0);

Restart• Benchmark process hangs, requiring restart

Abort• Benchmark process aborts (e.g., “core dump”)

Silent• No error code generated, when one should have been

(e.g., de-referencing null pointer produces no error)

Hindering• Incorrect error code generated

23

Digital Unix 4.0 Results

24

Comparing Fifteen POSIX Operating Systems

N o rm a l iz e d F a ilu re R a te

B a ll is ta R o b u s t n e s s Te s ts f or 2 3 3 P o s ix F u n c t ion C a l ls

0 % 5 % 1 0 % 1 5 % 2 0 % 2 5%

A IX 4 .1

Q N X 4 .2 2

Q N X 4 .2 4

S u n O S 4 .1 .3

S u n O S 5 .5

O S F 1 3 .2

O S F 1 4 .0

1 C a ta s tro p h ic

2 C a ta s tro p h ic s

F ree B S D 2 .2 .5

Ir ix 5 .3

Ir ix 6 .2

L in u x 2 . 0 .1 8

L y n x O S 2 .4 .0

N e tB S D 1 .3

H P -U X 9 .0 5


1 C a ta s tro ph ic

H P -U X 1 0 .2 0

A b o r t F a ilu re s

R e s ta r t F a ilu re


25

Failure Rates By POSIX Fn/Call Category

26

C Library Is A Potential Robustness Bottleneck

N o rm a l iz e d F a ilu re R a te

P o rt io n s o f F a ilu re R a te s D u e To S y s te m /C - L ib ra ry

0 % 5 % 1 0 % 1 5 % 2 0 % 2 5%

A IX 4 .1

Q N X 4 .2 2

Q N X 4 .2 4

S u n O S 4 .1 .3

S u n O S 5 .5

O S F 1 3 .2

O S F 1 4 .0

F ree B S D 2 .2 .5

Ir ix 5 .3

Ir ix 6 .2

L in u x 2 . 0 .1 8

L y n x O S 2 .4 .0

N e tB S D 1 .3

H P -U X 9 .0 5

H P -U X 1 0 .2 0


2 C a ta s tro p h ic s




C L ib ra ry

S y s te m C a lls

27

Failure Rates by Function Group

28

Technology Transfer Original project sponsor DARPA

• Sponsored technology transfer projects for:– Trident Submarine navigation system (U.S. Navy)– Defense Modeling & Simulation Office HLA system

Industrial sponsors are continuing the work

• Cisco – Network switching infrastructure• ABB – Industrial automation framework• Emerson – Windows CE testing• AT&T – CORBA testing• ADtranz – (defining project)• Microsoft – Windows 2000 testing

Other users include• Rockwell, Motorola, and, potentially, some POSIX OS developers

29

Specifying A Test (web/demo interface) Simple demo interface; real interface has a few more steps...

30

Viewing Results Each robustness failure is one test case (one set of parameters)

31

“Bug Report” program creation Reproduces failure in isolation (>99% effective in practice)

/* Ballista single test case Sun Jun 13 14:11:06 1999

* fopen(FNAME_NEG, STR_EMPTY) */

...

const char *str_empty = "";

...

param0 = (char *) -1;

str_ptr = (char *) malloc (strlen (str_empty) + 1);

strcpy (str_ptr, str_empty);

param1 = str_ptr;

...

fopen (param0, param1);

33

Research Challenges

34

Research Challenges Ballista provides a small, discrete state-space for software

components Challenge is to create models of inter-module relations and

workload statistics to create predictions Create discrete simulations using model and probabilities as input

parameters Validation of model at a high level of abstraction through

experimentation on testbed Optimize cost/performance

35

Contributors What does it take to do this sort of research?

• A legacy of 15 years of previous Carnegie Mellon work to build upon– But, sometimes it takes that long just to understand the real problems!

• Ballista: 3.5 years and about $1.6 Million spent to date

Students: Meredith Beveridge John Devale Kim Fernsler David Guttendorf Geoff Hendrey Nathan Kropp Jiantao Pan Charles Shelton Ying Shi Asad Zaidi

Faculty & Staff: Kobey DeVale Phil Koopman Roy Maxion Dan Siewiorek

Characterization of Pathological Behavior ices.cmu/ballista

Documents

Transcript of Characterization of Pathological Behavior ices.cmu/ballista