Lecture 3: Random Testing

1

Today

Random testing• Start off with a practical look, and some

useful ideas to get you started on the project: random testing for file systems

• Then take a deeper look at the notion of feedback and why it is useful: method for testing OO systems from ICSE a couple of years

• Then back out to take a look at the general idea of random testing, if time permits

2

A Little Background

Random testing• Generate program inputs at random• Drawn from some (possibly changing)

probability distribution• “Throw darts at the state space, without

drawing a bullseye”• May generate the same test (or equivalent

tests) many times• Will perform operations no sane human

would ever perform

3

A Somewhat Random Tester (Last Week)#define N 5 // 5 is “big enough”?

int testFind () {

int a[N];

int p, i;

for (p = 0; p < N; p++) {

random_assign(a, N)

a[p] = 3;

for (i = p; i < N; i++) {

if (a[i] == 3)

a[i] = a[i] – 1;

}

printf (“TEST: findLast({”);

print_array(a, N);

printf (“}, %d, 3)”, N);

assert (findLast(a, N, 3) == p);

}

}

4

A Considerably More Random Tester#define N 50 // 50 is “big enough”?

int testFind () {

int a[N];

int p, x, n, i, j;

for (i = 0; i < NUM_TESTS; i++) {

pick(n, 0, N);

pick(x, -2^31, 2^31);

pick(p, -1, n-1)

random_assign(a, n)

if (p != -1) {

a[p] = x;

}

for (j = p+1; j < n; j++) {

if (a[j] == x)

a[j] = a[j] – 1;

}

printf (“TEST: findLast({”);

print_array(a, n);

printf (“}, %d, %d) with item at %d”, N, x, p);

assert (findLast(a, n, x) == p);

}

}

5

Fuzz Testing

One night (it was a dark and stormy night) in 1990, Bart Miller (U Wisc.) was logged in over dialup• There was a lot of line

noise due to the storm• His shell and editors kept

crashing• This gave him an idea…

6

Fuzz Testing

Bart Miller et al., “An Empirical Study of the Reliability of UNIX Utilities”• Idea: feed “fuzz” (streams of pure

randomness, noise from /dev/urandom pretty much) to OS & utility code

• Watch it break!• In 1990, could crash 25-33% of utilities• Reports every few years since then• Some of the bugs are the same ones in

common security exploits (particularly buffer overruns)

7

Random Testing for Good & Evil

Fuzzers• Tools the send malformed/random input to a

program and hope to crash it or find a security hole

• Firefox is internally using random testing to find (security) problems

• One developer I know says they aren’t publishing much because it would be too useful to the bad guys

• Fuzzing is useful for finding bugs to protect programs (“white hat” work)

• But also for finding bugs to hack into systems (“black hat”)!

8

The Problem at JPL

Testing is the net that JPL uses to catch software errors before they show up in mission operation• Last line of defense – if a bug gets

through it can mean mission failure

Traditional software testing nets have big holes

9

The Problem at JPL

Most mission testing is integration testing of nominal scenarios:• Very thorough checks that when

expected things happen, other expected things happen – including fault protection (expected unexpected things)

• Unfortunately when the unexpected unexpected happens…

10

The Problem at JPL

Nominal (or stress) integration testing relies on expensive and slow radiation hardened flight hardware• Lots of competition for limited

computational resources• Computationally infeasible to

make use of statisticalapproaches, such asrandom testing

11

Building Better Nets

Thorough file system testing is a pilot effort to improve software testing at JPL• Reduce bugs found at the final system

I&T level – or in operation – by more effective early use of computational power on core modules of flight software

• Exploit models and reference implementations to reduce developer & tester effort

12

Flash File System Testing

We (LaRS) are developing a file system for mission use (NVFS)• A key JPL mission component• Problems with previous file systems used in missions

(MER flash anomaly, others I can’t tell you about here)• If bugs in our code show up in flight, JPL loses, science

loses, etc.

High reliability is critical:• Must preserve integrity of data

• in presence of arbitrary system resets• in presence of hardware failures

How do we thoroughly test such a module?

13

Quick Primer: NAND Flash

Before we continue, the tested system in a bit more detail

Flash memory is a set of blocks• A block is a set of pages• A page can be written

once; read many times• Page must be erased

before it can be re-written• Erase unit is a full block of

pages

Block of pages

PAGE WRITE

PAGE WRITE

PAGE WRITE

(obsoletes old data)

More WRITES…

BLOCK ERASE

Used page

Free page

“Dirty” page

14

The Goals

Randomize early testing (since it is not possible to be exhaustive)• We don’t know where the bugs are

Nominal

Scenario Tests

Randomized

Testing

15

Random testing

Simulated flash hardware layer allows random fault injection

Most development/early testing can be done on workstations

Lots of available compute power – can cover many system behaviors

Will stress software in ways nominal testing will not

16

The Goals

Automate early testing• Run tests all the time, in the background, while

continuing development efforts Automate test evaluation

• Using reference systems for fault detection and diagnosis

• Automated test minimization techniques to speed debugging and increase regression test effectiveness

Automate fault injection• Simulate hardware failures in a controlled test

environment

17

The Goals

Make use of desktop hardware for early testing – vs. expensive (sloooow) flight hardware testbeds• Many faults can be exposed without full bit-

level hardware simulation

18

Traditional Testing

Limited, fixed, unit tests by developers

Nominal scenarios on hardware testbeds• Small number of scenarios, due to limited resources• Test engineers inspect results manually• Limited fault injection capability (reset means manually

hitting the “red button”)

Testengineer

A day of testing

19

Random Testing

Millions of operations and scenarios, automatically generated

Run on fast & inexpensive workstations

Results checked automatically by a reference oracle

Hardware simulation for fault injection and reset simulation

A day (& night) of testing

(x 100000)

(x 100000)(x 100000)

(x 100000)

20

Differential Testing

How can we tell if a test succeeds?• POSIX standard for file system operations

• IEEE produced, ANSI/ISO recognized standard for file systems

• Defines operations and what they should do/return, including nominal and fault behavior

POSIX operation Result

mkdir (“/eng”, …) SUCCESS

mkdir (“/data”, …) SUCCESS

creat (“/data/image01”, …) SUCCESS

creat (“/eng/fsw/code”, …) ENOENT

mkdir (“/data/telemetry”, …) SUCCESS

unlink (“/data/image01”) SUCCESS

/

/eng /data

image01 /telemetry

File system

21

Differential Testing

How can we tell if a test succeeds?• The POSIX standard specifies (mostly) what

correct behavior is• We have heavily tested implementations of the

POSIX standard in every flavor of UNIX, readily available to us

• We can use UNIX file systems (ext3fs, tmpfs, etc.) as reference systems to verify the correct behavior of flash

• First differential approach (published)was McKeeman’s testing for compilers

22

Random Differential Testing

Choose (POSIX) operation F

Perform F on NVFSPerform F on Reference

(if applicable)

Compare return values

Compare error codes

Compare file systems

Check invariants

(inject a fault?)

23

Testing a File System

Use simulation layer to imitate flash hardware, operating at RAM-disk speed• I.e., much faster than the real flight hardware• Making large-scale random testing possible

Simulation layer provides same interface as the real hardware driver

Simulation layer provides ability to inject faults: bad blocks, system resets, read failures

24

Random Differential TestingChoose file system operations randomly

• Include standard POSIX calls + other operations (mount, unmount, format)

• Bias choice by a (coarse) model of file system contents, but allow failing operations

• Akin to randomized testing with feedback (Pacheco et al., ICSE 07)

Perform on both systems:

fs_fd = nvfs_creat (“/dp/images/img019”, ctime);ref_fd = creat (“/dp/images/img019”, …);

Compare return values Compare error codes Compare file systems Check invariants

25

Feedback: How to Pick a Path NameFull random generator: picks a path of

length up to n, from fixed components, e.g.:• /alpha/beta/beta/gamma/alpha

History-based generator: picks a random path from a list of all paths that have ever been created• With some probability of adding an extra

random component

26

Feedback: How to Pick a Path Name

full random?

n

ypick length n

append random component

n components? n

y

pick path from history

[append random component]

return chosen path

Tune P(full random) tobalance chance of usefuloperations with ability tocatch unlikely faults

Note that no operation shouldever succeed on a path thatcan’t be produced from historyplus one extra component

//alpha/alpha/beta/gamma/delta/delta/gamma/beta…

/delta/delta/alpha

2

/beta/beta/alpha

27

Fault Injection Example: Reset fs_fd = nvfs_creat (“/dp/images/img019”, ctime);ref_fd = creat (“/dp/images/img019”, …);


fs_ret = nvfs_mkdir(“/dp/images/old”, ctime);

a test with random reset scheduled:

match?

did reset occur?

ref_ret = mkdir(“/dp/images/old”, …);

n

n

y

y

(reset before commit)

restart/remountNVFS

compare file systemcontents

(reset took place after commit)


no resets

28

Stress Testing

Bugs live in the corner cases, i. e.:• File system is (running) out of space• High rate of bad blocks

Use a small virtual flash device to test for these conditions: 6-13 blocks, 4 pages per block, 200-400 bytes per page

Used page

Free page

Dirty page

Bad block

29

Part of a Typical Random Test5:: - (creat /gamma) = 0 *success*

6::(rename /gamma /gamma) *EBUSY*

7::(rename /gamma /gamma) *EBUSY*

8::(truncate /gamma offset 373) *EOPNOTSUPP*

9::(rmdir /gamma) *ENOTDIR*

10::(unlink /gamma) *success*

11::(open /gamma RDWR(2)) *ENOENT*

12::(open /gamma RDWR|O APPEND(1026)) *ENOENT*

13::(open /gamma O RDONLY|O CREAT|O EXCL) *success*


15:: (creat /alpha) = 2 *success*

16::(idle compact 0 0) *success*

17::(idle compact 0 1) *success*

18:: (read 0 (399 bytes) /gamma) *EBADF*


20:: (write 0 479 /gamma) Wrote 479 bytes to FLASH

. . .

*********************************************

Scheduling reset in 1...

*********************************************

195::(rename /delta/gamma/alpha /gamma) *ENOENT*

196::(read -9999 400 /delta/gamma/alpha) *EBADF*

197:: (creat /delta/gamma/delta)

write of page 7 block 1 failed on reset trap

*********************************************

Reset event took place during this operation.

*********************************************

(mount) fs Block 4 bad -- hardware memory

*success*

*ENOSPC*

Note: Not comparing results/error codes due to reset.

Clearing file descriptors and open directories...

198::(write -9999 320 /delta/gamma/delta) *EBADF*

199::(rmdir /delta) *EROFS*

Even with some feedback, we get lots ofredundant and “pointless” operations

But many errors involve operations thatshould fail but succeed, so it is hard tofilter out the rest in order to improve testefficiency: baby with the bathwater

30

Difficulties

The reference is not “perfect”: there are cases where Linux/Solaris file systems return a poor (but POSIX-compliant) choice of error code

Special efforts to test operations that are not in the reference system – such as bad block management

Sometimes we don’t want POSIX: eventually decided that on a spacecraft, using creat to destroy existing files is bad

31

Test Strategies

Overnight/daily runs of long sequences of tests• Range through random seeds (e.g., from

1 to 1,000,000)• When tests fail, add one representative

for each suspected cause to regressions• Vary test configurations (an art, not a

science, alas)• Test length varies – an interesting

question: how much does it matter?

32

Run Length and Effectiveness

33


34


35

Test Strategies

Good news:• Finds lots of bugs, very quickly

Bad news:• Randomness means potentially long test cases

to examine, and thousands of variations of the same error

36

Test Case Minimization

Solution: automatic minimization of test cases as they are generated

Minimized test case: subset of original sequence of operations such that• Test case still fails• Removing any one operation makes the test

case successful

Typical improvement: order of magnitude or greater reduction in length of a test case• Highly effective technique, essential for quick

debugging

37


Based on Zeller’s delta-debugging tools• Automated debugging state-of-the-art• Set of Python scripts easily modified to

automatically minimize tests in different settings

• Requires that you be able to• Play back test cases and determine success

or failure automatically• Define the subsets of a test case – provide a

test case decomposition• We’ll cover delta-debugging and variants in

depth later

38


Based on a clever modification of a “binary search” strategy

OriginalTestCase

First half

First half

Second half

First threefourths

Last threefourths

First half

Second half

39


One problem• Sometimes every large test case contains an

embedded version of a small test case that fails for a different reason

• When you delta-debug, these small cases dominate

• Our solution: only consider a test failing (when minimizing) if the last operation is the same

• Heuristic seems to work very well in practice

fd = creat (“foo”)write (fd, 128)unlink (“foo”)

40


We’ll revisit Zeller’s delta-debugging when we cover debugging

http://www.st.cs.uni-sb.de/dd• Check out if you want to get started• Could be useful now, for your tests

42

Some Results: Tests

Over two hundred minimized regression test cases• No failures over these tests for latest version of file

system• Success on ~2,000,000 new randomized tests• Can continue testing: why stop?

• Background task on a compute server…• Low cost of testing means there’s no real reason to stop

looking for rare glitchesTest team

43

Some Results: Coverage

80-85% typical statement coverage

Hand inspection to show that uncovered code is either:• Extremely defensive coding to handle (non-

provably) impossible conditions – coverage here would indicate a bug in the file system…

• Cases intentionally not checked, to improve test efficiency: null pointers, invalid filename characters – can (statically) show these do not change (or depend on) file system state

44

Getting Code Coverage

Can just use gcov

Free tool available for use with gcc

Compile program with extra flags--fprofile-arcs --ftest-coverage

After all (or each) test case finishes, rungcov –o object-files-location source-files

Will produce some output & some files• Output gives coverage %s per file• And you get an annotated copy of source

45

Code Not Covered by Tests

Defensive coding:

Trivial parameter checks:

531914: 780: if (!FS_ASSERT((dp->type & FS_G) == FS_G))

#####: 781: { fs_handle_condition(dp->type);

#####: 782: FS_SET_ERR(EEASSERT);

-: 783: }

15007634: 1844: if (want < 0 || b_in == NULL)

#####: 1845: { fs_i_release_access(Lp);

#####: 1846: FS_SET_ERR(EINVAL);

#####: 1847: return FS_ERROR;

-: 1848: }

If this runs, we’ve found a fault

This is a bit more subtle…

INDICATES CODE NOT COVERED BY THE TESTS – 0 executions

46

Don’t Use Random Testing for Everything!

Why not test handing read a null pointer?• Because (assuming the code is correct) it

guarantees some portion of test operations will not induce failure

• But if the code is incorrect, it’s easier and more efficient to write a single test

• The file system state doesn’t have any impact (we hope!) on whether there is a null check for the buffer passed to read

But we have to remember to actually do these non-random fixed tests, or we may miss critical, easy-to-find bugs!

47

Some Results: After >~109+ POSIX Ops.

Results of the 11 Last Weeks of Random Testing(results for full single partition testing only,

i.e., 1 of 3 categories tested -- the other categories are similar)

1

100

10000

1000000

100000000

10000000000

1 2 3 4 5 6 7 8 9 10 11

Week

Nu

mb

er o

f O

per

atio

ns

#Reset Traps Set #Bad Block Traps Set

#Defect Reports Filed # Operations in Test-Runs (approx)

For runs withpotential defects

All test runs(estimated)

Defect reportsfiled

48

Some Results: Defect Tracking

Source Code Size & Test Code Size (KNCSL)Cumulative Cyclomatic Complexity (sum of all functions),

Defect Reports & Cumulative Defect Reports

1

10

100

1000

10000

100000

0 5 6 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Weeks

NC

SL

& D

efec

ts R

epo

rts

fs_lib test support code

fs_lib source code

defect reports filed

cum. defect reports filed

cum. cyclomatic complexity

code for random testing

49

The Real Results: The Bugs

POSIX Divergences:• Early testing exposed numerous incorrect

choices of POSIX error code – easily resolved, mostly low impact

Fault Interactions:• A large number of cases involved

hardware failure interactions – failure to track the bad block list properly, for the most part

50


File System Integrity/Functionality Losses:• A substantial number of errors

discovered involved low-probability, very high impact scenarios

• Complete loss of file system contents• Loss of file contents (for a file not involved

in an operation, in some cases)• Null pointer dereference• Inability to unmount the file system (!)• Failed assertions on global invariants• Undead files – I thought I killed that!Our version of feedback really helps with finding these

51


We believe it is extremely unlikely that traditional testing procedures would have exposed several of these errors• Probably would have missed a lot of the

POSIX and hardware fault errors too, but those aren’t as important

Backing out to a larger perspective: why were we able to find them?

The bigquestion

52

Why We Found the Bugs (We Think!)Design for testability:

• Scalable down to small Flash systems• Very heavy use of assertions and

invariant checks• Chose system behavior to make the

system predictable (thus testable)

Performed millions of different automated tests, thanks to randomization with feedback + a powerful oracle (differential testing)

53

Why We Found the BugsNeed both:

• If only nominal scenarios are executed, design for testability

• can’t take advantage of small configurations• gives less chance to exercise assertions

• Large-scale random testing is less effective if

• you can’t scale down the hardware• there are no sanity checks• system unpredictability makes it difficult to

use a reference oracle

54

Reusing the Test Framework

Internal development efforts at JPL:• RAMFS

• Use of code instrumentation for “hardware simulation” (memory is the hardware)

• NVDS: low level storage module• Adaptations for new flash hardware/MSL

Request from Discovery class NASA mission – used to perform acceptance testing on a (non-POSIX) flight file system• Exposed serious undetected errors

55

Inheriting Test Code

RAMFS: A RAM file system with reliability across warm resets:• Used the same test framework and reference

file system as for flash• Unable to inject faults through a custom driver

layer – “write” is C assignment or memcpy• Used automatic code instrumentation to

simulate arbitrary system resets• Add a potential longjmp escape at each write

to global memory (everything but stack vars)

56

Testing an Externally Developed FS

Like JPL, contractor decided that past mission flash file systems were inadequate Developed new “highly-reliable” flash file system JPL management wanted to get a better feel for the

quality of this system JPL mission managemet knew of LaRS work

Contractor stated that the development process followed and previous testing were first-rate

• “One of our best developers” (true)• “Following our best process” (probably true)

• CMMI Level 3• For mission critical software

• “Ready to fly”

57

Testing an Externally Developed FS

Nonetheless, JPL management requested that LaRS perform additional acceptance testing with our random test methods• Validate effectiveness as a highly reliable file

system for mission data• Evaluate risk to mission• Improve quality of file system

58

Performing the Testing

LaRS received an Interface Control Document (ICD) and an executable• But no source, requirements or design

documents, due to IP concerns

Prior to receiving executable, our queries about behavior described in the ICD resulted in two Software Change Requests for serious flaws• Tester’s job may begin before even receiving

code: thinking about how to test a system can expose faults

• Good case for Beizer’s levels – we certainly didn’t actually execute any code to find those problems

Testing began early January, report delivered February 13

Black box! Or at leastvery gray box

59

Test Results

Exposed 16 previously undetected errors – 14 were fixed• Each error had potential for file system

corruption or loss-of-functionality• Delivered a C program with a minimal test

case for each error to ease diagnosis• 8 new releases to correct errors, sent

shortly after our test cases arrived• Final version successfully executed

hundreds of thousands of operations• Modified tester to avoid remaining

problems that were not fixed

Useful idea: havean automatic testergenerate stand-aloneprogram test casesautomatically: veryhelpful for sending todevelopers, whetherin-house or outside –and ensures bug isn’tin the test framework!

60

Sample error

Reset during close can cause fatal file system corruption and crash

• Reset after 2nd write while closing a file can produce system corruption. When system is mounted next, results in crash with segmentation fault

close

write toflash

System reboot before this page can be written

open file

flash storage device

Restart system

Mount file system…

CRASH!

61

Reset Testing

Discussion with developer revealed that extensive reset testing had been done

This means that the testing had been better than is typical, but still covered only a few scenarios

Should still be considered incomplete

Handful of fixed test scenarios

Try reset at each point and check result

62

Reset Testing

Random testing can try thousands of scenarios with resets at random points

More important: it can also vary flash contents & operations

Hypothesis: more important to vary states in which reset takes place extensivelythan to exhaustively check all placements for reset in a limited set of scenarios

63

Test Results

Reported on remaining major vulnerabilities: High mission risk for use of rename

operation – can destroy file system contents if used on full volume

Contractor hardware model may not reflect actual hardware behavior

The 2 unfixed errors (design flaws) prevented significant testing on a full or nearly full file system

64

Test Results

Test efforts well received by the project and by the contractor development team Reliability was improved – corrections for

many errors, and more information about remaining risks

LaRS team was invited to attend the code and device driver review

65

Principles Used

Random testing (with feedback)

Test automation

Hardware simulation & fault injection

Use of a well-tested reference implementation as oracle (differential testing)

Automatic test minimization (delta-debugging)

Design for testability• Assertions• Downward scalability (small model property)• Preference for predictability

66

Synopsis

Random testing is sometimes a powerful method and could likely be applied more broadly in other missions• Already applied to four file system-related

development efforts• Part or all of this approach is applicable to

other critical components (esp. with better models to use as references)

67

Ongoing Work

Used framework / hardware simulation / reference in model checking of storage system

Developing hybrid methods combining model checking, constraint solving, random testing• State spaces are still too large• Sound abstractions are very difficult to devise

Theorem proving efforts on design proved very labor-intensive, even with insights from early efforts & layered design

68

Challenge for “Formal Verification”

Traditionally:• “Testing is good for the 1/103 bugs”• “For 1/107 bugs, you need model checking

or the like”

We’ve found such low probability errors (checksum overlap with reset partway through memcpy in RAMFS)• (Also found some bugs with model checking

we did not find with random testing)

Lecture 3: Random Testing

Documents

Transcript of Lecture 3: Random Testing