Lecture 3: Random Testing
-
Upload
flashdomain -
Category
Documents
-
view
421 -
download
0
Transcript of Lecture 3: Random Testing
1
Today
Random testing• Start off with a practical look, and some
useful ideas to get you started on the project: random testing for file systems
• Then take a deeper look at the notion of feedback and why it is useful: method for testing OO systems from ICSE a couple of years
• Then back out to take a look at the general idea of random testing, if time permits
2
A Little Background
Random testing• Generate program inputs at random• Drawn from some (possibly changing)
probability distribution• “Throw darts at the state space, without
drawing a bullseye”• May generate the same test (or equivalent
tests) many times• Will perform operations no sane human
would ever perform
3
A Somewhat Random Tester (Last Week)#define N 5 // 5 is “big enough”?
int testFind () {
int a[N];
int p, i;
for (p = 0; p < N; p++) {
random_assign(a, N)
a[p] = 3;
for (i = p; i < N; i++) {
if (a[i] == 3)
a[i] = a[i] – 1;
}
printf (“TEST: findLast({”);
print_array(a, N);
printf (“}, %d, 3)”, N);
assert (findLast(a, N, 3) == p);
}
}
4
A Considerably More Random Tester#define N 50 // 50 is “big enough”?
int testFind () {
int a[N];
int p, x, n, i, j;
for (i = 0; i < NUM_TESTS; i++) {
pick(n, 0, N);
pick(x, -2^31, 2^31);
pick(p, -1, n-1)
random_assign(a, n)
if (p != -1) {
a[p] = x;
}
for (j = p+1; j < n; j++) {
if (a[j] == x)
a[j] = a[j] – 1;
}
printf (“TEST: findLast({”);
print_array(a, n);
printf (“}, %d, %d) with item at %d”, N, x, p);
assert (findLast(a, n, x) == p);
}
}
5
Fuzz Testing
One night (it was a dark and stormy night) in 1990, Bart Miller (U Wisc.) was logged in over dialup• There was a lot of line
noise due to the storm• His shell and editors kept
crashing• This gave him an idea…
6
Fuzz Testing
Bart Miller et al., “An Empirical Study of the Reliability of UNIX Utilities”• Idea: feed “fuzz” (streams of pure
randomness, noise from /dev/urandom pretty much) to OS & utility code
• Watch it break!• In 1990, could crash 25-33% of utilities• Reports every few years since then• Some of the bugs are the same ones in
common security exploits (particularly buffer overruns)
7
Random Testing for Good & Evil
Fuzzers• Tools the send malformed/random input to a
program and hope to crash it or find a security hole
• Firefox is internally using random testing to find (security) problems
• One developer I know says they aren’t publishing much because it would be too useful to the bad guys
• Fuzzing is useful for finding bugs to protect programs (“white hat” work)
• But also for finding bugs to hack into systems (“black hat”)!
8
The Problem at JPL
Testing is the net that JPL uses to catch software errors before they show up in mission operation• Last line of defense – if a bug gets
through it can mean mission failure
Traditional software testing nets have big holes
9
The Problem at JPL
Most mission testing is integration testing of nominal scenarios:• Very thorough checks that when
expected things happen, other expected things happen – including fault protection (expected unexpected things)
• Unfortunately when the unexpected unexpected happens…
10
The Problem at JPL
Nominal (or stress) integration testing relies on expensive and slow radiation hardened flight hardware• Lots of competition for limited
computational resources• Computationally infeasible to
make use of statisticalapproaches, such asrandom testing
11
Building Better Nets
Thorough file system testing is a pilot effort to improve software testing at JPL• Reduce bugs found at the final system
I&T level – or in operation – by more effective early use of computational power on core modules of flight software
• Exploit models and reference implementations to reduce developer & tester effort
12
Flash File System Testing
We (LaRS) are developing a file system for mission use (NVFS)• A key JPL mission component• Problems with previous file systems used in missions
(MER flash anomaly, others I can’t tell you about here)• If bugs in our code show up in flight, JPL loses, science
loses, etc.
High reliability is critical:• Must preserve integrity of data
• in presence of arbitrary system resets• in presence of hardware failures
How do we thoroughly test such a module?
13
Quick Primer: NAND Flash
Before we continue, the tested system in a bit more detail
Flash memory is a set of blocks• A block is a set of pages• A page can be written
once; read many times• Page must be erased
before it can be re-written• Erase unit is a full block of
pages
Block of pages
PAGE WRITE
PAGE WRITE
PAGE WRITE
(obsoletes old data)
More WRITES…
BLOCK ERASE
Used page
Free page
“Dirty” page
14
The Goals
Randomize early testing (since it is not possible to be exhaustive)• We don’t know where the bugs are
Nominal
Scenario Tests
Randomized
Testing
15
Random testing
Simulated flash hardware layer allows random fault injection
Most development/early testing can be done on workstations
Lots of available compute power – can cover many system behaviors
Will stress software in ways nominal testing will not
16
The Goals
Automate early testing• Run tests all the time, in the background, while
continuing development efforts Automate test evaluation
• Using reference systems for fault detection and diagnosis
• Automated test minimization techniques to speed debugging and increase regression test effectiveness
Automate fault injection• Simulate hardware failures in a controlled test
environment
17
The Goals
Make use of desktop hardware for early testing – vs. expensive (sloooow) flight hardware testbeds• Many faults can be exposed without full bit-
level hardware simulation
18
Traditional Testing
Limited, fixed, unit tests by developers
Nominal scenarios on hardware testbeds• Small number of scenarios, due to limited resources• Test engineers inspect results manually• Limited fault injection capability (reset means manually
hitting the “red button”)
Testengineer
A day of testing
19
Random Testing
Millions of operations and scenarios, automatically generated
Run on fast & inexpensive workstations
Results checked automatically by a reference oracle
Hardware simulation for fault injection and reset simulation
A day (& night) of testing
(x 100000)
(x 100000)(x 100000)
(x 100000)
20
Differential Testing
How can we tell if a test succeeds?• POSIX standard for file system operations
• IEEE produced, ANSI/ISO recognized standard for file systems
• Defines operations and what they should do/return, including nominal and fault behavior
POSIX operation Result
mkdir (“/eng”, …) SUCCESS
mkdir (“/data”, …) SUCCESS
creat (“/data/image01”, …) SUCCESS
creat (“/eng/fsw/code”, …) ENOENT
mkdir (“/data/telemetry”, …) SUCCESS
unlink (“/data/image01”) SUCCESS
/
/eng /data
image01 /telemetry
File system
21
Differential Testing
How can we tell if a test succeeds?• The POSIX standard specifies (mostly) what
correct behavior is• We have heavily tested implementations of the
POSIX standard in every flavor of UNIX, readily available to us
• We can use UNIX file systems (ext3fs, tmpfs, etc.) as reference systems to verify the correct behavior of flash
• First differential approach (published)was McKeeman’s testing for compilers
22
Random Differential Testing
Choose (POSIX) operation F
Perform F on NVFSPerform F on Reference
(if applicable)
Compare return values
Compare error codes
Compare file systems
Check invariants
(inject a fault?)
23
Testing a File System
Use simulation layer to imitate flash hardware, operating at RAM-disk speed• I.e., much faster than the real flight hardware• Making large-scale random testing possible
Simulation layer provides same interface as the real hardware driver
Simulation layer provides ability to inject faults: bad blocks, system resets, read failures
24
Random Differential TestingChoose file system operations randomly
• Include standard POSIX calls + other operations (mount, unmount, format)
• Bias choice by a (coarse) model of file system contents, but allow failing operations
• Akin to randomized testing with feedback (Pacheco et al., ICSE 07)
Perform on both systems:
fs_fd = nvfs_creat (“/dp/images/img019”, ctime);ref_fd = creat (“/dp/images/img019”, …);
Compare return values Compare error codes Compare file systems Check invariants
25
Feedback: How to Pick a Path NameFull random generator: picks a path of
length up to n, from fixed components, e.g.:• /alpha/beta/beta/gamma/alpha
History-based generator: picks a random path from a list of all paths that have ever been created• With some probability of adding an extra
random component
26
Feedback: How to Pick a Path Name
full random?
n
ypick length n
append random component
n components? n
y
pick path from history
[append random component]
return chosen path
Tune P(full random) tobalance chance of usefuloperations with ability tocatch unlikely faults
Note that no operation shouldever succeed on a path thatcan’t be produced from historyplus one extra component
//alpha/alpha/beta/gamma/delta/delta/gamma/beta…
/delta/delta/alpha
2
/beta/beta/alpha
27
Fault Injection Example: Reset fs_fd = nvfs_creat (“/dp/images/img019”, ctime);ref_fd = creat (“/dp/images/img019”, …);
Compare return values Compare error codes Compare file systems Check invariants
fs_ret = nvfs_mkdir(“/dp/images/old”, ctime);
a test with random reset scheduled:
match?
did reset occur?
ref_ret = mkdir(“/dp/images/old”, …);
n
n
y
y
(reset before commit)
restart/remountNVFS
compare file systemcontents
(reset took place after commit)
Compare return values Compare error codes Compare file systems Check invariants
no resets
28
Stress Testing
Bugs live in the corner cases, i. e.:• File system is (running) out of space• High rate of bad blocks
Use a small virtual flash device to test for these conditions: 6-13 blocks, 4 pages per block, 200-400 bytes per page
Used page
Free page
Dirty page
Bad block
29
Part of a Typical Random Test5:: - (creat /gamma) = 0 *success*
6::(rename /gamma /gamma) *EBUSY*
7::(rename /gamma /gamma) *EBUSY*
8::(truncate /gamma offset 373) *EOPNOTSUPP*
9::(rmdir /gamma) *ENOTDIR*
10::(unlink /gamma) *success*
11::(open /gamma RDWR(2)) *ENOENT*
12::(open /gamma RDWR|O APPEND(1026)) *ENOENT*
13::(open /gamma O RDONLY|O CREAT|O EXCL) *success*
14::(rmdir /gamma) *ENOTDIR*
15:: (creat /alpha) = 2 *success*
16::(idle compact 0 0) *success*
17::(idle compact 0 1) *success*
18:: (read 0 (399 bytes) /gamma) *EBADF*
19::(rmdir /gamma) *ENOTDIR*
20:: (write 0 479 /gamma) Wrote 479 bytes to FLASH
. . .
*********************************************
Scheduling reset in 1...
*********************************************
195::(rename /delta/gamma/alpha /gamma) *ENOENT*
196::(read -9999 400 /delta/gamma/alpha) *EBADF*
197:: (creat /delta/gamma/delta)
write of page 7 block 1 failed on reset trap
*********************************************
Reset event took place during this operation.
*********************************************
(mount) fs Block 4 bad -- hardware memory
*success*
*ENOSPC*
Note: Not comparing results/error codes due to reset.
Clearing file descriptors and open directories...
198::(write -9999 320 /delta/gamma/delta) *EBADF*
199::(rmdir /delta) *EROFS*
Even with some feedback, we get lots ofredundant and “pointless” operations
But many errors involve operations thatshould fail but succeed, so it is hard tofilter out the rest in order to improve testefficiency: baby with the bathwater
30
Difficulties
The reference is not “perfect”: there are cases where Linux/Solaris file systems return a poor (but POSIX-compliant) choice of error code
Special efforts to test operations that are not in the reference system – such as bad block management
Sometimes we don’t want POSIX: eventually decided that on a spacecraft, using creat to destroy existing files is bad
31
Test Strategies
Overnight/daily runs of long sequences of tests• Range through random seeds (e.g., from
1 to 1,000,000)• When tests fail, add one representative
for each suspected cause to regressions• Vary test configurations (an art, not a
science, alas)• Test length varies – an interesting
question: how much does it matter?
32
Run Length and Effectiveness
33
Run Length and Effectiveness
34
Run Length and Effectiveness
35
Test Strategies
Good news:• Finds lots of bugs, very quickly
Bad news:• Randomness means potentially long test cases
to examine, and thousands of variations of the same error
36
Test Case Minimization
Solution: automatic minimization of test cases as they are generated
Minimized test case: subset of original sequence of operations such that• Test case still fails• Removing any one operation makes the test
case successful
Typical improvement: order of magnitude or greater reduction in length of a test case• Highly effective technique, essential for quick
debugging
37
Test Case Minimization
Based on Zeller’s delta-debugging tools• Automated debugging state-of-the-art• Set of Python scripts easily modified to
automatically minimize tests in different settings
• Requires that you be able to• Play back test cases and determine success
or failure automatically• Define the subsets of a test case – provide a
test case decomposition• We’ll cover delta-debugging and variants in
depth later
38
Test Case Minimization
Based on a clever modification of a “binary search” strategy
OriginalTestCase
First half
First half
Second half
First threefourths
Last threefourths
First half
Second half
39
Test Case Minimization
One problem• Sometimes every large test case contains an
embedded version of a small test case that fails for a different reason
• When you delta-debug, these small cases dominate
• Our solution: only consider a test failing (when minimizing) if the last operation is the same
• Heuristic seems to work very well in practice
fd = creat (“foo”)write (fd, 128)unlink (“foo”)
40
Test Case Minimization
We’ll revisit Zeller’s delta-debugging when we cover debugging
http://www.st.cs.uni-sb.de/dd• Check out if you want to get started• Could be useful now, for your tests
42
Some Results: Tests
Over two hundred minimized regression test cases• No failures over these tests for latest version of file
system• Success on ~2,000,000 new randomized tests• Can continue testing: why stop?
• Background task on a compute server…• Low cost of testing means there’s no real reason to stop
looking for rare glitchesTest team
43
Some Results: Coverage
80-85% typical statement coverage
Hand inspection to show that uncovered code is either:• Extremely defensive coding to handle (non-
provably) impossible conditions – coverage here would indicate a bug in the file system…
• Cases intentionally not checked, to improve test efficiency: null pointers, invalid filename characters – can (statically) show these do not change (or depend on) file system state
44
Getting Code Coverage
Can just use gcov
Free tool available for use with gcc
Compile program with extra flags--fprofile-arcs --ftest-coverage
After all (or each) test case finishes, rungcov –o object-files-location source-files
Will produce some output & some files• Output gives coverage %s per file• And you get an annotated copy of source
45
Code Not Covered by Tests
Defensive coding:
Trivial parameter checks:
531914: 780: if (!FS_ASSERT((dp->type & FS_G) == FS_G))
#####: 781: { fs_handle_condition(dp->type);
#####: 782: FS_SET_ERR(EEASSERT);
-: 783: }
15007634: 1844: if (want < 0 || b_in == NULL)
#####: 1845: { fs_i_release_access(Lp);
#####: 1846: FS_SET_ERR(EINVAL);
#####: 1847: return FS_ERROR;
-: 1848: }
If this runs, we’ve found a fault
This is a bit more subtle…
INDICATES CODE NOT COVERED BY THE TESTS – 0 executions
46
Don’t Use Random Testing for Everything!
Why not test handing read a null pointer?• Because (assuming the code is correct) it
guarantees some portion of test operations will not induce failure
• But if the code is incorrect, it’s easier and more efficient to write a single test
• The file system state doesn’t have any impact (we hope!) on whether there is a null check for the buffer passed to read
But we have to remember to actually do these non-random fixed tests, or we may miss critical, easy-to-find bugs!
47
Some Results: After >~109+ POSIX Ops.
Results of the 11 Last Weeks of Random Testing(results for full single partition testing only,
i.e., 1 of 3 categories tested -- the other categories are similar)
1
100
10000
1000000
100000000
10000000000
1 2 3 4 5 6 7 8 9 10 11
Week
Nu
mb
er o
f O
per
atio
ns
#Reset Traps Set #Bad Block Traps Set
#Defect Reports Filed # Operations in Test-Runs (approx)
For runs withpotential defects
All test runs(estimated)
Defect reportsfiled
48
Some Results: Defect Tracking
Source Code Size & Test Code Size (KNCSL)Cumulative Cyclomatic Complexity (sum of all functions),
Defect Reports & Cumulative Defect Reports
1
10
100
1000
10000
100000
0 5 6 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Weeks
NC
SL
& D
efec
ts R
epo
rts
fs_lib test support code
fs_lib source code
defect reports filed
cum. defect reports filed
cum. cyclomatic complexity
code for random testing
49
The Real Results: The Bugs
POSIX Divergences:• Early testing exposed numerous incorrect
choices of POSIX error code – easily resolved, mostly low impact
Fault Interactions:• A large number of cases involved
hardware failure interactions – failure to track the bad block list properly, for the most part
50
The Real Results: The Bugs
File System Integrity/Functionality Losses:• A substantial number of errors
discovered involved low-probability, very high impact scenarios
• Complete loss of file system contents• Loss of file contents (for a file not involved
in an operation, in some cases)• Null pointer dereference• Inability to unmount the file system (!)• Failed assertions on global invariants• Undead files – I thought I killed that!Our version of feedback really helps with finding these
51
The Real Results: The Bugs
We believe it is extremely unlikely that traditional testing procedures would have exposed several of these errors• Probably would have missed a lot of the
POSIX and hardware fault errors too, but those aren’t as important
Backing out to a larger perspective: why were we able to find them?
The bigquestion
52
Why We Found the Bugs (We Think!)Design for testability:
• Scalable down to small Flash systems• Very heavy use of assertions and
invariant checks• Chose system behavior to make the
system predictable (thus testable)
Performed millions of different automated tests, thanks to randomization with feedback + a powerful oracle (differential testing)
53
Why We Found the BugsNeed both:
• If only nominal scenarios are executed, design for testability
• can’t take advantage of small configurations• gives less chance to exercise assertions
• Large-scale random testing is less effective if
• you can’t scale down the hardware• there are no sanity checks• system unpredictability makes it difficult to
use a reference oracle
54
Reusing the Test Framework
Internal development efforts at JPL:• RAMFS
• Use of code instrumentation for “hardware simulation” (memory is the hardware)
• NVDS: low level storage module• Adaptations for new flash hardware/MSL
Request from Discovery class NASA mission – used to perform acceptance testing on a (non-POSIX) flight file system• Exposed serious undetected errors
55
Inheriting Test Code
RAMFS: A RAM file system with reliability across warm resets:• Used the same test framework and reference
file system as for flash• Unable to inject faults through a custom driver
layer – “write” is C assignment or memcpy• Used automatic code instrumentation to
simulate arbitrary system resets• Add a potential longjmp escape at each write
to global memory (everything but stack vars)
56
Testing an Externally Developed FS
Like JPL, contractor decided that past mission flash file systems were inadequate Developed new “highly-reliable” flash file system JPL management wanted to get a better feel for the
quality of this system JPL mission managemet knew of LaRS work
Contractor stated that the development process followed and previous testing were first-rate
• “One of our best developers” (true)• “Following our best process” (probably true)
• CMMI Level 3• For mission critical software
• “Ready to fly”
57
Testing an Externally Developed FS
Nonetheless, JPL management requested that LaRS perform additional acceptance testing with our random test methods• Validate effectiveness as a highly reliable file
system for mission data• Evaluate risk to mission• Improve quality of file system
58
Performing the Testing
LaRS received an Interface Control Document (ICD) and an executable• But no source, requirements or design
documents, due to IP concerns
Prior to receiving executable, our queries about behavior described in the ICD resulted in two Software Change Requests for serious flaws• Tester’s job may begin before even receiving
code: thinking about how to test a system can expose faults
• Good case for Beizer’s levels – we certainly didn’t actually execute any code to find those problems
Testing began early January, report delivered February 13
Black box! Or at leastvery gray box
59
Test Results
Exposed 16 previously undetected errors – 14 were fixed• Each error had potential for file system
corruption or loss-of-functionality• Delivered a C program with a minimal test
case for each error to ease diagnosis• 8 new releases to correct errors, sent
shortly after our test cases arrived• Final version successfully executed
hundreds of thousands of operations• Modified tester to avoid remaining
problems that were not fixed
Useful idea: havean automatic testergenerate stand-aloneprogram test casesautomatically: veryhelpful for sending todevelopers, whetherin-house or outside –and ensures bug isn’tin the test framework!
60
Sample error
Reset during close can cause fatal file system corruption and crash
• Reset after 2nd write while closing a file can produce system corruption. When system is mounted next, results in crash with segmentation fault
close
write toflash
System reboot before this page can be written
open file
flash storage device
Restart system
Mount file system…
CRASH!
61
Reset Testing
Discussion with developer revealed that extensive reset testing had been done
This means that the testing had been better than is typical, but still covered only a few scenarios
Should still be considered incomplete
Handful of fixed test scenarios
Try reset at each point and check result
62
Reset Testing
Random testing can try thousands of scenarios with resets at random points
More important: it can also vary flash contents & operations
Hypothesis: more important to vary states in which reset takes place extensivelythan to exhaustively check all placements for reset in a limited set of scenarios
63
Test Results
Reported on remaining major vulnerabilities: High mission risk for use of rename
operation – can destroy file system contents if used on full volume
Contractor hardware model may not reflect actual hardware behavior
The 2 unfixed errors (design flaws) prevented significant testing on a full or nearly full file system
64
Test Results
Test efforts well received by the project and by the contractor development team Reliability was improved – corrections for
many errors, and more information about remaining risks
LaRS team was invited to attend the code and device driver review
65
Principles Used
Random testing (with feedback)
Test automation
Hardware simulation & fault injection
Use of a well-tested reference implementation as oracle (differential testing)
Automatic test minimization (delta-debugging)
Design for testability• Assertions• Downward scalability (small model property)• Preference for predictability
66
Synopsis
Random testing is sometimes a powerful method and could likely be applied more broadly in other missions• Already applied to four file system-related
development efforts• Part or all of this approach is applicable to
other critical components (esp. with better models to use as references)
67
Ongoing Work
Used framework / hardware simulation / reference in model checking of storage system
Developing hybrid methods combining model checking, constraint solving, random testing• State spaces are still too large• Sound abstractions are very difficult to devise
Theorem proving efforts on design proved very labor-intensive, even with insights from early efforts & layered design
68
Challenge for “Formal Verification”
Traditionally:• “Testing is good for the 1/103 bugs”• “For 1/107 bugs, you need model checking
or the like”
We’ve found such low probability errors (checksum overlap with reset partway through memcpy in RAMFS)• (Also found some bugs with model checking
we did not find with random testing)