PART 16 -- ALTERNATIVES TO GUI TEST AUTOMATION · for a course that we co-taught on software test...

1Black Box Software Testing Copyright © 2003 Cem Kaner & James Bach

Black Box Software Testing Fall 2004

PART 16 -- ALTERNATIVES TO GUI TEST AUTOMATIONby

Cem Kaner, J.D., Ph.D.Professor of Software Engineering

Florida Institute of Technologyand

James BachPrincipal, Satisfice Inc.

Copyright (c) Cem Kaner & James Bach, 2000-2004This work is licensed under the Creative Commons Attribution-ShareAlike License. To view acopy of this license, visit http://creativecommons.org/licenses/by-sa/2.0/ or send a letter toCreative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

These notes are partially based on research that was supported by NSF Grant EIA-0113539ITR/SY+PE: "Improving the Education of Software Testers." Any opinions, findings andconclusions or recommendations expressed in this material are those of the author(s) and do notnecessarily reflect the views of the National Science Foundation.


Black Box Software Testing

Alternatives to GUI-Based Automated Regression Testing

• Several of these slides were developed by Doug Hoffman or in co-authorship with Doug Hoffmanfor a course that we co-taught on software test automation.

• Many of the ideas in this presentation were presented and refined in Los Altos Workshops onSoftware Testing.

• LAWST 5 focused on oracles. Participants were Chris Agruss, James Bach, Jack Falk, DavidGelperin, Elisabeth Hendrickson, Doug Hoffman, Bob Johnson, Cem Kaner, Brian Lawrence, NoelNyman, Jeff Payne, Johanna Rothman, Melora Svoboda, Loretta Suzuki, and Ned Young.

• LAWST 1-3 focused on several aspects of automated testing. Participants were Chris Agruss, TomArnold, Richard Bender, James Bach, Jim Brooks, Karla Fisher, Chip Groder, ElizabethHendrickson, Doug Hoffman, Keith W. Hooper, III, Bob Johnson, Cem Kaner, Brian Lawrence,Tom Lindemuth, Brian Marick, Thanga Meenakshi, Noel Nyman, Jeffery E. Payne, BretPettichord, Drew Pritsker, Johanna Rothman, Jane Stepak, Melora Svoboda, Jeremy White, andRodney Wilson.

• I’m indebted to James Whittaker, James Tierney, Harry Robinson, and Noel Nyman for additionalexplanations of stochastic testing.


What is automation design?

• Determine the goals of the automation

• Determine the capabilities needed to achieve those goals

• Select automation components

• Set relationships between components

• Identify locations of components and events

• Sequence test events

• Evaluate and report results of test events.


Issues faced in a typical automated test

– What is being tested?

– How is the test set up?

– Where are the inputs coming from?

– What is being checked?

– Where are the expected results?

– How do you know pass or fail?


Automated software test functions

– Automated test case/data generation– Test case design from requirements or code– Selection of test cases– Able to run two or more specified test cases– Able to run a subset of all the automated test cases– No intervention needed after launching tests– Automatically sets-up and/or records relevant test environment– Runs test cases– Captures relevant results– Compares actual with expected results– Reports analysis of pass/fail


Characteristics of “fully automated” tests

• A set of tests is defined and will be run together.

• No intervention needed after launching tests.

• Automatically sets-up and/or records relevant test environment.

• Obtains input from existing data files, random generation, or anotherdefined source.

• Runs test exercise.

• Captures relevant results.

• Evaluates actual against expected results.

• Reports analysis of pass/fail.

Not all automation is full automation. Partial automationcan be very useful.


Capabilities of automation tools

• Automated test tools combine a variety of capabilities. Forexample, GUI regression tools provide:– capture/replay for easy manual creation of tests– execution of test scripts– recording of test events– compare the test results with expected results– report test results

• Some GUI tools provide additional capabilities, but no tooldoes everything well.


Capabilities of automation tools• Here are examples of automated test tool capabilities:

– Analyze source code for bugs– Design test cases– Create test cases (from requirements or code)– Generate test data– Ease manual creation of test cases– Ease creation/management of traceability matrix– Manage testware environment– Select tests to be run– Execute test scripts– Record test events– Measure software responses to tests (Discovery Functions)– Determine expected results of tests (Reference Functions)– Evaluate test results (Evaluation Functions)– Report and analyze results


Improve testability by providing diagnostic support

• Hardware integrity tests. Example: power supply deterioration canlook like irreproducible, buggy behavior.

• Database integrity. Ongoing tests for database corruption, makingcorruption quickly visible to the tester.

• Code integrity. Quick check (such as checksum) to see whetherpart of the code was overwritten in memory.

• Memory integrity. Check for wild pointers, other corruption.• Resource usage reports: Check for memory leaks, stack leaks, etc.• Event logs. See reports of suspicious behavior. Normally requires

collaboration with programmers.• Wrappers. Layer of indirection surrounding a called function or

object. The automator can detect and modify incoming andoutgoing messages, forcing or detecting states and data values ofinterest.


GUI regression is just a special case

• Source of test cases– Old

• Size of test pool– Small

• Serial dependence among tests– Independent

• Evaluation strategy– Comparison to saved result



• Source of test cases– Old– Intentionally new– Random new

• Size of test pool– Small– Large– Exhaustive

• Serial dependence among tests– Independent– Sequence is relevant



• Evaluation strategy– Comparison to saved result– Comparison to an oracle– Comparison to a computational or logical model– Comparison to a heuristic prediction. (NOTE: All oracles are

heuristic.)– Crash– Diagnostic– State model


A different special case: Exhaustive testing• MASPAR functions: square root tests

– 32-bit arithmetic, built-in square root• 2^32 tests (4,294,967,296)• 6 minutes to run the tests• Much longer to run the oracle• Discovered 2 errors that were not associated with any

boundary (a bit was mis-set, and in two cases, thisaffected the final result).

– 64-bit arithmetic?


A different special case: Exhaustive testing• MASPAR functions: square root tests

– Source of test cases• Intentional new

– Size of test pool• Exhaustive

– Evaluation strategy• Comparison to an oracle

– Serial dependence among tests• Independent


Random testing: Independent and stochastic approaches

• Random Testing– Random (or statistical or stochastic) testing involves generating test

cases using a random number generator. Because they are random, theindividual test cases are not optimized against any particular risk. Thepower of the method comes from running large samples of test cases.

• Stochastic Testing– Stochastic process involves a series of random events over time

• Stock market is an example• Program typically passes the individual tests: The goal is to

see whether it can pass a large series of the individual tests.• Independent Testing

– Our interest is in each test individually, the test before and the test afterdon’t matter.


Independent random tests: Function equivalence testing

• Hypothetical case: Arithmetic in Excel– Suppose we had a pool of functions that worked well in previous

version.• For individual functions, generate random number and

take function (e.g. log) in Excel 97 and Excel 2000.– Spot check results (e.g. 10 cases across the series)

• Build a model to combine functions into expressions– Generate and compare expressions– Spot check results

– For an academic example of this, see the Final Exam for Software Testing2 (fall, 2003) athttp://blackbox.cs.fit.edu/blog/kaner/archives/000008.html


Independent random tests: Function equivalence testing

• Hypothetical case: Arithmetic in Excel– Source of test cases

• Random new– Size of test pool

• Large– Evaluation strategy

• Comparison to an oracle– Serial dependence among tests

• Independent


Comparison functions

• Parallel function (Oracle)– Previous version– Competitor– Standard function– Custom model

• Computational or logical model– Inverse function

• mathematical inverse• operational inverse (e.g. split a merged table)

– Useful mathematical rules (e.g. sin2(x) + cos2(x) = 1)


Oracles: Challenges

• Incomplete information from oracle– May be more than one oracle for SUT– Inputs may effect more than one oracle

• Accuracy of information from oracle– Close correspondence makes common mode faults likely– Independence is necessary:

• algorithms• sub-programs and libraries• system platform• operating environment


Oracles: Challenges

• Close correspondence reduces maintainability• Must maintain currency of oracle through changes in the SUT• Oracle may become as complex as SUT• More complex oracles make more errors• Speed of predictions• Usability of results


Heuristic oracles• Heuristics are rules of thumb that support but do not mandate

a given conclusion. We have partial information that willsupport a probabilistic evaluation. This won’t tell you that theprogram works correctly but it can tell you that the program isbroken. This can be a cheap way to spot errors early in testing.

• Example:– History of transactions Almost all transactions came from

New York last year.– Today, 90% of transactions are from Wyoming. Why? Probably

(but not necessarily) the system is running amok.


Stochastic test: Dumb monkeys

• Dumb Monkey– Random sequence of events– Continue through crash (Executive Monkey)– Continue until crash or a diagnostic event occurs. The

diagnostic is based on knowledge of the system, not oninternals of the code. (Example: button push doesn’tpush—this is system-level, not application level.)

– (name coined by Noel Nyman)


Stochastic test: Dumb monkeys

• Dumb Monkey– Source of test cases

• Random new

– Size of test pool• Large

– Evaluation strategy• Crash or Diagnostics

– Serial dependence among tests• Sequence is relevant


Stochastic test using diagnostics

• Telephone Sequential Dependency– Symptoms were random, seemingly irreproducible crashes

at a beta site– All of the individual functions worked– We had tested all lines and branches.– Testing was done using a simulator, that created long chains

of random events. The diagnostics in this case were assertfails that printed out on log files.


Stochastic test using diagnostics

• Telephone Sequential Dependency– Source of test cases

• Random new– Size of test pool

• Large– Evaluation strategy

• Diagnostics– Serial dependence among tests

• Sequence is relevant


Stochastic test: State-model-based

• Testing Based on a State Model– For any state, you can list the actions the user can

take, and the results of each action (what new state,and what can indicate that we transitioned to thecorrect new state).

– Randomly run the tests and check expected againstactual transitions.

– See www.geocities.com/model_based_testing/online_papers.htm


Stochastic test: State-model based

• Testing Based on a State Model– Source of test cases

• Random new

– Size of test pool• Large, medium or small (different substrategies)

– Evaluation strategy• State model or crash



Stochastic test: Saved-tests based

• Testing with Sequence of Passed Tests– Collect a large set of regression tests, edit them so that they don’t

reset system state.– Randomly run the tests in a long series and check expected against

actual results.– Will sometimes see failures even though all of the tests are passed

individually.


Stochastic test: Saved-tests based

• Testing with Sequence of Passed Tests– Source of test cases

• Old

– Size of test pool• Large

– Evaluation strategy• Saved results or Crash or Diagnostics



Another approach to evaluating strategies for automation

What characteristics of the– goal of testing– level of testing (e.g. API,

unit, system)– software under test– environment– generator– reference function– evaluation function– users– risks

would support, counter-indicate, or drive you toward astrategy?

– consistency evaluation– small sample, pre-specified

values– exhaustive sample– random (aka statistical)– heuristic analysis of a large

set– embedded, self-verifying

data– state-model-based testing

High Volume Test Automation 31

High Volume Test AutomationHigh Volume Test Automation

Keynote AddressKeynote AddressSTAR EastSTAR East

International Conference on Software Testing Analysis & ReviewInternational Conference on Software Testing Analysis & ReviewOrlando, Florida, May 20, 2004.Orlando, Florida, May 20, 2004.

Cem KanerProfessor of Software Engineering

Walter P. BondAssociate Professor of Computer Science

Pat McGeeDoctoral Student (Computer Science)

Florida Institute of Technology


AcknowledgementsAcknowledgements• Many of the ideas in this presentation were initially jointly developed with Doug Hoffman,as we developed a course

on test automation architecture, and in the Los Altos Workshops on Software Testing (LAWST) and the AustinWorkshop on Test Automation (AWTA).

– LAWST 5 focused on oracles. Participants were Chris Agruss, James Bach, Jack Falk, David Gelperin,Elisabeth Hendrickson, Doug Hoffman, Bob Johnson, Cem Kaner, Brian Lawrence, Noel Nyman, JeffPayne, Johanna Rothman, Melora Svoboda, Loretta Suzuki, and Ned Young.

– LAWST 1-3 focused on several aspects of automated testing. Participants were Chris Agruss, TomArnold, Richard Bender, James Bach, Jim Brooks, Karla Fisher, Chip Groder, Elizabeth Hendrickson,Doug Hoffman, Keith W. Hooper, III, Bob Johnson, Cem Kaner, Brian Lawrence, Tom Lindemuth, BrianMarick, Thanga Meenakshi, Noel Nyman, Jeffery E. Payne, Bret Pettichord, Drew Pritsker, JohannaRothman, Jane Stepak, Melora Svoboda, Jeremy White, and Rodney Wilson.

– AWTA also reviewed and discussed several strategies of test automation. Participants in the firstmeeting were Chris Agruss, Robyn Brilliant, Harvey Deutsch, Allen Johnson, Cem Kaner, BrianLawrence, Barton Layne, Chang Lui, Jamie Mitchell, Noel Nyman, Barindralal Pal, Bret Pettichord,Christiano Plini, Cynthia Sadler, and Beth Schmitz.

• We’re indebted to Hans Buwalda, Elizabeth Hendrickson, Noel Nyman, Pat Schroeder, Harry Robinson, JamesTierney, & James Whittaker for additional explanations of test architecture and stochastic testing.

• We also appreciate the assistance and hospitality of “Mentsville,” a well-known and well-respected, butcan’t-be-named-here, manufacturer of mass-market devices that have complex firmware.Mentsville opened its records to us, providing us with details about a testing practice(Extended Random Regression testing) that’s been evolving at the company since 1990.

• Finally, we thank Alan Jorgensen for explaining hostile data stream testing to us andproviding equipment and training for us to use to extend his results.


Typical Testing TasksTypical Testing Tasks• Analyze product & its risks

– market– benefits & features– review source code– platform & associated software

• Develop testing strategy– pick key techniques– prioritize testing foci

• Design tests– select key test ideas– create test for the idea

• Run test first time (often by hand)• Evaluate results

– Report bug if test fails• Keep archival records

– trace tests back to specs

• Manage testware environment• If we create regression tests:

– Capture or code steps oncetest passes

– Save “good” result– Document test / file– Execute the test

• Evaluate result– Report failure or– Maintain test case


Automating TestingAutomating Testing

• No testing tool covers this range of tasks• We should understand that

– “Automated testing” doesn’t meanautomated testing

– “Automated testing” means

Computer-Assisted Testing


Automated GUI-Level Regression TestingAutomated GUI-Level Regression Testing

• Re-use old tests using tools like Mercury, Silk, Robot• Low power• High maintenance cost• Significant inertia

INERTIAThe resistance to change that

our development processbuilds into the project.


The Critical Problem of Regression TestingThe Critical Problem of Regression Testing

• Very few tests• We are driven by the politics of scarcity:

– too many potential tests– not enough time

• Every test is lovingly crafted, or should be, because we need tomaximize the value of each test.

What if we could create, execute, and evaluate scrillions of tests?Would that change our strategy?


Case Study: Extended Random RegressionCase Study: Extended Random Regression

• Welcome to “Mentsville”, a household-name manufacturer, widely respectedfor product quality, who chooses to remain anonymous.

• Mentsville applies wide range of tests to their products, including unit-leveltests and system-level regression tests.– We estimate > 100,000 regression tests in “active” library

• Extended Random Regression (ERR)– Tests taken from the pool of tests the program has passed in this

build– The tests sampled are run in random order until the software under

test fails (e.g crash)– These tests add nothing to typical measures

of coverage.– Should we expect these to find bugs?


Extended Random Regression TestingExtended Random Regression Testing

• Typical defects found include timing problems, memorycorruption (including stack corruption), and memory leaks.

• Recent release: 293 reported failures exposed 74 distinct bugs,including 14 showstoppers.

• Mentsville’s assessment is that ERR exposes problems that can’tbe found in less expensive ways.– troubleshooting of these failures can be very difficult and

very expensive– wouldn’t want to use ERR for basic functional bugs or simple

memory leaks--too expensive.• ERR has gradually become one of the fundamental techniques

relied on by Mentsville– gates release from one milestone level to

the next.


Implications of ERR for Reliability ModelsImplications of ERR for Reliability Models• Most models of software reliability make several common assumptions,

including:– Every fault (perhaps, within a given severity class) has the same

chance of being encountered as every other fault.– Probability of fault detection in a given period of time is directly

related to the number of faults left in the program.(Source (example) Farr (1995) “Software Reliability Modeling Survey,” inLyu (ed.) Software Reliability Engineering.)

• Additionally, the following ideas are foreign to most models:a) There are different kinds of faults (different detection probabilities)b) There are different kinds of tests (different exposure probabilities)c) The power of one type of test can diminish over time, without a

correlated loss of power of some other type of test.d) The probability of exposing a given kind of fault depends

in large part on which type of test you’re using.ERR demonstrates (d), which implies (a) and (c).


Summary So FarSummary So Far

• Traditional test techniques tie us to a small number of tests.• Extended Random Regression exposes bugs the traditional

techniques probably won’t find.• The results of Extended Random Regression provide another

illustration of the weakness of current models of softwarereliability.


Plan for the HVAT Research ProjectPlan for the HVAT Research Project• Capture an industry experience. We capture information to understand the

technique, how it was used, the overall pattern of results, the technique user's beliefs aboutthe types of errors it’s effective at exposing and some of its limitations. This is enoughinformation to be useful, but not enough for a publishable case study. For that, we’d needmore details about the corporation, project and results, and permission to publish details thecompany might consider proprietary.

• Create an open source, vendor-independent test tool that lets us do thesame type of testing as the company did. Rather than merely describing the toolin a case study report, we will provide any interested person with a copy of it.

• Apply the tool to one, or preferably a few, open source product(s) indevelopment. The industry experience shapes our work but our primary publication is adetailed description of the tool we built and the results we obtained, including the softwareunder test (object and source), the project’s development methods and lifecycle, errorsfound, and the project bug database, which includes bugs discovered using other methods.

• Evaluate the results in terms of what they teach us about softwarereliability modeling. Results we've seen so far pose difficulties for several popularmodels. We hope to develop a usable modification or replacement.

• Develop instructional materials to support learning about the testtechniques and assumptions and robustness of the current reliabilitymodels. This includes lecture notes, video lectures and demonstrations, and exercises forthe test tools, and a simulator for studying the reliability models, with notes and lectures, allfreely downloadable from www.testingeducation.org.


Ten Examples of HVATTen Examples of HVAT

1. Extended random regression testing2. Function equivalence testing (comparison to a reference

function)3. Comparison to a computational or logical model4. Comparison to a heuristic prediction, such as prior behavior5. Simulator with probes6. State-transition testing without a state model (dumb monkeys)7. State-transition testing using a state model (terminate on failure

rather than after achieving some coverage criterion)8. Functional testing in the presence of background load9. Hostile data stream testing10. Random inputs to protocol checkers


A Structure for Thinking about HVATA Structure for Thinking about HVAT• INPUTS

– What is the source for our inputs?How do we choose input values forthe test?

– (“Input” includes the full set ofconditions of the test)

• OUTPUTS– What outputs will we observe?

• EVALUATION– How do we tell whether the

program passed or failed?• EXPLICIT MODEL?

– Is our testing guided by any explicitmodel of the software, the user, theprocess being automated, or anyother attribute of the system?

• WHAT ARE WE MISSING?– The test highlights some problems

but will hide others.

• SEQUENCE OF TESTS– Does / should any aspect of test N+1

depend on test N?• THEORY OF ERROR

– What types of errors are we hoping tofind with these tests?

• TROUBLESHOOTING SUPPORT– What data are stored? How else is

troubleshooting made easier?• BASIS FOR IMPROVING TESTS?• HOW TO MEASURE PROGRESS?

– How much, and how much is enough?• MAINTENANCE LOAD / INERTIA?

– Impact of / on change to the SUT

• CONTEXTS– When is this useful?


Mentsville ERR and the StructureMentsville ERR and the Structure• INPUTS:

– taken from existing regressiontests, which were designed under awide range of criteria

• OUTPUTS– Mentsville: few of interest other

than diagnostics– Others: whatever outputs were

interesting to the regression testers,plus diagnostics

• EVALUATION STRATEGY– Mentsville: run until crash or other

obvious failure– Others: run until crash or until

mismatch between programbehavior or prior results or modelpredictions

• EXPLICIT MODEL?– None

• WHAT ARE WE MISSING?– Mentsville: Anything that doesn’t cause

a crash• SEQUENCE OF TESTS

– ERR sequencing is random• THEORY OF ERROR

– bugs not easily detected by theregression tests: long-fuse bugs, suchas memory corruption, memory leaks,timing errors

• TROUBLESHOOTING SUPPORT– diagnostics log, showing state of

system before and after tests


NEXT: Function Equivalence TestingNEXT: Function Equivalence Testing

• Example from Florida Tech’s Testing 2 final exam last fall:– Use test driven development to create a test tool that will test

the Open Office spreadsheet by comparing it with Excel– (We used COM interface for Excel and an equivalent

interface for OO, drove the API-level tests with a programwritten in Ruby, a simple scripting language)

– Pick 10 functions in OO (and Excel). For each function:• Generate random input to the function• Compare OO evaluation and Excels• Continue until you find errors or are satisfied of the

equivalence of the two functions.– Now test expressions that combine several of

the tested functions


Function Equivalence TestingFunction Equivalence Testing• INPUTS:

– Random• OUTPUTS

– We compare output with the output from areference function. In practice, we alsoindependently check a small sample ofcalculations for plausibility

• EVALUATION STRATEGY– Output fails to match, or fails to match

within delta, or testing stops from crash orother obvious misbehavior.

• EXPLICIT MODEL?– The reference function is, in relevant

respects, equivalent to the software undertest.

– If we combine functions (testingexpressions rather than single functions),we need a grammar or other basis fordescribing combinations.

• WHAT ARE WE MISSING?– Anything that the reference function

can’t generate• SEQUENCE OF TESTS

– Tests are typically independent• THEORY OF ERROR

– Incorrect data processing / storage /calculation

• TROUBLESHOOTING SUPPORT– Inputs saved

• BASIS FOR IMPROVING TESTS?


Oracle comparisons are heuristic:Oracle comparisons are heuristic:We compare only a few result attributesWe compare only a few result attributes

Test Results

Test OracleSystem Under Test

Test Results

Modified from notes by Doug Hoffman

Intended Test Inputs

AdditionalPrecondition Data

PreconditionProgram State

EnvironmentalInputs

Postcondition DataPostcondition Program State

Environmental Results

Postcondition DataPostcondition Program State

Environmental Results


What is this technique useful for?

• Hoffman’s MASPAR Square Root bug• Pentium FDIV bug


Summary So FarSummary So Far

• Traditional test techniques tie us to a small number of tests.• Extended Random Regression exposes bugs the traditional techniques

probably won’t find.• The results of Extended Random Regression provide another illustration of the

weakness of current models of software reliability.

• ERR is just one example of a class of high volume tests• High volume tests are useful for:

– exposing delayed-effect bugs

– automating tedious comparisons, for any testingtask that can be turned into tedious comparisons

• Test oracles are useful, but incomplete.– If we rely on them too heavily, we’ll miss bugs


Hostile Data Stream TestingHostile Data Stream Testing

• Pioneered by Alan Jorgensen (FIT, recently retired)• Take a “good” file in a standard format (e.g. PDF)

– corrupt it by substituting one string (such as a really, reallyhuge string) for a much shorter one in the file

– feed it to the application under test– Can we overflow a buffer?

• Corrupt the “good” file in thousands of different ways, trying todistress the application under test each time.

• Jorgenson and his students showed serious security problems insome products, primarily using brute force techniques.

• Method seems appropriate for application ofgenetic algorithms or other AI to optimize search.


Hostile Data Stream and HVACHostile Data Stream and HVAC• INPUTS:

– A series of random mutations of thebase file

• OUTPUTS– Simple version--not of much interest

• EVALUATION STRATEGY– Run until crash, then investigate


• WHAT ARE WE MISSING?– Data corruption, display corruption,

anything that doesn’t stop us fromfurther testing

• SEQUENCE OF TESTS– Independent selection (without

repetition). No serial dependence.• THEORY OF ERROR

– What types of errors are we hoping tofind with these tests?


troubleshooting made easier?• BASIS FOR IMPROVING TESTS?

– Simple version: hand-tuned– Seemingly obvious candidate for GA’s

and other AI


What does this one have to do with

reliability models? Maybe nothing,in the traditionalreliability sense.

The questionaddressed by thistechnique is not

how the programwill fail in

normal use, buthow it fares in

the face ofdetermined

attack.


Phone System: Simulator with ProbesPhone System: Simulator with Probes

Telenova Station Set 1. Integrated voice and data.108 voice features, 110 data features. 1985.


Simulator with ProbesSimulator with Probes

Context-sensitivedisplay

10-deep hold queue10-deep wait queue


Simulator with ProbesSimulator with ProbesThe bug that triggered the simulation looked like this:• Beta customer (a stock broker) reported random failures

– Could be frequent at peak times– An individual phone would crash and reboot, with other phones crashing while the first was

rebooting– On a particularly busy day, service was disrupted all (East Coast) afternoon

• We were mystified:– All individual functions worked– We had tested all lines and branches.

• Ultimately, we found the bug in the hold queue– Up to 10 calls on hold, each adds record to the stack– Initially, checked stack whenever call was added or removed, but this took too much system

time– Stack has room for 20 calls (just in case)– Stack reset (forced to zero) when we knew it should be empty– The error handling made it almost impossible for us to detect the

problem in the lab. Because we couldn’t put more than 10 calls on thestack (unless we knew the magic error), we couldn’t get to 21 calls tocause the stack overflow.


Idle

Connected

On Hold

Ringing Caller hung up

Youhung up

Simplified state diagram




Idle

Connected

On Hold

Ringing Caller hung up

Youhung up

Cleaned up everything but the stack. Failure wasinvisible until crash. From there, held calls were hold-forwarded to other phones, causing a rotating outage.



Having found and fixed the hold-stack bug, should we assume

that we’ve taken care of the problemor that if there is one long-sequence bug,

there will be more?

Hmmm…If you kill a cockroach in your kitchen,

do you assumeyou’ve killed the last bug?

Or do you call the exterminator?



• Telenova (*) created a simulator– generated long chains of random events, emulating input to the

system’s 100 phones– could be biased, to generate more holds, more forwards, more

conferences, etc.• Programmers added probes (non-crashing asserts that sent alerts to a

printed log) selectively– can’t probe everything b/c of timing impact

• After each run, programmers and testers tried to replicate failures, fixanything that triggered a message. After several runs, the logs ranalmost clean.

• At that point, shift focus to next group of features.• Exposed lots of bugs

(*) By the time this was implemented, I had joined Electronic Arts.


Simulator with ProbesSimulator with Probes• INPUTS:

– Random, but with biasable transitionprobabilities.

• OUTPUTS– Log messages generated by the probes.

These contained some troubleshootinginformation (whatever the programmerchose to include).

• EVALUATION STRATEGY– Read the log, treat any event leading to a

log message as an error.• EXPLICIT MODEL?

– At any given state, the simulator knowswhat the SUT’s options are, but it doesn’tverify the predicted state against actualstate.

• WHAT ARE WE MISSING?– Any behavior other than log

• SEQUENCE OF TESTS– Ongoing sequence, never reset.

• THEORY OF ERROR– Long-sequence errors (stack

overflow, memory corruption,memory leak, race conditions,resource deadlocks)

• TROUBLESHOOTING SUPPORT– Log messages

• BASIS FOR IMPROVING TESTS?– Clean up logs after each run by

eliminating false alarms and fixingbugs. Add more tests and logdetails for hard-to-repro errors


SummarySummary• Traditional test techniques tie us to a small number of tests.• Extended random regression and long simulations exposes bugs the traditional

techniques probably won’t find.• Extended random regression and simulations using probes provide another illustration

of the weakness of current models of software reliability.• ERR is just one example of a class of high volume tests• High volume tests are useful for:

– exposing delayed-effect bugs• embedded software• life-critical software• military applications• operating systems• anything that isn’t routinely rebooted

– automating tedious comparisons, for any testing taskthat can be turned into tedious comparisons

• Test oracles are incomplete.– If we rely on them too heavily, we’ll miss bugs


Where WeWhere We’’re Headedre Headed

1. Enable the adoption and practice of this technique– Find and describe compelling applications (motivate

adoption)– Build an understanding of these as a class, with differing

characteristics• vary the characteristics to apply to a new situation• further our understanding of relationship between context and

the test technique characteristics– Create usable examples:

• free software, readable, sample code• applied well to an open source program

2. Critique and/or fix the reliability models


Two More ExamplesTwo More Examples

• We don’t have time to discuss these in the talk• These just provide a few more illustrations that you might work

through in your spare time.


Here are two more examples.We don’t have enough time for these in this talk, but they are

in use in several communities.


State Transition TestingState Transition Testing• State transition testing is stochastic. It helps to distinguish between independent

random tests and stochastic tests.• Random Testing

– Random (or statistical or stochastic) testing involves generating test casesusing a random number generator. Individual test cases are not optimizedagainst any particular risk. The power of the method comes from runninglarge samples of test cases.

• Independent Random Testing– Our interest is in each test individually, the test before and the test after

don’t matter.• Stochastic Testing

– A stochastic process involves a series of random events over time• Stock market is an example• Program may pass individual tests when run in

isolation: The goal is to see whether it can pass a largeseries of the individual tests.


State Transition Tests Without a State Model:State Transition Tests Without a State Model: Dumb Monkeys Dumb Monkeys

• Phrase coined by Noel Nyman. Many prior uses (UNIX kernel, Lisa, etc.)• Generate a long sequence of random inputs driving the program from state to

state, but without a state model that allows you to check whether the programhas hit the correct next state.– Executive Monkey: (dumbest of dumb monkeys) Press buttons

randomly until the program crashes.– Clever Monkey: No state model, but knows other attributes of the

software or system under test and tests against those:• Continues until crash or a diagnostic event occurs. The diagnostic is

based on knowledge of the system, not on internals of the code.(Example: button push doesn’t push—this is system-level, notapplication level.)

• Simulator-with-probes is a clever monkey

• Nyman, N. (1998), “Application Testing with Dumb Monkeys,” STAR West.• Nyman, N. “In Defense of Monkey Testing,”

http://www.softtest.org/sigs/material/nnyman2.htm


Dumb MonkeyDumb Monkey• INPUTS:

– Random generation.– Some commands or parts of system may

be blocked (e.g. format disk)• OUTPUTS

– May ignore all output (executive monkey)or all but the predicted output.

• EVALUATION STRATEGY– Crash, other blocking failure, or

mismatch to a specific prediction orreference function.


• WHAT ARE WE MISSING?– Most output. In practice, dumb monkeys

often lose power quickly (i.e. the programcan pass it even though it is still full ofbugs).

• SEQUENCE OF TESTS– Ongoing sequence, never reset

• THEORY OF ERROR– Long-sequence bugs– Specific predictions if some aspects

of SUT are explicitly predicted• TROUBLESHOOTING SUPPORT

– Random number generator’s seed,for reproduction.

• BASIS FOR IMPROVING TESTS?


State Transitions: State Models (Smart Monkeys)State Transitions: State Models (Smart Monkeys)

• For any state, you can list the actions the user can take, and the results of each action(what new state, and what can indicate that we transitioned to the correct new state).

• Randomly run the tests and check expected against actual transitions.• See www.geocities.com/model_based_testing/online_papers.htm• The most common state model approach seems to drive to a level of coverage, use

Chinese Postman or other algorithm to achieve all sequences of length N. (A lot of workalong these lines at Florida Tech)– High volume approach runs sequences until failure appears or the

tester is satisfied that no failure will be exposed.• Coverage-oriented testing fails to account for the problems associated with multiple

runs of a given feature or combination.

• Al-Ghafees, M. A. (2001). Markov Chain-based Test Data Adequacy Criteria. Unpublished Ph.D., Florida Instituteof Technology, Melbourne, FL. Summary athttp://ecommerce.lebow.drexel.edu/eli/2002Proceedings/papers/AlGha180Marko.pdf

• Robinson, H. (1999a), “Finite State Model-Based Testing on a Shoestring,” STAR Conference West. Available atwww.geocities.com/model_based_testing/shoestring.htm.

• Robinson, H. (1999b), “Graph Theory Techniques in Model-Based Testing,” International Conference on TestingComputer Software. Available at www.geocities.com/model_based_testing/model-based.htm.

• Whittaker, J. (1997), “Stochastic Software Testing”, Annals of Software Engineering, 4, 115-131.


State-Model Based TestingState-Model Based Testing• INPUTS:

– Random, but guided or constrained by astate model

• OUTPUTS– The state model predicts values for one

or more reference variables that tell uswhether we reached the expected state.

• EVALUATION STRATEGY– Crash or other obvious failure.– Compare to prediction from state model.

• EXPLICIT MODEL?– Detailed state model or simplified model:

operational modes.• WHAT ARE WE MISSING?

– The test highlights some relationshipsand hides others.

• SEQUENCE OF TESTS– Does any aspect of test N+1 depend on

test N?• THEORY OF ERROR

– Transitions from one state to anotherare improperly coded

– Transitions from one state to anotherare poorly thought out (we see these attest design time, rather than inexecution)


troubleshooting made easier?• BASIS FOR IMPROVING TESTS?

PART 16 -- ALTERNATIVES TO GUI TEST AUTOMATION · for a course that we co-taught on software test...

Documents

Transcript of PART 16 -- ALTERNATIVES TO GUI TEST AUTOMATION · for a course that we co-taught on software test...