TestRank: Eliminating Waste from Test-Driven Development

TestRank:Eliminating Waste from

Test-Driven Development

Presented by Hagai CibulskiTel Aviv University

Advanced Software Tools Research Seminar 2010

Elevator Pitch

TDD bottleneck - repeated runs of an ever growing test suite

Less productivity Casualness in following TDD

Loss of Quality

TestRank – finds appropriate tests to run after given code edits

Run a fraction of the tests in each cycle: eliminate waste + high bug detection rate

Agenda

In this talk we will:learn Test-Driven Development (TDD) in two minutesobserve insights into the nature of TDD testsidentify the TDD bottleneckdefine the Regression Test Selection (RTS) problem for the TDD contextreview past work on RTSsee alternative Program Analysis techniques:

Dynamic PANatural Language PA

present TestRank – an RTS tool for TDD

Test-Driven Development

Agile software development methodology

Short development iterations

Pre-written test cases define functionality

Each iteration: code to pass that iteration's tests

Test-Driven Development Cycle

Repeat:Add a test

Run tests and see the new one fails

Write some code

Run tests and see them succeed

Refactor code

Run tests and see them succeed

TDD Tests - Observations

TDD tests define functionality

TDD code is highly factored

Therefore:A single test may cross multiple units of code

A single code unit implements functionalities defined in multiple tests

Test Suite - Observations

Tests are added over time5 developers x 1 test a day x 240 days = 1200 tests

1200 tests x 200 mSec = 4 minutes

Integrated into nightly/integration buildsCommited changes are covered nightly/continuously

Integrated into the team developers' IDEProgrammers can run isolated tests quickly

The Motivation:Early detection of software bugs

A developer edits a block of code

Using strict unit tests as a safety netFinding the unit test to run is straightforward (1-1 or 1-n)

Using TDD functional tests as a safety netFinding the tests to run? (n-n)

Where is the code-test correlation?

Must run the entire test suite?Might not be cost effective

Delay of running the entire test suite?Delays the detection of software bugs

Bugs become harder to diagnose the further

the symptom is removed from the cause

TestRank Problem Definition

GivenP – program under test

T – test suite (assuming all tests in T pass on P)

Q – query about location (method) L

Find:Ranking: t1, t2, …, tn

s.t. if change in L causes test ti to fail, i is minimal

Application:Select top (e.g. 20%) of ranking

Goal: Achieve 1- bug detection, s.t. is minimal

TestRank – Application

Rank the tests such that running the top 20% ranked tests will reveal a failing test with 80% probability

20% is our promise to the developer

80% is justified assuming eventually all tests will be run

Usually new bugs first chance of being detected is when all the tests are run (typically on the nightly build)

Don't waste time reconciling which (or whose) coding changes are responsible for new bugs

The bugs never get checked into the master source

Related Work

Past Work on Test Suite OptimizationTest Selection

Lower total cost by selecting an appropriate subset of the existing test suite based on information about the program, modified version, and test suite

Usually conservative ("safe") analyses

Test PrioritizationSchedule test cases in an order that increases rate of fault detection (e.g. by decreasing coverage delta)

TestTubeTestTube: a system for selective regression testing

Chen, Rosenblum and Vo, 1994 “Safe” RTS: Identify all global entities that test t covers

Assumes deterministic systemCoarse level of granularity – C functionsInstrumentation: t {fi}closure(f) = global vars, types, and macros used by fA test case t in T is selected for retesting P' if:

diff(P,P’) ∩ closure(trace(t)) ØReduction of 50%+ in number of test cases

Only in case of “feature functions”..

feature functions

core functions

Nondeterministic version - "transitive closure" technique 0% reduction in “core functions”

DejaVuA safe, efficient regression test selection technique

Rothermel and Harrold, 1997Conservative + Granularity at statement level

improving precisionControl flow based

CFG for each procedureInstrumentation: t {e} Simultaneous DFS on G,G' for each procedure and its modified version in P,P’A test case t in T is selected for retesting P' if its execution trace contains a “dangerous” edgeA lot of work goes into calculating diff(P,P’)

Might be too expensive to be used on large systems

Results: Two studies found average reduction of 44.4% and 95%

DejaVOO - Two Phase Technique

A comparative study [Bible, Rothermel & Rosenblum. 2001] found TestTube/DejaVu exhibit trade-off of efficiency versus precision

Analysis-time + test-execution-time ∝ const

Scaling Regression Testing to Large Software SystemsOrso, Shi, and Harrold (DejaVu). 2004

JBoss = 1MLOCEfficient approach: Selected too many testsPrecise approach: Analysis took too much timeIn each case: Analysis + Execution > naïve-retest-all

Implementing a technique for Java programs that is “safe”, precise, and yet scales to large systems

Phase #1: Fast, high-level analysis to identify the parts of the system that may be affected by the changesPhase #2: Low-level analysis of these parts to perform precise test selection

DejaVOO Results

Considerable increase in efficiency

Same precision

TestTube / DejaVuTestRank

Testing phaseImplementation phase (TDD)

input is a version to be tested

- after everyone checked in

look at a single block of code – before check in

ConservativeTestTube: Low precision

DejaVu: Higher precision -

(still “safety” is overrated)

High precision is the goal –

(sometimes not reporting may be OK)

Standard RTS vs. RTS for TDD

Commercial/Free ToolsGoogle Testar: Selective testing tool for Java

Works with JUnitRecords coverage by instrumenting bytecode

Clover’s “Test Optimization”A coverage tool with a new test optimization featureSpeed up CI buildsLeverages "per-test" coverage data for selective testing

JUnitMax by Kent BeckA continuous test runner for Eclipse Supports test prioritization to encourage fast failuresRun short tests firstRun recently failed (and newly written) tests first

JTestMe: Another selective testing tool for JavaUses AspectJ, method-level coverage

Infinitest: a continuous test runner for JUnit testsWhenever you make a change, Infinitest runs tests for you. It selects tests intelligently, and runs the ones you need. Uses static analysis will not work with dynamic/reflection-based invocations

CodePshychologist

Locating Regression Bugs Nir, Tyszberowicz and Yehudai, 2007

Same problem in reverse

Given check point C that failed, and source code S of the AUT, find the places (changes) in the code S that causes C to fail

System testing

UI level

Using script/manual

Checkpoint C defined at UI level

CodePshychologist - Code lines affinity

Check point

Select "clerk 1" from the clerk tree (clerk number 2). Go to the next clerk.The next clerk is "clerk 3"

CodePshychologist - affinity problem

red, flower, white, black,

cloud

rain, green, red, coat

>red, flower,

white, black, cloud

train, table, love

CodePshychologist - Words affinity

Taxonomy of words, graph where each node represents a synonym setWordnet: An electronic lexical database. 1998 Wordnet-based semantic similarity measurement. Simpson & Dao, 2005

CodePshychologist – Word Groups affinity

TestRank Marketecture

P

T

Dynamic & Static Analyses

Correlation Scores

&Locator

Qfile:line

QueryEngine

Ranking

t1t2t3…

TestRank - Preprocessing Phase

Pre-compute test/unit correlation during a run of the test suite by tracing the tests through the production code

AspectJ

Use coverage as basic soundness filterCollect dynamic metrics:

Control flowData flow

Look for natural language clues in sources textWordNet

TestRank – Online Phase

Use correlation data during code editing to expose to the developer a list of tests which might conflict with the block of code currently being edited

Sorted in descending order of correlation

Developers can run just the specific functional tests

Dynamic PA

Execution Count PredictorHow many times this method was called during the execution stemming from test t?

Call Count PredictorHow many distinct calls to this method when called during the execution stemming from test t?

Normalize to [0, 1]score = c / (c+1)

More Dynamic PA – The Stack

Two Stack Frames Count PredictorHow many distinct configurations of the calling frame and frame before that on the call stack?

Stack Count PredictorHow many distinct call stack configurations?

Stack Depth Sum PredictorSum the inverse depth of call stack at each execution of this method stemming from test t.

Dynamic PA – Data Flow

Value Propagation PredictorCompare values of simple typed arguments (and return value), between those flowing out of the test and those reaching the method under test.

Size of intersection between the two sets of values.

For each test/method pair, find the maximum intersection m.

score = m / (m+1)

Natural Language PA

Adapted CodePsychologist Algorithm.

Coverage sound filter (execution count > 0).

For each test/method pair, look for:Similar methodName()“Similar literals”

// Similar comments

Similar words extracted from meaningfulIdentifierNames

NL analysis

During tracing build SourceElementLocator:fileName beginLine ElementInfo{signature, begin, end,

WordGroup}

For each source file, extract words and literals and map them by line numbers.

Literals are whole identifiers, strings and numbersWords are extracted from identifiers by assuming_namingConventions {assuming, naming, conventions}

For each method:include the comments before the methodcollect group of words and literals mapped to line numbers between the beginning and the end of the method

NLPA – Word Group Affinity

for each test/method pair (t, m), locate the two code elements and get two word groups wg(t), wg(m)

calculate GrpAff(wg(t), wg(m)) using adapted CodePsychologist algorithm:

separate words from literals, compute GrpAff for each type separately and take the average affinity.

filter out 15% most common words in the text.

TF-IDF

TermFrequency x InverseDocumentFrequency

Balances the relative frequency of the word on particular method, with its overall frequency

w occurs nw,p times in a method p and there are a total of Np terms on the method

w occurs on dw methods and there are a total of D methods in the traces

tfidf(w, p) = tf (w, p) × idf(w) = nw,p/Np * log D/dw

NLPA – Weighted Group Affinity

AsyGrpAff’(A,B) = 1/n · Σ1≤i≤n

[max{WrdAff(ai, bj) | 1 ≤ j ≤ m} · tfidf2(ai, A) · factor(*)(ai)]

(*) Words appearing in method name are given a

x10 weight factor

Synthetic Experiment – Code Base

Log4J

Apache’s open source logging project

33.3KLOC

8.4K statements

252 test methods

Used CoreTestSuite = 201 test methods

1,061 actual test/method pairs traced

Synthetic Experiment – Performance

CPU: Intel Core2-6320 1.86GHz; RAM: 2Gb

Preprocessing:Dynamic PA ≈ 6 sec

Natural Language PA ≈ another 12 sec

Creates two database files:affinities.ser ~1.1Mb

testrank.ser ~2Mb

Query < 1 sec

Synthetic Experiment – Method

identified "core methods" covered by 20-30 tests each.manually mutated four methods in order to get a test failing.got ten test failures

getLoggerRepository {testTrigger, testIt}setDateFormat {testSetDateFormatNull, testSetDateFormatNullString}getRenderedMessage {testFormat, testFormatWithException…. 3 more}getLogger {testIt}

LogManager.getLoggerRepository covered by 30 testsPlanted Bug: Removed the “if” condition:// if (repositorySelector == null) { repositorySelector = new DefaultRepositorySelector(new NOPLoggerRepository()); guard = null; LogLog.error("LogMananger.repositorySelector was null likely due to error in class reloading.");// } return repositorySelector.getLoggerRepository();

actual:Errors:

SMTPAppenderTest.testTriggerFailures:

TelnetAppenderTest.testIt

Synthetic Experiment – Method (2)

pairs (mi, ti) of mutated method and actual failing test

e.g. input file (descriptor of 3 such pairs)

actual_1.txtLogManager.java:174

LoggerRepository org.apache.log4j.LogManager.getLoggerRepository()

void org.apache.log4j.net.SMTPAppenderTest.testTrigger()

void org.apache.log4j.net.TelnetAppenderTest.testIt()

Synthetic Experiment – Method (3)

reverted all mutations back to original code.

ran TestRank preprocessing

for each pair (mi, ti) ran a query on mi and compared ti to the each of TestRank predictor's ranking (ti1, ti2, …, tim).

For predictor p, let the actual failed test's relative rank be RRp = j/m, where ij=i.

Synthetic ExperimentPredictors Results

BugSafeRTSDynamic Heuristics

NL Heuristics

MetaHeuristics

ExecutionCount

CallCount

2 Stack Frames

StackCount

Stack DepthSum

ValuePropagationAffinity

SimpleAverage

#130822222032

138881671017

#22255554887

55554386

#3*2111-164-145-156-1511-163-131-46-14

#41744444434Different heuristics predicted different failures

Best heuristics: Value Propagation, Affinity

Stack Depth Sum was very good on 4 experiments, and among the worst on the other 6

Worst heuristic: Execution Count(*) Bug #3 caused five tests to fail

Synthetic ExperimentPredictors Statistics

Improvement vs. “Safe” RTS

StatisticDynamic HeuristicsNL Heuristics

MetaHeuristics

ExecutionCount

CallCount

2 Stack Frames

StackCount

Stack DepthSum

ValuePropagationAffinity

SimpleAverage

Average45.8%28.8%31.2%32.6%42.9%30.6%18.6%37.9%

Median43.3%23.5%23.8%26.7%52.4%26.7%10%28.6%

80%61.9%28.6%33.3%38.1%57.1%38.1%33.3%56.7%

Affinity is best on Average and Median

Call Count is best on 80% percentile

Worst heuristics: Execution Count, Stack Depth Sum

Simple Average is a bad meta heuristic

Conclusions

TDD is powerful but over time introduces waste in the retesting phase“Safe” RTS techniques are too conservative for TDD(and are not really safe…)

Our technique enables to find and run the tests most relevant to a given code changeDynamic and natural-language analyses are keyDevelopers can run the relevant tests and avoid wasting time on running the irrelevant onesEliminate waste from the TDD cycle while maintaining a high bug detection rateMakes it easy to practice TDD rigorously

(Near) Future Work

We are currently working on:Affinity Propagation through call tree

Meta heuristicsWeighted average

Use experimental results as training data?

Self weighting heuristics

Further validation

Future Work

Reinforcement LearningStrengthen correlation between true positives and weaken for false positivesInteractive confirm/deny

Add annotations/taggingFiner granularity Greater precisionString edit distance between literalsConsider external resources

Changes in files such as XML and propertiesCombine global ranking

Test cyclomatic complexity / test code sizeUse timing of tests for cost effective ranking (short tests rank higher)

Selection should have good total coverageHandle multiple editsIntegration with Eclipse and JUnitChanges filtering: comments, refactoring, dead code

combine static analysis (combine existing tools)

Further Applications of Code/Test Correlation

Assist code comprehensionWhat this code does?

Assist test maintenance What is the sensitivity/impact of this code?What tests to change?

Find regression to known past bugsrelated bug descriptions in bug tracking system

Reverse applicationsFind regression cause: Test fails ? where to fix (CodePsychologist++)Find bug cause: Find relevant code to bug description in bug tracking systemTDD impl assist: Spec (test) change? where to implement

Questions?

Thank You

How Often?

Quote from JUnit FAQ:http://junit.sourceforge.net/doc/faq/faq.htm

How often should I run my tests?

Run all your unit tests as often as possible, ideally every time the code is changed. Make sure all your unit tests always run at 100%. Frequent testing gives you confidence that your changes didn't break anything and generally lowers the stress of programming in the dark.

For larger systems, you may just run specific test suites that are relevant to the code you're working on.

Run all your acceptance, integration, stress, and unit tests at leastonce per day (or night).

How much time?

We posted a question on stackoverflow.comhttp://stackoverflow.com/questions/1066415/how-much-time-do-you-spend-running-regression-tests-on-your-ide

How much time do you spend running regression tests on your IDE, i.e. before check-in?

In most cases these test will run in < 10 seconds. To run the complete test suite I rely on the Hudson Continuous Integration server... (within an hour).

sometimes I run a battery of tests which takes an hour to finish, and is still far from providing complete coverage.

My current project has a suite of unit tests that take less than 6 seconds to run and a suite of system tests that take about a minute to run.

I would generally run all my tests once per day or so, as, in one job, I had about 1200 unit tests.

Assumptions

Baseline - All tests in T pass on P

Change is localized to a single method

We currently ignore some possible inputs:Source control history

Test results history

Test durations

Recently failed/ added/changed tests

DejaVu Results

“Siemens study” [Hutchins 1994] Set of 7 small, nontrivial, real C programs

141-512 LOC, 8-21 procedures, 132 faulty versions, 1000-5500 tests

44.4% average reduction

“Player”A worker handing one player in internet game 'Empire‘

766 procedures 50kloc, 5 versions, 1000 tests (same command with different parameters)

95% average reduction!

TestRank: Eliminating Waste from Test-Driven Development

Documents

Transcript of TestRank: Eliminating Waste from Test-Driven Development