Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transcript of Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice
Tao XieUniversity of Illinois at Urbana-Champaign
[email protected] http://taoxie.cs.illinois.edu/
In collaboration with collaborators from Microsoft Research, Tencent, and Salesforce
(Automated) Test Generation
• Human• Expensive, incomplete, …
• Brute Force• Pairwise, predefined data, etc…
• Tool Automation!!
…
Getting Real to Produce Practice Impact
3
Making real impact
Building real technologies
Solving real problems
Software testing tools are naturally tied with software development practice
Dynamic Symbolic Execution
Code to generate inputs for:
Constraints to solve
a!=null
a!=null &&
a.Length>0
a!=null &&
a.Length>0 &&
a[0]==1234567890
void CoverMe(int[] a){
if (a == null) return;if (a.Length > 0)
if (a[0] == 1234567890)throw new Exception("bug");
}
Observed constraints
a==null
a!=null &&
!(a.Length>0)
a!=null &&
a.Length>0 &&
a[0]!=1234567890
a!=null &&
a.Length>0 &&
a[0]==1234567890
Data
null
{}
{0}
{123…}a==null
a.Length>0
a[0]==123…T
TF
T
F
F
Execute&MonitorSolve
Choose next path
Done: There is no path left.
Negated condition
[DART: Godefroid et al. PLDI’05]
Microsoft Research Automated Test Generation Tool: Pex & Relatives
• Pex (released in May 2008)
http://research.microsoft.com/pex/
Microsoft Research Automated Test Generation Tool: Pex & Relatives
• Pex (released in May 2008)
• 30K downloads after 20 months
• Active user community: 1.4K forum posts during ~3 years
• Shipped with Visual Studio 2015 as IntelliTest
• Moles (released in Sept 2009)• Shipped with Visual Studio 2012 as Fakes
• “Provide Microsoft Fakes w/ all Visual Studio editions” got 1.5K community votes
http://research.microsoft.com/pex/
There are decision procedures for individual path conditions, but…
• Number of potential paths grows exponentially with number of branches
• Reachable code not known initially
• Without guidance, same loop might be unfolded forever
Fitnex search strategy
[Xie et al. DSN 09]
Explosion of Search Space
http://taoxie.cs.illinois.edu/publications/dsn09-fitnex.pdf
DSE Example
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
TestLoop(0, {0})
Path condition:!(x == 90)
↓New path condition:(x == 90)
↓New test input:TestLoop(90, {0})
DSE Example
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
TestLoop(90, {0})
Path condition:(x == 90) && !(y[0] ==15) && !(x == 110)
↓New path condition:(x == 90) && (y[0] ==15)
↓New test input:TestLoop(90, {15})
Challenge in DSE
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
TestLoop(90, {15})
Path condition:(x == 90) && (y[0] ==15) && !(x+1 == 110)
↓New path condition:(x == 90) && (y[0] ==15) && (x+1 == 110)
↓New test input:No solution!?
A Closer Look
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
TestLoop(90, {15})
Path condition:(x == 90) && (y[0] ==15) && (0 < y.Length) && !(1 < y.Length) && !(x+1 == 110)
↓New path condition:(x == 90) && (y[0] ==15) && (0 < y.Length) && (1 < y.Length) Expand array size
A Closer Look
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
TestLoop(90, {15})
We can have infinite paths!
Manual analysis need at least 20 loop iterations to cover the target branch
Exploring all paths up to 20loop iterations is infeasible:
220 paths
Fitnex: Fitness-Guided Exploration
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
Key observations: with respect to the coverage target
• not all paths are equally promising for branch-node flipping
• not all branch nodes are equally promising to flip
• Our solution:
– Prefer to flip branch nodes on the most promising paths
– Prefer to flip the most promising branch nodes on paths
– Fitness function to measure “promising” extents
TestLoop(90, {15, 0})TestLoop(90, {15, 15})
[Xie et al. DSN 2009]
http://taoxie.cs.illinois.edu/publications/dsn09-fitnex.pdf
Fitness Function
• FF computes fitness value (distance between the current state and the goal state)
• Search tries to minimize fitness value
[Tracey et al. 98, Liu at al. 05, …]
Fitness Function for (x == 110)
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
Fitness function: |110 – x |
Compute Fitness Values for Paths
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
(90, {0}) 20(90, {15}) 19(90, {15, 0}) 19(90, {15, 15}) 18(90, {15, 15, 0}) 18(90, {15, 15, 15}) 17(90, {15, 15, 15, 0}) 17(90, {15, 15, 15, 15}) 16(90, {15, 15, 15, 15, 0}) 16(90, {15, 15, 15, 15, 15}) 15…
Fitness Value(x, y)
Fitness function: |110 – x |
Give preference to flip paths with better fitness valuesWe still need to address which branch node to flip on paths …
Compute Fitness Gains for Branches
public bool TestLoop(int x, int[] y) {
if (x == 90) {
for (int i = 0; i < y.Length; i++)
if (y[i] == 15)
x++;
if (x == 110)
return true;
}
return false;
}
(90, {0}) 20(90, {15}) flip b4 19(90, {15, 0}) flip b2 19(90, {15, 15}) flip b4 18(90, {15, 15, 0}) flip b2 18(90, {15, 15, 15}) flip b4 17(90, {15, 15, 15, 0}) flip b2 17(90, {15, 15, 15, 15}) flip b4 16(90, {15, 15, 15, 15, 0}) flip b2 16(90, {15, 15, 15, 15, 15}) flip b4 15…
Fitness Value(x, y)
Fitness function: |110 – x |
Branch b1: i < y.LengthBranch b2: i >= y.LengthBranch b3: y[i] == 15Branch b4: y[i] != 15 •Flipping Branch b4 (b3) gives us average 1 (-1) fitness gain (loss)
•Flipping branch b2 (b1) gives us average 0 fitness gain (loss)
Compute Fitness Gain for Branches cont.
• For a flipped node leading to Fnew, find out the old fitness value Fold before flipping
• Assign Fitness Gain (Fold – Fnew) for the branch of the flipped node
• Assign Fitness Gain (Fnew – Fold ) for the other branch of the branch of the flipped node
• Compute the average fitness gain for each branch over time
Search Frontier
• Each branch node candidate for being flipped is prioritized based on its composite fitness value:
• (Fitness value of node – Fitness gain of its branch)
• Select first the one with the best composite fitness value
To avoid local optimal or biases, the fitness-guided strategy is integratedwith Pex’s fairness search strategies
Microsoft Research Automated Test Generation Tool: Pex & Relatives
• Pex (released in May 2008)
• 30K downloads after 20 months
• Active user community: 1.4K forum posts during ~3 years
• Shipped with Visual Studio 2015 as IntelliTest
• Moles (released in Sept 2009)• Shipped with Visual Studio 2012 as Fakes
• “Provide Microsoft Fakes w/ all Visual Studio editions” got 1.5K community votes
http://research.microsoft.com/pex/
What went on behind the scenes to build a user base?See more @ASE 2014 Experience Report: http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf
Lesson 1. Evolving Vision
void TestAdd(ArrayList a, object o) {Assume.IsTrue(a!=null);int i = a.Count;a.Add(o);Assert.IsTrue(a[i] == o);
}
Parameterized Unit Tests Supported by Pex
Moles/Fakes
IntelliTest
Pex4Fun/Code Hunt
• Surrounding (Moles/Fakes)
• Retargeting (Pex4Fun/Code Hunt)
• Simplifying (IntelliTest)
Lesson 2. Landing the First Customer
• Developer/manager: “Who took a dependency on your tool?”
• Pex team: “Do you want to be the first?”
• Developer/manager: “I love your tool but no.”
Tool Adoption by (Mass) Target Users
Tool Shipping with Visual Studio
Macro Perspective
Micro Perspective
Lesson 2. Landing the First Customer
• Tackle real-world challenges• Demo Pex on real-world cases (e.g., ResourceReader) beyond textbook examples
• Demo Moles to well address important scenarios (e.g., unit testing SharePoint code)
• Address technical/non-technical barriers for tech adoption in industry• Offer tool license not prohibiting commercial use
• Incremental shipping• Ship experimental reduced versions and gather feedback
• Find early adopters
• Provide quantitative info (reflecting tool’s importance or benefit extent)• Not all downloads are equal! (e.g., those from Fortune 500)
cont.
Lesson 3. Human Factors –Generated Data Consumed by Human
• Developer: “Code digger generates a lot of “\0” strings as input. I can’t find a way to create such a string via my own C# code. Could any one show me a C# snippet? I meant zero terminated string.”
• Pex team: “In C#, a \0 in a string does not mean zero-termination. It’s just yet another character in the string (a very simple character where all bits are zero), and you can create as Pex shows the value: “\0”.”
• Developer: “Your tool generated “\0””
• Pex team: “What did you expect?”
• Developer: “Marc.”
…
Lesson 3. Human Factors –Generated Name Consumed by Human
• Developer: “Your tool generated a test called Capitalize01. I don’t like it.”
• Pex team: “What did you expect?”
• Developer:“Capitalize_Should_Fail_When_Value_Is_Null.”
Lesson 3. Human Factors –Generated Results Consumed by Human
Object Creation messages suppressed (e.g., Covana by Xiao et al. [ICSE’11])
Exception Tree View
Exploration Tree View
Exploration Results View
Lesson 4. Misconceptions
• Someone advertises: “Simply one mouse click and then everything would work just perfectly”• Often need environment isolation w/ Moles/Fakes
• “One mouse click, a test generation tool would detect all or most kinds of faults in the code under test”• Developer: “Your tool only finds null references.”
• Pex team: “Did you write any assertions?”
• Developer: “Assertion???”
• “I do not need test generation; I already practice unit testing (and/or TDD). Test generation does not fit into the TDD process”
Lesson 5. Embracing Feedback
Gathered feedback from target tool users:
• Directly, e.g., via • MSDN Pex forum, tech support, outreach to MS engineers and .NET user groups,
outreach to external early adopters
• Indirectly, e.g., via • interactions with the Visual Studio team (a tool vendor to its huge user base)
• Lack of basic test isolation in practice => Moles• Our suggestion of refactoring code for testability faced strong resistance in practice
• Observation at Agile 2008 conference• Large focus on mock objects and tool support for mock objects
Feedback
Early drops on VS Code Gallery of the Pex Extension, and the Code Digger extensions for Visual Studio 2013
Visual Studio MVP Community
Internal dogfooding by teams within Microsoft
Uservoice feedback (> 20 ideas)
StackoverflowActive forum with questions tagged with "Pex" or "IntelliTest“
Facebookhttps://www.facebook.com/PexMoles/
Twitterhttps://twitter.com/pexandmoles
Item Votes
Add support for NUnit and xUnit.net 304
Add Support for VB.NET 336
Make it available with Visual Studio Professional 319
Enable IntelliTest for 64 bit projects 72
Gathering Feedback
Pex to IntelliTest - Adds and Cuts
• Additions• Externalizing Test Framework support
• Cuts• Visual Basic Support
• CommandLine Support
• FixIt
• Code Contract integration
• Pex Explorer
• Reporting
From http://bbcode.codeplex.com/
NUnit Extension – 37,378 installs
xUnit.net Extension – 17,438 installs
Pex to IntelliTest - Shipped!
(as of May 19, 20 17)
Collaboration with Academia• Win-win collaboration model
• Win (Industry Lab): longer-term research innovation, man power, research impacts, …
• Win (University): powerful infrastructure, relevant/important problems in practice, both research and industry impacts, …
• Hosting academic visitors• Faculty visits
e.g., Fitnex [Xie et al. DSN’09], Pex4Fun [Tillmann et a. ICSE’13 SEE]
• Student internshipse.g., FloPSy [Lakhotia et al. ICTSS’10], DyGen [Thummalapenta et al. TAP’10]
http://research.microsoft.com/pex/community.aspx#publications
Engaging Broader Academic Communities
• Academic research inspiring internal technology development• Reggae [Li et al. ASE’09] Rex [Veanes et al. ICST’10]
• MSeqGen [Thummalapenta et al. FSE’09] DyGen [Thummalapenta et al. TAP’10]
• …
• Academic research exploring longer-term research frontiers• DySy [Csallner et al. ICSE’08]
• Seeker [Thummalapenta et al., OOPSLA’11]
• Covana [Xiao et al. ICSE’11]
• SEViz [Honfi et al. ICST’15]
• Pex + Code Contracts [Christakis et al. ICSE’16]
• …
http://research.microsoft.com/pex/community.aspx#publications
Going from Pex to Coding Duels
Secret Implementation
class Secret {public static int Puzzle(int x) {
if (x <= 0) return 1;return x * Puzzle(x-1);
}}
Player Implementation
class Player {public static int Puzzle(int x) {
return x;}
}
class Test {public static void Driver(int x) {
if (Secret.Puzzle(x) != Player.Puzzle(x))throw new Exception(“Mismatch”);
}}
behaviorSecret Impl == Player Impl
34
About Code Hunt
Blogs and SitesApril 29, 2015May 15, 2014 www.codehunt.comresearch.microsoft.com/codehuntresearch.microsoft.com/codehuntcommunityData site on Github
Powerful and versatile platform for coding as a game
Built on the symbolic execution of Pex
Addressing different audiences – students, developers, researchers
Data available in the cloud
Unique in working from unit tests not specifications
Open sourced data available for analysis
Over 350,000 players as of August 2016 (since mid 2014)
www.codehunt.com
It’s a game!
1. iterative gameplay2. adaptive3. personalized4. no cheating5. clear winning criterionScore is based on
• how many puzzles solved,
• how well solved, and
• when solved
code
test cases
Lesson 6: Following the Data
• Java is provided by a source-to-source translator• We watched which features players used and what errors they made to
concentrate translation efforts for maximum effect
• The bank of over 400 puzzles records difficulty levels • These are updated by crowdsourcing users attempts
• The vast number of attempts at solving puzzles gives reliable data as to where programmers have difficulty – see open sourced data
For ImCupSept257 users x 24 puzzles x approx. 10 tries = about 13,000 programs`
http://taoxie.cs.illinois.edu/publications/icse15jseet-codehunt.pdfICSE 2015 JSEET:
http://taoxie.cs.illinois.edu/publications/icse13see-pex4fun.pdfICSE 2013 SEE:
Test Generation for Mobile Apps
When Monkey and WeChat Meet …
41
http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf
FSE 2016 Industry Track:
ICSE 2017 SEIP Track:
http://taoxie.cs.illinois.edu/publications/icse17seip-wechat.pdf
Motivation
Choudhary et al. [ASE’15]: Do we have good-enough tools to test Android apps?• Evaluated six research tools and Monkey on 68 open-source apps• Monkey tool outperformed all six research tools
Their study can be further extended• No industrial-strength Android app was studied• No demonstration on whether and how techniques can further
improve Monkey under industrial settings
42
Challenges for code coverage measurement: Requiring app’s source code Industrial-strength Android app can cause 64K
reference limit exception during instrumentation
Challenges for applicability: Scalability on testing apps with large codebases OS compatibility of testing tool
Challenges on Testing Industrial Mobile Apps
43
WeChat Overview
WeChat = WhatsApp+Facebook+Instagram+PayPal+Uber… 846 million monthly active users Daily number: dozens of billion messages sent, hundreds of
million photos uploaded, hundreds of million payment transactions executed
WeChat backend: 2K+ microservices running on 40k+ servers 10M queries per second during Chinese New Year Eve
Large codebase on WeChat Android
44
Monkey: Experimental setup
Experiment Setup• Set Monkey to fire random events every 500
milliseconds• Run Monkey on WeChat 5 times independently• Run Monkey for 18 hours each time (2 hours
without login)
Evaluation Metric• Line coverage• Activity coverage
45
Monkey: Coverage Result Findings
Finding 1: Monkey has low
line coverage (19.5%) and low
activity coverage (10.3%). s
Finding 2: Monkey allocates a lopsided distribution of exploration time on each activity.
46
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Coverage
Percentage
ExplorationTime(hours)
LineCoverage- Monkey
ActivityCoverage- Monkey
Manually login
Monkey: Exploration time challenges
Widget obliviousness: It is difficult to generate events at the small-sized GUI element
State obliviousness: Monkey explores the same two activities repeatedly without contributing to new code coverage
To another activity
Back to other activities.
SelectContactUI ContactLabelUI
47
New Approach
Design goals• Have direct access to UI elements on activity under test• Allocate more exploration time towards new GUI states
Techniques• Use UIAutomator framework to gain UI layout tree • Abstract GUI states to guide firing of state-changing
events
48
New Approach: Coverage result
New approach covers an additional 11.1% p.p. more lines and 18.4% p.p. more activities than Monkey does! 49
Categorization of Not-covered Activities
50
Dead activity examples: unreleased features,activities for older devices
Insufficient account state examples: requiring financial information, account history, enabled features
Example Not-covered Activities
51Default disabled features of WeChat
Activity for searching saved favorite history Activity for showing details of searching result
Substring Hole Analysis
•Substring hole: name set of not-covered activities
•Identify “wallet”-related activities to be not-covered•Manual testing of wallet-related tests, reducing substring hole of “wallet” to be about 22.5% (16 / 71) from 85.9% (61 / 71)
52
Learning for Test PrioritizationAn Industrial Case Study
Benjamin Busjaeger
Salesforce
Tao Xie
University of Illinois,
Urbana-Champaign
FSE 2016 Industry Track
http://taoxie.cs.illinois.edu/publications/fse16industry-learning.pdf
Large Scale Continuous Integration
2K+Modules
600K+Tests
2K+Engineers
500+Changes/Day
Main repository ...
350K+Source files
Motivation
Before Commit After Commit
Objective Reject faulty changes Detect test failures as quickly as possible
Current Run hand-picked tests Parallelize (8000 concurrent VMs) &batch (1-1000 changes per run)
Problem Too complex for human Feedback time constrained by capacity & batching complicates fault assignment
Desired Select top-k tests likely to fail Prioritize all tests by likelihood of failure
Insight: Need Multiple Existing Techniques
● Heterogeneous languages: Java, PL/SQL, JavaScript, Apex, etc.
●Non-code artifacts: metadata and configuration●New/recently-failed tests more likely to fail
Test code coverage of change Textual similarity between test and change
Test age and recent failures
Insight: Need Scalable Techniques●Complex integration tests: concurrent execution●Large data volume: 500+ changes, 50M+ test runs, 20TB+ results per
day
→ Our approach: Periodically collect coarse-grained measurements
Test code coverage of change Textual similarity between test and change Test age and recent failures
model for learning from past test results
New Approach: Learning for Test Prioritization
Test code coverage of change Textual similarity between test and change Test age and recent failures
Change
Test Ranking
→ Implementation currently in pilot use at Salesforce
Empirical Study: Setup
●Test results of 45K tests for ~3 month period●In this period, 711 changes with ≥1 failure
• 440 for training• 271 for testing
●Collected once for each test:• Test code coverage• Test text content
●Collected continuously:• Test age and recent failures
● New approach achieves highest average recall at all cutoff points• 50% failures detected with top 0.2%• 75% failures detected with top 3%
Results: Test selection (before commit)New Approach
● New approach achieves highest APFD with least variance• Median: 85%• Average: 99%
Results: Test prioritization (after commit)New Approach
Summary: Learning for Test Prioritization
● Main insights gained in conducting test prioritization in industry
● Novel learning-based approach to test prioritization
● Implementation currently in pilot use at Salesforce
● Empirical evaluation using a large Salesforce dataset
Summary• Pex practice impact by surrounding, retargeting, simplifying
• Lessons in transferring tools to practice1. Evolving vision
2. Landing your first customer
3. Human factors
4. Misconceptions
5. Embracing feedback
• Collaboration/engagement with academia
• Educational impact and lesson learned
6. Following the data
• Important to research on testing industrial apps (e.g., WeChat)
• Beyond open-source ones
http://research.microsoft.com/pex/
http://www.codehunt.com/
Getting Real to Produce Practice Impact
64
Making real impact
Building real technologies
Solving real problems
Software testing/analysis tools are naturally tied with software development practice
Thank you! Questions?
65
This material is based upon work supported by the Maryland Procurement Office under Contract No. H98230-14-C-0141. This work is also supported in part by National Science Foundation under grants no. CCF-1409423, CNS-1434582, CNS-1513939, CNS-1564274.
Getting Real to Produce Practice Impact
66
Making real impact
Building real technologies
Solving real problems
Software testing/analysis tools are naturally tied with software development practice