Can we induce change with what we measure?
-
Upload
michaela -
Category
Technology
-
view
237 -
download
2
description
Transcript of Can we induce change with what we measure?
Data-driven software engineering @Microsoft
Michaela Greiler
Data-driven software engineering @Microsoft• How can we optimize the testing process?
• Do code reviews make a difference?
• Is coding velocity and quality always a tradeoff?
• What’s the optimal way to organize work on a large team?
MSR Redmond/TSE:Michaela GreilerJacek CzerwonkaWolfram SchulteSuresh Thummalapenta
MSR Redmond:Christian BirdKathryn McKinleyNachi NagappanThomas Zimmermann
MSR Cambridge:Brendan MurphyKim Herzig
0
20
40
60
80
100
201020102011201120112011201120112011201120112011201120112012201220122012201220122012201220122012201220122013201320132013201320132013201320132013
11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
Code Coverage trigger of Checkins
% completely covered % somewhat covered % not covered
Reviewer recommendation: Does experience matter?
Can we change with what we can measure?
Michaela Greiler
YES
YES
that’s the danger!
What is measured?
0
1
2
3
4
5
6
7
8
Carl Lisa Rob Danny
Number Bugs
What is changed?
0
0.5
1
1.5
2
2.5
Carl Lisa Rob Danny
Number Bugs
Code Quality
What is measured?
0
1
2
3
4
5
6
7
8
Carl Lisa Rob Danny
Number Bugs
What is changed?
0
0.5
1
1.5
2
2.5
Carl Lisa Rob Danny
Number Bugs
Code Quality
SOCIO TECHNICAL CONGRUENCE “Design and programming are human activities; forget that and all is lost” – Bjarne Stroustrop
So should we go without any measurements?
InterpretationData Collection Usage
Lessons learned
No
Garbage!
• What is codemine? What data does codemine have?
GMQ vs. Opportunistic data collection
• Easily available ≠ what’s needed
• Determine the needed data
• Find proxy measures if needed
• Know the analysis before collecting the data
Otherwise, data is not usable for the intended purpose
• Goal – Question – Metric
• Check for completeness, cleanness/ noise and usefulness
• Data background• How was data generated?• Why was it generated?• Who consumes the data?• What about outliers?• How was the data processed?
Interpretation needs domain knowledge
Tools, processes,
practices and policies.
Release schedule
Time
Engi
ne
ers
What roles exist?Who does what?Responsibilities?
M1
M2
Beta
Organization of code bases
Team structure and culture.
You cannot compare 1:1
Engineers want to understand the nitty-gritty
• How do you calculate the recommended reviewers?
• Why was that person recommended?
• Why is Lisa not recommended?
Simplicity first
Fileswithout
bugs
Fileswithbugs
Files withoutbugs: main contributor
made > 50% of all edits
Files with bugs: main
contributor made < 60% of
all edits
Ownership metric:Proportion of edits of all edits for the contributor with the
most edits
Reporting vs. PredictionComprehension vs. automation
If you can do it with a decision tree… do it…
Iterative process with very close involvement of product teams and domain experts.
It’s a dialog It’s a back and forth
Mixed Method Research
Is a research approach or methodology
• for questions that call for real-life contextual understandings;
• employing rigorous quantitative research assessing magnitude and frequency of constructs and
• rigorous qualitative research exploring the meaning and understanding of constructs;
DR. MARGARET-ANNE STOREY
Professor of Computer Science University of Victoria
All methods are inherently flawed!
Generalizability
Precision Realism
DR. ARIE VAN DEURSEN
Professor of Software Engineering Delft University of Technology
Foundations of Mixed
Methods Research
Designing
Social Inquiry
Qualitative Research: Mixed Method Research
• Interviews
• Observations
• Focus groups
• Contextual Inquiry
• Grounded Theory
• …
A Grounded Theory Study
23
Systematic procedure to discover a theory from (qualitative) data
S. Adolph, W. Hall, Ph. Kruchten. Using Grounded theory to study the experience of software development. Empirical Software Engineering, 2011.
B. Glaser and J. Holton. Remodeling grounded theory. Forum Qualitative Res., 2004.
Glaser and Strauss
Deductive versus inductive
A deductive approach is concerned with developing a hypothesis (or hypotheses) based on existing theory, and then designing a research
strategy to test the hypothesis (Wilson, 2010, p.7)
Inductive approach starts with observations. Theories emerge towards the end of the research and as a result of careful examination of
patterns in observations (Goddard and Melville, 2004).
Theory Hypotheses Observation Confirm/Reject
Observation Patterns Theory
All models are wrong but some are useful (George E. P. Box )
Theo: Test Effectiveness Optimization from History
Kim Herzig*, Michaela Greiler+
, Jacek Czerwonka+, Brendan Murphy*
*Microsoft Research, Cambridge +Microsoft Corporation, US
Improving Development Processes
Product / Service
Lega
cych
ange
s
New
pro
du
ctfe
atu
res
Tech
no
logy
chan
ges
Development Environment
$Speed
R
Cost
Quality / Risk
(should be well balanced)
Microsoft aims for shorter release cycles
Empirical data to support & drive decisions
• Speed up development processes (e.g. code velocity)• More frequent releases• Maintaining / increasing product quality
Joint effort by MSR & product teams• MSR Cambridge: Brendan Murphy, Kim Herzig• TSE Redmond: Jacek Czerwonka, Michaela Greiler• MSR Redmond: Tom Zimmermann, Chris Bird, Nachi Nagappan• Windows, Windows Phone, Office, Dynamics product teams
Software Testing for Windows
Winmain (main branch)
Quality gate
(system testing)
Quality gate
(system & component testing)
Quality gate
(component testing)
time
Development branch
Multiple area branches
Multiple component branches
Software testing is very expensive• Thousands test suites executed, millions test cases executed
• On different branches, architectures, languages, etc.
• We tend to repeat the same tests over and over again
• Too many false alarms (failures due to test and infrastructure issues)
• Each test failures slows down product development
• Aims to find code issues as early as possible
• At the cost of slower product development
Actual problem
Current process aims for maximal protection
{Simplified illustration}
Software Testing for Office
Software testing is very expensive• Thousands test suites executed, millions test cases executed
• On different branches, architectures, languages, etc.
• We tend to repeat the same tests over and over again
• Too many false alarms (failures due to test and infrastructure issues)
• Each test failures slows down product development
• Aims to find code issues as early as possible
• At the cost of slower product development
Actual problem
Current process aims for maximal protection
Dev Inner Loop
BVT and CVTon main
Dog food
Different • Branching structure• Development process• Testing process• Release schedules• …{Simplified illustration}
Goal
Reduce the number of test executions …
… without sacrificing code quality
Dynamic, self-adaptive optimization model
Solution
Reduce the number of test executions …
• Run every test at least once before integrating code change into main branch (e.g., winmain).
• We eventually find all code issues but take risk of finding them later (on higher level branches).
… without sacrificing code quality
High cost, unknown
value$$$$$
High cost, low value
$$$$
Low cost,low value
$
Low cost, good value
$$
How likely is a test causing:1) false positives or 2) finding code issues?
Analyze historic data:- Test Events- Builds- Code Integrations
Analyze past test results- Passing tests, false alarms, detected code issues
Bug finding capabilities change with context
Solution
Using cost function to model risk.
𝑪𝒐𝒔𝒕𝑬𝒙𝒆𝒄𝒖𝒕𝒊𝒐𝒏 > 𝑪𝒐𝒔𝒕𝑺𝒌𝒊𝒑 ? suspend ∶ execute test
𝐶𝑜𝑠𝑡𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 = 𝐶𝑜𝑠𝑡𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + "Cost of potential false alarm"
= 𝐶𝑜𝑠𝑡𝑀𝑎𝑐ℎ𝑖𝑛𝑒/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 + (𝑃𝑟𝑜𝑏𝐹𝑃 ∗ 𝐶𝑜𝑠𝑡𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗𝑇𝑖𝑚𝑒𝑇𝑟𝑖𝑎𝑔𝑒 )
𝐶𝑜𝑠𝑡𝑆𝑘𝑖𝑝 = "Potential cost of finding a defect later"
= 𝑃𝑟𝑜𝑏𝑇𝑃 ∗ 𝐶𝑜𝑠𝑡𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟/𝑇𝑖𝑚𝑒 ∗ 𝑇𝑖𝑚𝑒𝐹𝑟𝑒𝑒𝑧𝑒 𝑏𝑟𝑎𝑛𝑐ℎ ∗ #𝐷𝑒𝑣𝑒𝑙𝑜𝑝𝑒𝑟𝑠𝐵𝑟𝑎𝑛𝑐ℎ
Test
Cost to run a test.
Value of output.
Current Results
Simulated on Windows 8.1 development period (BVT only)
Dynamic, Self-Adaptive
Decision points are connected to each otherSkipping tests influences the risk factors of higher level branches
We re-enable tests if code quality drops (e.g. different milestone)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
rela
tive
te
st r
edu
ctio
n r
ate
Time (Windows 8.1)
Training period
Bug Finding Performance of Tests
How many test executions fail?
#failed test exec
Bra
nch
leve
l
Number of test executions
How many of the failed test executions result in bug reports?
FP TP test-unspecific
TP test-specific
Bra
nch
leve
l
Impact on Development Process
Secondary Improvements• Machine Setup: we may lower the number of machines allocated to testing process
• Developer satisfaction: Removing false test failures increases confidence in testing process
…hard to estimate speed improvement through simulation
“We used the data […] to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)
Michaela Greiler@mgreiler
www.michaelagreiler.com
http://research.microsoft.com/en-us/projects/tse/