Post on 21-Jan-2018
1
Empirical Evaluations in Software Engineering Research: A Personal Perspective
Ahmed E. Hassan Canada Research Chair in Software Analytics Blackberry Ultra Large Scale Systems Chair
School of Computing, Queen’s University Kingston, Canada
2
Today’s goal:
Give concrete ideas Inspire change
Kingston Home of the 1,000 Islands
3
Kingston Home of the 1,000 Islands
3
4
5
Intelligence throughout the lifetime of a software system
from inception to production http://sail.cs.queensu.ca
1851
6
SAIL’s new home: A 5,000 sq. ft. historical building
1851
6
SAIL’s new home: A 5,000 sq. ft. historical building
1851
6
SAIL’s new home: A 5,000 sq. ft. historical building
~15 years of Mining Software Repositories
7http://msrconf.org
Early MSR Days
8
9
Early Day Miner
MSR Today: Easy Peasy!
10
11
Data track GHTorrent
Lots of data!
Then
12
Now
13
Thanks to a large scale community
effort!
14
Yet low practitioner interest!!
15
Even though our friendly reviewers are pleased!
Why?!?
16
Why?!?
16
We continue to raise the bar for studies
Empirical evaluations continues to mature and evolve
17
Empirical evaluations continues to mature and evolve
17
No validation
Single System
Multiple Systems
GitHub Scale
+ Practitioner Feedback
18
No validation
Single System
Multiple Systems
GitHub Scale
+ Practitioner Feedback
18
No validation
Single System
Multiple Systems
GitHub Scale
+ Practitioner Feedback
Are things this bad? What can we do to prevent this?
We are killing
research areas
19
Software Visualization
Program Understand.
We are following industry instead of leading!!
Web Apps (e.g, AJAX) Data Analytics
Scripting
Infrastructure Mobile Apps IoT
We are doing this to ourselves!!
21
22
22
Practitioner buy-in
More studies (N+1)
Replication data
Not Soft. Eng!
23
Practitioner buy-in
More studies (N+1)
Replication data
23
Practitioner buy-in
More studies (N+1)
Replication data
Not Soft. Eng!
Not Software Enginering
24
Not Software Enginering
24
Mobile AppsLogs
Build SystemsLoad Testing
NoSQL Systems
25
25
"I found the overall presentation disjointed…. This needs to focus more
on the IR issues and less on web analysis.” SIGIR Rejection 98
26
Practitioner buy-in
Replication data
Not Soft. Eng!
26
Practitioner buy-in
More studies (N+1)
Replication data
Not Soft. Eng!
27
We can early-detect
brain cancer!!!
28
How about skin/lung cancer?!!!
The “N+1” Critique encourages
29
Case Study Fishing
Shallow Analysis
Generalization of approaches instead of generalization of results
30
Encourage Meta Analysis
31
Encourage Meta Analysis
31
Encourage Meta Analysis
31
Encourage Meta Analysis
31
N=15..241
Total N= 2,659
Encourage Meta Analysis
32
Encourage Meta Analysis
32
1972
More is less!
33
34
More studies (N+1)
Replication data
Not Soft. Eng!
34
Practitioner buy-in
More studies (N+1)
Replication data
Not Soft. Eng!
35
Smoking kills!
36
What do smokers think?!
37
54.9% of the time patients did not receive the recommended care for their condition
[New England Journal of Medicine 2003]
Less than half of developers have
a CS degree [stack overflow survey 2016]
38
“A couple billion lines of code later: static checking in the real world” Dawson Engler
39
“A couple billion lines of code later: static checking in the real world” Dawson Engler
39
“A couple billion lines of code later: static checking in the real world” Dawson Engler
39
Do practitioners agree with
each other?
4141
Practitioners do not agree with each other
42
Presentation
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
Survey (93 stakeholders)
Practitioners do not agree with each other
42
Presentation
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
Survey (93 stakeholders)
Semi-structured interviews
(15 Key engineers)
–---------–-------–---------–-------–---------–-------–---------–-------
Interviewee’s list
Practitioners do not agree with each other
42
Presentation
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
Survey (93 stakeholders)
Semi-structured interviews
(15 Key engineers)
Validation survey Interviews
(25 Senior engineers)
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------–---------–-------–---------–-------
Interviewee’s list
–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------
Implications
Practitioners do not agree with each other
42
Presentation
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
Survey (93 stakeholders)
Semi-structured interviews
(15 Key engineers)
Validation survey Interviews
(25 Senior engineers)
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------–---------–-------–---------–-------
Interviewee’s list
–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------
Implications
–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------
Verified implications
Practitioners do not agree with each other
42
Presentation
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
Survey (93 stakeholders)
Semi-structured interviews
(15 Key engineers)
Validation survey Interviews
(25 Senior engineers)
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------
–---------–-------–---------–-------–---------–-------–---------–-------
Interviewee’s list
–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------
Implications
–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------
Verified implications
As low as 68% agreement
43
43
even within the same system
44
Practitioner buy-in
More studies (N+1)
Not Soft. Eng!
44
Practitioner buy-in
More studies (N+1)
Replication data
Not Soft. Eng!
45
Replication data is not sufficient!
MSR Today: Easy Peasy!
46
Limited knowledge of machine learning is a serious risk
47
A Simple Example
A Simple Example
Logistic Regression
Empirical Hypothesis Testing (is complex code more risky?)
49
Bugs ~ f(LOC, CC_max)
NOT about predicting bugs
Including correlated metrics
50
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
Including correlated metrics
51
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
Including correlated metrics
51
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
Including correlated metrics
51
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
60% of 2000-2011 studies do not handle correlated metrics [Shihab 2012]
Using F-measure in studies
52
Threshold = 0.2 Threshold = 0.5 Threshold = 0.8
Resampling imbalanced data
53
Resampling imbalanced data
53
Resampling imbalanced data
54
Resampling imbalanced data
54
Variable importance: Original vs under sampled data
55
Variable importance: Original vs under sampled data
55
Variable importance: Original vs under sampled data
55
Not including control metrics
56
Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)
Not including control metrics
57
Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)
Not including control metrics
57
Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)
58
Unseen100
out of sample bootstrap
10-fold cross validation
10x10 Cross Validation
Using 10 fold cross validation
59
Unseen25
out of sample bootstrap
100 out of sample
bootstrap
10-fold cross validation
10x10 cross validation
Using 10 fold cross validation
60
Unseen25
out of sample bootstrap
100 out of sample
bootstrap
10-fold cross validation
10x10 cross validation
Using 10 fold cross validation
61
Unseen25
out of sample bootstrap
100 out of sample
bootstrap
10-fold cross validation
10x10 cross validation
Using 10 fold cross validation
Using default ANOVA
62
Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )
Using default ANOVA
63
Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )
Using default ANOVA
63
Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )
Pitfalls in Analytical Modelling
Pitfalls in Analytical Modelling
Joint work with Kla Tantihamthavorn
65
Pitfalls in Analytical Modelling
66
Pitfalls in Analytical Modelling
Open Scripts
67
Open Scripts
67
Encourages community mentorship
Open Scripts
67
Encourages community mentorship
Works even for industrial data
Open Scripts
67
Encourages community mentorship
Works even for industrial data
Ensures research transparency
68
68
68
68
68
N
69
Where do we go from
here?
• Trailblazing research should be encouraged and defended
• Inaccurate analysis is a serious and growing threat
• Deeper analysis is more valuable than large scale studies
• Generalization is not feasible nor desired by practitioners
• Industry is a partner not an idol
Parting thoughts
70
http://sail.cs.queensu.ca
Think Different!