Empirical Evaluations in Software Engineering Research: A Personal Perspective

Post on 21-Jan-2018

247 views 0 download

Transcript of Empirical Evaluations in Software Engineering Research: A Personal Perspective

1

Empirical Evaluations in Software Engineering Research: A Personal Perspective

Ahmed E. Hassan Canada Research Chair in Software Analytics Blackberry Ultra Large Scale Systems Chair

School of Computing, Queen’s University Kingston, Canada

2

Today’s goal:

Give concrete ideas Inspire change

Kingston Home of the 1,000 Islands

3

Kingston Home of the 1,000 Islands

3

4

5

Intelligence throughout the lifetime of a software system

from inception to production http://sail.cs.queensu.ca

1851

6

SAIL’s new home: A 5,000 sq. ft. historical building

1851

6

SAIL’s new home: A 5,000 sq. ft. historical building

1851

6

SAIL’s new home: A 5,000 sq. ft. historical building

~15 years of Mining Software Repositories

7http://msrconf.org

Early MSR Days

8

9

Early Day Miner

MSR Today: Easy Peasy!

10

11

Data track GHTorrent

Lots of data!

Then

12

Now

13

Thanks to a large scale community

effort!

14

Yet low practitioner interest!!

15

Even though our friendly reviewers are pleased!

Why?!?

16

Why?!?

16

We continue to raise the bar for studies

Empirical evaluations continues to mature and evolve

17

Empirical evaluations continues to mature and evolve

17

No validation

Single System

Multiple Systems

GitHub Scale

+ Practitioner Feedback

18

No validation

Single System

Multiple Systems

GitHub Scale

+ Practitioner Feedback

18

No validation

Single System

Multiple Systems

GitHub Scale

+ Practitioner Feedback

Are things this bad? What can we do to prevent this?

We are killing

research areas

19

Software Visualization

Program Understand.

We are following industry instead of leading!!

Web Apps (e.g, AJAX) Data Analytics

Scripting

Infrastructure Mobile Apps IoT

We are doing this to ourselves!!

21

22

22

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

23

Practitioner buy-in

More studies (N+1)

Replication data

23

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

Not Software Enginering

24

Not Software Enginering

24

Mobile AppsLogs

Build SystemsLoad Testing

NoSQL Systems

25

25

"I found the overall presentation disjointed…. This needs to focus more

on the IR issues and less on web analysis.” SIGIR Rejection 98

26

Practitioner buy-in

Replication data

Not Soft. Eng!

26

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

27

We can early-detect

brain cancer!!!

28

How about skin/lung cancer?!!!

The “N+1” Critique encourages

29

Case Study Fishing

Shallow Analysis

Generalization of approaches instead of generalization of results

30

Encourage Meta Analysis

31

Encourage Meta Analysis

31

Encourage Meta Analysis

31

Encourage Meta Analysis

31

N=15..241

Total N= 2,659

Encourage Meta Analysis

32

Encourage Meta Analysis

32

1972

More is less!

33

34

More studies (N+1)

Replication data

Not Soft. Eng!

34

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

35

Smoking kills!

36

What do smokers think?!

37

54.9% of the time patients did not receive the recommended care for their condition

[New England Journal of Medicine 2003]

Less than half of developers have

a CS degree [stack overflow survey 2016]

38

“A couple billion lines of code later: static checking in the real world” Dawson Engler

39

“A couple billion lines of code later: static checking in the real world” Dawson Engler

39

“A couple billion lines of code later: static checking in the real world” Dawson Engler

39

Do practitioners agree with

each other?

4141

Practitioners do not agree with each other

42

Presentation

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

Survey (93 stakeholders)

Practitioners do not agree with each other

42

Presentation

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

Survey (93 stakeholders)

Semi-structured interviews

(15 Key engineers)

–---------–-------–---------–-------–---------–-------–---------–-------

Interviewee’s list

Practitioners do not agree with each other

42

Presentation

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

Survey (93 stakeholders)

Semi-structured interviews

(15 Key engineers)

Validation survey Interviews

(25 Senior engineers)

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------

Interviewee’s list

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Implications

Practitioners do not agree with each other

42

Presentation

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

Survey (93 stakeholders)

Semi-structured interviews

(15 Key engineers)

Validation survey Interviews

(25 Senior engineers)

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------

Interviewee’s list

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Implications

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Verified implications

Practitioners do not agree with each other

42

Presentation

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

Survey (93 stakeholders)

Semi-structured interviews

(15 Key engineers)

Validation survey Interviews

(25 Senior engineers)

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------

Interviewee’s list

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Implications

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Verified implications

As low as 68% agreement

43

43

even within the same system

44

Practitioner buy-in

More studies (N+1)

Not Soft. Eng!

44

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

45

Replication data is not sufficient!

MSR Today: Easy Peasy!

46

Limited knowledge of machine learning is a serious risk

47

A Simple Example

A Simple Example

Logistic Regression

Empirical Hypothesis Testing (is complex code more risky?)

49

Bugs ~ f(LOC, CC_max)

NOT about predicting bugs

Including correlated metrics

50

Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)

Including correlated metrics

51

Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)

Including correlated metrics

51

Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)

Including correlated metrics

51

Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)

60% of 2000-2011 studies do not handle correlated metrics [Shihab 2012]

Using F-measure in studies

52

Threshold = 0.2 Threshold = 0.5 Threshold = 0.8

Resampling imbalanced data

53

Resampling imbalanced data

53

Resampling imbalanced data

54

Resampling imbalanced data

54

Variable importance: Original vs under sampled data

55

Variable importance: Original vs under sampled data

55

Variable importance: Original vs under sampled data

55

Not including control metrics

56

Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)

Not including control metrics

57

Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)

Not including control metrics

57

Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)

58

Unseen100

out of sample bootstrap

10-fold cross validation

10x10 Cross Validation

Using 10 fold cross validation

59

Unseen25

out of sample bootstrap

100 out of sample

bootstrap

10-fold cross validation

10x10 cross validation

Using 10 fold cross validation

60

Unseen25

out of sample bootstrap

100 out of sample

bootstrap

10-fold cross validation

10x10 cross validation

Using 10 fold cross validation

61

Unseen25

out of sample bootstrap

100 out of sample

bootstrap

10-fold cross validation

10x10 cross validation

Using 10 fold cross validation

Using default ANOVA

62

Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )

Using default ANOVA

63

Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )

Using default ANOVA

63

Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )

Pitfalls in Analytical Modelling

Pitfalls in Analytical Modelling

Joint work with Kla Tantihamthavorn

65

Pitfalls in Analytical Modelling

66

Pitfalls in Analytical Modelling

Open Scripts

67

Open Scripts

67

Encourages community mentorship

Open Scripts

67

Encourages community mentorship

Works even for industrial data

Open Scripts

67

Encourages community mentorship

Works even for industrial data

Ensures research transparency

68

68

68

68

68

N

69

Where do we go from

here?

• Trailblazing research should be encouraged and defended

• Inaccurate analysis is a serious and growing threat

• Deeper analysis is more valuable than large scale studies

• Generalization is not feasible nor desired by practitioners

• Industry is a partner not an idol

Parting thoughts

70

http://sail.cs.queensu.ca

Think Different!