Empirical Evaluations in Software Engineering Research: A Personal Perspective

Ahmed E. Hassan Canada Research Chair in Software Analytics Blackberry Ultra Large Scale Systems Chair

School of Computing, Queen’s University Kingston, Canada

Today’s goal:

Give concrete ideas Inspire change

Kingston Home of the 1,000 Islands

Intelligence throughout the lifetime of a software system

from inception to production http://sail.cs.queensu.ca

SAIL’s new home: A 5,000 sq. ft. historical building

~15 years of Mining Software Repositories

7http://msrconf.org

Early MSR Days

Early Day Miner

MSR Today: Easy Peasy!

Data track GHTorrent

Lots of data!

Thanks to a large scale community

effort!

Yet low practitioner interest!!

Even though our friendly reviewers are pleased!

Why?!?

We continue to raise the bar for studies

Empirical evaluations continues to mature and evolve

No validation

Single System

Multiple Systems

GitHub Scale

+ Practitioner Feedback

No validation

Single System

Multiple Systems

GitHub Scale

No validation

Single System

Multiple Systems

GitHub Scale

Are things this bad? What can we do to prevent this?

We are killing

research areas

Software Visualization

Program Understand.

We are following industry instead of leading!!

Web Apps (e.g, AJAX) Data Analytics

Scripting

Infrastructure Mobile Apps IoT

We are doing this to ourselves!!

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

Practitioner buy-in

More studies (N+1)

Replication data

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

Not Software Enginering

Mobile AppsLogs

Build SystemsLoad Testing

NoSQL Systems

"I found the overall presentation disjointed…. This needs to focus more

on the IR issues and less on web analysis.” SIGIR Rejection 98

Practitioner buy-in

Replication data

Not Soft. Eng!

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

We can early-detect

brain cancer!!!

How about skin/lung cancer?!!!

The “N+1” Critique encourages

Case Study Fishing

Shallow Analysis

Generalization of approaches instead of generalization of results

Encourage Meta Analysis

N=15..241

Total N= 2,659

Encourage Meta Analysis

More is less!

More studies (N+1)

Replication data

Not Soft. Eng!

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

Smoking kills!

What do smokers think?!

54.9% of the time patients did not receive the recommended care for their condition

[New England Journal of Medicine 2003]

Less than half of developers have

a CS degree [stack overflow survey 2016]

“A couple billion lines of code later: static checking in the real world” Dawson Engler

Do practitioners agree with

each other?

Practitioners do not agree with each other

Presentation

–---------–-------–---------–-------

Survey (93 stakeholders)

Presentation

–---------–-------–---------–-------

Semi-structured interviews

(15 Key engineers)

–---------–-------–---------–-------–---------–-------–---------–-------

Interviewee’s list

Presentation

–---------–-------–---------–-------

(15 Key engineers)

Validation survey Interviews

(25 Senior engineers)

–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Implications

Presentation

–---------–-------–---------–-------

(15 Key engineers)

–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Implications

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Verified implications

Presentation

–---------–-------–---------–-------

(15 Key engineers)

–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Implications

–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------–---------–-------

Verified implications

As low as 68% agreement

even within the same system

Practitioner buy-in

More studies (N+1)

Not Soft. Eng!

Practitioner buy-in

More studies (N+1)

Replication data

Not Soft. Eng!

Replication data is not sufficient!

MSR Today: Easy Peasy!

Limited knowledge of machine learning is a serious risk

A Simple Example

Logistic Regression

Empirical Hypothesis Testing (is complex code more risky?)

Bugs ~ f(LOC, CC_max)

NOT about predicting bugs

Including correlated metrics

Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)

60% of 2000-2011 studies do not handle correlated metrics [Shihab 2012]

Using F-measure in studies

Threshold = 0.2 Threshold = 0.5 Threshold = 0.8

Resampling imbalanced data

Variable importance: Original vs under sampled data

Not including control metrics

Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)

Unseen100

out of sample bootstrap

10-fold cross validation

10x10 Cross Validation

Using 10 fold cross validation

Unseen25

100 out of sample

bootstrap

10x10 cross validation

Unseen25

100 out of sample

bootstrap

Unseen25

100 out of sample

bootstrap

Using default ANOVA

Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )

Using default ANOVA

Pitfalls in Analytical Modelling

Joint work with Kla Tantihamthavorn

Pitfalls in Analytical Modelling

Open Scripts

Encourages community mentorship

Open Scripts

Works even for industrial data

Open Scripts

Works even for industrial data

Ensures research transparency

Where do we go from

• Trailblazing research should be encouraged and defended

• Inaccurate analysis is a serious and growing threat

• Deeper analysis is more valuable than large scale studies

• Generalization is not feasible nor desired by practitioners

• Industry is a partner not an idol

Parting thoughts

http://sail.cs.queensu.ca

Think Different!

Empirical Evaluations in Software Engineering Research: A Personal Perspective

Presentations & Public Speaking

Transcript of Empirical Evaluations in Software Engineering Research: A Personal Perspective

DFID COUNTRYPROGRAMME EVALUATIONS SYNTHESISOF2006/2007 EVALUATIONS · 2016-03-29 · DFID COUNTRYPROGRAMME EVALUATIONS SYNTHESISOF2006/2007 EVALUATIONS Julian Barr and CharlotteVaillant

Theoretical Foundations and Empirical Evaluations of ... · deﬁned in the previous section.2 Electoral systems E, including changes such as redistricting, are important because

Evolutionary Theories of Cultural Change: An Empirical Perspective

Juvenile delinquency from the perspective of employees social institutions in the jordanian society ajloun empirical study

Grzegorz Gorzelak - Towards policy lessons for regional development: the empirical perspective

An empirical investigation on the transfer of expatriates ...An empirical investigation on the transfer of expatriates within MNCs from a knowledge perspective Carmen Astorne-Figari

Student Interactions in Online Discussion Forum: Empirical ... · Student Interactions in Online Discussion Forum: Empirical Research from ‘Media Richness Theory’ Perspective

Impact of leadership role perspective on conflict resolution ......their conflict resolution styles. There is har dly any empirical research that focuses on Leadership role perspective

SmartUnit: Empirical Evaluations for Automated …SmartUnit: Empirical Evaluations for Automated Unit Testing of Embedded Software in Industry Chengyu Zhang, Yichen Yan, Hanru Zhou,

CONCEPTUALIZATION AND EMPIRICAL DEFINITION/67531/metadc130798/m2/1/high_res... · CONCEPTUALIZATION AND EMPIRICAL DEFINITION OF TIME PERSPECTIVE THESIS Presented to the Graduate Council

Microfinance on Poverty Alleviation: Empirical Evidence ...€¦ · Microfinance on Poverty Alleviation: Empirical Evidence from Indian Perspective MOHDAZHARUD DIN MALIK Abstract

An empirical study of macroeconomic factors and stock market an indian perspective

Concrete Models and Empirical Evaluations for the Categorical Compositional ...anthology.aclweb.org/J/J15/J15-1004.pdf · · 2015-04-16for the Categorical Compositional Distributional

[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies

A comparative empirical perspective on governance in African countries

Test Automation: An Empirical Perspective. Part II Testing ... · Test Automation: An Empirical Perspective. Part II –Testing Web Applications Long Tutorial at the GTTSE Summer

INTRODUCTION Research Methods. Scientifically Investigating Phenomenon Normative vs. Empirical Evaluations Normative = questions of value: “what ought.

YOURCLINICALCOMPETENCYCOMMITTEES AREBUSY ... · Multi-sourceEvaluations …puttingitalltogether! Self Evaluations Nursing Evaluations Peer Evaluations Patient Evaluations Student

Considering the societal perspective in economic ... · REVIEW Open Access Considering the societal perspective in economic evaluations: a systematic review in the case of depression

TRUST FROM A TRAIT PERSPECTIVE - uni-mannheim.de · TRUST FROM A TRAIT PERSPECTIVE: A THEORETICAL FRAMEWORK AND EMPIRICAL TEST ISABEL THIELMANN M. Sc. Inaugural dissertation submitted