LILIANE SHEYLA DA SILVA FONSECA · 2019. 10. 25. · Liliane Sheyla da Silva Fonseca AN INSTRUMENT...

LILIANE SHEYLA DA SILVA FONSECA

AN INSTRUMENT FOR REVIEWING THE COMPLETENESS OF EXPERIMENTALPLANS FOR CONTROLLED EXPERIMENTS USING HUMAN SUBJECTS IN

SOFTWARE ENGINEERING

.

Federal University of Pernambuco

[email protected]

www.cin.ufpe.br/~posgraduacao

RECIFE2016

Liliane Sheyla da Silva Fonseca

AN INSTRUMENT FOR REVIEWING THE COMPLETENESS OFEXPERIMENTAL PLANS FOR CONTROLLED EXPERIMENTS

USING HUMAN SUBJECTS IN SOFTWARE ENGINEERING

A Ph.D. thesis presented to the Center for Informatics ofFederal University of Pernambuco in partial fulfillment ofthe requirements for the degree of Philosophy Doctor inComputer Science.

Advisor: Sergio Castelo Branco SoaresCo-Advisor: Carolyn Seaman

RECIFE2016

Catalogação na fonte

Bibliotecária Monick Raquel Silvestre da S. Portes, CRB4-1217

F676i Fonseca, Liliane Sheyla da Silva

An instrument for reviewing the completeness of experimental plans for controlled experiments using human subjects in software engineering / Liliane Sheyla da Silva Fonseca. – 2016.

459 f.: il., fig., tab. Orientador: Sérgio Castelo Branco Soares. Tese (Doutorado) – Universidade Federal de Pernambuco. CIn, Ciência da

Computação, Recife, 2016. Inclui referências e apêndices.

1. Engenharia de software. 2. Engenharia de software experimental. 3. Experimentos controlados. 4. Fatores humanos. I. Soares, Sérgio Castelo Branco (orientador). II. Título. 005.1 CDD (23. ed.) UFPE- MEI 2017-43

Liliane Sheyla da Silva Fonseca

An instrument for reviewing the completeness of experimental plans forcontrolled experiments using human subjects in Software Engineering

Tese apresentada ao Programa de Pós-Graduação em Ciência da Computação daUniversidade Federal de Pernambuco, comorequisito parcial para a obtenção do título deDoutor em Ciência da Computação.

Aprovado em: 21/12/2016

——————————————————–Prof. Sergio Castelo Branco SoaresOrientador do Trabalho de Tese

BANCA EXAMINADORA

———————————————————————–Prof. Dr. Hermano Perrelli de Moura

Centro de Informatica/UFPE

———————————————————————–Prof. Dra. Renata maria Cardoso Rodrigues de Souza


———————————————————————–Prof. Dr. Andre Luis de Medeiros Santos


———————————————————————–Prof. Dr. Rafael Prikladnicki

Departamento de Fundamentos da Computação/ PUC/RS

———————————————————————–Prof. Dr. Eduardo Henrique da Silva Aranha

Departamento de Informatica e Matematica Aplicada/ UFRN

I dedicate this thesis to all my family, friends and professorswho gave me the necessary support to get here.

Abstract

It is common sense in software engineering that well made experimental plans are recipesfor successful experiments, and they help experimenters to avoid interferences during experi-ments. Although a number of tools are available to help researchers with writing experimentsreports for scientific publications, few studies focus on how to assess study protocols withrespect to completeness and scientific quality. As a result, designing controlled experimentsusing subjects has been a challenge for many experimenters in software engineering becauseof a large variety of factors that should be present in it to avoid introducing bias in controlledexperiments. The main aim of this thesis is to define an instrument to help experimenters,specially beginners, to review their experimental planning for assessing whether they producedan experimental plan that is complete and includes all possible factors to minimize bias andissues. The instrument is a checklist whose design is based on experimental best practices andexperts’ experience in planning and conducting controlled experiments using subjects. To collectthe best practices, a systematic mapping study was conducted to identify support mechanisms(processes, tools, guidelines, among others.) used to plan and conduct empirical studies in theempirical software engineering community, and an informal literature review was carried outin order to find which support mechanisms are generally used in other fields. Moreover, weperformed a qualitative study for understanding how empirical software engineering experts plantheir experiments. The instrument has been evaluated through four empirical studies. Each onewas explored from different perspectives by Software Engineering researchers at different levelsof experience. The instrument was assessed regarding items that they find useful, inter-rateragreement, inter-rater reliability, and criterion validity using fully crossed design. Two controlledexperiments were performed to assess if the usage of the instrument can reduce the chance offorgetting to include something important during the experiment planning phase compared to theusage of ad hoc practices. Additionally, the acceptance of the instrument was assessed by the fourstudies. In total, we had 35 participants who participated in four different kinds of assessment ofthe instrument. In the first study, 75.76% of the items were judged useful by two experts. Theremaining items were discussed and adjusted. The second study revealed that the usage of theinstrument helped beginners to assess experimental plans in the same way as the experts. Wefound a strong correlation between the overall completeness scores of the experimental plansand the recommendation that the experiment should proceed or not, and whether it is likely to besuccessful. In Studies 3 and 4, the proportion of the correct items found by participants usingthe instrument was greater than the results from participants using the ad hoc practices. Theinstrument has high acceptance from participants. Although the results are positive, performingmore assessments including different settings is required to generalize these results. The usageof the instrument by experimenters, specially beginners, helps them to review the key factors

included in the experimental plan, thus contributing to reduce potential confounding factorsin the experiment. Revising an experimental plan is not a direct evaluation of the quality ofthe experiment itself but it allows changes to be made to improve the experiment before it isperformed.

Keywords: Experimental Software Engineering. Controlled Experiments. Participants. Humanfactors.

Resumo

É comumente aceito pela comunidade de engenharia de software que planos experi-mentais bem planejados são receitas para experimentos bem sucedidos. Isso se deve ao fatoque planos experimentais auxiliam experimentadores a evitarem interferências durante a ex-ecução dos experimentos. No entanto, embora existam ferramentas disponíveis para ajudaros investigadores a reportarem seus experimentos para publicações científicas, poucos estudostem o objetivo de avaliar os protocolos de estudo no que diz respeito à completude e qualidadecientífica. Desta forma, planejar experimentos controlados utilizando participantes tem sido umdesafio para muitos experimentadores em engenharia de software devido a grande variedade defatores que devem estar presentes em um plano experimental a fim de evitar a introdução deviés nos experimentos controlados. O principal objetivo dessa tese de doutorado é definir uminstrumento que auxilie experimentadores, principalmente inexperientes, a revisarem seus plane-jamentos experimentais a fim de avaliar se eles produziram um plano experimental completo,incluindo todos os possíveis fatores para minimizar viés e problemas. O instrumento é uma listade verificação baseado nas melhores práticas experimentais e na experiência dos especialistas emengenharia de software experimental no planejamento e condução de experimentos controladosutilizando pessoas. Para coletar as melhores práticas, um mapeamento sistemático foi realizadopara identificar os mecanismos de apoio (processos, ferramentas, guias, dentre outros) utilizadospara planejar e conduzir estudos empíricos na comunidade de engenharia de software e umarevisão da literatura foi realizada para identificar mecanismos de apoio que são geralmenteutilizados em outras áreas. Além disso, foi realizada um estudo qualitativo a fim de entendercomo os especialistas em engenharia de software experimental planejam seus experimentos.O instrumento foi avaliado por meio de quatro estudos. Cada estudo foi explorado através dediferentes perspectivas por pesquisadores de engenharia de software em diferentes níveis deexperiência. O instrumento foi avaliado com relação a utilidade dos itens, a concordância e aconfiabilidade entre os avaliadores e validade de critério. Dois experimentos controlados foramrealizados para avaliar se o uso do instrumento pode reduzir a chance de esquecer algo importantedurante a fase de planejamento do experimento em comparação com as práticas comumenteusadas pelos pesquisadores. Além disso, os quatro estudos avaliaram a aceitação do instrumentopara revisar planos experimentais de experimentos controlados utilizando participantes. Nototal, 35 participantes avaliaram o instrumento através de quatro diferentes tipos de objetivos.No primeiro estudo, 75,76% dos itens foram julgados uteis pelos dois especialistas envolvidosno estudo. Os itens restantes foram discutidos e ajustados. O segundo estudo revelou que autilização do instrumento auxiliou iniciantes a avaliarem planos experimentais da mesma formados especialistas. Os resultados mostraram uma forte correlação entre os escores da completudeglobal dos planos experimentais e as recomendações se o experimento deveria prosseguir e

a probabilidade do experimento ser bem-sucedido. Nos estudos 3 e 4, a proporção dos itenscorretos encontrados pelos participantes utilizando o instrumento foi significantemente maior doque os resultados utilizando as práticas comumente utilizadas pelos participantes. O instrumentoteve alta aceitação por parte dos participantes. No entanto, embora os resultados sejam positivos,se faz necessário a realização de mais estudos de avaliação, incluindo outras configuraçõesde ambientes a fim de que o resultado possa ser generalizado. A utilização do instrumentopelos experimentadores, especialmente os iniciantes, auxilia a revisão dos principais fatoresque devem estar incluídos no plano experimental, contribuindo assim para reduzir potenciaisfatores de confusão no experimento. Revisar um plano experimental não é uma avaliação diretada qualidade do experimento, mas permite que mudanças no experimento sejam realizadas antesque ele seja de fato executado.

Palavras-chave: Engenharia de Software Experimental. Experimentos Controlados. Partici-pantes. Fatores Humanos.

List of Figures

2.1 Overview of the experiment process . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Results of search and selection process . . . . . . . . . . . . . . . . . . . . . . 513.2 Distribution of full papers by year . . . . . . . . . . . . . . . . . . . . . . . . 513.3 Most active authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4 Geographic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Institution distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6 Profile page of the mechanism SM08 mechanisms . . . . . . . . . . . . . . . . 553.7 Source area of the support mechanisms for experiments . . . . . . . . . . . . . 633.8 Source area of the support mechanisms for general empirical research . . . . . 643.9 Experiments evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.10 Not identified studies evolution . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1 Conceptual model about what experts actually do when they design their experi-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2 Ways of learning about experiments . . . . . . . . . . . . . . . . . . . . . . . 934.3 Intersection of reported results – Problems, Mistakes and Gaps. . . . . . . . . . 103

5.1 Experimental Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1 Overview of the four evaluation studies . . . . . . . . . . . . . . . . . . . . . 1236.2 Versions of the Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246.3 Example of the Experimental Website Layout . . . . . . . . . . . . . . . . . . 1256.4 Bland Altman Plot - Beginners . . . . . . . . . . . . . . . . . . . . . . . . . . 1446.5 Linnear regression - Beginners . . . . . . . . . . . . . . . . . . . . . . . . . . 1456.6 Bland Altman Plot- Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476.7 Linnear regression - Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 1486.8 Items Identified Correctly Raw Data Study 3 . . . . . . . . . . . . . . . . . . . 1606.9 Study 3- Success Rate Comparison . . . . . . . . . . . . . . . . . . . . . . . . 1626.10 Items Identified Correctly Raw Data Study 4 . . . . . . . . . . . . . . . . . . . 1666.11 Study 4- Success Rate Comparison . . . . . . . . . . . . . . . . . . . . . . . . 1686.12 Scatterplots for the FP and IA by expertise . . . . . . . . . . . . . . . . . . . . 1716.13 Scatterplots for the PU and PEOU by expertise . . . . . . . . . . . . . . . . . 172

D.1 Abstract- Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341D.2 Instructions- Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341D.3 Option 1- Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

D.4 Option 2- Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342D.5 Instrument- Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.6 Feedback- Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.7 Abstract- Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343D.8 Instructions- Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344D.9 DryRun - Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344D.10 Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345D.11 Experimental Plans- Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 345D.12 Instrument- Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346D.13 Feedback- Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346D.14 Welcome- Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346D.15 Agenda- Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347D.16 Dry Run Treatment 1 - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 347D.17 Dry Run Treatment 2 - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 348D.18 Treatment 1 A - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348D.19 Treatment 1 B - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349D.20 Treatment 2 A - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349D.21 Treatment 2 B - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350D.22 Experimental Plans- Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 350D.23 Feedback- Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350D.24 Welcome- Study 4 A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351D.25 Welcome- Study 4 B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351D.26 Instructions- Study 4A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351D.27 Instructions- Study 4B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352D.28 Treatment 1 A - Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352D.29 Treatment 2 A - Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353D.30 Treatment 1 B - Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353D.31 Treatment 2 B - Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354D.32 Experimental Plans- Study 4A . . . . . . . . . . . . . . . . . . . . . . . . . . 354D.33 Experimental Plans- Study 4B . . . . . . . . . . . . . . . . . . . . . . . . . . 354

E.1 Assessment of the Experimental Plans by Researchers- Raw data Study 2 . . . 360E.2 Completeness Score from Researchers - Raw Data Study 2 . . . . . . . . . . . 361E.3 Agreement - Raw Data Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 361E.4 Raw Data Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362E.5 Items Identified Correctly Raw Data Study 3 . . . . . . . . . . . . . . . . . . . 362E.6 Raw data Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363E.7 Items Identified Correctly Raw Data Study 4 . . . . . . . . . . . . . . . . . . . 363

F.1 Bland Altman Plot - Beginners . . . . . . . . . . . . . . . . . . . . . . . . . . 369

F.2 Linnear regression - Beginners . . . . . . . . . . . . . . . . . . . . . . . . . . 370F.3 Bland Altman Plot- Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373F.4 Linnear regression - Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 374F.5 Study 2- Instrument’s Acceptance: Fitness for Purpose . . . . . . . . . . . . . 382F.6 Study 2- Instrument’s Acceptance: Item’s Appropriateness . . . . . . . . . . . 383F.7 Study 3- Instrument’s Acceptance: Fitness for Purpose . . . . . . . . . . . . . 384F.8 Study 3- Instrument’s Acceptance: Item’s Appropriateness . . . . . . . . . . . 384F.9 Study 4- Instrument’s Acceptance: Fitness for Purpose . . . . . . . . . . . . . 385F.10 Study 4- Instrument’s Acceptance: Item’s Appropriateness . . . . . . . . . . . 385F.11 Study 2- Instrument’s Acceptance: Perceived usefulness . . . . . . . . . . . . 386F.12 Study 3- Instrument’s Acceptance: Perceived usefulness . . . . . . . . . . . . 387F.13 Study 4- Instrument’s Acceptance: Perceived usefulness . . . . . . . . . . . . 387F.14 Study 2- Instrument’s Acceptance: Perceived ease of use . . . . . . . . . . . . 388F.15 Study 3- Instrument’s Acceptance: Perceived ease of use . . . . . . . . . . . . 389F.16 Study 4- Instrument’s Acceptance: Perceived ease of use . . . . . . . . . . . . 389

List of Tables

2.1 Type of Bias [39] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Data extraction instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 The most cited support mechanisms . . . . . . . . . . . . . . . . . . . . . . . 563.3 Mechanisms used to support activities of experimental process . . . . . . . . . 613.4 General Mechanisms used to support activities of experimental process . . . . 62

4.1 Research Question 1 (RQ1): What do experimental experts actually do whenthey design their experiments? . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2 Research Question 2 (RQ2): What kinds of problem/traps do the experts fall into? 714.3 Research Question 3 (RQ3): How do experts currently learn about experiment

planning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4 Research Question(RQ4): What gaps do experts have in their knowledge? . . . 714.5 RQ1 - Research Question: What do experimental experts actually do when they

design their experiments? - Interview Questions . . . . . . . . . . . . . . . . . 724.6 RQ2 - Research Question: What kinds of problem/traps do the experts fall into?

- Interview Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.7 RQ3 - Research Question: How do experts currently learn about experiment

planning? - Interview Questions . . . . . . . . . . . . . . . . . . . . . . . . . 734.8 RQ4- Research Question: What gaps do experts have in their knowledge? -

Interview Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.9 Order of the interview questions . . . . . . . . . . . . . . . . . . . . . . . . . 754.10 Schedule of the Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.11 List of Open Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.12 Experimental plan elements from Experts . . . . . . . . . . . . . . . . . . . . 834.13 Support Mechanisms cited by the Interviewees . . . . . . . . . . . . . . . . . 864.14 Things that experts would like to have versus Studies towards these suggestions 105

5.1 Guidelines Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.2 Checklists Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.3 human factors and ethical concerns guidelines sources . . . . . . . . . . . . . 1145.4 Guidelines for conducting and reporting experiments in software engineering . 1155.5 Experimental process steps in order proposed by authors . . . . . . . . . . . . 1165.6 Experimental process phases of the instrument . . . . . . . . . . . . . . . . . . 1175.7 Classification and sub-classification of the experimental process phases . . . . 1185.8 Classification and sub-classification of the checklists items . . . . . . . . . . . 119

6.1 Experimental Materials for the four studies . . . . . . . . . . . . . . . . . . . 1266.2 Demographic Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.3 Demographic Information from the Study 1 . . . . . . . . . . . . . . . . . . . 1276.4 Demographic Information from the Study 2 . . . . . . . . . . . . . . . . . . . 1286.5 Demographic Information from the Study 3 . . . . . . . . . . . . . . . . . . . 1286.6 Demographic Information from the Study 4 . . . . . . . . . . . . . . . . . . . 1296.7 Schedule of the Instrument validity 1 . . . . . . . . . . . . . . . . . . . . . . . 1316.8 Overview of collected data from Study 1 . . . . . . . . . . . . . . . . . . . . 1326.9 Schedule of the Instrument Validation 2 . . . . . . . . . . . . . . . . . . . . . 1376.10 Completeness Score Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.11 Completeness scores from the researchers . . . . . . . . . . . . . . . . . . . . 1416.12 Descriptive statistics of the completeness scores of the experimental plans be-

tween researchers with similar expertise . . . . . . . . . . . . . . . . . . . . . 1416.13 Average Deviation Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1486.14 Inter rater reliability of the instrument . . . . . . . . . . . . . . . . . . . . . . 1496.15 Criterion Validity Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1506.16 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1556.17 Co-located Controlled Experiment Schedule . . . . . . . . . . . . . . . . . . . 1586.18 Sample Random Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1586.19 Schedule of the Instrument validity 3 . . . . . . . . . . . . . . . . . . . . . . . 1596.20 Experimental Plan 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606.21 Study 3- Experimental Plan 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606.22 Remote Controlled Experiment Schedule . . . . . . . . . . . . . . . . . . . . 1646.23 Schedule of the Instrument validity 4 . . . . . . . . . . . . . . . . . . . . . . . 1646.24 Study 4- Experimental Plan 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.25 Study4- Experimental Plan 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.26 Descriptive Statistic for the appropriateness: FP and IA variables . . . . . . . . 1716.27 Descriptive Statistic for perceived ease of use and usefulness variables . . . . . 1726.28 Fitness for purpose Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1736.29 Item’s Appropriateness Values . . . . . . . . . . . . . . . . . . . . . . . . . . 1736.30 Fitness for purpose Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1736.31 Item’s Appropriateness Values . . . . . . . . . . . . . . . . . . . . . . . . . . 1746.32 Fitness for purpose Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1756.33 Item’s Appropriateness Values . . . . . . . . . . . . . . . . . . . . . . . . . . 1756.34 Fitness for purpose Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1766.35 Item’s Appropriateness Values . . . . . . . . . . . . . . . . . . . . . . . . . . 1766.36 Perceived Usefulness Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 1776.37 Perceived Usefulness Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 1776.38 Perceived Usefulness Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.39 Perceived Usefulness Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 1786.40 Perceived ease of use Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 1796.41 Perceived ease of use Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 1796.42 Perceived ease of use Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 1806.43 Perceived ease of use Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

E.1 Raw Data Study 1: Values from Instrument’s Acceptance: Fitness for Purpose . 355E.2 Raw Data Study 1: Values from Instrument’s Acceptance: Item’s Appropriateness355E.3 Raw Data Study 1: Values from Instrument’s Acceptance: Perceived usefulness 356E.4 Raw Data Study 1: Values from Instrument’s Acceptance: Perceived ease of use 356E.5 Instrument Validation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357E.6 Comments of items that the participants had problems to understand . . . . . . 358E.7 Raw Data Study 2: Values from Instrument’s Acceptance: Fitness for Purpose . 359E.8 Raw Data Study 2: Values from Instrument’s Acceptance: Item’s Appropriateness359E.9 Raw Data Study 2: Values from Instrument’s Acceptance: Perceived usefulness 360E.10 Raw Data Study 2: Values from Instrument’s Acceptance: Perceived ease of use 360

F.1 Average Deviation Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374F.2 Inter rater reliability of the instrument . . . . . . . . . . . . . . . . . . . . . . 376F.3 Criterion Validity Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

H.1 Schedule of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401H.2 Schedule of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

Contents

1 Introduction 241.1 Problem and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.2 Study Goals and Research Questions . . . . . . . . . . . . . . . . . . . . . . . 261.3 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2 Background 312.1 Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2 Evidence-Based Software Engineering . . . . . . . . . . . . . . . . . . . . . . 332.3 Empirical Studies in Software Engineering . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Controlled Experiments in Software Engineering . . . . . . . . . . . . 372.3.2 Experiments in other fields . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 Important Human-Related Factors in Experimental Research and Ethical Issues 402.5 Experimental Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5.1 Importance of Experimental Plans . . . . . . . . . . . . . . . . . . . . 422.5.2 Quality Assessment in Experimental Plans and Experiment Reports . . 44

2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Support Mechanisms to Conduct Experiments in Software Engineering: a System-atic Mapping Study 463.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.1.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.1.3 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.1.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2.1 Overview of the Systematic Mapping Study . . . . . . . . . . . . . . . 503.2.2 Catalog of Support Mechanisms . . . . . . . . . . . . . . . . . . . . . 54

3.2.2.1 The Most Used Support Mechanisms . . . . . . . . . . . . . 553.2.2.2 Main Mechanisms for Conducting Experimental activities . . 603.2.2.3 General Support Mechanisms . . . . . . . . . . . . . . . . . 623.2.2.4 Overview of Experimental Software Engineering Field . . . 63

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Qualitative Interview Study 694.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.2 Interview Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.3 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2.3.1 IRB Application . . . . . . . . . . . . . . . . . . . . . . . . 754.2.4 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.2.5 Process of Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.6 Piloting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.7 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.8 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3 Results and Principal Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 804.3.1 RQ1: What do experimental experts actually do when they design their

experiments? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.3.1.1 Brainstorming about experiment rationales . . . . . . . . . . 814.3.1.2 Writing and updating experimental plans . . . . . . . . . . . 844.3.1.3 Revising the experimental plan . . . . . . . . . . . . . . . . 844.3.1.4 Using Support Mechanisms . . . . . . . . . . . . . . . . . . 854.3.1.5 Meetings with external researchers . . . . . . . . . . . . . . 874.3.1.6 Running pilots . . . . . . . . . . . . . . . . . . . . . . . . 874.3.1.7 Discussing results from pilot . . . . . . . . . . . . . . . . . 88

4.3.2 RQ2: What kinds of problems/traps do the experts fall into? . . . . . . 884.3.2.1 Common mistakes and traps . . . . . . . . . . . . . . . . . . 894.3.2.2 Problems and Difficulties . . . . . . . . . . . . . . . . . . . 91

4.3.3 RQ3: How do experts learn about experiments? . . . . . . . . . . . . . 934.3.3.1 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.3.2 Listening and Speaking . . . . . . . . . . . . . . . . . . . . 954.3.3.3 Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.3.4 Observing . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.3.5 Practicing . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.4 RQ4: What gaps do experts have in their knowledge? . . . . . . . . . . 974.3.4.1 Gaps in experts’ knowledge . . . . . . . . . . . . . . . . . . 974.3.4.2 Gaps in empirical literature . . . . . . . . . . . . . . . . . . 984.3.4.3 Supports that experts would like to have . . . . . . . . . . . 98

4.3.5 Perceptions of experiment planning in empirical software engineering . 1004.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.5 Trustworthiness of this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Instrument Development 1085.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.2 Instrument Development Methodology . . . . . . . . . . . . . . . . . . . . . . 111

5.2.1 Data collection from empirical checklists and guidelines, and experts’experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2.2 Standardizing experimental process phases and classifying items by phases1155.2.3 Grouping related data within each experimental phase . . . . . . . . . 1195.2.4 Formulating the instrument items and formulating recommendations . 1205.2.5 Pre-validation of the instrument items . . . . . . . . . . . . . . . . . . 120

5.3 Instrument Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.1 Objects of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.2 Raters of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.3 Instrument items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6 Instrument Evaluation 1226.1 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.1.1 The Experimental Website . . . . . . . . . . . . . . . . . . . . . . . . 1256.1.2 Demographic Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 1266.1.3 Demographic Questionnaire Results . . . . . . . . . . . . . . . . . . . 127

6.2 Study 1: Instrument Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2.1 Study Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2.2 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.2.2.2 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.2.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2.6 Summary Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3 Study 2: Instrument Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.3.1 Study Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.3.2 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.3.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.3.2.2 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.3.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.3.3 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.4 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.3.4.1 Inter-rater Agreement . . . . . . . . . . . . . . . . . . . . . 1386.3.4.2 Inter-Rater Reliability . . . . . . . . . . . . . . . . . . . . . 1406.3.4.3 Criterion Validity . . . . . . . . . . . . . . . . . . . . . . . 140

6.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.3.5.1 Completeness Scores from Researchers . . . . . . . . . . . . 1406.3.5.2 Difference between the completeness mean scores from Be-

ginner and Expert Researchers . . . . . . . . . . . . . . . . 1416.3.5.3 Inter-Rater agreement between rater with similar expertise:

beginner and expert researchers . . . . . . . . . . . . . . . . 1426.3.5.4 Inter-Rater Agreement among Four Researchers . . . . . . . 1486.3.5.5 Inter-Rater Reliability . . . . . . . . . . . . . . . . . . . . . 1496.3.5.6 Criterion Validity . . . . . . . . . . . . . . . . . . . . . . . 149

6.3.6 Summary Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1506.4 Study 3: Co-located Controlled Experiment . . . . . . . . . . . . . . . . . . . 151

6.4.1 Experiment Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.4.1.1 Global Goal . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.4.1.2 Study Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.4.1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . 1516.4.1.4 Measurement Goal . . . . . . . . . . . . . . . . . . . . . . . 1516.4.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.4.2 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1526.4.2.1 Context selection . . . . . . . . . . . . . . . . . . . . . . . 1526.4.2.2 Hypotheses formulation . . . . . . . . . . . . . . . . . . . . 1526.4.2.3 Variables selection . . . . . . . . . . . . . . . . . . . . . . . 1546.4.2.4 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . 1546.4.2.5 Experiment Design . . . . . . . . . . . . . . . . . . . . . . 155

6.4.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606.4.5 Summary Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.5 Study 4: Remote Controlled Experiment . . . . . . . . . . . . . . . . . . . . . 1636.5.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.5.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.5.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1646.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.5.6 Summary Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

6.6 Instrument’s Acceptance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1686.6.1 Instrument’s Acceptance Results . . . . . . . . . . . . . . . . . . . . . 170

6.6.1.1 Instrument’s acceptance Overview of Results . . . . . . . . . 1716.6.1.2 Appropriateness - To what extent do evaluators believe that

the instrument is appropriate for reviewing experimental plansfor controlled experiments using participants in Software En-gineering? . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.6.1.3 Perceived usefulness - To what extent do evaluators believethat using the instrument would enhance their performance inplanning Software Engineering controlled experiments withparticipants? . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6.6.1.4 Perceived ease of use - To what extent do evaluators believethat using the instrument would be free of effort? . . . . . . . 179

6.6.1.5 Qualitative Analysis- Open Questions . . . . . . . . . . . . 1816.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

6.7.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.7.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1856.7.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1866.7.4 Conclusion Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1866.9 Final Version of the Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . 1896.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

7 Related Work 2117.1 Related Work – Systematic Mapping Study (Chapter 3) . . . . . . . . . . . . . 2117.2 Related Work – Qualitative Interview Study (Chapter 4) . . . . . . . . . . . . 2127.3 Related Work – Proposed Instrument (Chapter 5) . . . . . . . . . . . . . . . . 2137.4 Summary Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

8 Conclusions 2178.1 Answers to the Research Questions of the Thesis Research . . . . . . . . . . . 2178.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2188.3 Study Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

8.3.1 Systematic Mapping Study . . . . . . . . . . . . . . . . . . . . . . . . 2188.3.2 Qualitative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2198.3.3 Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2198.3.4 Instrument Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.4 Current Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2208.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

References 223

Appendix 235

A Support Mechanisms Reference 236A.1 Support Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236A.2 Primary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Appendix B Initial version of the Proposed Instrument 310B.1 Category 1: Stating the Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 310B.2 Category 2: Hypotheses, Variables, and Measurements . . . . . . . . . . . . 311B.3 Category 3: Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313B.4 Category 4: Experimental Materials and Tasks . . . . . . . . . . . . . . . . . 315B.5 Category 5: Experimental Design - The experimental design category contains

four items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317B.6 Category 6: Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318B.7 Category 7: Data Collection and Data Analysis . . . . . . . . . . . . . . . . 319B.8 Category 8: Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . 320B.9 Category 9: Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

Appendix C Checklists and Guidelines Items 322C.1 Checklists Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

C.1.1 Classification: Goal Definition . . . . . . . . . . . . . . . . . . . . . . 322C.1.2 Classification: Research Question . . . . . . . . . . . . . . . . . . . . 323C.1.3 Classification: Metrics and Measurement . . . . . . . . . . . . . . . . 324C.1.4 Classification: Context Selection . . . . . . . . . . . . . . . . . . . . . 325C.1.5 Classification: Hypotheses Formulation . . . . . . . . . . . . . . . . . 325C.1.6 Classification: Parameters and Variables . . . . . . . . . . . . . . . . . 326C.1.7 Classification: Participants . . . . . . . . . . . . . . . . . . . . . . . . 326C.1.8 Classification: Group Assignment . . . . . . . . . . . . . . . . . . . . 328C.1.9 Classification: Experimental Materials and Tasks . . . . . . . . . . . . 328C.1.10 Classification: Experimental Design . . . . . . . . . . . . . . . . . . . 329C.1.11 Classification: Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 329C.1.12 Classification: Data Collection . . . . . . . . . . . . . . . . . . . . . . 330C.1.13 Classification: Analysis Procedures . . . . . . . . . . . . . . . . . . . 331C.1.14 Classification: Threats to Validity . . . . . . . . . . . . . . . . . . . . 332C.1.15 Classification: Document Structure . . . . . . . . . . . . . . . . . . . 332

C.2 Guidelines Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332C.2.1 Classification: Goal Definition . . . . . . . . . . . . . . . . . . . . . . 332C.2.2 Classification: Research Questions . . . . . . . . . . . . . . . . . . . . 333C.2.3 Classification: Metrics and Measurements . . . . . . . . . . . . . . . . 333C.2.4 Classification: Context Selection and Hypotheses Formulation . . . . . 334

Appendix A Support mechanisms Reference

242424

1Introduction

This chapter presents the overview of the thesis. The research problem and motivationare presented in Section 1.1. In sequence, Section 1.2 describes the study goals and researchquestions. Then, the research strategy is presented in Section 1.3. The summary of contributionsof the thesis is presented in Section 1.4. Finally, the structure of the thesis is described in Section1.5.

1.1 Problem and Motivation

Since the 1980s the necessity of experimentation has been recognized in softwareengineering [1], and it has become widely accepted as a way to assess technologies (processmodels, methods, techniques, tools, languages, etc.), and human- computer interactions [2], [3],[4], [5], [6], [7], [8], [9], [10], [11]. In the same way, quality assessment of experiments andcontrolled experiments has often been carried out in systematic literature review, replication, andmeta analysis in software engineering by using checklists and scales for evaluating experimentalresults published from different venues in order to build a reliable empirical software engineeringbody of knowledge [12]. Instruments used to assess controlled experiments cannot measure thequality of the experiment itself but they can assess the quality of experiment reports. However,many researchers have found poor quality controlled experiment reports in software engineering.Surprisingly, this problem involves not only lack of some important details in the reports due torestricted space, but also mistakes in the choice of the experimental design, the experimentalsetting, and the choice of the population sample, among others [5], [13], [14]. One reasonfor this is because researchers are still having difficulties planning and performing controlledexperiments using human subjects [14], [5], [15], [16], [13] in software engineering becauseit is difficult to manage the large variety of factors involved in the experiment planning. Also,many researchers have seen performing experiments as too risky and too difficult to plan andconduct because, besides being expensive to run, they might lead to inconclusive or negativeresults [17].

In the experimentation field, it is generally accepted that experiment planning is one of

1.1. PROBLEM AND MOTIVATION 25

the most important phases in the experimentation process because experimental plans are recipesfor experiments [18]. A thorough experimental plan contributes to increasing the internal validityof the experiment as well as helping to minimize bias [19]. Experimental plans are outputsof the planning phase, which outline all factors relating to the experiment planning includinggoal, methodological, bureaucratic, time, financial, and ethical aspects, potential limitations andexpected results. In addition, it is expected that experimental plans or protocols of controlledexperiments should provide all the necessary information to conduct and replicate the studyand integrate it into the empirical software engineering body of knowledge [20]. The carefulexamination of these issues at the planning phase allows a thorough and informed preparationthat increases the likelihood of a successful controlled experiment with meaningful results.

Although the literature [21], [22], [23], [24], [20], [17] presents important resourcesto plan and conduct experiments and write experimental reports, there are few that are focusedon how to review experimental plans regarding completeness and scientific quality. To carry outa controlled experiment, an experimental plan is mandatory. However, experience has shownthat a number of experimental plans are poorly written resulting in failures during the controlledexperiment execution. It is also difficult for researchers to find relevant information becausethe factors that should be contained in the experimental plan are spread out in different sources.There is no unified collection of experimental design considerations for the community to use tosupport the review of experimental plans. This kind of mechanism would be especially usefulfor beginners, who are planning their first experiments and might not be aware of some planningissues.

In addition, there is a gap between what the literature says about how to plan experimentsand what the experts actually do when they design their experiments. Similarly to experimentreporting, instruments for assessing the completeness of an experimental plan can be used in orderto prevent problems which could be identified during execution. Besides this, it is important toconsider other disciplines apart from software engineering. There are many disciplines that havebeen carrying out experimentation much longer than software engineering has (e.g. medicine,education, the social sciences, and psychology). Evidence of this is found in the study publishedby Kitchenham [25] where experimental studies from other fields scored better on Kitchenham’squality scale than experimental studies from the software engineering field.

Although there are many tasks associated with experiment plans such as reviewingrelated work, formulating theory to help motivate and shape the research questions, planningthe collection of qualitative data during the study to help explain the experimental findings, andassessing the completeness of an experiment plan, this last one, according to the qualitative studyperformed in Chapter 4, is one of the most common issues faced by researchers because a poorquality plan may increase the cost of the experiment and may also generate invalid experimentresults. Also, forgetting important elements and making changes in the controlled experimentafter completing the planning phase make the experiment difficult to control. Because of this, aninstrument for reviewing these factors should be useful for researchers to remember important

1.2. STUDY GOALS AND RESEARCH QUESTIONS 26

elements that should be considered in the planning phase.The main objective of this thesis is to define and evaluate an instrument for reviewing the

completeness of experimental plans for controlled experiments using human subjects in softwareengineering based on empirical software engineering guidelines and also on literature fromexternal fields. Also, the development of the instrument takes into account the experience ofexperts in planning controlled experiments. Besides this, the instrument contributes to gatheringseveral support mechanisms scattered in the literature, and helps inexperienced researchers toreview their controlled experiment plans. The usage of the instrument by researchers, essentiallybeginners, helps them to check if the key factors are included in the experimental plan andcontributes to reducing potential confounding factors in the experiment.

Assessing the completeness of an experimental plan is not a direct evaluation of thecontrolled experiment itself. In other words, we cannot determine if a controlled experiment willbe successful just by reviewing the plan. But assessing the completeness of an experimental planreduces the cost and risk of problems of the experiment because an improved plan will allowresearchers to take maximum advantage of the resources, especially participants. Applying theinstrument during the planning phase allows changes to improve the experiment as much aspossible during the planning phase. We can reduce the number of failures in the execution of theexperiment by assessing the completeness of experimental plans before they are performed.

For clarification, the instrument is based on the following definition of a controlledexperiment by Wohlin [ 21]: Experiment (or controlled experiment) in software engineering isan empirical enquiry that manipulates one factor or variable of the studied setting. Based inrandomization, different treatments are applied to or by different subjects, while keeping othervariables constant, and measuring the effects on outcome variables. In human-oriented experi-ments, humans apply different treatments to objects, while in technology-oriented experiments,different technical treatments are applied to different objects.

Throughout this thesis, we will use the terms experiments and controlled experimentsinterchangeably to refer to experiments based in randomization.

1.2 Study Goals and Research Questions

The main goal of this research is to propose and evaluate an instrument to support experi-menters in their reviews of experimental plans for controlled experiments using human subjectsin software engineering. It provides a checklist for reviewing and checking the completeness ofan experimental plan. To achieve the main goal the following objectives were defined:

1. To provide a complete checklist for reviewing all different aspects of an experimentalplan for a controlled experiment.

2. To guide reviewers regarding the completeness of experimental plans.

1.3. RESEARCH STRATEGY 27

3. To assist researchers to save resources by preventing mistakes on experimental plansat planning phase.

4. To develop an instrument which is appropriate, useful, and easy to use for helpingespecially novice researchers in experimental software engineering to review theirexperimental plans using human subjects in software engineering.

Based on the research problem described before, three research questions were defined:

1. Which are the most commonly adopted mechanisms to conduct experiments insoftware engineering research?

2. What do experimental software engineering experts actually do when they designtheir experiments?

3. How can experiment planning be supported to positively help researchers, especiallybeginners, to review their experiment plans for completeness?

1.3 Research Strategy

In order to achieve the objectives of this research, six research tasks were performed asfollows:

Step 1: Systematic Mapping Study: To map information about support mechanisms(methodologies, processes, guides, tools, techniques and best practices) used toconduct experiments in software engineering.

A systematic mapping study is part of a big project of our research group at UFPE. Itwas performed to identify the main mechanisms used to support software engineeringexperiments. This systematic mapping study included all full papers published atinternational conferences EASE (Evaluation and Assessment in Software Engineer-ing) and ESEM (Empirical Software Engineering and Measurement), and the journalESEJ (Empirical Software Engineering Journal) since their first editions.

Although the systematic mapping study conduct by our research group includes notonly experiments, but also other kinds of empirical studies in software engineering,we analyzed and reported results regarding software engineering experiments becausethe scope of our Ph.D research.

During this step, we provided a catalog of support mechanisms to conduct experi-ments in software engineering, including:

! The most used support mechanisms for conducting experiments: Thisresult presents the description of the most cited support mechanisms.

1.3. RESEARCH STRATEGY 28

! Main mechanisms for conducting experiments: This result describesthe main mechanisms for Experiments.

! General support mechanisms for conducting experiments: This resultpresents some support mechanisms that although do not address a specificempirical strategy, they support some activities of experiments, such asstatistical analysis and qualitative analysis.

! Overview of experimental software engineering field: This result pre-sents source areas of the support mechanisms for conducting experiments,the evolution of experiments over the years and the concerns about "notidentified" empirical strategies in studies published in the EASE, ESEM,and ESEJ.

For more details, see Chapter 3.

Step 2: Qualitative study for understanding how software engineering researchersplan their experiments

We used qualitative analysis methods from the grounded theory approach to analyzedata from interviews with 11 experienced experimental software engineering re-searchers to understand how software engineering researchers plan their experiments.This study addressed four main topics, namely, what software engineering researchersactually do when they design experiments, what kinds of problems or traps they fallinto, how they currently learn about experiments, and what gaps they have in theirknowledge. During this step, we identified:

! What experimental experts actually do when they design their experiments.

! What kinds of problems/traps the experts fall into.

! How experts currently learn about experiments.

! What gaps experts have in their knowledge.


Step 3: Developing the instrument:

We collected the key elements from empirical guidelines and checklists identifiedpreviously (Steps 1 and 2) for building the proposed instrument. This step aims todefine the instrument for assessing the completeness of a controlled experimentalplan. It contains important factors of an experimental plan based on empirical bestpractices and experts’ experiences. In addition, each element is associated with somerecommendations. These recommendations were supported by evidence from theliterature. The instrument was developed in four steps:

1.4. CONTRIBUTIONS 29

! Merging all found data into one list.

! Classifying data on dimensions according to sections in experiment plan-ning.

! Selecting items by grouping similar items within each category.

! Associating recommendations for each item.


Step 4: Assessing the instrument: The instrument was assessed by four studies and alldata analysis was reviewed by two researchers who are not involved in this research.The studies are described as follows:

! Study 1: Analyzing which instrument’s checklist items high-level expertsin experimental software engineering find useful, which ones they do notfind useful, and which ones they have trouble understanding. In addition,we assessed the instrument regarding its acceptance.

! Study 2: Analyzing the agreement, reliability, criterion validity, and the in-strument’s acceptance by expert and beginner researchers in experimentalsoftware engineering.

! Study 3: Performing a co-located controlled experiment using post grad-uate students for assessing if the usage of the instrument can reduce thechance of forgetting to include something important during the experimentplanning phase compared to the usage of ad hoc practices.

! Study 4: Performing a remotely controlled experiment using post graduatestudents for assessing if the usage of the instrument can reduce the chanceof forgetting to include something important during the experiment plan-ning phase compared to the usage of ad hoc practices. Also, we assessedthe instrument’s acceptance. Study 4 presents the same study design asStudy 3 but it differs from it because the Study 4 was performed in avirtual environment, while Study 3 was performed in a laboratory.

Step 6: Adjusting the instrument: We compiled the instrument assessment’s results andmade the adjustments and improvements suggested.

For more details about the step 5 and 6, see Chapter 6.

1.4 Contributions

This section presents the summary of the contributions of this study. As mentioned before,the main goal of this research is to provide a supportive instrument to help experimenters to review

1.5. THESIS STRUCTURE 30

their experiment planning for assessing whether they produced a complete experiment plan basedon experimental best practices and experts’ experience in planning controlled experiments usingsubjects. Thus, we expect that this instrument can be adopted to check the completeness ofexperiment plans for improving their quality. The main contributions of this research can besummarized as:

1. A catalog of main support mechanisms for planning and conducting experiments insoftware engineering.

2. Analysis of what experimental software engineering experts actually do when theydesign their experiments.

3. An instrument for reviewing the completeness of experiment plans.

4. Assessment of the proposed instrument.

So far, the results were published at ESEM 2014 [26] and EASE 2015 [27].

! Support mechanisms to conduct empirical studies in software engineering. InProceedings of the 8th ACM/IEEE International Symposium on Empirical SoftwareEngineering and Measurement (ESEM’14). ACM, New York, NY, USA.

! Support Mechanisms to Conduct Empirical Studies in Software Engineering:a Systematic Mapping Study. In the 19th International Conference, 2015, Nanjing.Proceedings of the 19th International Conference on Evaluation and Assessment inSoftware Engineering - EASE ’15. New York: ACM Press. p. 1.

1.5 Thesis Structure

We organize this document as follows:

! Chapter 2 reviews fundamental concepts used throughout this study.

! Chapter 3 presents a systematic mapping study to identify support mechanisms toconduct experiments in software engineering.

! Chapter 4 presents a qualitative study for understanding how software engineeringresearchers plan their experiments.

! Chapter 5 presents the development of the proposed instrument.

! Chapter 6 describes the evaluation of the proposed instrument.

! Chapter 7 presents the related work.

! Chapter 8 presents the conclusions.

313131

2Background

The need to perform empirical studies in software engineering is not new; it has been atopic of great interest for many years. In 1986, Basili [1] described this need, and this motivatedmany initiatives to improve and disseminate the adoption of empirical strategies in softwareengineering in the following decades, including environments, guidelines, methodologies, tools,and other mechanisms to support the execution of empirical studies that allow researchers toevaluate technology proposals and research results in software engineering [28], [3], [29], [30],[24], [11], [31], [21], [23]. However, researchers have faced problems in conducting controlledexperiments, especially ones that involve human participants because human aspects increase thelikelihood of cost of experimentation, and it is difficult to achieve statistical significance [32].

In this chapter, we begin by presenting a brief but essential background for the reader tounderstand the context in which this research is inserted. In Section 2.1, we started presenting thesoftware engineering concepts and their relationship to the usage of empirical studies. In Section2.2, we described the importance of evidence-based software engineering which is being widelyused by the software engineering community through systematic literature reviews, systematicmapping studies, and meta analysis. Section 2.3 presents a brief overview of the state of the artof empirical studies in software engineering, especially of controlled experiments in softwareengineering (See Section 2.3.1) and experiments in other fields ( See Section 2.3.2). Section 2.4presents important human-related factors in experimental research and ethical issues. Section2.5 describes the importance of experimental plans in the quality of experiments 2.5.1 and thequality assessment of experimental plans in software engineering 2.5.2. Finally, the summary ofthis chapter is described in Section 2.6.

2.1 Software Engineering

Software engineering supports several efforts towards becoming as close as possible totraditional engineering, with clear methods and standards. It is necessary to use a well-defined,and goal-oriented process to provide the accumulation of past experience and process it intoknowledge. In software engineering, it is important to have discipline and organization to ensure

2.1. SOFTWARE ENGINEERING 32

the success of applications developed.Some software engineering definitions from the literature are described below:

! The Institute of Electrical and Electronics Engineers – IEEE, Software Engi-neering is [33]:

(1) The application of a systematic, disciplined, quantifiable approach to the develop-ment, operation, and maintenance of software; that is, the application of engineeringto software. (2) The study of approaches as in (1)"

! Fritz Bauer [34]

“The establishment and use of sound engineering principles in order to obtain eco-nomically software that is reliable and works efficiently on real machines.”

and,

! Ian Sommerville [35]

Software Engineering is concerned with all aspects of software production from theearly stages of system specification through to maintaining the system after it hasgone into use.

! Pierre Bourque and Richard E. Fairley – SWEBOK [36]

ISO/IEC/IEEE Systems and Software Engineering Vocabulary (SEVOCAB) definessoftware engineering as “the application of a systematic, disciplined, quantifiableapproach to the development, operation, and maintenance of software; that is, theapplication of engineering to software).” 1

The focus of software engineering research is investigating real world events and gearedtowards the development of new mechanisms (technologies, processes, methods, techniques,languages, etc.) to support their activities, aim at improving the quality of software productsand increase productivity in the development process [37]. Therefore, software engineeringresearch is concerned to investigate how the mechanisms actually work, understand its limits,and propose solutions. In this context, empirical methods provide consistent ways to validatethe software engineering phenomenon, generating more accurate evidence and facilitating thetransfer of new technologies to industry [11]. Also, software engineering presents similarcharacteristics to social and behavioral sciences, because human factors can have a major impacton the software development process and the quality of the software produced. Despite theefforts of activities related to software production, software engineering is still dependent onhuman influence [38]. As a result, software engineering has faced several problems related tohuman factors and empirical studies which are important factors in the software engineeringprocess.

1See www.computer.org/sevocab.

2.2. EVIDENCE-BASED SOFTWARE ENGINEERING 33

2.2 Evidence-Based Software Engineering

Considering that science is cumulative, software engineering can obtain benefit throughthe approaches that aggregate and synthesize existent empirical findings, in order to obtainan overall perspective about a research topic, which could not be otherwise provided withindividual studies. In this context, we can highlight the field of study of evidence-based softwareengineering, which allows systematically collecting and evaluating all the available evidenceabout a specific software engineering phenomenon [11]. According to Kitchenham [39],the evidence-based approach seeks to offer different ways to group evidence from researchand integrate it with human practice and values and then apply it to the process of decision-making in the context of software development and maintenance. Thus, the evidence-basedparadigm becomes one important tool for adding and building knowledge in software engineering.According to Kitchenham and Charters [40], the three main methods in evidence-based softwareengineering are:

Systematic Literature Review is a secondary study that uses a reliable, accurate, andauditable methodology to identify and analyze all available evidence related with one researchtopic. The systematic literature review is considered an impartial and repeatable method.

Systematic Mapping Study is a type of systematic literature review that it is used whenthere are broader research questions. It aims to obtain an overview of an investigation area. Thesystematic mapping study have a character more exploratory while the systematic literaturereviews aim to identify, evaluate, and understand a specific study field.

Meta-analysis is a secondary research method that seeks to synthesize informationfrom primary studies, in order to obtain more consistent aggregate results through the useof quantitative statistical techniques. Meta-analysis is a method typically used in systematicliterature reviews in which data from several studies are integrated.

These three main methods systematic literature reviews, systematic mapping study,and meta-analysis are secondary research methods that collect and analyze studies such asexperiments, cases study, surveys, ethnographies, and action researches from the literature.Empirical software engineering is a branch of software engineering dedicated to conduct andanalyze these studies. The Section 2.3 describes empirical studies in software engineering.

2.3 Empirical Studies in Software Engineering

Empirical software engineering research seeks to explore, describe, predict, and explainnatural, social, or cognitive phenomena by using evidence based on observation or experience[11]. In addition, empirical studies allow us to assess software engineering technologies anddetermine which kinds of tools and techniques are more effective. Therefore, it is possible to builda knowledge base to support decision-making by researchers and practitioners [14]. Empiricalresearch involves knowledge acquisition through empirical methods in order to explore and

2.3. EMPIRICAL STUDIES IN SOFTWARE ENGINEERING 34

evaluate natural phenomena based on evidence obtained by systematic observation or experiment,rather than deductive logic [11].

According to Easterbrook et al. [22], an empirical method is a set of organized principleswhereby empirical data are collected and analyzed. Empirical study allows evaluation of studiedactivities in a systematic, quantifiable and controlled way [29].

Easterbrook et al. [22] presents five classes of empirical strategies. They are:

! Controlled Experiments (including Quasi-Experiments)

A controlled experiment is an investigation of a testable hypothesis where oneor more independent variables are manipulated to measure their effect on one ormore dependent variables. A precondition for conducting an experiment is a clearhypothesis. The hypothesis (and the theory from which it is drawn) guide all steps ofthe experimental design,including deciding which variables to include in the studyand how to measure them.

! Case studies (both exploratory and confirmatory)

Yin [41] introduces the case study as “an empirical inquiry that investigates acontemporary phenomenon within its real-life context, especially when the bound-aries between phenomenon and context are not clearly evident.” Also, case studiesoffer in-depth understanding of how and why certain phenomena occur, and canreveal the mechanisms by which cause–effect relationships occur.A precondition forconducting a case study is a clear research question concerned with how or whycertain phenomena occur.

! Survey research

Survey research is used to identify the characteristics of a broad population ofindividuals. It is most closely associated with the use of questionnaires for datacollection. However, survey research can also be conducted by using structuredinterviews, or data logging techniques. The defining characteristic of survey researchis the selection of a representative sample from a well-defined population, and thedata analysis techniques used to generalize from that sample to the population,usually to answer questions.A precondition for conducting survey research is a clearresearch question that asks about the nature of a particular target population.

! Ethnographies

Ethnography is a form of research focusing on the sociology of meaning throughfield observation. The goal is to study a community of people to understand how themembers of that community make sense of their social interactions. For softwareengineering, ethnography can help to understand how technical communities builda culture of practices and com- munication strategies that enables them to perform


technical work collaboratively. The preconditions for an ethnographic study includea research question that focuses on the cultural practices of a particular community,and access to members of that community.

! Action Research

In Action Research, the researchers attempt to solve a real-world problem whilesimultaneously studying the experience of solving the problem. While most empir-ical research methods attempt to observe the world as it currently exists, actionresearchers aim to intervene in the studied situations for the explicit purpose ofimproving the situation. A precondition for action research is to have a problemowner willing to collaborate to both identify a problem, and engage in an effort tosolve it.

In addition, from previously cited methods, Easterbrook et al. [22] cites one approachfeatured as mixed methods research, where researchers adopt more than one empirical strategyin their search, such as case study and survey. Other research methods applicable in ESEresearch are discussed in the literature [15], [42], [43], field study, simulation, grounded theory,correlational analysis, just to name some. The choice of empirical strategy depends on theprerequisites for the investigations, the purpose of the study, available resources and how wewould like to collect the data. Easterbrook et al. [22] provides more advice on selection ofresearch strategies.

Despite the optimistic prospects related to empirical software engineering, some studiesshow that empirical validation in software engineering research is low, hindering its progressas science and delaying the adoption of new technologies [30], [9]. Furthermore, Juristo [23]highlights some arguments traditionally used to avoid the usefulness of experimental studies,such as, among others: software developers are not trained in the importance and meaning ofthe scientific method; software developers are unable to easily understand how to analyze thedata of an experiment or how it was analyzed by others because they lack training; the immensenumber of variables influences software development; empirical studies conducted to check theideas of others are not often published. In this context, Sjoberg [11] identified a discrepancy ofexperiment numbers in relation to the number of new technologies and approaches developed.According to Wohlin and Aurum [44], there are some factors that make empirical research inSoftware Engineering particularly challenging. For example, in addition to study the technologyusage, it is also necessary to investigate social and cognitive processes that are involved inhuman activities [22]. Identifying proper approaches and methods to be applied in differentcontexts may be another challenge in software engineering research. In this sense, it is crucialthat researchers take well founded decisions when choosing the right methods and references toperform a given empirical study [44].

Empirical software engineering have become indispensable instruments to softwareengineering scientific advancements, allowing us to identify accurate and reliable evidence about


a certain technology [30]. Therefore, empirical studies have been gaining reliability in scientificresearch, building knowledge to support decision-making.

In order to improve the quality of software engineering research and increase the usageof empirical strategies, it is important to spread knowledge about the mechanisms used to supportresearchers in their studies. In this sense, our research group carried out a systematic mappingstudy, according to the guidelines defined by Kitchenham [ 40]. See Chapter 3. The main goalis to provide a catalog with mechanisms used as reference to support planning and executionof empirical software engineering research. We focused our investigation on well-establishedvenues of the empirical software engineering community: the International Conference onEvaluation and Assessment in Software Engineering (EASE), the International Symposiumon Empirical Software Engineering and Measurement (ESEM), and the Empirical SoftwareEngineering Journal (ESEJ). We have not considered other important venues, as we believe thatstudies published in the chosen venues reflect the spectrum of the ESE community and give arelevant view of support mechanisms adopted in this area, since empirical software engineeringis their main focus. The systematic mapping study aimed to identify support mechanisms, whichwe define as methods, process, guidelines, tools, and other resources used to plan, conduct, andanalyze empirical studies. Among our results, we observed that the three most used mechanisms,that are, Wohlin et al. [29], a guideline for performing experiments in software engineering, Yin[41], a guideline for performing case studies, and Kitchenham and Charters [40], a guidelinefor performing systematic literature review in software engineering are related to the mostadopted strategies, that are, experiments and case studies regarding primary research methods,and systematic literature reviews regarding secondary research methods respectively.

Furthermore, in order to identify existing guidelines and checklists in the empiricalliterature including main quality checklists and scales to assess experiments design or experimentreports from software engineering and other fields mentioned in this chapter, we performed aninformal literature survey. However, we took into account the following considerations:

1. Not just collecting guidelines and checklists from software engineering but alsofrom other fields. There are many disciplines that have been doing experimentationmuch longer than software engineering has, such as medicine, education, the socialsciences, psychology. Because of this, there are many important studies that should beread by anyone wanting to contribute to the study/specification of empirical methods.

2. Not just collecting published guidelines and checklists for designing but alsoreporting experiments. Because quality assessment of primary studies is usuallyperformed in systematic literature, we searched quality instruments for assessingcontrolled experiments in them. Usually checklist reports contain the key factors thatalso should be contained in the experimental plan. Although quality checklists assessthe quality of published experiments, in the instrument development, we just selecteditems that refer to elements in experiment planning.


3. Not just collecting guidelines and checklists for experiments but also for otherkinds of primary studies, such as case Studies. Several checklists include a set ofitems to evaluate the quality of primary studies (e.g., Experiments, Case Studies) insystematic literature reviews, and these items can be suitable to different kinds ofprimary studies at the same time. As a result, we collected checklists addressed toother primary studies.

2.3.1 Controlled Experiments in Software Engineering

This research considers the definition of controlled experiment from Wohlin et al. [21]:

! Controlled experiment in software engineering:

"Experiment (or controlled experiment) in software engineering is an empiricalenquiry that manipulates one factor or variable of the studied setting. Based inrandomization, different treatments are applied to or by different subjects, whilekeeping other variables constant, and measuring the effects on outcome variables.In human-oriented experiments, humans apply different treatments to objects, whilein technology-oriented experiments, different technical treatments are applied todifferent objects."

Experiment is the classical scientific method for identifying cause-effect relationships[14]. Experiments or controlled experiments are carried out when we want to control thesituation and want to manipulate behavior directly, precisely and systematically [21]. ControlledExperiments involve at least one treatment, an outcome measure, units of assignment, andsome comparison from which change can be inferred and attributed to the treatment [ 11]. Therandomized (or true) experiments are characterized by the use of initial random assignments ofsubjects to experimental groups to infer treatment-cause change. Quasi-experiments also havetreatments, outcome measures, and experimental units, but do not use random assignment tocreate the comparisons from which treatment-caused change is inferred.

Experiments are human-oriented or technology-oriented. In human-oriented experiments,humans apply different treatments to objects, for example, two inspection methods are applied totwo pieces of code. In technology-oriented experiments, typically different tools are applied todifferent objects for example, two test cases generation tools are applied to the same programs.The human-oriented experiments have less control than the technology-oriented one, sincehumans behave differently at different occasions, while tools (mostly) are deterministic [21].

In this research we focus on human-oriented experiments. Section 2.4, we describesome important human related factors in experimental research and ethical guidelines to conductresearch that involves human subjects.

Another experiment aspect which is important to consider is experiment replication. Thereplication of an experiment involves repeating the investigations under similar conditions. The


replication helps us to find out how much confidence it is possible to place in the results of theexperiments.

Pfleeger [45] divides the experiment process in six steps conception, design, preparation,execution, analysis, and dissemination and decision- making. Juristo and Moreno [ 23] dividesthe phases of experimentation into the following activities: definition of the objectives ofthe experimentation, design of the experiments, execution of the experiments; analysis of theresults/data collected from the experiments. Wohlin et al. [21] presents the experimentationprocess through five steps experiment scoping, experiment planning, experiment operation,analysis and interpretation, and presentation and package. Figure 2.1 illustrates the merge of theexperiment process overview from the three groups of authors mentioned above, and then eachstep is summarized.

Figure 2.1: Overview of the experiment process

1. Study Definition: In this step, the objective and goals of the experiment must bedefined. The GQM template [46] is suggested to be used to formulated the goal.

2. Planning: The objective of this step is to plan all experiment. The researchersmust define the formal hypothesis, the variables, the measurement scales, where theexperiment will be performed, what data will be collected, what is the experimentaldesign, how data will be analyzed, what experimental materials will be used includinginstructions and guidelines, what are the objects, who are the participants and howmany among others.

3. Preparation: The objective of this step is to produce and prepare all the materialsuch as questionnaires, interview protocols, website, tools, documents and so on, thatis required to conduct the experiment according to the experimental plan. Usually, apilot study is performed to to find and adjust any deficiencies in the material preparedin the planning step.


4. Execution: The goal of this step is to carry out the experiment according to theexperimental plan and collect data.

5. Analysis and Interpretation: The objective of this step is to analyze and interpretthe collected data to answer the goals of the experiments. The data analysis followsthe methods and statistics tests defined in the planning step.

6. Presentation and Package: In this last step, the findings of the experiment isreported to the external community.

2.3.2 Experiments in other fields

The experimental literature from other fields have influenced the experimental softwareengineering area. For example, the most used guidelines for performing experiments in softwareengineering, Wohlin et al. [29] and Juristo and Moreno [30], based their threats to validitysections on studies from the social sciences such as Judd [47] and Campbell and Stanley[48]. Also, they based the experiment design section on studies from the statistics field suchas Montgomery [49] and Box and Hunter [50]. In addition, Kitchenham et al. [24] proposedguidelines widely used for performing empirical research in software engineering from medicalresearch. In this section, we do not intend to exhaustively talk about experiments in other fieldseven tough the number of exceptions is enormous. However, we intend to give a brief overviewof how experiments works on other fields and give some suggestions of experimental readingfrom other areas.

Experiments behave in different forms depending on the area where they are carried out.For example, in physical and engineering fields, experiments are widely used to test hypothesesof new theories regarding specific conditions. Also, replication is a important step of theirexperimentation processes. It is common they replicate their experiments several times under thesame conditions in order to have the same results in each replication. Usually, randomizationassignment is not common in their sciences.

In medicine field, the experimental medical science performs experiments as clinicaltrial, where experimental units, usually individual human beings, are randomly assigned to atreatment, and one or more outcomes are assessed [51].

In experimental psychology, this science is focused on the human behavior analysis.Therefore, ethical and methodological issues of treating human research participants fairly isone of the most concern in the area. Although the random selection of subjects is not alwaysperformed, randomization is widely used in psychology experiments to increase the represen-tativeness of their samples. The crucial randomization procedure is the random assignment ofparticipants to test conditions [52].

In social sciences, external validity of experiments is one of the main concerns in thembecause it is difficult to generalize experimental results to groups that were not included in the

2.4. IMPORTANT HUMAN-RELATED FACTORS IN EXPERIMENTAL RESEARCH ANDETHICAL ISSUES 40study. Although it is not a particular problem from social sciences experiments, facing thisproblems, the social sciences developed relevant studies in the threats to validity area whichaddresses the main threats in experimental and quasi-experimental designs such as Campbelland Stanley [48], Judd [47], Cook and Campbell [53], and Shadish et al. [54].

Although the replication is important and is commonly performed by experimentalmedical science, psychology field, social sciences,and education area, they do not run a seriesof replications of each experiment performed as in empirical physical and engineering fields;instead, it is common they perform several studies which are assembled in secondary researchmethods such as systematic literature review, systematic mapping study and meta-analysis.

Also, the statistics field has contributed to experimental software engineering. Mont-gomery [49] and Box and Hunter [50] are classic books on experimental design and analysiswhich provide extensive information on statistics and ways to help researchers to design andanalyze experiments for improving the quality, efficiency and performance of them.

Because the experimental software engineering comes from other experimental fields,there are many important studies that should be read by anyone wanting to contribute to thestudy/specification of experimental methods in software engineering. We have suggested readingtext books for more details in threats to validity by Shadish et al. [54], Cook and Campbell [53],Judd [47], Campbell and Stanley [ 48], and for more details on methodological characteristicsof experiments by Martin [52], [55], [56], Cox [57], Greenwood [58], [59], Slavin [60],andLight et al. [61].

2.4 Important Human-Related Factors in Experimental Re-search and Ethical Issues

Experimental studies seem appropriate to promote the advancement of software engi-neering, especially considering human factors because unlike physics, most of the technologiesand theories in software engineering are human-based, and so variation in human ability tendsto obscure experimental effects. One of the many problems that arise experimental softwareengineering research is the difficult to assess a new product without humans are involved. Inhuman based experiments, humans are directly linked to the application of different treatmentsto objects in experiments. This scenario results in a reduction of the control of experiment,because some human aspects, such as the different levels of participants’ expertise, the abilityof humans learn over time, the skill of participants guessing what experimenter expects, theparticular motivation of subjects participate in experiments, among others. These and otherinfluences and threats need to be carefully thought and planned in the experiment planning. Forexample, how participants are recruited and treated in order to avoid bias in the study and toassure the protection of participants from undue harm. The language and cultural differencesshould also be considered [62].

2.4. IMPORTANT HUMAN-RELATED FACTORS IN EXPERIMENTAL RESEARCH ANDETHICAL ISSUES 41

In addition, any empirical research activity that involves human participants must considerethical aspects [21]. As well as software engineering, other fields are also concerned with humanbehavior. For example, psychology has human research ethics codes, which present ethicalprinciples for conducting research with human participants [63], [64]. In medicine, JohnsHopkins University has a guideline for training individuals who will participate in some aspectof a human subject research interaction or intervention [65]. Also social sciences has proceduresto protect human participants in their studies [66]. The human factors in computer science havebeen a concern for some time ago [67], [9], [68]. In software engineering, Singer and Vinson[69], [70] started the discussion about ethical issues in empirical studies of software engineering.Recently studies have also been drawn attention to the lack of practical methodological guidancefor designing experiments with participants [17] and consideration of human factors in empiricalsoftware engineering [71].

There are four common principles in code of ethics [72], [63], [64], [65], [66]. Theyare informed consent, beneficence, confidentiality, and scientific value.

Informed consent- Before study starts, human subjects must have enough informationabout study, and they have right to choose whether they want to participate or not. The informedconsent has four main elements, that are, disclosure, comprehension, voluntariness, and decision.

! Disclosure: The investigator must provide information for human subjects in orderto make decision to participate including purpose of research, procedure, risks andbenefits, voluntary participation, and offer to answer questions

! Comprehension: The investigator must provide information for human subjects inorder to ability to make rational informed decision, specially in special cases thatinclude impaired subjects or children.

! Voluntariness: The participation conditions of the human subjects in the experimentmust be free of coercion and influence. The human subject must have the option towithdrawn from the study any time without any harm for the participant.

! Decision: The human subjects must represent an active consent of their decision toparticipate in the study.

Beneficence- The benefits must overweight risks of study for individual and groupof subjects and organizations. The investigators should minimizing harms including tedium,loss of dignity, stress, financial losses. Usually, in software engineering there are breaches ofconfidentiality.

Confidentiality- The human subjects have right to expect that any information they sharewith researchers will remain confidential such as data privacy, dta anonymity, and anonymity ofparticipation. When data is examined, the investigator must not reveal the identity of participantsand organizations. Also, the investigators must disclose the level of anonymity before the studystarts.

2.5. EXPERIMENTAL PLANS 42

Scientific value- The study should be methodologically valid, that is, its results shouldreflect reality.

For more details, Vinson and Singer [72] provides a practical guideline to conduct ethicalresearch involving humans.

2.5 Experimental Plans

This section presents the importance of experimental plans in the quality process and thequality assessment in experimental plans.

2.5.1 Importance of Experimental Plans

An experimental plan is a document where experimenters define specific procedures anddirections to be used in conducting an experiment. In other words, an experimental plan is aprotocol that describes step by step important elements in the experiment. It is used to carry outthe experiment and analyze its results.

In a cooking recipe, the measures and items are detailed, and the order and how theingredients will be used are carefully described. For example, if someone wants to bake a cake,they read the recipe and do what the recipe specifies by following the directions carefully. Theydo not have to ask any questions, and the recipe should be precise and exact. However, if astep in the recipe is missing, someone else will have a hard time figuring out what to do. Inaddition, the recipe should contain information or advice about temperature and ways to proceed.Similarly to recipes for a cake, experimental plans are used as a recipe for the experiment[18]. An experimental plan should provide all the information that is necessary to conduct andreplicate the study and integrate it into the empirical software engineering body of knowledge[20]. Experimental plans should contain essential elements for carrying out the experiment [73],[21], [23], [24], [20].

It may start out with what the goal to accomplish is, what question an experimenter istrying to answer. Then, it requires thinking about the type of data that an experimenter wants tocollect. It includes things such as which people we are going to talk to, and what software projectwe are going to look at. Some issues such as what the kind and size of sample are, what typeof treatments experimenters are going to use, what type of metrics experimenters are going tocalculate, and what type of statistical tests experimenters are going to use. These considerationsare probably the most important because an experimenter has to answer some complex questions.The plan should also include rationales behind any decision. For example, suppose that thereare three different statistical methods that are possible but the experimenter chooses only one.In the experimental plan, the reasonswhy the experimenter did not choose the other two shouldbe explained. The experimental plan should include who is involved, the kind of assignments,timeline, any resources that are necessary and threats to validity.


Making certain decisions before an experiment is conducted, and documenting thosedecisions in an experimental plan, can minimize costs, improve internal validity, and reducebias. In addition, it is quite difficult to assess the quality of an experiment by assessing its report.However, the access to experimental plans allows researchers to assess the internal validity ofthe study, facilitating the selection criteria for systematic review or meta -analysis [39], [24].Internal validity (eg. how adequately the experiment is planned, executed and analyzed) reflectsthe degree to which the experimental results are accurate, and accuracy is determined by theminimization of the risk of bias [19]. The way to plan a well made experiment is to makedesign decisions to reduce bias as much as possible before the experiment is done. Bias is atendency to produce results that depart systematically from the "true"results [39], that is, flawsin planning, conducting and analyzing experiments lead to bias [74]. Therefore, unbiased resultsare internally valid [39]. The literature from Medicine [75], [76], [77] presents different kindsof bias. Kitchenham et al. [39] adapted these definitions of bias to address software engineering.See Table 2.1.

Table 2.1: Type of Bias [39]

Type DefinitionSelection bias (Allocationbias)

Systematic difference between comparison groups with re-spect to treatment

Performance bias Systematic difference is the conduct of comparison groupsapart from the treatment being evaluated.

Measurement bias (DetectionBias)

Systematic difference between the groups in how outcomesare ascertained.

Attrition bias (Exclusion bias) Systematic differences between comparison groups in termsof withdrawals or exclusions of participants from the studysample.

Many researchers face difficulties in producing experimental plans in software engi-neering. From the qualitative study described in Chapter 4, experts reported that some of themdo not know when they have a complete experimental plan, that is, they do not know whentheir experimental plan is ready to run an experiment. Also, another difficulty faced usuallyby beginners who are planning their first experiments is their lack of experience. In addition,poor mentoring or a lack of mentoring by someone that has done experiments before is also adifficulty faced by novice researchers. In addition, because the limited period of time availableto plan experiments, some inexperienced researchers start executing an experiment withoutthinking carefully about the experiment design issues such as human factors, the choice ofrepresentative materials and participants, experimental design, procedures, among others, thus,resulting to include bias in the experimental plan. For example, the choice of participants whoare not suitable for the experimental tasks can generate false (positive or negative) results, that


is, the obtained results are not real ones because experimental tasks were not performed by themost appropriate human subjects. Another example, mistakes on randomizing and properlyassigning participants to treatments can generate inconclusive or invalid results. These factorscan complicate the execution of controlled experiments involving human subjects even morebecause of the variety of ethical concerns and research bias involved.

Although there are many problems associated with experimental plans, assessing theircompleteness is one of the most common issues faced by researchers because it may increasethe cost of the experiment and may also generate invalid experiment results. Also, forgettingimportant elements and making changes in the controlled experiment after completing theplanning phase make the experiment difficult to control.It is also important highlight being awareof not planning bigger and expensive experiments more than necessary to achieve its goalsinstead planning experiments trying to obtained the best trade offs between achieving goals ofthe experiment and minimizing costs and bias.

2.5.2 Quality Assessment in Experimental Plans and Experiment Reports

It is important not to confuse experimental plans with experiment reports. They presentdifferent goals, each one with its importance. While experimental plans are documents thatexplain step by step what will be executed in the experiment, experimental reports are documentsthat describe how the experiment was conducted. Experimental plans are recipes for experiments,and they are completed before the experiment occurs, whereas experimental reports are writtenafter the experiment. They are also produced in different phases, although the experiment reportsshould use information from experimental plans.

Quality refers to the extent to which the design, conduct, and analysis of the primarystudies are likely to prevent systematic errors or bias [74]. Assessing the quality of experimentsonly from experiment reports generates several problems because journal articles and, in particu-lar, conference papers rarely provide enough details of the methods used, due to limitations ofspace in journal volumes and conference proceedings. Although Shull and co-authors publishedtwo papers in 2002 [78] and 2004 [79] that described a solution to this problem of space whenreporting experiments in journal articles, there is a risk that what is being assessed is the quality ofreporting rather than the quality of research [12]. However, assessing the quality of experimentsalso from experimental plans could improve the quality assessment of experiments involvinghuman subjects in software engineering because we can reduce bias before the experiment iscarried out, thus minimizing its cost and increasing its quality. Also, in this phase, the problemsdetected in experimental plans can be solved and adjusted without major issues.

There are many quality assessment tools used in empirical fields to assess the quality ofprimary studies in systematic reviews, replications, and meta analysis through experiment reports[12]. We highlighted some of those widely used in software engineering, including Dyba andDingsoyr [80], Dieste et al. [19], Jedlitschka et al. [ 81],Kitchenham et al. [ 82], Kitchenham

2.6. CHAPTER SUMMARY 45

et al. [40], Kitchenham et al. [40], Kitchenham et al. [83] for experiments reports, Hostand Runeson [84] for case studies reports, and Wieringa [ 85] for experimental and case studyresearch designs and reports. From other fields, we highlighted Jadad et al. [86], one of the mostused scales for assessing randomized controlled trials in medicine. Other tools also widely usedto assess the quality of studies reports in medicine [87], [88], [89], [90], [91], [92], [93],[94], [95], [96], [56], education [97], and psychology [98]. Other checklists are focused on

assessing the statistical content, such as Gardner et al. [ 99] and Jeffers [100]. In addition, theCASP (Critical Appraisal Skills Programme) 1 presents a set of eight checklists for appraisingdifferent types of health research, including systematic reviews, randomized controlled trials,cohort studies, case control studies, economic evaluations, diagnostic studies, qualitative studiesand clinical prediction rule.

However, currently, there are no mechanisms for assessing or reviewing experimentalplans in software engineering. The closest ones are the instruments available for assessingexperiment reports, but they are not suitable because there are specific issues related to planningstages that are not directly addressed by them. Nevertheless, the experimental guidelines, tools,and checklists for assessing the quality of primary studies mentioned above can be used to builda set of important elements that should be contained in experimental plans.

2.6 Chapter Summary

In this chapter we have discussed the important concepts in the software engineering area,including evidence based on software engineering, empirical studies, and especially controlledexperiments. Several relevant studies from software engineering as well as other fields suchas medicine, psychology, education, among others were discussed. We also discussed theimportance of assessing experimental plans in order to improve the quality of experiments, andwe addressed the difference between the experimental plans and the report of the experimentsin relation to the quality assessment of experiments. With this knowledge, the reader will beable to understand the context in which this research is inserted. In the next chapter a systematicmapping study is presented in order to identify support mechanisms used to conduct experiments.

1http://www.casp-uk.net/checklists

464646

3Support Mechanisms to Conduct Experimentsin Software Engineering: a Systematic Map-ping Study

This systematic mapping study is a part of the big project of our experimental researchgroup at UFPE. We performed a systematic mapping study to identify support mechanisms(process, tool, guideline, etc.) used to plan and conduct empirical studies in the empiricalsoftware engineering community. It included all full papers published at EASE, ESEM and ESEJsince their first editions. We analyzed 1,098 papers, among primary, secondary, and tertiarystudies. A total of 362 support mechanisms were found. The initial results from this systematicmapping study was published in previous studies in ESEM 2014 [26] and EASE 2015 [27].We updated the systematic mapping study with papers published in 2014 and 2015, in order toidentify newest mechanisms and to maintain the current research data. The update of the resultswas submitted to the Empirical Software Engineering Journal. One of the main contributions ofthis systematic mapping study update was to provide a catalog of support mechanisms availableto the software engineering community interested in empirical studies. With the catalog, it ispossible to know which resources are being used as reference to plan and support different kindsof empirical studies and in which contexts they are applied. For example, it is possible select themechanisms used to design controlled experiments, to perform qualitative analysis, to defineresearch goals, or to plan research replications. The catalog is available at https://goo.gl/nOj4Tu.In this study, we consider support mechanism as any resource used as a reference in planningor execution of an empirical study. Overall, a mechanism can be characterized as a book, ascientific paper or software. Thus, we are considering as resources any mechanisms used tosupport empirical research in software engineering, such as processes, guidelines, tools, andothers.We performed a systematic mapping study instead systematic literature review because ofthe exploratory nature of our research.

Although we conducted a systematic mapping study which collected support mechanismsto conduct empirical studies, this Ph.D thesis is focused on experimental software engineering,

3.1. METHOD 47

that is, we are interested in investigating the experimental field regarding the support mechanismsused to conduct experiments. In this chapter, the author of this thesis reported the part of thesystematic mapping study results regarding experiments and general mechanisms that supportexperimental activities. The complete results of our systematic mapping study can be seen inpreviously published studies [26], [27] and hopefully soon at Empirical Software EngineeringJournal. This chapter presents results regarding research question 1 of this thesis.

This chapter is organized as follows: Section 3.1 describes the research method of thesystematic mapping study, including research questions, search strategy, data extraction, anddata analysis. Section 3.2 reports the results which are the answer of our first research question.Section 3.3 discusses the findings, and Section 3.4 presents the threats to validity of this study.Finally, Section 3.5 presents the chapter summary.

3.1 Method

The systematic mapping study procedure was divided in four steps, including researchquestions, search strategy, data extraction, and data analysis. The protocol used to guide thisresearch was based on the guidelines defined by Kitchenham and Charters [40].

3.1.1 Research Question

The research question was defined based on the scope of this Ph.D research in orderto provide an overview of support mechanisms for experiments reported in EASE, ESEM andESEJ.

! Which are the most commonly adopted mechanisms to support experiment planningin software engineering research?

3.1.2 Search Strategy

Because the systematic mapping study intends exclusively to analysis the publishedarticles in empirical software engineering community, we decided to adopt just manual searchesin the venue databases. Thus, it was not necessary to use search strings. We decided to notinclude other sources in order to not lose sight of the methodological rigor and the quality in thedata analysis and extraction, once the EASE, ESEM, and ESEJ were collected over a thousandarticles.

Initially the studies were collected on the website of each venue: EASE (http://goo.gl/SOcPOj),ESEM (http://goo.gl/q4hgfF), and ESEJ (http://goo.gl/nQchhp). Some studies were collectedin scientific digital libraries (IEEE, ACM, Springer Link, Science Direct, and Scopus). Wecollected all full papers published in EASE, between 1997 and 2015; ESEM, between 2002 and2015; and, ESEJ, between 1996 and 2015. As a matter of fact, regarding ESEM, from 2002 and

3.1. METHOD 48

2006, we collected studies from the ISESE (International Symposium on Empirical SoftwareEngineering), which joined with the METRICS (International Software Metrics Symposium) in2007 resulting on ESEM conference.

Considering that this study intends to analyze some aspects of the empirical softwareengineering community, almost all papers were included, however we applied three exclusioncriteria: (1) short papers, (2) non-technical studies (tutorial, keynote, industrial presentation,etc.), and (3) duplicate papers. We decided to exclude short papers and non-technical studiesbecause they, in general, present in progress research. Besides, those studies may either notfollow a strict empirical strategy or they do not have enough space for an in depth specificationof their strategy. In cases of duplicate articles, we excluded the older and / or less completeversion, unless they had additional data.

3.1.3 Data Extraction

After collecting and selecting the studies from EASE, ESEM, and ESEJ, we initiated thedata extraction process. Eight researchers participated in this process, four PhD and four MScstudents. Data extraction is an error-prone activity, so in order to avoid miss-extraction eachpaper was read by at least two researchers. The involved researchers were divided into pairs,each comprised by one PhD and one MSc student.

Before the actual data extraction, we performed an extraction pilot. Ten papers wererandomly selected from the study set, and all participants performed the data extractions onthese papers. After that, a meeting was organized in order to resolve the conflicts and mitigatethe mistakes. This pilot was necessary to calibrate the extraction instrument, to reinforce theextraction strategies, and to avoid misunderstandings among the participants. For example, insome articles divergences in the definition of the type of empirical strategy were observed. Thepilot extraction allowed to analyze the effort necessary to evaluate each paper, therewith it waspossible to plan the data extraction steps.

We used a spreadsheet, see Table 3.1, for extracting the data. During this process, paperswere analyzed focusing on abstract, introduction, methodology, results, and conclusion. In somecases, a meticulous reading of the paper was necessary. This process was organized in cyclesin order to mitigate errors. Each cycle lasted two weeks, in which each pair was responsiblefor analyze twenty papers. By the end of each cycle, the teams performed the compilation ofresults. In addition, after each cycle, a meeting with all participants was organized. It intends toresolve the remaining conflicts, evaluate the current state, and plan the next steps of the extractionprocess. A senior researcher supervised the process, which lasted six months.

In order to avoid subjectivity in the data extraction, we extracted the information exactlyas the authors mentioned in the paper. In principle, all conflicts were discussed and resolvedinternally by the pairs. However, if there was no consensus, these conflicts were discussed withall participants, in the general meetings. Due to all actions performed to avoid any mistake,

3.1. METHOD 49

we had a high level of agreement between researchers. Just a few divergences needed to bediscussed in general meetings.

Aiming to formalize the accuracy and reliability of the extracted data, we made anassessment of the agreement level between the researchers in each pair, through the Kappa coef-ficient of agreement [101]. To evaluate the concordance of each pair, we analyzed the extracteddata by each researcher, considering only the data related with the identified mechanisms andthe empirical strategy type. Analyzing the reliability values of all pairs together, the Kappacoefficient of agreement was 0.687. According to the interpretation suggested by Landis andKoch [102], our extraction process has the second highest level of agreement, making it clearthat the results of this study are reliable.

In the final stage of extraction, two researchers were responsible for integrating the finalspreadsheets from all teams. Data standardization was needed because even having an extractionpilot to avoid misunderstandings we found deviations from the expected results. For instance,related to the bibliographical reference format of the found mechanisms. Besides, we foundsome information out of the standard in author and institution names. The result of this processwas serialized in a spreadsheet with all data extracted in this systematic mapping study.

The instrument used to make data extraction is described in Table 3.1. Each researchquestion motivates different aspects of the data extraction. In particular, the instrument wasa spreadsheet, where each row in a column represents a piece of information that had to beextracted. In order to organize the data extraction, each article received a unique identifier (PS01– PS1098). Besides, the support mechanisms also received a unique identifier, SM01, SM02, andso on.

We consider that a support mechanism is any resource cited as reference to plan andguide empirical strategies, as discussed in Section 3.1.1. In this sense, we are not accounting themechanisms used for a specific domain of the paper issue. For instance, assuming a study thatperforms a case study in an agile project [103], it uses a guideline to support the case study andanother to support the agile methodology. In this example, we only extracted the guideline toperform the case study as mechanism, since the guideline for the agile methodology is specificto the study’s domain.

Table 3.1: Data extraction instrument

Information DescriptionGeneral Information Title; Authors; Research Institution; Year; Publica-

tion Vehicle;Support Mechanism (foreach one)

Bibliographical Reference.Mechanism Domain.

Empirical Method Type: Experiment, Case Study, Survey, Ethnogra-phy, Action Research, Systematic Literature Study,Mixed Methods, Others, or Not Identified.

As discussed in Section 3.1.3, all extracted pieces of information have to correspond

3.2. RESULTS 50

strictly to the authors’ words in the paper. Therefore, the empirical strategy classification wasdefined based only on the paper’s content. The mechanism domain refers to the empiricalactivities in which it was applied, for instance, experiment planning, qualitative data analysis,data collect, and statistical analysis.

The empirical strategy classification adopted in this work is provided by Easterbrooket al. [22]. These authors define empirical strategy as a set of organizing principles aroundwhich empirical data are collected and analyzed. Besides, we considered the systematic literaturestudies as empirical strategies. Studies that adopted more than one empirical research strategywere classified as “Mixed Methods” [22].

An important observation is the definition of “Others” and “Not Identified” in theclassification presented in Table 3.1. We classified the studies as “Not Identified” due to thefollowing criteria: the authors did not state the empirical strategy employed. An example of“Not Identified“ is the Jørgensen’s work [104]. In this work, the author performs an analysis ofa dataset but there is no explicit reference in the paper of which empirical strategy was adopted.The studies that specify an empirical strategy that do not match any of the strategies presented inTable 3.1 are classified as “Others”. Some examples of empirical methods cited in the papersclassified as “Others” are: focus group, grounded theory, cross validation, and qualitative study.

3.1.4 Data Analysis

In this step, the collected data were organized in descriptive statistics, i.e. tables andgraphics, which allow better visualization and analysis. Since the amount of extracted infor-mation was large, we developed a tool to automate the data extraction from spreadsheets. Thistool made the counting and graph plotting. The source code of our tool is available on-line(https://goo.gl/r2sdjT).

3.2 Results

In this section, we present the main results of this research. Section 3.2.1 presentsan overview of the 1,098 studies analyzed, and Section 3.2.2 outlines the catalog of supportmechanisms for experiments.

3.2.1 Overview of the Systematic Mapping Study

In this Section, we present the results of search and selection process, distribution offull papers by year, the most active authors, geographic distribution, and institution distributionregarding the thorough systematic mapping study.

3.2. RESULTS 51

Figure 3.1: Results of search and selection process

Figure 3.1 shows how many papers were collected and selected. As mentioned in Section3.1.2, we gathered the studies from the conference websites, and also some search engines (IEEE,ACM, Springer Link, etc.). When the paper was not available to download, we contacted theauthors. However, even with these efforts, we were not able to retrieve 15 papers: 14 studiesfrom early editions of the EASE and one study from ESEM.

As shown in Figure 3.1, 1,634 studies were collected from the chosen scientific venues.We applied the exclusion criteria (Section 3.1.2),therefore, 311 short papers, 207 non-technical,and 18 duplicated studies were excluded. In general, the duplicated studies were the onesapproved in EASE or ESEM venues, whose later versions were published in ESEJ, which arethe ones we included. Finally, of the remaining 1,098 studies (67% of the collected papers): 257(23%) from EASE, 420 (38%) from ESEM, and 421 (39%) from ESEJ. The metadata of each1,098 study – title, author, year, publication vehicle, and research institution – are available at(https://goo.gl/Z918SA).

Figure 3.2: Distribution of full papers by year

3.2. RESULTS 52

Figure 3.2 presents the distribution of the studies by publication year. All selected studieswere published between 1996 and 2015, and 70% (771 of 1,098 studies) were published inthe last ten years, which evidences a growing interest in the evolution of empirical softwareengineering research in the last decade. Another fact that confirms the increasing importance ofthe empirical studies in software engineering is that the last two years (2014 and 2015) had thehighest amounts of publications, 115 and 92 studies, respectively.

Analyzing the publications by each vehicle individually, we can observe some facts.Since its first edition, EASE published papers had a smooth oscillation in its growth, with a meanof 14 papers per year. On the other hand, ESEM, since the beginning, has a larger number ofpublications. In the last ten editions of ESEM, we can observe a mean of 30 papers per year.Finally, considering the ESEJ, in the last decade, the high number of publications is almostconstant, with a mean of 29 publications by year. We can highlight the last two editions of ESEJ,with 53 and 52 publications, respectively.

Figure 3.3: Most active authors

In our mapping we identified 1,987 authors that had at least one study published atthese sources. About 61% (1,217 authors) of them published at least two papers. Figure 3.3presents the authors that had more studies published at those venues. Claes Wohlin, BarbaraKitchenham, and Emilia Mendes are the most active authors; they published 37, 36, and 36studies, respectively.

3.2. RESULTS 53

Figure 3.4: Geographic distribution

We also made a geographic analysis of our data, through the countries of the institutions(Figure 3.4). Our findings show that 53 different countries were involved. The majority ofstudies were developed in collaboration among two or more researchers from institutions locatedin different countries (494 studies, 45%). Most of the studies come from United States (260publications) and United Kingdom (146 publications). Considering the size of American softwareengineering industry and software engineering research community, this high concentration ofresearch from them are expected.

Figure 3.5: Institution distribution

Figure 3.5 depicts the institutions that most contributed with published studies in empiri-cal software engineering community. The most productive institutions were Blekinge Instituteof Technology (48 studies) and The Lund University (41 studies), both located in Sweden. Asurprising finding was that the USA has only one institution on the top five, even being the coun-try with the highest number of empirical studies. It emphasizes a distribution of the empiricalstudies among the USA’s institutions. In addition to California (33 studies) and Maryland (30studies), there are several other institutions with great participation.

3.2. RESULTS 54

Finally, we can notice some lessons learned from our systematic mapping study. Asmentioned in Section 3.1.3, during the data extraction process, we performed a pilot study. Thisprocedure allowed to calibrate the extraction instrument, reinforce the extraction strategies, andavoid misunderstandings among the participants. In fact, this is a procedure that we stronglysuggest to be performed in any kind of systematic literature study. By doing so, we could earlymitigate bias among the participants. We can also highlight that the high level of agreementbetween researchers (Section 3.1.3) may be due to our decision to extract the information exactlyas mentioned by the paper’s authors. The experience gained in this research also suggests thatin research involving many articles and researchers performing the extraction process in cyclesallows a constant realignment relating to the understanding of research information and mostlikely mitigates possible bias in the data extraction.

3.2.2 Catalog of Support Mechanisms

This section presents the catalog of support mechanisms. Through our complete sys-tematic mapping study, we identified 362 mechanisms used as reference to plan and conductempirical strategies in software engineering. As cited in Section 3.1.3, all mechanisms received aunique identifier, according to the number of citations. For instance, SM01 (Support Mechanism01) stands for the most cited resource. Thus, throughout this section, the support mechanismswill be cited by their IDs. The catalog of support mechanisms is available through a website(https://goo.gl/nOj4Tu). The mechanisms can be accessed according to the empirical activitiesin which they are applied. For example, it is possible to select the mechanisms used to designa controlled experiment. If the researcher selects this option, the catalog shows a list of themechanisms that were used as the reference to plan and to design an experimental research.If a researcher needs references for specifics empirical activities, such as planning researchreplication or performing qualitative data analysis, it can also be queried through the catalog.The categories of empirical activities are: Design of Research Method (experiment, case study,survey, ethnography, action research, systematic study, field study, meta-analysis, or groundedtheory), Multi-Method Approach, Data Collection, Qualitative Data Analysis, Statistical Analy-sis, Research Validation, Research Replication, Quality Evaluation, Report Research Results,and Research Goals Definition.

As an example, Figure 3.6 shows the profile page of the mechanism SM08. We can seethat the catalog comprises general information of each support mechanism (title, description,author, year, and place of publication). Besides, the ID of the papers that cited the mechanism isprovided as well as the empirical methods in which it has been used. Also, the catalog describesthe empirical activities that the mechanism supports, their format (book, scientific article, orsoftware), and the origin area of the mechanism (software engineering, healthcare, social science,education, business, statistical, and multidiscipline). Finally, a link to access the mechanism isprovided.

3.2. RESULTS 55

Figure 3.6: Profile page of the mechanism SM08 mechanisms

In following sections, we present a subset of these resources for experiments.Initially, we present the most cited support mechanisms that can be used in experiment

3.2.2.1. Then, we present the main mechanisms for conducting experimental activities 3.2.2.2. InSection 3.2.2.3, we present general support mechanisms that although do not address a specificempirical method, it can be used in experimental studies, such as mechanisms for data analysis,qualitative research, and threats to validity. And finally, Section 3.2.2.4 presents an overview ofexperimental software engineering field, including source areas of the support mechanisms forconducting experiments, the evolution of experiments over the years, and the concerns about "notidentified" empirical strategies in studies. The references of the support mechanism presented inthis chapter, and the list of the 393 experimental studies is presented in Appendix A.

3.2.2.1 The Most Used Support Mechanisms

Regarding the complete systematic mapping study, experiment was the empirical strategywith the largest number of specific resources to guide its activities. We identified 43 mechanismsto support them, and we collected 393 studies classified as controlled experiments (133 studies),experiments (245 studies), and quasi-experiments (15). There is this distinction since we decidedto classify the paper based on the authors’ specification (Section 3.1.3). In fact, for Wohlinet al. [29], experiment and controlled experiment are synonyms, and a quasi-experiment hasall the same elements as an experiment; however it typically lacks random selection and/orrandom assignment of participants [29]. We also analyzed 201 mechanisms to support empiricalresearch. These resources although do not address specif experiments instead they are used to

3.2. RESULTS 56

design and execute empirical methods in general, some of them are widely used in experimentalsoftware engineering.

Table 3.2 presents the most cited support mechanisms used as reference to experiments.We also included mechanisms that address general empirical research because they can beapplied to experiments. The first column presents the mechanism ID, the second shows themechanism’s reference, the third presents the count of studies that cited the mechanism, thefourth shows which empirical activities the mechanisms aims to support, and the last columnpresents the empirical strategy in which the mechanism is applied. Then, we describe the mostsupport mechanisms used in experimental software engineering.

Table 3.2: The most cited support mechanisms

SM ID Support Mech-anism Refer-ence

Number of Ci-tation

Supported Activities Empirical Strategy

SM01 Wohlin et al.(2000) [29]

138 studies Design of Experiment; Data Collec-tion; Qualitative Analysis; ResearchValidation; Research Replication.

Experiment

SM04 Basili et al.(1994) [46]

37 studies Research Goals Definition; Quan-titative Analysis; Research Valida-tion.

General EmpiricalResearch

SM05 Kitchenham etal. (2002)[24]

34 studies Design of Research Methods (Gen-eral); Data Collection; QualitativeAnalysis; Statistical Analysis; Re-search Validation.


SM06 Strauss et al.(1998) [105]

33 studies Design of Grounded Theory; DataCollection; Qualitative Analysis.


SM07 Robson(2002) [38]

32 studies Design of Research Methods (Gen-eral); Data Collection; QualitativeAnalysis; Research Validation.


SM08 Juristo andMoreno(2001) [30]

28 studies Design of Experiment. Experiment

SM09 Seaman(1999) [106]

25 studies Qualitative Data Analysis. General EmpiricalResearch

SM10 Basili et al.(1999) [107]

24 studies Design of Experiment; ResearchGoals Definition; Research Valida-tion

Experiment

SM01 (“Experimentation in software engineering: an introduction”)The most cited mechanism in empirical software engineering community was SM01

[29], cited by 138 studies, around eight citations by year. It is a book that provides a backgroundof theories and methods used in software engineering experimentation. This mechanism detailsthe five experiment steps: scoping, planning, execution, analysis, and result presentation. Italso shows practical examples. In general, this mechanism was used as reference for defining

3.2. RESULTS 57

experiments. In 28 studies, it was explicitly used as reference to the definition of the validitythreats (for example, PS310 and PS559). Besides, it was applied as a guide to general activitiesof empirical studies, such as studies replication (for example, PS710 and PS70), descriptivedata analysis (for example, PS611), and statistical analysis (for example, PS308 and PS309). Inthis sense, despite being a guideline for experiments in software engineering, it also providessuggestions that can be used by any empirical study, such as case study, survey, and systematicstudy. Considering the last five years, this mechanism was cited by 52 studies. Highlightingthe last two years, with 22 (2014) and 15 citations (2015), respectively. Also, considering thetwo first years after the publication of this book (2001 and 2002), it was cited by 14 studies.These numbers show that this mechanism is an important reference for the empirical softwareengineering community since its publication. In the following pieces of text we report somecollected stretches from studies that mention SM01:

! PS611: “The experiment data are analyzed with descriptive analysis and statisticaltests [29]. The Mann–Whitney test is used”. In this study, the mechanism SM01 wasused as reference in experimental data analysis.

! PS705: “In this section, a controlled experiment is described that took place at theUniversity of Alberta, Canada. This experiment follows the well-known experimen-tation process proposed by [29]” ... “In this section we present threats to the validityof the study in accordance with the standard classification [29]”.

SM04 (“The Goal Question Metric Approach”)The mechanism SM04 [46] specifies a paradigm based on metrics to define and evaluate

a set of operational goals, called Goal-Question-Metric (GQM). This systematic approach wasused to define and evaluate goals in empirical research. Besides, the SM04 was applied fortailoring and integrating goals with software engineering products, based upon the specificneeds of a project, as referenced, for example, in the studies PS164, PS230, and PS310. Thismechanism also supports the analysis and validation of the research conclusions, consideringthe defined goals. Among the 37 studies that have cited this reference, the majority performedan experiment (18 studies) or a case study (nine studies).As well as the mechanism SM02,this resource was also published decades ago and still remains be adopted in many empiricalresearches (18 citations in the last five years). The following present example evidences collectedfrom studies that cited it:

! PS844: “In this paper, we present three controlled experiments that analyze whetherand how background colors can improve the readability of preprocessor directiveswe describe each experiment using the goal-question-metric approach”.

! PS849: “In the context of our case study”. . . “We describe our study followingthe Goal-Question-Metrics paradigm [46], which includes goals, quality focus, andcontext”.

3.2. RESULTS 58

SM05 (“Preliminary Guidelines for Empirical Research in Software Engineer-ing”)

The mechanism SM05 [24] presents a set of guidelines to support the software engineer-ing researchers in designing, conducting, and evaluating empirical studies. Among the 34 studiesthat cited this reference, the majority (80%, 28 studies) used this resource to plan experiments.However it was also adopted in other empirical strategies: case study (PS264 and PS647), survey(PS466), and systematic literature review (PS251 and PS674). Considering the last five years,this guide was cited by 11 studies. In the following we show examples from studies that citedSM05:

! PS103: “In order to empirically assess the usefulness of ALSAF, we decided to carryout an empirical research project consisting of a set of experiments. Our researchproject was developed based on the guidelines provided in [40], [22]”. The reference[22] refers to the mechanism SM05, which was used for defining a set of experiments.

! PS265: “We considered the guidelines for empirical research by Kitchenham et al.[24] when planning and reporting this experiment”.

SM06 (“Basics of Qualitative Research: Techniques and Procedures for Develop-ing Grounded Theory”)

This resource presents techniques and methods to aid researchers perform qualitativeanalysis and data synthesis, and build theories from the research results. The focus of thisapproach is the grounded theory method, which is intended to collect, organize, and analyze datasystematically (for example, PS73, PS319, and PS358). This reference was normally adopted incase studies (ten studies) and surveys (six studies). This mechanism was cited by 33 studies, ofwhich 19 are published between 2011 and 2015. As well as other resources already mentioned,this mechanism is not a recent reference but it is among the most recently cited resources. Anexcerpt from a study that cited the SM06 [105] is:

! PS645: “The interviews were analyzed according to principles from grounded theory,[105], to ensure that the interviewees’ opinions were conveyed systematically”.

SM07 (“Real World Research”)This mechanism provides a guideline that includes the steps needed to carry out an

applied research. It brings materials and approaches from different disciplines, valuing bothquantitative and qualitative approaches, as well as their combination in multiple method designs.It was normally used as reference for data collection and analysis, for example, PS37, PS192,PS238, and PS318. In some other studies, this mechanism was reference to define threatsto validity, for example, PS17, PS52, and PS568. This mechanism was cited by 32 studies.Considering the recent period (2011 - 2015), it was cited by ten studies. Thus, this resourceis still widely used as a reference in empirical researches. The following presents evidencecollected from other studies that cited the SM07 mechanism [38]:

3.2. RESULTS 59

! PS375: “An interview guide was developed by the researchers and designed to coverthe area of interest and answer the research questions, supporting a semi-structuredinterview style, as described in Robson [38]. The strategy for selecting intervieweescan be described as heterogeneous and purposive, see Robson [38]. The threats toconstruct validity can be expressed as respondent biases and researcher biases thatwere mitigated here by utilizing three different strategies discussed in Robson [38]”.

! PS793: “Various data collection methods were applied as proposed in the case-studymethod [41]. These were dependent on the case-specific deployment process ofeach company and included such methods as observation, semi-structured group andindividual interviews [38]”.

SM08 (“Basics of Software Engineering Experimentation”)The SM08 (Juristo and Moreno 2001) [30] presents a guideline for performing software

engineering experiments. It reports many concepts related with empirical research in softwareengineering. Researchers follow recommendations provided by this resource to plan and conductsoftware engineering experiments, for example, PS337, PS640, and PS701. This mechanismwas cited by 28 studies, with an average rate of two citations by year. Considering the last fiveyears, it was cited by 11 studies. Since it was published, this resource is among the most usedby researchers that need to execute an experiment, controlled experiment, or quasi-experiment.Below are some examples from studies that used the SM08:

! PS452: “We first present an overview of the original experiment followed by thedetailed design and execution of the internal replication. We adopted the guidelinesproposed by Wohlin et al. [29] and Juristo and Moreno [30]”.

! PS506: “In this section we describe the planning, design and execution of ourexperiment following the guidelines of well-known books on experimental softwareengineering [ [30], [29]”. The reference [30] refers to the mechanism SM08, whichwas used for defining the experiment.

SM09 (“Qualitative methods in empirical studies of software engineering”)The SM09 [106] is the tenth most cited mechanism, with 25 citations. It presents several

qualitative methods for data collection and analysis. The author describes these methods in termsof how to incorporate them into software engineering empirical studies, in particular how tocombine them with quantitative methods. It was mostly used in case studies (ten studies) andsurveys (seven studies). We show the following evidence from studies that used this reference:

! PS624: “To facilitate analysis of the collected feedbacks we used the coding processdescribed by Seaman (Seaman, 1999)”.

3.2. RESULTS 60

! PS647: “In this paper we investigate the experiences from integrating agile teamsin traditional project management models. Two cases are studied, both within largesystem development companies”. . . “We present and motivate a feasible approachfor this investigation, using qualitative research methodology [106]”.

SM10 (“Building Knowledge through Families of Experiments”)The SM10 mechanism [107] is a framework for organizing sets of related experiments

in order to build up a complete picture with results of a wide range of contexts. It is organizedaround the GQM paradigm [46]. It was used in experiment studies, mainly for defining goal andobjectives of the experiment (for example, PS458, PS662, and PS803). It was also applied in thethreats to validity of the research (for example, PS30 and PS450) and to experiment replication(for example, PS711). This resource was cited by 24 articles. Taking into account the recentyears (2011-2015), it was used as reference in eight studies. We show evidence from a study thatused the SM10 below:

! PS789: “We conducted a family of three experiments to compare program compre-hension between DSLs and GPLs”. . . “To avoid questionable results and to enablethe replication of this research, we have followed published guidelines on conductinga family of experiments [107]”. . . “Before the experiments started, some basic ruleswere defined to rigorously prepare the experiment environment [107]”.

3.2.2.2 Main Mechanisms for Conducting Experimental activities

Table 3.3 presents mechanisms cited in published experiments and empirical research tosupport activities of experimental process, including definition of research goal, data collection,design of experiments, research validation, report research result, and research replication.

3.2. RESULTS 61

Table 3.3: Mechanisms used to support activities of experimental process

Activity Purpose Experiment Empirical Research Total Number ofmechanisms

Definition of re-search goal

Defining andEvaluating Goals

SM10 SM04; SM38; SM57; SM147; SM199. 6

Data CollectionSample Defini-tion

SM236; SM146; SM129. —- 3

Techniques andMethodologies

SM01; SM157; SM268. SM48; SM49; SM264; SM229; SM138;SM143; SM189.

10

Tools SM50 SM108; SM213. 3BibliographyManagement

– SM201; SM124; SM316. 3

Questionnaires/Interviews– SM327; SM97; SM317. 3QualitativeResearch

— SM281 1

Design of Experi-ments

Design of experi-ments

SM185; SM18; SM01;SM236; SM171; SM146;SM129; SM121; SM92;SM70; SM46; SM42;SM12; SM10; SM08;SM17; SM36; SM96;SM120; SM126; SM130;SM139; SM182; SM197;SM217; SM227; SM265;SM266; SM331; SM335.

SM05; SM07; SM60; SM61; SM81;SM154; SM275; SM322;SM339; SM360.

40

Research Valida-tion

Threats to Valid-ity

SM01; SM18; SM33;SM42; SM12; SM10.

SM04; SM05; SM07; SM28; SM58; SM64;SM77; SM121; SM176; SM242; SM322;SM329; SM338; SM339; SM360.

21

Report ResearchResult

Report ResearchResult

SM33; SM70; SM46. SM116 4

ExperimentReplication

ExperimentReplication

SM01; SM70; SM78;SM100; SM121; SM131;SM224; SM225; SM171.

SM125; SM134; SM170; SM285; SM354 14

Following, we present the main mechanisms cited references for conducting experimentsin software engineering.

As we see in Table 3.2, among the most cited support mechanisms, three are specificto guiding experiments in software engineering (SM01, SM08, and SM10). Other widely usedreference for experiments is SM12 [48], cited by 23 studies. It presents a guideline for thedesigning an experimental and quasi- experimental research, and it is largely used to threats tovalidity. In Table 3.3, we also identified a web tool (SM50) [108] to support experiments activitiesin the software engineering context. It can be used to conduct a large controlled experiment inindustry. The mechanisms SM224 [109], SM100 [110], and SM171 [111] present suggestionsfor experiment replications in software engineering. SM70 [112] presents a replication approachto empirical software engineering research. It was used by two experiments (PS30 and PS518).SM131 [78] and SM78 [113] discuss aspects of experiment replications in software engineering;while the resource SM157 [114] provides suggestions for collecting feedback during softwareengineering experiments. SM01 is also used as reference for defining threats to research validity.It was explicitly cited for this purpose in 40 studies, among experiments, surveys, case studies,and systematic studies. The mechanisms SM12 [115], and SM18 [53] were also cited to validity

3.2. RESULTS 62

of the research.

3.2.2.3 General Support Mechanisms

In Table 3.4, we present some general support mechanisms that although they do notspecific address to experiments, they support some experimental activities, such as statisticalanalysis, qualitative analysis, quality evaluation, and multi-method approach.

Table 3.4: General Mechanisms used to support activities of experimental process

Activity Purpose Experiment Empirical Research Total Number ofmechanisms

Statistical Analy-sis

Statistical Analy-sis

SM92. SM15; SM16; SM25; SM27; SM37; SM39;SM40; SM41; SM52; SM55; SM62; SM63;SM68; SM69; SM72; SM74; SM80; SM81;SM83; SM102; SM132; SM136; SM145;SM148; SM149; SM150; SM155; SM156;SM158; SM159; SM160; SM161; SM162;SM164; SM168; SM169

37

Qualitative DataAnalysis

Qualitative DataAnalysis

SM18; SM216; SM269;SM284

SM04; SM05; SM07; SM09; SM20; SM26;SM30; SM34; SM45; SM54; SM59; SM65;SM66; SM85; SM91; SM93; SM95; SM97;SM101; SM104; SM106; SM107; SM109;SM111; SM113; SM128; SM166; SM167;SM183; SM184; SM188; SM194; SM202;SM228; SM235; SM237; SM240; SM257;SM261; SM274; SM275; SM281; SM288;SM295; SM296; SM305; SM308; SM309;SM334; SM343; SM344; SM345; SM352;SM356; SM357; SM358; SM359.

61

Grounded Theory — SM06; SM11; SM99; SM105; SM163;SM304; SM307

7

Quality Evalua-tion

Quality Evalua-tion

SM87; SM122; SM211.

Multi Method Ap-proach

Multi Method Ap-proach

SM01; SM70. SM350; SM351 4

Following, we present the more relevant general mechanisms cited references for con-ducting experiments in software engineering.

Statistical Data Analysis: 37 mechanisms from experiments and empirical researchsupport statistical analysis activities. Among the most cited mechanisms to support statisticalanalysis: SM16 [116] presents statistical techniques to perform data analysis. The mechanismSM37 [117] provides statistical principles to conduct experimental research. [118] (SM15)presents another widely used statistical model to perform data analysis, cited by 19 studies. Theresource SM41 [119] is a book for engineering and sciences which emphasizes the modernstatistical methodology and data analysis. R 1 (SM25) is an environment for statistical computingand graphics. Considering the recent five years, we can also highlight the mechanism SM55[120], which provides a practical guide for using statistical tests to assess algorithms in softwareengineering. Other references can be obtained in the complete mechanisms catalog.

1http://www.r-project.org

3.2. RESULTS 63

Qualitative Data Analysis: Table 3.2 presents three mechanisms to support qualitativeanalysis (SM06, SM07, and SM09). Besides, the catalog contains other specific references forthis purpose. SM20 [121], cited in 13 studies, provides guidelines to perform data analysisand synthesis. SM66 [ 122] recommends steps for qualitative analysis in software engineeringusing thematic synthesis. Some mechanisms describe guidelines to the Grounded Theory method(SM06 [105]; SM11 [123]; SM99 [124]; SM105 [125]; and, SM307 [126].

Quality Evaluation: Some resources were explicitly cited as reference to performquality assessment in empirical studies such as SM87 [80], SM122 [82], and SM211 [83].

Multi- Method Approach: Multi- method approach is a strategy used by researcherswhen they use more than one specific empirical study. Some resources were cited as referenceused to applying the multi-method such as SM01; SM70; SM350; SM351.

3.2.2.4 Overview of Experimental Software Engineering Field

This section presents source areas of the support mechanisms for conducting experiments,the evolution of experiments over the years and the concerns about "not identified" empiricalstrategies in studies published in the EASE, ESEM, and ESEJ.

Source Areas of the Support Mechanisms for Conducting ExperimentsAnalyzing only the studies that performed an experiment, Figure 3.7 presents the distri-

bution of the support mechanisms by origin area.

Figure 3.7: Source area of the support mechanisms for experiments

Among the 43 identified support mechanisms specif for conducting experiment, 29mechanisms (67%) are specific to software engineering and 14 (36%) are mechanisms adaptedfrom other scientific areas. We can highlight that the majority of the mechanisms are anadaptation from social science (8 resources, 19%), multidisciplinary (5 resources, 12%) , andbusiness (1 resource, 2%). Also, from the 393 studies that conducted an experiment 116 studies(30% of the experiment studies) do not cite any mechanism.

3.2. RESULTS 64

Analyzing mechanisms that were cited to design and to execute empirical methods ingeneral but do not address a specific empirical strategy, Figure 3.8 presents the distribution ofthe support mechanisms by origin area.

Figure 3.8: Source area of the support mechanisms for general empirical research

Among the 201 identified support mechanisms for conducting general empirical research,51 mechanisms (25.4%) are from software engineering and 150 (75%) are mechanisms fromother scientific areas. We can highlight that the majority of the mechanisms are an adaptationstatistics (77 resources, 38.3%), multidisciplinary and social sciences (30 resources, 14.9%).

Evolution of Experiments over the yearsFigure 3.9 presents the evolution over the years of software engineering experiments.

Figure 3.9: Experiments evolution

Since the initial editions, experiment is one of the mostly used empirical strategies.Considering the last ten years, we can observe in Figure 3.9 an average of 26 experiments

3.3. DISCUSSION 65

published per year. Except in EASE 1998, all editions of the three venues had published at leastone experiment.

Concerns about "Not identified" Empirical Strategies in StudiesAs a side effect of this study, we also reported the progress over the years of the "Not

Identified" studies. One of the main reason is because experiments could be added in thissystematic mapping results if the authors of studies had identified their study as experiment.

Figure 3.10 presents the variation over the years of the studies in which the authors didnot report the empirical strategy used. The average over the years of these studies is almostconstant (around 19% per year). However, in the last two years we can notice that this ratedecreases, 14% in 2014 and 12% in 2015. In addition, between 2010 and 2014, there were 65Not Identified studies (15% of the studies published in this period).

Figure 3.10: Not identified studies evolution

The “Not Identified” studies are present in almost all editions of all venues. From allcollected full papers, in 201 studies (18%) we did not identify a strategy. As we observed inFigure 3.10 such kind of study is not increasing along the years. However, we also observe thatthey are slowly decreasing. In spite of all efforts to evolve empirical software engineering, weconsider these results as an evidence of misunderstandings on the usage of empirical softwareengineering methodologies. Besides, it points out that the community still accepts researchthat does not clarify the empirical strategies and references used. This could be mitigated byimprovement and dissemination knowledge about the empirical strategies and their supportmechanisms.

3.3 Discussion

Regarding the mechanisms applied in software engineering experiments, we foundsatisfactory results. As mentioned in previous sections, experiment is the most commonly adoptedempirical strategy, with 393 studies (35.8% of the total). This group comprises experiments (245

3.3. DISCUSSION 66

studies), controlled experiments (133 studies), and quasi-experiments (15 studies). Includingthe studies that adopted experiment in a mixed method research, the number of experimentstudies increase to 414 studies (37.7% of the total). 43 resources are specific to support softwareengineering experiments. Table 3.2 presents three of the most used mechanisms specific toexperiment. 201 mechanisms although do not address specific to experiments, they support someexperimental activities including data analysis, design, data collection, threats to validity, amongothers.

We also analyzed if the support mechanisms identified in experimental studies are specificto software engineering or from other scientific area. As said in Section 3.2.2.4, 67% of themechanisms cited in experiments studies are from software engineering, which suggest that theexperimental software engineering researchers use the mechanisms developed by the softwareengineering community. It is a positive point because software engineering is a recent researcharea, compared to other engineering fields.

However, we found a considerable amount of experimental studies that do not cite anyreference to plan and conduct experiment (30% of the experiments). Some can argue that thesestudies do not cite their references, since they are a common sense in the community. However,Jedlitschka and Pfahl [20] say that the quality of an experiment report decreases when it doesnot cite the adopted guidelines, tools, etc.

A web catalog contains the result regarding experiments as well as other kinds ofempirical studies. The experimental resources are a subset of it. The catalog may support thedecision-making process regarding support which mechanisms to use. It is useful especially fornovice researchers and less experienced researchers of the software engineering area to haveaccess and choose the most appropriate approaches to conduct research, and consequently furtherpopularize the adoption of experiments. Besides defining a catalog of support mechanisms,we can notice some useful findings for the experimental software engineering community,including the increase in the number of published experiments studies in EASE, ESEM, andESEJ, especially in the last editions, the most used support mechanisms to conduct experimentsin software engineering is Wohlin et al. (2000) [29], and experiments is the most adoptedempirical strategy.

As a side result, we identified a high number of studies wherein authors did not statethe empirical strategy used. Regarding the analyzed vehicles, we noticed a slight difference inquality, when considered the amount of studies that explicitly use empirical methods and citereferences to guide their research. It can be said that the journal (ESEJ) has the highest averageof research contributions that make use of strategies and guides to empirical studies. This resultcan be attributed to a rigor required by the journal, as well as a greater maturity of the researchsubmitted to it, which in general were published previously in a conference.

Finally, we believe that the results of this research contribute towards providing a betterunderstanding of the landscape of resources that can support experimental studies in softwareengineering. Thus, we expect to contribute to foment the adoption of experiments and increase

3.4. THREATS TO VALIDITY 67

the quality of the research conducted within the software engineering area. This is relevant forthe industrial practice as well as the academic research.

3.4 Threats to Validity

This section discusses threats to the validity of our SMS, and the actions taken to mitigatethose threats. According to Sjøberg et al. [14], the main threats to the validity of a systematicliterature review are: i) bias in the papers selection; ii) low accuracy in data extraction; and, iii)mistakes in data classification and synthesis.

The bias in the studies selection does not represent a threat to our work since we includedall full paper published in EASE, ESEM, and ESEJ. We did not perform any automatic searches,mitigating the risk of relevant studies has not been included. As mentioned in Section 3.1.2, wedecided to not include automatic search since our goal was to give a relevant view of supportmechanisms adopted in the empirical software engineering area. Manual search in other vehiclesand automated searches can be planned as extensions of this research.

Due to the large number of studies to be analyzed, one possible threat to this work isthe inaccuracy in data extraction. To mitigate this threat we defined a structured form for theextraction of evidence. Besides, at least two researchers extracted each piece of information(Section 3.1.3). We also performed an extraction pilot to avoid misunderstandings.

Another treat regarding data analysis, since there is a high amount of information. Wemitigated this threat by developing a tool to automatize summarization and counting (Section3.1.4). We conducted a review strategy in order to evaluate whether the information presented bythe tool was accurate. Two researchers performed a manual summarization and counting on partof the information. We had found no disagreement when comparing the manual’s against thetool’s summarization and counting.

Another threat in our study is the fact that the subjectivity during the data extractioncould generate mistakes, especially in the mechanisms and empirical strategy classification. Inorder to mitigate this threat, we extracted the information exactly as the authors mentioned in thepaper. These strategies can also facilitate the replication and verification of our study.

3.5 Chapter Summary

In this Chapter we presented the results of the systematic literature review for iden-tifying the support mechanisms used by the empirical software community. The completesystematic mapping study identified mechanisms to support empirical studies, including experi-ments(controlled experiments, experiments, and quasi-experiments), case studies, ethnography,and action research. However, because our scope is focused on experiments, we only reportedmechanisms that support experimental activities. We presented the most used support mecha-nisms, main mechanisms for conducting experiments, and general support mechanisms. Also,


we presented an overview of the experimental software engineering field over the years in studiespublished in the main empirical software engineering venues. The results of our systematicmapping study regarding software engineering experiments suggest an increasing usage of exper-iments in software engineering over the years. However, evidence shows that a high number ofstudies did not cite references used to plan and conduct empirical research. In this sense, thecatalog of support mechanisms, which identify the activities that each one supports, and exampleswhere they were used, is a major contribution of this work to the software engineering communityinterested in empirical studies. In the next chapter we present the qualitative interview study tounderstand how experienced empirical software engineering researchers plan their experiments.

696969

4Qualitative Interview Study

In this chapter, we described the qualitative interview study performed to understandhow experienced empirical software engineering researchers plan their experiments. The chapteris organized into the following six sections: Section 4.1 presents the introduction for the study;Section 4.2 reports the research methodology; Section 4.3 describes the results and principalfindings; Section 4.4 presents the discussion; Section 4.5 presents the trustworthiness in thequalitative interview study; and Section 4.6 presents the summary of this chapter.

4.1 Introduction

Experimental planning is a process that specifies how experimenters will carry out theirexperiments. It involves determining under exactly what conditions the experiment is to beconducted, which variables can affect the experiment, who is going to participate in the study,how many times it is to be repeated, and so on. A well made experimental planning minimizescosts and bias [30]. Although the planning process iterates until a complete experiment designis ready [21], some researchers, especially beginners, are not aware when it is ready to run.Once an experiment is started, lack of important elements in the experimental plan and wrongdecisions made by experimenters can affect the experiment results, thus, leading to inconclusiveor negative results [17].

Although the literature specifies some guidelines to conduct experiments in softwareengineering [30], [21], [24], [20], [17], few focus on a practical perspective from experiencedempirical software engineering researchers regarding how they actually plan their experiments.

Qualitative study methods came from social sciences to help researchers understandhuman behavior [127], and they have been used in software engineering through action research,case study research, ethnography, and grounded theory. Seaman [106] presents several researchmethods for collecting and analyzing qualitative data, and describes how these qualitativemethods can be used in empirical studies of software engineering.

Because we targeted to extract information from experts, conducting a qualitative studyusing semi- structured interviews is the appropriate study method. We conducted 11 interviews,

4.2. METHODOLOGY 70

which were audio recorded and transcribed. To perform the data analysis, we chose to useopen and axial coding from grounded theory because they allowed us to associate codes withquotations and analyze them through relationships between the codes that were merged [ 105].The goal of this study is to understand what experimental experts actually do when they designtheir experiments, what kinds of problems or traps they fall into, how they currently learn aboutresearch methods, and what gaps they have in their knowledge. The result of this study is one ofthe empirical groundings of the proposed instrument.

4.2 Methodology

In this research, the focus is to understand the experiment planning process according toexperienced empirical software engineers researchers’ point of view regarding how they actuallyplan their experiments in practice. In order to do this, we used the GQM goal template [ 46] tostructure the following general goal of the study:

Analyze the process of planning experiments.For the purpose of improving the controlled experiment planning process and understand-ing how experts plan their experiments.With respect to resources for and problems with planning controlled experiments using sub-jects.From the viewpoint of experts in empirical software engineering.In the context of researchers in empirical software engineering.

4.2.1 Research Questions

From the general goal, we specified four sub goals, which are also our research questions,and we applied GQM template to each one. Tables 4.1, 4.2, 4.3, and 4.4 presents the researchquestions structured in the GQM goal template.

Table 4.1: Research Question 1 (RQ1): What do experimental experts actually do whenthey design their experiments?

Analyze The experiment planning processFor the purpose of Characterizing/ Understanding

With respect to Actual process/ How experiment planning is doneFrom the viewpoint of Experimental Experts

In the following context Software Engineering

4.2. METHODOLOGY 71

Table 4.2: Research Question 2 (RQ2): What kinds of problem/traps do the experts fallinto?

Analyze The problems in experimental planningFor the purpose of Detection of mistakes and finding traps

With respect to Actual process/ What are the common failures in ex-perimental planning

From the viewpoint of Experimental ExpertsIn the following context Software Engineering

Table 4.3: Research Question 3 (RQ3): How do experts currently learn about experimentplanning?

Analyze Experiment planningFor the purpose of Collecting advice/suggestions about how experts learn

about planning experimentsWith respect to Learning about planning experiments

From the viewpoint of Experimental ExpertsIn the following context Software Engineering

Table 4.4: Research Question(RQ4): What gaps do experts have in their knowledge?

Analyze Knowledge of expertsFor the purpose of Finding gaps in the literature

With respect to Literature and planning structureFrom the viewpoint of Experimental Experts

In the following context Software Engineering

4.2.2 Interview Design

Tables 4.5, 4.6, 4.7, and 4.8 illustrate the translation of the research questions intointerview questions. The questions were developed, discussed with a researcher mentor, piloted,and adjusted until the final version of the questionnaire. Nine interview questions were generatefor RQ-1, six for Rq-2, four for RQ-3, and five for RQ-4.

4.2. METHODOLOGY 72

Table 4.5: RQ1 - Research Question: What do experimental experts actually do whenthey design their experiments? - Interview Questions

ID Research sub-question Interview QuestionQ1 What is an experimental plan and what

are the main elements that should be con-tained in it?

From your perspective, what is an exper-imental plan and what are the main ele-ments that should be contained in it?

Q2 Which written definitions do ESE re-searchers use for planning experiments

Do you use any written experimental pro-cess or guidelines to plan experiments? Ifso, which ones: ( )No ( )Wohlin ( )Juristo( )Kitchenham ( ) Other____________

Q3 What are the guidelines for planning ex-periments available in ESE?

What other guidelines for planning experi-ments do you know about?

Q4 How do ESE researchers plan experi-ments?

How do you plan experiments?

Q5 What are the characteristics of a experi-mental plan that help to optimize the effec-tiveness of an experiment?

What are the characteristics of an experi-mental plan that you think are most impor-tant to help optimize the effectiveness ofan experiment?

Q6 What are the acceptance criteria for thequality of the experiment planning?

In your opinion what are the acceptancecriteria for the quality of the experimentplan?

Q7 How effective is it to plan an experiment? Does planning an experiment make theexperiment more effective (as comparedto doing an experiment with no planning)?Why?

Q8 Is the time to plan an experiment relevant?If yes, What is the average time requiredto plan an experiment?

Does it make sense to track the amount oftime it takes to plan an experiment? If yes,what is the average time required to planan experiment?

Q9 How the experts know that an experimen-tal plan is complete or it has a good qual-ity?

How do you assess or know that your ex-perimental plan is complete or it has agood quality?

4.2. METHODOLOGY 73

Table 4.6: RQ2 - Research Question: What kinds of problem/traps do the experts fallinto? - Interview Questions

ID Research sub-question Interview QuestionQ1 What are the main difficulties faced by

experts when they plan experiments?In your experience, what are the main dif-ficulties you face when you are doing ex-periment planning?

Q2 What are the sections of the experimen-tal plan that the experts have problems todefine and why?

Which sections of the experiment plan doyou have problems to define? and why?

Q3 What are the common traps the expertsusually fall into?

What are the common traps you usuallyfall into?

Q4 Can poor knowledge of statistics influencethe quality of experiment planning?

Do you think having a poor knowledgeof statistics can influence the quality ofexperiment planning?

Q5 Which kind of mistakes are likely to befound in experimental plans?

Which kind of mistakes are likely to befound in experimental plans?

Q6 What are the most common problemswhen someone doesn’t plan their exper-iments correctly?

In your perspective , what are the mostcommon problems when someone doesn’tplan their experiments correctly?

Table 4.7: RQ3 - Research Question: How do experts currently learn about experimentplanning? - Interview Questions

ID Research sub-question Interview QuestionQ1 Are current experimental processes a use-

ful and efficient way to train new re-searchers?

In your opinion, is the way that the ESEcommunity plans experiments a useful andefficient way to train new researchers?

Q2 What are the best methods to teach/learnhow to plan an experiment?

In your opinion, what are the best meth-ods to teach/learn how to do experimentplanning?

Q3 Can a practical approach encourage inex-perienced researchers to learn how to doexperiment planning?

What other types of guidance or toolswould be useful to help inexperienced re-searchers to learn how to do experimentplanning?

Q4 How do experts currently learn about re-search methods?

How do you currently learn about researchmethods?

4.2. METHODOLOGY 74

Table 4.8: RQ4- Research Question: What gaps do experts have in their knowledge? -Interview Questions

ID Research sub-question Interview QuestionQ1 What are the gaps in experts’ knowledge

about planning experiments?What are the gaps in your knowledgeabout planning experiments?

Q2 Regarding the literature, which are themain critical gaps in experimental plan-ning?

Regarding the literature, which are themain critical gaps in experiment planning?

Q3 What could be improved in experimentalplanning?

In your opinion, what could be improvedin the literature about experiment plan-ning?

Q4 How important is it to carry out a goodexperimental planning? and Why?

In your opinion, how important is it tocarry out a good experiment planning? andWhy?

Q5 Do some ESE researchers neglect experi-ment planning?

Do you think that some ESE researchersneglect experiment planning? What is usu-ally left out?

The questions above were organized in order to acquire fluidity during the interview. Atthe end of the interview, the interviewees had the opportunity to express any additional thoughts,comments or lessons learned from their experience in experiment planning. The sequence of thequestions is presented in Table 4.9.

4.2. METHODOLOGY 75

Table 4.9: Order of the interview questions

Seq_ID Interview Questions ResearchQues-tions

1 From your perspective, what is an experimental plan and what are the main elements that should be con-tained in it?

GQM 1

2 Do you use any written experimental process or guidelines to plan experiments? If so, which ones: ( )No ()Wohlin ( )Juristo ( )Kitchenham ( ) Other____________

GQM 1

3 What other guidelines for planning experiments do you know about? GQM 14 What other types of guidance or tools would be useful to help inexperienced researchers to learn how to do

experiment planning?GQM 1

5 How do you plan experiments? GQM 36 What are the characteristics of an experimental plan that you think are most important to help optimize the

effectiveness of an experiment?GQM 1

7 In your opinion what are the acceptance criteria for the quality of the experiment plan? GQM 18 Does planning an experiment make the experiment more effective (as compared to doing an experiment

with no planning)? Why?GQM 1

9 Does it make sense to track the amount of time it takes to plan an experiment? If yes, What is the averagetime required to plan an experiment?

GQM 1

10 How do you assess or know that your experimental plan is complete or it has a good quality? GQM 111 In your experience, what are the main difficulties you face when you are doing experiment planning? GQM 212 Which sections of the experiment plan do you have problems to define? and why? GQM 213 In your perspective , what are the most common problems when someone doesn’t plan their experiments

correctly?GQM 2

14 What are the common traps you usually fall into? GQM 215 Do you think having a poor knowledge of statistics can influence the quality of experiment planning? GQM 216 Which kind of mistakes are likely to be found in experimental plans? GQM 217 In your opinion, is the way that the ESE community plans experiments a useful and efficient way to train

new researchers?GQM 3

18 In your opinion, what are the best methods to teach/learn how to do experiment planning? GQM 319 How do you currently learn about research methods? GQM 320 What are the gaps in your knowledge about planning experiments? GQM 421 Regarding the literature, which are the main critical gaps in experiment planning? GQM 422 In your opinion, what could be improved in the literature about experiment planning? GQM 423 In your opinion, how important is it to carry out a good experiment planning? and Why? GQM 424 Do you think that some ESE researchers neglect experiment planning? What is usually left out? GQM 425 Are there any thoughts, comments, suggestions, or lessons learned from your experiences with experiments

planning that you like to share?

4.2.3 Procedures

The semi-structured questionnaire presented above was prepared in order to collect andanalyze data about the process of planning experiments. The duration of the each interview wasdesigned to last around 60 minutes. This time was respected at all interviews, and it was notnecessary to stop any interview because the time was up.

4.2.3.1 IRB Application

US universities have an Institutional Review Board (IRB) responsible for ensuring thathuman subjects are treated ethically in research. Because the investigators of this study wereaffiliated with UMBC, which is a university in the U.S.A, it was required that the study followprocedures and federal regulations regarding the protection of human subjects. All researchers

4.2. METHODOLOGY 76

involved in the study who participate in the design and conduct of studies involving humansubjects must go through an approved training program prior to conducting the research.

The training must be repeated every five years. The instructions can be found inhttps://www.citiprogram.org/ . The author acquired the certification on August 25, 2015.

After the principal investigators completed the Collaborative Institutional Training Ini-tiative (CITI) training program, they chose the most appropriate type of review for their study.Depending on the level of risk to the subjects, the study proposal falls into one of three types ofreview: exempt review, expedited review, and full committee review. For more information seethe UMBC website http://research.umbc.edu/. Our research fits in the expedited review processwhere potential risks of the research must not be greater than minimal and fall into at least oneof the expedited categories defined by the federal regulations that in our case were:

! Collection of data from voice,video, digital or image recordings made for researchpurposes.

! Research on individual or group characteristics or behavior or research employingsurvey, interview, oral history, focus group, program evaluation, human factorsevaluation, or quality assurance methodologies.

After defining the most appropriate type of review process for the study, the followingdocuments and forms were prepared to expedited approval for use of human participants. 1.Adult consent2. Waiver of informed consent3. One paragraph abstract describing the protocol4. Investigator vita5. Interview questionnaire6. Recruitment letters

The protocol form and the documents were submitted electronically to [email protected]. The expedited review process lasted two weeks before approval. Onlyafter the approval of the IRB committee we were able to start interacting with the potentialinterviewees. We sent recruitment letters and the informed consent by e-mail to the possibleparticipants in the study.

4.2.4 Sample

The participants of this study are experts in empirical software engineering. They wereselected based on their experiences in conducting experiments in software engineering. Theyhave relevant studies published in the empirical field with more than ten years of experience. Ourgoal was to interview 20 researchers from several countries, such as Brazil, the United States,Canada, Germany, England, and Spain. An invitation e-mail was sent to possible interviewees.

4.2. METHODOLOGY 77

11 out of 20 experts were available to participate in this study. Three experts were from U.S.A,one from England, one from Spain, and six from Brazil. The six participants from Brazil wereinterviewed in Portuguese; the rest of the interviews were conducted in English. Ten interviewswere remotely conducted via Skype while one interview was performed face to face at UMBCbecause of the availability of the interviewed.

4.2.5 Process of Consent

The informed consent was developed to provide information about the purpose of theresearch, procedures, risks and benefits, and confidentiality. It was sent as an attachment tothe invitation e-mail to the potential interviewees. The interviewees were asked to review theinformed consent before the interview. At the beginning of the Skype interview, the interviewerconfirmed verbally with the participants in case they consented or not, and the participants couldask any questions. The interviews only proceeded after the consent of the interviewed experts.The participants’ verbal consent were documented by recording the date and time of consentin the table of participant names and codes. The benefit of participating in this study was theopportunity to contribute to the development of the field of empirical software engineering. Theparticipants were not paid for their contribution. The potential risks to participants would befatigue during the interviews. However, the participants had the freedom to take a break at anytime, or even end their participation in the study. The questions asked were not sensitive orpersonal, and no risk of embarrassment or loss of reputation was expected.

4.2.6 Piloting

A pilot study was performed in order to test materials and procedures for verifyingpotential problems and the fluidity of the interview. We performed two pilots. The first one wasaccomplished in Portuguese with Brazilian researchers. We performed the pilot with one Ph.D.student in empirical software engineering from a Brazilian university (UFPE). The second pilotwas accomplished in English. We used a senior researcher in empirical software engineeringfrom U.S.A university (UMBC). After the pilots, the participants gave feedback about the studyrelated to the content of the questionnaire, fluidity, tone and body language of the interviewer. Inaddition, we verify the time, record resources, and environment of the interview. We concludedthat the study was designed suitable to the goals of the study.

4.2.7 Data Collection

The data collection was split into two phases: Brazilian experts and experts from othernationalities interviews including U.S.A, England, and Spain. We performed 11 interviews from20 invitations. Table 4.10 shows the schedule of the interviews. The interviews were transcribedin the same language of the language used during the interview. That means, interviews from the

4.2. METHODOLOGY 78

phase 1 were transcribed in Portuguese and the interviews from the Phase 2 were transcribed inEnglish.

Table 4.10: Schedule of the Interviews

Phase Interviewee ID Interview Date

Phase 1

P_1 November 9, 2015P_2 November 16, 2015P_3 November 23, 2015P_4 December 2, 2015P_5 December 2, 2015P_6 December 3, 2015

Phase 2

P_7 February 1, 2016P_8 February 3, 2016P_9 February 23, 2016

P_10 Mar 3, 2016P_11 Mar 21, 2016

The interviews were conducted by Skype, and were audio recorded. Regarding theconfidentiality of data a code was associated with each research participant to protect personalprivacy. Only one table contains both names and codes, and this table is kept in electronic formonly on a password protected computer. Analysis of the data used only codes, no names. Theresults are published without reference to any participants’ names. The original recordings anddata files were stored on a secured shared drive accessible only to the investigators.

4.2.8 Data Analysis

The Brazilian interviews were carried out in Portuguese and the transcripts were tran-scribed in Portuguese. The interviews from other countries were conducted in English and thetranscripts were transcribed in English. The data was transcribed by the principal investigatorof this study, and all transcripts were double checked by two reviewers, one Portuguese nativespeaker researcher to Portuguese transcripts, and one English native speaker researcher to Englishtranscripts. The reviewers were the advisors of this research. They randomly picked up pieces ofthe transcripts and checked them by listening to the audio and reading the transcriptions. Becausethe results of this research is published in English, we did a free translation of the quotes fromBrazilian experts used in the results Section. In order to assure credibility in the translations,they were checked by another researcher.

To perform the data analysis, we chose to use open and axial coding from GroundedTheory [105]. Because it was not our goal to create a theory from this study, we do not use theselective coding. Open Coding involves the analysis and categorization of information about

4.2. METHODOLOGY 79

the phenomenon of interest. In this phase, there was an intensive reading of the transcripts. Atthe beginning of the data analysis process, some seed codes were created from brainstormingmeetings, then, relevant words, phrases, sentences, and sections were labeled from the codespreviously created. However, during the open coding process, new codes were created fromrelevant ideas, for example, something that was repeated several times, something that surprisedthe coder, in case that the interviewee explicitly states it was important, or some other reason thatthe coders think it was relevant. During the open coding process, the list of codes increased. Thecodes were grouped into five categories, which corresponded to the research questions of thisstudy, plus an added general category that contains general codes. The final list of open codesincludes 21 codes. The codes and categories can be seen in Table 5.12. An Excel spreadsheetwas used to facilitate the open coding phase. We performed the open coding analysis line-by-linehighlighting what was important. We did a second round of open coding in order to ensure wehad not previously missed any important categories. At the end of the open coding phase, weclassified the transcripts, that is, we grouped some selected quotes into relevant factors for eachquestion to be analyzed in the axial coding phase.

In the axial coding phase, we identified relationships among the data that had been brokeninto pieces by the open coding process. We performed this phase within the Excel spreadsheet.Additionally, memos were an important part of analysis process. We wrote a series of them tocapture thoughts and connections.

4.3. RESULTS AND PRINCIPAL FINDINGS 80

Table 4.11: List of Open Codes

Research Question Open Codes

What experimental expertsactually do when they designtheir experiments?

What they doExperiment design processSupport mechanisms usedImportance of experiment planningPerception from empirical software engi-neering communityAcceptance criteriaElementsTimeExperimental Plan QualityExperimental plan definition

What kinds of problems/trapsthe experts fall into?

Common mistakes and trapsProblems and Difficultiesstatistics

How do experts currently learnabout experiments?

Ways of learning

Gaps in experts KnowledgeGaps in literatureSomething that experts miss

General CodesAdvice/tips

FeedbackCommentsExample

4.3 Results and Principal Findings

In this section, the results from the interviews and principal findings are presented, and itis organized around the major quotes from our analysis. Additionally, we present a conceptualmodel about what experts actually do when they design their experiments, an overview of theways of learning organized in skills such as reading, listening, speaking, observing, practicing,and writing, the common mistakes and traps experienced by experts, and main problems anddifficulties faced by them. Finally, we present the gaps reported in experts’ knowledge and in theempirical literature, and supports that they would like to have.


4.3.1 RQ1: What do experimental experts actually do when they designtheir experiments?

The participants described several distinct pathways for experiment planning. However,the activities that they have practiced are almost the same. We grouped these activities andwe developed a conceptual model that captures all the variations the participants say they dowhen they design their experiments. These activities are doing brainstorming about experimentrationales; using support mechanisms; writing and updating an experimental plan; revising theexperimental plan; meetings with external researchers; running pilots, and discussing resultsfrom pilots. An overview of a conceptual model depicting these activities is provided in Figure4.1.

Figure 4.1: Conceptual model about what experts actually do when they design theirexperiments

4.3.1.1 Brainstorming about experiment rationales

For experts, brainstorming about experiment rationales is the first important step inplanning experiments. Before they start to plan, they have multiple meetings to discuss all thedesign decisions that they will be making. These decisions are described as follows:

"We discuss all the design decisions that we will be making like who should wecontact, who should we talked to, what data shall we collect and what problemswill that be? What kind of work through, kind of the planning phase. That may takemultiple meetings but,usually,we try to reach the point where everybody involved inthe research project is happy with the plan and the assignment."


Another participant provided more detail about the specific topics that need to be ad-dressed in these initial brainstorming meetings:

I’ll start with here is an idea for how I’m gonna run the study and I sort of go backand think like what are the potential confounds or the potential risks in the study orwhat could go wrong and sort of thinking through these risks,and thinking aboutwhat’s the results I want, what’s the research questions I wanna answer I gotta justsort of keep iterating sort of through the design of the study. "

The participants also indicated that this initial brainstorming contributes to importantelements in the experimental plans:

"...what the goal is to accomplish, what question that I’m trying to answer. ...thinking about the type of data that I want to gather and collect... what people arewe going to talk to, what software project are we going to look at... some of issuesof kind of sample is, what type of treatments we’re going to use, what type of metricswe’re going to calculates, what type of statistical tests we’re going to use, and. . . ...how can metrics that I can calculate the measures or main analyses, how can thataccurately represent, what I’m trying to capture the question, and then the typesof analyses you do, the types of reporting you’d include, and then I guess in theplan also include alternatives, so like, I might, let’s say I a pick three differencesstatistical methods to use, I might choose one, but in plan would be good to say whyI didn’t choose other two..."

During the meetings, the experts also ask themselves about some important elements inthe experiment. Table 4.12 present some of these elements.


Table 4.12: Experimental plan elements from Experts

Category Item Element

Goals

Exp_Item_1 What I want to learn.Exp_Item_2 A couple of paragraphs that describe an abstract of the study.Exp_Item_3 The goal of research questions.Exp_Item_4 The definition of the experiment.

Research QuestionsExp_Item_5 What questions we are trying to answer.Exp_Item_6 What the context of the research questions is .Exp_Item_7 What the relevance of the research questions is .

Metrics

Exp_Item_8 How we are going to measure the metric that we define for the dependent variables.Exp_Item_9 What the measures are.Exp_Item_10 What the metrics are like.Exp_Item_11 What type of metrics we are going to calculate.

Context Selection Exp_Item_12 Where the experiment will be performed.

HypothesesExp_Item_13 What the research hypotheses are.Exp_Item_14 What the statistical hypotheses are.

Variables

Exp_Item_15 What the independent, dependent variables, and contextual variables are .Exp_Item_16 How many variables the experiment needs.Exp_Item_17 Which factors affect my experiment.Exp_Item_18 How many factors the experiment should have.

Participants

Exp_Item_19 How many participants will be involved in the experiment.Exp_Item_20 Who the participants (Participants information) areExp_Item_21 What knowledge these participants need to have to be able to do tasks. ( Procedures

for identifying and selecting subjects); We have the right participants. ; What arethe including criteria I need about what the participants need to know; What type ofexperience the participants of the experiment should have.

Exp_Item_22 What the demographic data planning is.Exp_Item_23 What the kind of sample is.Exp_Item_24 how it can accurately represent.Exp_Item_25 Power analysis. how many subjects we need, in order to find statistical results.

AssignmentExp_Item_26 How we assign specific subjects to specific treatments.Exp_Item_27 In what order we will assign the subjects and treatments.

Experimental materials

Exp_Item_28 What software project we are going to look at.Exp_Item_29 What artifacts we are going to use in terms of objects/ Experimental objects defini-

tionExp_Item_30 Preparing the experimental material that will be used by participants during the data

collectionExp_Item_31 How many computers we have available and for how long.Exp_Item_32 Which materials we will use for the experiment, programs, specifications, code,

whatever we need for running the experiment itself.Exp_Item_33 What forms (questionnaires/interviews) we need for the participants to follow.

Tasks

Exp_Item_34 How long the tasks should be.Exp_Item_35 What information I should give participants before they start the tasks.Exp_Item_36 How many tasks we need for this experiment.Exp_Item_37 What the tasks are .Exp_Item_38 How much training the participants are going to need to be able to use the tools that

you are working with.Exp_Item_39 Whether the training addresses the issues that they might potentially have in using

the tool.Exp_Item_40 The tasks are relevant to what you are looking at.Exp_Item_41 How long participants are going to take when they are doing the tasks.Exp_Item_42 What the scope is that you are going to do.

Experimental Design(procedure)

Exp_Item_43 Detailed description of the experimental design

Experimental Design(Treatment)

Exp_Item_44 What type of treatments we are going to use.Exp_Item_45 How we should organize the groups they had.Exp_Item_46 How many groups of participants the experiment should have .

Experimental Design Exp_Item_47 Which design we should follow for running the experiment.

Schedule

Exp_Item_48 What the schedule is in which the experiment will be run.Exp_Item_49 How many hours/ days we need to run the experiment.Exp_Item_50 How we have to organize these days.Exp_Item_51 which things we will cover everyday.

Piloting Exp_Item_52 What the pilot will be like.

Data CollectionExp_Item_52 What type of data I want to gather and collect.Exp_Item_53 How we will collect the data.

Analyses ProcedureExp_Item_54 What types of analyses we will do.Exp_Item_55 How to analyze the experiment data.Exp_Item_56 What statistical tests we are going to use and why.

Threats to Validity Exp_Item_57 What the threats to validity of the experiment are (Internal, external, Construct andconclusion)

Debriefing Exp_Item_58 Preparing interviews after the tasks or surveys in order to collect more data.

OthersExp_Item_59 Cost of the experiment.Exp_Item_60 Quality.


4.3.1.2 Writing and updating experimental plans

According to interviewees, it is hard to achieve a relevant result without planning anexperiment. They have emphasized that it is important to have a well written experimental plan.

"It is critical you have a document reporting all the details of the experiment."

Regarding how they document experimental plans there is no a unanimity about that.Some of them follow the same sequence suggested by empirical literature.

“I try to follow the sequence suggested by empirical literature.”

While other experts do not follow the experimental process prescribed by them.

“I’m aware a lot of those guidelines that exist but I don’t really follow a set processas prescribed by them.”

However, all of participants agree that the rationales of the experiment discussed in themeetings should be documented. It avoids the experimenter becoming “lost in the middle” ofexperiment and being surprised with issues that can appear during the experiment.

".. so, for design is important that not only the final design of experiment is in theexperimental plan, but should have also been an explanation of how to get the designand why you have chosen that design and not a different one. So I think it is animportant issue in an experimental plan."

4.3.1.3 Revising the experimental plan

Revising experimental plans was considered for several participants an important activityto find mistakes in the experimental plan. One participant said that in this activity he revisesdifferent kind of issues including risks, participants, training, among others.

" there is all sort of risks we can look at , you can look at , do you have the rightparticipants? , what knowledge do this participants need to know to be able to dotasks you have, are there enclosing criteria that you need to sort of reduce thoserisks, how much training are the participants going to need to be able to use thetools that you are working with, does that training sort of address the issues thatthey might potentially have in using the tool."

The most of them point out completeness as one of the acceptance criteria for the qualityof the experimental plan.

"So the quality is the completeness, I think, as long as everything is in there, I thinkthe quality of the experimental should be good."


"Its completeness, its coverage regarding the phenomena which is being observed."

Additionally, they consider completeness as a set of information that should be containedin the experimental plan which will not impact in the experiment execution.

"It should be complete. All the information should be in the experimental plan.”

Another factor that they frequently review was the rationale of their made decisions.

"I think going through each part of the research plan and making sure that there’s areason why we are taking each step, so there’s some justification, some rationale,for each step in the experimental plan."

4.3.1.4 Using Support Mechanisms

Some of the participants reported that they use support mechanisms to support them inwriting experimental plans and planning the statistic analysis.

"I use literature when I do experimental planning."

" I think, so let’s go read papers or books that talk about that type of analysis andoutside of that should be part of my experimental plan."

Although not all interviewed experts plan their experiments using the empirical literaturebecause some of them use their own personal experience, all of them have agreed that empiricalliterature is an important support, especially for beginner researchers.

" When we plan the experiments themselves, we do not follow any kind of formalguidelines, because at least the guidelines that we know are related to more to thereporting experimental resource."

One participant said that usually he follows guidelines to report or replicate experimentsbut he does not follow guidelines to plan his experiments.

"When we are reporting experiments, we follow Jedlitschka guidelines, if you arereporting for example replication, we use Jeffrey Carver guidelines, but when weare doing the plan itself we do not follow any kind of guidelines. "

Another interesting findings is that the most of experts reported that they read similarstudies, especially the lessons learned sections to try not to make the same mistakes reported inthe previous studies.

“I’ll look at how other researchers made their experiments, how they were organized,what was going wrong and right, the lessons learned section.”


As mentioned before, not all of experts use support mechanisms for guidance for their ex-periments, they do the experimental activities based on their experience. However, in cases wherethey do not use any support mechanisms anymore, we asked them to suggest good experimentalmaterials useful for beginner researchers. Therefore Table 4.13 is a mix of mechanisms that theyhave the habit of using and mechanisms that they have used in the past and have suggested fortheir students and interns.

Table 4.13: Support Mechanisms cited by the Interviewees

ID Reference TitleSMI_01 Wohlin, C. et al. (2012) [29], [21] Experimentation in Software EngineeringSMI_02 Juristo, N. and Moreno, A.M.

(2010) [30], [23]Basics of Software Engineering Experimentation

SMI_03 Jedlitschka, Aet al. (2008) [81] Reporting Experiments in Software EngineeringSMI_04 Kitchenham, B. et al. (2002) [24] Preliminary Guidelines for Empirical Research in Software EngineeringSMI_05 Pfleeger, S.L. (1995) [45] Experimental design and analysis in software engineeringSMI_06 Zelkowitz, M. V. et al. (2003) [43] Experimental Validation of New Software TechnologySMI_07 Basili, V. R. et al. (1994) [46] Goal Question Metric ParadigmSMI_08 Seaman, C. B. (1999) [106] Qualitative Methods in Empirical Studies of Software EngineeringSMI_09 Shull, F. et al. (2007) [128] Guide to Advanced Empirical Software EngineeringSMI_10 Shadish, W. R. et al. (2002) [54] Experimental and Quasi-Experimental Designs for Generalized Causal

InferenceSMI_11 Cook, T. D and Campbell, D. T.

(1979) [53]Quasi- Experimentation: Design & Analysis Issues for Field Settings

SMI_12 Travassos, G. H. et al. (2002) [129] Experimental Software Engineering - An Introduction (in portuguese)SMI_13 Ardelin Neto, A. and Conte, T. U.

[130]Identifying Threats to Validity and Control Actions in the PlanningStages of Controlled Experiments

SMI_14 Lopes, V. P. and Travassos, G. H(2009) [131]

Knowledge Repository Structure of an Experimental Software Engi-neering Environment

SMI_15 Freire, M. A. et al. [132] A Model-Driven Approach to Specifying and Monitoring ControlledExperiments in Software Engineering

SMI_16 Travassos, G. H. et al. (2008) [31] An environment to support large scale experimentation in software en-gineering

SMI_17 Arisholm, E. et al. [108] A Web-based Support Environment for Software Engineering Experi-ments

SMI_18 Carver, J. et al. (2014) [133] Replications of software engineering experimentsSMI_19 Minitab, Inc.(2010) [134] Minitab 17 Statistical SoftwareSMI_20 SAS Institute Inc. (2007) [135] JMP Statistics toolSMI_21 R Development Core Team (2008)

[136]R: A Language and Environment for Statistical Computing

SMI_22 IBM Corp. (2015) [137] IBM SPSS Statistics

SMI_01, SMI_02, SMI_04 are the books most cited as a guideline by the expertsinterviewed. They reported that SMI_04 provides more of a checklist of things that should becarried out in the experiments than the first two books. SMI_09 is a book cited as a resource thatcovers the common difficulties and challenges faced by software engineering researchers whowant to design their empirical studies. It provides guidance and information for research methodsand techniques. SMI_05, SMI_06, SMI_12 are books cited by experts which provide generalguidance for researchers.SMI_07 is a template largely used by experts to structure the goal of anexperiment, and SMI_08 is cited as a support in qualitative analysis. SMI_10 and SMI_11 guidethem on the design of experiments and quasi-experiments. SMI_13 is a tool suggested to support


the identification of threats to validity and actions to control them in the planning stages ofcontrolled experiments. SMI_14, SMI_15, SMI_16, and SMI_17 are proposed environments tosupport conducting experiments. SMI_19, SMI_20, SMI_21, SMI_22 are statistics tools usedby them to analyze quantitative data. SMI_03 is a guideline to report experiments in softwareengineering, and SMI_18 is cited as a guide to experiment replications.

4.3.1.5 Meetings with external researchers

The experts reported the practice of having their experimental plans reviewed by expe-rienced external researchers who are not involved in the experiment planning. They have alsopresented their planning in workshops with graduate students who are able to think critically andmake relevant comments. The goal of these meetings is to “try to identify threats to validity, andthings that could go wrong”. As one participant noted, the advantage of this practice is simply

“more people thinking."

4.3.1.6 Running pilots

Running pilots, even using a small scale, was cited as a good practice performed bythe experts interviewed. They think that running a pilot is essential, especially for controlledexperiments using human subject. One participant said that running pilots is important step to dobefore the experiment execution.

"So I think piloting probably is the biggest and the most important step for me."

Also, during the pilot, experimenters can see whether those risks that they imagined inthe initial meetings are actually going to be a big issue or not. One expert said that he observesthe consequences of the identified risks while another one observes how everything behaves.

"I think a lot of that again is trying to identify what are the risks in terms of notgetting results that speak to your research question and have you sort of piloting toiteratively check whether those risks are actually going to be a big issue or not."

" given what I want to learn. Here is what I want the participants to be doing. I’llrun a pilot study. It is that actually sort of, what’s happening, and based on whatthey are actually doing. I’m actually sort of seeing the activities that they want tosee that will help me sort of address the research questions."

When asked to give an example of mistakes that they committed, one expert answered:

"About mistakes: Not documenting properly the design, that was an importantmistake that we made. Another one was, that I mentioned about which rule we werediscussing or the design we have two choices. First we thought ok, this is the best


option but then we thought we could not make that choice because we need too manysubjects, too many experimental subjects, so we decided very fast we change thedesign and because we run a pilot we discovery that the change we make to decidewas wrong."

This example illustrated that writing a well experimental plan and running a pilot aregood activities for doing when researchers planning experiments.

Because experiment planning is an iterative process, it is difficult to say that an exper-imenter has a complete experimental plan until they run the experiment. However, running apilot comes as a good practice for revealing many problems including confusing experimentalmaterials, procedures, tasks, questionnaires, and instructions. It is important to run the pilotas close as possible to conditions in which the experiment will be conducted, including theenvironment, participants, materials, among others.

"Although there is never really a point where a researcher can say I have a completeplan for the experiment formulated until I actually run the experiment becauseis kind of always this iterative process so for me a lot of it is doing things likefine-tuning the tasks."

4.3.1.7 Discussing results from pilot

The experts reported that they discuss the results from pilots with their research groupbecause at this stage they are already aware of most of the problems that might arise. In thisactivity, depending on the pilot results, they decide what they must do. In some cases they havebrainstorming meetings, another cases, they just update their experimental plan and run theirexperiment, thus, assuming the risks they already know about most of the threats to validity thattheir experiment is exposed to. Also, they said that in this stage because they already know aboutthe trade-offs in running the experiment, it is easier they make changes to the experimental planthan in the experiment execution.

4.3.2 RQ2: What kinds of problems/traps do the experts fall into?

In this section, we present results regarding problems and traps that interviewed expertshave experienced. It includes common mistakes and traps in Section 4.3.2.1 and problems anddifficulties from themselves in Section 4.3.2.2 from their students, or from external reviews inwhich they have participated. Although some common traps and mistakes sometimes is mixedwith problems and difficulties, in order to better organize this section, we distinguish them asmistakes and traps are issues that have happened without they have expected to face them. Thereare things that they have imagined doing correctly but that realized that they made mistakes orfall into traps. While problems and difficulties we defined them as issues that they are aware thatthey have in planning experiments.


4.3.2.1 Common mistakes and traps

Although each experiment has its own particular traps, the expert participants reportedsome common mistakes and traps they have experienced including choosing an unsuitabledesign, doing statistic analysis, defining variables and confusing variable scales, forgetting aboutsome external factors related to the experiment, having bias or noise in the data, not writingexperimental plans well, among others.

Choosing a design that is most suitable for doing experiment data analysis was a majorcited mistake made for several participants. For some, the most complicated part is to decide thedesign including how to assign experimental subjects to tasks.

In research question 1, Section running pilots 4.3.1.6, the fourth quote, which is aboutchanging the experimental design quickly and discovering through the pilot that this change wasa wrong decision, describes a situation that occurred with a participant regarding the choice ofthe design, and the mistake was only discovered when they ran a pilot.

Also, this example makes clear the importance of running pilots before experiments areconducted. If the experts did not perform a pilot, they could generate invalid results.

Another common mistake that is linked to the design is statistic analysis. One participantsaid that not thinking in how analyze the data before the data collection is one of the mainmistakes that researchers committed. He said:

“We plan the whole experiment, we plan data collection, but we do not think abouthow I will analyze the data. So when we start to analyze the collected data, wedo not analyze what we should actually analyze to answer the research questions.We analyze the data we have. So, it’s a reverse engineering. “I have this data. So,which are the options to analyze the data that I have?”. Then I analyze the data “nomatter what” based on data that we have instead of planning how to collect fromthe analysis of what I want to do.”

Also, some of they emphasized that he has seen mistakes on experiments reports includingvariable definitions, variable scales, statistics tests, among others.

Another common trap cited by experts was incompleteness. Having an incompletenessexperimental plan is one kind of mistake very common made by their students because theirinexperience makes them forget important elements that should contain in the experimental plan.Also, because some elements include not trivial elements such as external factors and confoundfactors from participants that are not easily to measure.

" Forgetting about external factors. Such as, the usual one is the experience of thesubjects, and what researchers usually do is try to measure this factor and try tobalance it but, in practice it’s impossible to perfectly measure the experience or thebrain of the subject.”


For other experts including bias or noises in data is another critical trap. Some expertssaid that usually researchers include unintentionally bias in several ways. Some of them reportedthat they have found bias in group assignment, others in how the materials are used and regardingthe order how they are offered, and others reported that they have found bias in training. Theyhave been concerned about bias because incorrect inference also can be included in the study.One participant said:

" If you have a really large data set, you want to make sure that is not biased orhave noise in it, the systematic noise, but could cause you to make an incorrectinference."

Another mistake that we found strong evidence in the results was not writing the ex-perimental plan well. They said that a lot of problems and mistakes occur because researchersdo not plan their experiments correctly. One expert recognized the importance of well writtenexperimental plan regarding the documentation of design: "No documenting properly the designwas an important mistake that we made.” Yet about the importance of recording design decisions,another participant gave us an example the importance of well written experimental plan.

"So, when you do not document things that may happen to you that in the trainingmaybe, does not occur properly, is not properly done and then you have a lot inchanges, do you know? Problems than go after in the other, because then data maynot be reliable, your conclusions might not be reliable, you may have certain threatsto the validity of the results, because of not doing that. So, another example I gave ifyou do not properly document your design then your analysis of the data, maybe youwill do an incorrect analysis and then again your conclusions will not be valid, ok?So I see the main problem when we run the experiments when we analyze the data,so. It could be those tasks, kind of perform properly and you did not do correctly theplan."

Additionally, the experts also reported mistakes and traps regarding not well definingthe research questions, properly setting the scope of the experiment in order to find the rightbalance of amount of knowledge that one study can create, and the preparation of qualitative datacollection including having suitable interview questions. Although there are vast possibilities oftraps and mistakes that an experimenter can make, one expert expressed the idea that two factorsmay intervene in the study including the amount of risks and the experience of doing similarstudies previously.

“I think the amount of traps that you’re gonna fall into and the seriousness youhave to sort of take them in sort of planning out and addressing naturally dependsjust on how much risks there is in the study design and to some maybe there arenovelty in what you’re doing, is it similar to a study that you’ve done before, a fairly


standard study design or is it something that’s different in some point and aspectthat requires a little more thought and attention to make that work well.“

4.3.2.2 Problems and Difficulties

In the same way of the subsection above, experts were asked about problems anddifficulties faced by them when they plan experiments including having representative data,doing qualitative data analysis, finding participants, choosing the most suitable design, havingproblems in statistics, selecting the reference to be used as a guideline, and fitting the idealexperiment execution in the real experimental conditions

Not having a representative material, tasks, and data collection are problems usuallyfaced by some of them. It also includes not having a representative participants sample whichlead to problems in generalizing the results. One expert told us his experience of some studiesperformed by his company may not be representative in an open source environment.

"I think the biggest one is, I think this is an external validity [threat], which is thethreat of not being representative. I worked at <name of researcher’s company> .We do a lot of studies of product groups here at <name of researcher’s company>.That may not be representative of an open source product. "

Another common problem reported by some of them is preparing qualitative materialand doing qualitative data analysis. One expert reported that he has faced problems in preparingproperly the interview questions, sometimes he forgets to include important questions and he hasto return to interviewees to ask them those important issues, and this situation is not comfortable.

“Often, in the qualitative part, some questions are not so nice, that is, those ques-tionnaires that we normally use to see the qualitative part, I think that is a point thatsomething is not realized by us and we just realized it when we analyze the data. Forexample, because the guy said it here, we realize sometimes that we have to try totalk with the guy again to see if we understand better what he said.”

Another common problem reported by some of them is the difficulty of finding humansubjects to participate in the controlled experiment which really want to participate in theexperiment. They said that is difficulty to finding people who are suitable for the tasks and alsowant to collaborate with the experiment. One participant quote represents this concern expressedby experts.

"Everything is important in human centered experiments, but I’d say one of the mosttroublesome things for me has been recruiting participants who will enter into thespirit and they will. . . you’ll just end up using students because they are available.It can be quite hard to motivate students to really understand about experiments."


Choosing the most suitable design is also in the common mistakes and traps that theexperts have experienced but some of they emphasized that the design of the experiment stillbeing one of the hardest part in the planning.

" I’ll say again and again for me the most complicated part is to decide the design.how to organize the sections, how to assign experimental subjects to tasks and howmany experimental sections we will have, because we typically have... we typicallywork... if you will work between subjects experiment design that’s easier, but wetypically work because we don’t have too many subjects."

Another participant highlights the problem of make mistakes in the experiment design:

" Because you cannot make mistakes, if you make mistakes in other parts you canalways fix it but, if you going to make the mistake when you design it and then runit and then you cannot come back and run it again. So because of this you mightat the most crucial part, then I need a lot of time and you need to ... The hard partagain is to balance the different threats to validity."

Another problem that is linked to experiment design is having problems in statistics. Oneparticipant stated that some students still having this problems when they analyze data.

"sometimes, poor analysis in terms of statistical analysis of the results and the datacollection "

Some of participants also reported that some of their students have asked them:

" I have more than one reference to guide experiments. Can I use any reference?"

And, usually, the researchers who are planning their first experiments follow the guide-lines used by their advisors, but they do not know why they are follow that specific guideline notother.

Also, the experts have concerned that several researchers, especially beginners, try to planthe ideal experiment, and the most times it differs so much of the real experimental conditionsthey have. Assessing the trade offs and completeness of experiments has been a challenge tothem.

Once, one expert remembered that in a similar situation he gave the following advice tohis student:

"Although the other plan is better, it requires more people and materials. I take intoaccount the plan viability."

As a result, the expert thinks important to have a balance between having a perfectplan and having a plan that fulfills their needs. Furthermore, experts face difficulties to planexperiments because experiment planning is an iterative process. Therefore, although they comeup with an initial design that they think is the best one, it is common they make changes whichwill affect the initial design.


4.3.3 RQ3: How do experts learn about experiments?

This section presents results about ways of learning to plan and conduct experiments.They talked about how they learned in the past and how they have been currently learning.Their answers were also a collection of advice for beginners researchers who are starting tolearn about experiments. Figure 4.2 illustrates the ways of learning cited by experts, fromprocessing empirical information through reading empirical literature, building understandingthrough discussions of papers, to applying the knowledge acquired doing pilots and experiments.Indeed, this scheme can be used not only for experiments, but also for other empirical studies.The experts’ ways of learning include reading empirical literature, reading similar studies, seeingexamples, reading empirical literature from other fields, receiving feedback from reviewers andexperienced researchers, discussing papers, exchanging ideas with other researchers, observingother researchers design experiments, being a volunteer and participating in pilots and exper-iments, being involved with other fields of empirical research, doing pilots and experiments,helping other researchers to plan and conduct experiments, writing papers and technical reports,attending forums, having a good mentor, and doing courses and taking classes in softwareengineering as well as in other fields.

Figure 4.2: Ways of learning about experiments

We grouped the learning suggestions given by experts into six learning skills. They arelearning through reading, listening, speaking, observing, practicing, and writing. Sometimes, theways of learning are not bound to a unique skill. However, this does not become an obstacle tolearning.


4.3.3.1 Reading

For experts, reading is the most common and independent way of learning. Some of themreported that have read a vast number of papers from empirical literature as the quote below:

" I think in my experience, one of the best ways to learn about experimental planningis just to read a bunch of research and learn how people do things, how peopletackle challenges and also make judgment calls about what you like and what youdon’t like."

Another ones focus on reading similar studies as one of experts expressed:

"It’s useful to have looked at similar studies just to give you a sense for what normalstudy designs look like in the area that you’re considering doing a study in and it’suseful to take a look at guideline papers to just in terms of knowing about what areaspects of the study design."

Others highlight the importance of learning through examples examples of similarexperiments that were performed previously. A participant reported that present concreteexamples of well done studies is an ideal scenario of how students could learn how to doexperiments. He said:

" The perfect scenario of learning experiments is if the empirical classes were repletewith examples of experiments... I think what enriching the learning process areexamples. You see how experiments are well-designed and well-conducted and usethem as examples."

Some of them also reported that they have read empirical literature from other fields thansoftware engineering in order to learn how to plan and conduct experiments from other contexts.When one of participants were asked about how he learn about experiment, he answered asfollows:

"Usually by the literature, research, papers, books (which are few), conferences andalso looking for experimentation in other areas that have already done experimentslonger than software engineering. You learn from other contexts in which you arenot inserted; seeking in other contexts. Sometimes you get examples in medicine."

According to the evidence below , some participants also indicated that another usefulway of learning experiments is receiving feedback from reviewers and experienced researchers.

"being in touch with the empirical community, receiving feedback from paper re-views..."


4.3.3.2 Listening and Speaking

Participating in papers discussions and exchanging ideas with other researchers are twohelpful ways of learning about experiments. These practice was reported by some experts. Oneof them illustrated how he have done with their students:

" I say take a look at these papers, list the things that you think they did well, as wellas the things that you think they didn’t do well and let’s discuss them. "

Another one reported the importance of exchanging ideas with other researchers indifferent ways. These ways are described as follows:

"Studying, being in touch with the empirical community, receiving feedback frompaper reviews, reading papers, attending to forums of empirical discussions, partici-pating in refresher courses."

4.3.3.3 Writing

Writing papers and technical reports is a good practice of learning for two main motiva-tions. The first one for researchers themselves in which they can find all decisions made in theexperiment, also for other researchers who want to continue the study later. As one participantsuggested: "You have to take notes of everything and keep all the information. " and another onestated:

"Particularly, I prefer to have more documented information even though I do notuse them immediately than 5 years later another researcher want to continue mystudy, and there is no documented information of an experiment that was done andit could be useful for them."

4.3.3.4 Observing

Observing other experienced researchers design experiments is also another efficient wayof starting to learn about experiments described by some of experts. One of them described theprocess of observation:

" First of all, it is just listen and see how people who have the knowledge plan anexperiment, maybe after two or three times you see them, maybe you can start doingthat."

Also, some experts guide their students being a volunteer and participating in pilots andexperiments as stated as follow by one participant:

"the best way to teach students is participating. If they participate and writeexperiments, they will learn a lot."


However, observing is not just for beginner researchers. Some experts themselves alsouse the observation as a way of learning empirical studies. One expert stated that he is involvedwith other fields of empirical research in order to learn how they do empirical studies.

" I actually learned how to do empirical studies by going to a different field andlearning how they did it. I found that really useful. I think that’s probably atypical,though. I don’t think most people do that. I found it quite useful."

4.3.3.5 Practicing

Doing pilots and experiments is considered by all experts as a good practice of masteringexperiments. "You can not learn without doing an experiment. You can not learn just readingabout them" said one expert. Another one make this evidence more strongly when he said

“definitely the best way to learn is of course a piloting." After talk about all planning process,other participant emphasized his routine in running a pilot before he conducts a study.

"...given what I want to learn, here is what I want the participants to be doing, I’llrun a pilot study. "

Another way of learning about experiments in practice is helping other researchers toplan and conduct experiments. One expert encourages his students to take advantage of being onan experimental research group where the chances of helping other colleagues to design is veryhabitual.

" By practicing. The best way to learn is by doing the task or if you refer for exampleto Ph.D. students, the best way to learn would be he or she attends meetings wherepeople talk about designing others’ experiments. So definitely the best way to learn."

In the observing skill of learning, we reported the observation of empirical studies inother fields than software engineering as a good practiced by some of expert interviewees. Linkedto this practice, they also pointed out that they usually do courses and take classes not only insoftware engineering field but also in other fields. One expert stated that when he was graduatestudent he did quantitative courses in sociology field as quote below:

"When I was in grad school I actually took courses in sociology. Quantitativecourses in sociology, as a grad student, because they have a longer history of doingexperimental design, doing statistics, and things, and a rigorous and well acceptedway."

Because the learning process is an arduous process, some of participants think that havinga good mentor is another efficient way of learning.


" I think the mentoring process that happens is really the most important step there, Imean having somebody that knows study design and has run studies like that beforeis really the most important thing in training."

Also, some of them noted that having an experience of working with somebody that hasdone experiments before is a really a key component of learning how to do studies well. Thequote below express the idea that although having a mentoring not avoid researchers to makemistakes, the mentoring is really important.

" Having a good mentor to give you feedback as you’re going through and actuallydesigning a plan, you’re inevitably going to make some mistakes and just havesomebody that can give you feedback on that is really important."

4.3.4 RQ4: What gaps do experts have in their knowledge?

In this section, the participants reported gaps that they have in their knowledge andthat they have found in the empirical literature, and things that they have not been able to findbut would find useful. We split out this section in three subsections (1) gaps in the experts’knowledge, (2) gaps in the empirical literature, and (3) supports that experts would like to have.

4.3.4.1 Gaps in experts’ knowledge

The most common reported gaps in experts’ knowledge were doing statistical analysis,choosing the most suitable experiment design, and doing qualitative analysis.

For the most participants, the statistical analysis was reported as a complex part becausesoftware engineering researchers have not been well trained in statistics. This is also notedbecause it is easy to find issues in published experiments. One participant said:

" So, definitely, for me at least statistically is the part that is most complex becauseyou think that things are solved, the things are clear, but still do. Go to papers andfind contradictory information."

They also have faced gaps in choosing the experiment design because many aspects mustbe considered in that choice. One expert said:

"The design, the design is worst part in the planning."

However, there is some variation for that because few other experts reported that theyhave faced challenges in terms of how to do qualitative data collection and qualitative dataanalysis. One participant said:

“I feel like I have a pretty decent handle on quantitative stuff but all aspects ofqualitative, I still need to improve. “


4.3.4.2 Gaps in empirical literature

For some experts, statistics comes again as a gap in software engineering empiricalliterature. They think that statistics books from the mathematics field are difficult to handle forsoftware engineering researchers, and statistics sections in software engineering books are notclear. One expert gave his opinion about statistical part in software engineering books of themstated:

"The statistical part, the software engineering books do not bring it very clearly."

Another gap reported by some of participants is the missing of research planning ratio-nales publications. They think that it would be useful to have protocols of experiments availablefor researchers to see the rationales behind the decisions made in experiment design. One expertexpressed his thought about that as follows:

" I think what we need is a systematic compilation of past studies but, that clearlyfocuses on why people took different decisions. Because all the books that reportexperiments like, this a good example, but there is no discussion about why someonedid like this and someone else did like that. So somehow differences can be good.What is now, yes, we have a set of experiments but the rationale behind the decisionare not well explained. I think the victim here is the rationale that is behind thedecisions. I think this is what is missing now from the literature."

In the same way, the missing of publication of negative results is also seen as a gap inempirical literature. One expert expressed his desire of seeing negative results being published.He argues that his knowledge cannot be built only from positive results.

"I really miss, for example, the availability of negative results. My knowledge cannot be built only from positive results, it has to be built from negative also."

Additionally, because developing interesting research questions is a good starting pointof having interesting results, some experts think it is important to have resources that addresshow to formulate research questions. While others have missed the presence of well definedempirical standards in software engineering.

4.3.4.3 Supports that experts would like to have

The participants also indicate supports that they lack in empirical software engineering.Some of supports were suggested to their necessities as experienced researchers while otherswere suggested by being used for inexperienced researchers, or anyone who is planning andconducting their first experiments. Some would like to have a taxonomy well defined to supportempirical studies.


"I suppose if you had some kind of taxonomy of different types of research method-ologies, or experimental methodologies and the common ways that people do thingsin the common traps that people fall into."

While others would like to have tools where experimenters could simulate an experiment.

"If there were tools where he (experimenters) could simulate an experiment and theywould simulate such participants and generate a result, and it would enable him tomake an analysis. I think it would be an interesting thing for learning."

Another one would like to have a repository with a series of examples of experiments,and a single place where they could find various kinds of information about experiments.

"A series of examples."

"I would like to have a single place where I could find various information aboutexperiments. Today, if I want, I have to get three, four, five books, from two, three,four researchers. I could have, like, a repository... I think that literature itself hasall the information we need, the problem I do not think are in the literature, I thinkit’s in the access we have this information and how these information is available."

Some would like to have a catalog of experiments.

"I think a catalog would be very useful."

And others would like to have a checklist that helps researchers understand when thatplan is complete which can include a set of steps that tries to direct beginner experimenters toplan experiments or something that tries to clarify the planning sections from empirical literatureincluding necessary elements that should contain in the experimental plan.

"Having a checklist that can help him (researcher) to understand if that plan orexperiment was completed."

And lastly, some of them would like to have a tool where given a set of characteristics ofthe experiment, it returns a set of examples of similar experiments.

"I would try to develop an environment or something that understands what oneresearcher wants and it comes back with a list of a set of examples, things that havebeen done. So if I want to do a qualitative research, the environment would offerexamples of protocols of qualitative research studies, how the data were analyzed,and how the data were collected."


4.3.5 Perceptions of experiment planning in empirical software engineer-ing

When asked about what is experimental plan is and what the main elements that shouldbe contained in it are, the experts define experimental plans as an iterative protocol, a design ofthe study, where everything important is reported.

One expert said "the experimental plan should contain everything related to the ex-periment", which means it should contain all necessary elements, materials, procedures, anddecisions to investigate a target phenomenon including guiding the experimenters through all thedifferent steps in the experimental design.

Ideally, they see experimental plans as something that in the case of another researcherwanting to run the same experiment, they will understand, and will be able to replicate theexperiment. Additionally, all of them see experiment planning as extremely important. Oneparticipant stated that if an experimenter does not have a good plan, they are not going to conducta successful experiment, which could lead to inconclusive results. For example, one of themcompared the activity of a conventional engineer with the activity of a software engineer:

"For an engineer, it is impossible to do engineering without planning....Planningis essential to develop any product. An experiment is a product, it is a target to beachieved."

Some of them emphasized that it does not matter how expert a researcher should be.Building an ad hoc experimental plan raises a lot of issues because of the large quantity ofdecisions that must be made.

The experts further stated that not planning experiments increases the number of threatsto validity in them. However, some of them argue that researchers are too concerned aboutperforming "perfect" studies. One said :

“Everybody is always concerned about doing studies because it’s perceived to bea lot of effort to run one of these studies and plan one of these studies and I thinkperhaps rather than thinking about how do I run the perfect study, how much workdo I really need to do to plan a really really great study, it would be better if peoplejust started running studies even if they’re not perfect, even if they’re simpler. "

Another expert also thinks that "reviewers are too rigorous in terms of researchers havingto follow exactly the instructions outlined in guidelines". They do not consider the importance ofexperiment even though it has its limitations. Another expert confirmed the thoughts of previousexperts. He said:

"I think the basic problem today is that perhaps that everybody sees studies asjust being really really hard to run and currently that’s just also reaction where


everybody that’s reviewing studies and program committees like always wants thestudies to be perfect and thinking less about, is this study methodologically perfectand more about what have you really learned by running the study even if it haslimitations, even if like there’s questions about external validity in terms of do theresults really generalize. "

Regarding the time spent planning an experiment, the vast majority of the participantsthink that it does not make sense to track the time spent planning experiments unless researcherswant to report it. It is important to plan correctly. Sometimes it can take days, weeks, months, oreven years.

The experts were also asked about their opinions about the empirical software engi-neering community regarding whether experiment planning is neglected or left out by softwareengineering researchers. One of the experts thinks that everybody does some planning althoughhow much depends on each individual case:

" Some do a really good job. Some do not do a very comprehensive job. I thinkeverybody does some planning at some level but it may be some aren’t very completeand sometimes they are incorrect. I have read papers where it’s clear the peopledidn’t think things through. It’s really on a case-by-case basis. I mean, again, wecould be improving. Some neglect it, some don’t."

In general, they recognized the existence of problems in empirical software engineeringas one expert said:

“I do not think that it is neglecting, it is just a matter of you making a mistake”

Also, the term “neglect” was seen as too strong. Another expert said:

" I don’t think they neglect. I think it’s a complex problem. I think that with theabsence of standards and for other reasons we cannot be sure of its quality."

Although they pointed out several issues in experiment reports such as the immaturityof the research questions, hypotheses, statistics analysis, experiment design, and among others,they recognized that the experimental software engineering community is progressing.

“I think the empirical software engineering community is otherwise doing not toobad a job and has a very good handle on it. “

Furthermore, encouraging researchers to run experiments, even on a small scale, wouldmake experimental software engineering more mature.

"I think just to be able to run more studies especially smaller studies would be a bigstep for our field because there’s so much we don’t know and the more studies wecan run, the more opportunities we have to learn more about software engineering.”

4.4. DISCUSSION 102

4.4 Discussion

This study is the first to use a qualitative analysis to explore what experts actually dowhen they design their experiments, the problems and traps that they have already experienced,how they teach and learn about experiments, and what gaps they have in their knowledge.Additionally, we also addressed their opinions of the experiment planning phase regarding itsdefinition, relevance, and quality issues.

We identified that experts design their experiments by doing six activities: (1) brain-storming experiment rationales, (2) writing and updating the experimental plan, (3) revisingthe experimental plan, (4) holding meetings with external researchers, (5) running pilots, and(6) discussing results. From that, we developed a conceptual model, Figure 4.1, to illustratethe experiment planning activities that are performed by experts in order to help inexperiencedresearchers to plan their experiments. In the empirical software engineering literature, theexperiment planning process follows a set of important steps such as context and variablesdefinition, sample selection, hypothesis formulation, choice of design type, instrumentation,and validity evaluation [21]. Our results show that although the experts include these stepsin their experiment planning, they carry out those activities that allow them to critically andefficiently think, write, review, run(even in a small scale), and discuss the experiment that theyare developing. Therefore, the findings are an addition to the experiment planning process foundin the empirical literature because these activities can help researchers to plan their controlledexperiments efficiently.

Regarding ways of learning, the experts reported classical ways of learning such as takingcourses, reading and discussing empirical materials, writing papers, and being in touch with thesoftware engineering community. However, besides these, they highlighted the importance ofthe process of observing, practicing, and having a good mentor as a useful and efficient wayto train new researchers. Also, they highlighted the usage of examples when they are teachingas well as learning, and the execution of pilots, which build confidence because this practicemakes them solve different kinds of problems and consequently will make them learn moreabout the phenomena they are studying. Figure 4.2 illustrates a set of ways of learning aboutexperiments reported by experts. These findings are a valuable contribution for teachers as wellas for students of empirical software engineering courses. For teachers, they mean demonstratinghow to carry out experiments through different kinds of examples, encouraging the discussionand presentation of experimental plans, and encouraging the publication not only of experimentreports but also of experimental plans in order to build a body of knowledge. This should includelessons learned from successful and unsuccessful plans which would be a great step for empiricalsoftware engineering education. For students, the findings mean being involved in experimentsthrough planning and conducting their own as well as their colleagues’ experiments, that is,learning from their mistakes and the mistakes of others. Furthermore, publishing negative resultswith the lessons learned, and publishing research planning rationale could also be efficient way

4.4. DISCUSSION 103

of learning how to conduct experiments, or perhaps we should say, regarding how not to conductexperiments.

Regarding first the problems and difficulties, second, mistakes and traps, and third,gaps both in the literature and experts’ knowledge, we also illustrated the main findings in adiagram (see Figure 4.3). We observed that some reported results are related to more than oneclass of issues. For example, we noted that issues, such as choosing a suitable design, doingstatistical analysis, preparing and analyzing qualitative methods, and finding incompleteness inexperimental plans were cited in the three classification above.

Figure 4.3: Intersection of reported results – Problems, Mistakes and Gaps.

It might also be noted that the completeness of experimental plans is considered animportant acceptance quality criterion. Incompleteness was reported by experts as a recurringproblem especially for novice researchers, because they forget common important elements andmake mistakes in experimental plans and reports. In order to avoid this, the experts believe thatexperimental plan reviews are extremely important. Usually they review their experimental planthrough their research group and through meetings with external researcher whenever possible.Although the experts agree that the point when a researcher can say that they have a completeplan is when they actually run the experiment, revising their experimental plan is good practiceto minimize bias, mistakes,and missing elements. This practice is usually performed before theyrun a pilot in order to take better advantage of pilot resources.

On this issue, some experts point out that having a checklist which helps inexperiencedresearchers to remember missing elements, identify bias, and find mistakes would be a goodsupport mechanism in order to avoid basic mistakes found in experimental plans and reportstoday.

4.4. DISCUSSION 104

In addition, although we did not directly ask them about something that they would liketo have in the empirical software engineering field, some experts said what they wanted whenthey reported gaps, mistakes, and problems they have experienced, that is, a consensual welldefined taxonomy; which when given a set of characteristics of the experiment, returns a setof examples from similar experiments; a repository of experiments; well defined standards inexperimental software engineering; a catalog of experiments; experiment simulation tools; andchecklists for checking completeness of experimental plans.

Some initiatives can be seen in the software engineering literature to address someof these things. Table 4.14 presents the things suggested by some experts, and also someresearch that is focused on these issues. For example, some researchers have been workingtowards simulation tools [138], [31], tools for conducting controlled experiments in softwareengineering [139], [132], repository of experiments [131], and a glossary of terms started by theISERN (International Software Engineering Research Network) 1 community in 1998. Regardingcatalogs of experiments, our empirical research group at UFPE conducted a systematic mappingstudy [26], [27] one of whose contributions was to provide a catalog of support mechanismsavailable to the software engineering community interested in empirical studies. This contributioncan be seen in Chapter 3.

1http://lens-ese.cos.ufrj.br/wikiese

4.4. DISCUSSION 105

Table 4.14: Things that experts would like to have versus Studies towards thesesuggestions

Suggestions of new things that expertswould like to have

Some studies in the literature towardsthese experts’ suggestions

Simulation tools Reporting guidelines for simulation [138],Experimentation environment to supportlarge-scale experimentation [31]

Tools for conducting controlled experi-ments in software engineering

A framework which includes a set of inte-grated tools to support experiments in soft-ware engineering [139], A model drive ap-proach to specifying and monitoring Con-trolled experiments [132]

Repository of experiments Knowledge repository structure of an ex-perimental software engineering environ-ment [131]

Consensual and well defined experimentalsoftware engineering taxonomy

A glossary of terms started by ISERN(International Software Engineering Re-search Network) 1 community

Catalogs of experiments A catalog of available support mechanismsfor empirical studies is described in Chap-ter 3 based on systematic mapping study[26] [27]

Although there are some contributions in the areas described above, regarding checkliststo confirm the completeness of experimental plans, we only identified checklists for experimentreports in software engineering literature.

As a result, from the evidence of the importance of review, completeness as an importantacceptance quality criterion for experimental plans, the frequent omission of important elementsin experimental plans, and the suggestions of having a checklist to support the experimentalplans reported by experts, motivated us to develop an instrument for reviewing the completenessof experimental plans for controlled experiments using human subjects in Software Engineering.The proposed instrument definition and its evaluation are described in Chapters 5 and 6 respec-tively. The development of the proposed instrument is related to addressing some problems andmistakes reported by experts, including avoiding bias and mistakes in the experiment, forgettingimportant elements including external factors such as human aspects and ethics, encouragingresearchers to include the rationales of their decisions, making researchers think whether they re-ally have relevant research questions, the right participants, representative data, the most suitableexperiment design, and fitting their ideal experiment into the real experimental conditions theyhave.

4.5. TRUSTWORTHINESS OF THIS STUDY 106

Finally, although planning and conducting controlled experiments requires a lot of effort,we have learned from the experts that rather than having a perfect study as a target, it would bebetter if researchers just started performing experiments even if they are simpler and not perfect.We have come to the conclusion that the basic problem in doing experimentation in softwareengineering today is because everybody sees experiments very difficult to run, and currently,researchers who are reviewing studies want them to be perfect. It is important that we think lessabout whether the study is methodologically perfect and more about what we have really learnedby carrying out experiments, even if they have limitations and there are questions about externalvalidity in terms of the generalization of the results. We agree that if researchers are able to runmore studies, especially smaller ones, it would be a big step for our field because there is somuch we do not know, there are more studies we can run, and more opportunities we have tolearn about software engineering experiments.

4.5 Trustworthiness of this Study

We addressed some criteria for ensuring trustworthiness in this qualitative interviewstudy including, credibility, dependability, and transferability.

Regarding credibility in terms of research bias, we were not interested in fitting our datainto any particular conclusion. In contrast, we wanted to conduct a well designed qualitativestudy which captures how experts actually do their experiments in practice, and present this resultto the empirical software engineering community. The participants were selected based on theirexperiences in conducting experiments in software engineering and their relevant publicationsin the experimental software engineering field. After that, they showed their willingness to beinterviewed. We were careful to prepare the qualitative interview questions using the GQMtemplate in order to take maximum advantage of this interview opportunity. We were also carefulto avoid leading questions that might cause embarrassment or harm the participant’s reputation.In terms of data collection and data checking, all interviews were audio-recorded and transcribedin order to capture the participants’ contributions accurately. All transcripts were checked byanother researcher who was a speaker of the language that the participants used in the interviews,that is, Portuguese interviews were checked by a Brazilian researcher, and English interviews byan American researcher. In terms of additional peer review, the quotes used in the results fromBrazilian experts were translated and reviewed by another researcher who has language skillsboth in Portuguese and English.

Regarding dependability in terms of consistency, we recorded each step on spreadsheetsin case someone else wants to audit the findings, which can be used as audit trails by externalreviewers. In terms of code and recoding, we coded and recoded the data several times bothduring the open coding and axial coding phases, in which we compared the various sets forcompleteness and consistency of data.

Regarding transferability, we provided descriptions of the context in which the research


was performed.

4.6 Chapter Summary

This study aimed to offer insights into how experienced software engineering researchersactually plan their experiments. Interviews were conducted to understand how they deal withexperimental issues in a practical way. As a result, we developed a conceptual model about howthey actually plan experiments, including the common activities which help them to achievesatisfactory results. Also, we have provided an overview of ways in which experts learn aboutempirical studies. We presented the list of common problems that they have faced, mistakes thatthey have already experienced, gaps both in their knowledge and in the empirical literature aswell as things that they would like to have in the empirical software engineering field. Finally,We presented some experts’ opinions of experiment planning in empirical software engineering.In the next Chapter we present the proposed instrument, describing in detail the steps taken toachieve the goal of this thesis.

108108108

5Instrument Development

In this Chapter, we propose an instrument for reviewing the completeness of experimentalplans for controlled experiments using human subjects in the area of software engineering. Asmentioned before, the goal of this research is to facilitate the identification of conceptual,methodological, and ethical problems, among other issues, in experimental plans and to allowadjustments and improvements before a controlled experiment is carried out. It is also necessaryto emphasize that the instrument does not intend to replace the experimental software engineeringbest practices consolidated by the empirical community, but rather aims to help inexperiencedresearchers to review their experimental plans more easily by using existing mechanisms in abetter way.

Figure 5.1: Experimental Process

Figure 5.1 illustrates the complete experimental process. It was modeled using BPMNotation [140]. The proposed instrument should be used at the diamond. It is an importantcontrol point in the experimental process, where an experimental plan should be assessed. At thispoint the researchers should decide whether the experimental plan should be revised or it shouldproceed using that experimental plan. The experimental plan can be reviewed by the researcherwho developed the experimental plan, or by another researcher who is not involved in the study.

5.1. DEFINITIONS 109

This chapter is divided into three sections. In Section 5.1, we first define the termsused by the instrument and its context of usage. In Section 5.2, we explain the instrumentdevelopment methodology, and in Section 5.3, we present the instrument specification. Theinitial version of the proposed instrument is presented in the Appendix B. After the developmentprocess described in this chapter, the initial version of the proposed instrument was assessedfrom different perspectives and by empirical software engineering researchers at different levelsof experience, as described in Chapter 6. During and after evaluation, modifications were appliedto the instrument. The final version of the instrument is also included in Chapter 6. Finally,Section 5.4 presents the chapter summary.

5.1 Definitions

Before presenting the instrument, this section delimits the scope of this research. Theinstrument is a checklist whose focus is on reviewing the completeness of experimental plansfor controlled experiments using human subjects in software engineering. In this section,we explain the definitions of the above words for the reader’s better understanding of thedevelopment and usage of the instrument.

Definition of ChecklistChecklists are instruments based on quality items and are not scored numerically. This

type of instrument is generally composed of a sizeable number of quality-related questions(items) with yes/no answers, e.g., Is there a well- defined question? Are the results generalizableto the setting of interest in the review? [19]. Although there are other approaches [74], [76] forchecking how adequately an experiment is planned, such as quality scales, we chose to use thechecklist approach rather than quality scales because we are not interested in giving scores forthe items for two reasons. First, because the items tend to be subjective, and second because therelevance of the items can differ from experiment to experiment.

Furthermore, the usage of checklists for reviewing experimental plans allows researchersto go through their own experimental plans directly and systematically to avoid mistakes and bereminded about important missing elements.

Although checklists and quality scales have not been thoroughly evaluated for theirability to effectively assess the quality of experiments [ 74], there are many quality assessmentdescribed in the literature that use them in several fields such as software engineering [12],[39], medicine [76], [86], [87], [95], [91], [90], social sciences [53], psychology [98], andeducation [97].

Definition of ReviewingReviewing is a relevant activity to catch mistakes, omissions, and missing elements. It

increases the chances of identifying issues. The proposed instrument is a tool for reviewing thekey elements in experimental plans which allow their reviews by their authors or by someoneelse with the purpose of improving their contents before to conduct experiments.

5.1. DEFINITIONS 110

Definition of CompletenessCompleteness is not just one of the quality attributes of a well-made experimental plan,

but also the quality acceptance criterion most cited by experienced empirical software engineeringresearchers (as found in the qualitative interview study described in Chapter 4). Experimentalplan completeness concerns the presence of key elements that are likely to reduce bias andincrease internal validity.

Definition of Experimental plansAs mentioned before in Chapter 2, an experimental plan is a document where exper-

imenters define the study plan including specific procedures and techniques to be used inconducting an experiment.

Definition of Controlled experimentsThis research uses the definition of Controlled experiments used by Wohlin et al. [21].

They state:"Experiment (or controlled experiment) in software engineering is an empirical enquiry

that manipulates one factor or variable of the studied setting. Based in randomization, differenttreatments are applied to or by different subjects, while keeping other variables constant, andmeasuring the effects on outcome variables. In human-oriented experiments, humans applydifferent treatments to objects, while in technology-oriented experiments, different technicaltreatments are applied to different objects." [21].

Definition of Human SubjectsAccording to the Code of Federal Regulations 1of the United States Department of

Health and Human Services, a human subject is defined as a living individual about whom aresearch investigator (whether a professional or a student) obtains (1) data through interventionor interaction with the individual, or 2) identifiable private information which includes a subject’sopinion on a given topic.

Software engineering is a research field mainly based on human activities and interactionsbecause software engineering technologies are developed to be used by people. However, humaninteractions make experiments difficult to plan and evaluate, because it is not possible to generateaccurate models of human behavior, as in mathematics and physics [21]. A large number ofhuman variables present in the experiment can generate threats to the validity of the results. Themain difficulties with experiments with human subjects are concerns about finding appropriateparticipants, and managing research participants, including protecting their rights and informingthem of their rights [62].

The proposed instrument is focused on helping experimenters to remember importantissues and on suggesting guidelines to protect their human subjects from any form of harm, thuscontributing to increasing the reliability of research results.

1Public Welfare Protection of Human Subjects, 45 C.F.R. § 46.102 (2015)

5.2. INSTRUMENT DEVELOPMENT METHODOLOGY 111

5.2 Instrument Development Methodology

The instrument was developed adapting the Host and Runeson (2007) methodology [84]to our context. All steps were supervised by another researcher in order to check the reliabilityof the data.

The instrument was derived in five steps:Step 1. Data collection from existing empirical checklists and guidelines, and experts’ ex-perience.Step 2. Standardizing experimental process phases and classifying items by phases.Step 3. Grouping related data within each experimental phase.Step 4. Formulating the instrument items and associating recommendations for each item.Step 5. Pre- validation of the instrument items.

Each step is explained in the following sections.

5.2.1 Data collection from empirical checklists and guidelines, and ex-perts’ experience

To develop the instrument, we used the results of the systematic mapping study (seeChapter 3) to collect support mechanisms used in the experimental software engineering commu-nity to design and conduct experimental activities. We also used the ad hoc literature review (seeChapter 2) to complement the collection of empirical guidelines from other fields, to collect themost common checklists used in software engineering and other fields, and to identify importanthuman-related factors in empirical research and ethical guidelines. Another important input wasthe qualitative study (see Chapter 4) we conducted to collect experts’ experience as the basis toorganize the items and recommendations. As a result, the proposed instrument was developedthrough the collection and integration of existing checklists, experts’ experience in conductingcontrolled experiments, and guidelines for conducting empirical research in software engineeringand other fields. In addition, we included guidelines for reporting controlled experiments, asreporting guidelines often discuss the key elements in experiment planning as well as the relevantfactors to consider related to human subjects. Also, reporting guidelines often discuss ethicalconcerns.

The data collected were organized in individual spreadsheets, each one for a specifickind of source, namely guidelines, checklists, experts’ experience, human factors in empiricalresearch and ethical concerns. We began with existing guidelines and checklists, and added datafrom the other categories later, in Step 4, to help fill gaps and provide input into the “things toconsider” and recommendations for each item as necessary. During the analysis of these sources,we focused on the experimental goal setting and the experiment planning phases. Redundancieswere inevitable in this stage of the analysis, so we tried to collect all the different perspectives


from our sources in order to narrow down the items in a subsequent stage, to the main componentsin the experiment planning phase. Because checklists are composed of items, we included allitems found in the checklists we examined to be able to perform the process of narrowing downand removing duplication later. From the guidelines, we directly extracted the potential items.The resulting list contains information about data source, the experimental process phase towhich the data is related, data descriptions, and recommendations, at least in cases where thesources provided all this information (otherwise we used N/A for the missing information). Thecomplete data collection list resulting from this step is in Appendix C.

Information about the sources we used in this step can be found in Table 5.1 for guidelines,Table 5.2 for checklists, and Table 5.3 for human factors and ethical concerns sources.

Table 5.1: Guidelines Sources

ID Domain Sub-Domain

Reference Purpose

GSE_1SE Empirical

researchWohlin2000 [29]

It gives a well-structured and easy-to-understand introduction to experimentationfor software engineers.

Wohlin2012 [21]

It is Wohlin 2000 [29] update.

GSE_2SE Controlled

ExperimentJuristo &Moreno2001 [30]

It describes in-depth methods for advanced experiment designs and data analysiswith examples from software engineering. They provide a list of “most importantpoints to be documented for each phase” in the form of “questions to be answeredby the experimental documentation”.

Juristo &Moreno2010 [23]

It is Juristo & Moreno 2001 [30] update.

GSE_3 SE Experiment Pfleeger1995 [45]

This study presents key activities necessary for designing and analyzing an experi-ment in software engineering.

GSE_4

SE ControlledExperiment

Jedlitschka2008 [81]

It a unification of a set of guidelines for reporting experiments in software engi-neering. It is the last version of Guidelines for reporting experiments in SE byJedlitschka.

Jedlitschka2005a[141]

Jedlitschka et al. presented a first version of a guideline for reporting controlledexperiments (2005a) during a workshop on empirical software engineering (Jedl-itschka, 2005).


Description in the frame above.

Jedlitschka2005b[142]

Feedback from the workshop participants, as well as from peer reviews, was incor-porated into a second version of the guideline (2005b).


In parallel, the guideline was evaluated by means of a perspective-based inspectionapproach (Kitchenham et al., 2006). This evaluation highlighted 42 issues wherethe guideline would benefit from amendment or clarification and eight defects. Thefeedback from the perspective- based inspection and discussions with its authors ledto a second iteration of the guideline, where the amendments were incorporated ifwe found them appropriate and defects were removed (Jedlitschka and Ciolkowski,2006).


Additional feedback from individual researchers was also incorporated (Jedlitschkaet al., 2007). This is a preliminary version of a chapter in Shull, F., Singer, J., andSjøberg, D.I. (eds.); Advanced Topics in Empirical Software Engineering, Springer,2007.

GSE_5 SE ControlledExperiment

Ko et al.2015 [17]

It is a practical guide to controlled experiments of software engineering tools withhuman participants.

GO_1 Education Experiment McGowan2011 [145]

Planning a Comparative Experiment in Educational Settings

GO_2 Education McCall1923 [146]

It is the first book on educational experimentation published in the USA. It pre-sents concrete examples of experimental problems and its purpose is to present themethodology of educational experimentation in a practical form.

GO_3 Psychology Martin2008 [147]

This book guides students through the experimentation process in a step-by-stepmanner, teaching them how to design, execute, interpret, and report on simple psy-chology experiments.

GO_4 Psychology Robson1994 [55]

describes how to design, carry out, analyze and interpret simple psychological ex-periments. It shows ways of finding the statistical test appropriate to the problem,the reasoning behind the choice and how to present the results as simply and clearlyas possible.

GO_5 Statistics Oehlert2010 [148]

It is a book which proposes a course in design and analysis of experiments.

GO_6 Statistics Montgomery2013 [149]

This book helps in design and analysis of experiments


Table 5.2: Checklists Sources

ID Domain Sub-Domain

Reference Purpose No. Items Recommendations

CSE_1 SE Experiments Dieste et al.2011 [19]

Quality assessment instrument to determine the quality of experiments. It is a qual-ity scale for using to study the correlation between quality(understood as internalvalidity) and bias.

11 Yes

CSE_2 SE PrimaryStudies

Dyba et al.2008 [12]

Quality criteria for assessing primary studies in SE 11 No

CSE_3SE Case study Host and

Runeson2007 [84]

It is checklists for supporting researchers and reviewers in conducting and reviewingcase studies. Researcher’s checklist

38 No

Reviewer’s Checklist 12 NoCSE_4 SE Experiments Jedlitschka

et al. 2008[81]

It a unification of a set of guidelines for reporting experiments in software engi-neering. It is the last version of Guidelines for reporting experiments in SE byJedlitschka.

42 No

CSE_5 SE Experiments Kitchenhamand Char-ters 2007[40]

This study presents a summary quality checklists for quantitative studies. For quan-titative studies we have accumulated a list of questions from [150], [151] , [152],[76] and [153] and organised them with respect to study stage and study type. Wedo not suggest that anyone uses all the questions. Researchers should adopt Fink’ssuggestion [151] which is to review the list of questions in the context of theirown study and select those quality evaluation questions that are most appropriatefor their specific research questions.

50 No

CSE_6 SE Experiments Kitchenhamet al. 2010[83]

It is a quality checklist with nine questions. The purpose is to assess whether thequality of published human-centric software engineering experiments was improv-ing.

9 No

CSE_7 SE Experiments Kitchenham2009 [82]

It is a generic Quality checklist for quality of experiments built from Kitchenhamand charters 2007 [40]

9 Yes

CSE_8 SE Empiricalstudies

Kitchenham2002 [24]

They proposed a preliminary set of research guidelines aimed at stimulating discus-sion among software researchers. The guidelines are based on a review of researchguidelines developed for medical researchers and on our own experience in doingand reviewing software engineering research. The guidelines are intended to assistresearchers, reviewers, and meta-analysts in designing, conducting, and evaluatingempirical studies.Although it is a set of guidelines, it is also checklists.

36 Yes

CSE_9 SE Experimentaland caseStudy

Wieringa2012 [85]

It is a checklist for empirical research. They designed a unified checklist for empiri-cal research, and identify commonalities and differences between experimental andcase study research.

40 No

CO_1 Medicine Begg 1996[92]

CONSORT checklist for randomized controlled trials. 21 No

CO_2 Medicine PhysicalTherapy- Painresearch- Ran-domizedclinicaltrials

Jadad 1996[86]

This study describes the development of an instrument to assess the quality of re-ports of randomized clinical trials in pain research and its use to determine the effectof rater blinding on the assessments of quality.

3 No

CO_3Medicine Randomized

controlledtrials

Moher2010 [88]

The CONSORT group (Consolidated Standards of Reporting Trials) is a group ofscientists and editors in medical research that aims to improve reporting of RCT’sin medical research.

37 No

Schulz2010 [89]

They organised a CONSORT Group meeting to update the 2001 statement [6-8]. Itprovides guidance for reporting all randomised controlled trials.

37 No

CO_4 Psychology Randomisedcontrolledtrials

NCBI 2012[98]

Quality checklist templates for clinical studies and reviews. 14 Yes

CO_5 Education CEBP 2010[97]

Checklist For Reviewing a Randomized Controlled Trial of a Social Program orProject, To Assess Whether It Produced Valid Evidence

17 Yes

CO_6 Medicine Badgley1961 [93]

Checklists for assessing research methods reports 5 No

CO_7 Medicine Pharmacology Bland 1985[94]

Checklists for assessing if clinical trial evidence about new drugs are statisticallyadequate

18 No

CO_8 Medicine Gardner1989 [99]

Checklists in assessing the statistical content of medical studies 26

CO_9Medicine Greenhalgh

2006 [56]Checklist for the methods section of a paper. 14 No

Checklist for the statistical aspects of a paper. 16 NoCO_10 Medicine Greenhalgh

2005 [96]Quality checklist for experimental (randomised and non-randomised controlledtrial) designs. Modified from Cochrane EPOC checklist

12 No

CO_11 Health Systematicreviews

CASP_RCT2013 1

Randomised Controlled Trials Checklist 11 Yes

CO_12 Health Systematicreviews

CASP_QR2013 1

Qualitative Research Checklist 10 Yes

CO_13 Medicine Down 1998[95]

Checklist for measuring study quality in randomised and non-randomised studiesof health care interventions

27 Yes

CO_14 Medicine Systematicreviews

Zaza 2000[91]

checklist from Medicine 15 Yes

CO_15 Medicine Primarystudies

Cochrane2002 [90]

The purpose of the checklist is to provide a guide to reviewers about the type ofrelevant information that could be extracted from primary studies.

7 Yes

CO_16 Medicine EPHPP2010 [154]

Quality assessment tool for quantitative studies 18 No

CO_17 Ecology Jeffers1984 [100]

Statistical checklist 74 No


Table 5.3: human factors and ethical concerns guidelines sources

ID Reference PurposeHF_1 Vinson and Singer

(2008) [72]A practical guide to ethical research involving humans

HF_2 Singer and Vinson(2002) [70]

Ethical Issues in Empirical Studies of Software Engineer-ing

HF_3 Shneiderman (1980)[72]

Software Psychology: Human Factors in Computer andInformation Systems

HF_4 Lazar et al. (2010)[62]

Research methods in human-computer interaction. Vari-ation language and cultural differences between partici-pants in experiment

HF_5 American Psycho-logical Association(2010) [64]

Ethical Principles of Psychologists and code of conduct

HF_6 The British Psycho-logical Society (2014)[63]

BPS code of human research ethics

HF_7 Johns Hopkins Uni-versity (2010) [65]

JHSPH Human subjects research ethics field trainingguide

HF_8 Garza (1991) [66] The Touchy Ethics of Corporate Anthropology

Table 5.1 shows information about the 11 guidelines we analyzed, five from softwareengineering and six from other fields including education, psychology, and statistics. A total of190 potential items were collected from these guidelines. There are other relevant guidelines thatwe studied but they did not yield items such as Moroe [155] ,Greenwood [58], Cox et al. [57],Campbell and Stanley [48] ,Cook et al. [53], Judd et al. [47], Fleiss [156] Shadish et al. [54].

A total of 26 checklists were analyzed (see Table 5.2). Nine of them are from thesoftware engineering area and seventeen from other fields including medicine, health, psychology,education, and ecology. We highlight the reference CSE_8 in the table. It is a guideline anda checklist at the same time. In order to avoid redundancies, we classified it as checklist. Atotal of 603 items were collected from the checklist sources. Of these, 258 items were fromsoftware engineering checklists, 101 of which were classified as not applicable, thus resultingin 157 remaining items from the software engineering field. Similarly, 355 items were foundin checklists from other fields, 185 of which were classified as not applicable, thus leaving 160items. As a result, 317 items from checklists were classified as potential instrument items andare included in Appendix C.

Table 5.3 presents information about the sources we used to identify relevant humanfactors in empirical research and ethical guidelines. These sources did not yield additionalitems, but helped us develop the supplemental information (e.g. “things to consider” andrecommendations) that is part of the resulting instrument.


5.2.2 Standardizing experimental process phases and classifying items byphases

The goal of this step is to classify all collected data according to the phases of theexperimental process. Because the phases of the experimental process are different from one setof experiment planning guidelines to another,this section describes the process of standardizationof these phases for use in this research and focuses on creating a standard for them in order toclassify the items from checklists, guidelines, and experts’ experience. We organized this sectioninto three steps: data collection, experimental process phases identification, and experimentalprocess phases standardization. The results of this task can be seen in Table 5.7.

Step 1. Data CollectionThis step aims to identify existing guidelines for conducting and reporting experiments

in software engineering. Table 5.4 shows the main guidelines for conducting and reportingexperiments in software engineering, extracted from the Table 5.1. The first column of the tablelists the Reference, refers to the ID from Table 5.1 for each source. The second column describesthe purpose of the guideline.The third column reports the phases of experimentation they address.

Table 5.4: Guidelines for conducting and reporting experiments in software engineering

Reference Author/Year Purpose Phases of studyGSE_1 Wohlin 2000 [29] Wohlin 2012 [21] Empirical Research AllGSE_2 Juristo 2001 [30] Juristo 2010 [23] Controlled Experiment AllGSE_3 Pfleeger 1995 [45] Experiment All

GSE_4Jedlitschka 2005 [20] Controlled Experiment ReportingJedlitschka 2008 [81]

GSE_5 Ko et al. 2015 [17] Controlled Experiment AllGSE_6 Kitchenham 2002 [24] Empirical Research All

Step 2. Experimental process phases identificationBecause the guidelines for conducting and reporting experiments in the software engi-

neering domain have different views of the experimental process steps, the steps of the processesare summarized in Table 5.5. Although Jedlitschka (2008) [81] is the update of Jedlitschka(2005) [20], we included the two versions because they present relevant differences in thestructure of the experimental process that we decided to include in order to analyze.


Table 5.5: Experimental process steps in order proposed by authors

GSE_1 (2000; 2012)[29] [21]

GSE_2 (2001; 2010)[30] [23]

GSE_3 (1995) [45] GSE_4 (2005) [20] GSE_4(2008) [81]

GSE_5(2015) [17]

GSE_6 (2002) [24]

Goal Definition Step 0- Goals of theexperiment and the hy-potheses

Goals Objectives/ Re-search Questions

Goals Goals ResearchQuestion

D1- Identify the population from which the sub-jects and objects are drawn.

Context selection Step 1. Identify the fac-tors

Hypothesis Hypotheses Participants * D2: Define the process by which the subjectsand objects were selected.

Hypothesis formulation Step 2. Identify the re-sponse variables

Experimental Ob-jects/ExperimentalUnits

Parameters Experimentalmaterials

Recruitment D3: Define the process by which subjects andobjects are assigned to treatments.

Variables selection Step 3. Identify the pa-rameters

Experimental Subjects Variables Tasks Selection D4: Restrict yourself to simple study designsor, at least, to designs that are fully analyzedin the statistical literature. If you are not usinga well-documented design and analysis method,you should consult a statistician to see whetheryours is the most effective design for what youwant to accomplish.

Selection of subjects Step 4. Identify theblocking variables

Control Object Design Hypotheses Consent D5: Define the experimental unit.

Choice of design type Step 5. Determine thenumber of replications

Response or dependentVariables and State orindependent variables

Subjects/ participants Parameters Procedure D6: For formal experiments, perform a pre-experiment or precalculation to identify or es-timate the minimum required sample size.

Instrumentation Step 6. Select experi-mental design

Procedure Objects Variables DemographicMeasure-ments

D7: Use appropriate levels of blinding.

Validity evaluation Step 7. Select the exper-imental objects

Experimental Design Instrumentation ExperimentDesign

GroupAssignment

D8: If you cannot avoid evaluating your ownwork, then make explicit any vested interests(including your sources of support) and reportwhat you have done to minimize bias.

Step 8. Select the exper-imental subjects

Analysis procedure Data collect procedure Procedure Training D9: Avoid the use of controls unless you aresure the control situation can be unambiguouslydefined.

Analysis procedure AnalysisProcedure

Tasks D10: Fully define all treatments (interventions).

Validity evaluation ValidityEvaluation

OutcomeMeasure-ments

D11: Justify the choice of outcome measures interms of their relevance to the objectives of theempirical study.

Debrief andCompen-sate

* The study previously defined the experiment design, and it stated that every participant received the exact same materials, instruction, tasks, and environment, except for which tool theyreceive.

Step 3. Experimental process phases standardizationBased on the guidelines for conducting experiments in software engineering, we named

the experimental phases to be used in the classification of the instrument’s items as shownin Table 5.6. The names we chose based on the merging of the experimental phases areshown in the first column, and the phases from the source guidelines are shown in subsequentcolumns. Furthermore, in some cases, it was necessary to make the phase more specific, thus wecreated a sub-classification of some experimental process phases. We also added the DocumentStructure classification. This classification is a general classification for any element thatspecifies something that should be contained in the experimental plan. It was created during theclassification and sub-classification steps when we realized that some items did not fit in anyof the other phases. However, those items are important because they refer to general relevantconcerns that should be considered by the researcher. The final result of the classification of theexperimental process can be seen in Table 5.7.


Table 5.6: Experimental process phases of the instrument

Experimentalprocessphases

GSE_1 (2000; 2012)[29] [21]

GSE_2 (2001; 2010)[30] [23]

GSE_3 (1995) [45] GSE_4(2005) [20]

GSE_4(2008) [81]

GSE_5(2015) [17]

GSE_6 (2002) [24]

Goal defini-tion

Goal Definition Step 0- Goals of theexperiment and the hy-potheses

Goals Objectives/ Re-search Questions

Goals Goals

ResearchQuestions

ResearchQuestion

Metrics andMeasure-ments

OutcomeMeasure-ments

D11: Justify the choice of outcome measures interms of their relevance to the objectives of theempirical study.

Context Se-lection

Context selection

Hypothesesformulation

Hypothesis formulation Hypothesis Hypotheses Hypotheses

ParametersandVariablesSelection

Variables selection Step 1. Identify the fac-tors


Parameters Parameters

Step 2. Identify the re-sponse variables

Variables Variables

Step 3. Identify the pa-rametersStep 4. Identify theblocking variables

Participants

Selection of subjects Step 8. Select the exper-imental subjects

Experimental Subjects Subjects/participants

Participants Recruitment D1- Identify the population from which the sub-jects and objects are drawn.

Selection D2: Define the process by which the subjectsand objects were selected.

ConsentD6: For formal experiments, perform a pre-experiment or precalculation to identify or es-timate the minimum required sample size.

GroupAssignment

GroupAssignment

D7: Use appropriate levels of blinding.

D3: Define the process by which subjects andobjects are assigned to treatments.

ExperimentalMaterial

Step 7. Select the exper-imental objects

Experimental Ob-jects/ExperimentalUnits

Objects Experimentalmaterials

DemographicMeasure-ments

D2: Define the process by which the subjectsand objects were selected.

Control Object D5: Define the experimental unit.D10: Fully define all treatments (interventions).

Tasks Tasks TasksExperimentDesign

Choice of design type Step 6. Select experi-mental design

Experimental Design Design ExperimentDesign

D4: Restrict yourself to simple study designsor, at least, to designs that are fully analyzedin the statistical literature. If you are not usinga well-documented design and analysis method,you should consult a statistician to see whetheryours is the most effective design for what youwant to accomplish.

ParametersandVariablesSelection

Variables selection Step 1. Identify the fac-tors


Parameters Parameters

Step 2. Identify the re-sponse variables

Variables Variables

Step 3. Identify the pa-rametersStep 4. Identify theblocking variables

ProcedureInstrumentation Step 5. Determine the

number of replicationsProcedure Instrumentation Procedure Procedure D8: If you cannot avoid evaluating your own

work, then make explicit any vested interests(including your sources of support) and reportwhat you have done to minimize bias.

Training D9: Avoid the use of controls unless you aresure the control situation can be unambiguouslydefined.

ConsentD6: For formal experiments, perform a pre-experiment or precalculation to identify or es-timate the minimum required sample size.

Data Col-lection

Datacollectprocedure

Analysisprocedure

Analysis procedure Analysisprocedure

Analysisprocedure

Threats toValidity

Validity evaluation Validityevaluation

ValidityEvaluation


Table 5.7: Classification and sub-classification of the experimental process phases

Experimental Process PhasesGoal definitionResearch QuestionMetrics and MeasurementContext selectionHypothesis formulationParameters and VariablesParticipants* Recruiting and selecting* Sampling* Consent and Ethics* Blinding and concealment* GeneralGroup assignmentExperimental Material* Objects* Instruments* Technology InformationExperiment Design* Experimental Design* Procedure* TreatmentsProcedureData CollectionAnalysis ProcedureThreats to Validity*Document Structure

Once we developed a standardized set of experimental process phases, we placed eachitem we had identified (as described in Section 4.1) into one of the standardized phases. At thesame time, we examined each item to make sure it was applicable to our scope. Because wehad collected data from sources about empirical research in general, and sources describingother types of primary studies such as case studies, some data was not suitable for defining anddesigning a controlled experiment. Therefore, these items were classified as not applicable. Someof the items we had identified were also not directly relevant to experimental plans, e.g., somesources were related to reporting. However, we did not discard these potential items because, inthe subsequent steps, we wanted to have the opportunity to change those items slightly to makethem applicable to the planning phase. After classifying all items according to their classificationbased on the standardized experimental process phases, we had discarded 286 items that wereclassified as not applicable. This considerably reduced the number of items, from 793 to 507items.


5.2.3 Grouping related data within each experimental phase

The goal of this step is just to group the similar data according to their experimentalprocess phase. The main reason for this step is to serve as an input for formulating instrumentitems. Table 5.8 presents the number of the items by each classification and sub- classification tochecklists and guidelines, respectively.

Table 5.8: Classification and sub-classification of the checklists items

Classification Sub-classification Checklist Items Guideline ItemsGoal definition - 10 items 11 items

Research Question - 6 items 4 itemsMetrics and Measurement - 19 items 15 items

Context selection - 6 items 2 itemsHypothesis formulation - 7 items 7 items

Parameters and Variables - 9 items 22 items

Participants

Recruiting and selecting 11 items 15 itemsSampling 28 items 14 items

Consent and Ethics 8 items 12 itemsBlinding and concealment 11 items 2 items

General 1 item 1 itemGroup assignment - 48 items 9 items

Experimental MaterialObjects 2 items 10 items

Instruments 5 items 3 itemsTechnology Information 3 items 7 items

Tasks - 1 item 7 tasks

Experiment DesignExperimental Design 10 items 10 items

Procedure 2 items 3 itemsTreatments 9 items 7 items

Procedure - 16 items 22 itemsData Collection - 25 items 2 items

Analysis Procedure - 52 items 8 itemsThreats to Validity - 24 items 5 items

Document Structure - 4 items 0 itemsTotal Items - 317 190

N/A 286 -Total items + N/A 603 190

5.3. INSTRUMENT SPECIFICATION 120

5.2.4 Formulating the instrument items and formulating recommenda-tions

The potential items were formulated and each item was associated with relevant “things toconsider” and recommendations, which were supported by evidence from the literature includingethical concerns and human factors associated with empirical research and evidence from thecontribution of empirical software engineering experts.

As a result, in this step, we formulated 34 potential instruments items. The number ofpotential items sharply decreased, from 507 to 34 potential items, because some of them weregrouped to build a potential instrument item and others were dropped because they target to thesame goal.

5.2.5 Pre-validation of the instrument items

Before the proposed instrument was assessed by researchers who were not involveddirectly in this study (see Chapter 6), I had meetings with one of my research advisor tocritically discuss the validity of the proposed instrument items. The proposed instrument was pre-evaluated in order to review the completeness and relevance of the items. Out of 34 items, onewas considered redundant. We also found inconsistent and unclear items. All of these problemswere resolved before the instrument validation by other researchers. The resulting instrumentcontains 33 items.

5.3 Instrument Specification

In this section, we present the instrument resulting from the design process.

5.3.1 Objects of Interest

Our instrument supports completeness reviews of plans for controlled experiment usinghuman participants. This instrument is designed to be used after experimenters plan theircontrolled experiments, in order to check its completeness.

5.3.2 Raters of Interest

The instrument was designed to be used by software engineering researchers who plannedtheir experiments. Also, it can be used by an independent rater who was not involved in theresearch.


5.3.3 Instrument items

In this section we present defined items that support experimental plan review. The itemsare based on: i) Existing sources found through a systematic mapping study and a literaturereview and ii) Experts’ expertise in planning controlled experiments. This instrument is designedto support software engineering researchers in the review of plans for controlled experimentsusing human subjects in Software Engineering. It includes 33 items, and it is organized into ninecategories from the experimental process phases described in Section 5.2.2. This instrument isnot the final version because the assessments described in Chapter 6 resulted in modifications.This instrument were implemented with the SurveyGizmo tool 1. Each item in the instrumentalso includes a set of things to consider, in the form of questions that lead experimenters andreviewers of experiments to think systematically about the important components that shouldbe contained in the experimental plan. Each item is associated with a response scale: “Yes”,“Partially”, “No”, or “Not Applicable”.

! “Yes” means the experimental plan could not be improved anymore for that item.

! “Partially” means that the experimental plan describes something but it could beimproved.

! “No” means the item is not described at all.

! “Not Applicable” means that the item is not applicable for the experimental plan.

5.4 Chapter Summary

The process of building an instrument for reviewing the completeness of experimentalplans for controlled experiments using human subjects in the area of software engineering hasbeen discussed in this chapter. We discussed the steps taken to carry out this research, includingimportant definitions for the reader to understand the proposed instrument development, theinstrument development methodology, and the instrument specification. An initial version of theinstrument is presented in Appendix B containing 33 items. Each item is described as questionsand each one contains a set of things to consider and recommendations for the experimenterreviews the experimental plan. However, it was not the final version of the instrument because inthe next chapter we discuss the results obtained through four evaluation studies.

1https://app.surveygizmo.com/

122122122

6Instrument Evaluation

The research carried out in this thesis yielded an instrument for reviewing the com-pleteness of experimental plans for controlled experiments using human subjects in softwareengineering. In the previous chapter we proposed and described it. The purpose of this chapteris to report on the evaluation of the proposed instrument. We assessed the instrument throughfour different studies which together contribute to evaluate the instrument from the distinctperspectives of beginner as well as expert researchers in experimental software engineering. Theitems of the instrument assessed in this chapter can be seen in section 5.3.3 of Chapter 5.

In Section 6.1, we describe the research approach, which contains the same informationfor the four studies, including the experimental website, the demographic questionnaire and theirresults for each study. Then, in Sections 6.2, 6.3 , 6.4, 6.5, we describe the study design, dataanalysis, and results for studies 1,2,3,and 4 respectively. The instrument’s acceptance and theresults for each study is present in Section 6.6. The threats to validity are described in Section6.7. Then, we discuss the results from four studies in Section 6.8. In Section 6.9, we present thefinal version of the instrument. Finally, the Chapter Summary is presented in Section 6.10.

6.1 Research Approach

In this section, we describe the common research approach for all four evaluation studies.Each one was planned to explored different goals as well as different perspectives of softwareengineering researchers, including beginners and experts.

Figure 6.1 illustrates the main characteristics of the four evaluation studies.

6.1. RESEARCH APPROACH 123

Figure 6.1: Overview of the four evaluation studies

The first study counted on the assessment of experts in experimental software engineering,the second study counted on the assessment of beginners and experts, and the third and fourthevaluation studies focused on the participation of post graduate students who have been planningtheir first controlled experiments. Furthermore, the proposed instrument was assessed throughtwo different approaches. In the first one, the instrument was assessed from a formative evaluationperspective whose goal was to improve the instrument. In the second one, the instrumentwas assessed from a summative perspective whose goal was to check the effectiveness of theinstrument. From both the formative and summative evaluation results, the proposed instrumentwas modified and improved. The formative assessment consists of studies 1 and 2, and thesummative assessment consists of studies 3 and 4.

Each study is described as follows:

! Study 1: Analyzing which instrument’s checklist items high-level experts in Exper-imental Software Engineering find useful, which ones they do not find useful, andwhich ones they have trouble understanding. In addition, we assessed the instrumentregarding its acceptance. See Section 6.2.

! Study 2: Analyzing the agreement, reliability, criterion validity, and the instrument’sacceptance by expert and beginner researchers in Experimental Software Engineering.See Section 6.3.

! Study 3: Performing a co-located controlled experiment using post graduate studentsfor assessing if the usage of the instrument can reduce the chance of forgetting toinclude something important during the experiment planning phase compared to


the usage of ad hoc practices. Also, we assessed the instrument’s acceptance. SeeSection 6.4.

! Study 4: Performing a remote controlled experiment using post graduate students forassessing if the usage of the instrument can reduce the chance of forgetting to includesomething important during the experiment planning phase compared to the usageof ad hoc practices. Also, we assessed the instrument’s acceptance. The Study 4presents the same study design of the Study 3 but it differs from it because the study4 was performed in a virtual environment, while the Study 3 was co-located. SeeSection 6.5.

Although each evaluation study targets distinct goals, together they contribute towardsassessing the proposed instrument from different perspectives. Figure 6.2 illustrates how theversions of the instrument were implemented. First, the V0 of the instrument was assessedthrough sandboxes by the investigator researchers. Redundancies were identified and solved.Second, the V1 of the instrument was assessed by a participant of the evaluation study 1.The V1 of the instrument should have been assessed by the two participants at the same time.However, because unforeseen problems of one of the participants, they assessed the instrument indifferent dates. Therefore, we decided to analyze the data after the data collection of the secondparticipant, and the results of study 1 were not incorporated into the instrument until version VF .See modifications in Sections 6.2.4 and 6.2.5. The V1 of the instrument was assessed throughthe evaluation study 2 and modifications were implemented before the evaluation studies 3 and 4are executed. The V2 of the instrument was assessed through studies 3 and 4. See modificationsof studies 2, 3 and 4 in Section 6.6.1.5.

Figure 6.2: Versions of the Instrument

We were performing different studies possibly with some of the participants’ colleaguesor members of their research group. So, in order to avoid introducing bias into the studies, werequested that the participants agree not to exchange any information with anyone else abouttheir assessment, the instrument, or any other details of the study that they were performing.Although the data collection was not anonymous, all information was treated confidentially.


In the following subsections we present some details that are common to all four studies.Section 6.1.1 presents the experimental website and its content, which was used by all participants.Then, Section 6.1.2 presents the demographic questionnaire rationale and the demographic resultsfrom each study are presented in Section 6.1.3.

The details of each study are presented in sections 6.2 through 6.5. Also, all the studiesassessed the proposed instrument regarding its acceptance. Section 6.6 describes this evaluationand the results about acceptance from each study are presented in Section 6.6.1.

6.1.1 The Experimental Website

All four studies were performed through a website. Even though the participants instudy 3 were in a laboratory, all data were still collected through experimental websites. Weused Google sites for building the assessment environments. Although each website had itsparticularities regarding instructions and materials for each study, they followed the samestructure. Basically, the websites included a page for an abstract, which presents to participantsan overview of the instrument; a page for study instructions and a dry-run; a page for theproposed instrument with general information; a link for the proposed instrument, which wasimplemented with the SurveyGizmo tool 1; a page with experimental plans (except in study 1,where participants assessed the instrument by itself); finally, a page for feedback informationwith an external link to the instrument’s acceptance survey, which was also implemented withSurveyGizmo tool. Figure 6.3 illustrates an example of the website layout. All the informationabout the content of the website for the four studies can be seen in Appendix D.

Figure 6.3: Example of the Experimental Website Layout

The researchers found all materials needed for the study on the website, which includedthe study instructions, the proposed instrument, the three experimental plans, the demographic



questionnaire, where the participants provided information about their experience in conductingexperiments with human participants, and the instrument’s acceptance questionnaire. The threeexperimental plans used in studies 2, 3, and 4 were all written in Portuguese.

Table 6.1: Experimental Materials for the four studies

Experimental Material Description Study 1 Study 2 Study 3 Study 4The experimental website X X X XInstructions X X X XThe Experimental Plans X X XQuestionnaire about Instrument’s acceptance implemented with Survey-Gizmo tool

X X X X

Demographic questionnaire implemented with the SurveyGizmo toolwhere the subjects will provide information about their experience inconducting experiments with human participants

X X X X

The proposed instrument implemented with the SurveyGizmo tool X X X XForm for the participants include mistakes and miss elements found dur-ing the analysis without the proposed instrument implemented with theSurveyGizmo tool

X X

Form for the participants include the assessment of each experimentalplan implemented with the SurveyGizmo tool

X

Form for the participants assess each item of the instrument imple-mented with the SurveyGizmo tool

X

6.1.2 Demographic Questionnaire

We collected demographic information from all the participants in the studies. Wefocused on identifying the current position of participants although we already knew theirprofiles. Other important data collected was how long they had been experimental researchersin software engineering. This question helped us to classify their experience. Similar to thisquestion, in order to classify the experience of participants, we asked them how many experimentsthey had planned or helped to plan. In addition, for participants who did not have English astheir native language, we asked them their level of English reading comprehension. This wasnecessary because the proposed instrument and instructions are developed in English. Althoughgender and age are common questions used in demographic questionnaires, in our case, thesefactors do not impact on the results of the usage of the instrument. Appendix N presentsthe demographic questionnaire implemented by SurveyGizmo tool. Table 6.2 presents theDemographic Questionnaire questions.


Table 6.2: Demographic Questionnaire

Question Option

1) What is your current position?

a) Master degreeb) PhD studentc) Postdocd) Senior researchere) Professorf) Other

2) How long have you been an experimentalresearcher in Software Engineering?

a) Less than 2 yearsb) Two or more than 2 years and less than 5 yearsc) Five or more than 5 years and less than 10 yearsd) 10 or more than 10 years

3) How many experiments have you helped to plan?

a) Less than 2b) Two or more than 2 and less than 5c) Five or more than 5 and less than 10d) 10 or more than 10

4) Which is your English reading comprehensionproficiency?

a) Very low proficiencyb) Low proficiencyc) Moderate proficiencyd) High proficiencye) Very high proficiency

6.1.3 Demographic Questionnaire Results

This section presents the results of the four evaluation studies.Two high level experts in experimental software engineering participated in the study 1.

Both of them are software engineering professors at U.S.A universities. Both of them have beenexperimental researchers in Software Engineering for more than a decade. Regarding how manyexperiments they have helped to plan, one helped 10 or more experiments while the other helped5 or more experiments. See table 6.3.

Table 6.3: Demographic Information from the Study 1

Question Expert 1 Expert 2What is your current posi-tion?

Professor Professor

How long have youbeen an Experimentalresearcher in SoftwareEngineering?

10 or more than 10 years 10 or more than 10 year

How many experimentshave you helped to plan?

10 or more than 10 five or more than 5 and lessthan 10

Four Brazilian experimental researchers in software engineering participated in the study2, a pair of beginners and a pair of experts in experimental software engineering. See Table 6.4.



Question Value Percentage Count

1) What is your current position?PhD student 50.0% 2Professor 50.0% 2

2) How long have you been an Experimentalresearcher in Software Engineering?

2 or more than 2 years andless than 5 years

50.0% 2

5 or more than 5 years andless than 10 years

50.0% 2

3) How many experiments have you helpedto plan?

2 or more than 2 and lessthan 5

50.0% 2


50.0% 2

4) Which is your English readingcomprehension proficiency?

Very High Reading Com-prehension Proficiency

100.0% 4

A total of seven post graduate students participated in the study 3. See Table 6.5.



1) What is your current position?Master degree 14.3% 1PhD student 85.7% 6


Less than 2 years 28.6% 22 or more than 2 years andless than 5 years

71.4% 5


Less than 2 57.1% 42 or more than 2 and lessthan 5

42.9% 3


Moderate Reading Com-prehension Proficiency

14.3% 1

High Reading Comprehen-sion Proficiency

42.9% 3


42.9% 3

Study 4 counted on the participation of 22 Brazilian people. See Table 6.6.

6.2. STUDY 1: INSTRUMENT VALIDATION 129



1) What is your current position?PhD student 50% 11Master student 13.64% 3Master Degree 18.18% 4Ph.D Degree 13.64% 3Other 4.55% 1


Less than 2 years 59.09% 132 or more than 2 years andless than 5 years

36.36% 8

10 or more than 10 years 4.55% 1


Less than 2 36.36% 82 or more than 2 and lessthan 5

59.09% 13


4.55% 1


Moderate Reading Com-prehension Proficiency

45.45% 10

High Reading Comprehen-sion Proficiency

40.91% 9


13.64% 3

A total of 35 participants were involved in the assessment of the proposed instrument.Seven had Ph.D degree, 19 were Ph.D student, 5 had Master degree, 3 were Master student,and 1 dropped out the Master degree course. Regarding how long they had been experimentalresearchers in software engineering, 15 are fewer than 2 years, 15 between 2 and 5 years, 2between 5 and 10 years, and 3 more than 10 years. Regarding how many experiments they hadhelped to plan, 12 answered fewer than 2, 18 between 2 and 5, 4 between 5 and 10 experiments,and 1 more than 10 experiments. Brazilian participants answered a question about their Englishreading comprehension level, 11 had moderate, 12 high, and 10 very high English readingproficiency. Because the participants of study 1 are professors at U.S.A universities, the questionregarding their English reading comprehensive level was not required.

6.2 Study 1: Instrument Validation

6.2.1 Study Goals

The objective of this assessment is to find out what experienced experimental researchersthink about the proposed instrument regarding which checklist items they find useful and whichones they have trouble understanding.


6.2.2 Study Design

6.2.2.1 Participants

Two Professors from U.S.A universities, whose research focuses on empirical softwareengineering were invited to voluntarily participate in this study. Although they participated in theprevious investigation of how software engineering researchers plan their experiments (describedin Chapter 4), they were not involved in developing the proposed instrument. They have beenexperimental researchers in software engineering for more than ten years, in which they havebeen planning, teaching and helping other researchers to plan experiments. See Table 6.3 formore details about their demographic information.

6.2.2.2 Objects

This study focuses on assessing the instrument by itself. The evaluators assessed theinstrument by itself checking the items they find being useful, and which ones they had troubleunderstanding. The study design did not involve the use of experimental plans as objects.

6.2.2.3 Procedure

We ran sandboxes and pilots of this study to see if the instructions and materials wereunderstandable and unambiguous. A protocol for developing a list of possible questions that theparticipants might ask during data collection was created in order to not introduce bias in thestudy results. The protocol is described in Appendix K. The participants were invited by e-mail,and all evaluation was performed remotely. The participants used an experimental website, wherethey could find all materials needed for the assessment, including the instructions, the instrument,and an instrument’s acceptance questionnaire. See more details about the experimental websitein Section 6.1.1. All Professors independently evaluated the proposed instrument over the courseof one week, and they committed to not exchanging information on the instrument. During theassessment they could contact the first author of this research with any questions or confusions.The schedule of study 1 is presented in Table 6.7. The Experimental material used in this studyis described in Appendix G.


Table 6.7: Schedule of the Instrument validity 1

Study Phase Activity Date Duration Environment

InvitationsSend e-mail pilot to pilot partici-pants

May 2, 2016 —- e-mail

Send e-mail pilot to study partici-pants

May 2, 2016 —– e-mail

PilotSandbox April 25, 2016 1 hour WebsitePilot May 30, 2016 1 hour WebsiteAdjustment May 31 to June 4, 2016 5 days Website

Instructions Send email with instructions to Par-ticipants

June 5, 2016 —– e-mail

Data CollectionParticipant 1 June 5 to June 11, 2016 7 days WebsiteParticipant 2 June 26 to July 2, 2016 7 days Website

Data Analysis Data Analysis July 3 to July 5, 2016 3 days —-Adjustment Adjustment July 6 to July 9, 2016 3 days —-

6.2.3 Data Collection

In this study, we collected data with SurveyGizmo tool from the two participants throughthe experimental website described in Section 6.1.1. We collected what each participant thinksabout each item of the instrument through the scale as follows:

[ ] I find the item useful.[ ] I find the item is not useful.[ ] I have trouble understanding. (Please specify why)* .We also collected the instrument’s acceptance. More details, See section 6.6.

6.2.4 Data Analysis

The data were analyzed qualitatively. We tabulated the results, as shown in Table 6.8,to identify items that were problematic. Then we examined the comments for those itemsqualitatively to gain insight into improvements to the instrument.

6.2.5 Results

25 out of the 33 items achieved agreement of 100 % . The remaining eight items werediscussed by the author of the instrument with one of researcher mentors of this research. Table6.8 presents a general overview of the data collected.


Table 6.8: Overview of collected data from Study 1

Items Value Count Percentage CommentsItem 2; Item 3; Item 6; Item 7; Item8; Item 9; Item 10; Item 12; Item13; Item 16; Item 17; Item 18; Item19; Item 20; Item 21; Item 22; Item23; Item 25; Item 26; Item 27; Item28; Item 29; Item 30; Item 31; Item33.

I find the item useful 2 100% No comments

Item 1: Are the aims clearly andprecisely stated?

I find the item useful 1 50% No commentsI do not find the itemuseful

1 50%

Item 24: Is there an adequatedescription of the context in whichthe experiment will be carried out?

I find the item useful 1 50% No commentsI do not find the itemuseful

1 50%

Item 4: Do the objectives of theexperiment satisfy ethicalconcerns?

I find the item useful 1 50% The term "all relevant information" is mis-leading. The problem is in understandingwhat "all" are.

I have trouble under-standing the item

1 50%

Item 5: Are the hypotheses of theresearch clearly described and arethey related to the research goals?

I find the item useful 1 50% The problem I always have is that the hy-potheses are described before the variablesare discussed. Therefore, I cannot use thevariables in my hypotheses. This makes thehypotheses not formal.


1 50%

Item 11: Is a demographicquestionnaire planned to collectinformation from participants?

I find the item useful 1 50% I believe demographic is too general. Youcould provide a set of fields used to capturedemographic information. If these fields willbe standardized then results aggregation willbe much easier, reliable, and powerful.


1 50%

Item 14: Are ethical issuesaddressed properly (personalintentions, integrity issues, consentreview board approval?)

I find the item useful 1 50% The impact of this field is probably overesti-mated. If you need to cut somewhere, this isa good place.


1 50%

Item 15:Do the experimentersdescribe how participation will bemotivated?

I find the item useful 1 50% I believe this overlaps with a previous field.I have trouble under-standing the item

1 50%

Item 32: Do the experimentersidentify and discuss threats tovalidity, study limitations, potentialbiases or confounders that mayinfluence the experiment results?

I find the item useful 1 50% As I said in the interview, It is important toreport the trade-offs. Authors should reportthe rationale of their decisions in terms ofhow the balanced different threats to validity.


1 50%

Item 1 and 24Regarding items 1 and 24, one of appraisers assessed them as "I do not find the item

useful". However, we decided to keep them in the instrument for reasons described as follows:

! Item 1- Although study definition is a step performed before the experiment planning,it is essential that the study goals are well defined in experimental plans.


! Item 24- It is important to include the context where the experiment will be carriedout in experimental plans because several factors and threats to validity are related toplace where the experiment will be performed. Also, it is important to be aware ifthe location of the experiment execution is representative for the goals of study.

With respect to items 4, 5, 11, 14, 15, and 32, one of participants (the same one) assessedeach of these as "I have trouble understanding the item" and gave explanatory comments.

Item 4Comment: "the term "all relevant information" is misleading. The problem is in under-

standing what "all" are."He was referring to the first thing to consider in the item 4. The participants have access

to all relevant information about the study, before making their decision to participate or not.Discussion Result: We agree with his comment, and we made changes to the sentence

as follows: The participants have access to all the information they need to make an informeddecision about whether or not to participate.

Item 5Comment: The problem I always have is that the hypotheses are described before the

variables are discussed. Therefore, I cannot use the variables in my hypotheses. This makes thehypotheses not formal.

Discussion Result: Although we understand and agree with the comment, we do notthink that this comment is relevant to the instrument because the order of items that we have inthe checklist does not require that the researcher follow that order. So, we decided not to makeany changes in the instrument.

Item 11Comment: I believe demographic is too general. You could provide a set of fields used to

capture demographic information. If these fields will be standardized then results aggregationwill be much easier, reliable, and powerful.

Discussion result: Because the relevant demographic data changes from experiment toexperiment because it depends on the target of study, instead of providing a set of fields o beused to capture demographic information, we decided to put under the things to consider thedefinition of demographic data. Although from our point of view the term "demographic" is wellunderstood, most people know what mean demographic data, maybe beginners misunderstand it.

As a result,We included the following demographic definition: Demographic data are characteristics

and attributes of a population, such as the age, gender and income of the people within thepopulation.

Item 14Comment: The impact of this field is probably overestimated. If you need to cut some-

where, this is a good place.


Discussion Result: Items 4 and 14 were identified as redundant. As a result, we combinedthese items.

Item 15Comment: I believe this overlaps with a previous field.Discussion Result: Items 15 and 10 also were identified as redundant. As a result, we

combined them.Item 32Comment: As I said in the interview, it is important to report the tradeoffs. Authors

should report the rationale of their decisions in terms of how they balanced different threats tovalidity.

Discussion Result: We added another thing to consider to suggest that the researcherexplain and justify the tradeoffs : Whether the experimenters report the rationale of theirdecisions in terms of how they balanced different threats to validity.

The final instrument contains 31 items.

6.2.6 Summary Study 1

In the first study, the results of assessing the instrument’s items revealed that 75.76% of the checklist’s items were judged as useful by both raters. However, they had troubleunderstanding six items, 4, 5, 11, 14, 15, and 32. Each one was discussed, and improvementswere performed in the proposed instrument. Item 32, regarding the threats to validity, was widelydiscussed because of its importance in research studies. All studies, no matter how well they areplanned, have their threats to validity, and because the rationale behind experimenters’ decisions,in terms of how they balance the threats to validity, should be reported in experimental plans.One rater assessed the instrument at the beginning of the studies, and the other rater assessed theinstrument after study 4.

6.3 Study 2: Instrument Validation

This Study reports the instrument’s appraisal by experimental software engineeringresearchers into two levels, experts and beginners with respect to inter-rater agreement, inter-rater reliability, criterion validity, and instrument’s acceptance. The design and result of theinstrument’s acceptance can be seen in section 6.6.

6.3.1 Study Goals

The goal of this assessment is to improve the instrument development through feedbackfrom researchers in experimental software engineering. We evaluated the instrument’s inter-rateragreement, inter-reliability, and criterion validity. Therefore, we defined the following researchquestions for validity criteria (Research questions 1,2,3, and 4):


! Inter-rater agreement RQ1- To what extent do raters score experimental plans usingthe instrument in a similar manner?

! Inter-rater reliability RQ2-To what extent do raters rank experimental plans using theinstrument in a similar manner?

! Criterion Validity

1. RQ3- To what extent does a rater’s ratings correlate with the rater’s opinionabout whether the experiment should proceed with the experimental plan?

2. RQ4- To what extent does a rater’s ratings correlate with the rater’s opinionabout whether the experiment is likely to be successful if it proceeds withthe plan?

We investigated these questions in the scenario of researchers analyzing whether acontrolled experiment using participants should proceed or not with a given experimental plan.

6.3.2 Study Design

The proposed instrument was assessed by a fully crossed design, which means eachevaluator individually assessed the complete sample of three experimental plans.


The population of this study is experimental software engineering researchers withexperience in controlled experiments using human subjects. We selected four of them byconvenience, two beginner researchers and two expert researchers in conducting controlledexperiments using participants. We sent an invitation letter by e-mail to participants. SeeAppendix H. All participants have good knowledge of experimental software engineering. Bothbeginner researchers are Ph.D students in experimental software engineering. Both experts areprofessors, who have Ph.D. degrees in experimental software engineering. The Section 6.1.3presents the demographic information from the participants in this study.

6.3.2.2 Objects

Because there is no repository of experimental plans in software engineering, we collectedthree experimental plans from post graduate students in a course on experimentation in Brazil inwhich they have learned how to plan and conduct controlled experiments using human subjects.See Appendix I. The experimental plans were selected as described in the protocol for selectingexperimental plans. See Appendix J. The experimental plans were selected from the followingcriteria:


! It must be a plan of a controlled experiment. That means, should be a documentwritten before the experiment was run, not a experiment report or a document writtenafter the experiment have finished;

! It should involve human participants;

! It should assign the subjects randomly.

6.3.2.3 Procedure

The participants were asked to assess three experimental plans using the proposedinstrument for one week. The duration of each evaluation is not limited. While they werecompleting their assessment, they could contact the investigators any time with any questions orconfusions.

1. Pre Pilot

We performed sandbox pilots. Sandbox pilots are pre-pilots in which the researchersthemselves are the participants [17]. They helped us to discover problems before thepilot participants, increasing the possibility that they will find other more significantmistakes in the experimental plans.

2. Pilot

We performed a pilot of this study to see whether the instructions and experimentalmaterials are understandable and unambiguous. We collected information from twoPh.D. students in experimental software engineering, who were not participants instudy 2. After the pilot, the instructions were adjusted in order to provide clearerand more complete information for participants. A protocol for developing a list ofpossible questions to be answered during data collection was created in order to notintroduce bias in this study results. The protocol is described in Appendix K.

3. Training and Dry run

All participants were trained in how to use the proposed instrument. They readthe instructions from email and applied the instrument to an experimental plan forpractice. That experimental plan was not included in the experimental object sample.They had one day to solve any confusions and ask any questions to the investigatorsby email or Skype. All confusions and questions were discussed and solved beforethey started the study.

4. Study Execution

The participants received an e-mail with instructions. See Appendix F. They foundall materials needed for the study through the experimental website described in


Section 6.1.1. All participants assessed three experimental plans using the instrumentduring one week. The participants assessed the experimental plans in any order andsubmitted each assessment immediately after completion of each one. Althoughthe responses were not anonymous, all information they submitted was treatedconfidentially. The experimental material used in this study is described in AppendixH. In addition, the participants assessed the instrument’s acceptance described inSection 6.6 and report the demographic information described in Section 6.1.2.

5. Schedule of the Study

Table 6.9 shows the schedule of the instrument validation study 2.

Table 6.9: Schedule of the Instrument Validation 2

StudyPhase

Activity Date Duration Environment

InvitationsSend e-mail to pi-lot participants

May 1, 2016 —- e-mail

Send e-mail tostudy participants

May 1, 2016 —- e-mail

PilotSandbox April 25,2016 1 hour WebsitePilot May 30, 2016 1 day WebsiteAdjustment May 31 to June 4,

20165 days Website

Instructions Send e-mail withinstructions.

June 5, 2016 1 day e-mail

TrainingTraining and dryrun instructions

June 6 30 min Skype

Dry- run June 6 to June 7,2016

2 days Website

Questions & An-swers

June 7, 2016 As needed Skype

Data Col-lection

Data Collection June 8 to June 15,2016

7 days Website

Data Analy-sis

Data Analysis June 16 to June17, 2016

2 days —-

Adjustment Adjustment June 18, 2016 1 day —-

6. Drop outs

There were no dropouts in the experiment. All four participants who had started theexperiment finished its execution.


6.3.3 Data collection

We collected data with the SurveyGizmo tool from the two participants through the experi-mental website described in Section 6.1.1. The collected data were the results of assessmentsof three experimental plans using the instrument. There were a total of 12 assessments, with 4participants each assessing 3 experimental plans.

6.3.4 Data analysis

The data were quantitatively analyzed by the first author using IBM SPSS Statistics 23[137] and R [136]. Two coders, who are not otherwise involved in this research, checked andrepeated all analyses. They are Ph. D. students in Computer Science. We used measures basedon [157], [158], [159].

To analyze the data, we used the participants’ completeness scores from the experimentalplans (EP_1, EP_2, and EP_3). Because there is no standard to assess experimental plancompleteness, in our study, the completeness score of an experimental plan is defined as in Table6.10. The overall score is defined by the sum of each item score. Each item scoring "Yes" gets 1point, "Partially" 0.5 points, and "No" 0 points. The number of total items is defined by the totalnumber of the instrument’s items subtracting the total number of "Not applicable" items (whichmight vary from one assessment to another).

Table 6.10: Completeness Score Definition

Completeness= Overall Score/ Total items.WhereOverall Score= SUM (Yes* 1+ Partially* 0.5 +No *0)TotalItems= 33 items- (Number of total N/A)

We analyzed if there is a statistically significant difference between the means frombeginners and experts researchers.

We reported descriptive statistics of relevant variables including mean (M), standarddeviation (SD), median (Mdn), minimum (Min), and Maximum (Max).

We tested if variables are normally distributed using Shapiro-Wilk test. If the variable isnormally distributed, we use the Independent T-test. If the variable is not a normal distribution,we use the One-sample Wilcoxon signed -rank to determine if there is a significance differencebetween the two measurements.

6.3.4.1 Inter-rater Agreement

! RQ1- To what extent do raters score experimental plans using the instrument in asimilar manner?


We analyzed inter-rater agreement between raters with similar expertise and among allfour raters.

1. inter-rater agreement between raters with similar expertise

We analyzed inter-rater agreement regarding the completeness score between re-searchers with similar expertise, beginners and experts. The Bland-Altman methodwas used to determine the level of agreement between the two measurements. Itcalculates the mean of the differences between two evaluators, and 95% limits ofagreement as the mean of the differences ± 1.96 SD. If the participants agree, themean of differences should be close to zero and present no systematic variation.

2. inter-rater agreement among four researchers

In addition, we used average deviation (AD) to analyze the inter-rater agreementamong four raters. Average Deviation (AD) is one of various indices of variabilityused to characterize the dispersion among the measures in a given population. In ourstudy, we assess the agreement achieved by the four participants for each experimentalplan in their overall completeness scores.

In order to calculate the difference between two raters’ scores on a given item,we associated the assessment of the experimental plan values to Yes receives zero(Yes=0), Partially receives one (Partially=1), No receives two (No=2), and N/Areceives the distance of 3 from the value given by the participant with similar expertise.Therefore, N/A was treated as a special case as follows:

! whether the rater with similar expertise assigned Yes, Yes=0 and N/A= 3;

! whether the rater with similar expertise assigned Partially, Partially=1 andN/A= 4;

! whether the rater with similar expertise assigned No, No=2 and N/A= 5.

AD indices are defined as follows:

ADMdn( j) =1K

k

∑k=1

|x jk −Mdn j|

ADMdn =1J

J

∑j

ADMdn( j)

Where j=1 to J items, k=1 to K raters, x jk is the kth raters core on the item j, andMdn j is the item median over raters. ADMdn < 0.50 are considered to representacceptable agreement [160]. We calculated ADMdn for each experimental plan.


6.3.4.2 Inter-Rater Reliability

! RQ2-To what extent do raters rank experimental plan using the instrument in a similarmanner?

Intraclass Correlation Coefficients (ICCs) were used to determine inter-rater reliability.We analyzed ICC coefficients for the completeness score using a two-way random effect model.ICC values less than 0 shows a lack of reliability, 0.01 to 0.21 slight, 0.21 to 0.40 fair, 0.41 to0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 1 almost perfect reliability [161]. We aimedto obtain a moderate reliability for each experimental plan.

6.3.4.3 Criterion Validity

! RQ3- To what extent does a rater’s ratings correlate with the rater’s opinion aboutwhether the experiment should proceed with the experimental plan?

! RQ4- To what extent does a rater’s ratings correlate with the rater’s opinion aboutwhether the experiment is likely to be successful if it proceeds with the plan?

Kendall’s tau-b rank correlation coefficients for each experimental plan were used toestimate if the overall completeness scores (A) of the experimental plans is associated with therecommendations:

! if the experiment should proceed (B).

! if the experiment proceeds, it is likely to be successful(C).

Absolute values closer to 0 indicate that there is little or no linear relationship. The closerthe absolute value of the coefficient is to 1 the stronger the relationship between the variablesbeing correlated. Values less than 0.40 are considered as weak, 0.41 to 0.69 moderate, 0.70 to0.89 strong, and 0.90 to 1 perfect correlation. We expect that there should be a strong relationshipbetween the recommendations and the overall completeness score. Therefore, we expected toobserve moderate to strong correlations.

6.3.5 Results

6.3.5.1 Completeness Scores from Researchers

Table 6.11 shows the completeness scores and completeness mean scores from the partic-ipants, and Table 6.12 shows descriptive statistics including mean (M), standard deviation(SD),minimum (Min), and maximum(Max) of the completeness scores of the experimental plansbetween participants with similar expertise. The completeness score ranged from 0.27 to 0.68 ofa maximum of 1. The raw data is described in Appendix E.


Table 6.11: Completeness scores from the researchers

EP_1 EP_2 EP_3O_Sc NA Comp O_Sc NA Comp O_Sc NA Comp

Beginner 1 16.5 0 0.5 8.5 3 0.28 15 1 0.47Beginner 2 21 2 0.68 8.5 3 0.28 11.5 2 0.37Expert 1 17.5 0 0.53 9.5 0 0.29 22 0 0.67Expert 1 15.5 2 0.5 9 0 0.27 20.5 0 0.62

Beginner Means 0.59 0.28 0.42Expert Means 0.52 0.28 0.65

O_Sc: Overall Score; NA: Not Applicable items; Comp: Completeness

Table 6.12: Descriptive statistics of the completeness scores of the experimental plansbetween researchers with similar expertise

Researchers Minimum Maximum Mean Std. DeviationBeginners 0.28 0.68 0.4300 0.15336Experts 0.27 0.67 0.4800 0.16661

6.3.5.2 Difference between the completeness mean scores from Beginner and Expert Re-searchers

This section describes details steps we went through to determine if there is a statisticallysignificant difference between the means from Beginners and Experts.

1. We tested if variables, Mean of the completeness scores from EP_1, EP_2, and EP_3by beginners and experts, are normally distributed using the Shapiro-Wilk test.

Hypothesis testing

H0: the population is normally distributedH1: the population is not normally distributedAlpha level < 0.05The p-value should be less than the chosen alpha level for the null hypothesis beingrejected.

Interpretation

The p-value (0.8934) and (0.6753) are greater than the chosen alpha level(0.05), thenthe null hypothesis that the data came from a normally distributed population cannotbe rejected.

2. We tested the equality of variances using the Levene’s Test value(F) because we didnot reject the null hypothesis that data came from a normally distribution.


Hypothesis testing H0: The population variances are equal.H1: The population variances are not equal.Alpha level < 0.05

The p- value should be less than the chosen alpha level for the null hypothesis beingrejected.

Interpretation

The p-value (0.8124) is greater than the chosen alpha level (0.05), then the nullhypothesis of equal variances can not be rejected. It is concluded that there is nodifference between the variances in the population.

3. We used the parametric T-test to test whether there is a statistically significantdifference between the means from beginners and experts.

Hypothesis testing H0: There is no statistically significant difference between themeans.H1: There is statistically significant difference between the means.Alpha level < 0.05

Interpretation

The p-value (0.7238) is greater than the chosen alpha level (0.05), then the nullhypothesis can not be rejected. It is concluded that there is no difference between themeans. The mean from beginners is 0.43 and experts is 0.48. It is concluded that usingthe proposed instrument beginners assessed the completeness of the experimentalplans in a similar manner of the experts in Experimental Software Engineering.

6.3.5.3 Inter-Rater agreement between rater with similar expertise: beginner and expertresearchers

This section presents the inter-rater agreement between beginner and expert researchers.

1. Beginners

This section shows the detailed steps we went through to determine if there isproportional bias in the agreement measurement, that means, if there is a level ofagreement between the two beginner researchers.

(a) DataB1= (0.5, 0.28, 0.47)B2= (0.68, 0.28, 0.37)Mean B1B2= (0.59, 0.28, 0.42)


(b) Calculating the difference of the Beginner researchers scores

(c) Testing if the completeness score differences of two beginner researchersare normally distributed using the Shapiro-Wilk test.

i. Hypothesis testing H0: The population is normally distributed.H1: The population is not normally distributed.

The p-value should be less than the chosen alpha level for thenull hypothesis being rejected.

ii. InterpretationThe p-value (0.6878) is greater than the chosen alpha level(0.05),then the null hypothesis that the data came from a normallydistributed population cannot be rejected. An assumption ofthe Bland-Altman limits of agreement is that the differences arenormally distributed.

(d) Determining if there is a significance difference between the two mea-surements

i. Hypothesis testingH0: There is no statistically significant difference between thedifference of the two measurements.H1: There is statistically significant difference between the dif-ference of of the two measurements.

ii. InterpretationThe p-value (0.7757) is greater than the chosen alpha level (0.05),then the null hypothesis can not be rejected. It is concluded thatthere is no a statistically significance difference between the twomeasurements. We cannot assert that there is difference betweenthe two measurements.

(e) Constructing Bland Altman Plot

i. Constructing a Basic Scatterplotii. Calculating Upper and Lower limit

Upper Limit= Mean + (SD * 1.96) = (-0.02667) + (0.2781)= 0.25

Lower Limit = Mean - (SD * 1.96) = (-0.02667) - (0.2781)= -0.30

iii. Bland Altman Plot


See Figure F.1

Figure 6.4: Bland Altman Plot - Beginners

(f) Linear regression

i. Hypothesis testingH0: There is a level of agreement between the two measure-ments.H1: There is no level of agreement between the two measure-ments.

ii. InterpretationThe p-value (0.527) is greater than the chosen alpha level (0.05),then the null hypothesis can not be rejected. We cannot assertthat there is difference between the two measurements. SeeFigure F.2


Figure 6.5: Linnear regression - Beginners

Regarding to the completeness score, the mean of differences between beginnerresearchers is not statistically significantly different from zero (M= -0.02666667,SD= 0.141892, t = -0.32552, p-value = 0.7757). Also, We cannot assert that there isdifference between the two measurements p-value = 0.527.

2. Experts

This section shows the detailed steps we went through to determine if there isproportional bias in the agreement measurement, that means, if there is a level ofagreement between the two expert researchers.

(a) Data

E1= (0.53, 0.29, 0.67)E2= (0.50, 0.27, 0.62)Mean B1B2= (0.59, 0.28, 0.42)


(b) Calculating the difference of the Experts researchers scores

(c) Testing if the completeness score differences of two Expert researchersare normally distributed using the Shapiro-Wilk test.



ii. InterpretationThe p-value (0.6369) is greater than the chosen alpha level(0.05),then the null hypothesis that the data came from a normallydistributed population cannot be rejected. An assumption ofthe Bland-Altman limits of agreement is that the differences arenormally distributed.



ii. InterpretationThe p-value (0.06341) is greater than the chosen alpha level(0.05), then the null hypothesis can not be rejected. It is con-cluded that there is no a statistically significance differencebetween the two measurements. We cannot assert that there isdifference between the two measurements.


i. Calculating Upper and Lower limitUpper Limit= Mean + (SD * 1.96) = (0.033) + (0.03)= 0.063

Lower Limit = Mean - (SD * 1.96) = (0.033) - (0.03)= 0.003

ii. Bland Altman PlotSee Figure F.3


Figure 6.6: Bland Altman Plot- Experts



ii. InterpretationThe p-value (0.226) is greater than the chosen alpha level (0.05),then the null hypothesis can not be rejected. We cannot assertthat there is difference between the two measurements. SeeFigure F.4


Figure 6.7: Linnear regression - Experts

Regarding to the completeness score, the mean of differences between expert re-searchers is not statistically significantly different from zero (M= 0.033, SD= 0.01527,t = 3.7796, p-value = 0.06341). Also, we cannot assert that there is difference betweenthe two measurements p-value = 0.226.

6.3.5.4 Inter-Rater Agreement among Four Researchers

Table F.1 shows average deviation(ADMdn) indices achieved by the four researchersappraising each single experimental plan on the completeness score.

Table 6.13: Average Deviation Indices

Experimental Plan AD(Mdn)EP_1 0.4886364EP_2 0.4583333EP_3 0.4166667


In order to calculate the average deviation(ADMdn) among four researchers, we createda function in R. See Appendix F

The four researchers achieved acceptable agreement (ADMdn<0.50) in the overallcompleteness score. The content of the files "/Users/lilianefonseca/Desktop/Agree_EP_1.csv" ispresented in Appendix E.

6.3.5.5 Inter-Rater Reliability

Table F.2 shows the Intraclass correlation coefficients (ICCs) for the completeness scorefor two beginners, two experts, and all researchers together.

Table 6.14: Inter rater reliability of the instrument

Researchers ICCs Values InterpretationBeginners 0.791 SubstantialExperts 0.998 Almost Perfect

All 0.875 Almost Perfect

In the pairs of researchers with similar expertise, the instrument has substantial relia-bility(0.791) for the overall completeness scores by beginner researchers and almost perfectreliability (0.998) by expert researchers.

Considering the four raters, the instrument has almost perfect reliability (ICC = 0.875)for the overall completeness score of the experimental plans, which means, researchers, bothbeginners and experts, ranked the experimental plans in a similar manner.

6.3.5.6 Criterion Validity

Table F.3 shows the values given by the participants regarding the overall completenessscore(A), whether the experiment should be proceed (B), and if the experiment proceeds, it islikely to be successful? (C).


Table 6.15: Criterion Validity Values

EP_1 EP_2 EP_3A B* C* A B* C* A B C*

Beginner 1 0.50 3 3 0.28 2 2 0.47 3 3Beginner 2 0.68 4 4 0.28 2 2 0.37 3 3Expert 1 0.53 3 3 0.29 2 2 0.67 3 3Expert 1 0.5 3 3 0.27 2 2 0.62 4 4

Mean 0.55 3.25 3.25 0.28 2 2 0.53 3.25 3.00*Five point rating scale from 1 (Strongly disagree) to 5 (Strongly Agree)

A= Overall completeness scoresB= Should the experiment be proceed?

C= If the experiment proceeds , it is likely to be successful?

In order to calculate the correlation coefficients, the script in R is described below:We calculated correlation coefficients between the mean scores of A and B, and A and C.

Because the values given by the participants in the variable B and C were the same, the resultsobserved to the variable B is the same to the C. We found strong correlation between the overallcompleteness scores and the recommendation if the experiment should be proceed ( τ B = 0.816,p< 0.2), and the overall completeness scores and the recommendation if the experiment proceeds, it is likely to be successful ( τ C = 0.816, p< 0.2).


In the second study, we analyzed the level of agreement in the overall completenessscore of the experimental plans between homogeneous groups of raters, namely beginner andexpert researchers. In both cases, we cannot assert that there is difference between the twomeasurements. It is concluded that the obtained results showed that there was not disagreementbetween the measurements. We also analyzed the measurements among the four researchers.The result of the inter-rater agreement among them achieved acceptable agreement in the overallcompleteness score. The results of the completeness mean scores from beginners and expertresearchers indicate that there was no difference between means.

Regarding inter-rater reliability, the results of study 2 indicate the instrument has sub-stantial reliability (ICC=0.791) for the overall completeness scores of beginner researchersand almost perfect reliability (ICC=0.998) for expert researchers. With relation to the fourresearchers, the results also indicate almost perfect reliability (ICC=0.875).

Finally, in study 2 we found a strong correlation between the overall completeness scoresand the suggestion of whether the experiment should proceed and if the experiment proceedswhether it is likely to be successful, which indicates the positive relation between these variablesand the efficacy of the instrument.

6.4. STUDY 3: CO-LOCATED CONTROLLED EXPERIMENT 151

6.4 Study 3: Co-located Controlled Experiment

This experimental study aims to assess the usage of the proposed instrument by theexecution of a controlled experiment. From the four assessment studies performed in thisresearch, it is the unique that was performed in a laboratory instead of being remotely performed.This study is planned based on experimental software engineering practices [17], [21], [23],[24].

6.4.1 Experiment Definition

We have structured the objective of this study using GQM template [ 46] to collect andanalyze suitable metrics to assessment the proposed instrument.

6.4.1.1 Global Goal

The main goal of this study is analyzing if the instrument can reduce the chance offorgetting to include something important during the experiment planning phase.

6.4.1.2 Study Goal

The study goal is describes as follow:Analyze the experimental plan review instrument,for the purpose of evaluating its effect on reviewing experimental plans,with respect to how well the instrument helps to find missing elements in experimental plans, incomparison to doing a review without the instrument.From the point of view of post graduate students in Computer Science with knowledge inExperimental Software EngineeringIn the context of experimental plans for controlled experiments using human subjects in Soft-ware Engineering.

6.4.1.3 Research Questions

The following research questions are defined for this study:RQ1: Can the usage of the proposed instrument reduce the chance of forgetting to

include something important during the experiment planning phase compared to the usage of adhoc practices?

6.4.1.4 Measurement Goal

To answer RQ1, we collected the following metric:

! The number of items identified correctly in the experimental plan reviews (Theconcept of "items identified correctly" is presented in Section 6.4.1.5).


6.4.1.5 Metrics

To RQ1:

! Data representing the number of correct items identified in the experimental planreviews.

To assess the differences between the usage of the proposed instrument and the usageof the ad hoc practices, we have chosen the number of items identified correctly in the reviewscompared to the reference model because we believe that the proposed instrument may increasethe chance of the experimenter remembering important factors in the experiment plan while theexperimenter is still in the planning phase. We define ad hoc practices as the current methodused by the participants when they review experiment plans.

Definition of "items identified correctly"We consider items identified correctly by a list of mistakes and miss elements that should

be contained in the experimental plans produced by post graduate students. The list of mistakeswas built by the professor of the experimental software engineering course which the experimen-tal plans came from. We called this list as reference model. The reference model is specific toeach experimental plan being reviewed, it is not a list of general kinds of errors and omissions.The process of developing the reference models is described in the Appendix L. The referencemodel checklist was used to compare with the results of mistakes found by the participants inthe experiment execution.

6.4.2 Planning

6.4.2.1 Context selection

We executed the experiment in an academic context with post graduate students insoftware engineering, in order to review experiments using the proposed instrument and ad hocpractices.

6.4.2.2 Hypotheses formulation

We collected metrics to compare the completeness of experimental plan reviews. Beforesetting the hypotheses, we are going to introduce some symbols to represent the metrics usedin this study that will be collected and analyzed. These symbols will be used throughout thissection.

The formal hypotheses formulate (statistics) is built as following:


! P: The mean of proportions of the correct items identified by the participants. Theproportion is given by "the total number of correct items found by the subject" and"the total number of items in the reference model".

! T: The usage of the proposed Instrument.

! U: The usage of ad hoc practices.

This metric presents the following variations:

1. Experimental group

(a) P T - The mean of proportions of the correct items identified by the partici-pants using the proposed instrument.

2. Control group

(a) P U - The mean of proportions of the correct items identified by the partici-pants using ad hoc practices.

The following hypotheses definitions use the symbols described above.

! Null Hypothesis- The main hypothesis is the null hypothesis, which states thatthere is no difference between using the proposed instrument and ad hoc practices.However, this study attempts to reject this hypothesis.

Null Hypotheses (H0): Data defined by the metrics above is equal using the proposedinstrument(T) or ad hoc practices(U). That means:

H0: Participants applying the proposed instrument have the same mean of proportions ofthe correct items identified of the participants using ad hoc practices and the proposed instrument.

! H0: PT = PU

In addition, alternative hypotheses are defined to be accepted when the correspondingnull hypothesis is denied.

Alternatives Hypotheses

! Alternative Hypothesis: H1- Data defined by the metrics above using the proposedinstrument(T) is smaller than the data collected using ad hoc practices(U).That means:

H11: Participants applying the proposed instrument have smaller mean of proportionsof the correct items identified than the participants using ad hoc practices.

! H11: PT < PU


! Alternative Hypothesis: H2- Data defined by the metrics above using the proposedinstrument(T) is smaller than the data collected using ad hoc practices(U). Thatmeans:

H2: Participants applying the proposed instrument have greater mean of proportionsof the correct items identified than the participants using ad hoc practices.

! H11: PT > PU

6.4.2.3 Variables selection

This section defines the dependent and independent variables of the experiment.

1. Independent variables (predictor variables)

(a) Experience of the participants

(b) Experimental plan to be reviewed

Although we selected post graduate students in computer science with knowledge inexperimental software engineering, each one has different skills that can influencedirectly on our metrics. To try to control this variable, we applied to subjects ademographic questionnaire. Another independent variable is the experimental planto be reviewed. So, for controlling this variable, all participants in the experimentwill have the opportunity to use all experiment plans to be reviewed.

2. Dependent variable (response variable)

(a) Completeness- The mean of proportions number of correct items identifiedby the participants.

The experimental plan review completeness was represented by the comparison betweenthe number of the items found by the participants and the total number of the item containedin the reference model. The process of analyzing the results from participants against to thereference models is described in the Appendix L.


The study population is from 115 post graduate students who were enrolled and com-pleted the experimental software engineering courses from seven classes between 2010.1 and2015.2 at CIn- UFPE. We sent an e-mail inviting them to voluntarily participate in the instrumentassessment. See Appendix M. Nine of them were available to perform the co-located experimentwhile 28 were available to remotely participate in the study (See Section 6.5). One Ph.D. studentin experimental software engineering join the team of co-located experiment. Although he wasnot officially enrolled on the course, he was volunteer student on the course and has strong


knowledge of the experimental software engineering field. Therefore, a total of ten students wereplanned to perform the co-located student. From ten, seven attended at the co-located experi-ment. The participants were randomly distributed into the groups. The participants voluntarilyparticipated in the experiment.

6.4.2.5 Experiment Design

An experiment consists of a set of tests for the treatments. To obtain the maximum valueof an experiment, the set of tests is planned and designed carefully. The design of experimentdescribes how these tests are organized and how they will be executed. The following sectionsdescribe how this experiment was designed.

TreatmentIn this experiment, we want to control two factors: the approach used and the variety

of experimental plans to be reviewed. In statistics, a full factorial experiment is an experimentwhose design consists of two or more factors, each with levels and whose participants take on allpossible combinations of these levels across all such factors. A full factorial design may also becalled a fully crossed design. Such an experiment allow us to study the effect of each factor onthe dependent variable, as well as the effects of interactions between factors on the dependentvariable [50]. We choose to use 2*2 factorial design because such design allows each factorhas two levels resulting on having four treatment combinations in total. The first factor A isthe approach used and the second factor, B, is the variety of the experimental plans that will bereviewed. Each factor has two levels: the usage of the ad hoc practices or proposed instrument,and Experimental Plan 1 (EP1) and Experimental Plan 2 (EP2), respectively. In the first moment,we randomly assigned four subjects into Group A to perform two treatments with one leveleach (ad hoc practices and EP_1) and the three remaining subjects were assigned into Group Bto perform ad hoc practices and EP_2. Then, there was an inversion of groups, which means,Group B performed the proposed instrument and EP_1, and Group A performed the proposedinstrument and EP_2.see Table 6.16. This way, we assure that all subjects were exposed to bothapproaches and experimental plans.

Table 6.16: Experiment Design

The approach usedAd hoc practices The proposed instrument

EP variety *EP_1 Group A Group BEP_2 Group B Group A

*The variety of the experimental plans that will be reviewed.

This design follows the three principles of the experiment design, replication, ran-domization, and local control. Replication because the experimenter can replicate the design.


Randomization because the participant was randomly assigned into two groups. Finally, thisdesign has local control because the two known factors.

Pilot ExperimentThe pilot was performed by two Ph.D. students who neither were involved in the real

experiment nor in this research. We did not use the results from the pilot because the goal of itjust gave us experience how participants would behave, how the metrics would be collected, howthe data would be analyzed. From the pilot, we solved some issues before the real experimenthad be accomplished. The pilot lasted three hours. A protocol for developing a list of possiblequestions to be answered during data collection was created in order to not introduce bias in thisstudy results. The protocol is described in Appendix K.

Experimental ObjectThe experimental object was two experimental plans from post graduate students on

the Experimental Software Engineering course at UFPE. Appendix I describes the process forselecting the experimental plans. An empirical software engineering Expert who is the Professorof that course will review these three experimental plans. He introduced some mistakes, andtook out some key elements in these three experimental plans. These reviews resulted in a list ofmistakes and missed elements that should be contained in them. We called these list as referencemodels. Each experimental plan have a respective reference model that was used to comparewith the results from the participants. In addition, the Professor messed up the experimentalplans concealing from the Experimenter (author of the instrument) with the purpose of avoidingbias in the result of the experiment. One of the experimental plans was used in the trainingsession, and the two others in the real experiment.

Experimental Materials

! Website containing all experimental materials. See Appendix D.

! The Experimental Plans. See Appendix I

! Questionnaire about Instrument’s acceptance implemented with SurveyGizmo tool.See Appendix N

! Demographic questionnaire implemented with the SurveyGizmo tool where thesubjects will provide information about their experience in conducting experimentswith human participants. See Appendix N.

! The proposed instrument implemented with the SurveyGizmo tool. See Appendix M.

! Form for the participants include mistakes and miss elements found during theanalysis without the proposed instrument. See Appendix M.

Training Session and dry runAll participants participated in the training session. First, they performed a dry run using

ad hoc practices for reviewing an experimental plan. At the second moment, they performed a


dry run applying the proposed instrument to an experimental plan. The experiment plan used indry run activity was not included in the object sample. In both cases, the participants collectedmetrics. They were free to ask any questions and take out any confusions. After they had finishedthe activities, they submitted their results in the virtual environment where contained on lineforms. All questions were discussed and solved before the session ended. We do not use theseresults to perform data analysis because the goal of the dry run is the participants understand theexperiment procedure while they practice it. We used this data just to analyze if the metrics wereproperly collected. The training and dry run lasted 30 minutes each treatment.

SettingThe experiment occurred in a laboratory at CIn-UFPE on Monday, June 20, 2016.

Each participant used one computer. Because the all experimental materials were in a virtualenvironment, they were allowed to bring their laptop.

TasksThe participants should review experimental plans for finding mistakes or missing

elements in them. In the first moment, they submitted the experimental issues using anymaterial(Book, guideline, tool) they used to review experimental plan. In the second moment,they submitted the experimental issues using the proposed instrument.

ProcedureThe experiment lasted four hours as shown in Table 6.17. The participants took a place in

computer laboratory at CIn- UFPE, one computer per one participant. The experiment procedurewas supervised by one Ph.D. Student, who was not involved in the research, in ExperimentalSoftware Engineering. The role of the supervisor was not allowed to the experimenter introducebias in the data collection.

The experiment was split into two sessions, treatment 1, and treatment 2. In the first 15minutes, the participants received an experiment overview. Second, they received training andexecuted the dry run for 30 minutes focuses on treatment 1 where they applied ad hoc practicesfor reviewing experimental plans. During the training and dry run, they were free to make anyquestions. We reserved extra ten minutes after training and dry-run for the participants to askmore questions and take out any confusions. Before the treatment 1 started, we had given fiveminutes break.

The supervisor of the experiment assigned the participants randomly into two groups.The names of the participants were listed in alphabetic order in a spreadsheet. He drew a randomsample to the group A using the SPSS tool. The names that had not be on the random selectedinto the Group A were put in the Group B. Table 6.18 presents the sample random selection intogroups.

The treatment 1 was performed by 40 minutes. Then, they had a coffee break for 20minutes. After they returned to the break, they received training and executed the dry-run tothe treatment 2 for 30 minutes. In the treatment 2, they used the proposed instrument to findissues in the experimental plan. Similarly to the treatment 1, the participants were free to ask any


questions and extra ten minutes was reserved for the end of training and dry run. Although wedid not overcome the reserved time in any treatment, if the participants had kept with confusions,we would have given more time until they are comfortable in use the respective treatment. Afterthey had received five minutes break, they applied the treatment 2 for 40 minutes. The remaining35 minutes was used to the participants answer the instrument’s acceptance and answer thedemographic questionnaire. To submit all tasks in the experiment, they used the website. Theexecution of the experiment followed the schedule described in the Table 6.17. The schedule ofthe instrument validation 3 can be seen in Table 6.19.

Table 6.17: Co-located Controlled Experiment Schedule

Time Duration Activity1:00- 1:15 15 minutes Experiment Overview1:15- 1:45 30 minutes Dry Run- Treatment 11:45- 1:55 10 minutes Questions and Answers1:55- 2:00 5 minutes Break2:00- 2:40 40 minutes Applying treatment 12:40- 3:00 20 minutes Break3:00- 3:30 30 minutes Dry run- Treatment 23:30- 3:40 10 minutes Questions and Answers3:40- 3:45 5 minutes Break3:45- 4:25 40 minutes Applying treatment 24:25- 5:00 35 minutes Instrument’s Acceptance

Table 6.18: Sample Random Selection

Group A Group BP5 P6P3 P4P1 P2P7



Study Phase Activity Date Duration

InvitationsSend e-mail to pilot participants May 2, 2016 —–Send e-mail to study participants May 31, 2016 —–

PilotSandbox June 16, 2016 1 hour

Pilot June 17, 2016 1 hourAdjustment June 18, 2016 1 day

InstructionsTraining and dry run 1 June 20, 2016 20 minutesTraining and dry run 2 June 20, 2016 20 minutes

Data CollectionTreatment 1 June 20, 2016 40 minutesTreatment 2 June 20, 2016 40 minutes

Data Analysis Data Analysis June 21 to June 28, 2016 8 daysAdjustment Adjustment June 29 to July 1, 2016 3 days

Drop outsThere were not dropouts in the experiment. All seven participants who had started the

experiment finished its execution. Ten participants were planned to perform the experiment.However, three of them could not attend to the co-located controlled experiment. We allocatedthem to the remotely experiment.

6.4.3 Data Analysis

The data were quantitatively analyzed by the first instrument’s author using IBM SPSSStatistics 23 [137] and R [136]. Two coders, who are not involved in this research, checked andrepeated all analyses. They are Ph.D students in computer science.

The experimenter (author of the proposed instrument) was not involved in the itemsidentified correctly analysis in order to not include bias in the results. Therefore, the twocoders also were in charge to assess the answers given by the participants. The process analysisis described in Appendix L. The professor created two reference models. One refers to theexperimental plan 1 and the other refers to the experimental plan 2. The number of mistakes andmissing elements of the reference model 1 from experimental plan 1 contains 23 issues, and thereference model 2 from experimental plan 2 contains 26 issues. The coders judged each itemfrom the participant’s lists such as Yes, Partially, and No. We considered the proportion of itemsidentified correctly as bellow:

P = Sum (Yes + Partially) / Total Number of known Issues in the Reference Model

We used Shapiro-Wilk test for checking if the variables are normally distributed. Weused Wilcoxon test to analyze if there is a statistically significant difference between the variablesPU and PT .


6.4.4 Results

Tables 6.20 and 6.21 shows the results from the analysis by the coders according processdescribed in Appendix L. The raw data is presented in Appendix E.

Table 6.20: Experimental Plan 1

Study 3- Experimental Plan 1 - 23 IssuesTreatment Participant ID

PU

ID_1 4/23ID_3 3/23ID_5 4/23ID_7 4/23

PT

ID_9 23/23ID_11 22/23ID_13 22/23

Table 6.21: Study 3- Experimental Plan 2

Experimental Plan 2 - 26 IssuesTreatment Participant ID P

PU

ID_2 5/26ID_4 6/26ID_6 6/26

PT

ID_1 20/26ID_3 20/26ID_5 17/26ID_7 22/26

Figure 6.8 presents the raw data of this study, which includes the number of reporteditems that coders judged as correct (Yes), partially correct(Partially), and wrong reported mistakesor missing elements(No).

Figure 6.8: Items Identified Correctly Raw Data Study 3

Following, we determine if there is a statistically significant difference between theproportion of items identified correctly from PU and PT .

1. We tested if variables, The proportion of items identified correctly from EP_1, andEP_2 from the lists of participants, are normally distributed using the Shapiro-Wilktest.

Hypothesis testing

H0: the population is normally distributedH1: the population is not normally distributed


Alpha level < 0.05The p-value should be less than the chosen alpha level for the null hypothesis beingrejected.

> PU=c(0.17,0.13,0.17,0.17,0.19,0.23,0.23)

Interpretation

The p-value (0.2917) is greater than the chosen alpha level (0.05), then the nullhypothesis that the data came from a normally distributed population cannot berejected.

> PT=c(1,0.96,0.96,0.77,0.77,0.65,0.85)

Interpretation

The p-value(0.4666) is greater than the chosen alpha level(0.05), then the null hypoth-esis that the data came from a normally distributed population cannot be rejected.




Interpretation

The p-value (0.006805) is smaller than the chosen alpha level (0.05), then the nullhypothesis of equal variances can be rejected. It is concluded that there is differencebetween the variances in the population.

3. We used the parametric T-test to test whether there is a statistically significantdifference between the variables PU and PT .

Hypothesis testing H0: There is no statistically significant difference between thevariables PU and PT .H1: There is statistically significant difference between the variables PU and PT .

The p-value should be less than the chosen alpha level for the null hypothesis beingrejected.


P- value<0.05

Interpretation

The null hypothesis is rejected because the p-value (3.632e-06) is less than thechosen alpha level (0.05). We found a statistically significant difference between thevariables PU and PT .

The mean of the proportion of the correct items in PT is greater than PU .

We found a statistically significant difference between the variables PU and PT . The nullhypothesis is rejected because thep-value (3.632e-06) is less than the chosen alpha level (0.05).Figure 6.9 shows the comparison of success rate between the usage of ad hoc practices and theproposed instrument. The mean of proportion of the correct items PT (0.8514286) is greaterthan PU (0.1842857). We concluded that the proportion of the correct items identified by theparticipants using the proposed instrument (PT ) is greater than the mean proportion of the correctitems identified by the participants using ad hoc practices (PU ).

Figure 6.9: Study 3- Success Rate Comparison


In the first part of the experiment the participants applied ad hoc practices to identifymistakes and missing elements in an experimental plan. They reported an average of 18% of thetotal mistakes and missing elements correctly identified. 18% is a low percentage of mistakesand missing elements identified by ad hoc practices. Unlike ad hoc practices, the usage of theproposed instrument increases this number five fold, which means, the success rate was 86 %.The proportion of correct items identified using ad hoc practices ranged from 0.13 to 0.23 of amaximum of 1, while using the instrument ranged from 0.65 to 1. Although we cannot generalizethe results, they indicate that using the checklist for reviewing experimental plans is an efficientway to find mistakes and problems as regards ad hoc practice.

6.5. STUDY 4: REMOTE CONTROLLED EXPERIMENT 163

6.5 Study 4: Remote Controlled Experiment

This study has the same experiment design of the Study 3. However, they differ from thenumber of participants, setting, procedure, and schedule.

6.5.1 Participants

The study population is from 115 post graduate students who were enrolled and com-pleted the experimental software engineering courses from seven classes between 2010.1 and2015.2 at CIn- UFPE as described in Section 6.4.2.4. 28 participants were available to remotelyparticipate in this study. The three subjects that were not attended in the co-located experimentwere invited to perform this study. The 31 participants were assigned randomly into two groups,one with 16 participants, and the other 15. The randomly assigned process was the same as instudy 3. Although 31 participants started the assessment, only 22 completed the study. We didnot use the data of any of the nine participants that did not complete the tasks. As a result, 13participants remain in one group and 9 in the other. Although the number of participants in eachgroup was not the same, this did not cause the groups to be unbalanced because each treatmentapplied had the same number of participants, that is, 22 participants for each treatment.

6.5.2 Setting

This study was remotely performed, and all experimental materials were available in thewebsite.

6.5.3 Procedure

The participants remotely performed the experiment over the course of one week. Thetime set to perform each treatment was the same of the co-located experiment, 40 minutes. Theschedule of the remote study is shown in Table 6.22.

The experiment was split into two sessions, treatment 1, and treatment 2. The participantsreceived instructions and experiment overview. They did not executed a dry run because wewant to observe how the set of instructions were accurate, and how the participants used theinstrument without perfomr a previous dry-run. However, they were free to make any questions.We randomly assigned 31 subjects into two Groups to each combination of the levels, see Table6.16. Their names were listed in alphabetic order in a SPSS spreadsheet. We drew a randomsample to the group A using the SPSS tool. The names that had not be on the random selectedinto the Group A were put in the Group B. 16 participants were assigned to the Group A and 15into group B. As the Study 3, we used the same experimental design. Each group had accessedto a specific website regarding its respective treatment.


After completing the treatment 1 and 2, they answered the instrument’s acceptance andthe demographic questionnaire. The experimental material is described in Appendix M.

To submit all tasks in the experiment, they used the website. The execution of theexperiment followed the schedule described in the Table 6.17. The schedule of the instrumentvalidation 4 can be seen in Table 6.23.

Table 6.22: Remote Controlled Experiment Schedule

Date Activity Duration EnvironmentJune 20, 2016 Instructions As needed e-mail and SkypeJune 21, 2016 Data collection 7 days Website


Study Phase Activity Date Duration

InvitationsSend e-mail to pilot participants May 2, 2016 —–Send e-mail to study participants May 31, 2016 —–

PilotSandbox June 16, 2016 1 hour

Pilot June 17, 2016 1 hourAdjustment June 18, 2016 1 day

Instructions Instructions June 19, 2016 As neededData Collection Data Collection June 21, 2016 7 daysData Analysis Data Analysis June 28 to July 5, 2016 8 daysAdjustment Adjustment July 6 to July 8, 2016 3 days

Drop outsA total of nine participants dropped out the experiment. Five of them did not even start

the experiment while two did not complete the treatments. Other two participants’ data wasdiscarded because the coders observed the problems as follows:

! There was no any answer to the treatment 1.

! All answers to treatment 2 were marked as "Yes", and they spent few seconds oneach page. An insufficient time for at least reading the items.

6.5.4 Data Analysis

The data analysis regarding the items identified correctly and the instrument’s acceptancewere performed as described in the Study 3.


6.5.5 Results

Tables 6.24 and 6.25 shows the results from the analysis by the coders according processdescribed in Appendix L. The proportion of the item identified correctly were the sum betweenitems the coders judged correct and partially correct.

Table 6.24: Study 4- Experimental Plan 1


PU

ID_1 1/23ID_2 5/23ID_3 2/23ID_4 0/23ID_5 5/23ID_6 3/23ID_7 2/23ID_8 5/23ID_9 4/23

ID_10 9/23ID_11 0/23ID_12 0/23ID_14 2/23ID_17 5/23

PT

ID_24 13/23ID_25 13/23ID_27 13/23ID_28 17/23ID_29 19/23ID_30 21/23ID_31 21/23ID_32 12/23

Table 6.25: Study4- Experimental Plan 2


PU

ID_15 1/26ID_16 2/26ID_18 4/26ID_19 3/26ID_20 2/26ID_21 0/26ID_22 2/26ID_23 2/26

PT

ID_33 20/26ID_34 14/26ID_35 13/26ID_36 21/26ID_37 15/26ID_38 16/26ID_39 24/26ID_40 26/26ID_41 25/26ID_42 10/26ID_43 2/26ID_44 21/26ID_45 12/26ID_26 6/26

Figure 6.10 presents the raw data of this study, which includes the number of reporteditems that coders judged as correct (Yes), partially correct (Partially), and wrong reportedmistakes or missing elements (No).


Figure 6.10: Items Identified Correctly Raw Data Study 4

Following we determine if there is a statistically significant difference between theproportion of items identified correctly from PU and PT .


(a) Hypothesis testing

H0: the population is normally distributedH1: the population is not normally distributedAlpha level < 0.05The p-value should be less than the chosen alpha level for the null hypoth-esis being rejected.

> PU=c(0.04, 0.21, 0.08, 0, 0.22, 0.13, 0.08, 0.21, 0.17, 0.39, 0, 0, 0.08,0.21, 0.04, 0.08, 0.15, 0.12, 0.08, 0, 0.08, 0.08)

(b) Interpretation

The p-value (0.01499) is less than the chosen alpha level (0.05), then thenull hypothesis is rejected and there is evidence that the data tested (PU)are not from a normally distributed population.

> PT=c(0.57, 0.57, 0.57, 0.74, 0.83, 0.91, 0.91, 0.52, 0.77, 0.54, 0.50, 0.81,0.58, 0.62, 0.92, 1, 0.96, 0.38, 0.08, 0.81, 0.46, 0.23)

(c) Interpretation


The p-value (0.2979) is greater than the chosen alpha level(0.05), then thenull hypothesis that the data came from a normally distributed populationcannot be rejected.

2. We used the a non parametric test (Wilcoxon) to test whether there is a statisticallysignificant difference between the variables PU and PT .

(a) Hypothesis testing H0: There is no statistically significant differencebetween the variables PU and PT .H1: There is statistically significant difference between the variables PU

and PT .

The p-value should be less than the chosen alpha level for the null hypoth-esis being rejected.P- value<0.05

(b) Interpretation

The null hypothesis is there is a statistically significant difference betweenthe variables PU and PT . As the p-value turns out to be 8.888e-08, and isless than the .05 significance level, we reject the null hypothesis. Therefore,we found a statistically significant difference between the variables PU

and PT .


We concluded that the proportion of the correct items identified by theparticipants using the proposed instrument PT is greater than proportion ofthe correct items identified by the participants using ad hoc practices PU .

We found a statistically significant difference between the variables PU and PT . Werejected the null hypothesis because thep-value turns out to be 8.888e-08, and it is less than the0.05 significance level. Figure 6.11 shows the comparison of success rate between the usage ofad hoc practices and the proposed instrument. The mean of the proportion of the correct items inPT (0.6490909) is greater than PU (0.1113636). We concluded that the proportion of the correctitems identified by the participants using the proposed instrument (PT ) is greater than proportionof the correct items identified by the participants using ad hoc practices (PU ).

6.6. INSTRUMENT’S ACCEPTANCE 168

Figure 6.11: Study 4- Success Rate Comparison


In the first part of the experiment the participants applied ad hoc practices to identifymistakes and missing elements in an experimental plan. They reported an average of 11% of thetotal mistakes and missing elements correctly identified. Unlike ad hoc practices, the usage ofthe proposed instrument increases this number six fold, which means, the success rate was 65%. The proportion of correct items identified using ad hoc practices ranged from 0.00 to 0.39of a maximum of 1, while using the instrument ranged from 0.08 to 1. Although we cannotgeneralize the results, they indicate that using the checklist for reviewing experimental plans isan efficient way to find mistakes and problems as regards ad hoc practice.

6.6 Instrument’s Acceptance

The instrument’s acceptance survey was implemented with the SurveyGizmo tool1. SeeAppendix N. We analyzed the instrument’s acceptance through the following three researchquestions:

! Appropriateness- To what extent do evaluators believe that the instrument is appro-priate for reviewing experimental plans for controlled experiments using participantsin Software Engineering?

! Perceived usefulness- To what extent do evaluators believe that using the instru-ment would enhance their performance in planning Software Engineering controlledexperiments with participants?

! Perceived ease of use- To what extent do evaluators believe that using the instrumentwould be free of effort?



All participants from the four studies assessed the instrument regarding its acceptanceafter they participated in their respective study. In order to collect their perceptions regardingappropriateness of the instrument, we split it in two parts: instrument’s fitness for purpose (fourquestions) and the items’ appropriateness (five questions).Therefore, to assess the instrument’sappropriateness, we asked the participants to what degree they agree with questions about theinstrument’s fitness for purpose and the item’s appropriateness. To measure perceived usefulness,we tailored the questions from the Technology Acceptance Method (TAM) [162]. The perceivedusefulness and ease of use were measured using five statements each. The response scale wasa five- point rating scale from 1 (Strongly disagree) to 5 (Strongly agree). The participantsalso answered five open questions about feedback of the instrument, which were qualitativelyanalyzed through meetings. The total of 24 sentences are presented in bullets below.

Appropriateness: (A) Fitness for Purpose + (B) Item’s AppropriatenessA) Fitness for Purpose- measured using four statements:

! (1) The instrument supports experimenters in assessing the completeness of theexperimental plan.

! (2) The instrument identifies potential biases that were not identified at the beginning.

! (3) The instrument is useful for inexperienced experimenters.

! (4) The instrument is of value to experienced experimenters.

(B) Item’s Appropriateness- measured using five statements:

! (5) I find the questions of the instrument adequate.

! (6) I find the recommendations (things to consider and hints) helpful.

! (7) I find the set of questions to be complete.

! (8) I find the number of questions adequate.

! (9) I find the order of questions adequate.

Perceived Usefulness- measured using five statements:

! (10) Using the instrument would give me greater control over my experimentalplanning.

! (11) The instrument would help me to complete my review in a reasonable amountof time.

! (12) The instrument supports critical aspects of my experimental plan.


! (13) I find that this instrument would be useful for reviewing experimental plans.

! (14) I would recommend this instrument to my colleagues and friends.

Ease of use - measured using five statements:

! (15) The instructions for using the instrument are clear.

! (16) It is easy for me to remember how to use the instrument.

! (17) The instrument provides helpful guidance in reviewing an experiment.

! (18) I find the instrument easy to use.

! (19) I am able to efficiently complete my review using this instrument.

Feedback about the instrumentWe qualitatively analyzed the instrument by five questions:

! (20) From your perspective, is the proposed instrument an effective support forreviewing an experimental plan for completeness? Why or why not?.

! (21) In your opinion, is the instrument easy to use for reviewing an experimentalplan?.

! (22) In your opinion, does the instrument lack any important component?.

! (23) Do you have any improvements to suggest for this instrument?.

! (24) Are there any comments or suggestions about the instrument that you would liketo share?.

For each study, we analyzed the internal consistency of each variable mentioned above(fitness for purpose, item’s appropriateness, perceived usefulness, and ease of use), except thestudy 1 because of the low number of participants. The internal consistency measures howclosely a set of items are to the target variable. We used Cronbach’s alpha for measuring theinternal consistency which the acceptable reliability coefficient is higher than 0.70 [163].

6.6.1 Instrument’s Acceptance Results

In this section, we described the instrument’s acceptance results according to the researchquestions regarding (a) appropriateness (fitness for purpose and item’s appropriateness) (SeeSection 6.6.1.2), (b) perceived usefulness (See Section 6.6.1.3), (c) perceived ease of use (SeeSection 6.6.1.4), and (d) qualitative analysis of instrument’s feedback for each study (See Section6.6.1.5). In Section 6.6.1.1, we present an overview of the four evaluation studies results. Theraw data of the studies is presented in Appendix E.


6.6.1.1 Instrument’s acceptance Overview of Results

In this Section, we discuss the appropriateness of the instrument between the mean offitness for purpose (FP) and item’s appropriateness(IA) by experts and beginners, and the relationbetween the mean values of the perceived ease of use(PEOU) and usefulness(PU). The scoresof each subject were averaged over the different items to each goal, such as fitness for purpose(four items), Item’s appropriateness(five items), perceived usefulness (five items), and perceivedease of use (five items).

Figure 6.12 presents scatterplots for the fitness for purpose and item’s appropriateness,which illustrates the relationship between the mean values of these two variables.

Figure 6.12: Scatterplots for the FP and IA by expertise

34 out of 35 participants( expert and beginners researchers) believe that the instrumentis appropriate for reviewing experimental plans for controlled experiments using participantsin software engineering (>=3). Six out of 35 participants found the number of questionsinadequate(< 3). Although they think the instrument is too long, it should be extensive becauseof its coverage of important aspects of experimental plans. Table 6.26 shows descriptive statisticsfor these variables.

Table 6.26: Descriptive Statistic for the appropriateness: FP and IA variables

Statistic FP IAMean 4.27 4.03

Standard Deviation 0.48 0.53

Figure 6.13 presents scatterplots for the perceived usefulness and perceived ease of use,which illustrates the relationship between the mean values of these two variables.


Figure 6.13: Scatterplots for the PU and PEOU by expertise

33 out of 35 participants( expert and beginners researchers) perceived that the instrumentwould enhance their performance in planning software engineering controlled experiments withparticipants and using the instrument would be free of effort (> 3). However, two beginners fromthe study 4 did not find the instructions for using the instrument are clear. One of them did notfind the instrument easy to use, and another did not think that it is easy for him to remember howto use the instrument. Table 6.27 shows descriptive statistics for these variables.

Table 6.27: Descriptive Statistic for perceived ease of use and usefulness variables

Statistic PU PEOUMean 4.29 4.14

Standard Deviation 0.53 0.55

The following sections discuss the instrument’s acceptance results in detail to eachevaluation study.

6.6.1.2 Appropriateness - To what extent do evaluators believe that the instrument isappropriate for reviewing experimental plans for controlled experiments usingparticipants in Software Engineering?

As described previously, appropriateness is composed of two parts fitness for purposeand item’s appropriateness. Because the study 1 had just two participants, we did not calculatethe internal consistency of the items.

1. Study 1 Results

All items measuring appropriateness indicated agreement, to varying degrees, thatthe instrument is fit for its intended purpose. Item 1 ( The instrument supportsexperimenters in assessing the completeness of the experimental plan) received thestrongest support, while the response to Item 2 (The instrument identifies potentialbiases that were not identified at the beginning) was the weakest. With the exceptionof Item 8, all the items measuring appropriateness were nearly uniformly positive.


The results for Item 8, however, shows a disagreement between the two participantson whether the length of the survey was appropriate. See Tables 6.28 and 6.29 foritem’s appropriateness values.

Table 6.28: Fitness for purpose Values

Item Value Percentage CountItem_1 Strongly Agree 100.0% 2

Item_2Neither agree nor disagree 50.0% 1

Agree 50.0% 1

Item_3Agree 50.0% 1

Strongly agree 50.0% 1

Item_4Agree 50.0% 1


Table 6.29: Item’s Appropriateness Values

Item Value Percentage CountItem_5 Agree 100.0% 2Item_6 Agree 100.0% 2Item_7 Agree 100.0% 2

Item_8Disagree 50.0% 1

Agree 50.0% 1

Item_9Agree 50.0% 1


2. Study 2 Results

Almost all items were consistently positive. Tables 6.30 and 6.31 indicated agreementregarding the appropriateness of the instrument. Items 1 (The instrument supportsexperimenters in assessing the completeness of the experimental plan) and 5 (I findthe questions of the instrument adequate) presented the highest percentages and item8 (I find the number of questions adequate) presented the lowest, with the exceptionof Items 6 (I find the recommendations (things to consider and hints) helpful) and 7(I find the set of questions to be complete) where one of the participants disagreed.


Item Value Percentage Count

Item_1Agree 50.0% 2



Agree 25.0% 1Strongly agree 50.0% 2

Item_3Agree 25.0% 1







Item_5Agree 50.0% 2








Item_9Agree 75.0% 3


The alpha coefficient for fitness for purpose (four items) is 0.824 and item’s ap-propriateness (five items) is 0.829. Because both results are higher than 0.70, theappropriateness (fitness for purpose + item’s appropriateness) also has high internalconsistency. Figures F.5 and F.6 present the data analysis of the internal consistencyof fitness for purpose and the item’s appropriateness respectively.

3. Study 3 Results

All items measuring the fitness for purpose had a strong positive evaluation by allparticipants. Table 6.32 presents the values for fitness for purpose. With respectto item’s appropriateness, the majority of items had positive feedback. However,items 6 and 8 show disagreement was reported by 1 out 7 participants on whether thehints are valuable and the length of the instrument was adequate . See the completepercentage value in Table 6.33.




Item_1Agree 42.9% 3




Item_3Agree 57.1% 4






Item_5Agree 71.4% 5







Agree 85.7% 6



The alpha coefficient for fitness for purpose (four items) is 0.76444 and item’sappropriateness (five items) is 0.877809 suggesting that the items have relativelyhigh internal consistency. Figures F.7 and F.8 present the data analysis of the internalconsistency of fitness for purpose and the item’s appropriateness respectively.

4. Study 4 Results

All items measuring appropriateness of the instrument indicated agreement that theinstrument is suitable for its purpose. Item 8 was the weakest item measured bythe participants, where 4 out of 22 participants show disagreement on whether thequantity of questions of the instrument was appropriate. See Tables 6.34 and 6.35 foritems’ appropriateness values.






Item_2

Disagree 4.5% 1Neither agree nor disagree 9.1% 2




Item_4





Item_5







Item_8





The alpha coefficient for fitness for purpose (four items) is 0.708 and item’s appropri-ateness (five items) is 0.70 which means the set of items has high internal consistency.Figures F.9 and F.10 present the data analysis of the internal consistency of fitnessfor purpose and the item’s appropriateness respectively

6.6.1.3 Perceived usefulness - To what extent do evaluators believe that using the instru-ment would enhance their performance in planning Software Engineering con-trolled experiments with participants?

1. Study 1 Results

All items regarding the perceived usefulness of the instrument rated by the twoparticipants were positive with the exception of item 11, where one of the raters


disagreed that the instrument would help him to complete his review in a reasonableamount of time. Item 10 (using the instrument would give me greater control overmy experimental planning) was highly rated by both participants. Item 13 (I find thatthis instrument would be useful for reviewing experimental plans) was also positivelyrated. For more details see Table 6.36.

Table 6.36: Perceived Usefulness Values

Item Value Percentage CountItem_10 Strongly Agree 100.0% 2





Item_13Agree 50.0% 1




2. Study 2 Results

All items were uniformly rated positively by the four participants. Item 13 (I findthat this instrument would be useful for reviewing experimental plans) presentedthe highest percentages with 3 out of 4 participants, and Item 11 (The instrumentwould help me to complete my review in a reasonable amount of time) presentedthe lowest. Table 6.37 presents the instrument’s acceptance percentage values forperceived usefulness.













The alpha coefficient for perceived usefulness(five items) is 0.784 which means theitems have relatively high internal consistency. Figure F.11 presents the data analysisof the internal consistency of perceived usefulness.

3. Study 3 Results


In a similar way to the results of study 2, perceived usefulness presented a uniformlypositive result. The highest rated item was finding that the instrument would be usefulfor reviewing experimental plans (item 13), while the lowest item was item 11 (Theinstrument would help me to complete my review in a reasonable amount of time).The participants did not disagree on any items regarding perceived usefulness. Table6.38 presents the instrument’s acceptance percentage values for perceived usefulness.













The alpha coefficient for perceived usefulness(five items) is 0.762987 which meansthe items have high internal consistency. Figure F.11 presents the data analysis of theinternal consistency of perceived usefulness.

4. Study 4 Results

All items were positively measured, with the exception of item 13, where 1 out of 22participants disagreed that the instrument is useful for reviewing experimental plans.Item 14 (I would recommend this instrument to my colleagues and friends) presentedthe highest percentage, and Item 12 (the instrument supports critical aspects of myexperimental plan) presented the lowest. Table 6.39 presents the complete results.














The alpha coefficient for perceived usefulness(five items) is 0.853 which also meanshigh internal consistency. Figure F.13 presents the data analysis of the internalconsistency of perceived usefulness.

6.6.1.4 Perceived ease of use - To what extent do evaluators believe that using the instru-ment would be free of effort?

1. Study 1 Results

Regarding perceived use, the participants totally agree that the instrument provideshelpful guidance in reviewing an experiment and they are able to efficiently completetheir review using the proposed instrument. See the complete results in table 6.40.No items were measured as "Strongly disagree".

Table 6.40: Perceived ease of use Values





Neither agree nor disagree 50.0% 1Item_17 Agree 100.0% 2


Agree 50.0% 1Item_19 Agree 100.0% 2

2. Study 2 Results

All items regarding the perceived ease of use of the instrument presented high values.Item 16 (It is easy for me to remember how to use the instrument) was the strongest,while Item 15 (The instructions for using the instrument are clear) was the weakest.Table 6.41 presents the instrument’s acceptance percentage values for perceived easeof use.





Item_16 Agree 50.0% 2Strongly agree 50.0% 2


Item_18 Agree 100.0% 4


The alpha coefficient for perceived ease of use (five items) is 0.904 suggesting thatthe items have relatively high internal consistency. Figure F.14 presents the dataanalysis of the internal consistency of perceived usefulness.


3. Study 3 Results

The items related to perceived ease of use were positively measured. Items 15 and 16presented the highest percentages. One participant did not agree or disagree aboutwhether he finds the instrument easy to use. Table 6.42 present the values of theassessment.













The alpha coefficient for perceived ease of use (five items) is 0.8634021 suggestingthat the items have relatively high internal consistency. Figure F.15 presents the dataanalysis of the internal consistency of perceived ease of use.

4. Study 4 Results

All items were satisfactory measured. However, two participants disagreed that theinstructions for using the instrument are clear; one disagreed that the instrument iseasy for him to remember how to use, and one did not find the instrument easy to use.Table 6.43 presents the complete results regarding the perceived ease of use of theinstrument.




Item_15



Item_16





Item_18





The alpha coefficient for perceived ease of use (five items) is 0.930 suggesting thatthe items have relatively high internal consistency. Figure F.16 presents the dataanalysis of the internal consistency of perceived ease of use.

6.6.1.5 Qualitative Analysis- Open Questions

The feedback about the instrument answered through open questions was discussed inmeetings involving the investigators.

1. Study 1 Results

Both participants agreed that the proposed instrument is an effective support forreviewing an experimental plan for completeness. However, it is extensive. Althoughthey think the instrument can be exhaustive in its coverage of experimental designissues, it is at the same time effective in assessing a plan for completeness. Theexperts agree the instrument is easy to use and especially is useful for beginners whoare designing their first studies because they might not be aware of some of the issuesthat the instrument highlights. However, they emphasize that the beginners shouldhave in mind the trade-offs of each experimental plan, and they should not onlyconcentrate efforts on the mechanics, but should also spend time formulating theory,shaping the research questions, and collecting qualitative data during the study tohelp explain the experimental findings. Both of them think that the instrument doesnot lack any important component. Regarding the initiative to build the proposedinstrument one of them said:

”I think building a unified collection of experimental design considerationsis a very valuable service for the community. I think the challenge is


making sure such considerations don’t become overly intimidating orburdensome."

Regarding the improvement of the instrument, one of them suggested as future work,the building of a preamble that talks about cost and benefit trade offs in the contextof experimental design, which would help remind experimenters of the time and costof designing, conducting, and analyzing an experiment.

Regarding the redundancies in the instrument, study 1 help us to identify some redun-dancies because the raters assessed each item of the instrument. The redundancieswere identified and the instrument was adjusted.

2. Study 2 Results

All participants considered the proposed instrument an effective support and easy touse for reviewing an experimental plan for completeness. They emphasized that theinstrument highlights key factors for an experimental plan, helping the experimentersremember what the important elements that compose an experiment plan are. Onebeginner and one expert researcher thought more details were lacking in the threatsand validity item such as actions to control threats to validity. As a result, we includedthe item "Whether the experimenters reported actions to control each described threatto validity". Because threats to validity are essential but also an extensive section inexperiment planning, in order to address this concern, we added some mechanismsas references in the proposed instrument to support researchers. Also, some of theparticipants suggested that each thing to consider should be associated with theresponse scale "Yes", "Partially", "No", or "Not Applicable" and a review commentbox. We decided not to include the response scale to each “things to consider” butwe included a review comment box for each item. Finally, they suggested addinga hyperlink to the citations. The instrument was adjusted and a new version of theinstrument was applied in studies 3 and 4.

3. Study 3 Results

All participants agreed that the proposed instrument is an effective support forreviewing an experimental plan for completeness because it helps them remember, asa checklist, important elements in the experimental plan, and it systematically coverseach area of the experiment plan. Although some of them think the instrument isan extensive checklist, all subjects agreed that the proposed instrument is easy touse. They did not notice any missing components in the instrument. They suggestedthe creation of a short version of the instrument. They also reported that somerecommendations of the instrument seems repetitive.

Regarding the creation of the short version of the instrument, we did not consider


applying this suggestion because although the instrument is extensive, it contains im-portant things to consider for reviewing experimental plans. However, if researchersare comfortable with experimental plan elements, they can consider only the itemswithout the things to consider although we suggest that they should review all instru-ment items.

4. Study 4 Results

All participants agreed that the proposed instrument is an effective support for re-viewing an experimental plan for completeness because the instrument supports theexperiment review. It was strongly emphasized that the instrument is very usefulguidance for beginner experimenters. The participants agreed the instrument is easyto use although some people said that the checklist is too long. The majority ofthe participants do not think that the instrument lacks some important items butone of them suggested more details regarding the threats to validity item. Similarlyto Study 3, some participants in Study 4 suggested the creation of a short versionof the instrument, and of a glossary containing some important experimental plan-ning terms, such as treatment, dependent and independent variables, among others.However, a glossary of terms to experimental software engineering was started bythe ISERN (International Software Engineering Research Network) community in1998. The content of this repository concerns the definitions for terms usually usedin the Experimental Software Engineering field. It can be accessed at http://lens-ese.cos.ufrj.br/wikiese.

6.7 Threats to Validity

This section discusses the most relevant threats to validity for four studies. We breakthem down into four main categories: Internal, External, construct and conclusion.

6.7.1 Internal Validity

! Instrumentation

Because the instrumentation can cause issues related to instructions or experimentalmaterials, we wrote the instructions carefully to guide all participants in the fourstudies in a similar way. We performed sandboxes and pilots for checking if there weresome problems in the instructions and experimental materials in the experimentalwebsites. In addition, during the co-located study (Study 3) in the laboratory, therewas a supervisor to answer questions asked by participants about the mechanics of theinstrument. The supervisor followed a protocol to answer the questions consistentlyto avoid bias in data collection.


In addition, to prevent the participants changing the behavior in some way during thestudy, we planned not to overburden them in the assessment. Regarding studies 1, 2,and 4 we limited the evaluation to no more than two hours, during which time theparticipants could start, stop, and restart again at any time during one week. Withrespect to study 3, we gave breaks between the sessions.

! Maturation

To decrease maturation effects, it is important to vary the order of the treatmentsbecause the participants might get better at reviewing from one experimental roundto another just because they are getting some practice. The way to mitigate this isto vary the order of the treatments. In studies 3 and 4 it was not possible to varythe order of the treatments because once the participants had been exposed to theinstrument, they were likely to continue to remember many of the items on it evenafter it was taken away. Therefore, a participant could not really do an ad hoc reviewafter they had used the checklist once. As a result, this threat remains, and should bemitigated in future studies.

! Individual differences among the participants

In study 2, we had beginner and expert participants. In order to deal with individualdifferences between participants, they were divided into distinct pairs of beginnersand experts. Also, the sample was homogeneously selected, that is, the pairs had thesame background.

! The language of experimental materials

With respect to addressing the threat to validity of the language of the experimentalmaterials (English), which was different from the native language of the participants(except in study 1 in which the participants were native English speakers) we applieda demographic questionnaire asking them the level of their English reading compre-hension. We used a scale from very low English reading comprehension to very highEnglish reading comprehension proficiency. We considered data from participantswith at least a moderate level of reading comprehension proficiency in English.

! Blinding

In Studies 3 and 4, because we applied ad hoc practices versus the proposed instru-ment, we could not mask the usage of the treatments. The participants knew whichtreatment they were receiving. However, they were advised of the importance ofreporting real and coherent data during the experiment. To address this threat tovalidity, the experimenter (author of the instrument) was not aware which mistakesthe professor included in the experimental plans.As a result, we could observe towhat extent the instrument identified those issues.


! Subjective analysis

Because the analysis of the correct issues reported by the participants is subjective instudies 3 and 4, the experimenter (author of this thesis) was not involved in that datacollection in order to avoid bias in the result of experiments. Instead, two researcherswho were not involved in the research calculated how many correct issues werereported by each participant.

! Object Learning effect

Because the participants had to apply both treatments in studies 3 and 4, we tried toavoid the object learning effect by introducing different experimental plans in eachtreatment.

6.7.2 External Validity

! Representativeness of subjects

Although the sample was selected according to convenience, the sample is represen-tative of the target population because we chose participants with different levels ofexperience (beginners and expert researchers) in planning experiments. However, theresults cannot be generalized because of the small sample size.

! Representativeness of objects

In study 2, we used a total of three experimental plans, which are representative ofthe object of interest. These were assessed by the instrument. However, they wereselected from the same experimental software engineering course, which means,the authors of the experimental plans learned how to plan experiments from thesame professor at the same university using the same materials. Regarding studies 3and 4, one experimental plan was used for each treatment. Therefore, new studiesshould be conducted with more experimental plans from different places, includingheterogeneous scenarios.

In addition, the development of the experimental objects used in studies 3 and 4,including the reference models, was not entirely independent, since it was done byone of the researchers of this research. So the errors the researcher introduced couldhave been biased towards things that the checklist was designed to cover. As a result,future studies should use experimental objects that are developed independently ofthe developers of the instrument.

! Situational effects

Except in study 3, which was performed in the laboratory, the participants performedstudies 1, 2, and 4 in their own environments, which were close to the environment

6.8. DISCUSSION 186

where they might use the instrument. We did not have control over whether theparticipants masked the results or whether they randomly checked off the boxes inthe proposed instrument and the acceptance questionnaire in order to complete thetasks quickly. To address this issue, study 3 gave us a baseline of time because itwas monitored in a lab environment. Also, both the instrument and instrument’sacceptance questionnaire had a time resource, where the time spent on questions wascounted. As a result, we did not find systematic bias in the results.

6.7.3 Construct Validity

! Mono operation bias

In study 1, the participants assessed the instrument by itself and did not use the instru-ment to actually do a review. Also, in study 2, the usage of only one single treatmentmight have introduced mono operation bias in the results because the participants didnot have a point of comparison. They just used the proposed instrument.

! Anonymous Participation

In all four studies, even though the assessment is not anonymous, the participantswere aware that all data was treated confidentially.

6.7.4 Conclusion Validity

! The small sample size

The sample size is statistically small. However, it is a current issue in softwareengineering assessments. Therefore, it led us to consider the results just as indicators.They cannot be generalized.

6.8 Discussion

In total, we had 35 subjects from beginners to high level researchers in experimentalsoftware engineering who participated in four different kinds of the instrument’s assessment.

In study 1, the goal was to expose the proposed instrument to two researchers in experi-enced experimental software engineering for them to report what they thought about each item ofthe instrument, which ones they found useful and which ones they had trouble understanding. Theexperience of two raters was crucial in this study because they provided important considerationsabout the items. From 33 items, they totally agreed on 25 items. One of the participants thoughtthat items 1 and 24 were not useful, while the other had trouble understanding six items, 4, 5,11, 14, 15, and 32. The eight remaining items were discussed by the investigators. We decidedto keep items 1, 5, and 24, while regarding items 4 and 15, the redundancies were removed.

6.8. DISCUSSION 187

Also, items 4, 11, and 32 were adjusted in order to make them clearer to the future user of theinstrument.

In study 2, we assessed the proposed instrument from the perspective of two beginner andtwo expert experimental software engineering researchers. This study focuses on evaluating theproposed instrument regarding inter-rater agreement, reliability, and criterion validity. Regardinginter-rater agreement, we analyzed to what extent raters score experimental plans using theinstrument in a similar manner. We concluded that there were no significant differences betweenthe completeness score means, regardless of the composition of the groups of raters, in otherwords, among homogeneous groups of raters (beginners and experts) and the four raters. Itwas a relevant result because they indicate that the usage of the instrument helped beginners toreview experimental plans in a similar manner to experts. However, we are aware that it can varyfrom beginner to beginner because the participant’s factors could influence the results, such asa beginner’s and an expert’s knowledge and experience. With regard to inter rater-reliability,we analyzed to what extent raters rank experimental plans using the instrument in a similarmanner. The inter- rater reliability of the instrument for the group of experts (ICC=0.998) andamong the four raters (ICC=0.875) was almost perfect, while the group of beginner raters showedsubstantial reliability (ICC=0.791). The result of the reliability among the four raters indicatesthat beginners ranked the experimental plans in a similar manner to the experts, which is apositive result. The criterion validity of the instrument was supported in that the mean overallcompleteness scores of the experimental plans strongly correlate with the recommendation forwhether the experiment should proceed (τ B = 0.816, p< 0.2) and the recommendation if theexperiment proceeds, it is likely to be successful ( τ C = 0.816, p< 0.2). This is also a positiveindication. However, it is important to test other hypotheses that can contribute to increasing thevalidity of our instrument.

In studies 3 and 4, we had the same research question. We analyzed whether the usageof the proposed instrument can reduce the chance of forgetting to include something importantduring the experiment planning phase, compared to the usage of ad hoc practices. However,we varied some characteristics of experiments. In study 3, we had seven participants, whilein study 4 we had 22 participants. Study 3 was co-located performed in the laboratory, whichincreased the control of the experiment, while study 4 was carried out remotely. In both studies,the proportion of the correct items found by participants using the instrument was greater thanthe results from participants using ad hoc practices. The difference between the studies was thecontrol of the assessment, because study 3 was performed in the laboratory, while study 4 wascarried out in the participants’ environment (house, work, among others). Similarly to study 3,at the beginning of study 4, the participants applied ad hoc practices to identify mistakes andmissing elements in an experimental plan. They reported an average of 11% of the total mistakesand missing elements correctly identified. This result is lower than the result obtained in study 3(18%). The usage of the instrument increased the number of correct items identified more thanfive fold compared with the usage of ad hoc practices in both studies. In study 4, the minimum

6.8. DISCUSSION 188

proportion of correct items identified using the instrument (0.08) was lower than the minimumproportion of correct items identified using ad hoc practices (0.13) in study 3. However, it was anisolated case because the majority of participants (19 out 22) that used the instrument achievedmore than 50% of the mistakes in experimental plans, 23 issues for EP1 and 26 issues for EP2.In the same way as study 3, although we cannot generalize the results, they indicate that usingthe checklist for reviewing experimental plans is an efficient way to find mistakes and problemsregarding ad hoc practice.

Although all the subjects who participated in studies 3 and 4 had taken a course inexperimental software engineering, the results of both studies were significantly better using theinstrument than when using ad hoc practices. It is important to emphasize that it is not enough tosimply teach people experimental software engineering because even people who were trainedstill did a significantly better job with the instrument than without it. However, these studiesshould be replicated with a broad range of experimental plans from different sources, and alsowith more subjects who have learned how to plan experiments from different experimentationcourses.

At the end of the studies, all the participants assessed the instrument’s acceptance. InSection 6.6.1.1, we presented an overview of the four studies results regarding the instrument’sacceptance. The participants perceived the instrument as appropriate, useful, and easy to use. Weconclude that the instrument achieved high acceptance among the 35 subjects who participatedin the studies.

The participants emphasized that the instrument is most useful for beginner researchers,in which confirmed its purpose. The major problem of the proposed instrument reported bythe participants was linked to its length. However, although the instrument is exhaustive in itscoverage of experimental design issues, it is effective in assessing a plan for completeness.

Moreover, it is important to highlight that some of the questions (e.g., is the number ofparticipants adequate?) must not lead the experimenters towards ever bigger and more expensiveexperiments. Instead, the purpose of this kind of question is to make the experimenters reflecton future analysis issues. Therefore, in order for the proposed instrument not to become anobstacle to experimenters, especially to beginners, who only want to run perfect experimentsbecause of the large quantity of mechanics that are involved in the experiment planning, weadded the advice below in the last item of the instrument (Category:Document) based on resultof the quality study reported in Chapter 4, which encouraged experimenters to run experimentseven if they are not perfect:

Instead of targeting a perfect study, it would be better if researchers just startedrunning studies even if they are not perfect, or even if they are simpler. The basicproblem in doing experimentation in software engineering today is because every-body sees experiments as very difficult to run, and currently, researchers who arereviewing studies want the studies to be perfect. It is important that we think lessabout whether the study is methodologically perfect, and more about what we have

6.9. FINAL VERSION OF THE INSTRUMENT 189

really learned by running the study even if it has limitations, and even if there arequestions about external validity in terms of the generalization of the results. Ifresearchers were able to run more studies, especially smaller ones, it would be a bigstep for our field because there is so much we do not know. There more studies wecan run, the more opportunities we have to learn more about software engineeringexperiments.

The results of all four studies were positive. However, it is necessary to perform moreassessments to confirm previous results and to generalize them for software engineering because,as mentioned earlier, the size of our sample is small and our results are indications, which meanswe cannot generalize the results. To overcome the limitations described in Section 6.7, it isimportant to replicate these studies using: (1) a broader range of experimental plans in terms ofvariety, quality, quantity, and source (from different classes and projects); and (2) a broader rangeof raters with a different kind of expertise in experimental software engineering. The results mayreflect a general trend towards the acceptance of the instrument.

6.9 Final Version of the Instrument

Based on each assessment study including feedback regarding the instrument’s accep-tance described in previous sections, the instrument was adjusted. The redundancies wereremoved, and relevant definitions and suggestions were included to clarify the items of theinstrument. The final version of the instrument resulted in a reduction of 33 to 31 checklist items.Below, we present the final version of the instrument:


6.10 Chapter Summary

In this Chapter we discussed the results of the instrument assessments. Each studyhad different specific goals and approaches(formative or summative) from distinct perspectives(beginners and experts). We performed two experimental studies, and two controlled experiments,including a co-located and a remote experiment. The instrument’s acceptance was applied toeach study regarding appropriateness, usefulness, and ease of use. All studies presented ahigh percentage of instrument’s acceptance. A total of 35 participants were involved in thesestudies. The result of the initial assessments indicate the instrument is a good starting point tosupport software engineering researchers when they are reviewing experimental plans in softwareengineering. In the next chapter some related work is discussed together with relevant differencesfrom our research.

211211211

7Related Work

This chapter presents related work concerning the contributions of this thesis and thestate of the art in the empirical literature. Some works are the writings that inspired and guidedour research, while other studies although similar to what we performed, did not completelysatisfy our goals. The following sections present the related work regarding the systematicmapping study (See Section 7.1), the qualitative study (See Section 7.3) , and the proposedinstrument (See Section 7.3). Finally, the summary chapter is presented in Section 7.4.

7.1 Related Work – Systematic Mapping Study (Chapter 3)

In Chapter 3, we presented a systematic mapping study we performed in order to identifysupport mechanisms used to conduct experiments. Almeida et. al [164] presented a relatedstudy in which they conducted a systematic mapping to collect mechanisms to guide empiricalstudies in software engineering. Their study focus on selecting studies in empirical softwareengineering that defined a guide to empirical studies. Their systematic mapping study wasperformed using automated search engines from five digital libraries (IEEE, ACM, ScienceDirect, Scopus, and EI Compendex), and manual searches on relevant journals and conferences(EASE, ESEM, and ESEJ). In their work, 23 guides proposed to support empirical methodsin software engineering were identified. The main reason of this small number is because thestudy only searched for guides specific to support software engineering. Our systematic mappingstudy is similar to them, however, our research is not focused on support mechanisms definedfrom software engineering but on all mechanisms used as a reference for empirical methods insoftware engineering, including other areas. Besides, our research analyzed the studies that usedthese mechanisms, while the Almeida’s work analyzed the studies that defined these mechanisms.Therefore, we provide an overall view regarding the main references concerning the empiricalsoftware engineering community.

Another study that collect information about support mechanisms was Marshall andBrereton [165]. They proposed a web-based catalog to help reviewers to find appropriatesystematic literature review tools based on their particular needs. Similar to our on-line catalog,

7.2. RELATED WORK – QUALITATIVE INTERVIEW STUDY (CHAPTER 4) 212

their catalog comprises a list of support mechanism classified according to the empirical activitiesin which they are applied to support (Study Selection, Quality Assessment, Data Extraction,Meta-Analysis, and others). However, their catalog focuses on resources specific to systematicliterature review. They identified 71 automated tools and 23 other resources, including guidelines,checklists and reporting standards. In this sense, our catalog has greater coverage, since itprovides 362 mechanisms, including references for the empirical strategies most commonlyapplied in empirical software engineering (experiment, case study, survey, systematic literaturereview, and others).

7.2 Related Work – Qualitative Interview Study (Chapter 4)

In Chapter 4, we presented a qualitative interview study that we carried out in order tounderstand how experienced researchers plan and conduct their experiments. Although we didnot find any study directly related to this study, the study produced by Ko et al. [17] motivatedus to work on a practical perspective of planning controlled experiments with human subjectsbecause of the difficulty of researchers in dealing with issues regarding human factors related toexperiments. Ko et al. [17] present a practical perspective of guiding controlled experiments ofsoftware engineering tools with human participants. They presented the problem that, althoughcontrolled experiments have been widely adopted in software engineering research, manysoftware engineering researchers have seen controlled experiments involving human participantsas too risky and too difficult to conduct. Our study tries to fill some these gaps because it is thefirst study that presents what researchers actually do when they design experiments, the problemsand mistakes they have experienced, how they learn about empirical methods, and gaps they stillhave in their knowledge.

In addition, because when textual data is quantified, there is a significant loss of in-formation, and because experiences taken from experts could not be appropriated describedthrough statistics and other quantitative methods, some ancillary works that address the qual-itative approach helped us to generate well-grounded understanding about the phenomenonunder study. Seaman [166] presents several qualitative methods for data collection and analysis.She described how qualitative methods might be integrated into empirical studies of softwareengineering. Also, she included how qualitative methods might be combined with quantitativemethods, and all through her study, the author presented examples of the usage of qualitativemethods from software engineering studies. Another study used was Strauss and Corbin [105],which described the grounded theory method. This study focuses on systematizing the qualitativedata collection and analysis. We used this work to support the analysis of data based on coding,including open coding and axial coding.

7.3. RELATED WORK – PROPOSED INSTRUMENT (CHAPTER 5) 213

7.3 Related Work – Proposed Instrument (Chapter 5)

In Chapter 5, we presented the proposed instrument for reviewing the completeness ofexperimental plans for controlled experiments using human subjects in the area of softwareengineering. Although there has been a significant amount of research focused on helpingresearchers to design, conduct and report their controlled experiments, which builds an im-portant methodological base in the empirical literature [29], [30], [24], [128], it focuseson important but general issues regarding designing experiments [17]. Few studies, if any,focus on guiding experimenters on how to review software engineering experimental plans forcontrolled experiments using participants, with respect to completeness and scientific quality.As a result, we identified checklists and guidelines from software engineering and other fieldswhich assist empirical researchers with designing and reporting experiments and evaluatingscientific publications. Our research is directly related to them because the proposed instrumentwas developed based on their items, and recommendations.

Regarding empirical software engineering guidelines, Wohlin et al. [29], Juristo andMoreno [23], Kitchenham et al. [24], and Pfleeger [45] are a set of software engineeringempirical guidelines which present key activities necessary for designing and analyzing experi-ments in software engineering. While Juristo and Moreno [ 23] and Wohlin et al. [21] are textbooks, Pfleeger [45] and Kitchenham et al. [24] are scientific papers. Both groups of studiesaddress experimental methodology in empirical software engineering. Also, they were usedas a base for building the proposed instrument for reviewing experimental plans. Wohlin etal. [29] is the most cited reference to guide controlled experiments in software engineering. Itgives a well-structured introduction to experimentation for software engineers, while Juristo andMoreno [23] describes in-depth methods for advanced experiment designs and data analysis,with examples from software engineering. They provide a list of the most important points tobe documented for each phase in the form of questions to be answered by the experimentaldocumentation. Pfleeger [45], presents key activities necessary for designing and analyzing anexperiment in software engineering. Kitchenham et al. [24] is another work which presents aset of research guidelines aimed at stimulating discussion among software researchers. Theybased their guidelines on research guidelines reviews from medical science. They also includedtheir own experience in doing and reviewing software engineering research. These guidelinesare intended to assist researchers in designing, conducting, and evaluating empirical studies. Thestudy includes a checklist with 36 items.

Some related work are explicit guidelines on reporting experiments. Basili et al. [1]suggested a framework for experimentation that provides a structure for presenting experiments.Singer [18] described how to use the American Psychological Association (APA) style guidelinesfor publishing experimental results in software engineering. Jedlitschka [20] described a surveyof the most relevant studies for reporting guidelines and suggested a unified standard for reportingcontrolled experiments. This study contains a quick reference checklist which includes 42 items.


Jedlitschka’s proposal is similar to APA stardands [18]. However, this study presents moredetailed and specific elements for experiments in software engineering. The first version of thisstudy was published in 2005 [20], [141], and feedback from the workshop participants, as wellfrom peer reviews, was incorporated into a second version of the guideline [142]. In parallel,the guideline was evaluated by Kitchenham et al. (2006) [167]. This evaluation highlighted 42issues and eight mistakes. The feedback from Kitchenham et al. [ 167] led to a second versionof the guideline Jedlitschka and Ciolkowski [143], where, if appropriate, the adjustments wereapplied, and problems were solved. In 2007, additional feedback from individual researcherswas also incorporated in Jedlitschka et al.(2007) [144]. The current version of the guideline waspublished in 2008 [81]. Wohlin et al. [21], Juristo et al. [23], and Kitchenham et al. [24] alsoincluded guidelines on how to document experiments.

In relation to empirical quality checklists in software engineering, our work is also directlyrelated to Dieste et al. [19], Kitchenham et al. [24], Dyba and Dingsoyr [12], Kitchenhamet al. [40], Kitchenham et al. [82], and Kitchenham et al. [83]. These checklists providea list of items used to assess primary studies reports. As a result, we took into account mostprominent items related to experimental planning. Dieste et al. [19] presents a quality scale forexperiments based on the methodological recommendation reported in Kitchenham et al. [24].They analyzed the quality of 28 experiments in software engineering regarding the measurementof experimental bias. The checklists contains 11 items organized according to five dimensions.They are experimental context, experimental design, analysis, interpretation of results, andpresentation of results. Another instrument analyzed was the set of quality criteria for assessingprimary studies in software engineering proposed by Dyba and Dingsoyr [12]. The checklistwas designed for being used to assess the quality of primary studies in their systematic review ofempirical studies of agile software development. The checklist contains eleven criteria whichwere based on CASP (Critical Appraisal Skills Programme) 1 and by principles of good practicefor conducting empirical research in software engineering proposed by Kitchenham et al. [24].In similar way, Kitchenham and Charters [40] designed a study which presents a summaryquality checklists for quantitative studies including 50 questions. They have accumulated a listof questions from other studies such as Crombie et al. [150], Fink et al. [151], Greenhalgh et al.[152], Khan et al. [76] and Petticrew et al. [153]. kitchenham and Charters checklists [40] areorganized in terms of study stage and study type. The authors did not suggest that researchersuse all the questions proposed by them but researchers should review the list of questions in thecontext of their own study and select those quality evaluation questions that are most appropriatefor their specific research questions. In 2009, Kitchenham et al. [82] published a generic qualitychecklist for quality of experiments including nine items based on their previous. Also in 2010,Kitchenham et al. [83] published a quality checklist with nine questions. The purpose was toassess whether the quality of published human-centric software engineering experiments wasimproving. Moreover, we considered Host and Runeson [84]and Wieringa [85] studies. Host

1http://www.casp-uk.net/checklists


and Runeson [84] proposed a checklist which supports researchers and reviewers in conductingand reviewing case studies in software engineering. They have identified the need for checklistssupporting researchers and reviewers in conducting and reviewing case studies. They developedtheir checklists based on nine sources on cases studies using a systematic qualitative approach.103 items were identified. The items were classified and validated. After validation, because thefirst version of the checklist was considered too extensive for the purpose to review, they weresplit in two checklists, one for design and another one for review. As a result, the researcher’schecklist contains 38 items, and the reviewer’s checklist contains 12 items. Wieringa [85] studysuggested a unified checklist for observational and experimental research in software engineeringbased on several published checklist in software engineering and other fields. The checklistincludes a list of 40 items. This study was evaluated by Nelly et al. [168] which resulted in a listof suggestions and improvements for the checklist including clarifying questions and concepts,reducing the checklist, among others.

Other empirical communities from software engineering widely use tools in form ofchecklists to appraisal the quality of their controlled experiments reports. In medicine, controlledexperiments are also called randomized controlled trials. From the medical community, Jadadet al. [86], CASP 1, and CONSORT group [88], [89] are commonly used tools for assessingthe quality of randomized controlled trials. Jadad et al. [86] is one of the most well know andused scales for assessing the quality of randomized controlled trials,and its use to determine theeffect of rater blinding on the assessments of quality. The scale measures the likelihood of biasin pain research reports, and it address three kind of bias, they are randomization, blindness, andwithdrawals and dropouts. The scale contains the three items as follows: was the study describedas randomized (this includes the use of words such as randomly, random, and randomization)?;was the study described as double blind?; was there a description of withdrawals and dropouts?.CASP offers critical appraisal tools in form of checklists for helping researchers to check healthresearch for trustworthiness, results and relevance. In this research, we used two kinds of CASP1 tools, one for appraising reports of randomized controlled trials, which contains 11 questions,and another for appraising reports of qualitative studies, which contains 10 questions. Threebroad issues are considered in the CASP tools including are the results of the trial/review valid?,what are the results?, and will the results help locally?. These tools help researchers think aboutthese issues systematically. Another largely use tool in the medical community is the tool fromthe CONSORT (Consolidated Standards of Reporting Trials) group, The CONSORT group is agroup of scientists and editors in medical research that aims to improve reporting of randomizedclinical trials in medical research. The tool provides guidance for reporting all randomizedcontrolled trials [88], [89]. The checklists contains 37 items.

We also used other checklists from medicine field including a checklists for assessingresearch methods with contains 5 items [93], Begg et al. [92], a checklist for measuring

1http://www.casp-uk.net/checklists1http://www.casp-uk.net/checklists

7.4. SUMMARY CHAPTER 216

study quality in randomised and non-randomised studies of health care interventions with 27items [95], a checklist that assess potential threats to the validity of study, data collectioninstrument with 15 items [91], a checklist to provide a guide to reviewers about the type ofrelevant information that could be extracted from primary studies with 7 items [90], a qualityassessment tool for quantitative studies with 18 items [154],a checklist for the methods sectionof a paper and a checklist for the statistical aspects of a paper with 14 and 16 respectively [56],aquality checklist for experimental (randomised and non-randomised controlled trial) designswith 12 items [96], a checklist for assessing the statistical adequacy of published papers with18 items [94], and checklists in assessing the statistical content of medical studies with 26studies [99]. We also included in our research checklists from other fields such as psychology,education, and ecology. From empirical psychology field, we used the quality checklist templatesfor clinical studies and reviews with 14 items [98]. From education area, we used the checklistfor reviewing a randomized controlled trial of a social program or project to assess whether itproduced valid evidence with 17 items [97]. From the institute of terrestrial ecology, we used astatistical checklist with 74 items [100].

We also examined some studies that present principles for conducting research withhuman subjects. We followed guidelines to the human factors in empirical software engineeringsuch as Vinson and Singer [72], [69], [70] which raised the ethical issues in empirical studiesof software engineering, Lazar et al. [62], the British Psychology Society [63], AmericanPsychology Association [64], Johns Hopkins [65], Garza [66] which address and present codesof human research ethics in psychology, medicine, and social sciences.

7.4 Summary Chapter

The main objective of this Chapter is to provide brief information about the relatedwork regarding each study carried out. Sections 7.1, 7.3, 7.3) presented related work regardingthe systematic mapping study, the qualitative study, and the proposed instrument respectively.Each related work offered guidance to our research, either explicitly described in the studiesor in terms of recommendations based on the experience of the authors or through empiricalevaluations results.

217217217

8Conclusions

This chapter presents the conclusions of this research, including answers to the researchquestions 8.1, the main contributions 8.2, the study limitations 8.3, the current activities 8.4, andthe future works based on the gaps found in the results 8.5.

8.1 Answers to the Research Questions of the Thesis Research

This section answers the Research Questions (RQs) raised by this thesis.

! RQ1: Which are the most commonly adopted mechanisms to conduct experi-ments in software engineering research

The goal of this research question was to map support mechanisms used by re-searchers. Within this context, we identified the most cited support mechanisms inexperimental software engineering in the most well-known venues of the empiricalsoftware engineering community, EASE, ESEM and ESEJ, and the general mecha-nisms to support the experimental activities. The answer to this research questioncan be seen in Chapter 3.

! RQ2:What do experimental software engineering experts actually do whenthey design their experiments?

We collected data from experienced experimental software engineering researchersto explore what they actually do when they design experiments, and what kindsof problems/traps they fall into, as well as how they currently learn about researchmethods, and what gaps they have in their knowledge. The answer to this researchquestions can be seen in Chapter 4.

! RQ3:How can experiment planning be supported to positively help researchers,especially beginners, to review their experiment plans for completeness?

By answering this research question, we verified how well-defined the proposedinstrument is to support experiment plan reviewing for completeness through four

8.2. CONTRIBUTIONS 218

different studies, which together contribute to evaluate the instrument from thedistinct perspectives of beginner as well as expert researchers. The result can be seenin Chapter 6.

Although the instrument was designed for reviewing the completeness of experimen-tal plans for controlled experiments using human subjects in the area of softwareengineering, it can also be used for guiding researchers in the experimental planningand by researchers from other fields.

8.2 Contributions

We present the main contributions of this Ph.D thesis as follows:

1. A catalog of main support mechanisms for planning and conducting experimentsin software engineering. The catalog is available at (https://goo.gl/nOj4Tu). SeeChapter 3.

2. Analysis of what experimental software engineering experts actually do when theydesign their experiments. See Chapter 4.

3. An instrument for reviewing the completeness of experiment plans. See Chapter 5.

4. Assessment of the proposed instrument. See Chapter 6.

8.3 Study Limitations

The following sections discuss limitations of our Ph.D thesis research.

8.3.1 Systematic Mapping Study

The threats to validity of this study are reported in Section 3.4 in Chapter 3. An-other important limitation relates to the studies not being available for download. The ma-jority of these studies were published in early EASE editions, so in EASE 2001’s website(http://www.scm.keele.ac.uk/ease/ease2001) there is the following message: “. . . we are unableto make all files available for copyright reasons, additionally not all authors in the early yearsof EASE could make electronic copies of their papers available. . . ”. When the paper was notavailable we adopted the strategy of sending an email directly to the authors of the article.However, even with these efforts, we were not able to retrieve 15 papers: 14 papers of the EASEconference and one paper from ESEM, while in ESEJ all studies were available. We believe thatour conclusions cannot be invalidated by this limitation, since such studies correspond only to1.4% (15 studies) of all candidate studies (some of them could be excluded) for our systematicmapping study and are in the first editions of the venues considered.

8.3. STUDY LIMITATIONS 219

8.3.2 Qualitative Study

The qualitative study has two limitations. First, although the conceptual model about howexperts actually plan their experiments is a strong result, these findings should be complementedwith an empirical investigation to determine the extent to which these activities extend toexperimental planning performed by inexperienced researchers. The findings of the qualitativestudy were not empirically assessed.

Second, more interviews should be conducted with other researchers from other compa-nies and universities in order to try to understand how the empirical research communities insoftware engineering, as well as in other fields, carry out experiments in practice.

8.3.3 Instrument

Because the process of reviewing experimental plans is a subjective activity, consequentlythe proposed instrument is also subjective. This happens because although we have developed aset of items and things to consider in order to guide authors and reviewers of experimental plans,the instrument also depends on the reviewers.This means different reviewers produce differentreviews. However, the proposed instrument seeks to raise the most relevant issues for reviewersto think carefully about the experiment before it starts.

Another limitation is the fact that the instrument is not yet automated. During theevaluation, the instrument was implemented through the SurveyGizmo tool 1. Although thisimplementation was suitable for the instrument assessment, it is not the best option for theresearchers to use the instrument. First, the SurveyGizmo tool is an online survey software tool.Second, its usage requires a license. Therefore, currently, we are working towards a collaborativeplatform for reviewing the completeness of experimental plans for controlled experiments usinghuman subjects in software engineering.

8.3.4 Instrument Evaluation

Section 6.7 in Chapter 6 presents in detail the threats to validity regarding the fourevaluation studies. However, we highlight two of them that limit this study, namely, the smallsample size and the representativeness of objects.

First, although the small number of participants in experiments is a current issue insoftware engineering assessments, the results cannot be generalized. Therefore, it led us toconsider the results only as indicators.

Second, the objects used in studies 2, 3 and 4 were selected from the same experimentalsoftware engineering course, which means the authors of the experimental plans learned how toplan experiments from the same professor at the same university using the same materials.


8.4. CURRENT ACTIVITIES 220

In study 1 we had given two options of appraisals in which the participants could choosewhat kind of study they preferred to accomplish: (1) assess the instrument by itself or (2) assessthe instrument using an experimental plan they already had. However, none chose option 2which would produce a different perspective compared to the other three studies because it givesthe opportunity for the participants to assess the instrument in another scenario, which means,the participants might use their own experimental plan during the assessment instead of usingsomebody else’s experimental plan.

8.4 Current Activities

As mentioned in a previous section, we are working towards a web-based collaborativeplatform for reviewing the completeness of experimental plans for controlled experiments usinghuman subjects in software engineering. The proposal of this work is to automate the instrumentthat was developed to help experimenters to review their experimental plan, assess whether itis complete, and include all possible factors to minimize bias and maximize internal validity.This platform will combine the proposed instrument and suggestions for mechanisms identifiedin the systematic mapping studies that support experimental activities. Also, it will permit thedevelopment of a repository of research plans for experiments in software engineering. Thisstudy is being developed in collaboration with the author of this thesis and a senior student ofcomputer science at CIn UFPE. This work is also his partial fulfillment of requirements for thedegree of Bachelor in Computer Science.

We describe the main functionalities of the web-based collaborative platform as follows:All users have to access the web-based collaborative platform with a valid login and

password. After login, the experimental plans and experimental plan reviews associated witheach user are displayed. A user can have profiles such as author, collaborator, reviewer, visitor,and administrator of the experimental plans and their reviews. The profiles are not mutuallyexclusive, that is, the same user can be associated with one or more profiles.

The web-based collaborative platform must allow the author of the experimental planto include, edit and remove experimental plan information, such as "Name of the ExperimentPlan", "Name of the authors","Name of the reviewers", "Date of the experimental plan", "Publicor Private view".

Also, the tool must provide some information, including "Initial date of the experimentalplan review", "Status that indicates where reviewers stop the review", "The current score ofthe completeness of the experiment plan (sum of (yes=1, no=0, and partially=0,5)) ", "Thecurrent percentage of the experimental plan review", and "the access to the review report, whichcontains suggestions and gaps found in the experimental plan review". This information shouldbe displayed at the top of the dashboard.

The collaborative platform must allow the upload/download of experimental plans, and itmust be online in order to facilitate the collaboration between researchers and raters around the

8.5. FUTURE WORK 221

world.The tool must display a set of questions/items to review the completeness of the exper-

imental plan. Each question/item must contain a checkbox (yes, partially, no) indicating thepresence or not of that question/item in the experimental plan, and a box for reviewers to inputcomments and suggestions. For example, the reviewers can mention the missing elements andtheir considerations about that item.

The questions/items have recommendations/suggestions/hints from empirical softwareengineering best practices for advising experimenters in case they have some doubts relatedto the questions/items. To access this recommendation the tool must provide a checkbox "Seeadvice". If the experimenter checks the box, a pop-up must display the suggestions for thatitem. Also, the platform must provide access to a catalog of mechanisms to support each kind ofexperimental activity.

During the review process, the tool must track the progress of the review. A progress baris related to the completeness of the categories of the experiment plan, which are, (1) stating thegoals; (2) hypotheses, variables and measurements; (3) participants; (4) experimental materialsand tasks; (5) experimental design; (6) procedure; (7) data collection and data analysis; (8)threats to validity; and (9) document. The tool must provide the current score of the completenessof the experiment plan (sum of (yes=1, no=0, and partially=0,5)).

At the end of the process review, the tool should display the score of the experimentalplan and the advice for questions/items which were assigned "no" and "partially". The reviewingprocess does not need to follow a sequential order, that is, the reviewer can review differentcategories of the experimental plan.

The tool must generate an experimental plan review report based on gaps (question/itemchecked as "no" or "partially") identified during the review process. This report should containthe completeness score and the advice for the authors to revise their experimental plan.

The tool must store previous experimental plan reviews, (auto) saving the status (stageand score) where the experimenter stopped. The category and question/item where the reviewerstopped reviewing should be highlighted (color).

8.5 Future Work

We present some future work that can be developed based on the research conducted inthis thesis as follows:

! Updating the systematic mapping study by including more general software engi-neering venues in order to compare our results with other software engineeringcommunities.

! Performing a study to assess whether authors properly classified the empirical strate-gies they used, in order to obtain a more accurate classification. In this sense, it is

8.5. FUTURE WORK 222

interesting to evaluate the studies that used the term "experiment" but in fact did notconduct an experiment, for example, studies that only perform a dataset analysis.

! Conducting a similar qualitative study with a greater number of researchers fromother companies and universities in order to try to understand how the empiricalresearch communities in software engineering as well as in other fields carry outexperiments in practice.

! Developing a web-based collaborative platform for reviewing the completeness ofexperimental plans for controlled experiments using human subjects in softwareengineering based on the proposed instrument.

! Creating a repository of experimental plans for performing experiments in softwareengineering.

! Developing instruments for reviewing the completeness of study plans for other kindsof empirical studies.

! Performing additional assessments using a broader range of experimental plansand raters with different kinds of expertise in order to confirm previous results andgeneralize them for software engineering.

! Building of a preamble that talks about cost and benefit trade offs in the context ofexperimental design, which would help remind experimenters of the time and cost ofdesigning, conducting, and analyzing an experiment.

223223223

References

[1] Basili V. R., Selby R. W., and Hutchens D. H. Experimentation in software engineering.IEEE Trans. Softw. Eng., 12(7):733–743, July 1986.

[2] Basili V. R. The experimental paradigm in software engineering. In Proceedings of theInternational Workshop on Experimental Software Engineering Issues: CriticalAssessment and Future Directions, pages 3–12, London, UK, UK, 1993. Springer-Verlag.

[3] Basili V. R. The role of experimentation in software engineering: Past, current, and future.In Proceedings of the 18th International Conference on Software Engineering, ICSE ’96,pages 442–449, Washington, DC, USA, 1996. IEEE Computer Society.

[4] Basili V. R. The role of controlled experiments in software engineering research. InProceedings of the 2006 International Conference on Empirical Software EngineeringIssues: Critical Assessment and Future Directions, pages 33–37, Berlin, Heidelberg,2007. Springer-Verlag.

[5] Zannier C., Melnik G., and Maurer F. On the success of empirical studies in theinternational conference on software engineering. InProceedings of the 28thInternational Conference on Software Engineering, ICSE ’06, pages 341–350, New York,NY, USA, 2006. ACM.

[6] Rombach H. D., Basili V. R., and Selby R. W., editors. Proceedings of the InternationalWorkshop on Experimental Software Engineering Issues: Critical Assessment and FutureDirections, London, UK, UK, 1993. Springer-Verlag.

[7] Fenton N. How effective are software engineering methods? J. Syst. Softw.,22(2):141–146, August 1993.

[8] Tichy W. F., Lukowicz P., Prechelt L., and Heinz E. A. Experimental evaluation incomputer science: A quantitative study. Computer, 1995.

[9] Tichy W. F. Should computer scientists experiment more? Computer, 31(5):32–40, May1998.

[10] Lott C. and Rombach H. D. Repeatable software engineering experiments for comparingdefect-detection techniques. Empirical Software Engineering, 1:241–277, 1996.

[11] Sjøberg D. I. K., Dybå T., and Jorgensen M. The future of empirical methods in softwareengineering research. In 2007 Future of Software Engineering, FOSE ’07, pages358–378, Washington, DC, USA, 2007. IEEE Computer Society.

[12] Dybå T. and Dingsøyr T. Empirical studies of agile software development: A systematicreview ". Information and Software Technology, 50(9 – 10):833 – 859, 2008.

[13] Kampenes .S B., Dybå T., Hannay J. E., and Sjøberg D. I.K. A systematic review ofeffect size in software engineering experiments. Information and Software Technology,49(11–12):1073 – 1086, 2007.

REFERENCES 224

[14] Sjøberg D. I. K., Hannay J. E., Hansen O., By Kampenes V., Karahasanovic A., LiborgN., and Rekdal A.C. A survey of controlled experiments in software engineering. IEEETrans. Softw. Eng., 31(9):733–753, September 2005.

[15] Glass R. L., Vessey I., and Ramesh V. Research in software engineering: an analysis ofthe literature. Information and Software Technology, page 491–506, 2002.

[16] Hannay J. E., Sjøberg D. I. K., and Dybå T. A systematic review of theory use in softwareengineering experiments. IEEE Trans. Softw. Eng., 33(2):87–107, February 2007.

[17] Ko A. J., LaToza T. D., and Burnett M. M. A practical guide to controlled experiments ofsoftware engineering tools with human participants. Empirical Software Engineering,20(1):110–141, 2015.

[18] Singer J. Using the American Psychological Association (APA) Style Guidelines toReport Experimental Results. In Workshop on Empirical Studies in SoftwareMaintenance (WSESE 1999), pages 71–75, Oxford, England, September 1999.

[19] Dieste O., Grimán A., Juristo N., and Saxena H. Quantitative determination of therelationship between internal validity and bias in software engineering experiments:Consequences for systematic literature reviews. In Proceedings of the 2011 InternationalSymposium on Empirical Software Engineering and Measurement, ESEM ’11, pages285–294, Washington, DC, USA, 2011. IEEE Computer Society.

[20] Jedlitschka A. Minutes from third international workshop on empirical softwareengineering "guidelines for empirical work in software engineering". InternationalSymposium in Empirical Software Engineering, June 2005.

[21] Wohlin C., Runeson P., Höst M., Ohlsson M. C., Regnell B., and Wesslén A.Experimentation in Software Engineering. Springer, 2012.

[22] Easterbrook S., Singer J., Storey M. A., and Damian D. Selecting empirical methods forsoftware engineering research. In Guide to Advanced Empirical Software Engineering.Springer, 2008.

[23] Juristo N. and Moreno A. M. Basics of Software Engineering Experimentation. SpringerPublishing Company, Incorporated, 1st edition, 2010.

[24] Kitchenham B. A., Pfleeger S. L., Pickard L. M., Jones P. W., Hoaglin D. C., Emam K. E.,and Rosenberg J. Preliminary guidelines for empirical research in software engineering.IEEE Trans. Softw. Eng., 28(8):721–734, aug 2002.

[25] Kitchenham B., Sjøberg D. I. K., Dybå T., Brereton P., Budgen D., Host M., and RunesonP. Trends in the quality of human-centric software engineering experiments- aquasi-experiment. IEEE Trans. Softw. Eng., 39(7):1002–1017, jul 2013.

[26] Borges A., Ferreira W., Barreiros E., Almeida A., Fonseca L., Teixeira E., D. Silva,Alencar A., and Soares S. Support mechanisms to conduct empirical studies in softwareengineering. In Proceedings of the 8th ACM/IEEE International Symposium on EmpiricalSoftware Engineering and Measurement, ESEM ’14, pages 50:1–50:4, New York, NY,USA, 2014. ACM.

REFERENCES 225

[27] Borges A., Ferreira W., Barreiros E., Almeida A., Fonseca L., E. Teixeira, Silva D.,Alencar A., and Soares S. Support mechanisms to conduct empirical studies in softwareengineering: A systematic mapping study. In Proceedings of the 19th InternationalConference on Evaluation and Assessment in Software Engineering, EASE ’15, pages22:1–22:14, New York, NY, USA, 2015. ACM.

[28] Kitchenham B., Pickard L., and Pfleeger S. L. Case studies for method and toolevaluation. IEEE Softw., 12(4):52–62, jul 1995.

[29] Wohlin C., Runeson P., Höst M., Ohlsson M. C., Regnell B., and Wesslén A.Experimentation in Software Engineering— An Introduction. Kluwer AcademicPublishers, 2000.

[30] Juristo N. and Moreno A. M. Basics of Software Engineering Experimentation. KluwerAcademic Publishers, Boston, 2001.

[31] Travassos G. H., Santos P. S. M. S., Mian P. G., Neto A. C. D., and J. Biolchini. Anenvironment to support large scale experimentation in software engineering. InProceedings of the 13th IEEE International Conference on Engineering of ComplexComputer Systems, pages 193–202. EEE Computer Society, 2008.

[32] Basili V. R., Shull F., and Lanubile F. Using experiments to build a body of knowledge.In Proceedings of the Third International Andrei Ershov Memorial Conference onPerspectives of System Informatics, PSI ’99, pages 265–282, London, UK, UK, 2000.Springer-Verlag.

[33] IEEE. Ieee standard glossary of software engineering terminology, 1990.

[34] Naur P. and Randell B. Software Engineering: Report of a Conference Sponsored by theNATO Science Committee. Brussels, Scientific Affairs Division, NATO, Garmisch,Germany, Oct. 1969.

[35] Sommerville I. Software Engineering. Addison-Wesley, 9th edition, 2007.

[36] P. Bourque and R. E. Fairley, editors. Guide to the Software Engineering Body ofKnowledge - SWEBOK. IEEE Computer Society, 2014.

[37] Weyuker E. J. Empirical software engineering research - the good, the bad, the ugly. 2013ACM / IEEE International Symposium on Empirical Software Engineering andMeasurement, 0:1–9, 2011.

[38] Robson C. Real World Research - A Resource for Social Scientists andPractitioner-Researchers. Blackwell Publishing, 2th edition, 2002.

[39] Kitchenham B. Procedures for performing systematic reviews. Technical report, KeeleUniversity and NICTA, 2004.

[40] Kitchenham B. and Charters S. Guidelines for performing systematic literature reviews insoftware engineering. Keele University and University of Durham, 2007. TechnicalReport.

[41] Yin R.K. Case Study Research: Design and Methods. Sage, Thousand Oaks, CA, 2003.

REFERENCES 226

[42] Hofer A. and Ticchy W. Status of Empirical Research in Software Engineering. SpringerVerlag, 2007.

[43] Zelkowitz M. V., Wallace D. R., and Binkley D. W. Experimental validation of newsoftware technology. In Natalia Juristo and Ana M. Moreno, editors, Lecture Notes onEmpirical Software Engineering, pages 229–263. World Scientific Publishing Co., Inc.,River Edge, NJ, USA, 2003.

[44] Wohlin C. and Aurum A. Towards a decision-making structure for selecting a researchdesign in empirical software engineering. Empirical Software Engineering,20(6):1427–1455, 2015.

[45] Pfleeger S. L. Experimental design and analysis in software engineering. Annals ofSoftware Engineering, 1:219–253, 1995.

[46] Basili V. R., Caldiera G., and Rombach H. D. Goal question metric paradigm. InEncyclopedia of Software Engineering, pages 1: 528– 532. John Wiley and Sons Inc.,1994.

[47] Judd C. M. Research Methods in Social Relations. Harcourt Brace Jovanovich, sixth ed.edition, 1991.

[48] Campbell D. T. and Stanley J. C. Experimental and Quasi-Experimental Designs forResearch. Rand McNally College Publishing, 1963.

[49] Montgomery D. C. Design and Analysis of Experiments. John Wiley & Sons, New Jersey,1997.

[50] Box G., Hunter W., , and Hunter J. Statistics for Experimenters. An Introduction toDesign Data Analysis and Model Building. John Wiley & Sons, New York, 1978.

[51] Holland P. W. Statistics and causal inference. Journal of the American StatisticalAssociation, page 945–960, 7 1986.

[52] Martin D. W. Doing psychology experiments. Brooks/Cole, Pacific Grove, 4th edition,1996.

[53] Cook T. D. and Campbell D. T. Quasi- Experimentation: Design & Analysis Issues forField Settings. Houghton Mifflin Company, Boston, 1979.

[54] Shadish W. R., Cook T. D., and Campbell D. T. Experimental and Quasi-ExperimentalDesigns for Generalized Causal Inference. Houghton Mifflin, Boston, 2002.

[55] Robson C. Experiment design and statistics in psychology. Penguin Psychology, 1994.

[56] Greenhalgh T. How to Read a paper: The Basics of Evidence-Based Medicine. BMJPublishing Group, London, 3rd ed. edition, 2006.

[57] Cox D.R. Planning Experiments. John Wiley & Sons, New York, 1958.

[58] Greenwood E. Experimental sociology. Octagon Books, New York, 1945.

[59] Chapin F.S. Experimental designs in sociological research. Harper and Row, New York,1947.

REFERENCES 227

[60] Slavin R. E. Research Methods in Education: A Practical Guide. Englewood Cliffs,NJ:Prentice-Hall, 1984.

[61] Light R. J., Singer J. D., and Willett J. B. Design: Planning Research on HigherEducation. Cambridge, MA: Harvard, 2009.

[62] Lazar J., Feng J. H., and Hochheiser H. Research Methods in Human-ComputerInteraction. Wiley Publishing, 2010.

[63] The British Psychological Society. Code of human research ethics, 2014.

[64] American Psychological Association. Ethical principles of psychologists and code ofconduct, 2010.

[65] Johns Hopkins University. Jhsph human subjects research ethics field training guide,2010.

[66] Garza C.E. The touchy ethics of corporate anthropology, 1991.

[67] Shneiderman B. Software Psychology: Human Factors in Computer and InformationSystems (Winthrop Computer Systems Series). Winthrop Publishers, 1980.

[68] Weinberg G. M. The Psychology of Computer Programming. John Wiley & Sons, Inc.,New York, NY, USA, 1985.

[69] Singer J. and Vinson N. G. Why and how research ethics matters to you. yes, you!, 2001.

[70] Singer J. and Vinson N. G. Ethical issues in empirical studies of software engineering.IEEE Trans. Softw. Eng., 28(12):1171–1180, December 2002.

[71] Hanenberg S. Faith, hope, and love: An essay on software science’s neglect of humanfactors. In Proceedings of the ACM International Conference on Object OrientedProgramming Systems Languages and Applications, OOPSLA ’10, pages 933–946, NewYork, NY, USA, 2010. ACM.

[72] Vinson N.G. and Singer J. A practical guide to ethical research involving humans.Springer, 2008.

[73] Harris P. Designing and reporting experiments in psychology. Open University Press,1995.

[74] Higgins J. P. T. and Green S. Cochrane Handbook for Systematic Reviews ofInterventions. The Cochrane Collaboration, 5 edition, feb 2008.

[75] Australian National Health and Medical Research Council. How to review the evidence:systematic identification and review of the scientific literature. National Health andMedical Research Council„ 2000.

[76] University of York. NHS Centre for Reviews and Dissemination. Undertaking systematicreviews of research on effectiveness: CRD’s guidance for those carrying out orcommissioning reviews. CRD report. NHS Centre for Reviews and Dissemination,University of York, 2001.

[77] Cochrane Collaboration. Cochrane reviewers’ handbook, December 2003.

REFERENCES 228

[78] Shull F., Basili V., Carver J., Maldonado J. C., Travassos G. H., Mendonça M., and FabbriS. Replicating software engineering experiments: Addressing the tacit knowledgeproblem. In Proceedings of the 2002 International Symposium on Empirical SoftwareEngineering, ISESE ’02, pages 7–, Washington, DC, USA, 2002. IEEE ComputerSociety.

[79] Shull F., Mendonça M. G., Basili V., Carver J., Maldonado J. C., Fabbri S., Travassos G.H., and Ferreira M. C. Knowledge-sharing issues in experimental software engineering.Empirical Softw. Engg., 9(1-2):111–137, March 2004.

[80] Dybå T. and Dingsøyr T. Strength of evidence in systematic reviews in softwareengineering. In Proceedings of the Second ACM-IEEE International Symposium onEmpirical Software Engineering and Measurement, ESEM ’08, pages 178–187, NewYork, NY, USA, 2008. ACM.

[81] Jedlitschka A., Ciolkowski M., and Pfahl D. Reporting Experiments in SoftwareEngineering. Springer, 2008.

[82] Kitchenham B. A., Brereton O. P., Budgen D., and Li Z. An evaluation of qualitychecklist proposals: A participant-observer case study. In Proceedings of the 13thInternational Conference on Evaluation and Assessment in Software Engineering,EASE’09, pages 55–64, Swinton, UK, UK, 2009. British Computer Society.

[83] Kitchenham B., Sjøberg D. I. K., Brereton O. P., Dybå T., Höst M., Pfahl D., andP. Runeson. Can we evaluate the quality of software engineering experiments? InProceedings of the 2010 ACM-IEEE International Symposium on Empirical SoftwareEngineering and Measurement, ESEM ’10, pages 2:1–2:8, New York, NY, USA, 2010.ACM.

[84] Host M. and Runeson P. Checklists for software engineering case study research. InProceedings of the First International Symposium on Empirical Software Engineeringand Measurement, ESEM ’07, pages 479–481. IEEE Computer Society, Sept. 2007.

[85] Wieringa R. A unified checklist for observational and experimental research in softwareengineering (version 1), March 2012.

[86] Jadad A.R., Moore R. A., Carroll D., Jenkinson C., Reynolds D.J., Gavaghan D.J., andMcQuay H.J. Assessing the quality of reports of randomized clinical trials: is blindingnecessary? Controlled Clinical Trials, 17(1):1–12, Feb 1996.

[87] Moher D., Schulz K. F., and Altman D. G. The consort statement: revisedrecommendations for improving the quality of reports of parallel group randomized trials.BMC Med Res Methodol., 357:1191–1194, April 2001.

[88] Moher D., Hopewell S., Schulz K. F., Montori V., Gotzche P., Devereaux P., Elbourne D.,Egger M., and Altman D. for the consort group, “consort 2010 explanation andelaboration: updated guidelines for reporting parallel group randomised trial. BritishMedical Journal, page 340, 2010.

[89] Schulz K. and Altman D. “consort 2010 statement: updated guidelines for reportingparallel group randomised trials,”. Annals of Internal Medicine, 152(11):1–7, June 2010.

REFERENCES 229

[90] Cochrane Effective Practice and Organisation of Care Review Group. Data collectionchecklist, June 2002. Tech. Rep.

[91] Zaza S., Wright-De Agüero L.K., Briss P. A., Truman B. I., Hopkins D.P., HennessyM.H., Sosin D.M., Anderson L., Carande-Kulis V.G., Teutsch S.M., and Pappaioanou M.Data collection instrument and procedure for systematic reviews in the guide tocommunity preventive services. Am J Prev Med, 18:44–74, 2000.

[92] Begg C., Cho M., Eastwood S., Horton R., Moher D., Olkin I., Pitkin R., Rennie D.,Schulz K., Simel D., and Stroup D. Improving the quality of reporting of randomizedcontrolled trials. the consort statement. JAMA, 276(8):637–639, August 1996.

[93] Badgley R. F. An assessment of research methods reported in 103 scientific articles fromtwo canadian medical journals. Canadian Medical Association Journal, 85(5):246–250,1961.

[94] Bland J.M., Jones D.R., Bennett S., Cook D.G., Haines A.P., and MacFarlane A.J. Is theclinical trial evidence about new drugs statistically adequate? british journal of clinicalpharmacology. British Journal of Clinical Pharmacology, 19(2):155–160, 1985.

[95] Down S. H. and Black N. The feasibility of creating a checklist for assessment of themethodological quality both of the randomised and non-randomised studies of health careinterventions. J Epidemiol Community Health, 52:377–384, 1998.

[96] Greenhalgh T., Robert G., Macfarlane F., Bate P., and Kyriakidou O. Diffusion ofinnovations in service organizations: systematic review and recommendations. BritishMedical Journal (Clinical research ed), 2005.

[97] Coalition for evidence based policy. Checklist for reviewing a randomized controlled trialof a social program or project, to assess whether it produced valid evidence.http://coalition4evidence.org/wp-content/uploads/2010/02/Checklist-For-Reviewing-a-RCT-Jan10.pdf, February2010.

[98] National Collaborating Centre for Mental Health (UK). Self-Harm: Longer-TermManagement. Number (NICE Clinical Guidelines, No. 133.) in Appendix 11, Qualitychecklist templates for clinical studies and reviews. Leicester (UK): British PsychologicalSociety, 2012.

[99] Gardner M.J., Machin D., and Campbell M.J. Use of checklists in assessing the statisticalcontent of medical studies. British Medical Journal (Clinical research ed),296(6523):810–812, 1989.

[100] Jeffers J. N. R. Design of experiments, 1978. Statistical Checklist, 1.

[101] Cohen J. A Coefficient of Agreement for Nominal Scales. Educational andPsychological Measurement, 20(1):37–46, April 1960.

[102] Landis J. R. and Koch G. G. The measurement of observer agreement for categorical data.Biometrics, 33(1), 1977.

REFERENCES 230

[103] Pikkarainen M., Salo O., Kuusela R., and Abrahamsson P. Strengths and barriers behindthe successful agile deployment- insights from the three software intensive companies infinland. Empirical Softw. Engg., 17(6):675–702, December 2012.

[104] Jørgensen M. A strong focus on low price when selecting software providers increasesthe likelihood of failure in software outsourcing projects. In Proceedings of the 17thInternational Conference on Evaluation and Assessment in Software Engineering, EASE’13, pages 220–227, New York, NY, USA, 2013. ACM.

[105] Strauss A. L. and Corbin J. Basics of qualitative research : techniques and procedures fordeveloping grounded theory. Sage, Thousand Oaks, 2nd edition, 1998.

[106] Seaman C. B. Qualitative methods in empirical studies of software engineering. IEEETrans. Softw. Eng., 25(4):557–572, July 1999.

[107] Basili V. R., Shull F., and Lanubile F. Building knowledge through families ofexperiments. IEEE Trans. Softw. Eng., 25(4):456–473, July 1999.

[108] Arisholm E., Sjøberg D. I. K., Carelius G. J., and Lindsjørn Y. A web-based supportenvironment for software engineering experiments. Nordic J. of Computing,9(3):231–247, September 2002.

[109] Andrew B., John D., James M., Marc R., and Murray W. Replication of experimentalresults in software engineering. Technical report, IEEE Transactions on SoftwareEngineering, 1996.

[110] Juristo N. and Vegas S. Using differences among replications of software engineeringexperiments to gain knowledge. In Proceedings of the 2009 3rd International Symposiumon Empirical Software Engineering and Measurement, ESEM ’09, pages 356–366,Washington, DC, USA, October 2009. IEEE Computer Society.

[111] Carver J. C. and Tuscaloosa A. Towards reporting guidelines for experimentalreplications: A proposal, 2010.

[112] Daly J. Replication and a multi-method approach to empirical software engineeringresearch. PhD thesis, PhD thesis, Department of Computer Science, University ofStrathclyde, 1996.

[113] F. J. Shull, J. C. Carver, S. Vegas, and N. Juristo. The role of replications in empiricalsoftware engineering. Empirical Softw. Engg., 13(2):211–218, April 2008.

[114] Karahasanoviæ A., Anda B., Arisholm E., Hove S. E., Jørgensen M., Sjøberg D. I., andWelland R. Collecting feedback during software engineering experiments. EmpiricalSoftw. Eng., 10(2):113–147, April 2005.

[115] Campbell D. T. and Stanley J. C. Experimental and Quasi-Experimental Designs forResearch. Rand McNally College Publishing, Chicago, 1967.

[116] Siegel S. and Castellan N. J. Nonparametric statistics for the behavioral sciences.McGraw-Hill, New York, USA, 1956.

[117] Winer B. J., Brown D. R., and Michels K. M. Statistical Principles in ExperimentalDesign, 3rd edition. McGraw Hill, New York, 1991.

REFERENCES 231

[118] Cohen J. Statistical Power Analysis for the Behavioral Sciences (2nd Edition). Routledge,2 edition, July 1988.

[119] Farnum N. Devore J.L. Applied statistics for engineers and scientists. Duxbury, 1988.

[120] Arcuri A. and Briand L. A practical guide for using statistical tests to assess randomizedalgorithms in software engineering. In Proceedings of the 33rd International Conferenceon Software Engineering, ICSE ’11, pages 1–10, New York, NY, USA, 2011. ACM.

[121] Miles M.B. and Huberman A.M. Qualitative data analysis: an expanded sourcebook.Sage Publications, Inc, Thousand Oaks, 2nd edn edition, 1994.

[122] Cruzes D. S. and Dybå T. Recommended steps for thematic synthesis in softwareengineering. In Proceedings of the 2011 International Symposium on Empirical SoftwareEngineering and Measurement, ESEM ’11, pages 275–284, Washington, DC, USA, 2011.IEEE Computer Society.

[123] Glaser B. and Strauss A. The discovery of grounded theory: Strategies for qualitativeresearch. Transaction Books, 2009.

[124] Glaser B.G. Basics of grounded theory analysis: emergence vs. forcing. Sociology Press,Mill Valley, 1992.

[125] Charmaz K. Constructing Grounded Theory: A Practical Guide Through QualitativeAnalysis. Thousand Oaks, Sage, 2006.

[126] Bryant A. and Charmaz K. The SAGE handbook of grounded theory. SAGE, 2007.

[127] Denzin N. K. and Lincoln Y.S. The SAGE Handbook of Qualitative Research. SAGEPublications, 2011.

[128] Shull F., Singer J., and Sjøberg D. I.K. Guide to Advanced Empirical SoftwareEngineering. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007.

[129] Travassos G., Gurov D., and Amaral E. Experimental software engineering - anintroduction (in portuguese). , SE Technical Report (ES- 590/02-Abril), COPPE/UFRJ ,Brazil, 2002.

[130] Ardelin Neto A. and Conte T.U. Identifying threats to validity and control actions in theplanning stages of controlled experiments. In Proceedings of the 26th InternationalConference on Software Engineering and Knowledge Engineering, SEKE 2014, pages256–261, 2014.

[131] Lopes V. P. and Travassos G. H. Knowledge repository structure of an experimentalsoftware engineering environment. Brazilian Symposium on Software Engineering, 2009.

[132] Freire M. A., Accioly P., Sizilio G., Campos E., Kulesza U., Aranha E., and P. Borba. Amodel-driven approach to specifying and monitoring controlled experiments in softwareengineering. In Product-Focused Software Process Improvement - 14th InternationalConference, PROFES 2013, Paphos, Cyprus, June 12-14, 2013. Proceedings, pages65–79, 2013.

[133] Carver J. C., Juristo N., Baldassarre M. T., and Vegas S. Replications of softwareengineering experiments. Empirical Software Engineering, 19:267–276, 2014.

REFERENCES 232

[134] Inc. Minitab. Minitab 17 statistical software, 2010.

[135] SAS Institute Inc. Jmp statistics tool, 2007.

[136] R Development Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0.

[137] IBM corp. Ibm spss statistics, 2015.

[138] França B. B. and Travassos G. H. Experimentation with dynamic simulation models insoftware engineering: Planning and reporting guidelines. Empirical Softw. Engg.,21(3):1302–1345, June 2016.

[139] Hochstein L., Nakamura T., Shull F., Zazworka N., Basili V. R., and Zelkowitz M. V. AnEnvironment for Conducting Families of Software Engineering Experiments”, volume 74.Elsevier, 2008.

[140] OMG. Business process model and notation (bpmn) version 2.0, jan 2011.

[141] Jedlitschka A. and Pfahl D. Reporting guidelines for controlled experiments in softwareengineering. International Symposium in Empirical Software Engineering, 2005.

[142] Jedlitschka A. and D. Pfahl. Reporting guidelines for controlled experiments in softwareengineering. In Proc. of ACM/IEEE Intern. Symposium on Software Engineering,0:95–104, November 2005.

[143] Jedlitschka A. and Ciolkowski M. Reporting guidelines for controlled experiments insoftware engineering, 2006.

[144] Jedlitschka A., Ciolkowski M., and Pfahl D. Reporting guidelines for controlledexperiments in software engineering, 2007.

[145] McGowan H. M. Planning a comparative experiment in educational settings. Journal ofStatistics Education, 19(2), 2011.

[146] McCall W.A. How to experiment in education. New York, 1923.

[147] Martin D. W. Doing psychology experiments. Pacific Grove, 7th edition edition, 2008.

[148] Oehlert G. W. A First Course in Design and Analysis of Experiments. New York: W.H.Freeman, 2010.

[149] Montgomery D.C. Design and Analysis of Experiments. John Wiley and Sons, eighth ed.edition, 2013.

[150] Crombie I.K. The Pocket Guide to Appraisal. BMJ Books, 1996.

[151] Fink A. Conducting research literature reviews., 2005.

[152] Greenhalgh T. How to Read a paper: The Basics of Evidence-Based Medicine. Medicine.BMJ Books, 2000.

[153] Petticrew M. and Roberts H. Systematic Reviews in the Social Sciences: A PracticalGuide. Blackwell Publishing, 2005.

REFERENCES 233

[154] Effective Public Health Practice Project. Quality assessment toll for quantitative studies,2010.

[155] Moroe S. W. and Engelhart M. D. Experimental Research in Education. Bureau ofeducational research college of education, 1930.

[156] Fleiss J. L. The Design and Analysis of Clinical Experiments. Wiley, New York, 1999.

[157] Edenborough R. Using psychometrics: A practical guide to testing and assessment.Kogan Page, 2nd ed. edition, 2000.

[158] Lebreton J. The restriction of variance hypothesis and interrater reliability and agreement:Are ratings from multiple sources really dissimilar? Organ. Res. Meth., 6(1):80– 128,2008.

[159] Shrout P. E. and Fleiss J. L. Intraclass correlations: Uses in assessing rater reliability.Psychological Bulletin, 8(2):420—-428, 1979.

[160] Burke J. and Dunlap w. Estimating interrater agreement with the average deviation index:A user’s guide. Organizational Research Methods, pages 5: (2) : 159–172, 2002.

[161] Dancey C. and Reidy J. Statistics without maths for psychology: using SPSS for windows.Prentice Hall, 2004.

[162] Davis F. D. Perceived usefulness, perceived ease of use, and user acceptance ofinformation technology. MIS Q., 13(3):319–340, September 1989.

[163] Devellis R. Scale development: Theory and Applications. Sage publications, 3rd edition,2012.

[164] Almeida A., Barreiros E., Saraiva J., and Soares S. Mechanisms to guide empiricalstudies in software engineering: A mapping study (in portuguese). InProceedings of 8thExperimental Software Engineering Latin American Workshop, Rio de Janeiro, Brazil,2011.

[165] Marshall C. and Brereton P. Tools to support systematic literature reviews in softwareengineering: A mapping study. In ACM/IEEE International Symposium on EmpiricalSoftware Engineering and Measurement, pages 296–299, 2013.

[166] Seaman C. B. Qualitative Methods. Springer, 2008.

[167] Kitchenham B., Al-Khilidar H., Babar M. A., Berry M., Cox K., Keung J., Kurniawati F.,Staples M., Zhang H., and Zhu L. Evaluating guidelines for empirical softwareengineering studies. In Proceedings of the 2006 ACM/IEEE International Symposium onEmpirical Software Engineering, ISESE ’06, pages 38–47, New York, NY, USA, 2006.ACM.

[168] Condori-Fernandez N., Wieringa R., Daneva M., Mutschler B., and Pastor O. Anexperimental evaluation of a unified checklist for designing and reporting empiricalresearch in software engineering, May 2012.

[169] Basili V. R. and Rombach H. D. The tame project: Towards improvement-orientedsoftware environments. IEEE Trans. Software Eng., 14:758–773, 1988.

REFERENCES 234

[170] Dybå T., Kampenes V.B., and Sjøberg D.I.K. A systematic review of statistical power insoftware engineering experiments. Information & Software Technology, 48(8):745–755,2006.

[171] Breaugh J.A. Effect size estimation: factors to consider and mistakes to avoid. J Manag,29(1):79 —- 97, 2003.

Appendix

236236236

ASupport Mechanisms Reference

A.1 Support Mechanisms

A.1. SUPPORT MECHANISMS 237

A.2. PRIMARY STUDIES 264

A.2 Primary Studies

310310310

BInitial version of the Proposed Instrument

Sections B.1, B.2, B.3, B.4, B.5, B.6, B.7, B.8, B.9 describes the category 1 – Statingthe Goals, category 2 – Hypotheses, Variables, and Measurements, category 3 – Participants,category 4 – Experimental Materials and Tasks, category 5 – Experimental Design, category 6 –Procedure, category 7 – Data Collection and Data Analysis, category 8 – Threats to Validity, andcategory 9 – Document respectively.

B.1 Category 1: Stating the Goals

This category contains four items, which allow researchers to review their experimentalplan regarding the study goals, the research questions, the choose of the controlled experimentas the most appropriate research technique to be used, and concerns about ethical aspects.

Item 1: Are the aims clearly and precisely stated?Consider whether the experiment’s goals describe:• A clear purpose.• Specific objectives.• The reasons for undertaking the experiment, clearly and explicitly stated.HintOne way to define the experiment goal is to use the GQM template. The purpose of a goaldefinition template is to ensure that important aspects of an experiment are defined before theplanning and execution take place. By defining the goal of the experiment according to thistemplate, the foundation is properly laid [21]. The goal template is [169]:Analyze <Object(s) of study>for the purpose of <Purpose>with respect to their <Quality focus>from the point of view of the <Perspective>in the context of <Context>.

B.2. CATEGORY 2: HYPOTHESES, VARIABLES, AND MEASUREMENTS 311

Item 2: Are the research questions linked to research goals and clearly defined?Consider:• Context of the research questions.• Relevance of the research questions.

Item 3: Based on the research goals, is controlled experiment the most appropriate researchtechnique to use?Consider:• Execution control• Measurement control• Investigation cost• Possibility to replicateHintWohlin [21] presents a comparison of empirical strategies in Chapter 2, Section 5 (pg.18),andEasterbrook 2008 presents an overview of the factors that should be involved in selecting anappropriate research technique for software engineering research.

Item 4: Do the objectives of the experiment satisfy ethical concerns?Consider whether:• The participants have access to all information they need about the study, before making theirdecision to participate or not.• The experiment has scientific value.• The experimenters maintain confidentiality of data.• The beneficence outweighs the risks.

B.2 Category 2: Hypotheses, Variables, and Measurements

This group of items includes four items, which revises the relationship between thehypotheses, variables and measurements and the research goals.

Item 5: Are the hypotheses of the research clearly described and are they related to the re-search goals?Considerations:• The objective should be translated into a formal hypothesis.• For each goal stated in the research objective, the null hypotheses, and their correspondingalternative hypotheses should be described.• The description of both null and alternative hypotheses should be as formal as possible.• Any statistical hypotheses should be described.

B.2. CATEGORY 2: HYPOTHESES, VARIABLES, AND MEASUREMENTS 312

• The main hypotheses should be explicitly separated from ancillary hypotheses and exploratoryanalyses.In the case of ancillary hypotheses, a hierarchical system is appropriate. Hypotheses need tostate the treatments and the control conditions.

Item 6: Have experimenters defined the variables or attributes to be measured?Consider whether:• Independent variables, dependent variables (response variables), factors, parameters, and con-textual variables are described.HintDependent variables need be defined and justified in terms of their relevance to the goals listedin the Research Objectives [81].

Item 7: Do the research measures allow the questions to be answered?Consider whether:• The measures are the most relevant ones for answering the research question.• The outcome measures are meaningful and relevant to the objectives of the experiment.• The experimenters justify the choice of outcome measures in terms of their relevance to theobjectives of the experiment.

Item 8: Are the outcome measures valid and clearly described?Consider whether:• The type of metrics that will be calculated is described.• It is described how the measures will be obtained.• Criteria for measuring and judging effects are defined.• The experimenters describe how they are going to measure the metric that they define for thedependent variables (response variables).• A valid and reliable method is used to determine the outcome measures.HintThese are indicators of valid measures [82]:• The measures are plausible measures of the construct they are meant to represent.• The measures are direct measures of well-defined concepts.• The measurement scales are respected (e.g. categorical measures are not treated as ordinal orinterval).• The data collection process is defined and appropriate.

B.3. CATEGORY 3: PARTICIPANTS 313

B.3 Category 3: Participants

This section involves concerns about the human subjects since recruitment strategy, theprocess described, remind to collect important information from the participants, concerns aboutthe population which participants are drawn, simple size, until the way in how to deal with them.It contains eight items.

Item 9: Is the recruitment strategy appropriate to the aims of the research?Consider whether:• Recruitment materials are appropriate, such as e-mail, poster, or advertisement.• It is specified if the experiment will use in person or remote participants.• What level of compensation will be offered to the participants, such as payments, no payments(the participants will be volunteers), grades.• Cultural differences can affect the recruiting strategy.HintIn case of recruiting remote participants, consider whether the fact that participants may mask theresults is controlled for [17]. The identification of an appropriately general group of participantsis always a challenge. Appropriate recruiting methods can help, but there are no guarantees.Despite your best efforts to find a representative population you always face the possibility thatyour group of participants is insufficiently representative in a way that was unanticipated. Asthis bias is always possible, it’s best to explicitly state what steps you have taken to account forpotentially confounding variables and to be cautious when making claims about your results [62].

Item 10: Is the recruitment process clearly described?Consider whether the recruitment process is specified in terms of:• Who to recruit, including eligibility criteria for participants.• Dates defining the periods for recruitment.• How many participants will be recruited.• Descriptions of the study participants in terms of, for example, SE experience, type (student,practitioner, consultant), nationality, task experience and other relevant variables.• What knowledge/experience the participants need to know to be able to do the experimentaltasks.

Item 11: Is a demographic questionnaire planned to collect information from participants?Consider whether:• The experimenters describe when the demographic data will be collected.• How the demographic data will be used is described.HintSurveys and interviews are common ways of collecting and measuring demographic variables.This data can be gathered before or after a task, or even as part of testing a potential participant

B.3. CATEGORY 3: PARTICIPANTS 314

against inclusion criteria [17].

Item 12: Does the researcher define the population from which participants are drawn?Consider whether:• The experimenters define carefully the population about which he/she are seeking to makeinferences from the results of the experiment.

Item 13: Is the sample well described?Consider whether the sample is described in terms of:• The sample size is justified.• The study has an adequate sample size (one large enough) to detect meaningful effects of theintervention.HintA more principled way to decide how many participants to recruit is to do a prospective poweranalysis (e.g., Dyba et al. [170], which gives an estimate of the sample size required to detect adifference between experimental conditions. The exact calculation of this estimate depends onthe specific statistical test used to compare the groups, but it generally requires three inputs: (1)the expected effect size (the difference in the outcome variable between groups), (2) the expectedvariation in this outcome measurement, and (3) the Type I error rate α (typically .05 in softwareengineering). The first two must be estimated. One source of these estimates is to use datafrom previous experiments on the tool or even pilot studies There are also standard approachesfor estimating sample size and effect size such as Cohen’s d, odds ratio, and Abelson’s CausalEfficacy Ratio. Breaugh [171] provides an accessible introduction to these topics [17].

Item 14: Are ethical issues addressed properly (personal intentions, integrity issues, consent,review board approval)?Consider whether:• The integrity of individuals/organizations is taken into account.• If there are sufficient details of how the research will be explained to participants to assesswhether ethical standards will be maintained.•If the experimenters have discussed issues raised by the study (e.g. issues around informedconsent or confidentiality or how they have handled the effects of the study on the participants)during and after the study.• The study design approval has been sought from the ethics committee at the institution.HintExperimenters should describe the relationship between themselves and participants and ifthat relationship has been adequately considered. The experimenters should critically examinetheir own role, potential bias and influence during formulation of the research questions, datacollection, including sample recruitment and choice of location. Also, the experimenter should

B.4. CATEGORY 4: EXPERIMENTAL MATERIALS AND TASKS 315

describe how they will respond to events during the study [80].

Item 15: Do the experimenters describe how participation will be motivated?Consider whether the experimenters describe:• The motivation for the participants to participate in the study.• If they are voluntary participants, or if they will receive payments, educational credits, or otherkind of advantage.HintA description of the motivation for the participants to participate is mandatory. For instance, itshould be stated whether the participants were paid and if so, how much, or whether they earnededucational credits for taking part in the experiment [81].

Item 16: Is the debriefing of participants clearly defined?Consider if the experimenters describe how they will: [17]• Explain to the participant what the study is investigating.• Explain why the study is important to conduct.• Explain how the participant’s data will be used to investigate the question.• Explain the correct solutions to the tasks.• Provide contact information so that the participant can ask further questions if they want toknow more about the research.• Provide instructions about information that participants should not share with others. Forexample, if one is recruiting from a student population, students might share the answers to thetasks with their friends, who may later participate in the study.Hint• After a participant has completed the tasks, it is common practice in human subjects research todebrief the participant about the study. Debriefing can also be an opportunity to get speculativefeedback from participants about how they felt about the tool.• If participants did not use the experimental treatment, it may be instructive for them to try itand provide feedback.• Participants should not leave a study feeling as if they “failed,” especially when tasks may havebeen designed to ensure that not every participant would succeed. Many ethicists feel that is anecessary part of research with human participants [17].

B.4 Category 4: Experimental Materials and Tasks

This set of items is focused on materials and tasks which should be used in the controlledexperiment. It includes three items.

B.4. CATEGORY 4: EXPERIMENTAL MATERIALS AND TASKS 316

Item 17: Do the experimenters clearly describe what instruments, materials, technology, andtools will be used and how?Consider how well the experimenters have described:• Which objects are selected and why.• Experimental materials, raw data, programs, artifacts, specifications, code, whatever they needfor running the experiment itself.• Technology information to reproduce the experiment.• Description of any tools that need to be purchased and any training required. Instructions mustbe written out or recorded properly.• Forms and questionnaires for the participants to fill out, interview materials.• All characteristics of the experimental material that might have an impact on the results.• Experimental infrastructure that will be used by participants during the data collection, such ashow many computers they will need and for how long.HintAll experimental materials and equipment should be described. For example, if the study in-volves a questionnaire, questions should be described, as should any other characteristics of thequestionnaire [81].

Item 18: Are the tasks that will be performed by the participants described in detail?Consider whether the following are fully described:• The tasks to be performed by subjects.• The scope of the tasks.• The feature coverage of the new technology the tasks will exploit.• If the experiment will be performed in a physical or virtual location.• The origin of the tasks.• The task duration. E.g.: unlimited time to work on a task (allowing either the participant or theexperimenter to decide when the task is complete) or a time limit.• Task difficulty.• Number of tasks needed for the experiment.• The information the participants will receive before they start the tasks.• How much training the participants will receive to be able to use the new technology.• The rationale behind the selection of roles, artefacts, viewpoints, etc.• Whether the description includes precisely what happens to the participants from the momentthey arrive to the moment they leave.• The procedures to prevent or minimize any potential risk or discomfort. For example, fatigueduring the experiment, cultural differences, language.• If the experiment is a replication, the discussion of the adjustments and their rationales.

Item 19: Do the experimenters define success with respect to the experimental tasks and

B.5. CATEGORY 5: EXPERIMENTAL DESIGN - THE EXPERIMENTAL DESIGNCATEGORY CONTAINS FOUR ITEMS 317how success will be measured?Consider whether the following are described adequately:• The goal state that a participant must reach to be counted as successful.• A method for determining when a goal state has been reached.• A method of communicating the goal to participants that all participants will interpret similarly.

B.5 Category 5: Experimental Design - The experimentaldesign category contains four items

It makes the researchers check if the experiment design chosen is the most appropriate,if the treatments are well defined, if the randomization is well described, and if an appropriateblinding procedure should be applied to reduce bias.

Item 20: Is the Experiment Design the most appropriate?Consider:• If the research design is appropriate to address the aims of the research.• General Design Principles: randomization, blocking, and balancing.• If the experiment design is the most appropriate for the variables involved.• If there is a justification for why the experimenters have chosen their experimental design.Hint• The choice of design should involve consideration of sample size (number of replicates),selection of a suitable run order for the experimental trials, and determination of whether or notblocking or other randomization restrictions are involved [149].• For the description of the experimental design in the experimental plan, it is important that notonly the final design of the experiment is in there, but it should have also an explanation of howthe design was arrived at and why experimenters have chosen that design and not a different one.

Item 21: Are the treatments well defined?Consider:• Whether the experimental treatments have been defined sufficiently precisely for them to beapplied correctly by the experimenter or by those wishing to repeat the experiment.• How realistic the treatments are.• Alternatives or levels and treatment definitions.• Whether the experiment is a within – or between-subjects design, or a mixed factors design,with a description of each of the levels of the independent variables.• Whether there is a control group with which to compare treatments?• Whether all treatment groups (including any control groups) are planned to be treated equiva-

B.6. CATEGORY 6: PROCEDURE 318

lently during the preparation for and conduct of the experiment.

Item 22: Do the experimenters define the process by which they will apply the treatmentto objects and subjects (e.g. randomization)?Consider whether:• Treatments are randomly allocated.• Participants are appropriately allocated to treatments given the number of participants and theoverall experimental design.• All measures for randomization are described, especially the random allocation of participantsto treatments.• The number of and relationships among subjects, objects and variables is carefully described inthe experimental plan.

Item 23: Is an appropriate blinding procedure used (e.g. blind allocation of materials, blindmarking)?Consider whether:• Lack of blinding could introduce bias.• Investigators will be kept ‘blind’ to participants’ exposure to the intervention.• Investigators will be kept ‘blind’ to other important confounding and prognostic factors.• The study participants will be aware of the research question.• For any kind of blinding (e.g., blind allocation), the details are provided.

B.6 Category 6: Procedure

Four items is described to review the procedure section including an adequate descriptionof the controlled experiment context, training, pilot, and timeline of the experiment.

Item 24: Is there an adequate description of the context in which the experiment will becarried out?Consider whether:• Where the experiment should be executed is described.• The environment (location) of the experiment is representative for the study’s objectives.

Item 25: Is there a description of any training that will be provided?Consider:• The description of training provided to the participants.• Whether experimenters will provide training on how to use the new technology.• Terminology of the new technology.

B.7. CATEGORY 7: DATA COLLECTION AND DATA ANALYSIS 319

• The design of the programs they will work with during tasks.• The decision of what to teach and what to have participants learn during the tasks.HintThe study should provide a way to teach the concepts and skills quickly and effectively anddevise a way to ensure that the participants have successfully learned the material [17].

Item 26: Is a Pilot described?Hint• Designing a study with human participants is necessarily an iterative process. Running an ex-periment for the first time, like testing software for the first time, will reveal a range of problems,which might include confusing study materials, bugs in the tool, confusion about the tasks, andunanticipated choices made by participants.• Sandbox pilots and analytical evaluation are good options of pre-pilots because they are easy toschedule and can reveal problems with the experiment without the trouble of recruiting outsiders.Ko et al. [17] brings interesting tips about pilot and pre-pilots [17].• If possible, a pilot of the experiment on a small set of people may be useful, so that you aresure that the plan is complete and the instructions understandable [45].

Item 27: Do experimenters describe the schedule in which the experiment will be run?Consider:• How many hours/ days the experiment will be run.• How experimenters have organized theses days.• Which activities experimenters will cover each day.• The schedule for the experiment, and how long the experiment will take on each day.• What events will happen during the experiments, in what order, and with what timing.• How many times the experiment will be repeated.

B.7 Category 7: Data Collection and Data Analysis

This category includes four items regarding the data collection and analysis procedures.Also, this section presents concerns about the statistical methods.

Item 28: Are the data collection procedures well described?Consider whether:• The data collection is planned in a way that addresses the research issue.• The data collection methods are adequately described.• The data collection procedures are sufficient for the purpose (data sources, collection, storage,validation)?

B.8. CATEGORY 8: THREATS TO VALIDITY 320

• Any quality control method that will be used to ensure completeness and accuracy of datacollection.HintDetails of the data collection method have to be described, including when the data will becollected, by whom, and with what kind of support (e.g., tool). Any type of transformation ofthe data (e.g., marking “true” defects in defect lists) and training provided for such should alsobe described [81].

Item 29: Are the analysis procedures clearly described?Consider:• How experimenters are going to analyze the data they will obtain.• The description of the analysis procedure detailing which methods will be used to test thehypotheses in analyzing the data.• The types of analysis.

Item 30: Are the statistical methods described?Consider:• The statistical context and methods applied.• Whether the statistical methods are appropriate for the study design.• The rationale and justification for the statistical methods.• If the results were not analyzed statistically, whether statistical analysis could have providedadditional descriptive and analytical insight.• Whether references are cited for all statistical procedures used.

Item 31: How precise is the estimate of the treatment effect?Consider:• How experimenters will interpret the possible outcomes of the experiment.• The confidence limits.• Whether potential confounders are adequately controlled for in the analysis.• How it is ensured that the data do not violate the assumptions of the tests used on them.

B.8 Category 8: Threats to Validity

This section consists of one item about the threats to validity. This item helps researchersto check if the experimental plan describes threats to validity, study limitations, potential biasesor confounders that may influence the experiment results.

Item 32: Do the experimenters identify and discuss threats to validity, study limitations, potential

B.9. CATEGORY 9: DOCUMENT 321

biases or confounders that may influence the experiment results?Consider:• Whether mention is made of the threats to validity in the experimental plan and also how thesethreats can affect the results and findings.• Whether the experimenters discuss the limitations of their study.• Whether the experimenters discuss potential experiment bias.• Whether the experimenters report the rationale of their decisions in terms of how the balanceddifferent threats to validity.HintA fundamental question concerning results from an experiment is how valid the results are. It isimportant to consider the question of validity already in the planning phase in order to plan foradequate validity of the experiment results. Adequate validity refers to that the results should bevalid for the population of interest [21].

B.9 Category 9: Document

This category also contains one item. It is focused on the general writing of the experi-mental plan regarding its audience suitability, and its facility to read.

Item 33: Is the experimental plan suitable for its audience, easy to read and well structured?Consider whether:• The terms are defined in such a way that it is possible to replicate the study.• The experiment addresses a clearly focused issue.

322322322

CChecklists and Guidelines Items

C.1 Checklists Items

C.1.1 Classification: Goal Definition

C.1. CHECKLISTS ITEMS 323

C.1.2 Classification: Research Question


C.1.3 Classification: Metrics and Measurement


C.1.4 Classification: Context Selection

C.1.5 Classification: Hypotheses Formulation


C.1.6 Classification: Parameters and Variables

C.1.7 Classification: Participants


C.1.8 Classification: Group Assignment

C.1.9 Classification: Experimental Materials and Tasks


C.1.10 Classification: Experimental Design

C.1.11 Classification: Procedure


C.1.12 Classification: Data Collection


C.1.13 Classification: Analysis Procedures

C.2. GUIDELINES ITEMS 332

C.1.14 Classification: Threats to Validity

C.1.15 Classification: Document Structure

C.2 Guidelines Items

C.2.1 Classification: Goal Definition


C.2.2 Classification: Research Questions

C.2.3 Classification: Metrics and Measurements


C.2.4 Classification: Context Selection and Hypotheses Formulation

C.2.5 Classification: Parameters and Variables


C.2.6 Classification: Participants


C.2.7 Classification: Group Assignment


C.2.8 Classification: Experimental Materials

C.2.9 Classification: Tasks


C.2.10 Classification: Experiment Design


C.2.11 Classification: Procedure

C.2.12 Classification: Data Collection and Analysis Procedure


C.2.13 Classification: Threats to Validity

341341341

DExperimental Websites

D.1 Experimental Website - Study 1

Figure D.1: Abstract- Study 1

Figure D.2: Instructions- Study 1

D.1. EXPERIMENTAL WEBSITE - STUDY 1 342

Figure D.3: Option 1- Study 1

Figure D.4: Option 2- Study 1


Figure D.5: Instrument- Study 1

Figure D.6: Feedback- Study 1


Figure D.7: Abstract- Study 2


Figure D.8: Instructions- Study 2

Figure D.9: DryRun - Study 2


Figure D.10: Study 2

Figure D.11: Experimental Plans- Study 2


Figure D.12: Instrument- Study 2



Figure D.14: Welcome- Study 3


Figure D.15: Agenda- Study 3

Figure D.16: Dry Run Treatment 1 - Study 3


Figure D.17: Dry Run Treatment 2 - Study 3

Figure D.18: Treatment 1 A - Study 3


Figure D.19: Treatment 1 B - Study 3




Figure D.22: Experimental Plans- Study 3




Figure D.24: Welcome- Study 4 A

Figure D.25: Welcome- Study 4 B

Figure D.26: Instructions- Study 4A


Figure D.27: Instructions- Study 4B




Figure D.32: Experimental Plans- Study 4A

Figure D.33: Experimental Plans- Study 4B

355355355

ERaw Data

E.1 Study 1

E.1.1 Instrument’s acceptance Raw Data - Study 1

Tables E.1, E.2, E.3, and E.4 show raw data from the researchers of the study 1.

Table E.1: Raw Data Study 1: Values from Instrument’s Acceptance: Fitness for Purpose

Items P1 P21. The instrument supports experimenters in assessing thecompleteness of the experimental plan.

5 5

2. The instrument identifies potential biases that were notidentified at the beginning.

3 4

3. The instrument is useful to inexperienced experimenters. 5 44. The instrument is of value to experienced experimenters. 5 4Table E.2: Raw Data Study 1: Values from Instrument’s Acceptance: Item’s

Appropriateness

Items P1 P25. I find the questions of the instrument adequate. 4 46. I find the recommendations (things to consider and hints)helpful.

4 4

7. I find the set of questions to be complete. 4 48.I find the number of questions adequate. 4 29.I find the order of questions adequate. 5 4

E.1. STUDY 1 356

Table E.3: Raw Data Study 1: Values from Instrument’s Acceptance: Perceivedusefulness

Items P1 P210.Using the instrument would give me greater control overmy experimental planning.

5 5

11.The instrument would help me to complete my review ina reasonable amount of time.

5 2

12. The instrument supports critical aspects of my experi-mental plan.

5 3

13. I find that this instrument would be useful for reviewingexperimental plans.

5 4

14. I would recommend this instrument to my colleaguesand friends.

5 3

Table E.4: Raw Data Study 1: Values from Instrument’s Acceptance: Perceived ease ofuse

Items P1 P215. The instructions for using the instrument are clear. 5 416. It is easy for me to remember how to use the instrument. 3 417. The instrument provides helpful guidance in reviewingan experiment.

4 3

18. I find the instrument easy to use. 3 419. I am able to efficiently complete my review using thisinstrument.

4 4

E.1.2 Raw Data - Study 1

Table E.5 presents the collected data regarding the assessment of study 1. Also, wecollected comments of items that the participants had problems to understand. The commentsare in Table E.6.

E.1. STUDY 1 357

Table E.5: Instrument Validation 1

Participant Rater 1 Rater 2Option Option 1 Option 1Item 1 I do not find the item useful I find the item usefulItem 2 I find the item useful I find the item usefulItem 3 I find the item useful I find the item usefulItem 4 I find the item useful I have trouble understandingItem 5 I find the item useful I have trouble understandingItem 6 I find the item useful I find the item usefulItem 7 I find the item useful I find the item usefulItem 8 I find the item useful I find the item usefulItem 9 I find the item useful I find the item useful

Item 10 I find the item useful I find the item usefulItem 11 I find the item useful I have trouble understandingItem 12 I find the item useful I find the item usefulItem 13 I find the item useful I find the item usefulItem 14 I find the item useful I have trouble understandingItem 15 I find the item useful I have trouble understandingItem 16 I find the item useful I find the item usefulItem 17 I find the item useful I find the item usefulItem 18 I find the item useful I find the item usefulItem 19 I find the item useful I find the item usefulItem 20 I find the item useful I find the item usefulItem 21 I find the item useful I find the item usefulItem 22 I find the item useful I find the item usefulItem 23 I find the item useful I find the item usefulItem 24 I do not find the item useful I find the item usefulItem 25 I find the item useful I find the item usefulItem 26 I find the item useful I find the item usefulItem 27 I find the item useful I find the item usefulItem 28 I find the item useful I find the item usefulItem 29 I find the item useful I find the item usefulItem 30 I find the item useful I find the item usefulItem 31 I find the item useful I find the item usefulItem 32 I find the item useful I have trouble understandingItem 33 I find the item useful I find the item useful

E.2. STUDY 2 358

Table E.6: Comments of items that the participants had problems to understand

Items CommentsItem 4 The term "all relevant information" is misleading.

The problem is in understanding what "all" are.Item 5 The problem I always have is that the hypotheses

are described before the variables are discussed.Therefore, I cannot use the variables in my hy-potheses. This makes the hypotheses not formal.

Item 11 I believe demographic is too general. You couldprovide a set of fields used to capture demographicinformation. If these fields will be standardizedthen results aggregation will be much easier, reli-able, and powerful.

Item 14 The impact of this field is probably overestimated.If you need to cut somewhere, this is a good place.

Item 15 I believe this overlaps with a previous field.Item 32 As I said in the interview, It is important to report

the tradeoffs. Authors should report the rationaleof their decisions in terms of how the balanceddifferent threats to validity.

E.2 Study 2


Tables E.7, E.8, E.9, and E.10 show raw data from the researchers of the study 2.

E.2. STUDY 2 359

Table E.7: Raw Data Study 2: Values from Instrument’s Acceptance: Fitness for Purpose

Items B1 B2 E1 E21. The instrument supports experimenters in assessing thecompleteness of the experimental plan.

5 4 4 5

2. The instrument identifies potential biases that were notidentified at the beginning.

5 3 4 5

3. The instrument is useful to inexperienced experimenters. 5 5 4 54. The instrument is of value to experienced experimenters. 5 4 3 4

Table E.8: Raw Data Study 2: Values from Instrument’s Acceptance: Item’sAppropriateness

Items B1 B2 E1 E25. I find the questions of the instrument adequate. 4 4 5 56. I find the recommendations (things to consider and hints)helpful.

5 2 4 5

7. I find the set of questions to be complete. 5 2 4 58.I find the number of questions adequate. 4 3 3 59.I find the order of questions adequate. 5 4 4 4

E.2. STUDY 2 360

Table E.9: Raw Data Study 2: Values from Instrument’s Acceptance: Perceivedusefulness

Items B1 B2 E1 E210.Using the instrument would give me greater control overmy experimental planning.

5 4 4 4

11.The instrument would help me to complete my review ina reasonable amount of time.

4 5 3 5

12. The instrument supports critical aspects of my experi-mental plan.

5 4 4 5

13. I find that this instrument would be useful for reviewingexperimental plans.

5 5 4 5

14. I would recommend this instrument to my colleaguesand friends.

5 4 4 5

Table E.10: Raw Data Study 2: Values from Instrument’s Acceptance: Perceived ease ofuse

Items B1 B2 E1 E215. The instructions for using the instrument are clear. 5 3 4 416. It is easy for me to remember how to use the instrument. 5 4 4 517. The instrument provides helpful guidance in reviewingan experiment.

5 4 4 5

18. I find the instrument easy to use. 4 4 4 419. I am able to efficiently complete my review using thisinstrument.

5 4 4 4

E.2.2 Assessment of the Experimental Plans by Researchers- Raw Data

Figure E.1: Assessment of the Experimental Plans by Researchers- Raw data Study 2

E.2.3 Completeness Score from Researchers- Raw Data

In order to calculate overall scores and completeness of experimental plans, Yes= 1,Partially= 0.5, and No=0.

E.2. STUDY 2 361

Figure E.2: Completeness Score from Researchers - Raw Data Study 2

E.2.4 Inter- Rater Agreement- Raw Data

Figure E.3: Agreement - Raw Data Study 2

E.3. STUDY 3 362

E.3 Study 3


Figures E.4 shows the raw data of the study 3.

Figure E.4: Raw Data Study 3

E.3.2 Items Identified Correctly Raw Data - Study 3

Figure E.5: Items Identified Correctly Raw Data Study 3

E.4 Study 4


Figure E.6 shows the raw data of the study 4.

E.4. STUDY 4 363

Figure E.6: Raw data Study 4

E.4.2 Items Identified Correctly Raw Data - Study 4

Figure E.7: Items Identified Correctly Raw Data Study 4

364364364

FData Analysis- Scripts

F.1 Scripts - Study 2

F.1.0.1 Difference between the completeness mean scores from Beginner and Expert Re-searchers

This section describes details steps we went through to determine if there is a statisticallysignificant difference between the means from Beginners and Experts.

1. We tested if variables, Mean of the completeness scores from EP_1, EP_2, and EP_3by beginners and experts, are normally distributed using the Shapiro-Wilk test.

Hypothesis testing


Script in R

> b= c( 0.59,0.28, 0.42)> e= c(0.52,0.28,0.65)> shapiro.test(b)Shapiro-Wilk normality testdata: bW = 0.99689, p-value = 0.8934> shapiro.test(e)Shapiro-Wilk normality testdata: eW = 0.97138, p-value = 0.6753

F.1. SCRIPTS - STUDY 2 365

Interpretation

The p-value (0.8934) and (0.6753) are greater than the chosen alpha level(0.05), thenthe null hypothesis that the data came from a normally distributed population cannotbe rejected.




Script in R > var.test(b,e)F test to compare two variancesdata: b and eF = 0.68401, num df = 2, denom df = 2, p-value = 0.8124alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.01753875 26.67644276sample estimates:ratio of variances0.6840114

Interpretation

The p-value (0.8124) is greater than the chosen alpha level (0.05), then the nullhypothesis of equal variances can not be rejected. It is concluded that there is nodifference between the variances in the population.

3. We used the parametric T-test to test whether there is a statistically significantdifference between the means from beginners and experts.

Hypothesis testing H0: There is no statistically significant difference between themeans.H1: There is statistically significant difference between the means.Alpha level < 0.05

Script in R


> t.test(b,e,var.equal=TRUE)Two Sample t-testdata: b and et = -0.37924, df = 4, p-value = 0.7238alternative hypothesis: true difference in means is not equal to 0 95 percent confi-dence interval:-0.4437945 0.3371278sample estimates:mean of x mean of y0.4300000 0.4833333

Interpretation

The p-value (0.7238) is greater than the chosen alpha level (0.05), then the nullhypothesis can not be rejected. It is concluded that there is no difference between themeans. The mean from beginners is 0.43 and experts is 0.48. It is concluded that usingthe proposed instrument beginners assessed the completeness of the experimentalplans in a similar manner of the experts in Experimental Software Engineering.

F.1.0.2 Inter-Rater agreement between rater with similar expertise: beginner and expertresearchers

This section presents the inter-rater agreement between beginner and expert researchers.

1. Beginners

This section shows the detailed steps we went through to determine if there isproportional bias in the agreement measurement, that means, if there is a level ofagreement between the two beginner researchers.

(a) Data

B1= (0.5, 0.28, 0.47)B2= (0.68, 0.28, 0.37)Mean B1B2= (0.59, 0.28, 0.42)

(b) Calculating the difference of the Beginner researchers scores

Script in R> diff= c(B1-B2)> diff> [1] -0.18 0.00 0.10


(c) Testing if the completeness score differences of two beginner researchersare normally distributed using the Shapiro-Wilk test.



ii. Script in R> diff= c(-0.18 0.00 0.10) > shapiro.test(diff)Shapiro-Wilk normality testdata: diff W = 0.97351, p-value = 0.6878

iii. InterpretationThe p-value (0.6878) is greater than the chosen alpha level(0.05),then the null hypothesis that the data came from a normallydistributed population cannot be rejected. An assumption ofthe Bland-Altman limits of agreement is that the differences arenormally distributed.



ii. Script in R> t.test(diff)

One Sample t-test

data: difft = -0.32552, df = 2, p-value = 0.7757alternative hypothesis: true mean is not equal to 095 percent confidence interval:-0.3791459 0.3258125sample estimates:mean of x-0.02666667


>sd(diff)> [1] 0.141892

iii. InterpretationThe p-value (0.7757) is greater than the chosen alpha level (0.05),then the null hypothesis can not be rejected. It is concluded thatthere is no a statistically significance difference between the twomeasurements. Therefore there is a certain level of agreementbetween these two measurements.


i. Constructing a Basic ScatterplotScript in SPSS

GRAPH/SCATTERPLOT(BIVAR)=MeanWITH diff/MISSING=LISTWISE.

ii. Calculating Upper and Lower limitUpper Limit= Mean + (SD * 1.96) = (-0.02667) + (0.2781)= 0.25

Lower Limit = Mean - (SD * 1.96) = (-0.02667) - (0.2781)= -0.30

iii. Bland Altman PlotSee Figure F.1


Figure F.1: Bland Altman Plot - Beginners



ii. Script in SPSSREGRESSION/MISSING LISTWISE/STATISTICS COEFF OUTS R ANOVA/CRITERIA=PIN(.05) POUT(.10)/NOORIGIN/DEPENDENT diff/METHOD=ENTER Mean.

iii. InterpretationThe p-value (0.527) is greater than the chosen alpha level (0.05),then the null hypothesis can not be rejected. It is concludedthat there is a certain level of agreement between these twomeasurements. See Figure F.2


Figure F.2: Linnear regression - Beginners

Regarding to the completeness score, the mean of differences between beginnerresearchers is not statistically significantly different from zero (M= -0.02666667,SD= 0.141892, t = -0.32552, p-value = 0.7757). Also, it is concluded that there is acertain level of agreement between these two measurements p-value = 0.527.

2. Experts

This section shows the detailed steps we went through to determine if there isproportional bias in the agreement measurement, that means, if there is a level ofagreement between the two expert researchers.

(a) Data

E1= (0.53, 0.29, 0.67)E2= (0.50, 0.27, 0.62)Mean B1B2= (0.59, 0.28, 0.42)


(b) Calculating the difference of the Experts researchers scoresScript in R> diff= c(E1-E2)> diff> [1] 0.03 0.02 0.05

(c) Testing if the completeness score differences of two Expert researchersare normally distributed using the Shapiro-Wilk test.



ii. Script in R> diff= c(0.03, 0.02, 0.05) > shapiro.test(diff)Shapiro-Wilk normality testdata: diff W = 0.96429, p-value = 0.6369

iii. InterpretationThe p-value (0.6369) is greater than the chosen alpha level(0.05),then the null hypothesis that the data came from a normallydistributed population cannot be rejected. An assumption ofthe Bland-Altman limits of agreement is that the differences arenormally distributed.



ii. Script in R> t.test(diff)

One Sample t-test

data: difft = 3.7796, df = 2, p-value = 0.06341


alternative hypothesis: true mean is not equal to 095 percent confidence interval:-0.004612497 0.071279164sample estimates:mean of x0.03333333

>sd(diff)> [1] 0.01527525

iii. InterpretationThe p-value (0.06341) is greater than the chosen alpha level(0.05), then the null hypothesis can not be rejected. It is con-cluded that there is no a statistically significance differencebetween the two measurements. Therefore there is a certainlevel of agreement between these two measurements.


i. Constructing a Basic Scatter plotScript in SPSS

GRAPH/SCATTERPLOT(BIVAR)=MeanWITH diff/MISSING=LISTWISE.

ii. Calculating Upper and Lower limitUpper Limit= Mean + (SD * 1.96) = (0.033) + (0.03)= 0.063

Lower Limit = Mean - (SD * 1.96) = (0.033) - (0.03)= 0.003

iii. Bland Altman PlotSee Figure F.3


Figure F.3: Bland Altman Plot- Experts



ii. Script in SPSSREGRESSION/MISSING LISTWISE/STATISTICS COEFF OUTS R ANOVA/CRITERIA=PIN(.05) POUT(.10)/NOORIGIN/DEPENDENT diff/METHOD=ENTER Mean.

iii. InterpretationThe p-value (0.226) is greater than the chosen alpha level (0.05),then the null hypothesis can not be rejected. It is concludedthat there is a certain level of agreement between these twomeasurements. See Figure F.4


Figure F.4: Linnear regression - Experts

Regarding to the completeness score, the mean of differences between expert re-searchers is not statistically significantly different from zero (M= 0.033, SD= 0.01527,t = 3.7796, p-value = 0.06341). Also, it is concluded that there is a certain level ofagreement between these two measurements p-value = 0.226.

F.1.0.3 Inter-Rater Agreement among Four Researchers

Table F.1 shows average deviation(ADMdn) indices achieved by the four researchersappraising each single experimental plan on the completeness score.

Table F.1: Average Deviation Indices

Experimental Plan AD(Mdn)EP_1 0.4886364EP_2 0.4583333EP_3 0.4166667


In order to calculate the average deviation(ADMdn) among four researchers, we createda function in R. The Script is presented below:

Average deviation Function- Script in RAD <- function(dados){J = length(dados[,1]);K = length(dados[1,]);ad=0;for(j in 1:J)ad_j = 0;for(i in 1:K)ad_j = ad_j + abs(dados[j,i] - mean(dados[j,]));

ad_j = ad_j/K;ad = ad + ad_j;

ad = ad/J;}

From that, we run the function in R to calculate the average deviation(ADMdn). Thescript is described below.

Average deviation(ADMdn)-Script in R> data=data.matrix(read.csv("/Users/lilianefonseca/Desktop/Agree_EP_1.csv",header = F))> aad=AD(data)> aadV10.4886364

> data=data.matrix(read.csv("/Users/lilianefonseca/Desktop/Agree_EP_2.csv",header = F))> aad=AD(data)> aadV10.4583333

> data=data.matrix(read.csv("/Users/lilianefonseca/Desktop/Agree_EP_3.csv",header = F))> aad=AD(data)> aadV10.4166667


The four researchers achieved acceptable agreement (ADMdn<0.50) in the overallcompleteness score. The content of the files "/Users/lilianefonseca/Desktop/Agree_EP_1.csv" ispresented in Appendix E.

F.1.0.4 Inter-Rater Reliability

Table F.2 shows the Intraclass correlation coefficients (ICCs) for the completeness scorefor two beginners, two experts, and all researchers together.

Table F.2: Inter rater reliability of the instrument

Researchers ICCs Values InterpretationBeginners 0.791 SubstantialExperts 0.998 Almost Perfect

All 0.875 Almost Perfect

In the pairs of researchers with similar expertise, the instrument has substantial relia-bility(0.791) for the overall completeness scores by beginner researchers and almost perfectreliability (0.998) by expert researchers.

Considering the four raters, the instrument has almost perfect reliability (ICC = 0.875)for the overall completeness score of the experimental plans, which means, researchers, bothbeginners and experts, ranked the experimental plans in a similar manner.

F.1.0.5 Criterion Validity

Table F.3 shows the values given by the participants regarding the overall completenessscore(A), whether the experiment should be proceed (B), and if the experiment proceeds, it islikely to be successful? (C).

Table F.3: Criterion Validity Values

EP_1 EP_2 EP_3A B* C* A B* C* A B C*

Beginner 1 0.50 3 3 0.28 2 2 0.47 3 3Beginner 2 0.68 4 4 0.28 2 2 0.37 3 3Expert 1 0.53 3 3 0.29 2 2 0.67 3 3Expert 1 0.5 3 3 0.27 2 2 0.62 4 4

Mean 0.55 3.25 3.25 0.28 2 2 0.53 3.25 3.00*Five point rating scale from 1 (Strongly disagree) to 5 (Strongly Agree)

A= Overall completeness scoresB= Should the experiment be proceed?

C= If the experiment proceeds , it is likely to be successful?


In order to calculate the correlation coefficients, the script in R is described below:Script in R

> ep<- c(0.55, 0.28, 0.53)> rec <- c( 3.25, 2, 3.25)> a <- cbind( ep, rec)> aep rec>[1,] 0.55 3.25>[2,] 0.28 2.00>[3,] 0.53 3.25> cor(a, method= "kendall", use= "pairwise")ep recep 1.0000000 0.8164966rec 0.8164966 1.0000000>

We calculated correlation coefficients between the mean scores of A and B, and A and C.Because the values given by the participants in the variable B and C were the same, the resultsobserved to the variable B is the same to the C. We found strong correlation between the overallcompleteness scores and the recommendation if the experiment should be proceed ( τ B = 0.816,p< 0.2), and the overall completeness scores and the recommendation if the experiment proceeds, it is likely to be successful ( τ C = 0.816, p< 0.2).


Following, we described in details the steps we went through to determine if there is astatistically significant difference between the proportion of items identified correctly from P U

and PT .


Hypothesis testing



2. Script in R

> PU=c(0.17,0.13,0.17,0.17,0.19,0.23,0.23)> shapiro.test(PU)Shapiro-Wilk normality test

data: PUW = 0.89318, p-value = 0.2917

Interpretation

The p-value (0.2917) is greater than the chosen alpha level (0.05), then the nullhypothesis that the data came from a normally distributed population cannot berejected.

> PT=c(1,0.96,0.96,0.77,0.77,0.65,0.85)> shapiro.test(PT)Shapiro-Wilk normality test

data: PTW = 0.91964, p-value = 0.4666

Interpretation

The p-value(0.4666) is greater than the chosen alpha level(0.05), then the null hypoth-esis that the data came from a normally distributed population cannot be rejected.




Script in R

> PU=c(0.17,0.13,0.17,0.17,0.19,0.23,0.23)> PT=c(1, 0.96, 0.96, 0.77, 0.77,0.65,0.85)> var.test(PU,PT)


F test to compare two variances

data: PU and PTF = 0.078116, num df = 6, denom df = 6, p-value = 0.006805alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.01342256 0.45461625sample estimates:ratio of variances0.07811603

Interpretation

The p-value (0.006805) is smaller than the chosen alpha level (0.05), then the nullhypothesis of equal variances can be rejected. It is concluded that there is differencebetween the variances in the population.

4. We used the parametric T-test to test whether there is a statistically significantdifference between the variables PU and PT .

Hypothesis testing H0: There is no statistically significant difference between thevariables PU and PT .H1: There is statistically significant difference between the variables PU and PT .

The p-value should be less than the chosen alpha level for the null hypothesis beingrejected.P- value<0.05

Script in R

> t.test(PU,PT,var.equal=FALSE)

Welch Two Sample t-test

data: PU and PTt = -13.202, df = 6.9317, p-value = 3.632e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-0.7868772 -0.5474085sample estimates:


mean of x mean of y0.1842857 0.8514286

Interpretation

The null hypothesis is rejected because the p-value (3.632e-06) is less than thechosen alpha level (0.05). We found a statistically significant difference between thevariables PU and PT .



Following we describe in details the steps we went through to determine if there is astatistically significant difference between the proportion od items identified correctly from PUand PT .


Hypothesis testing


Script in R> PU=c(0.04, 0.21, 0.08, 0, 0.22, 0.13, 0.08, 0.21, 0.17, 0.39, 0, 0, 0.08, 0.21, 0.04,0.08, 0.15, 0.12, 0.08, 0, 0.08, 0.08)> shapiro.test(PU) Shapiro-Wilk normality testdata: PUW = 0.88488, p-value = 0.01499

Interpretation

The p-value (0.01499) is less than the chosen alpha level (0.05), then the null hypoth-esis is rejected and there is evidence that the data tested (PU) are not from a normally


distributed population.

> PT=c(0.57, 0.57, 0.57, 0.74, 0.83, 0.91, 0.91, 0.52, 0.77, 0.54, 0.50, 0.81, 0.58,0.62, 0.92, 1, 0.96, 0.38, 0.08, 0.81, 0.46, 0.23)> shapiro.test(PT) Shapiro-Wilk normality testdata: PTW = 0.94873, p-value = 0.2979

Interpretation

The p-value (0.2979) is greater than the chosen alpha level(0.05), then the null hy-pothesis that the data came from a normally distributed population cannot be rejected.

2. We used the a non parametric test (Wilcoxon) to test whether there is a statisticallysignificant difference between the variables PU and PT.

Hypothesis testingH0: There is no statistically significant difference between the variables PU and PT .H1: There is statistically significant difference between the variables PU and PT .The p-value should be less than the chosen alpha level for the null hypothesis beingrejected. P- value<0.05

Script in R

> wilcox.test(PU,PT)Wilcoxon rank sum test with continuity correctiondata: PU and PTW = 14.5, p-value = 8.888e-08alternative hypothesis: true location shift is not equal to 0Warning message:In wilcox.test.default(PU, PT) : cannot compute exact p-value with ties

> mean(PU) >[1] 0.1113636 > mean(PT) >[1] 0.6490909

F.4. SCRIPTS - INSTRUMENT’S ACCEPTANCE 382

F.4 Scripts - Instrument’s acceptance

F.4.1 Appropriateness - To what extent do evaluators believe that the in-strument is appropriate for reviewing experimental plans for con-trolled experiments using participants in Software Engineering?

Study 2 Results

Figure F.5: Study 2- Instrument’s Acceptance: Fitness for Purpose


Figure F.6: Study 2- Instrument’s Acceptance: Item’s Appropriateness

Study 3 Results




Study 4 Results


F.4.2 Perceived usefulness - To what extent do evaluators believe that us-ing the instrument would enhance their performance in planningSoftware Engineering controlled experiments with participants?

Study 2 Results

Figure F.11: Study 2- Instrument’s Acceptance: Perceived usefulness

Study 3 Results



Study 4 Results



F.4.3 Perceived ease of use - To what extent do evaluators believe thatusing the instrument would be free of effort?

Study 2 Results

Figure F.14: Study 2- Instrument’s Acceptance: Perceived ease of use

Study 3 Results



Study 4 Results


390390390

GMaterial from the Instrument Validation 1

In this appendix, we present the material from the instrument validation 1 described inChapter 6. In detail, we present the invitation letter, instructions letter, and the instrument usedin study 1.

G.1 Letters

G.1.1 Invitation Letter

Hello <participant name>,A couple of months ago, we had the pleasure of your participation in our study about

what you actually do when you design experiments, and what kinds of problems or traps you fallinto. I interviewed you on < date >. Thank you again for your participation in that study. Basedon the results of that study, we are designing an instrument for helping researchers to reviewexperimental plans for controlled experiments using human subjects in software engineering. Sonow, we are inviting you to participate in the assessment of this instrument. This assessmentis part of a formative evaluation, the goal of which is to improve the instrument. Based on theresults of this evaluation, we will modify and adjust the instrument. If you agree, we intend tosend the instrument to you by e-mail on June 5, 2016, and you will have 1 week to assess it.The assessment can be done anywhere you choose. To assess the instrument, you will have twooptions: 1) assessing the instrument just by reading and reviewing it, or 2) using the instrumentto review an actual experimental plan you have on hand. In both cases, you will check the itemsin the instrument that you find useful, and which ones you had trouble understanding. We expectthat this assessment exercise will take no more than 2 hours. We hope that you can participate inthis study and contribute to this assessment. Please reply to let us know that you are willing toparticipate. Thank you in advance.

Best regards,Liliane Fonseca

G.1. LETTERS 391

G.1.2 Instruction Letter

Dear participant,Thank you very much for your willingness to participate in this assessment. The objective

of the assessment is to find out what empirical researchers think about the proposed instrumentregarding which checklist items they find useful and which ones they have trouble understanding.Also, we want to assess the instrument’s acceptance regarding the appropriateness, usefulness ,and ease of use.

In this website https://goo.gl/GsNr1M, you will find all materials needed for the study,including the instructions, the instrument, and an instrument acceptance questionnaire. Whileyou are completing your assessment, please feel free to contact me at [email protected] withany questions or confusions.

Please return your evaluation by Saturday June 11, 2016.In order to avoid introducing bias in future studies that we plan to perform soon, possibly

with some of your colleagues or members of your research group, we request that you commit tonot exchange any information with anyone else about your assessment, the instrument, or anyother details of this study.

All information will be treated confidentially.Best regards,

G.2. INSTRUMENT VALIDATION 1 392

G.2 Instrument validation 1

401401401

HMaterial from the Instrument Validation 2

In this appendix, we present the material from the instrument validation 2 described inChapter 6. In detail, we present the invitation letter, instructions letter, and the instrument usedin study 2.

H.1 Letters

H.1.1 Invitation Letter

Hello <participant name>,We are designing an instrument for helping researchers to review experimental plans for

controlled experiments using human subjects in software engineering. We are inviting you toparticipate in the assessment of this instrument. This assessment is part of a formative evaluation,the goal of which is to improve the instrument. Based on the results of this evaluation, we willmodify and adjust the instrument. If you agree, we intend to start the assessment on June 5, 2016.The assessment will be run remotely and it will follow the schedule in Table H.2.

Table H.1: Schedule of the Study

Date Activity Duration EnvironmentJune 5, 2016 Receive e-mail with instructions. ——– e-mail and WebsiteJune 6, 2016 Training and dry-run instructions 30 min SkypeJune 7, 2016 Questions and Answers As needed SkypeJune 8, 2016 Data collection 7 days Website

Basically, you will apply the instrument to three experimental plans (written in Por-tuguese) from graduate students in an Experimental Software Engineering course in Brazil,where they have learned how to plan and conduct controlled experiments using human subjects.In addition, you will answer a questionnaire about the instrument’s acceptance. We hope thatyou can participate in this study and contribute to this assessment. Please reply to let us knowthat you are willing to participate.

Best regards,

H.1. LETTERS 402

Liliane Fonseca

H.1.2 Instruction Letter

Dear participant,Thank you very much for your willingness to participate in this assessment. The objective

of this assessment is to evaluate the instrument with respect to inter- rater agreement, inter-raterreliability, criterion validity, and the instrument’s acceptance regarding the appropriateness,usefulness , and ease of use.

In this website https://goo.gl/cPYB1f, you will find all materials needed for the study,including the instructions, the instrument, and an instrument acceptance questionnaire. Whileyou are completing your assessment, please feel free to contact me at [email protected] withany questions or confusions. The assessment will be run remotely, and it will follow the schedulein Table H.2.

Table H.2: Schedule of the Study

Date Activity Duration EnvironmentJune 5, 2016 Receive e-mail with instructions. ——– e-mail and WebsiteJune 6, 2016 Training and dry-run instructions 30 min SkypeJune 7, 2016 Questions and Answers As needed SkypeJune 8, 2016 Data collection 7 days Website

Please reply to schedule the time of the training and dry-run instructions, which shouldtake place tomorrow (Monday June 6, 2016) by Skype.

The final deadline of this assessment is Wednesday June 15, 2016.In order to avoid introducing bias in future studies that we plan to perform soon, possibly

with some of your colleagues or members of your research group, we request that you commit tonot exchange any information with anyone else about your assessment, the instrument, or anyother details of this study.

All information you submit will be treated confidentially. Again, thank you for yourparticipation.

Best regards,Liliane Fonseca

H.2. INSTRUMENT VALIDATION 2 403

H.2 Instrument validation 2

413413413

IExperimental Plans

The experimental plans used as object in Study 2, 3 and 4 are presented as follows:

I.1. EXPERIMENTAL PLAN 0 414

I.1 Experimental Plan 0

436436436

JProtocol for Selecting Experimental Plans

This document describes the procedures for selecting experimental plans used as ex-perimental objects in the instrument validation 2, 3, and 4. This protocol was revised andapproved by two experts researchers in Experimental Software Engineering, and it describes theselection process of suitable experimental plans with the purpose to minimize problems duringthe execution of the studies.

1. Experimental Plans Data collection

E-mails were sent to experts researchers and professors in experimental softwareengineering asking them samples of written controlled experimental plans usinghuman subjects.

The experimental plans were selected from the following criteria:

! It must be a plan of a controlled experiment. That means, should be adocument written before the experiment was run, not a experiment reportor a document written after the experiment have finished;

! It should involve human participants;

! It should assign the subjects randomly.

2. Choosing suitable experimental plans

After sent the e-mails, some materials were received, and they were analyzed basedon the criteria previously described. A set of 48 experimental plans were chosen to beused as potential experimental objects in the studies from an Experimental SoftwareEngineering course at UFPE produced by graduate students in 2011, 2012, and 2013.The course was based on best practices of experimental software engineering such as:

(a) Books

i. Natalia Juristo and Ana M. Moreno. Basics of Software Engi-neering Experimentation. Kluwer Academic Publishers, 2001

437

ii. C. Wohlin et. al. Experimentation in Software Engineering: AnIntroduction. Kluwer Academic Publishers, 2000.

(b) Papers

i. S. Pfleeger. Design and Analysis in Software Engineering, Part1: The Language of Case Studies and Formal Experiments.Software Engineering Notes, 19(4):16-20, October 1994.

ii. S. Dehnadi and R. Bornat. The Camel has Two Humps.

iii. S. Surakka. What subjects and skills are important for softwaredevelopers?. Communications of the ACM, vol. 50(1): 73-78,January 2007.

iv. V. Basili, R. Selby, and D. Hutchens, Experimentation in Soft-ware Engineering. IEEE Transactions on Software Engineering,vol. 12(7): 733-743, July 1986.

(c) Other materials

i. Example of an experiment

ii. G. Travassos, et. al. Experimental Software Engineering In-trodução a Engenharia de Software Experimental. RelatórioTécnico ES-590/02, COPPE/UFRJ, Abril, 2002.

3. Choosing four experimental plans concealed from the author of the proposed instru-ment.

The researcher, who sent us the 48 experimental plans, applied second set of criteriato pick four experimental plans, without the author of the proposed instrument knowswhich ones were picked. Three experimental plans were used in the instrumentvalidation 3 and 4 (One in the dry-run and two in the experiment execution), and fourin the Study 2 (One in the dry-run and three in the experiment execution).

438438438

KA protocol for developing a list of possiblequestions to be answered during data col-lection

This protocol describes the process for developing a list of possible questions fromparticipants during data collection with the purpose to minimize bias during the data collectionphase regarding the influence of the author of the proposed instrument.

K.1 Before Pilot

Creating a prior list of questionsThe author of the proposed instrument did a list of probable questions about the exper-

imental studies (1, 2,3, and 4) and wrote the possible answers. The author tried, as much aspossible, to think about her answers ahead of time and tried to use in the pilot study.

Result of this activity:List of questions A1- Prior list of questions did by the author of the proposed instrument.

K.2 During Pilot

Creating a list of questions from pilot’s participantsThe pilot’s participants were invited to make questions about the data collection process.

They asked as much they could. The author of the proposed instrument used the pilot to see whatkind of questions and concerns the participants of the study would have. The purpose of thisactivity was the author of the instrument answering questions focuses on helping to participantswith mechanics of data collect but not with the content of the instrument.

Result of this activity:

K.3. AFTER PILOT 439

List of questions A2- List of questions did by the pilot’s participants.

K.3 After Pilot

Grouping the lists of questionsThe experimenter will review the lists A1 and A2 and group the questions into categories.

Result of this activity:List of questions A3- Grouping lists of questions A1 and A2.

K.4 Review of the list of questions

The list A3 was revised and some considerations were performed.

Result of this activity:List of questions revised A4- List of questions adjusted.

K.5 Simulation of answering questions

The author of the proposed instrument practiced the list of questions in order to answerthe questions consistently for all participants in the data collection process. Two researcherspretended to be participants and tested the answers gave by author of the instrument.

440440440

LProcess for creating and using the referencemodels

L.1 Creating reference models

This document describes the process for creating and using the reference models in theinstrument validation 3 and 4 in the Chapter 6. In addition, it also explains how the codersdetermined if an item found by a participant was the same as an item found in the referencemodel. This process tries to make the usage of the reference model reliable and minimize biasregarding the influence of the author of the instrument. We assumed reference model as a listof mistakes and missing elements that should be contained in the experimental plan. The listof mistakes was built by an expert in Experimental Software Engineering, who is the Professorof the Experimental Software Engineering course, which the experimental plans come from.Each reference model is related to one experimental plan. A total of three reference models wasproduced, one for being used in the dry run and two for being used in the data collection in thestudy Study 3 and 4 in the Chapter 6.

L.2 Using reference models

The coders received 14 lists of mistakes (seven from Experimental plan 1 and seven fromexperimental plan 2) from the seven participants of the Instrument Validation 3 and 4 in theChapter 6.

The data analysis process was performed by two coders, who are Ph.D students. Eachone has more than two years of experience in Experimental Software Engineering research. Theyindividually compared the items of mistakes and missing elements produced by the participantsand the items of mistakes and missing elements in the reference model. They considered acorrect mistake or missing element found by the participants in the experiment execution as theitems identified by the Professor.

Because the list that was produced by the participants were transcriptions not numbers,

L.2. USING REFERENCE MODELS 441

we created some rules and considerations about the process of comparison. Although some itemswere obvious or almost the same, there was a possibility that when the coders went throughduring the data analysis, they had to make a decision which item was the same and which onewas not the same.

In case a participant reported a mistake that was not in the reference model, the codersinserted the item in the reference model. They were advised that it was important that during theanalysis process, they checked their lists with each other with the purpose to solve some conflictitems. In case they do not achieve in an agreement, the conflicts were solved by the Professor.The Professor (creator of the reference model) answered if that item found by the participant is areal problem and if it was part of the reference model. If the Professor said yes, the coders addedthe item in the reference model. From that, the development of the reference model was iterative,and it became bigger than the initial one.

In addition, the coders calculated the number of the items identified correctly againstthe current reference model. All steps of the analysis process were recorded in spreadsheets bythe coders. At the end of the analysis, the coders sent the analysis results spreadsheet to theexperimenter by e-mail.

This analysis process aimed at giving autonomy to the coders to make decisions withoutthe influence of the author of the instrument, which the role was only a conductor.

442442442

MMaterial from the Instrument Validation 3and 4

In this appendix, we present the material from the instrument validation 3 and 4 describedin Chapter 6. In detail, we present the invitation letter, instructions letter, and and the instrumentused in study 3 and 4. The Invitation Letter and Instruction letter were written in Portuguese.

M.1 Letters

M.1.1 Invitation Letter- Study 3

Olá <nome do Participante>,Estaremos realizando um experimento sobre revisão de planos experimentais. O ex-

perimento será realizado no dia 20 de junho de 13h às 17h no CIn. Posso contar com a suapresença?

Muito Obrigada,Liliane Fonseca

M.1.2 Invitation Letter- Study 4

Olá <nome do Participante>, tudo bom!Sou Liliane, aluna de doutorado de Sergio Soares na área de Engenharia de Software

Experimental. Estamos convidando você a participar de uma de nossas avaliações do instrumentode apoio a revisão de planos experimentais que estamos desenvolvendo. Sergio sugeriu vocêpara participar do estudo pelo fato de você ter cursado a disciplina de Engenharia de SoftwareExperimental em <ano.semestre>. A avaliação é bem simples, será realizada remotamente, e tema duração de aproximadamente 90 minutos. Você gostaria de participar desta avaliação? Casosim, eu estarei enviando as instruções em 20 de junho. Você terá 7 dias para realizar o estudo.

Ficamos no aguardo da sua resposta,Muito obrigada, Liliane Fonseca

M.1. LETTERS 443

M.1.3 Instruction Letter- Study 3

Oi <nome do participante>, tudo bem!Mais uma vez muito obrigada pela disponibilidade em participar do estudo.O experimento será realizado na proxima Segunda-Feira dia 20 de junho das 13h as 17h

no laboratório de graduação 4- LabG4 no CIn. Caso tenha dificuldade em encontrar o laboratório,você pode entrar em contato comigo pelo <numero de telefone>. O laboratório estará identificadocom algumas placas de papel "EXPERIMENTO - LILIANE FONSECA"

É muito importante iniciarmos o experimento sem atrasos, então pedimos que você estejapresente no laboratório com 10 minutos de antecedência.

Gostariamos que você trouxesse o material que normalmente utiliza no planejamentode seus experimentos controlados. Eles podem incluir livros impressos, digitais ou material depesquisa online.

O laboratório terá computadores disponiveis, mas caso prefira, sinta-se a vontade emtrazer o seu proprio notebook.

Qualquer duvida pode entrar em contato comigo.Att,Liliane Fonseca

M.1.4 Instruction Letter- Study 4

Olá <nome do participante>,Mais uma vez muito obrigada por aceitar o convite em participar deste experimento.O objetivo do nosso estudo é analisar se o instrumento que estamos desenvolvendo pode

auxiliar pesquisadores (principalmente os iniciantes) na revisão de planos experimentais emEngenharia de Software. O instrumento de revisão pode ser usado tanto para revisar seus propriosplanos experimentais quanto auxiliar na revisão de planos experimentais de terceiros.

Você será convidado a acessar um website, onde vai encontrar todas as informaçõesnecessárias para avaliação.

É importante que antes de realizar as atividades você leia as instruções cuidadosamente ecaso tenha qualquer duvida entre em contato imediatamente.

Neste estudo você será exposto basicamente a dois tratamentos.No primeiro momento, você vai revisar um plano experimental e identificar erros e

elementos que deveriam estar contidos neste plano. Nesta atividade você pode consultar qualquermaterial que te auxilie nessa revisão. Faça esta atividade em no máximo de 40 minutos. Logo emseguida, você deve enviar um email para [email protected] informando que finalizou a atividadepara receber acesso ao tratamento 2.

Em seguida, você vai utilizar o instrumento desenvolvido para identificar erros e elemen-tos em um outro plano experimental similar ao primeiro. Faça esta atividade em no máximo de40 minutos.

M.2. INSTRUMENT VALIDATION 3 AND 4 444

Ao finalizar, você deve responder a um questionário de avaliação do instrumento pro-posto.

A fim de evitar viés em futuros estudos que estamos planejando realizar em breve,possivelmente com algum de seus colegas ou membros do seu grupo de pesquisa, nós pedimospara que você se comprometa em não trocar nenhuma informação com ninguem sobre suaavaliação, o instrumento, ou qualquer detalhe deste estudo.

Todas as informações que você submeter serão tratadas confidencialmente.

M.2 Instrument validation 3 and 4

M.2.1 Treatment 1


M.2.2 Treatment 2

456456456

NInstrument’s Acceptance and DemographicQuestionnaire

This questionnaire consists of two parts. The first one focuses on your perceptionsregarding the instrument’s fitness for purpose (4 questions), the items’ appropriateness (5questions), perceived usefulness (5 questions), perceived ease of use (5 questions) and feedbackabout the instrument (5 open questions). The second part aims to to know your experience inexperiment planning (3 questions).

N.1 Instrument’s Acceptance Questionnaire

N.1. INSTRUMENT’S ACCEPTANCE QUESTIONNAIRE 457

N.1. INSTRUMENT’S ACCEPTANCE QUESTIONNAIRE 458

N.2. DEMOGRAPHIC QUESTIONNAIRE 459

N.2 Demographic Questionnaire

C.2.5 Classification: Parameters and Variables . . . . . . . . . . . . . . . . . 334C.2.6 Classification: Participants . . . . . . . . . . . . . . . . . . . . . . . . 335C.2.7 Classification: Group Assignment . . . . . . . . . . . . . . . . . . . . 336C.2.8 Classification: Experimental Materials . . . . . . . . . . . . . . . . . . 337C.2.9 Classification: Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 337C.2.10 Classification: Experiment Design . . . . . . . . . . . . . . . . . . . . 338C.2.11 Classification: Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 339C.2.12 Classification: Data Collection and Analysis Procedure . . . . . . . . . 339C.2.13 Classification: Threats to Validity . . . . . . . . . . . . . . . . . . . . 340

Appendix D Experimental Websites 341D.1 Experimental Website - Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . 341D.2 Experimental Website - Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . 343D.3 Experimental Website - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . 346D.4 Experimental Website - Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . 351

Appendix E Raw Data 355E.1 Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

E.1.1 Instrument’s acceptance Raw Data - Study 1 . . . . . . . . . . . . . . 355E.1.2 Raw Data - Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

E.2 Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358E.2.1 Instrument’s acceptance Raw Data - Study 2 . . . . . . . . . . . . . . 358E.2.2 Assessment of the Experimental Plans by Researchers- Raw Data . . . 360E.2.3 Completeness Score from Researchers- Raw Data . . . . . . . . . . . . 360E.2.4 Inter- Rater Agreement- Raw Data . . . . . . . . . . . . . . . . . . . . 361

E.3 Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362E.3.1 Instrument’s acceptance Raw Data - Study 3 . . . . . . . . . . . . . . 362E.3.2 Items Identified Correctly Raw Data - Study 3 . . . . . . . . . . . . . . 362

E.4 Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362E.4.1 Instrument’s acceptance Raw Data - Study 4 . . . . . . . . . . . . . . 362E.4.2 Items Identified Correctly Raw Data - Study 4 . . . . . . . . . . . . . . 363

Appendix F Data Analysis- Scripts 364F.1 Scripts - Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

F.1.0.1 Difference between the completeness mean scores from Be-ginner and Expert Researchers . . . . . . . . . . . . . . . . 364

F.1.0.2 Inter-Rater agreement between rater with similar expertise:beginner and expert researchers . . . . . . . . . . . . . . . . 366

F.1.0.3 Inter-Rater Agreement among Four Researchers . . . . . . . 374F.1.0.4 Inter-Rater Reliability . . . . . . . . . . . . . . . . . . . . . 376

F.1.0.5 Criterion Validity . . . . . . . . . . . . . . . . . . . . . . . 376F.2 Scripts - Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377F.3 Scripts - Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380F.4 Scripts - Instrument’s acceptance . . . . . . . . . . . . . . . . . . . . . . . . 382

F.4.1 Appropriateness - To what extent do evaluators believe that the instru-ment is appropriate for reviewing experimental plans for controlledexperiments using participants in Software Engineering? . . . . . . . . 382

F.4.2 Perceived usefulness - To what extent do evaluators believe that usingthe instrument would enhance their performance in planning SoftwareEngineering controlled experiments with participants? . . . . . . . . . 386

F.4.3 Perceived ease of use - To what extent do evaluators believe that usingthe instrument would be free of effort? . . . . . . . . . . . . . . . . . . 388

Appendix G Material from the Instrument Validation 1 390G.1 Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

G.1.1 Invitation Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390G.1.2 Instruction Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

G.2 Instrument validation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

Appendix H Material from the Instrument Validation 2 401H.1 Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

H.1.1 Invitation Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401H.1.2 Instruction Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

H.2 Instrument validation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

Appendix I Experimental Plans 413I.1 Experimental Plan 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414I.2 Experimental Plan 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417I.3 Experimental Plan 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425I.4 Experimental Plan 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430

Appendix J Protocol for Selecting Experimental Plans 436

Appendix K A protocol for developing a list of possible questions to be answeredduring data collection 438K.1 Before Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438K.2 During Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438K.3 After Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439K.4 Review of the list of questions . . . . . . . . . . . . . . . . . . . . . . . . . . 439K.5 Simulation of answering questions . . . . . . . . . . . . . . . . . . . . . . . . 439

Appendix L Process for creating and using the reference models 440L.1 Creating reference models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440L.2 Using reference models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

Appendix M Material from the Instrument Validation 3 and 4 442M.1 Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

M.1.1 Invitation Letter- Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . 442M.1.2 Invitation Letter- Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . 442M.1.3 Instruction Letter- Study 3 . . . . . . . . . . . . . . . . . . . . . . . . 443M.1.4 Instruction Letter- Study 4 . . . . . . . . . . . . . . . . . . . . . . . . 443

M.2 Instrument validation 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 444M.2.1 Treatment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444M.2.2 Treatment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446

Appendix N Instrument’s Acceptance and Demographic Questionnaire 456N.1 Instrument’s Acceptance Questionnaire . . . . . . . . . . . . . . . . . . . . . 456N.2 Demographic Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

LILIANE SHEYLA DA SILVA FONSECA · 2019. 10. 25. · Liliane Sheyla da Silva Fonseca AN INSTRUMENT...

Documents

Transcript of LILIANE SHEYLA DA SILVA FONSECA · 2019. 10. 25. · Liliane Sheyla da Silva Fonseca AN INSTRUMENT...