SPICE: An Empiricist’s Perspective · SPICE: An Empiricist’s Perspective KHALED EL EMAM DENNIS...

SPICE: An Empiricist's Perspective

KHALED EL EMAM

DENNIS R. GOLDENSON

International Software Engineering Research Network Technical Report ISERN-98-03

(this report was originally written in 1995)

Abstract

The SPICE project aims to deliver an internationalstandard for software process assessment by theend of 1996. As part of this project there is anempirical trials phase whose purpose is to ascertainthe effectiveness of the prospective SPICEstandard. Two of the objectives of the trials phaseare: (a) to determine the extent to which SPICE-conformant assessments are repeatable (i.e.,reliability), and (b) to determine the extent to whichSPICE-conformant assessments are reallymeasuring best software process practices (i.e.,validity). This paper introduces the theoreticalfoundations for evaluating the reliability and validityof measurement, suggests some empirical researchmethods for investigating them in SPICE, anddiscusses the constraints and limitations of thesemethods within the context of the SPICE project.

1 Introduction

Over the last two years there has been an on-goingeffort at developing an international standard forsoftware process assessment. This effort is knownas the SPICE (Software Process Improvement andCapability dEtermination) project. A prime motivationfor developing this prospective standard has beenthe perceived need for an internationally recognizedsoftware process assessment framework that pullstogether the existing public and proprietarymethods [53]. Overviews of this project have beenpresented by various members of the SPICEteam [21][22][23][41][52].

One important question that ought to be askedabout such a prospective standard is: �does itembody sound software engineering practices?�This question reflects a more general concernamong some researchers that existing softwareengineering standards lack an empirical basisdemonstrating that they indeed represent �good�practices. For instance, it has been noted that [55]�standards have codif ied approaches whoseeffectiveness has not been rigorously andscientifically demonstrated. Rather, we have toooften relied on anecdote, 'gut feeling,' the opinionsof experts, or even flawed research,� and [54] �manycorporate, national and international standards arebased on conventional wisdom [as opposed toempirical evidence].� Similar arguments are made in[28][29][30].

To address such shortcomings in previousstandardization efforts, the SPICE project includesan empirical tr ials phase. While ideally anaccumulation of empirical evidence ought toprecede a standardization effort, inclusion of thetrials phase in the SPICE project is sti l l aconsiderable improvement over previous softwareengineering standardization efforts.

SPICE essentially defines requirements for ameasurement procedure. In other scientif icdiscipl ines (such as educational research,psychometrics, and econometrics), measurementprocedures are expected to exhibit high reliabilityand high validity. Thus, two of the main objectives ofthe SPICE trials phase are: (a) to determine theextent to which SPICE-conformant assessments1

SPICE: An Empiricist's PerspectiveGG

Khaled El Emam=

Dennis R. Goldenson*

=Fraunhofer Institute for Experimental Software Engineering*Software Engineering Institute, Carnegie Mellon University

1 In SPICE, explicit conformance criteria are defined. Thus, inthis paper when we write about SPICE-conformantassessments, we refer to assessments that satisfy theconformance criteria.

G This work was done while El Emam was at CRIM, Montreal.The views stated in this paper are those of the authors anddo not necessarily reflect the policies or positions of CRIM,Fraunhofer IESE, the Software Engineering Institute,Carnegie Mellon University, or their funding agencies.

are repeatable (i.e., their reliability), and (b) todetermine the extent to which SPICE-conformantassessments are really measuring best softwareprocess practices (i.e., their validity).

In this paper we first introduce the theoreticalfoundations for evaluating the reliability and validityof measurement procedures in general. We thensuggest some empirical research methods forevaluating the reliability and validity of SPICE, anddiscuss the constraints and limitations of thesemethods within the context of the SPICE project.

The main contributions of this paper are: (a) tohighlight the importance of the trials in ascertainingthe effectiveness of SPICE, at least on scientificgrounds, (b) to bring awareness of some importantempirical questions that ought to be addressed byany software process assessment method, and (c)to present some theoretical concepts and empiricalresearch methods for evaluating the reliability andvalidity of software process assessments.

In the remainder of this paper, each of the SPICEtrials' reliability and validity objectives is discussed ina section of its own (sections 2 and 3 respectively).This is followed in section 4 by a discussion of thelimitations and necessary tradeoffs in conducting theSPICE trials. Section 5 concludes the paper with asummary of the main points.

2 Reliability

2.1 The Importance of Reliability

A SPICE-conformant assessment is a measurementprocedure. For any measurement procedure,reliability is of fundamental concern. Reliability

addresses the extent to which there exists randomerror in a measurement procedure.

Within the context of software processassessments, concern with reliability is not unique toSPICE. For example, Card broached reliabilityissues while discussing the repeatability of CMM-based Software Capability Evaluations [12].

Ideally, SPICE-conformant assessments shouldexhibit high reliability. This means that random erroris minimal and that assessment scores areconsistent and repeatable.

The extent of reliability of SPICE-conformantassessments and the sources of randommeasurement error can be ascertained empirically.Such empirical studies are being conducted duringthe trials phase of the SPICE standardizationproject.

Reliability studies in the SPICE trials are notsimply an academic exercise. These studies arebeing designed to be in concert with the types ofdecisions that will be made in practice based on theoutcomes of SPICE-conformant assessments.

Two common types of decisions that will be madebased on SPICE-conformant assessment outcomesare considered here. These decisions arerepresented by the following two scenarios: (a) acontract award scenario where a contractor has toselect between competing suppliers, and (b) a self-improvement scenario where an organizationidentif ies areas for improvement and tracksimprovement progress.

Assume Figure 1 shows the profiles2 of twoorganizations, A and B, and that P and Q are twodifferent processes being assessed. Due to randommeasurement error, the scores obtained for eachprocess are only one of the many possible scoresthat would be obtained had the organization beenrepeatedly assessed3. While obtained scores fororganization B are in general higher than those oforganization A, this may be an artifact of chance.Without consideration of random measurementerror, organization A may be unfairly penalized in acontract award situation.

Figure 1: Example hypothetical assessmentscores with confidence intervals.

2 For simplicity, the diagram in Figure 1 does not illustrateprofiles in the same way as intended in the prospectiveSPICE standard. This, however, does not result in any lossof generality in the discussion.

3 The confidence intervals are established by adding andsubtracting, for instance, one standard error ofmeasurement to/from the obtained score after it istransformed to a z-score. The standard error ofmeasurement can be calculated from the reliability estimate(see [51]).

Turning to a self-improvement scenario, assumethat Figure 1 shows the profiles of one organizationat two points in time, A and B. At time A, it mayseem that the score for process Q is much lowerthan for process P. Thus, the organization would betempted to pour resources on improvements inprocess Q. However, without consideration ofrandom measurement error, one cannot have highconfidence about the extent to which the differencebetween P and Q scores is an artifact of chance.Furthermore, at time B, it may seem that theorganization has improved. However, withoutconsideration of random measurement error, onecannot have high confidence about the extent towhich the difference between A and B scores (forprocesses P and Q) are artifacts of chance.

The above examples highlight the importance ofevaluating the extent of rel iabil i ty of SPICE-conformant assessments. Furthermore, to facilitateimproved decision-making based on suchassessments, the sources of error ought to beidentified, and reduced if possible, in decisionmaking situations.

Another benefit of evaluating reliability is that onecan determine how to increase the reliability ofSPICE-conformant assessments. For example, onecan estimate what the reliability coefficient would beif the length of an instrument is increased, or if moreassessment teams conducted the assessmentsconcurrently and aggregated their findings. Thisinformation would be useful for improving theprospective SPICE standard before standardization,as well as for providing directions to users of theeventual standard about how to conduct morereliable assessments if the context demands it.

In the appendix of this paper we have includedsome general guidelines for increasing the reliability

of software process assessments. These guidelinesshould be useful for others who are developing orusing software process assessment methods.

The remainder of this section includes an overviewof reliability theory and methods for reliabilityestimation, and a discussion of their application inestimating reliability in the SPICE trials.

2.2 Reliability Theory and Methods

Two theoretical frameworks for ascertaining theextent of reliability are presented below: (a) theclassical test theory framework, and (b) thegeneralizability theory framework. Associated witheach theoretical framework are a number ofempirical research methods that can be applied.

2.2.1 Classical Test Theory Framework

Classical test theory states that an observed scoreconsists of two additive components, a true scoreand an error: X = T + E. Thus, X would be the scoreobtained in a SPICE-conformant assessment, T isthe mean of the theoretical distribution of X scoresthat would be found in repeated assessments of thesame organization4, and E is the error component.The reliability of measurement is defined as the ratioof true score variance to observed score variance.

There are four methods for estimating reliabilityunder this framework. All of the four methodsattempt to determine the proportion of variance in ameasurement scale that is systematic. The differentmethods can be classified by the number of differentassessment procedures necessary and the numberof different assessment occasions necessary. Thisclassification is depicted in Figure 2. These methodsare briefly described below (for more details see[45]):

1. Test-Retest MethodThis is the simplest method for estimatingreliability. In the SPICE context, one would haveto assess each organization's capability at twopoints in t ime using the same assessmentprocedure (i.e., the same instrument, the sameassessors, and the same assessment process).Reliability would be estimated by the correlation

Figure 2: A classification of classical reliabilityestimation methods.

4 In practice the true score can never be really known since itis generally not possible to obtain a large number ofrepeated assessments of the same organization. If one iswilling to make some assumptions (e.g., an assumption oflinearity), however, point estimates of true scores can becomputed from observed scores [45].

between the scores obtained on the twoassessments.

2. Alternative Forms MethodInstead of using the same assessment procedureon two occasions, the alternative forms methodstipulates that two alternative assessmentprocedures be used. This can be achieved, forexample, by using two different instruments orhaving two alternative, but equally qualified,assessment teams. This method can becharacterized either as immediate (where the twoassessments are concurrent in time), or delayed(where the two assessments are separated intime). The correlation coefficient (or some othermeasure of association) is then used as anestimate of reliability of either of the alternativeforms.

3. Split-Halves MethodWith the split-halves method, the total number ofitems in an assessment instrument are dividedinto two halves and the half-instruments arecorrelated to get an estimate of reliability. Thehalves can be considered as approximations toalternative forms. A correction must be applied tothe correlation coefficient though, since thatcoefficient gives the reliability of each half only.One such correction is known as the Spearman-Brown prophecy formula [51].

4. Internal Consistency MethodsWith methods falling under this heading, oneexamines the covariance among all the items inan assessment instrument. By far the mostcommonly used internal consistency estimate isthe Cronbach alpha coefficient [16].

Since there exists more than one classical methodfor estimating reliability, a relevant question is:�which method(s) are most commonly reported inthe literature?� If we take the field of ManagementInformation Systems (MIS) as a reference discipline(in the sense that MIS researchers are alsoconcerned with software processes, theirmeasurement, and their improvement), then somegeneral statements can be made about theperceived relative importance of the differentmethods.

In MIS, researchers developing instruments formeasuring software processes and their outcomestend to report the Cronbach alpha coefficient mostfrequently [61]. Furthermore, some researchers

consider the Cronbach alpha coefficient to be themost important [58].

Examples of instruments with reported Cronbachalpha coefficients are those for measuring userinformation satisfaction [38][62], user involvement[5][2], and perceived ease of use and usefulness ofsoftware [18][1]. However, recently, reliabilityestimates using other methods have also beenreported, for example, test-retest reliability for a userinformation satisfaction instrument [32], and for auser involvement instrument [63].

In software engineering, the few studies thatconsider reliability, report the Cronbach alphacoefficient. For example, the reliability estimate for arequirements engineering success instrument [25],for an organizational maturity instrument [26], andfor level 2 and 3 questions of the preliminary versionof the SEI maturity questionnaire [37].

2.2.2 Generalizability Theory Framework

The different classical methods for estimatingreliability presented above vary in the factors thatthey subsume under error variance. Figure 3summarizes these factors. This means that the useof different classical methods will yield differentestimates of reliability.

Generalizability theory [17], however, allows one toexplicit ly consider mult iple sources of errorsimultaneously and estimate their relativecontributions. In the context of SPICE, the theorywould be concerned with the accuracy ofgeneralizing from an organization's obtained scoreon a SPICE-conformant assessment to the averagescore that the organization would have receivedunder all possible conditions of assessment (e.g.,using all SPICE-conformant instruments, all SPICE-conformant assessment teams, all assessment teamsizes, etc.). This average score is referred to as theuniverse score . All possible condit ions ofassessment are referred to as the universe ofassessments. A set of measurement conditions iscalled a facet. Facets relevant to SPICE includeinstrument used, assessment team, andassessment team size.

Generalizability theory uses the factorial analysisof variance (ANOVA) [50] to partit ion anorganization's assessment score into an effect forthe universe score, an effect for each facet orsource of error, an effect for each of theircombinations, and other �random� error. This can becontrasted to simple ANOVA, which is more

analogous to the classical test theory framework.With simple ANOVA the variance is partitioned into�between� and �within�. The former is thought of assystematic variance or signal. The latter is thoughtof as random error or noise. In the classical testtheory framework one similarly partitions the totalvariance into true score and error score.

Suppose, for the purpose of illustration, one facetis considered, namely assessment instrument.

Further, suppose that in the trials phase twoinstruments are used and N organizations areassessed using each of the two instruments. In thiscase, one intends to generalize from the two SPICE-conformant instruments to al l other SPICE-conformant instruments. This design is representedin Figure 4. The results of this study would beanalyzed as a two-way ANOVA with oneobservation per cell (e.g., see [50]). The aboveexample could be extended to have multiple facets(i.e., to account for multiple sources of error such asinstruments and assessors).

2.2.3 Other Methods

Methods based on the above two frameworks arenot the only ones that can be used in reliabilitystudies. However, they are the most well developedin the literature and the most commonly used. Othermethods that may provide useful information as tothe consistency and repeatabil i ty of SPICE-conformant assessments include the calculation ofproximity or similarity measures of SPICE-conformant assessment profiles (produced bydifferent assessment teams), and indices ofagreement [31][24].

2.3 Application to SPICE

A number of reliability studies are being plannedduring the SPICE trials. One primary purpose ofthese studies is to estimate the reliability of SPICE-conformant assessments. Evidence of goodreliability would commonly constitute coefficients ofat least 0.8, but preferably 0.9 or higher.

Given the complexity of an intervention such as aSPICE-conformant assessment (or a softwareprocess assessment in general), only a subset ofthose methods presented earlier are feasible. Themost feasible classical approaches for assessing

N

...

...

...

5

4

3

2

1

Organization Instrument 1 Instrument 2

Figure 4: A basic one facet design.Figure 3(b): Sources of error accounted for by the

classical reliability estimation methods.

Figure 3(a): Definition of some sources of error.

Responses to different questions or subsets ofquestions within the same instrument may differamongst themselves. One reason for thesedifferences is that questions or subsets ofquestions may not have been constructed to thesame or to consistent content specifications.

Regardless of their content, questions may beformulated poorly, may be difficult to understand,may not be interpreted consistently, etc.

Within InstrumentContents

Assessment scores may differ across instruments.Lack of equivalence of instruments may be due tothe questions in different instruments not beingconstructed according to the same contentspecifications (i.e., do different instruments havequestions that cover the same content domain?).

Different InstrumentContents

Assessment scores may differ across assessors(or assessment teams). Lack of repeatability ofassessment scores may be due to the subjectivityin the evaluations and judgement of particularassessors (i.e., do different assessors make thesame judgements about an organization'sprocesses?).

Different Assessors

Assessment scores may differ across time.Instability of assessment scores may be due totemporary circumstances and/or actual change.

Different Occasions

DescriptionSource of Error

reliability are: alternative forms (immediate) withdifferent assessment teams or different assessmentinstruments, split-halves, and possibly internalconsistency.

The alternative forms method with differentassessment teams concerns the issue of inter-assessor reliability. This source of error is perhapsthe one of most concern to the software processcommunity. There are two general approaches thatcan be utilized to investigate this kind of error.

The first approach would be in a 'lab' setting.Fortunately, lab settings exist whenever a number ofassessors take SPICE assessment courses orbriefings as part of their qualification or training.Realistic case studies could be given to theassessors and they would be requested to providetheir ratings. Individual (or team) ratings would becompared across assessors (or teams) to determinereliability.

The second approach would be in a 'field' setting.For example, assessment teams could be dividedinto two groups. Each group would perform its ratingindependently and subsequently meet to arrive at aconsensus on the final ratings. The independentratings would be compared to determine thereliability coefficient.

The alternative forms method with differentassessment instruments accounts for differentinstruments' content as a source of error. Similarapproaches as described above for alternative formswith different assessment teams could be followed.

The split-halves method can be applied by havinga number of trials assessments use an assessmentinstrument that is somewhat longer than usual. Thisinstrument would subsequently be divided into twohalves and the reliability coefficient for each half andthe total instrument computed.

One diff iculty with the split-halves method,however, is that the reliability estimate depends onthe way the instrument is divided into two halves.For example, for a 10 question instrument, there are126 possible different splits [9], and hence 126different split-halves reliability estimates. The mostcommon procedure is to take even numbered itemson an instrument as one part and the odd numberedones as the second part.

An internal consistency method can be applied bycomputing the Cronbach alpha coefficient from theresults of several SPICE-conformant assessments.To allow internal consistency methods to be used,these assessments would at least have to be using

the same instrument (or overlapping instruments)and follow the same assessment process. In thecontext of the SPICE trials, however, differentparticipating organizations are expected to havedifferent priorities, and would therefore prefer to beassessed against different dimensions of capability(e.g., engineering processes vs. support processes).This means that using the same instrument oroverlapping instruments for a sufficiently largenumber of trial assessments may be difficult.

If one were to adopt the generalizability theoryframework, then multiple sources of error could beinvestigated simultaneously. The approaches thatwould be utilized under the generalizability theoryframework are similar to those described above forthe alternative forms method. The difference wouldbe in explicitly accounting for multiple sources oferror in the same study.

The test-retest and the delayed alternative formsmethods account for different occasions as a sourceof error. In the context of SPICE, there are at leastthree important difficulties in conducting studies thataccount for different occasions as a source of error.

The first difficulty is the expense of conductingassessments at more than one point in time. Giventhat prior experience has identified the costs ofprocess assessments as a concern [8][39], the costsof repeated assessments would be perceived assubstantial. It is already difficult enough to findsponsorship for a single assessment.

Second, it is not obvious that a low reliabilitycoefficient obtained using a test-retest or delayedalternative forms method indicates low reliability. Forexample, a likely explanation for a low coefficient isthat the organization's software process capabilityhas changed between the two assessmentoccasions. For instance, the initial assessment andits results might sensitize an organization to specificweaknesses and prompt them to init iate animprovement effort that influences the results ofsubsequent assessments.

Third, carry-over effects between assessmentsmay lead to an over-estimate of reliability. Forinstance, the reliability coefficient can be artificiallyinflated due to memory effects. Examples ofmemory effects are the assessees knowing the'right' answers that they have learned from theprevious assessments and, assessors rememberingresponses from previous assessments, and,deliberately or otherwise, repeating them in anattempt to maintain the consistency of results.

3 Validity

3.1 The Importance of Validity

Validity of measurement is defined as the extent towhich a measurement procedure is measuring whatit is purporting to measure [40]. During the processof validating a measurement procedure oneattempts to collect evidence to support the types ofinferences that are to be drawn from measurementscores [15]. In the context of SPICE, concern withvalidity is epitomized by the question: �are SPICE-conformant assessments really measuring bestsoftware process practices?�

Validity is related to reliability in the sense thatreliability is a necessary but insufficient condition forvalidity. The differences between reliability andvalidity are i l lustrated below by way of twoexamples.

For example, assume one seeks to measureintelligence by having children throw stones as faras they could. The distance the stones are thrownon one occasion might correlate highly with how farthey are thrown on another occasion. Thus, beingrepeatable, the stone-throwing measurementprocedure would be highly reliable. However, thedistance that stones are thrown would not beconsidered by most observers to be a valid measureof intelligence.

As another example, consider a car's fuel gaugethat systematically shows ten liters higher than theactual level of fuel in the gas tank. If repeatedreadings of fuel level are taken under the sameconditions, the gauge will yield consistent (andhence reliable) measurements. However, the gaugedoes not give a valid measure of fuel level in the gastank.

Investigating the validity of SPICE conformantassessments during the SPICE trials is an importantobjective. This importance becomes clear when oneconsiders that the empirical evidence supporting theefficacy of many of the practices codified in SPICEis far from overwhelming or even convincing onscientific grounds5.

For SPICE, there are three types of validity thatare of interest. These are: content validity, criterion-related validity, and construct validity. There is a

greater concern with criterion-related validity in thesoftware process community, so greater emphasiswill be placed on it in the ensuing discussion.

3.2 Content Validity

Content validity is defined as the representativenessor sampling adequacy of the content of a measuringinstrument [40]. Ensuring content validity dependslargely on expert judgement.

In the context of SPICE, expert judgement wouldensure that SPICE-conformant measurementprocedures are at least perceived to measure bestsoftware process practices. In order to explaincontent validation for SPICE, it is necessary to firstbriefly overview one of the core SPICE documents6,the Baseline Practices Guide (BPG).

The purpose of the BPG is to �document the set ofpractices essential to good management of softwareengineering� [59]. The practices in the BPG arecategorized into either one of two groups. The firstgroup is the Base Practices. The second group isthe Generic Practices.

A base practice is defined as �a softwareengineering or management activity that addressesthe purpose of a particular process, and thusbelongs to it. Consistently performing the basepractices associated with a process, and improvinghow they perform, will help in consistently achievingits purpose� [59]. An example of a process isDevelop System Requirements and Design. Basepractices that belong to this process include: SpecifySystem Requirements , Describe SystemArchitecture, and Determine Release Strategy.

A generic practice is defined as �animplementation or institutionalization practice(activity) that enhances the capability to perform aprocess" [59]. Generic practices are grouped intoCommon Features. An example of a CommonFeature is Disciplined Performance. A genericpractice that belongs to this Common Featurestipulates that data on the performance of theprocess must be recorded.

The BPG defines the content domain of bestsoftware process practices through the Base andGeneric Practices. It is hypothesized that thiscontent domain is applicable across softwareorganizations of different sizes, in different industrialsectors, and that fol low different software

6 The names of some of the SPICE documents may bechanged before the completion of standardization. Thenames referred to here are those that are currently used.

5 The lack of empirical studies, and hence empiricalevidence, is not unique to SPICE, but is a generalcharacteristic of software engineering [6] and computerscience [46].

development l i fe cycles. The BPG has beenreviewed by experts from industry and academe toensure that it adequately covers the content domain,and it has been revised accordingly. Furthermore,coverage of the BPG is also being evaluated duringactual tr ial applications of SPICE. The tr ialapplications are conducted by selected experiencedassessors who will provide the coverage feedback.

All SPICE-conformant assessments must bebased on a set of practices that at a minimuminclude those defined in the BPG or in a conformantvariant of the BPG for the processes assessed.Given the extensive 'arm-chair' based and real-application based reviews of the BPG, one would beconfident that the BPG provides reasonablecoverage of the "best software process practices"domain.

To further ensure content validity, it is necessarythat all assessment instruments include questionsthat adequately sample from the content domain.Another SPICE document, the AssessmentInstrument, prescribes guidelines for creatingSPICE-conformant assessment instruments.Furthermore, an exemplar assessment instrument isscheduled for development as part of the SPICEproject. Both of these wil l undergo the samevalidation procedure as the BPG to ensure thatSPICE-conformant assessment instruments have ahigh level of content validity.

3.3 Criterion-Related Validity

With criterion-related validity one attempts todetermine the magnitude of relationship (using acorrelation coefficient or some other measure ofassociation) between the score an organizationobtains in a SPICE-conformant assessment andsome other criterion. Two criteria of interest toSPICE are: (a) performance measures (for example,number of software defects post-release,productivity, cost per line of code, user satisfactionetc.), and (b) other measures of software processcapability (for example, those based on the SEI'sCMM or Bootstrap). These two criteria differentiatebetween two types of criterion-related validityrespectively: (a) predictive validity, and (b)concurrent validity.

3.3.1 Predictive Validity

Perhaps the most important type of validity ofconcern to the software process community ispredictive validity. For instance, in the context of

CMM-based assessments, Hersh [36] states�despite our own firm belief in process improvementand our intuitive expectation that substantial returnswill result from moving up the SEI scale - we stillcan't prove it� (although, there is now some initialevidence [33]). Also, Fenton [27] notes thatevaluating the validity of the SEI's process maturityscheme is a key contemporary research issue.

There have been some recent efforts at empiricallyinvestigating the predictive validity of softwareprocess assessments, including those based on theCMM, such as [33][35]. The SPICE trials areattempting to build upon this previous work.However, we face serious methodological difficultiesand constraints on available resources.

Performance measures used in validity studies canbe objective or subjective7. An example of anobjective measure is lines of code per hour. Anexample of a subjective measure is usersatisfaction.

Ideally, performance measures would be gatheredconsistently across all organizations involved in avalidity study. All measures would be defined in thesame way (e.g., line of code counting procedures),and measurement instruments would be sufficientlygeneric to be applicable to all organizations andtheir businesses. While we think it is tractable, thisrequirement will present considerable difficulty in theSPICE trials, especially since the trials are beingconducted globally.

If a strong relationship is found between SPICE-conformant assessment scores and performancemeasures, then this would provide strong initialevidence supporting SPICE validity. However, ifweak or no relationships are identified, then we havean interpretation problem. Finding no empiricalrelationship can be interpreted in at least threeways: (a) the empirical study was flawed, (b) thehypothesized relationship between the assessmentscore and the performance measure is wrong,and/or (c) SPICE-conformant assessments do notmeasure best software process practices. Thus, ifno relationships are identified, then great cautionshould be taken in drawing conclusions from apredictive validity study.

Another problem with predictive validity studies isidentifying appropriate measures of performance.Two types of measures can be discerned: (a) projecteffectiveness measures, and (b) organizational

7 The classification of metrics as either objective or subjectiveis common in software engineering, for example see [7].This classification is therefore used here.

effectiveness measures. Project effectivenessmeasures evaluate the outcomes of a single projector a phase of a single project. Organizationaleffectiveness measures evaluate the performance ofa whole software organization. Possible measuresof both types are presented below.

A recent review by Krasner [43] of the payoff forsoftware process improvement identifies a numberof possible project effectiveness measures. Theseinclude productivity, cost of rework, defect density,early defect detection rate, number of defectsdiscovered by the customer, cost per line of code,and predictability of costs and schedules. Anotherstudy conducted by the SEI [35] utilized similarmeasures. Rozum [57] also reviews a number ofpossible measures including mean time betweenfailures and availability.

A difficulty with attempting to collect performancedata is that projects that have low scores on SPICE-conformant assessments generally will not collect ormaintain such data, and hence they would have tobe excluded from a validity study. This would reducethe variation in the performance measure, and thusreduce (artificially) the validity coefficients.

An alternative approach is to collect subjectivedata at the time a validity study is conducted. Arecent study investigating the relationship betweensoftware process and project performance [19]utilized Likert-type scales to measure perceivedsoftware quality and perceptions of meeting(schedule and budget) targets. A surveyinvestigating the relationship between processmaturity and performance utilized Likert-type scalesto measure product quality, productivity, meetingschedule and budget targets, staff morale, andperceptions of customer satisfaction [33]. Anotherstudy investigating the relationship betweenorganizational maturity and requirementsengineering success [26] utilized a requirementsengineering success instrument [25]. Davis [18] hasdeveloped an instrument to measure the perceivedusefulness and ease of use of software. Rozum [57]considers customer satisfaction, and Krishnan [44]in his study of the relationship between softwareproduct and service characteristics and customersatisfaction, measured customer satisfaction using afive-point ordinal scale. Where the domain ofanalysis is business information systems, commonlyused measures of Information Systems successinclude system usage (empirical studies wheresystem usage was measured include [56][60]), anduser satisfaction [20].

An obvious approach for measuring organizationaleffectiveness is to aggregate project effectivenessmeasures for a representative sample of projects ina software organization. Alternatively, where thedomain of analysis is business information systems,a number of possible organizational effectivenessmeasures can be used. For example, Mahmood andSoon [47] have defined and operationalized a set ofstrategic variables that are potentially affected byInformation Technology (IT), and can be used toevaluate the impact of IT on an entire organization.More commonly used measures are those of overalluse of Information Systems (e.g., see [5]) and ofuser information satisfaction with the InformationSystems function [4][38]. However, measures oforganizational effectiveness are relativelycontroversial. Miller [49] notes that: �It appears futileto search for a precise measure or set of measuresof [Information Systems] effectiveness that will becommon across all organizations. Criteria foreffectiveness in a single organization can beexpected to vary with changing value structures,levels in the organization and phases inorganizational growth�.

3.3.2 Concurrent Validity

Concurrent validity is of interest if we want to answerthe question: �are scores obtained from SPICE-conformant assessments related to the scoresobtained using other software process assessmentmethods?� According to common expectations,SPICE scores should be highly related to the scoreson other assessment methods.

This expectation could be tested by first identifyingorganizations that have just recently undergone,say, a Bootstrap based assessment. Subsequentlythey would undergo a SPICE-conformantassessment. The correlation coefficient (or someother measure of association) would be computedbetween the two scores. If the magnitude of theassociation is found to be high, then this providesevidence that SPICE-conformant assessments aremeasuring the same thing as the other method(s).

If the association is high and if one makes theassumption that the other non-SPICE-relatedmethods are measuring best software processpractices, then one can have strong evidence to thevalidity of SPICE. However, given that there is littlescientific evidence to that effect (i.e., that the othermethods are indeed measuring best softwareprocess practices), this approach to validation wouldnot be too convincing.

Furthermore, a concurrent validity study wouldface the same difficulties as a test-retest study.These difficulties were mentioned in section 2 andthey are: high expense, interpreting low validitycoefficients, and carry-over effects.

3.4 Construct Validity

Construct validity8 is an operational concept thatasks whether two or more SPICE-conformantassessments measure the same concept (bestsoftware process practices), and whether they candifferentiate between the different dimensions of thatconcept (e.g., best engineering practices, bestcustomer supplier practices etc.). The abovedistinctions are often referred to as convergentvalidity and discriminant validity respectively [11].

Convergent and discriminant validities can beevaluated using the MultiTrait-MultiMethod (MTMM)matrix [11]. The basic idea behind the MTMMapproach is to assess several organizations on twoor more dimensions using two or more differentmethods. The obtained data are analyzed in amatrix.

An MTMM matrix is shown in Figure 5. In thismatrix there are two measurement methods (e.g.,two SPICE-conformant assessment methods), andthree dimensions (e.g., A: best engineeringpractices, B: best customer supplier practices, andC: best support practices).

An MTMM matrix has three types of coefficients:rel iabil i ty coeff icients, convergent validitycoefficients, and discriminant validity coefficients.Reliability coefficients have been discussed earlierin this paper.

Convergent validity coefficients are correlationsbetween measures of the same dimension usingdifferent measurement methods. Evidence of highconvergent validity exists when these correlations

8 The exact definition of construct validity and the proceduresfor construct validation tend to vary across the literature.For instance, Bagozzi [3] considers providing evidence ofwhat we have defined here as reliability to be part ofconstruct validation. The way we have defined and groupedconcepts and methods in this paper, however, is the onemore commonly found in the literature.

Figure 5: The MultiTrait-MultiMethod matrix approach for evaluating construct validity.

are high. Convergent validity indicates that thescores are less likely to be artifacts of the chosenmeasurement procedure.

Discriminant validity coefficients can either be thecorrelations between measures of the differentdimensions using the same method of measurement(these are referred to as heterotrait-monomethodcoefficients), or correlations between differentdimensions using different methods of measurement(these are referred to as heterotrait-heteromethodcoefficients). Evidence of high discriminant validityexists when these correlations are lower than thereliability and convergent validity coefficients.Discriminant validity indicates that methods ofmeasurement can differentiate between the differentdimensions.

In the context of SPICE, an MTMM approachwould necessitate multiple assessments of the sameorganization. As mentioned earlier, it is alreadydifficult to find sponsorship for a single assessment.Furthermore, several organizations would have tobe assessed on the same dimensions. This may bediff icult since the priorit ies of participatingorganizations are likely to differ.

Another approach for evaluating convergent anddiscriminant validities is through factor analysis [51].Factor analysis is a multivariate technique for'clustering' variables. These variables would be thequestions on multiple dimensions. If the emerging'clusters' match the dimensions, then this isevidence of construct validity. The logic behindusing factor analysis is that variation among anumber of questions that form a cluster can beattributed to variation among organizations on onecommon underlying factor (e.g, best customer-supplier practices).

4 Discussion

While we have presented some realistic approachesfor conducting reliability and validity studies, it mustbe noted that there exist limitations on the SPICEtrials, and that some tradeoffs are necessary in thedesign and conduct of the studies. Limitations andtradeoffs are not unique to SPICE, but are of equalconcern to any scientist conducting empiricalresearch in software engineering.

4.1 Limitations

Participation in the SPICE trials by experiencedassessors and by organizations is on a voluntary

basis. Although those participating are expected togain many benefits from their efforts, it would not beprudent during the trials planning to assume asufficiently large number of participants. Thus, it isassumed that sample sizes will probably be small, atleast for the initial phases of the SPICE trials.

When reliability and validity coefficients areestimated from small samples, sampling errors arerelatively large. This means that the statisticalpower9 of the inferential procedures used to analyzethe data from the studies is likely to be substantiallyreduced.

One approach to increasing the statistical power insuch a situation is to use parametric as opposed tonon-parametric tests, since these are, in general,more powerful [42]. However, parametric testsassume specific distributions in the data. Given thenature of software process assessment data, it isexpected that they will not be 'well-distributed'.Cohen [14], however, suggests the use of non-parametric tests only under extreme violations of theassumptions of the parametric tests. Some of theseassumptions can be tested once more SPICE trialsdata are collected.

4.2 Tradeoffs

As is common in many empirical research studies, anecessary tradeoff exists in the selection of aparticular empirical research strategy (e.g., fieldexperiments vs. surveys vs. laboratory experimentsetc.). McGrath [48] makes the point clearly: �allresearch strategies are 'bad' (in the sense of havingserious methodological limitations); none of themare 'good' (in the sense of being even relativelyunflawed). So, methodological discussions shouldnot waste time arguing about which is the rightstrategy, or the best one; they are all poor in anabsolute sense�.

One possible approach for al leviating suchconcerns is to follow a multimethod empiricalresearch strategy. The logic of the multimethodstrategy is [10] �to attack a research problem with anarsenal of methods that have nonoverlappingweaknesses in addition to their complementarystrengths�. This strategy effectively addressesmonomethod bias in the results of a study. Amultimethod strategy is being considered as part ofthe SPICE trials planning. Although, that strategy isconstrained by limited resources.

9 Statistical power is defined as the probability that astatistical test will correctly reject the null hypothesis.

Rarely does a single research project satisfy acomplete multimethod strategy, but the collectiveresults of an emerging discipline can use multiplemethods. This is an important point that requiresstrong emphasis. Software engineering researchersshould conduct their own empirical studies, usingdifferent methods, on the reliability and validity ofSPICE-conformant assessments subsequent to theSPICE trials. It is through such external studies thatone can gain greater confidence in the results of theSPICE trials. Furthermore, a diversity of empiricalresearch methods that are used in a scientificdiscipline is itself a sign of the discipline's ownmaturity [13].

5 Conclusions

In this paper we have presented some importanttheoretical and practical considerations pertinent tothe trials phase of the SPICE project. Theseconsiderations are driven by two of the objectives ofthe trials: to determine (a) the reliability, and (b) thevalidity of SPICE-conformant assessments. Ofcourse, these considerations and the questions theyraise are equally relevant to other software processassessment methods in existence today.

For the reliability objective, we presented twotheoretical frameworks for estimating reliability: (a)the classical test theory framework, and (b) thegeneralizability theory framework. For the validityobjective, we presented three types of validationapproaches: (a) content validation, (b) criterion-related validation, and (c) construct validation. Inaddition, we discussed reliability estimation andvalidation methods and their applicability in thecontext of the SPICE trials.

We have also attempted to paint a realistic pictureof the constraints and limitations of the SPICE trials.These constraints and limitations are importantbecause they will shape what eventually emergesfrom the SPICE trials, and the degree of confidenceone can place in their results.

The SPICE trials will not provide �conclusiveproof � about the extent of reliability and validity ofSPICE-conformant assessments. The trials canprovide some init ial evidence based on welldesigned and executed applied research. Theburden is upon the software engineering researchcommunity to replicate and extend the trials' studies,and to support or disconfirm their findings.

Acknowledgements

The authors wish to thank Jerome Pesant andmembers of the Software Engineering Laboratory atMcGill University for their comments on an earlierversion of this paper.

References[1] D. Adams, R. Nelson, and P. Todd: �Perceived

usefulness, ease of use, and usage of informationtechnology: A replication�. In MIS Quarterly, pages227-247, June 1992.

[2] K. Amoako-Gyampah and K. White: �Userinvolvement and user satisfaction: An exploratorycontingency model�. In Information and Management,25:1-10, 1993.

[3] R. Bagozzi: Causal Models in Marketing, John Wiley,1980.

[4] J. Bailey and S. Pearson: �Development of a tool formeasuring and analyzing computer user satisfaction�.In Management Science, 29(5):530-545, May 1983.

[5] J. Baroudi, M. Olson, and B. Ives: �An empirical studyof the impact of user involvement on system usageand information satisfaction�. In Communications ofthe ACM, 29(3):232-238, March 1986.

[6] V. Basili: �The experimental paradigm in softwareengineering�. In Experimental Software EngineeringIssues: Critical Assessment and Future Directions, H.D. Rombach, V. Basili, and R. Selby (eds.), Springer-Verlag, 1993.

[7] V. Basili and D. Rombach: �The TAME Project:Towards Improvement-Oriented SoftwareEnvironments�. In IEEE Transactions on SoftwareEngineering, 14(6):758-773, June 1988.

[8] J. Besselman, P. Byrnes, C. Lin, M. Paulk, and R.Puranik: �Software Capabil i ty Evaluations:Experiences from the field�. In SEI Technical Review,1993.

[9] G. Bohrnstedt: �Reliability and validity assessment inattitude measurement�. In Attitude Measurement, G.Summers (ed.), Rand McNally and Company, 1970.

[10] J. Brewer and A. Hunter: Multimethod Research: ASynthesis of Styles, Sage Publications, 1989.

[11] D. Campbell and D. Fiske: �Convergent anddiscriminant validation by the multitrait-multimethodmatrix�. In Psychological Bulletin, pages 81-105,March 1959.

[12] D. Card: �Capabil i ty evaluations rated highlyvariable�. In IEEE Software, pages 105-106,September 1992.

[13] M. Cheon, V. Grover, and R. Sabherwal: �Theevolution of empirical research in IS: A study in ISmaturity�. In Information and Management, 24:107-119, 1993.

[14] J. Cohen: �Some statistical issues in psychologicalresearch�. In Handbook of Clinical Psychology, B.Woleman (ed.), McGraw-Hill, 1965.

[15] L. Cronbach: �Test validation�. In EducationalMeasurement, R. Thorndike (ed.), American Councilon Education, 1971.

[16] L. Cronbach: �Coefficient alpha and the internalconsistency of tests�. In Psychometrika, pages 297-334, September 1951.

[17] L. Cronbach, G. Gleser, H. Nanda, and N.Rajaratnam: The Dependabil i ty of BehavioralMeasurements: Theory of Generalizability for Scoresand Profiles, John Wiley, 1972.

[18] F. Davis: �Perceived usefulness, perceived ease ofuse, and user acceptance of information technology�.In MIS Quarterly, pages 319-340, September 1989.

[19] C. Deephouse, D. Goldenson, M. Kellner, and T.Mukhopadhyay: �The effects of software processeson meeting targets and quality�. In Proceedings of theHawaiian International Conference on SystemsSciences, vol. 4, pages 710-719, January 1995.

[20] W. Doll and G. Torkzadeh: �The measurement ofend-user computing satisfaction�. In MIS Quarterly,pages 259-274, June 1988.

[21] A. Dorling: �SPICE: Software Process Improvementand Capability dEtermination�. In Information andSoftware Technology, 35(6/7):404-406, June/July1993.

[22] J-N Drouin: �Software quality - An internationalconcern�. In Software Process, Quality & ISO 9000,3(8):1-4, August 1994.

[23] J-N Drouin: �The SPICE project: An overview�. InSoftware Process Newsletter , IEEE ComputerSociety, No. 2, pages 8-9, Winter 1995.

[24] G. Dunn: Design and Analysis of Reliability Studies:The Statistical Evaluation of Measurement Errors,Oxford University Press, 1989.

[25] K. El Emam and N. H. Madhavji: �Measuring thesuccess of requirements engineering processes�. InProceedings of the Second IEEE InternationalSymposium on Requirements Engineering, pages204-211, 1995.

[26] K. El Emam and N. H. Madhavji: �The reliability ofmeasuring organizational maturity�. Submitted forpublication, 1995.

[27] N. Fenton: �Objectives and context ofmeasurement/experimentation�. In ExperimentalSoftware Engineering Issues: Critical Assessmentand Future Directions, H. D. Rombach, V. Basili, andR. Selby (eds.), Springer-Verlag, 1993.

[28] N. Fenton, B. Littlewood, and S. Page: �Evaluatingsoftware engineering standards and methods�. InSoftware Engineering: A European Perspective, R.Thayer and A. McGettrick (eds.), IEEE ComputerSociety Press, pages 463-470, 1993.

[29] N. Fenton and S. Page: �Towards the evaluation ofsoftware engineering standards�. In Proceedings ofthe Software Engineering Standards Symposium,pages 100-107, 1993.

[30] N. Fenton, S-L Pfleeger, S. Page, and J. Thornton:�The SMARTIE standards evaluation methodology�.Technical Report (available from the Centre forSoftware Reliability, City University), 1994.

[31] T. Frick and M. Semmel: �Observer agreement andreliabilities of classroom observational measures�. InReview of Educational Research, 48:157-184, 1978.

[32] D. Galletta and A. Lederer: �Some cautions on themeasurement of user information satisfaction�. InDecision Sciences, 20:419-438, 1989.

[33] D. Goldenson and J. Herbsleb: �What happens afterthe appraisal? A survey of process improvementefforts�. Paper presented at the 1995 SEPGConference, 1995.

[34] J. Guilford: Psychometric Methods, McGraw-Hill,1954.

[35] J. Herbsleb, A. Carleton, J. Rozum, J. Siegel, and D.Zubrow: �Benefits of CMM-Based Software ProcessImprovement: Initial Results�. Technical Report,CMU/SEI-94-TR-13, Software Engineering Institute,August 1994.

[36] A. Hersh: �Where's the Return on ProcessImprovement?�. In IEEE Software, page 12, July1993.

[37] W. Humphrey and B. Curtis: �Comments on �A CriticalLook��. In IEEE Software, pages 42-46, July 1991.

[38] B. Ives, M. Olson, and J. Baroudi: �The measurementof user information satisfaction�. In Communicationsof the ACM, 26(10):785-793, 1983.

[39] Japan SC7 WG10 SPICE Committee: �Report ofJapanese trial process assessment by SPICEmethod�. A SPICE Project Report, 1994.

[40] F. Kerlinger: Foundations of Behavioral Research,Holt, Rinehart, and WInston, 1986.

[41] M. Konrad: �On the horizon: An international standardfor software process improvement�. In SoftwareProcess Improvement Forum, pages 6-8,September/October 1994.

[42] H. Kraemer and S. Thiemann: How Many Subjects?Statistical Power Analysis in Research , SagePublications, 1987.

[43] H. Krasner: �The Payoff for Software ProcessImprovement (SPI): What it is and how to get it�. InSoftware Process Newsletter , IEEE ComputerSociety, No. 1, pages 3-8, September 1994.

[44] M. Krishnan: �Software product and service design forcustomer satisfaction: An empirical analysis�.Technical Report, Graduate School of IndustrialAdministration, Carnegie Mellon University, 1993.

[45] F. Lord and M. Novick: Statistical Theories of MentalTest Scores, Addison-Wesley, 1968.

[46] P. Lukowicz, E. Heinz, L. Prechelt, and W. Tichy:�Experimental evaluation in computer science: Aquantitative study�. Technical Report, 17/94,Department of Informatics, University of Karlsruhe,1994.

[47] M. Mahmood and S. Soon: �A comprehensive modelfor measuring the potential impact of informationtechnology on organizational strategic variables�. InDecision Sciences, 22(4):869-897, 1991.

[48] J. McGrath: �Dilemmatics: The study of researchchoices and dilemmas�. In Judgement Calls inResearch, J. McGrath, J. Martin, and R. Kulka (eds.),Sage Publications, 1982.

[49] J. Miller: �Information systems effectiveness: The fitbetween business needs and system capabilities�. InProceedings of the 10th International Conference onInformation Systems, pages 273-288, 1989.

[50] J. Neter, W. Wasserman, and M. Kutner: AppliedLinear Statistical Models: Regression, Analysis ofVariance, and Experimental Designs, Irwin, 1990.

[51] J. Nunnally: Psychometric Theory, McGraw-Hill,1978.

[52] M. Paulk and M. Konrad: �Measuring processcapability versus organizational process maturity�. InProceedings of the 4th International Conference onSoftware Quality, 1994.

[53] M. Paulk and M. Konrad: �ISO seeks to harmonizenumerous global efforts in software processmanagement�. In Computer, pages 68-70, April 1994.

[54] S-L Pfleeger: �The language of case studies andformal experiments�. In Software Engineering Notes,pages 16-20, October 1994.

[55] S-L Pfleeger, N. Fenton and S. Page: �Evaluatingsoftware engineering standards�. In Computer, pages71-79, September 1994.

[56] D. Robey: �User att i tudes and managementinformation system use�. In Academy of ManagementJournal, 22(3):527-538, September 1979.

[57] J. Rozum: �Concepts on measuring the benefits ofsoftware process improvements�. Technical Report,CMU/SEI-93-TR-009, Software Engineering Institute,1993.

[58] V. Sethi and W. King: �Construct measurement ininformation systems research: An illustration instrategic systems�. In Decision Sciences, 22:455-472,1991.

[59] The SPICE Project: Baseline Practices Guide (draft),1994.

[60] A. Srinivasan: �Alternative measures of systemeffectiveness: Associations and implications�. In MISQuarterly, 9(3):243-253, September 1985.

[61] A. Subramanian and S. Nilakanta: �Measurement: Ablueprint for theory-building in MIS�. In Informationand Management, 26:13-20, 1994.

[62] P. Tait and I. Vessey: �The effect of user involvementon system success: A contingency approach�. In MISQuarterly, pages 91-108, March 1988.

[63] G. Torkzadeh and W. Doll: �The test-retest reliabilityof user involvement instruments�. In Information andManagement, 26:21-31, 1994.

Appendix

This appendix provides some general guidelines forincreasing the rel iabil i ty of software processassessment methods. These guidelines areintended for those developing and/or usingassessment methods and/or frameworks.

1 Standardize Assessment ProceduresThe procedures used for an assessment must bestandardized and individual assessments mustfollow them closely to ensure consistency. In thecase of assessment instruments, instructionsconcerning the purpose and how to determineresponses and judge scores should be provided.In the case of interviews, the conduct of theinterviews (e.g., assurance of confidentiality andthe type and scope of evidence inspected) shouldbe defined.

2 Training of AssessorsAssessors should be trained in the assessmentprocedure and should have extensive experiencewith software development and maintenance.Furthermore, there should be a consistency in thequalif ications of the assessors fol lowing aparticular assessment procedure.

3 Increasing the Length of the AssessmentInstrumentReliability estimates utilize the assessmentscores. The more questions asked about thecapability of an organization, the more likely thatthe reliability estimates are increased. Of course,if the added questions have nothing to do withmaturity, then increasing the length of theinstrument may reduce reliability. However, it isassumed that added questions are chosen ascarefully as the original questions and that theywill not reduce the average inter-item correlations.

4 Defined Sampling CriteriaIn assessment procedures where a sample of anorganization's projects are assessed, and theseare used as an indicator of overall organizationalcapability, specific sampling criteria should bespecified. These sampling criteria should beapplied consistently in all assessments claimingto follow a particular assessment procedure.

5 Using Multiple Point ScalesDetermining the number of points on a scaleinvolves a tradeoff between losing some of thediscriminative powers of which the assessors arecapable (with too few points) and having a scalethat is too fine and hence beyond the assessors'powers of discrimination (with too many points). Ingeneral, it has been found that there is anincrease in reliability as the number of pointsincreases from 2 to 7, after which it tends to leveloff [34][51].

6 Having a Validation CycleSuch a cycle involves validating the informationthat the assessors have initially gathered. Thismay involve corroborative interviews and (further)inspections of documents. This would seem to bemore important were the assessors are externalto the assessed organization and when there is adanger of misrepresentation by the assessees.

SPICE: An Empiricist’s Perspective · SPICE: An Empiricist’s Perspective KHALED EL EMAM DENNIS...

Documents

Transcript of SPICE: An Empiricist’s Perspective · SPICE: An Empiricist’s Perspective KHALED EL EMAM DENNIS...