04220788

8/13/2019 04220788

1/9

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007 237

A Multi-Objective Software Quality ClassificationModel Using Genetic Programming

Taghi M. Khoshgoftaar, Member, IEEE, and Yi Liu, Member, IEEE

AbstractA key factor in the success of a software project isachieving the best-possible software reliability within the allottedtime & budget. Classification models which provide a risk-basedsoftware quality prediction, such as fault-prone & not fault-prone,are effective in providing a focused software quality assurance en-deavor. However, their usefulness largely depends on whether allthepredicted fault-pronemodules canbe inspected or improved bythe allocated software quality-improvement resources, and on theproject-specific costs of misclassifications. Therefore, a practicalgoal of calibrating classification models is to lower the expectedcost of misclassification while providing a cost-effective use of theavailable software quality-improvement resources.

This paper presents a genetic programming-based decision

tree model which facilitates a multi-objective optimization in thecontext of the software quality classification problem. The firstobjective is to minimize the Modified Expected Cost of Misclas-sification, which is our recently proposed goal-oriented measurefor selecting & evaluating classification models. The second objec-tive is to optimize the number of predicted fault-prone modulessuch that it is equal to the number of modules which can beinspected by the allocated resources. Some commonly used classi-fication techniques, such as logistic regression, decision trees, andanalogy-based reasoning, are not suited for directly optimizingmulti-objective criteria. In contrast, genetic programming isparticularly suited for the multi-objective optimization problem.An empirical case study of a real-world industrial software systemdemonstrates the promising results, and the usefulness of theproposed model.

Index TermsCost of misclassification, genetic programming,multi-objective optimization, software faults, software metrics,software quality estimation.

NOTATION

Type I error an error of misclassifying a fp module to a

nfpmoduleType II error an error of misclassifying anfp module to a

fpmoduleR, or Red fpmodule

G, or Green nfpmodule

cost of a Type I error

cost of a Type II error

cost ratio of a Type II error over a Type I

error, i.e.,

# of Type I errors

Manuscript received August 21, 2003; revised December 18, 2003 and April8, 2004; accepted April 30, 2004. This work was supported in part by the NSFgrant CCR-9970893. Associate Editor: M. Xie.

T. M. Khoshgoftaar is with the Empirical Software Engineering Laboratory,Department of Computer Science and Engineering, Florida Atlantic University,Boca Raton, FL 33431 USA (e-mail: [email protected]).

Y. Liu is with Georgia College & State University, Milledgeville, GA 31061USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TR.2007.896763

# of Type II errors

the maximum number of modules which can

be tested# of modules classified as members of class

, where can be either R, or G# of modules of class predicted as members

of class , where , and can be either R,

or G# of the modules in the data set

a vector of independent variables

an objective function

number of objective functions to be

considereda Pareto-optima

size of the Pareto-optima set

fitness value for each individual

generation

the Pareto-optima set in the generation

the set of Pareto-optima solutions

fitness function

or the ratio of over

the ratio of over

# of times the source file was inspected prior

to the system test release# of lines for the source file prior to coding

phase# of lines of code for the source file prior to

system test release# of lines of commented code for the source

file prior to coding phase# of lines of commented code for the source

file prior to system test releasethe ratio of over

ACRONYM1

ECM expected cost of misclassification

fp fault-prone

GP genetic programming

MECM modified expected cost of misclassification

nfp not fault-prone

SQA software quality assurance

SQC software quality classification

STGP strongly typed genetic programming

1The singular and plural of an acronym are always spelled the same.

0018-9529/$25.00 2007 IEEE

8/13/2019 04220788

2/9

238 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

I. INTRODUCTION

IN THE context of software quality & reliability improve-

ment endeavors, the software quality assurance (SQA) team

is often faced with the difficult task of working with finite &

limited resources. The team aims to make the best of available

software inspection & quality improvement resources. A logicalapproach for achieving cost-effective software quality improve-

ment is a risk-based targeting of software modules depending on

their predicted quality [1]. Software metrics-based quality clas-

sification models have proven their usefulness for timely soft-

ware quality improvements [2][4].

Typically, software modules are categorized by a classifica-

tion model into two risk-based groups, such as fault-prone (fp),

and not fault-prone (nfp) [5], [6]. A classification model is cal-

ibrated orfitted using software metrics, and fault data collected

from a previously developed system release, or similar soft-

ware project. Subsequently, the fitted classification model can

be applied to estimate the quality of modules which are cur-

rently under-development. However, existing software quality

classification (SQC) techniques do not consider the limited re-

source-availability factor during their modelcalibration process.

The usefulness of anSQC model depends on whether enough

resources are available to inspect all modules predicted asfp.For

example, if a model predicts 30% of the modules asfp, but avail-

able resources can only inspect 10% of the modules, then how

does one choose which of the predicted fp modules to inspect?

It is therefore important that a resource-basedSQCclassification

model is calibrated2. Moreover, because it is unrealistic to obtain

a model which yields perfect classification, i.e., all predictedfp

modules are actuallyfp, a classification technique should aim to

minimize the expected cost of misclassification in the contextof the software system & application domain [7]. In addition to

resource-usage & the misclassification costs, other factors also

need to be considered, such as model-simplicity, model-inter-

pretation, etc.

We have mentioned that, to calibrate a goal& objec-

tive-oriented classification model, multiple criteria have to be

optimized. However, commonly used classification techniques,

such as logistic regression [8] & decision trees [5], cannot

simultaneously attain a multi-objective optimization during

their modeling process. Modeling complicated engineering

problems, such as software quality prediction, with traditional

mathematical optimization methods, is practically unfeasible[9].

This paper presents a genetic programming (GP)-based de-

cision tree model that is capable of obtaining a multi-objec-

tive optimization. As a member of the evolutionary computa-

tional methods [10][13], GP has been explored to solve some

multi-optimization problems [14], [15]. This study is a con-

tinuation of our recent research efforts with GP-based software

quality classification models [16], [17], and focuses on the si-

multaneous optimization of decision trees with respect to the

following two factors:

2Some classifiers, such as count models, provide a probability that a modulehas a given number of faults. However, they are not suited for a multi-objective

optimization problem such as that being addressed in this paper.

1) Modified Expected Cost of Misclassification (MECM)

[18] (see Section II-A), and

2) number of predictedfpmodules, such that it is equal to the

number which can be inspected by the allocated software

quality improvement resources.

Although MECM contains a provision to penalize the models

whose number offp modules is greater than that which can betested by the available resources, GP can still find several models

which have very similar, or the same MECM values, but have

a different number of predicted fp modules. Hence, given two

such models, GP will give preference, based on the second objec-

tive, to the one which provides the approximately same number

offp modules as that required by available resources. Moreover,

to accelerate the run times of GP, and yield a relatively simple

model, a third optimization factor, i.e., minimizing the size of

the decision tree, is introduced in our GP-based decision tree

modeling process. A method based on GP for finding the solu-

tions in the Pareto set [9] is proposed when building the classi-

fication model.

Assessing the usefulness of a classification model basedsolely on its misclassification error rates, i.e., Type I & Type

II, is inappropriate because of the disparate misclassification

costs associated with the individual error types.3 In an earlier

study [19], we have investigated the application of the Ex-

pected Cost of Misclassification (ECM) as a singular unified

model-evaluation measure which can be used to incorporate

the project-specific misclassification costs. Though ECM-based

model evaluation demonstrates effective results, it does not

reflect the performance of the model in the context of allocated

resources. More specifically, such an approach is based on the

assumption that the project has enough resources to inspect

all the modules predicted as fp. To overcome this limitation ofECM, in a recent study we proposed an improved version called

MECM [18], which facilitates achieving resource-based SQC

models for a given resource allocation.

The basic functionality ofMECM is that it penalizes a classi-

fication model in terms of the costs of misclassifications, if the

model predicts an excess number offp modules than the number

which can be inspected by the allocated resources. Therefore, at

the time of model-calibration, the SQA team can provide infor-

mation regarding how many modules can be inspected & im-

proved with the allotted resources, and the approximate costs of

misclassifications. Estimating the actual costs of misclassifica-

tions at the time of modeling is a difficult problem. However, an

educated approximation is usually made based on heuristic soft-

ware engineering knowledge gained from previously developed

similar software projects. Based on the project-specific knowl-

edge of available resources, and the costs of misclassifications,

the MECM measure can be used to yield a resource-based SQC

model. Consequently, the best possible & practical usage of the

available resources can be achieved.

The empirical case study used to illustrate the GP-based multi-

objectiveSQCmodel consists of software metrics, and fault data

collected from two embedded software applications from the

wireless telecommunications industry. Other systems were also

3A Type I error occurs when a nfp module is misclassified as fp, whereas aType II error occurs when a fp module is misclassified as nfp.

8/13/2019 04220788

3/9

8/13/2019 04220788

4/9


This factor corresponds to a decrease in the expected cost

of misclassification.

The appropriateMECMfor this case is given by

(4)

III. GENETICPROGRAMMING

An inherent advantage ofGP is that it can evolve a solution au-

tomatically from the training data [10], and does not require an

assumption regarding the mathematical model of the structure

or the size of the decision tree-based solution. The evolutionary

process ofGPattempts to imitate the Darwinian principle of sur-

vival of thefittest individuals. Thefitness value of an individual

is an indicator of its quality, and hence, provides a probability

of which individuals can be selected for mating, and the subse-

quent reproduction of the next generation.

We note that an in-depth discussion of GP is avoided. How-

ever, some key elements of GP are briefly explained [10]. GP

uses the basic unit of the associated problem to assemble each

individual. The basic unit may include function sets, and ter-

minalsets. The typical structure of each individual can be seen

as a tree-shaped structure, and each individual may be unique.

There are three main operators forGP:reproduction,crossover,

and mutation. Eachusesrandom processing on one or more indi-

viduals. These three operators are used by GPfor afitness-based

evolution. The fitness factor is a measure used by GP during

its simulated evolution of how well a program (individual) has

learned to yield the correct output based on the given inputs.

Decision tree-based classification models have been recog-

nized as useful tools for data mining purposes [21]. This studyis a continuation of our previous efforts [16], [17] in which an

approach for building GP-based decision tree models for classi-

fying software modules either asfp or nfp was presented. The

commonly used standard-GPrequires that the function set, and

the terminal set have closure properties, i.e., all functions in the

function set must accept all kinds of data types & data values

as function arguments [22]. However, this requirement does not

guarantee that GP will generate a useful decision tree model.

Montana recognized this problem, and proposed the

Strongly Typed Genetic Programming (STGP) approach

[23]. It relaxes the closure property requirement of standard

GP by introducing additional criteria for genetic operations.More specifically, given a precise description of the permissible

data types for function arguments, STGP will only generate

individuals based on the constraint that the arguments of all

functions are of the correct type [23]. We only discuss the

modifications made to standard GPin order to build a decision

tree model for software quality classification purposes.

1) Constraint:Each function & terminal are assigned a type.

Different types may not crossover or mutate under certain

problem-specific constraints. The leaf node is a function

which returns the class membership of a module, and

the decision nodes (non-leaf nodes) are simple logical

equations which return either true, or false. Hence, only

constants & independent variables appear in the decisionnodes. The function which returns the class membership

of a module is not used in the internal nodes. Moreover, a

root node can be either a leaf node, or an internal node.

2) Crossover:In this stage, additional limitations are applied

to the genetic operation. In the context ofGP-based decision

trees, we define that the type of a subtree is the type of its

root node. Therefore, when two subtrees are selected for

crossover, they are required to have the same type so that aproper decision tree is generated.

3) Mutation: If a subtree is selected for mutation purposes, the

replaced tree must have the same type or at least a similar

type. A subtree of a similar type is one which, when used

as a replacement, i.e., mutation, yields a new tree which is

a permissible or proper decision tree. For example, the leaf

node is a function which returns the class membership of a

module, and can be replaced by a new subtree whose root

node is a logical equation.

A. Multi-Objective Optimization

A multi-objective optimization solution usually aims at ob-

taining a set ofPareto-optima[9], which represents feasible so-lutions where a solution to a single objective cannot be improved

without sacrificing the solution for some (or more) other criteria.

Let be a vector of independent variables, be an objec-

tive function, and be the number of objective functions to be

considered for optimization. A vector is a Pareto-optima iff

there is no vector which exists with the characteristics,

In the case of all non Pareto-optima vectors, the solution for at

least one objective function, , can be improved without sacri-

ficing the solutions for any of the other objective functions. Themost frequently used methods for generating Pareto-optima are

based on the notion of replacing the multi-objective problem

with a parameterized scalar problem [14]. Typically, by varying

the value of each parameter for each objective, it is possible to

generate all or parts of the Pareto-optima set.

The multi-objective optimization problem addressed in this

study includes three objectives:

1) to minimize the modified expected cost of misclassification

(MECM),

2) to obtain the number of modules predicted asfpsuch that it

is equal to the number of modules which can be inspected

with the available software quality improvement resources,and

3) to minimize the size of the decision tree model.

The third objective addresses the important issue of simplicity,

and ease in model-interpretation & comprehension. In addition,

limiting the size of the decision tree assists in the acceleration

of GPrun times.

The optimization method adopted in our study is based on

the Non-dominated Sorting Genetic Algorithm, as proposed by

Srinivas and Deb [24]. Because minimizing the MECMvalue is

the most important objective, sorting by the first objective en-

sures that it is the least likely to be violated. On the other hand,

the third objective, i.e., minimizing the size of the decision tree,

is the one most likely to be violated during the process of eachrun. Let be the size of the Pareto-optima set, be thefitness

8/13/2019 04220788

5/9

KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL 241

value for each individual, be the generation, be the

Pareto-optima set in the generation, and be the set of

Pareto-optima solutions we are interested in after the given run.

At the beginning of a run, , , and is an empty set.

EachGP run consists of the following steps:

Step 1: Sort the population according to the increasing

values of the first objective. If several individuals havethe same values, then order them successively by the in-

creasing values of the second, and third objectives.

Step 2: Select all non-dominated individuals from the pop-

ulation based on their sorted order. Assign to thefitness

of the non-dominated individuals. Then increase by 1.

Step 3: The individuals which are selected & assigned a

fitness in the previous step are ignored. Repeat the process

of selecting the non-dominated individuals, and return to

Step 2. If all individuals have already been selected, then

continue to Step 4.

Step 4: Save thefirst individuals to as the Pareto-op-

tima set of the current generation, and compare each solu-

tion in with each solution of . The first individualsamong & are saved into again.

When the computation is completed, each individual will be se-

lected to breed according to the probability defined bythe fitness

value obtained from the above steps. Subsequent to the last gen-

eration, will represent the best models of the run.

B. Fitness Functions

1) Minimizing MECM: A lower MECMvalue represents a pre-

ferred classification model. Because misclassification costs

are specific to the software project & the development or-

ganization, the SQA team can realize an educated approxi-

mation of the cost ratio, i.e., , based on heuristicsoftware engineering knowledge gained from previously

developed similar software projects. The applied penalty

of misclassification is defined as follows: if a nfp module

is misclassified asfp, a penalty of will be applied to

the fitness of that particular classification model. On the

same token, if afpmodule is misclassified asnfp, a penalty

of will be applied to the respectivefitness value. The

fitness function for this objective is given by

(5)

2) Resource availability: The aim here is to penalize a modelif it predicts a surplus (or a deficient) number offp modules

than the maximum number, i.e., , that can be

inspected or tested by the available resources. Therefore,

if the total number of modules predicted as fp is equal to

, then the fitness function is equated to zero,

implying that there is no penalty. Otherwise, the fitness

function is given by

(6)

where , and

. A lower value of

(6) implies a better performance in regards to the secondobjective.

3) Size of decision tree: We define the size of a decision tree

by its number of nodes; the smaller the size, the better is

the fitness for the third objective. We select a threshold

value offive, implying that if the tree-size is less than five,

then the fitness of that specific classification tree is five.

The minimum size of the tree was empirically set to five

nodes in order to prevent any loss of diversity in the GPpopulation.

IV. EMPIRICALCASESTUDY

A. System Description

The case study involved data collection efforts from two

large Windows-based embedded system applications used for

customizing the configuration of wireless telecommunications

products. The two C++ applications provide similar function-

ality, and contain common source code. The primary difference

is the type of wireless product that each supports. Both systems

comprised of over 1400 source code files, and contained more

than several million lines of code each. Software metrics were

obtained by observing the configuration management systems,

while the problem reporting systems tracked & recorded the

status of different problems. Information, such as how many

times a source file was inspected prior to system tests, was

recorded. The obtained software metrics reflected aspects of

sourcefiles, and therefore, a software module for the systems

consisted of a sourcefile.

The fault data collected represent the faults discovered during

system tests. Upon preprocessing & cleaning the software data,

1211 modules remained & were used for model calibration.Data preprocessing primarily included the removal of observa-

tions with missing information or incomplete data. The decision

to remove certain modules from the data set was based on our

discussion with the development team. Among the 1211 mod-

ules considered for modeling, over 66% (809) were observed

to have no faults, while the remaining 402 modules had one or

more faults.

The five software metrics used for this case study are:

, number of times the source file was inspected prior to

the system test release; , number of lines for the source

file prior to coding phase; , number of lines of code for

the source file prior to system test release; , numberof lines of commented code for the source file prior to coding

phase; and , number of lines of commented code for the

sourcefile prior to system test release. The available data col-

lection tools determined the number & selection of the metrics.

The product metrics used are statement metrics for the source

files. They primarily indicated the number of lines of source

code prior to the coding phase (i.e., auto-generated code), and

just before system tests. The process metric, , was

obtained from the problem reporting systems.

The module-quality metric, i.e., the dependent variable, is the

number of faults observed during system test. The SQC are de-

pendent on the chosen threshold value (of the quality metric),

which identifies modules as eitherfp or nfp. The main stimulusfor selecting the appropriate threshold value is to build the most

8/13/2019 04220788

6/9


useful & system-relevant SQC model possible. Therefore, the se-

lection of the threshold value is usually dependent on the soft-

ware quality improvement needs of the development team. Be-

cause software metrics data from subsequent releases was not

available, an impartial data splitting was applied to the data set

to obtain the fit& testdata sets. Consequently, the fit, andtest

data sets had 807, and 404 modules, respectively.The classification models for this case study are calibrated

to classify a software module as either fp or nfp, based on a

threshold of two faults; i.e., if a module has two or more faults,

it is categorized as fp, and as nfp otherwise. According to the

selected threshold value, the fit datasethas632 nfp modules, and

175fp modules, whereas the test data set has 317 nfp modules,

and 87 fp modules. The selection of a threshold value for the

number of faults is specific to a given software project.

B. Empirical Settings

The modeling tool used for our empirical studies is the

lilgp (version 1.01), developed by D. Zongker & B. Punch ofMichigan State University. It is implemented in C, and is based

on theLISP works of J. Koza [22]. When applying lilgpfor

a GP application, each individual is organized as a decision tree

in which a node is a C function pointer. The execution speed of

lilgpis faster than that of interpreted LISP.

The below-mentioned modeling methodology is adopted for

calibrating software quality classification models with a multi-

objective optimization. The procedure is based on the given pair

inputs of 1) how many modules can be inspected or tested ac-

cording to the available resources, and 2) the project-specific

cost ratio.

1) Divide the modules in the fit & test data sets into twoclasses, i.e., fp & nfp, according to the chosen threshold

value mentioned in the previous section.

2) Build a GP-based multi-objective classification model ac-

cording to the procedure discussed in Sections III, III-A,

and III-B.

3) Compute the quality-of-fit performance of the classifica-

tion model based on the quality estimation of the modules

in thefit data set, the number of predicted fpmodules, and

the size of the decision tree. The accuracy of a model, i.e.,

, indicates that according to the allocated resources,

how many of the actualfpmodules are predicted asfp, i.e.,

(7)

where is the ratio of the number of

modules which can be inspected by the available resources

to the total number of modules in the respective data set.

The equation represents the inspection efficiency for the

given amount of resources.

4) Apply the classification model to the test data set to eval-

uate its predictive performance. Validating the model on an

independent data set can provide an indication regarding

the accuracy of the classification model if it were applied

to a currently under-development system release (or sim-

ilar software project) with known software metrics, but un-known quality data.

TABLE I

PARAMETERSFOR GP

The independent variables are thefive software metrics de-

scribed earlier, whereas the dependent variable is the class mem-

bership based on the number of faults observed during system

test. Because the actual cost of misclassifications is usually un-known at the time of modeling, it is bene ficial to the project

management team to calibrate classification models for a range

of cost ratios which are likely to suit the project s needs. For ex-

ample, we consider different values for the cost ratio, denoted

as . The cost ratios considered in our study are 2, 5, 10, 17.5,

25, and 50. Based on our empirical software engineering knowl-

edge, this set of values covers a broad spectrum of possible cost

ratios. However, from a practical software engineering point of

view, the cost ratios of 2 & 5 are very unlikely for the embedded

systems being modeled in this study.

In each generation, our GP process will automatically select

five best models (parameter output.bestn) from the Pareto set itfinds. If the Pareto set contains more than 5 models according to

its non-domination order, the models will be sorted by using the

relative importance of the objectives. The first five(from the top)

models which have lower MECM values will be represented at the

end ofthe GP run. We performed 20 runs for each combination of

costratio , and available resources ; thus, 100 modelswere

recorded for each combination of , and . The best decision

tree (lowest MECMvalue forfit data) which appears during the

20 runs is selected as the preferred classification model for the

given combination.

In the context ofGP-based modeling, certain parameters have

to be assigned for a given set of input conditions. In our study,

the parameters that were used for each value of are listed inTable I. Some of the basic parameters, such as depth (init.depth)

8/13/2019 04220788

7/9

KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL 243

of the initial GP population, and the method (init.method) for

generating the initial population, were assigned default values

provided by John Koza. Other parameters, such as population

size (pop_size), maximum number of generations (max_gen-

erations), maximum depth of the population (max_depth),

crossover rate, reproduction rate, and mutation rate, were em-

pirically varied during our study. The selection method for thethree genetic operators is fitness-proportional selection. The

function set only contains four functions (C1, and C2 represent

the nfp, and fp classes, respectively) which are essential for

building a binary decision tree model.

We observed that our GP process was not sensitive to the

change of the above mentioned parameters as long as a reason-

able range of values were used, such as population size 500,

number of generations 100, and mutation rate 0.0. We note

that a detailed analytic study to find an optimal combination of

the GP parameters is out of scope for this study.

V. DISCUSSION

The performance of the preferred classification models for

the cost ratios of 2, and 10 are presented in Tables II and III,

respectively. The results for the other four cost ratios are not

presented due to similarity of empirical conclusions, and paper

size considerations. Thefirst column, , indicates the fraction

of the total number of modules which can be inspected or

tested according to the available software quality improvement

resources. The second, and third columns indicate the Type

I, and Type II misclassification error rates, respectively. The

fourth, and fifth columns respectively indicate the number of

modules predicted asfp, and the number of modules that can be

inspected, i.e., . The sixth column indicates the modifiedexpected cost of misclassification values for the respective

classification models. The last column represents , the

performance accuracy of the classification model as defined by

(7), i.e., the percentage of the predicted fp modules which are

actuallyfp.

As an illustration, consider the classification model shown in

thefirst row of Table II, i.e., for , and . In the case

of thefit data set, the model predicts 41 modules as fp. When

comparing this number to , we observe that the

model performs exceptionally well. Moreover, the performance

of the model is perfect, i.e., , implying that all

modules predicted asfp

are actuallyfp

. Hence, we see that thequality-of-fit of this model is excellent. The Type II error rate is

very large because the model is forced to optimize the predicted

number offp modules to 40, implying that

actual fp modules are misclassified. Let us now examine the

predictive capability of this model, i.e., its performance on the

test data set. We observe that 23 modules are classified asfp as

compared to , implying the model performs

very well. In addition, the performance implies

that over 90% of predictedfp modules are actuallyfp.

An example binary decision tree model obtained for the case

study is presented in Fig. 1. The model corresponds to a cost

ratio of 10, and resource availability of 0.40. The leaf nodes are

labeled as eitherfp, or nfp. A non-leaf tree node shows a specificsoftware metric, and its threshold, which is used to identify the

TABLE II

CLASSIFICATIONMODELS FOR

Fig. 1. GP decision tree model for , and .

subsequent traversal path. For example, the root node indicates

that if LOCA (number of lines of comments for the module priorto the coding phase) is greater than 3, then the right subtree

8/13/2019 04220788

8/9


TABLE III

CLASSIFICATIONMODELS FOR

of the root node is traversed where other conditions are tested;

otherwise, the module is classified asnfp.

The performance details of the GP model for the fit & test

data sets can be observed in Table III. The table indicates that

as increases, the performance decreases. Moreover, as

increases, we observe an inverse relationship between the Type

I, and Type II error rates; i.e., as increases, the Type I errorrate increases while the Type II error rate decreases. For the

different values, the number of modules predicted as fp by

the model is very close to the number of modules which can

be inspected, suggesting an optimized solution with respect to

resource utilization. Similar observations were also made from

the performances for the other cost ratios. An intuitive analysis

of why performance decreases with is now presented.

Assume that modules are scattered across a spectrum with

respect to their predicted class, such that the left end represents

the least faulty software module, while the right end represents

the most faulty module. Based on this assumption, the actual fp

modules are likely to be concentrated towards the right; the ac-

tualnfp modules are likely to be concentrated towards the left;and the middle portion of the spectrum will be comprised of an

intermixed collection offp & nfp modules. It is at this middle

portion of the spectrum that the misclassification errors are more

likely to occur, primarily because of data points that do not

follow the general trend of the data set. Therefore, for inspection

purposes for a given amount of resources, the needed modules

would be picked starting from the right end of the spectrum.

When , it was observed that for both fit, and testdata sets, the performance was very good, i.e., 90% to

100%, across all the cost ratios. This is analogous to selecting

5% of the modules starting from the right end of the spectrum,

which are more likely to be actuallyfp. On the other hand, when

, we observed that the performance was only about

40% to 48% for all the cost ratios. The performance reduction

is intuitive because, when a larger percentage of modules can be

inspected, manynfpmodules will invariably beflagged asfp(as

per the spectrum analogy presented earlier), thus lowering the

performance of the model, and correspondingly increasing the

Type I error rate. The relationship of a decrease in with an

increase in the Type I error rate is seen in the tables. Moreover,

when comparing theMECMvalues of the quality-of-fit (fit data),and predictive-quality (test data) performances of the different

classification models, we note that the respective models do not

demonstrate over-fitting tendencies, and generally maintain the

achieved quality-of-fit performance.

In a related empirical case study of the software system

presented in Section IV-A, we applied our GP-based classifi-

cation technique to build classification models which predict

modules as either change-prone, or not change-prone. The

module-quality metric (dependent variable) in that study was

the number of lines of code churn, which was defined as the

summation of the number of lines added, deleted, or modified

during system test. Empirical observations & conclusions madefrom the study were similar to those presented in this paper.

Due to the similarity of results, and paper-size concerns, we

have not included those results in this paper.

VI. CONCLUSION

This study presents a genetic programming-based multi-ob-

jective optimization modeling technique for calibrating a goal-

oriented software quality classification model geared toward a

cost-effective resource utilization. Using case studies of two

wireless configuration applications, software quality classifica-

tion models are calibrated. The effective multi-objective opti-

mization capability of GP

is demonstrated through a represen-tative case study in which models were calibrated to predict

software modules as either fault-prone or not fault-prone.

Multi-objective optimization is often a practical need in many

real-world problems, such as software quality estimation mod-

eling. An advantage of GP is that it does not require extensive

mathematical assumptions about the size, and structure of the

optimization problem. In software engineering, the importance

of making the best of the limited software inspection & testing

resources is well founded. In addition to effective resource uti-

lization, the project-specific costs of misclassifications also af-

fect the usefulness of classification models.

Building on our previously developed GP-based software

quality classification modeling technique, this study focuseson calibrating models which optimize the following criteria

8/13/2019 04220788

9/9

04220788

Documents

Transcript of 04220788