04220788
Transcript of 04220788
-
8/13/2019 04220788
1/9
IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007 237
A Multi-Objective Software Quality ClassificationModel Using Genetic Programming
Taghi M. Khoshgoftaar, Member, IEEE, and Yi Liu, Member, IEEE
AbstractA key factor in the success of a software project isachieving the best-possible software reliability within the allottedtime & budget. Classification models which provide a risk-basedsoftware quality prediction, such as fault-prone & not fault-prone,are effective in providing a focused software quality assurance en-deavor. However, their usefulness largely depends on whether allthepredicted fault-pronemodules canbe inspected or improved bythe allocated software quality-improvement resources, and on theproject-specific costs of misclassifications. Therefore, a practicalgoal of calibrating classification models is to lower the expectedcost of misclassification while providing a cost-effective use of theavailable software quality-improvement resources.
This paper presents a genetic programming-based decision
tree model which facilitates a multi-objective optimization in thecontext of the software quality classification problem. The firstobjective is to minimize the Modified Expected Cost of Misclas-sification, which is our recently proposed goal-oriented measurefor selecting & evaluating classification models. The second objec-tive is to optimize the number of predicted fault-prone modulessuch that it is equal to the number of modules which can beinspected by the allocated resources. Some commonly used classi-fication techniques, such as logistic regression, decision trees, andanalogy-based reasoning, are not suited for directly optimizingmulti-objective criteria. In contrast, genetic programming isparticularly suited for the multi-objective optimization problem.An empirical case study of a real-world industrial software systemdemonstrates the promising results, and the usefulness of theproposed model.
Index TermsCost of misclassification, genetic programming,multi-objective optimization, software faults, software metrics,software quality estimation.
NOTATION
Type I error an error of misclassifying a fp module to a
nfpmoduleType II error an error of misclassifying anfp module to a
fpmoduleR, or Red fpmodule
G, or Green nfpmodule
cost of a Type I error
cost of a Type II error
cost ratio of a Type II error over a Type I
error, i.e.,
# of Type I errors
Manuscript received August 21, 2003; revised December 18, 2003 and April8, 2004; accepted April 30, 2004. This work was supported in part by the NSFgrant CCR-9970893. Associate Editor: M. Xie.
T. M. Khoshgoftaar is with the Empirical Software Engineering Laboratory,Department of Computer Science and Engineering, Florida Atlantic University,Boca Raton, FL 33431 USA (e-mail: [email protected]).
Y. Liu is with Georgia College & State University, Milledgeville, GA 31061USA (e-mail: [email protected]).
Digital Object Identifier 10.1109/TR.2007.896763
# of Type II errors
the maximum number of modules which can
be tested# of modules classified as members of class
, where can be either R, or G# of modules of class predicted as members
of class , where , and can be either R,
or G# of the modules in the data set
a vector of independent variables
an objective function
number of objective functions to be
considereda Pareto-optima
size of the Pareto-optima set
fitness value for each individual
generation
the Pareto-optima set in the generation
the set of Pareto-optima solutions
fitness function
or the ratio of over
the ratio of over
# of times the source file was inspected prior
to the system test release# of lines for the source file prior to coding
phase# of lines of code for the source file prior to
system test release# of lines of commented code for the source
file prior to coding phase# of lines of commented code for the source
file prior to system test releasethe ratio of over
ACRONYM1
ECM expected cost of misclassification
fp fault-prone
GP genetic programming
MECM modified expected cost of misclassification
nfp not fault-prone
SQA software quality assurance
SQC software quality classification
STGP strongly typed genetic programming
1The singular and plural of an acronym are always spelled the same.
0018-9529/$25.00 2007 IEEE
-
8/13/2019 04220788
2/9
238 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
I. INTRODUCTION
IN THE context of software quality & reliability improve-
ment endeavors, the software quality assurance (SQA) team
is often faced with the difficult task of working with finite &
limited resources. The team aims to make the best of available
software inspection & quality improvement resources. A logicalapproach for achieving cost-effective software quality improve-
ment is a risk-based targeting of software modules depending on
their predicted quality [1]. Software metrics-based quality clas-
sification models have proven their usefulness for timely soft-
ware quality improvements [2][4].
Typically, software modules are categorized by a classifica-
tion model into two risk-based groups, such as fault-prone (fp),
and not fault-prone (nfp) [5], [6]. A classification model is cal-
ibrated orfitted using software metrics, and fault data collected
from a previously developed system release, or similar soft-
ware project. Subsequently, the fitted classification model can
be applied to estimate the quality of modules which are cur-
rently under-development. However, existing software quality
classification (SQC) techniques do not consider the limited re-
source-availability factor during their modelcalibration process.
The usefulness of anSQC model depends on whether enough
resources are available to inspect all modules predicted asfp.For
example, if a model predicts 30% of the modules asfp, but avail-
able resources can only inspect 10% of the modules, then how
does one choose which of the predicted fp modules to inspect?
It is therefore important that a resource-basedSQCclassification
model is calibrated2. Moreover, because it is unrealistic to obtain
a model which yields perfect classification, i.e., all predictedfp
modules are actuallyfp, a classification technique should aim to
minimize the expected cost of misclassification in the contextof the software system & application domain [7]. In addition to
resource-usage & the misclassification costs, other factors also
need to be considered, such as model-simplicity, model-inter-
pretation, etc.
We have mentioned that, to calibrate a goal& objec-
tive-oriented classification model, multiple criteria have to be
optimized. However, commonly used classification techniques,
such as logistic regression [8] & decision trees [5], cannot
simultaneously attain a multi-objective optimization during
their modeling process. Modeling complicated engineering
problems, such as software quality prediction, with traditional
mathematical optimization methods, is practically unfeasible[9].
This paper presents a genetic programming (GP)-based de-
cision tree model that is capable of obtaining a multi-objec-
tive optimization. As a member of the evolutionary computa-
tional methods [10][13], GP has been explored to solve some
multi-optimization problems [14], [15]. This study is a con-
tinuation of our recent research efforts with GP-based software
quality classification models [16], [17], and focuses on the si-
multaneous optimization of decision trees with respect to the
following two factors:
2Some classifiers, such as count models, provide a probability that a modulehas a given number of faults. However, they are not suited for a multi-objective
optimization problem such as that being addressed in this paper.
1) Modified Expected Cost of Misclassification (MECM)
[18] (see Section II-A), and
2) number of predictedfpmodules, such that it is equal to the
number which can be inspected by the allocated software
quality improvement resources.
Although MECM contains a provision to penalize the models
whose number offp modules is greater than that which can betested by the available resources, GP can still find several models
which have very similar, or the same MECM values, but have
a different number of predicted fp modules. Hence, given two
such models, GP will give preference, based on the second objec-
tive, to the one which provides the approximately same number
offp modules as that required by available resources. Moreover,
to accelerate the run times of GP, and yield a relatively simple
model, a third optimization factor, i.e., minimizing the size of
the decision tree, is introduced in our GP-based decision tree
modeling process. A method based on GP for finding the solu-
tions in the Pareto set [9] is proposed when building the classi-
fication model.
Assessing the usefulness of a classification model basedsolely on its misclassification error rates, i.e., Type I & Type
II, is inappropriate because of the disparate misclassification
costs associated with the individual error types.3 In an earlier
study [19], we have investigated the application of the Ex-
pected Cost of Misclassification (ECM) as a singular unified
model-evaluation measure which can be used to incorporate
the project-specific misclassification costs. Though ECM-based
model evaluation demonstrates effective results, it does not
reflect the performance of the model in the context of allocated
resources. More specifically, such an approach is based on the
assumption that the project has enough resources to inspect
all the modules predicted as fp. To overcome this limitation ofECM, in a recent study we proposed an improved version called
MECM [18], which facilitates achieving resource-based SQC
models for a given resource allocation.
The basic functionality ofMECM is that it penalizes a classi-
fication model in terms of the costs of misclassifications, if the
model predicts an excess number offp modules than the number
which can be inspected by the allocated resources. Therefore, at
the time of model-calibration, the SQA team can provide infor-
mation regarding how many modules can be inspected & im-
proved with the allotted resources, and the approximate costs of
misclassifications. Estimating the actual costs of misclassifica-
tions at the time of modeling is a difficult problem. However, an
educated approximation is usually made based on heuristic soft-
ware engineering knowledge gained from previously developed
similar software projects. Based on the project-specific knowl-
edge of available resources, and the costs of misclassifications,
the MECM measure can be used to yield a resource-based SQC
model. Consequently, the best possible & practical usage of the
available resources can be achieved.
The empirical case study used to illustrate the GP-based multi-
objectiveSQCmodel consists of software metrics, and fault data
collected from two embedded software applications from the
wireless telecommunications industry. Other systems were also
3A Type I error occurs when a nfp module is misclassified as fp, whereas aType II error occurs when a fp module is misclassified as nfp.
-
8/13/2019 04220788
3/9
-
8/13/2019 04220788
4/9
240 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
This factor corresponds to a decrease in the expected cost
of misclassification.
The appropriateMECMfor this case is given by
(4)
III. GENETICPROGRAMMING
An inherent advantage ofGP is that it can evolve a solution au-
tomatically from the training data [10], and does not require an
assumption regarding the mathematical model of the structure
or the size of the decision tree-based solution. The evolutionary
process ofGPattempts to imitate the Darwinian principle of sur-
vival of thefittest individuals. Thefitness value of an individual
is an indicator of its quality, and hence, provides a probability
of which individuals can be selected for mating, and the subse-
quent reproduction of the next generation.
We note that an in-depth discussion of GP is avoided. How-
ever, some key elements of GP are briefly explained [10]. GP
uses the basic unit of the associated problem to assemble each
individual. The basic unit may include function sets, and ter-
minalsets. The typical structure of each individual can be seen
as a tree-shaped structure, and each individual may be unique.
There are three main operators forGP:reproduction,crossover,
and mutation. Eachusesrandom processing on one or more indi-
viduals. These three operators are used by GPfor afitness-based
evolution. The fitness factor is a measure used by GP during
its simulated evolution of how well a program (individual) has
learned to yield the correct output based on the given inputs.
Decision tree-based classification models have been recog-
nized as useful tools for data mining purposes [21]. This studyis a continuation of our previous efforts [16], [17] in which an
approach for building GP-based decision tree models for classi-
fying software modules either asfp or nfp was presented. The
commonly used standard-GPrequires that the function set, and
the terminal set have closure properties, i.e., all functions in the
function set must accept all kinds of data types & data values
as function arguments [22]. However, this requirement does not
guarantee that GP will generate a useful decision tree model.
Montana recognized this problem, and proposed the
Strongly Typed Genetic Programming (STGP) approach
[23]. It relaxes the closure property requirement of standard
GP by introducing additional criteria for genetic operations.More specifically, given a precise description of the permissible
data types for function arguments, STGP will only generate
individuals based on the constraint that the arguments of all
functions are of the correct type [23]. We only discuss the
modifications made to standard GPin order to build a decision
tree model for software quality classification purposes.
1) Constraint:Each function & terminal are assigned a type.
Different types may not crossover or mutate under certain
problem-specific constraints. The leaf node is a function
which returns the class membership of a module, and
the decision nodes (non-leaf nodes) are simple logical
equations which return either true, or false. Hence, only
constants & independent variables appear in the decisionnodes. The function which returns the class membership
of a module is not used in the internal nodes. Moreover, a
root node can be either a leaf node, or an internal node.
2) Crossover:In this stage, additional limitations are applied
to the genetic operation. In the context ofGP-based decision
trees, we define that the type of a subtree is the type of its
root node. Therefore, when two subtrees are selected for
crossover, they are required to have the same type so that aproper decision tree is generated.
3) Mutation: If a subtree is selected for mutation purposes, the
replaced tree must have the same type or at least a similar
type. A subtree of a similar type is one which, when used
as a replacement, i.e., mutation, yields a new tree which is
a permissible or proper decision tree. For example, the leaf
node is a function which returns the class membership of a
module, and can be replaced by a new subtree whose root
node is a logical equation.
A. Multi-Objective Optimization
A multi-objective optimization solution usually aims at ob-
taining a set ofPareto-optima[9], which represents feasible so-lutions where a solution to a single objective cannot be improved
without sacrificing the solution for some (or more) other criteria.
Let be a vector of independent variables, be an objec-
tive function, and be the number of objective functions to be
considered for optimization. A vector is a Pareto-optima iff
there is no vector which exists with the characteristics,
In the case of all non Pareto-optima vectors, the solution for at
least one objective function, , can be improved without sacri-
ficing the solutions for any of the other objective functions. Themost frequently used methods for generating Pareto-optima are
based on the notion of replacing the multi-objective problem
with a parameterized scalar problem [14]. Typically, by varying
the value of each parameter for each objective, it is possible to
generate all or parts of the Pareto-optima set.
The multi-objective optimization problem addressed in this
study includes three objectives:
1) to minimize the modified expected cost of misclassification
(MECM),
2) to obtain the number of modules predicted asfpsuch that it
is equal to the number of modules which can be inspected
with the available software quality improvement resources,and
3) to minimize the size of the decision tree model.
The third objective addresses the important issue of simplicity,
and ease in model-interpretation & comprehension. In addition,
limiting the size of the decision tree assists in the acceleration
of GPrun times.
The optimization method adopted in our study is based on
the Non-dominated Sorting Genetic Algorithm, as proposed by
Srinivas and Deb [24]. Because minimizing the MECMvalue is
the most important objective, sorting by the first objective en-
sures that it is the least likely to be violated. On the other hand,
the third objective, i.e., minimizing the size of the decision tree,
is the one most likely to be violated during the process of eachrun. Let be the size of the Pareto-optima set, be thefitness
-
8/13/2019 04220788
5/9
KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL 241
value for each individual, be the generation, be the
Pareto-optima set in the generation, and be the set of
Pareto-optima solutions we are interested in after the given run.
At the beginning of a run, , , and is an empty set.
EachGP run consists of the following steps:
Step 1: Sort the population according to the increasing
values of the first objective. If several individuals havethe same values, then order them successively by the in-
creasing values of the second, and third objectives.
Step 2: Select all non-dominated individuals from the pop-
ulation based on their sorted order. Assign to thefitness
of the non-dominated individuals. Then increase by 1.
Step 3: The individuals which are selected & assigned a
fitness in the previous step are ignored. Repeat the process
of selecting the non-dominated individuals, and return to
Step 2. If all individuals have already been selected, then
continue to Step 4.
Step 4: Save thefirst individuals to as the Pareto-op-
tima set of the current generation, and compare each solu-
tion in with each solution of . The first individualsamong & are saved into again.
When the computation is completed, each individual will be se-
lected to breed according to the probability defined bythe fitness
value obtained from the above steps. Subsequent to the last gen-
eration, will represent the best models of the run.
B. Fitness Functions
1) Minimizing MECM: A lower MECMvalue represents a pre-
ferred classification model. Because misclassification costs
are specific to the software project & the development or-
ganization, the SQA team can realize an educated approxi-
mation of the cost ratio, i.e., , based on heuristicsoftware engineering knowledge gained from previously
developed similar software projects. The applied penalty
of misclassification is defined as follows: if a nfp module
is misclassified asfp, a penalty of will be applied to
the fitness of that particular classification model. On the
same token, if afpmodule is misclassified asnfp, a penalty
of will be applied to the respectivefitness value. The
fitness function for this objective is given by
(5)
2) Resource availability: The aim here is to penalize a modelif it predicts a surplus (or a deficient) number offp modules
than the maximum number, i.e., , that can be
inspected or tested by the available resources. Therefore,
if the total number of modules predicted as fp is equal to
, then the fitness function is equated to zero,
implying that there is no penalty. Otherwise, the fitness
function is given by
(6)
where , and
. A lower value of
(6) implies a better performance in regards to the secondobjective.
3) Size of decision tree: We define the size of a decision tree
by its number of nodes; the smaller the size, the better is
the fitness for the third objective. We select a threshold
value offive, implying that if the tree-size is less than five,
then the fitness of that specific classification tree is five.
The minimum size of the tree was empirically set to five
nodes in order to prevent any loss of diversity in the GPpopulation.
IV. EMPIRICALCASESTUDY
A. System Description
The case study involved data collection efforts from two
large Windows-based embedded system applications used for
customizing the configuration of wireless telecommunications
products. The two C++ applications provide similar function-
ality, and contain common source code. The primary difference
is the type of wireless product that each supports. Both systems
comprised of over 1400 source code files, and contained more
than several million lines of code each. Software metrics were
obtained by observing the configuration management systems,
while the problem reporting systems tracked & recorded the
status of different problems. Information, such as how many
times a source file was inspected prior to system tests, was
recorded. The obtained software metrics reflected aspects of
sourcefiles, and therefore, a software module for the systems
consisted of a sourcefile.
The fault data collected represent the faults discovered during
system tests. Upon preprocessing & cleaning the software data,
1211 modules remained & were used for model calibration.Data preprocessing primarily included the removal of observa-
tions with missing information or incomplete data. The decision
to remove certain modules from the data set was based on our
discussion with the development team. Among the 1211 mod-
ules considered for modeling, over 66% (809) were observed
to have no faults, while the remaining 402 modules had one or
more faults.
The five software metrics used for this case study are:
, number of times the source file was inspected prior to
the system test release; , number of lines for the source
file prior to coding phase; , number of lines of code for
the source file prior to system test release; , numberof lines of commented code for the source file prior to coding
phase; and , number of lines of commented code for the
sourcefile prior to system test release. The available data col-
lection tools determined the number & selection of the metrics.
The product metrics used are statement metrics for the source
files. They primarily indicated the number of lines of source
code prior to the coding phase (i.e., auto-generated code), and
just before system tests. The process metric, , was
obtained from the problem reporting systems.
The module-quality metric, i.e., the dependent variable, is the
number of faults observed during system test. The SQC are de-
pendent on the chosen threshold value (of the quality metric),
which identifies modules as eitherfp or nfp. The main stimulusfor selecting the appropriate threshold value is to build the most
-
8/13/2019 04220788
6/9
242 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
useful & system-relevant SQC model possible. Therefore, the se-
lection of the threshold value is usually dependent on the soft-
ware quality improvement needs of the development team. Be-
cause software metrics data from subsequent releases was not
available, an impartial data splitting was applied to the data set
to obtain the fit& testdata sets. Consequently, the fit, andtest
data sets had 807, and 404 modules, respectively.The classification models for this case study are calibrated
to classify a software module as either fp or nfp, based on a
threshold of two faults; i.e., if a module has two or more faults,
it is categorized as fp, and as nfp otherwise. According to the
selected threshold value, the fit datasethas632 nfp modules, and
175fp modules, whereas the test data set has 317 nfp modules,
and 87 fp modules. The selection of a threshold value for the
number of faults is specific to a given software project.
B. Empirical Settings
The modeling tool used for our empirical studies is the
lilgp (version 1.01), developed by D. Zongker & B. Punch ofMichigan State University. It is implemented in C, and is based
on theLISP works of J. Koza [22]. When applying lilgpfor
a GP application, each individual is organized as a decision tree
in which a node is a C function pointer. The execution speed of
lilgpis faster than that of interpreted LISP.
The below-mentioned modeling methodology is adopted for
calibrating software quality classification models with a multi-
objective optimization. The procedure is based on the given pair
inputs of 1) how many modules can be inspected or tested ac-
cording to the available resources, and 2) the project-specific
cost ratio.
1) Divide the modules in the fit & test data sets into twoclasses, i.e., fp & nfp, according to the chosen threshold
value mentioned in the previous section.
2) Build a GP-based multi-objective classification model ac-
cording to the procedure discussed in Sections III, III-A,
and III-B.
3) Compute the quality-of-fit performance of the classifica-
tion model based on the quality estimation of the modules
in thefit data set, the number of predicted fpmodules, and
the size of the decision tree. The accuracy of a model, i.e.,
, indicates that according to the allocated resources,
how many of the actualfpmodules are predicted asfp, i.e.,
(7)
where is the ratio of the number of
modules which can be inspected by the available resources
to the total number of modules in the respective data set.
The equation represents the inspection efficiency for the
given amount of resources.
4) Apply the classification model to the test data set to eval-
uate its predictive performance. Validating the model on an
independent data set can provide an indication regarding
the accuracy of the classification model if it were applied
to a currently under-development system release (or sim-
ilar software project) with known software metrics, but un-known quality data.
TABLE I
PARAMETERSFOR GP
The independent variables are thefive software metrics de-
scribed earlier, whereas the dependent variable is the class mem-
bership based on the number of faults observed during system
test. Because the actual cost of misclassifications is usually un-known at the time of modeling, it is bene ficial to the project
management team to calibrate classification models for a range
of cost ratios which are likely to suit the project s needs. For ex-
ample, we consider different values for the cost ratio, denoted
as . The cost ratios considered in our study are 2, 5, 10, 17.5,
25, and 50. Based on our empirical software engineering knowl-
edge, this set of values covers a broad spectrum of possible cost
ratios. However, from a practical software engineering point of
view, the cost ratios of 2 & 5 are very unlikely for the embedded
systems being modeled in this study.
In each generation, our GP process will automatically select
five best models (parameter output.bestn) from the Pareto set itfinds. If the Pareto set contains more than 5 models according to
its non-domination order, the models will be sorted by using the
relative importance of the objectives. The first five(from the top)
models which have lower MECM values will be represented at the
end ofthe GP run. We performed 20 runs for each combination of
costratio , and available resources ; thus, 100 modelswere
recorded for each combination of , and . The best decision
tree (lowest MECMvalue forfit data) which appears during the
20 runs is selected as the preferred classification model for the
given combination.
In the context ofGP-based modeling, certain parameters have
to be assigned for a given set of input conditions. In our study,
the parameters that were used for each value of are listed inTable I. Some of the basic parameters, such as depth (init.depth)
-
8/13/2019 04220788
7/9
KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL 243
of the initial GP population, and the method (init.method) for
generating the initial population, were assigned default values
provided by John Koza. Other parameters, such as population
size (pop_size), maximum number of generations (max_gen-
erations), maximum depth of the population (max_depth),
crossover rate, reproduction rate, and mutation rate, were em-
pirically varied during our study. The selection method for thethree genetic operators is fitness-proportional selection. The
function set only contains four functions (C1, and C2 represent
the nfp, and fp classes, respectively) which are essential for
building a binary decision tree model.
We observed that our GP process was not sensitive to the
change of the above mentioned parameters as long as a reason-
able range of values were used, such as population size 500,
number of generations 100, and mutation rate 0.0. We note
that a detailed analytic study to find an optimal combination of
the GP parameters is out of scope for this study.
V. DISCUSSION
The performance of the preferred classification models for
the cost ratios of 2, and 10 are presented in Tables II and III,
respectively. The results for the other four cost ratios are not
presented due to similarity of empirical conclusions, and paper
size considerations. Thefirst column, , indicates the fraction
of the total number of modules which can be inspected or
tested according to the available software quality improvement
resources. The second, and third columns indicate the Type
I, and Type II misclassification error rates, respectively. The
fourth, and fifth columns respectively indicate the number of
modules predicted asfp, and the number of modules that can be
inspected, i.e., . The sixth column indicates the modifiedexpected cost of misclassification values for the respective
classification models. The last column represents , the
performance accuracy of the classification model as defined by
(7), i.e., the percentage of the predicted fp modules which are
actuallyfp.
As an illustration, consider the classification model shown in
thefirst row of Table II, i.e., for , and . In the case
of thefit data set, the model predicts 41 modules as fp. When
comparing this number to , we observe that the
model performs exceptionally well. Moreover, the performance
of the model is perfect, i.e., , implying that all
modules predicted asfp
are actuallyfp
. Hence, we see that thequality-of-fit of this model is excellent. The Type II error rate is
very large because the model is forced to optimize the predicted
number offp modules to 40, implying that
actual fp modules are misclassified. Let us now examine the
predictive capability of this model, i.e., its performance on the
test data set. We observe that 23 modules are classified asfp as
compared to , implying the model performs
very well. In addition, the performance implies
that over 90% of predictedfp modules are actuallyfp.
An example binary decision tree model obtained for the case
study is presented in Fig. 1. The model corresponds to a cost
ratio of 10, and resource availability of 0.40. The leaf nodes are
labeled as eitherfp, or nfp. A non-leaf tree node shows a specificsoftware metric, and its threshold, which is used to identify the
TABLE II
CLASSIFICATIONMODELS FOR
Fig. 1. GP decision tree model for , and .
subsequent traversal path. For example, the root node indicates
that if LOCA (number of lines of comments for the module priorto the coding phase) is greater than 3, then the right subtree
-
8/13/2019 04220788
8/9
244 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
TABLE III
CLASSIFICATIONMODELS FOR
of the root node is traversed where other conditions are tested;
otherwise, the module is classified asnfp.
The performance details of the GP model for the fit & test
data sets can be observed in Table III. The table indicates that
as increases, the performance decreases. Moreover, as
increases, we observe an inverse relationship between the Type
I, and Type II error rates; i.e., as increases, the Type I errorrate increases while the Type II error rate decreases. For the
different values, the number of modules predicted as fp by
the model is very close to the number of modules which can
be inspected, suggesting an optimized solution with respect to
resource utilization. Similar observations were also made from
the performances for the other cost ratios. An intuitive analysis
of why performance decreases with is now presented.
Assume that modules are scattered across a spectrum with
respect to their predicted class, such that the left end represents
the least faulty software module, while the right end represents
the most faulty module. Based on this assumption, the actual fp
modules are likely to be concentrated towards the right; the ac-
tualnfp modules are likely to be concentrated towards the left;and the middle portion of the spectrum will be comprised of an
intermixed collection offp & nfp modules. It is at this middle
portion of the spectrum that the misclassification errors are more
likely to occur, primarily because of data points that do not
follow the general trend of the data set. Therefore, for inspection
purposes for a given amount of resources, the needed modules
would be picked starting from the right end of the spectrum.
When , it was observed that for both fit, and testdata sets, the performance was very good, i.e., 90% to
100%, across all the cost ratios. This is analogous to selecting
5% of the modules starting from the right end of the spectrum,
which are more likely to be actuallyfp. On the other hand, when
, we observed that the performance was only about
40% to 48% for all the cost ratios. The performance reduction
is intuitive because, when a larger percentage of modules can be
inspected, manynfpmodules will invariably beflagged asfp(as
per the spectrum analogy presented earlier), thus lowering the
performance of the model, and correspondingly increasing the
Type I error rate. The relationship of a decrease in with an
increase in the Type I error rate is seen in the tables. Moreover,
when comparing theMECMvalues of the quality-of-fit (fit data),and predictive-quality (test data) performances of the different
classification models, we note that the respective models do not
demonstrate over-fitting tendencies, and generally maintain the
achieved quality-of-fit performance.
In a related empirical case study of the software system
presented in Section IV-A, we applied our GP-based classifi-
cation technique to build classification models which predict
modules as either change-prone, or not change-prone. The
module-quality metric (dependent variable) in that study was
the number of lines of code churn, which was defined as the
summation of the number of lines added, deleted, or modified
during system test. Empirical observations & conclusions madefrom the study were similar to those presented in this paper.
Due to the similarity of results, and paper-size concerns, we
have not included those results in this paper.
VI. CONCLUSION
This study presents a genetic programming-based multi-ob-
jective optimization modeling technique for calibrating a goal-
oriented software quality classification model geared toward a
cost-effective resource utilization. Using case studies of two
wireless configuration applications, software quality classifica-
tion models are calibrated. The effective multi-objective opti-
mization capability of GP
is demonstrated through a represen-tative case study in which models were calibrated to predict
software modules as either fault-prone or not fault-prone.
Multi-objective optimization is often a practical need in many
real-world problems, such as software quality estimation mod-
eling. An advantage of GP is that it does not require extensive
mathematical assumptions about the size, and structure of the
optimization problem. In software engineering, the importance
of making the best of the limited software inspection & testing
resources is well founded. In addition to effective resource uti-
lization, the project-specific costs of misclassifications also af-
fect the usefulness of classification models.
Building on our previously developed GP-based software
quality classification modeling technique, this study focuseson calibrating models which optimize the following criteria
-
8/13/2019 04220788
9/9