04220788

download 04220788

of 9

Transcript of 04220788

  • 8/13/2019 04220788

    1/9

    IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007 237

    A Multi-Objective Software Quality ClassificationModel Using Genetic Programming

    Taghi M. Khoshgoftaar, Member, IEEE, and Yi Liu, Member, IEEE

    AbstractA key factor in the success of a software project isachieving the best-possible software reliability within the allottedtime & budget. Classification models which provide a risk-basedsoftware quality prediction, such as fault-prone & not fault-prone,are effective in providing a focused software quality assurance en-deavor. However, their usefulness largely depends on whether allthepredicted fault-pronemodules canbe inspected or improved bythe allocated software quality-improvement resources, and on theproject-specific costs of misclassifications. Therefore, a practicalgoal of calibrating classification models is to lower the expectedcost of misclassification while providing a cost-effective use of theavailable software quality-improvement resources.

    This paper presents a genetic programming-based decision

    tree model which facilitates a multi-objective optimization in thecontext of the software quality classification problem. The firstobjective is to minimize the Modified Expected Cost of Misclas-sification, which is our recently proposed goal-oriented measurefor selecting & evaluating classification models. The second objec-tive is to optimize the number of predicted fault-prone modulessuch that it is equal to the number of modules which can beinspected by the allocated resources. Some commonly used classi-fication techniques, such as logistic regression, decision trees, andanalogy-based reasoning, are not suited for directly optimizingmulti-objective criteria. In contrast, genetic programming isparticularly suited for the multi-objective optimization problem.An empirical case study of a real-world industrial software systemdemonstrates the promising results, and the usefulness of theproposed model.

    Index TermsCost of misclassification, genetic programming,multi-objective optimization, software faults, software metrics,software quality estimation.

    NOTATION

    Type I error an error of misclassifying a fp module to a

    nfpmoduleType II error an error of misclassifying anfp module to a

    fpmoduleR, or Red fpmodule

    G, or Green nfpmodule

    cost of a Type I error

    cost of a Type II error

    cost ratio of a Type II error over a Type I

    error, i.e.,

    # of Type I errors

    Manuscript received August 21, 2003; revised December 18, 2003 and April8, 2004; accepted April 30, 2004. This work was supported in part by the NSFgrant CCR-9970893. Associate Editor: M. Xie.

    T. M. Khoshgoftaar is with the Empirical Software Engineering Laboratory,Department of Computer Science and Engineering, Florida Atlantic University,Boca Raton, FL 33431 USA (e-mail: [email protected]).

    Y. Liu is with Georgia College & State University, Milledgeville, GA 31061USA (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TR.2007.896763

    # of Type II errors

    the maximum number of modules which can

    be tested# of modules classified as members of class

    , where can be either R, or G# of modules of class predicted as members

    of class , where , and can be either R,

    or G# of the modules in the data set

    a vector of independent variables

    an objective function

    number of objective functions to be

    considereda Pareto-optima

    size of the Pareto-optima set

    fitness value for each individual

    generation

    the Pareto-optima set in the generation

    the set of Pareto-optima solutions

    fitness function

    or the ratio of over

    the ratio of over

    # of times the source file was inspected prior

    to the system test release# of lines for the source file prior to coding

    phase# of lines of code for the source file prior to

    system test release# of lines of commented code for the source

    file prior to coding phase# of lines of commented code for the source

    file prior to system test releasethe ratio of over

    ACRONYM1

    ECM expected cost of misclassification

    fp fault-prone

    GP genetic programming

    MECM modified expected cost of misclassification

    nfp not fault-prone

    SQA software quality assurance

    SQC software quality classification

    STGP strongly typed genetic programming

    1The singular and plural of an acronym are always spelled the same.

    0018-9529/$25.00 2007 IEEE

  • 8/13/2019 04220788

    2/9

    238 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

    I. INTRODUCTION

    IN THE context of software quality & reliability improve-

    ment endeavors, the software quality assurance (SQA) team

    is often faced with the difficult task of working with finite &

    limited resources. The team aims to make the best of available

    software inspection & quality improvement resources. A logicalapproach for achieving cost-effective software quality improve-

    ment is a risk-based targeting of software modules depending on

    their predicted quality [1]. Software metrics-based quality clas-

    sification models have proven their usefulness for timely soft-

    ware quality improvements [2][4].

    Typically, software modules are categorized by a classifica-

    tion model into two risk-based groups, such as fault-prone (fp),

    and not fault-prone (nfp) [5], [6]. A classification model is cal-

    ibrated orfitted using software metrics, and fault data collected

    from a previously developed system release, or similar soft-

    ware project. Subsequently, the fitted classification model can

    be applied to estimate the quality of modules which are cur-

    rently under-development. However, existing software quality

    classification (SQC) techniques do not consider the limited re-

    source-availability factor during their modelcalibration process.

    The usefulness of anSQC model depends on whether enough

    resources are available to inspect all modules predicted asfp.For

    example, if a model predicts 30% of the modules asfp, but avail-

    able resources can only inspect 10% of the modules, then how

    does one choose which of the predicted fp modules to inspect?

    It is therefore important that a resource-basedSQCclassification

    model is calibrated2. Moreover, because it is unrealistic to obtain

    a model which yields perfect classification, i.e., all predictedfp

    modules are actuallyfp, a classification technique should aim to

    minimize the expected cost of misclassification in the contextof the software system & application domain [7]. In addition to

    resource-usage & the misclassification costs, other factors also

    need to be considered, such as model-simplicity, model-inter-

    pretation, etc.

    We have mentioned that, to calibrate a goal& objec-

    tive-oriented classification model, multiple criteria have to be

    optimized. However, commonly used classification techniques,

    such as logistic regression [8] & decision trees [5], cannot

    simultaneously attain a multi-objective optimization during

    their modeling process. Modeling complicated engineering

    problems, such as software quality prediction, with traditional

    mathematical optimization methods, is practically unfeasible[9].

    This paper presents a genetic programming (GP)-based de-

    cision tree model that is capable of obtaining a multi-objec-

    tive optimization. As a member of the evolutionary computa-

    tional methods [10][13], GP has been explored to solve some

    multi-optimization problems [14], [15]. This study is a con-

    tinuation of our recent research efforts with GP-based software

    quality classification models [16], [17], and focuses on the si-

    multaneous optimization of decision trees with respect to the

    following two factors:

    2Some classifiers, such as count models, provide a probability that a modulehas a given number of faults. However, they are not suited for a multi-objective

    optimization problem such as that being addressed in this paper.

    1) Modified Expected Cost of Misclassification (MECM)

    [18] (see Section II-A), and

    2) number of predictedfpmodules, such that it is equal to the

    number which can be inspected by the allocated software

    quality improvement resources.

    Although MECM contains a provision to penalize the models

    whose number offp modules is greater than that which can betested by the available resources, GP can still find several models

    which have very similar, or the same MECM values, but have

    a different number of predicted fp modules. Hence, given two

    such models, GP will give preference, based on the second objec-

    tive, to the one which provides the approximately same number

    offp modules as that required by available resources. Moreover,

    to accelerate the run times of GP, and yield a relatively simple

    model, a third optimization factor, i.e., minimizing the size of

    the decision tree, is introduced in our GP-based decision tree

    modeling process. A method based on GP for finding the solu-

    tions in the Pareto set [9] is proposed when building the classi-

    fication model.

    Assessing the usefulness of a classification model basedsolely on its misclassification error rates, i.e., Type I & Type

    II, is inappropriate because of the disparate misclassification

    costs associated with the individual error types.3 In an earlier

    study [19], we have investigated the application of the Ex-

    pected Cost of Misclassification (ECM) as a singular unified

    model-evaluation measure which can be used to incorporate

    the project-specific misclassification costs. Though ECM-based

    model evaluation demonstrates effective results, it does not

    reflect the performance of the model in the context of allocated

    resources. More specifically, such an approach is based on the

    assumption that the project has enough resources to inspect

    all the modules predicted as fp. To overcome this limitation ofECM, in a recent study we proposed an improved version called

    MECM [18], which facilitates achieving resource-based SQC

    models for a given resource allocation.

    The basic functionality ofMECM is that it penalizes a classi-

    fication model in terms of the costs of misclassifications, if the

    model predicts an excess number offp modules than the number

    which can be inspected by the allocated resources. Therefore, at

    the time of model-calibration, the SQA team can provide infor-

    mation regarding how many modules can be inspected & im-

    proved with the allotted resources, and the approximate costs of

    misclassifications. Estimating the actual costs of misclassifica-

    tions at the time of modeling is a difficult problem. However, an

    educated approximation is usually made based on heuristic soft-

    ware engineering knowledge gained from previously developed

    similar software projects. Based on the project-specific knowl-

    edge of available resources, and the costs of misclassifications,

    the MECM measure can be used to yield a resource-based SQC

    model. Consequently, the best possible & practical usage of the

    available resources can be achieved.

    The empirical case study used to illustrate the GP-based multi-

    objectiveSQCmodel consists of software metrics, and fault data

    collected from two embedded software applications from the

    wireless telecommunications industry. Other systems were also

    3A Type I error occurs when a nfp module is misclassified as fp, whereas aType II error occurs when a fp module is misclassified as nfp.

  • 8/13/2019 04220788

    3/9

  • 8/13/2019 04220788

    4/9

    240 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

    This factor corresponds to a decrease in the expected cost

    of misclassification.

    The appropriateMECMfor this case is given by

    (4)

    III. GENETICPROGRAMMING

    An inherent advantage ofGP is that it can evolve a solution au-

    tomatically from the training data [10], and does not require an

    assumption regarding the mathematical model of the structure

    or the size of the decision tree-based solution. The evolutionary

    process ofGPattempts to imitate the Darwinian principle of sur-

    vival of thefittest individuals. Thefitness value of an individual

    is an indicator of its quality, and hence, provides a probability

    of which individuals can be selected for mating, and the subse-

    quent reproduction of the next generation.

    We note that an in-depth discussion of GP is avoided. How-

    ever, some key elements of GP are briefly explained [10]. GP

    uses the basic unit of the associated problem to assemble each

    individual. The basic unit may include function sets, and ter-

    minalsets. The typical structure of each individual can be seen

    as a tree-shaped structure, and each individual may be unique.

    There are three main operators forGP:reproduction,crossover,

    and mutation. Eachusesrandom processing on one or more indi-

    viduals. These three operators are used by GPfor afitness-based

    evolution. The fitness factor is a measure used by GP during

    its simulated evolution of how well a program (individual) has

    learned to yield the correct output based on the given inputs.

    Decision tree-based classification models have been recog-

    nized as useful tools for data mining purposes [21]. This studyis a continuation of our previous efforts [16], [17] in which an

    approach for building GP-based decision tree models for classi-

    fying software modules either asfp or nfp was presented. The

    commonly used standard-GPrequires that the function set, and

    the terminal set have closure properties, i.e., all functions in the

    function set must accept all kinds of data types & data values

    as function arguments [22]. However, this requirement does not

    guarantee that GP will generate a useful decision tree model.

    Montana recognized this problem, and proposed the

    Strongly Typed Genetic Programming (STGP) approach

    [23]. It relaxes the closure property requirement of standard

    GP by introducing additional criteria for genetic operations.More specifically, given a precise description of the permissible

    data types for function arguments, STGP will only generate

    individuals based on the constraint that the arguments of all

    functions are of the correct type [23]. We only discuss the

    modifications made to standard GPin order to build a decision

    tree model for software quality classification purposes.

    1) Constraint:Each function & terminal are assigned a type.

    Different types may not crossover or mutate under certain

    problem-specific constraints. The leaf node is a function

    which returns the class membership of a module, and

    the decision nodes (non-leaf nodes) are simple logical

    equations which return either true, or false. Hence, only

    constants & independent variables appear in the decisionnodes. The function which returns the class membership

    of a module is not used in the internal nodes. Moreover, a

    root node can be either a leaf node, or an internal node.

    2) Crossover:In this stage, additional limitations are applied

    to the genetic operation. In the context ofGP-based decision

    trees, we define that the type of a subtree is the type of its

    root node. Therefore, when two subtrees are selected for

    crossover, they are required to have the same type so that aproper decision tree is generated.

    3) Mutation: If a subtree is selected for mutation purposes, the

    replaced tree must have the same type or at least a similar

    type. A subtree of a similar type is one which, when used

    as a replacement, i.e., mutation, yields a new tree which is

    a permissible or proper decision tree. For example, the leaf

    node is a function which returns the class membership of a

    module, and can be replaced by a new subtree whose root

    node is a logical equation.

    A. Multi-Objective Optimization

    A multi-objective optimization solution usually aims at ob-

    taining a set ofPareto-optima[9], which represents feasible so-lutions where a solution to a single objective cannot be improved

    without sacrificing the solution for some (or more) other criteria.

    Let be a vector of independent variables, be an objec-

    tive function, and be the number of objective functions to be

    considered for optimization. A vector is a Pareto-optima iff

    there is no vector which exists with the characteristics,

    In the case of all non Pareto-optima vectors, the solution for at

    least one objective function, , can be improved without sacri-

    ficing the solutions for any of the other objective functions. Themost frequently used methods for generating Pareto-optima are

    based on the notion of replacing the multi-objective problem

    with a parameterized scalar problem [14]. Typically, by varying

    the value of each parameter for each objective, it is possible to

    generate all or parts of the Pareto-optima set.

    The multi-objective optimization problem addressed in this

    study includes three objectives:

    1) to minimize the modified expected cost of misclassification

    (MECM),

    2) to obtain the number of modules predicted asfpsuch that it

    is equal to the number of modules which can be inspected

    with the available software quality improvement resources,and

    3) to minimize the size of the decision tree model.

    The third objective addresses the important issue of simplicity,

    and ease in model-interpretation & comprehension. In addition,

    limiting the size of the decision tree assists in the acceleration

    of GPrun times.

    The optimization method adopted in our study is based on

    the Non-dominated Sorting Genetic Algorithm, as proposed by

    Srinivas and Deb [24]. Because minimizing the MECMvalue is

    the most important objective, sorting by the first objective en-

    sures that it is the least likely to be violated. On the other hand,

    the third objective, i.e., minimizing the size of the decision tree,

    is the one most likely to be violated during the process of eachrun. Let be the size of the Pareto-optima set, be thefitness

  • 8/13/2019 04220788

    5/9

    KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL 241

    value for each individual, be the generation, be the

    Pareto-optima set in the generation, and be the set of

    Pareto-optima solutions we are interested in after the given run.

    At the beginning of a run, , , and is an empty set.

    EachGP run consists of the following steps:

    Step 1: Sort the population according to the increasing

    values of the first objective. If several individuals havethe same values, then order them successively by the in-

    creasing values of the second, and third objectives.

    Step 2: Select all non-dominated individuals from the pop-

    ulation based on their sorted order. Assign to thefitness

    of the non-dominated individuals. Then increase by 1.

    Step 3: The individuals which are selected & assigned a

    fitness in the previous step are ignored. Repeat the process

    of selecting the non-dominated individuals, and return to

    Step 2. If all individuals have already been selected, then

    continue to Step 4.

    Step 4: Save thefirst individuals to as the Pareto-op-

    tima set of the current generation, and compare each solu-

    tion in with each solution of . The first individualsamong & are saved into again.

    When the computation is completed, each individual will be se-

    lected to breed according to the probability defined bythe fitness

    value obtained from the above steps. Subsequent to the last gen-

    eration, will represent the best models of the run.

    B. Fitness Functions

    1) Minimizing MECM: A lower MECMvalue represents a pre-

    ferred classification model. Because misclassification costs

    are specific to the software project & the development or-

    ganization, the SQA team can realize an educated approxi-

    mation of the cost ratio, i.e., , based on heuristicsoftware engineering knowledge gained from previously

    developed similar software projects. The applied penalty

    of misclassification is defined as follows: if a nfp module

    is misclassified asfp, a penalty of will be applied to

    the fitness of that particular classification model. On the

    same token, if afpmodule is misclassified asnfp, a penalty

    of will be applied to the respectivefitness value. The

    fitness function for this objective is given by

    (5)

    2) Resource availability: The aim here is to penalize a modelif it predicts a surplus (or a deficient) number offp modules

    than the maximum number, i.e., , that can be

    inspected or tested by the available resources. Therefore,

    if the total number of modules predicted as fp is equal to

    , then the fitness function is equated to zero,

    implying that there is no penalty. Otherwise, the fitness

    function is given by

    (6)

    where , and

    . A lower value of

    (6) implies a better performance in regards to the secondobjective.

    3) Size of decision tree: We define the size of a decision tree

    by its number of nodes; the smaller the size, the better is

    the fitness for the third objective. We select a threshold

    value offive, implying that if the tree-size is less than five,

    then the fitness of that specific classification tree is five.

    The minimum size of the tree was empirically set to five

    nodes in order to prevent any loss of diversity in the GPpopulation.

    IV. EMPIRICALCASESTUDY

    A. System Description

    The case study involved data collection efforts from two

    large Windows-based embedded system applications used for

    customizing the configuration of wireless telecommunications

    products. The two C++ applications provide similar function-

    ality, and contain common source code. The primary difference

    is the type of wireless product that each supports. Both systems

    comprised of over 1400 source code files, and contained more

    than several million lines of code each. Software metrics were

    obtained by observing the configuration management systems,

    while the problem reporting systems tracked & recorded the

    status of different problems. Information, such as how many

    times a source file was inspected prior to system tests, was

    recorded. The obtained software metrics reflected aspects of

    sourcefiles, and therefore, a software module for the systems

    consisted of a sourcefile.

    The fault data collected represent the faults discovered during

    system tests. Upon preprocessing & cleaning the software data,

    1211 modules remained & were used for model calibration.Data preprocessing primarily included the removal of observa-

    tions with missing information or incomplete data. The decision

    to remove certain modules from the data set was based on our

    discussion with the development team. Among the 1211 mod-

    ules considered for modeling, over 66% (809) were observed

    to have no faults, while the remaining 402 modules had one or

    more faults.

    The five software metrics used for this case study are:

    , number of times the source file was inspected prior to

    the system test release; , number of lines for the source

    file prior to coding phase; , number of lines of code for

    the source file prior to system test release; , numberof lines of commented code for the source file prior to coding

    phase; and , number of lines of commented code for the

    sourcefile prior to system test release. The available data col-

    lection tools determined the number & selection of the metrics.

    The product metrics used are statement metrics for the source

    files. They primarily indicated the number of lines of source

    code prior to the coding phase (i.e., auto-generated code), and

    just before system tests. The process metric, , was

    obtained from the problem reporting systems.

    The module-quality metric, i.e., the dependent variable, is the

    number of faults observed during system test. The SQC are de-

    pendent on the chosen threshold value (of the quality metric),

    which identifies modules as eitherfp or nfp. The main stimulusfor selecting the appropriate threshold value is to build the most

  • 8/13/2019 04220788

    6/9

    242 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

    useful & system-relevant SQC model possible. Therefore, the se-

    lection of the threshold value is usually dependent on the soft-

    ware quality improvement needs of the development team. Be-

    cause software metrics data from subsequent releases was not

    available, an impartial data splitting was applied to the data set

    to obtain the fit& testdata sets. Consequently, the fit, andtest

    data sets had 807, and 404 modules, respectively.The classification models for this case study are calibrated

    to classify a software module as either fp or nfp, based on a

    threshold of two faults; i.e., if a module has two or more faults,

    it is categorized as fp, and as nfp otherwise. According to the

    selected threshold value, the fit datasethas632 nfp modules, and

    175fp modules, whereas the test data set has 317 nfp modules,

    and 87 fp modules. The selection of a threshold value for the

    number of faults is specific to a given software project.

    B. Empirical Settings

    The modeling tool used for our empirical studies is the

    lilgp (version 1.01), developed by D. Zongker & B. Punch ofMichigan State University. It is implemented in C, and is based

    on theLISP works of J. Koza [22]. When applying lilgpfor

    a GP application, each individual is organized as a decision tree

    in which a node is a C function pointer. The execution speed of

    lilgpis faster than that of interpreted LISP.

    The below-mentioned modeling methodology is adopted for

    calibrating software quality classification models with a multi-

    objective optimization. The procedure is based on the given pair

    inputs of 1) how many modules can be inspected or tested ac-

    cording to the available resources, and 2) the project-specific

    cost ratio.

    1) Divide the modules in the fit & test data sets into twoclasses, i.e., fp & nfp, according to the chosen threshold

    value mentioned in the previous section.

    2) Build a GP-based multi-objective classification model ac-

    cording to the procedure discussed in Sections III, III-A,

    and III-B.

    3) Compute the quality-of-fit performance of the classifica-

    tion model based on the quality estimation of the modules

    in thefit data set, the number of predicted fpmodules, and

    the size of the decision tree. The accuracy of a model, i.e.,

    , indicates that according to the allocated resources,

    how many of the actualfpmodules are predicted asfp, i.e.,

    (7)

    where is the ratio of the number of

    modules which can be inspected by the available resources

    to the total number of modules in the respective data set.

    The equation represents the inspection efficiency for the

    given amount of resources.

    4) Apply the classification model to the test data set to eval-

    uate its predictive performance. Validating the model on an

    independent data set can provide an indication regarding

    the accuracy of the classification model if it were applied

    to a currently under-development system release (or sim-

    ilar software project) with known software metrics, but un-known quality data.

    TABLE I

    PARAMETERSFOR GP

    The independent variables are thefive software metrics de-

    scribed earlier, whereas the dependent variable is the class mem-

    bership based on the number of faults observed during system

    test. Because the actual cost of misclassifications is usually un-known at the time of modeling, it is bene ficial to the project

    management team to calibrate classification models for a range

    of cost ratios which are likely to suit the project s needs. For ex-

    ample, we consider different values for the cost ratio, denoted

    as . The cost ratios considered in our study are 2, 5, 10, 17.5,

    25, and 50. Based on our empirical software engineering knowl-

    edge, this set of values covers a broad spectrum of possible cost

    ratios. However, from a practical software engineering point of

    view, the cost ratios of 2 & 5 are very unlikely for the embedded

    systems being modeled in this study.

    In each generation, our GP process will automatically select

    five best models (parameter output.bestn) from the Pareto set itfinds. If the Pareto set contains more than 5 models according to

    its non-domination order, the models will be sorted by using the

    relative importance of the objectives. The first five(from the top)

    models which have lower MECM values will be represented at the

    end ofthe GP run. We performed 20 runs for each combination of

    costratio , and available resources ; thus, 100 modelswere

    recorded for each combination of , and . The best decision

    tree (lowest MECMvalue forfit data) which appears during the

    20 runs is selected as the preferred classification model for the

    given combination.

    In the context ofGP-based modeling, certain parameters have

    to be assigned for a given set of input conditions. In our study,

    the parameters that were used for each value of are listed inTable I. Some of the basic parameters, such as depth (init.depth)

  • 8/13/2019 04220788

    7/9

    KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL 243

    of the initial GP population, and the method (init.method) for

    generating the initial population, were assigned default values

    provided by John Koza. Other parameters, such as population

    size (pop_size), maximum number of generations (max_gen-

    erations), maximum depth of the population (max_depth),

    crossover rate, reproduction rate, and mutation rate, were em-

    pirically varied during our study. The selection method for thethree genetic operators is fitness-proportional selection. The

    function set only contains four functions (C1, and C2 represent

    the nfp, and fp classes, respectively) which are essential for

    building a binary decision tree model.

    We observed that our GP process was not sensitive to the

    change of the above mentioned parameters as long as a reason-

    able range of values were used, such as population size 500,

    number of generations 100, and mutation rate 0.0. We note

    that a detailed analytic study to find an optimal combination of

    the GP parameters is out of scope for this study.

    V. DISCUSSION

    The performance of the preferred classification models for

    the cost ratios of 2, and 10 are presented in Tables II and III,

    respectively. The results for the other four cost ratios are not

    presented due to similarity of empirical conclusions, and paper

    size considerations. Thefirst column, , indicates the fraction

    of the total number of modules which can be inspected or

    tested according to the available software quality improvement

    resources. The second, and third columns indicate the Type

    I, and Type II misclassification error rates, respectively. The

    fourth, and fifth columns respectively indicate the number of

    modules predicted asfp, and the number of modules that can be

    inspected, i.e., . The sixth column indicates the modifiedexpected cost of misclassification values for the respective

    classification models. The last column represents , the

    performance accuracy of the classification model as defined by

    (7), i.e., the percentage of the predicted fp modules which are

    actuallyfp.

    As an illustration, consider the classification model shown in

    thefirst row of Table II, i.e., for , and . In the case

    of thefit data set, the model predicts 41 modules as fp. When

    comparing this number to , we observe that the

    model performs exceptionally well. Moreover, the performance

    of the model is perfect, i.e., , implying that all

    modules predicted asfp

    are actuallyfp

    . Hence, we see that thequality-of-fit of this model is excellent. The Type II error rate is

    very large because the model is forced to optimize the predicted

    number offp modules to 40, implying that

    actual fp modules are misclassified. Let us now examine the

    predictive capability of this model, i.e., its performance on the

    test data set. We observe that 23 modules are classified asfp as

    compared to , implying the model performs

    very well. In addition, the performance implies

    that over 90% of predictedfp modules are actuallyfp.

    An example binary decision tree model obtained for the case

    study is presented in Fig. 1. The model corresponds to a cost

    ratio of 10, and resource availability of 0.40. The leaf nodes are

    labeled as eitherfp, or nfp. A non-leaf tree node shows a specificsoftware metric, and its threshold, which is used to identify the

    TABLE II

    CLASSIFICATIONMODELS FOR

    Fig. 1. GP decision tree model for , and .

    subsequent traversal path. For example, the root node indicates

    that if LOCA (number of lines of comments for the module priorto the coding phase) is greater than 3, then the right subtree

  • 8/13/2019 04220788

    8/9

    244 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

    TABLE III

    CLASSIFICATIONMODELS FOR

    of the root node is traversed where other conditions are tested;

    otherwise, the module is classified asnfp.

    The performance details of the GP model for the fit & test

    data sets can be observed in Table III. The table indicates that

    as increases, the performance decreases. Moreover, as

    increases, we observe an inverse relationship between the Type

    I, and Type II error rates; i.e., as increases, the Type I errorrate increases while the Type II error rate decreases. For the

    different values, the number of modules predicted as fp by

    the model is very close to the number of modules which can

    be inspected, suggesting an optimized solution with respect to

    resource utilization. Similar observations were also made from

    the performances for the other cost ratios. An intuitive analysis

    of why performance decreases with is now presented.

    Assume that modules are scattered across a spectrum with

    respect to their predicted class, such that the left end represents

    the least faulty software module, while the right end represents

    the most faulty module. Based on this assumption, the actual fp

    modules are likely to be concentrated towards the right; the ac-

    tualnfp modules are likely to be concentrated towards the left;and the middle portion of the spectrum will be comprised of an

    intermixed collection offp & nfp modules. It is at this middle

    portion of the spectrum that the misclassification errors are more

    likely to occur, primarily because of data points that do not

    follow the general trend of the data set. Therefore, for inspection

    purposes for a given amount of resources, the needed modules

    would be picked starting from the right end of the spectrum.

    When , it was observed that for both fit, and testdata sets, the performance was very good, i.e., 90% to

    100%, across all the cost ratios. This is analogous to selecting

    5% of the modules starting from the right end of the spectrum,

    which are more likely to be actuallyfp. On the other hand, when

    , we observed that the performance was only about

    40% to 48% for all the cost ratios. The performance reduction

    is intuitive because, when a larger percentage of modules can be

    inspected, manynfpmodules will invariably beflagged asfp(as

    per the spectrum analogy presented earlier), thus lowering the

    performance of the model, and correspondingly increasing the

    Type I error rate. The relationship of a decrease in with an

    increase in the Type I error rate is seen in the tables. Moreover,

    when comparing theMECMvalues of the quality-of-fit (fit data),and predictive-quality (test data) performances of the different

    classification models, we note that the respective models do not

    demonstrate over-fitting tendencies, and generally maintain the

    achieved quality-of-fit performance.

    In a related empirical case study of the software system

    presented in Section IV-A, we applied our GP-based classifi-

    cation technique to build classification models which predict

    modules as either change-prone, or not change-prone. The

    module-quality metric (dependent variable) in that study was

    the number of lines of code churn, which was defined as the

    summation of the number of lines added, deleted, or modified

    during system test. Empirical observations & conclusions madefrom the study were similar to those presented in this paper.

    Due to the similarity of results, and paper-size concerns, we

    have not included those results in this paper.

    VI. CONCLUSION

    This study presents a genetic programming-based multi-ob-

    jective optimization modeling technique for calibrating a goal-

    oriented software quality classification model geared toward a

    cost-effective resource utilization. Using case studies of two

    wireless configuration applications, software quality classifica-

    tion models are calibrated. The effective multi-objective opti-

    mization capability of GP

    is demonstrated through a represen-tative case study in which models were calibrated to predict

    software modules as either fault-prone or not fault-prone.

    Multi-objective optimization is often a practical need in many

    real-world problems, such as software quality estimation mod-

    eling. An advantage of GP is that it does not require extensive

    mathematical assumptions about the size, and structure of the

    optimization problem. In software engineering, the importance

    of making the best of the limited software inspection & testing

    resources is well founded. In addition to effective resource uti-

    lization, the project-specific costs of misclassifications also af-

    fect the usefulness of classification models.

    Building on our previously developed GP-based software

    quality classification modeling technique, this study focuseson calibrating models which optimize the following criteria

  • 8/13/2019 04220788

    9/9