[IEEE 2011 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Guilin,...

6
CLASSIFYING FEATURE DESCRIPTION FOR SOFTWARE DEFECT PREDICTION LING-FENG ZHANG, ZHAO-WEI SHANG College of Computer Science, Chongqing University, Chongqing 400030, China E-MAIL: [email protected], [email protected] Abstract: To overcome the limitation of numeric feature description of software modules in Software defect prediction, we propose a novel module description technology, which employs the classifying feature, rather than numerical feature to describe the software module. Firstly, we construct independent classifier on each software metric. Then the classifying results in each feature are used to represent every module. We apply two different feature classifier algorithms (based on mean criterion and minimum error rate criterion, respectively) to obtain the classifying feature description of software modules. By using the proposed description technology, the discrimination of each metric is enlarged distinctly. Also, classifying feature description is simpler compared to numeric description, which would accelerate the speed of prediction model learning and reduce the storage space of massive data sets. Experiment results on four NASA data sets (CM1, KC1, KC2 and PC1) demonstrate the effectiveness of classifying feature description, and our algorithms can significantly improve the performance of software defect prediction. Keywords: Feature classifier description; binary classification; software defect prediction 1. Introduction As software systems grow in size and complexity, it becomes increasingly difficult to maintain the reliability of software products. Usually, software defect is the major factor influencing software reliability. The majority of a system’s faults, over 80%, exist in about 20% of modules which is known as the “80:20” rule [1]. Thus, the possibility of estimating the fault modules is extremely important for minimizing cost and improving the effectiveness of software testing process. The early prediction of fault-proneness of the modules can also allow software developers to allocate the limited resources on those defect-prone modules such that high reliability software can be produced on time and within budget [2]. Software fault-proneness is estimated based on software metrics, which provide quantitative descriptions of software module. A number of studies provide empirical evidence that correlation exists between some software metrics and fault-proneness [3]. By using those metrics feature, software defect prediction is usually viewed as a binary classification task , which classifies software modules into fault-prone (fp) and non-fault-prone (nfp). Many machine learning and statistical techniques have been applied to construct prediction models based on the measurement of static code attributes [4][5]. For example, Discriminant Analysis, Logistic Regression, Regression Trees, Nearest Neighbor (NN), Random Forest, Bayes, Artificial Neural Networks and Support Vector Machines (SVM) and so on. The common point of previous studies is that each module is represented by the numeric software metrics directly. The attribute types of each metric are real, categorical and integral. We call it numeric feature description. This is also the common description technology used in pattern recognition and machine learning. Nevertheless, due to the complex relationships between software metrics and fault-proneness, this kind of representation is a limitation to the classification effectiveness of each software metric. Indeed, if we treat each software metric individually, the values of metric lack of discrimination. Take MaCabe’s EV(g) for example. MaCabe’s EV(g) is a frequently-used software metric. Figure 1 shows the value distribution of the metric in CM1 data set. Figure 1. Distribution of MaCabe’s EV(g) metric in CM1 data set. The numeric values of the metric feature are divided into 10 intervals ranging from 1 to 30. The vertical axis shows the proportion of values that fall within this interval of each class. 138 2011 IEEE 978-1-4577-0282-2/11/$26.00 © Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011

Transcript of [IEEE 2011 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Guilin,...

CLASSIFYING FEATURE DESCRIPTION FOR SOFTWARE DEFECT PREDICTION

LING-FENG ZHANG, ZHAO-WEI SHANG

College of Computer Science, Chongqing University, Chongqing 400030, China E-MAIL: [email protected], [email protected]

Abstract: To overcome the limitation of numeric feature description

of software modules in Software defect prediction, we propose a novel module description technology, which employs the classifying feature, rather than numerical feature to describe the software module. Firstly, we construct independent classifier on each software metric. Then the classifying results in each feature are used to represent every module. We apply two different feature classifier algorithms (based on mean criterion and minimum error rate criterion, respectively) to obtain the classifying feature description of software modules. By using the proposed description technology, the discrimination of each metric is enlarged distinctly. Also, classifying feature description is simpler compared to numeric description, which would accelerate the speed of prediction model learning and reduce the storage space of massive data sets. Experiment results on four NASA data sets (CM1, KC1, KC2 and PC1) demonstrate the effectiveness of classifying feature description, and our algorithms can significantly improve the performance of software defect prediction.

Keywords: Feature classifier description; binary classification;

software defect prediction

1. Introduction

As software systems grow in size and complexity, it becomes increasingly difficult to maintain the reliability of software products. Usually, software defect is the major factor influencing software reliability. The majority of a system’s faults, over 80%, exist in about 20% of modules which is known as the “80:20” rule [1]. Thus, the possibility of estimating the fault modules is extremely important for minimizing cost and improving the effectiveness of software testing process. The early prediction of fault-proneness of the modules can also allow software developers to allocate the limited resources on those defect-prone modules such that high reliability software can be produced on time and within budget [2].

Software fault-proneness is estimated based on

software metrics, which provide quantitative descriptions of software module. A number of studies provide empirical evidence that correlation exists between some software metrics and fault-proneness [3]. By using those metrics feature, software defect prediction is usually viewed as a binary classification task , which classifies software modules into fault-prone (fp) and non-fault-prone (nfp). Many machine learning and statistical techniques have been applied to construct prediction models based on the measurement of static code attributes [4][5]. For example, Discriminant Analysis, Logistic Regression, Regression Trees, Nearest Neighbor (NN), Random Forest, Bayes, Artificial Neural Networks and Support Vector Machines (SVM) and so on.

The common point of previous studies is that each module is represented by the numeric software metrics directly. The attribute types of each metric are real, categorical and integral. We call it numeric feature description. This is also the common description technology used in pattern recognition and machine learning. Nevertheless, due to the complex relationships between software metrics and fault-proneness, this kind of representation is a limitation to the classification effectiveness of each software metric. Indeed, if we treat each software metric individually, the values of metric lack of discrimination. Take MaCabe’s EV(g) for example. MaCabe’s EV(g) is a frequently-used software metric. Figure 1 shows the value distribution of the metric in CM1 data set.

Figure 1. Distribution of MaCabe’s EV(g) metric in CM1 data set. The numeric values of the metric feature are divided into 10 intervals ranging from 1 to 30. The vertical axis shows the proportion of values that fall within this interval of each class.

1382011 IEEE978-1-4577-0282-2/11/$26.00 ©

Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011

From Figure 1 we can see that the values of the metric

are overlapping in each interval. Also, the distribution of the two classes are roughly the same. That is, under numeric description, This metric contains little classification information. The same phenomenon also occurs in other software metrics.

To increase the classification effectiveness of each feature, we propose a novel feature description technology which is named classifying feature description. Firstly, we construct independent classifier on each software metric. Then the classifying results, called classifying feature, in each feature are used to represent the software modules. In this paper, we obtain the classifying feature description of software modules by two different feature classifier algorithms based on mean criterion and minimum error rate criterion, respectively. By using classifying feature rather than the numeric feature, we enjoy the following advantages:

(1) Classifiers on each feature expend the classification effectiveness of each software metric, and may obtain higher classification accuracy. Classifying feature description doesn’t change the feature dimension of each module. As a consequence, standard machine learning prediction techniques, originally designed for numeric feature, are also available in classifying feature space.

(2) Classifier feature is simpler, which could accelerate the speed of model learning on the software defect data sets. Software defect prediction is a standard binary classification problem. Thus, the classification results of each feature classifier are also defined by binary values, which makes other operations more convenient.

(3) Compared with traditional numeric feature description, classifying feature description occupies less storage space. Binary classifying feature description of software module could be viewed as a sparse representation of the numeric feature. This makes storage of massive data sets available.

The remainder of this paper is organized as follows. Section 2 introduces the learning model based on numeric software metrics in prediction of software defect. The proposed classifying feature description is described in section 3. The feature classifying algorithms are discussed in section 4. Section 5 follows with the experiments, in which the performance of classifying feature description is tested in detail. Finally, conclusions are presented in Section 6.

2. Software defect prediction learning model

A number of studies provide empirical evidence that correlation exists between some software metrics and

fault-proneness. Thus, we can mathematically describe the software defect prediction model as a binary classification task.

Let us assume the training set

1{ , } {( , )}tr Ni i iS X Y x y == = , with ,N +∈� d

ix ∈� and

{0,1}iy ∈ . Each instance ix is represented by

1 2{ , , , }i i idx x x� ,where ijx is the feature representation

for ix in thj feature. ix could be viewed as a point in

the d-dimensional feature space d� . iy denotes the class

label associated with ix , fault-prone or non-fault-prone.

The attribute types of each feature 1 2{ , , }i i i niF x x x= � could be real, categorical and integral. Traditionally, Prediction models are constructed on the feature space d� directly. For a test module 1 1( , , , )dx x x x′ ′ ′ ′= � , we

obtain the prediction results y′ (fault-prone or non-fault-prone) from the model determined by the training set. Figure 2(a) shows the traditional learning process of the model.

The success of the prediction model learning relies mainly on the representation technology the module described, as well as on the prediction model operating in those training set. Various aspects of prediction model have been studied based on machine learning strategies. However, the same important representation technology is mostly ignored by the existing literatures. In fact, a suitable software module description acts as the basis of establishing successful prediction model.

To increase the classification effectiveness of each metric, we propose a novel module representation technology called classifying feature, which is obtained by the feature classifiers constructed on each software metric. The implement process of software defect prediction model based on classifying feature is shown in Figure 2(b).

{ , }trS X Y=

( )h x

{ },tr CCS X Y=

x′y′ cx′ cy′

(a) (b)

Figure 2. Software defect prediction learning model

139

Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011

Using the feature classifiers, both the training module set and test module are described by classifying feature before building a prediction model.

3. Classifying feature description

Let { } 1, {( , )}tr C c NC i i iS X Y x y == = denote the classifying

feature training set, with ,N +∈� c dix ′∈� and

{0,1}iy ∈ . Each instance 1 2[ , , , ]c c c ci i i idx x x x= � is a

point in the d-dimensional binary classifying feature space d′� . iy denotes the class label associated with c

ix ,

corresponding to the label of ix . The attribute types of

each classifying feature 1 2{ , , }c c ci i i niCF x x x= � are chosen

within 0 and 1. Let 1 2( ) { ( ), ( ), ( )}dh x h x h x h x= � denote the feature classifiers on each feature. And the classification results of ( )h x are equal to the class labels of original training set, which could be described as:

( ) : [0,1]h x x→ (1) In other word, ( )h x represents the mapping from

the numeric feature space d� to the Classifying feature space d′� :

( ): h xd d′Φ ⎯⎯⎯→� � (2) According to the above definition, for each instance of

training set, we have a simpler description, where all the classifying features are represented by 0 and 1. Figure 3 shows the implementation process of classifying feature description. The key idea is that classifiers in each feature can help enlarge the discrimination of each metric between the different classes.

���������

���������

Figure 3. Classifying feature description

Figure 4 shows an ideal classification example of

numeric feature description and classifying feature description. In this example, the two classes are represented by stars and diamonds. Each module has two metric features x and y, which are normalized between 0 and 1. Figure 4 (a) shows the distribution of numeric feature in the

2D feature space. The two classes are linearly separable in the 2D feature space. Now, we construct independent classifier on each feature. A simple threshold classifier is chosen in this example. Let’s set 0.5 as the threshold of the two features. If the value of feature is bigger than the threshold, the classification result is 1. And if the value is smaller than the threshold, the classification result is 0. Then, we can obtain the classifying feature description of each instance by the binary classification results. Figure4 (b) shows the distribution of classifying feature of the two classes.

(a) (b)

Figure 4. Ideal classification problem with numeric feature and classifying feature

In this example, we apply a feature threshold classifier to obtain the binary classifying feature representation of each metric. By applying the classifying feature description, modules in two classes are represented as (0,1) and (1,0). From Figure 4 we can see that classifying feature representation improves the average distance and margin between two classes significantly. Also, the simple description of 0 and 1 reduces the storage space of the data and makes easier for other operation.

4. Feature classifier algorithm

In binary classification task, the classification results of each feature classifier are chosen for 0 and 1. All the classifying features are represented 0 and 1. The aim of feature classifiers is expand the difference between the two classes using simple classify rule. In this paper, each feature classifier is defined by a measure on the value of ijx , which could be simple as a threshold classifier. The classifier ( )h x is defined as follows:

1( )

0ij j

j ij

if x thresholdh x

else>�

= ��

(3)

Let 1 2{ , , }dT t t t= � denote the threshold set of the features determined by the training set. The classification results of each feature classifier are used as the novel feature of each module. In this way, all the classifying features are represented by 0 and 1. The classifying feature of each module can be defined as:

140

Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011

1

0ij jc

ij

if x tx

else>�

= ��

(4)

From the thresholds obtained from the training set, we can get the feature description of the numeric testing sample. Take 1 1( , , , )dx x x x′ ′ ′ ′= � for example, the

classifying feature description cx′ is also obtained by: 1

0j jc

j

if x tx

else

′ >�′ = �

� (5)

Learning a good threshold plays a crucial role in each feature classifier. In this paper, the optimal threshold of each attribute is determined by two different technologies, the mean criterion and the minimum error rate criterion.

4.1. Mean criterion

This criterion assumes that the value each feature in the two classes obey the uniform distribution. Thus the means of features can be used to represent the values of all the features. Then, we choose the midpoint of the means of each feature of the two classes as a classification threshold. Let { , 1, }iX x i n+ + += = � and { , 1, }iX x i n− − −= = � denote the fault-prone and non-fault-prone subsets, respectively. The means of each feature in two classes could be calculated by:

1 n

i ijj

m xn

+

+ ++= , 1 n

i ijj

m xn

+ −−=

(6)

The threshold of mean criterion is defined as:

2i i

im mt

+ −+=

(7)

The pseudo-code of the feature classifier algorithm based on mean criterion is listed in Figure 5.

Feature Classifier Algorithm 1 Input: X /*training set*/ Y /*class label*/ Variables:

ijx /*the thj feature of the thi instance*/

iy /*the class label of the thi instance */

Output: T /*threshold*/ TX /*classifying feature training set*/

BEGIN

1. for i 1 to d do

2. Calculate im+ and im

− .

3. ( , )i i iit mean m m+ −=

4. 1

0ij jT

ij

if x tx

else>�

= ��

5. end for END

Figure 5. The pseudo code of feature classifier algorithm based on mean criterion

4.2. Minimum error rate criterion

This criterion is based on a process of complete search, which ensures the minimum of instances in the training set are misclassified. The threshold it of feature is chosen from all the feasible interval values. Firstly, the values of iF are sorted for small to large, represented as

{ }* * * *1 ,i j nF x x x= � . Then, we assume the threshold it of

the each feature to be each interval value of the sorted features ijthreshold , and calculate a error rate ijError . The threshold which leads to the minimum error rate is chosen for the feature, which is described as

, min ( ( )ji i agr error jt threshold= (8)

The pseudo-code of the feature classifier algorithm based on minimum error rate criterion is listed in Figure 6.

Feature Classifier Algorithm 2 Input: X /*training set*/ Y /*class label*/ Variables:

ijx /*the thj feature of the thi instance*/

iy /*the class label of the thi instance */

Output: T /*threshold*/ TX /*classifying feature training set*/

BEGIN

1. for i 1 to d do

2. { }* * * *1 2, ( , )i n iF x x x sort F as= =� .

3. for j 1 to n do 4. if 1j = 5. *

1ijthreshold x ε= −

6. Else if j n= 7. *

ij nthreshold x ε= +

8. Else 9. * *

10.5*( )ij i ithreshold x x += +

141

Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011

10. endif 11. 1 _ (

) /Error sum num x threshold

x threshold n+

= < +>

12. 2 _ () /

Error sum num x thresholdx threshold n

+

= > +<

13. min( 1, 2)ijError Error Error=

14. endfor 15.

, min ( )j iji i agr errort threshold=

16. 1

0ij jT

ij

if x tx

else>�

= ��

17. endfor END

Figure 6. The pseudo code of feature classifier algorithm based on mean criterion

Above all, we have constructed classifiers on the

features and obtained the classifying feature description of the software modules by the two feature classifier algorithms. Figure 7 shows the distribution of classifying feature of MaCabe’s EV(g) in CM1data set.

(a) (b)

Figure 7. Distribution of classifying feature of MaCabe’s EV(g). (a). based on mean criterion. (b). based on minimum error rate criterion.

Compared with the distribution of numeric feature in Figure1, the classifying features of the metric are more separable between the two classes. Especially in the description based on minimum error rate criterion, if we the treat the metric as the single feature of modules, only seldom modules are misclassified. The key idea of classifying feature description is improving the classification performance of features by the classifiers constructed on them.

5. Experiments

In this section, we evaluate the effectiveness of the proposed Classifying Feature (CF) description and the two feature classifier algorithm, represented by CF1 (based on mean criterion) and CF2(based on minimum error rate). The experiment is conducted on 4 benchmark data set from NASA data set (KC2, KC1, CM1 and PC1), which are publicly accessible from the NASA IV&V Facility Metrics Data Program. Each dataset contains twenty one metrics as feature and the associated dependent Boolean variable,

fault-prone or non-fault-prone. The performance of prediction of software defect prediction is typically evaluated using a confusion matrix, which is shown in Table 1. In this section we use the commonly used performance measures: accuracy, precision, recall and F-measure. The prediction results of three algorithm (NN, Bayes and SVM) on these four data sets are shown in Table 2, Table 3, Table 4 and Table 5 (%).

Table 1. Prediction results

Predicted Non-Fault-Prone Fault-Prone

Non-Fault-Prone TN (True Negative) FP (False Positive)ActualFault-Prone FN (False Negative) TP (True Positive)

Table 2. Prediction results on CMI dataset

Accuracy Precision Recall F-measure

NN 62.2530 63.7050 62.1667 62.1868

NN+CF1 69.8051 73.7753 65.9792 68.4305

NN+CF2 71.2872 70.7661 77.1458 73.1453

Bayes 65.6131 76.3280 48.9375 56.0973

Bayes+CF1 67.2842 75.2083 54.7083 62.2218

Bayes+CF2 69.9048 69.4953 76.0208 71.8738

SVM 57.4464 50.0080 88.1667 63.3539

SVM+CF1 70.7560 70.6232 74.6875 71.8267

SVM+CF2 70.5476 70.4383 75.3958 72.0681

Table 3. Prediction results on KC1 dataset

Accuracy Precision Recall F-measure

NN 71.7895 73.8791 70.3292 71.9415

NN+CF1 73.1457 77.7793 67.8858 72.3474

NN+CF2 75.9856 75.0230 80.6224 77.4708

Bayes 65.6968 82.3197 43.0195 56.1950

Bayes+CF1 70.0498 78.9030 57.7469 66.5471

Bayes+CF2 73.8061 73.8372 76.7078 74.9661

SVM 69.8030 63.8792 98.1430 77.2561

SVM+CF1 74.0659 75.4972 74.3673 74.7933

SVM+CF2 74.3654 72.2117 82.4023 76.7917

Table 4. Prediction results on KC2 dataset

Accuracy Precision Recall F-measure

NN 69.1420 70.9515 67.7644 68.9117

NN+CF1 78.0460 82.2119 72.9567 76.9966

142

Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011

NN+CF2 76.4909 76.6820 77.9327 76.9494

Bayes 69.6679 87.0080 47.6442 60.7960

Bayes+CF1 75.8396 85.2068 63.9183 72.6060

Bayes+CF2 77.8562 78.6577 78.1490 78.1509

SVM 63.1687 58.5960 95.6010 72.5795

SVM+CF1 77.5370 78.4068 77.5721 77.7266

SVM+CF2 77.7838 76.8082 81.5385 78.8313

Table 5. Prediction results on PC1 dataset

Accuracy Precision Recall F-measure

NN 62.1241 63.4266 62.2556 62.3820

NN+CF1 67.1761 72.3669 59.2199 64.0422

NN+CF2 73.9615 74.8950 74.0414 73.9790

Bayes 65.0083 81.9888 39.6992 51.8326

Bayes+CF1 65.4569 74.1550 50.2726 59.2221

Bayes+CF2 67.6919 68.9138 67.4530 67.4687

SVM 58.6220 87.0081 34.6992 36.2836

SVM+CF1 65.7736 67.9103 63.4868 64.8424

SVM+CF2 69.3031 68.7413 73.9944 70.7430

From Table 2 to Table 5, In all the four datasets,

classifying feature description achieves the highest results in both accuracy and F-measure under the three classifiers. In precision, except Bayes, classifying feature description performs better than numeric feature ones. In recall, classifying feature achieves higher prediction results in NN and Bayes (SVM, higher in PC1 only). Compared to numeric feature description, the two feature classifier algorithms provide slight improvement in most of the measurements. To definitively confirm this fact, we compute the average prediction results of the four dataset, which are shown in Figure 8.

From Figure 8, it is observed that classifying feature description outperforms the numeric feature description in accuracy under the three classifiers from 3.16% (Bayes, CF1) to 10.14% (SVM, CF2). F-measure considers the harmonic mean of precision and recall. It can be observed that the feature classifiers algorithms achieve higher F-measure than the normal ones. Moreover, The F-measure are significantly higher from 4.10% (NN, CF1) to 16.88% (Bayes, CF2).

6. Conclusions

This paper proposed a novel feature description method for software modules in prediction of software

defects, called classifying feature. The main advantage of this description in comparison to traditional numeric metrics is that the classification effectiveness of each metric is improved effectively. For future work, we will investigate the applicability of the classifying feature description to other domain and generalize it to multi-classification problem.

(a) Accuracy (b) Precision

(c) Recall (d) F-measure

Figure 8. Average results of the four datasets.

Acknowledgements This paper is supported by Project No.CDJXS10182216 and No.

CDJRC10180009 of the Fundamental Research Funds for the Central Universities and Project No. CSTC2010BB2217 of the Natural Science Foundation of Chongqing.

References

[1] Gondra, I., “Applying machine learning to software fault-proneness prediction”, The Journal of Systems and Software, Vol. 81, pp. 186-195, 2008.

[2] Zheng, J., “Cost-sensitive boosting neural networks for software defect prediction”, Expert Systems with Applications, Vol. 37, pp. 4537-4543, 2010.

[3] Gill, G. and Kemerer, C., “Cyclomatic complexity density and software maintenance productivity”, IEEE Transactions on Software Engineering, Vol. 17, No. 12, pp. 1284-1288, 1991.

[4] Guo, L., Ma, Y., Cukic, B. and Singh, H., “Robust Prediction of Fault-Proneness by Random Forests”, Proceedings of the 15th International Symposium on Software Reliability Engineering(ISSRE’04), pp. 417-428, 2004.

[5] Guang-jie, L. and Wen-yong, W., “Reach on an educational software defect prediction model based on SVM”, Entertainment for Education. Digital Techniques And Systems 6249, pp. 215-222, 201

143

Proceedings of the 2011 International Conference on Wavelet Analysis and Pattern Recognition, Guilin, 10-13 July, 2011