A comparative analysis of classification algorithms in data mining for accuracy, speed and...

20
A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness Neslihan Dogan Zuhal Tanrikulu Published online: 12 August 2012 Ó Springer Science+Business Media, LLC 2012 Abstract Classification algorithms are the most com- monly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalabili- ty, integration, comprehensibility, stability, and interest- ingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the perfor- mance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time con- sumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous vari- ables are discretised; and third, implementing the algo- rithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets. Keywords Classification Data mining Mining methods and algorithms 1 Introduction Classification or prediction tasks are the most widely used types of data mining. Classification algorithms are super- vised methods that look for and discover the hidden asso- ciations between the target class and the independent variables [45]. Supervised learning algorithms allow tags to be assigned to the observations, so that unobserved data can be categorized based on the training data [25]. The task, model structure, score function, search method, and data management method are the main components of each algorithm [26]. Image and pattern recognition, medical diagnosis, loan approval, fault detection, and financial trends are among the most well known examples of clas- sification tasks [20]. Before utilizing a model produced by a classification algorithm, that model is assessed with respect to some specific criteria. The model will probably result in certain errors; therefore, the data miner should take this possibility into account when selecting a model [17]. Accuracy or the percentage of instances that are correctly classified by the model is the most commonly used decision criteria for most model assessments [25]. However, other criteria can also be used to compare and evaluate the models. Berson et al. [8] defines these assess- ment concepts as: accuracy, explanation, and integra- tion abilities. Maimon and Rokach [45] introduce these N. Dogan (&) PricewaterhouseCoopers LLP, London, UK e-mail: [email protected] Z. Tanrikulu Department of Management Information Systems, Bogazici University, Istanbul, Turkey e-mail: [email protected] 123 Inf Technol Manag (2013) 14:105–124 DOI 10.1007/s10799-012-0135-8

Transcript of A comparative analysis of classification algorithms in data mining for accuracy, speed and...

Page 1: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

A comparative analysis of classification algorithms in data miningfor accuracy, speed and robustness

Neslihan Dogan • Zuhal Tanrikulu

Published online: 12 August 2012

� Springer Science+Business Media, LLC 2012

Abstract Classification algorithms are the most com-

monly used data mining models that are widely used to

extract valuable knowledge from huge amounts of data.

The criteria used to evaluate the classifiers are mostly

accuracy, computational complexity, robustness, scalabili-

ty, integration, comprehensibility, stability, and interest-

ingness. This study compares the classification of

algorithm accuracies, speed (CPU time consumed) and

robustness for various datasets and their implementation

techniques. The data miner selects the model mainly with

respect to classification accuracy; therefore, the perfor-

mance of each classifier plays a crucial role for selection.

Complexity is mostly dominated by the time required for

classification. In terms of complexity, the CPU time con-

sumed by each classifier is implied here. The study first

discusses the application of certain classification models on

multiple datasets in three stages: first, implementing the

algorithms on original datasets; second, implementing the

algorithms on the same datasets where continuous vari-

ables are discretised; and third, implementing the algo-

rithms on the same datasets where principal component

analysis is applied. The accuracies and the speed of the

results are then compared. The relationship of dataset

characteristics and implementation attributes between

accuracy and CPU time is also examined and debated.

Moreover, a regression model is introduced to show the

correlating effect of dataset and implementation conditions

on the classifier accuracy and CPU time. Finally, the study

addresses the robustness of the classifiers, measured by

repetitive experiments on both noisy and cleaned datasets.

Keywords Classification � Data mining �Mining methods

and algorithms

1 Introduction

Classification or prediction tasks are the most widely used

types of data mining. Classification algorithms are super-

vised methods that look for and discover the hidden asso-

ciations between the target class and the independent

variables [45]. Supervised learning algorithms allow tags to

be assigned to the observations, so that unobserved data

can be categorized based on the training data [25]. The

task, model structure, score function, search method, and

data management method are the main components of each

algorithm [26]. Image and pattern recognition, medical

diagnosis, loan approval, fault detection, and financial

trends are among the most well known examples of clas-

sification tasks [20].

Before utilizing a model produced by a classification

algorithm, that model is assessed with respect to some

specific criteria. The model will probably result in certain

errors; therefore, the data miner should take this possibility

into account when selecting a model [17]. Accuracy or the

percentage of instances that are correctly classified by the

model is the most commonly used decision criteria for

most model assessments [25].

However, other criteria can also be used to compare and

evaluate the models. Berson et al. [8] defines these assess-

ment concepts as: accuracy, explanation, and integra-

tion abilities. Maimon and Rokach [45] introduce these

N. Dogan (&)

PricewaterhouseCoopers LLP, London, UK

e-mail: [email protected]

Z. Tanrikulu

Department of Management Information Systems,

Bogazici University, Istanbul, Turkey

e-mail: [email protected]

123

Inf Technol Manag (2013) 14:105–124

DOI 10.1007/s10799-012-0135-8

Page 2: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

comparison criteria as the generalization error for the model,

a computational complexity or the amount of CPU consumed

by the inducer, the comprehensibility or ability to understand

the model, the scalability or ability to run efficiently on larger

databases, the robustness or ability to handle missing or

noisy data, the stability or ability to produce repeatable

results on different datasets, and lastly, the interesting nature

or the ability of classifier to generate valid and new

knowledge.

Before implementing classification algorithms, it is

recommended that incomplete, noisy, or inconsistent

datasets are preprocessed to make the knowledge discovery

process easier and more qualified. The most well-known

steps for this process are: summarization, cleaning, inte-

grations and transformations, data and dimensionality

reduction, and discretisation [25]. Discretisation and

dimension reduction lie within the scope of this study.

Data discretisation techniques can be used to reduce the

number of values for a given continuous variable by splitting

the range of that variable into intervals. Binning, for exam-

ple, is a type of discretisation technique in which the variable

is split into a particular number of bins. Dimension reduction

is another preprocessing technique used to obtain a reduced

dataset that still represents the original dataset. The most

commonly used dimension reduction technique is principal

component analysis (PCA). ‘‘PCA searches for k n-dimen-

sional orthogonal vectors that can best be used to represent

the data where k� n. The original data are thus projected

onto a smaller space’’ [25].

As data volume increases in real life today, it is

becoming harder to make valuable and significant deci-

sions with respect to that increase. In such situations, data

mining to extract the concealed knowledge from large

amounts of data is commonly used [25]. The predictive

power of data mining classification algorithms has been

appealing for many years. Numerous studies have thus

concentrated on proposing a new classification model by

comparing the existing models or important factors that

affect a model’s performance.

Quinlan states that it is not an easy task to determine that

one algorithm is always superior to others, and he links the

capabilities of models to task dependency. His study com-

pares the decision tree with network algorithms and con-

cludes that parallel-type problems are not common, nor are

sequential-type problems suited to back-propagation [53]. In

an additional study, some algorithms, such as LARCKDNF,

IEKDNF, LARC, BPRC and IE, were compared for three

tasks, and different results were found for each one [33].

Hacker and Ahn conducted another comparative experiment

that focused on eliciting user preferences. They compared

many methods and recommended a new classifier called

relative SVM, which outperformed others [24]. Further,

Putten et al. [51] compared the AIRS algorithm to other

algorithms, and found no significant evidence that it con-

sistently outperformed the others. Another research imple-

mented Naı̈ve Bayesian, a decision tree, KNN, NN and M5 to

predict the lifetime prediction of metallic components and

stated that methods dealing directly with continuous vari-

ables perform better [23]. In a comparative paper, the authors

selected 16 model selection schemes and 58 benchmark

datasets. Their paper first indicated the rationale and com-

plexity of the schemes as a reference guide; secondly, it

provided a bias-variance analysis for each scheme classifi-

cation performance; and lastly, it determined the effective

schemes that can meet the practical requirements [65]. In

another study, the same authors point at useful data mining

implications to try to understand whether meaningful rela-

tionships can be found in the soil profile data at different

regions. They used the data collected from the WA Depart-

ment of Agriculture and Food (AGRIC) soils database and

compared those data mining methods to the existing statis-

tical methods [4]. The importance of feature selection is

emphasized in a study of He et al. [27] where the decision tree

and regression methods are applied to breastfeeding survey

data. In another comparative study, authors compared 22

decision tree, nine statistical, and two neural network algo-

rithms on 32 datasets with respect to the classification

accuracy, training time, and number of leaves (in decision

tree algorithms). They concluded that algorithm called

POLYCLASS was ranked top in terms of accuracy (mean

error rate and mean rank of error rate). The second best

algorithm was found to be Logistic regression. QUEST was

found to be the most accurate decision tree algorithm. In

terms of training duration, the statistical algorithms needed

relatively longer training times [43]. A variety of techniques

can be used for the detection of network intrusions ranging

from traditional statistical methods to data mining approa-

ches. In another comparative study, authors compared rough

sets, neural networks and inductive learning methods in

detecting network intrusion. The results indicate that data

mining methods and data proportion have a significant

impact on classification accuracy. Rough sets proved better

accuracy and balanced data proportion performed better than

unbalanced data proportion [66]. Moreover, authors in the

‘Top 10 algorithms’ book presented C4.5, k-Means, SVM,

Apriori, EM, PageRank, AdaBoost, kNN, Naı̈ve Bayes, and

CART; provided a description of the algorithms; discussed

the impact of the algorithms, and reviewed current and fur-

ther research on the algorithms [64].

Jamain and Hand conducted an interesting study on col-

lecting and comparing the comparative studies of classifi-

cation methods. They claimed that most results are

inconclusive and limited. The authors reviewed the literature

and created a dataset of 5807 results. They also presented a

method to assess the overall methods [31]. In another study,

Kim et al. [37] proposed their own two methods: per-

106 Inf Technol Manag (2013) 14:105–124

123

Page 3: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

document text normalization and feature weighting, so as to

eliminate the deficiencies of the Naı̈ve Bayesian algorithm in

text classification tasks. They claimed their methods perform

well with respect to many novel methods, such as SVM. Su

and Hsiao’s study concerned the Multiclass Mahalanobis-

Taguchi System (MMTS), developed to combine feature

selection and classification. This system uses a reduced

model measurement scale and examples to calculate a min-

imum weighted distance. The authors also compared their

system to other well-known algorithms and results in terms

of accuracy and feature selection. In addition, they imple-

mented the system in a real-life case and pointed to its

practicality [57].

In retrospect, some researchers have attempted to show

the importance of datasets in determining classifications. A

crucial point is also introduced about the danger of using a

single dataset for performance comparison, and tests are

carried out for dynamic modifications of penalty and net-

work architectures [28]. A similar finding also is stated by

Keogh and Kasetty [35] since the performance results of

learning algorithms are expected to deviate across different

datasets; the study discusses data and implementation bias

for time series datasets. Brazdil et al. [11] presented a meta-

learning method to help the process of algorithm selection.

They used the k-Nearest Neighbor algorithm to identify the

datasets that are most similar to the one at hand. The per-

formance and speed of the candidate algorithms on those

datasets were used to populate a ranking to be provided to

the user. The distance between datasets is based on a small

set of data characteristics that represent a set of properties

that affect the performance of the learning algorithms.

Some of those characteristics were number of examples,

proportion of symbolic attributes and outliers and entropy

of classes. In another interesting study, authors tried to

discover the similarities among classification algorithms,

and among datasets on the basis of error measures. The most

frequent finding was that the described areas were charac-

terized by rather concentrated distributions of descriptors

such as data availability, problem dimensionality (number

of examples, ratio of examples to attributes, etc.) or class

distribution or information content [34].

The importance of implementation settings while run-

ning algorithms has also been underlined by some authors

in the literature. For example, Keogh et al. [36] emphasize

the importance of implementation details, such as param-

eter selections in algorithms and claim that algorithms

should have few or no parameters. Pitt and Nayak point to

one factor that affects accuracy in their study: ‘‘The use of

feature reduction algorithms on a large population survey

database has shown that the use of the subset and attribute

evaluation methods mostly results in an improvement in

accuracy despite a reduction in the number of attributes’’

[50]. Finally Howley et al. [30] studied the effects of data

preprocessing steps on classifier accuracies and compared

the results of classifiers where no preprocessing step was

applied and then applied additional techniques, such as

normalization or PCA.

The literature also shows many studies that reveal the

practical implementations of data mining algorithms.

Maindonald points to the difficulties and complexities of

comparing the algorithms and underlines the fact that users

who are more experienced working with a specific model

will have a tendency to produce the best results with that

model; therefore, the published performance results are

very broad indicators and dependent on datasets. Moreover,

he points to the insufficiency of datasets from several years

to determine changes in algorithm performances [46]. Ko

and Osei-Bryson tried to show the significance of Infor-

mation Technology (IT) investments on the healthcare

industry productivity in their research where they utilize

data mining implementations of Regression and MARS

(Multi Adaptive Regression Splines) algorithms. They

discover that the relationship between IT spending and

productivity is considerably complex and dependant on

such conditions as the amounts invested in IT Stock, Non-

IT Labor, Non-IT Capital and Time [38]. The research of

Abbasi and Chen then pays attention to categorization of

fake escrow websites, which make it hard to distinguish

legal websites from invalid ones. They assess the effec-

tiveness of various techniques, such as support vector

machines, neural networks, decision trees, Naı̈ve bayes,

and principal component analysis. A support vector

machine classifier, when combined with an extended fea-

ture set, differentiates fake pages from real ones with 90 %

accuracy for pages and 96 % accuracy for sites. The

authors also claim that an extended set of fraud features is

required because of the large range of fraud strategies [1].

Another interesting paper reveals the effectiveness of data

mining techniques on the prediction of fall-related injuries

for electronic medical records. Chiarini et al. [16] both

applied unsupervised (entropy) and supervised (informa-

tion gain) text mining techniques. Text mining alone is

deemed useful for identifying fall-related injuries. Then

they used the terms created by both schemes to create

clustering and logistics regression and concluded that

information gain outperformed entropy for both clustering

and logistics regression. Also, clustering on entropy-based

terms did not produce viable results; however, even

entropy performed well for the classification task and did

categorize cases correctly when data was not available.

As seen in this literature review, the data mining com-

munity is very interested in comparing different classifi-

cation algorithms. For example, Dogan and Tanrikulu

proposed a comparative framework for evaluating classifier

accuracies. They claim that classifier accuracies are not

always the same for every dataset, and performance is,

Inf Technol Manag (2013) 14:105–124 107

123

Page 4: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

therefore, significantly affected by such dataset character-

istics as variable types or the number of instances [19].

This study is also concerned with classification perfor-

mance and other factors that can affect accuracy by

applying new perspectives, as well as other quality indi-

cators such as classifier speed or robustness, to determine

classifier quality.

2 Research questions

This study aims to compare the classification algorithm

accuracies, speed and robustness with respect to various

datasets and implementation techniques. The research

questions for the study are as follows:

1. When implemented with different techniques, does the

performance of classifiers deviate significantly on

multiple datasets?

2. Is the performance of classifiers significantly affected

by dataset characteristics?

3. Does binning the continuous variables in the dataset

into discrete intervals affect classifier accuracy?

4. Does applying principal component analysis in the

dataset affect the classifier accuracy?

5. Based on the empirical results of this study (applying

classifiers on various datasets with different imple-

mentation techniques), what is the overall effect of

dataset and implementation attributes on the accuracy

of the classification algorithm?

6. Implemented using different techniques, does the

speed (consumed CPU time in seconds) of classifiers

deviate significantly on multiple datasets?

7. Based on the empirical results of this study (applying

classifiers on various datasets with different imple-

mentation techniques), what is the overall effect of

dataset and implementation attributes on speed for the

classification algorithm?

8. Does the speed of the algorithm significantly affect the

accuracy of the classification algorithm?

9. Do the abilities of classifiers to handle missing or noisy

data differ (their robustness)?

3 Method

Figure 1 shows the methodological framework maintained

during the research study. The accuracy is referred to as the

performance of classifiers throughout the study. Figure 1

shows the methodological framework maintained during

the research study. In the implementation phase, ten sample

datasets were used because the research study is interested

in applying the algorithms to multiple datasets. Fourteen

classification algorithms that are explained in Table 1 were

selected for implementation on the experimental datasets.

WEKA (Waikato Environment for Knowledge Analysis), a

preferred suite of machine learning software, was used here

as a tool to run AIRS2P, C4.5, CSCA, IBk, Logistics,

LogitBoost, MLP, MLVQ, Naı̈ve Bayesian and SVM

algorithms. SPSS (Statistical Package for the Social Sci-

ences) was also used as a tool to run the CART, Ex-CHAID

and QUEST algorithms, since these algorithms are avail-

able in SPSS. Finally, the Rosetta tool was used to run the

RSES algorithm. Data preprocessing steps were also

applied to the sample datasets, and the results of those

implementations were then tabulated. Afterwards, a

descriptive analysis and a One-way Anova test were car-

ried out to answer the first and sixth research questions.

Further, correlation analysis was conducted to answer the

second, third and fourth research questions. Lastly,

regression models were built to deal with the fifth and

seventh research questions. To learn whether there is a

relationship between accuracy and the speed of the clas-

sifier, correlation analysis was applied to answer question

eight. To answer the ninth research question, algorithms

were implemented on a noisy dataset before and after data

cleaning, and the results containing the variances were then

tabulated to determine the robustness of the classifiers.

3.1 Algorithms

Fourteen classification algorithms representing the differ-

ent types of classification models (decision trees, neural

networks, immune systems, probabilistic models, etc.)

were selected from the many existing classification algo-

rithms that fell within the scope of this study.

The selected algorithms were AIRS2P and CSCA arti-

ficial immune recognition systems algorithms; Ex-CHAID,

C4.5, CART and QUEST decision tree algorithms; the IBk

algorithm; the Logistics algorithm; the LogitBoost boosting

algorithm; MLP and MLVQ neural network algorithms; the

Naı̈ve Bayesian algorithm; the RSES rough sets algorithm

and the Support Vector Machine algorithm. Table 1 shows

the list of algorithms used in the study along with the tool

they were implemented in, the category they belonged to,

and finally, the people introduced the classifier.

Similar to the human natural immune system which

differentiates and recalls intruders, AIRS algorithm is a

cluster-based approach that understands the structure of the

data and performs a k-nearest neighbor search. AIRS2 and

AIRS2P are extensions of the existing AIRS algorithm with

some technical differences [51, 59–61]. Another artificial

immune system technique that is inspired by the func-

tioning of the clonal selection theory of acquired immunity

is the Clonalg algorithm. It is inspired by the ‘‘maintenance

of a specific memory set, selection and cloning of most

108 Inf Technol Manag (2013) 14:105–124

123

Page 5: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

stimulated antibodies, death of non-stimulated antibodies,

affinity maturation (mutation), re-selection of clones pro-

portional to affinity with antigen, and the generation and

maintenance of diversity’’ [14]. A variant implementation

of Clonalg is called a Clonal Selection Classifier Algorithm

(CSCA), and it aims to maximize classification accuracy

and minimize misclassification rate [3, 13].

In decision tree algorithms, the classification procedure

is condensed into a tree. After the model is constructed, it is

applied to the entire database [20]. The J48 algorithm is a

version of an earlier algorithm that was developed by J.

Ross Quinlan, namely, the very popular C4.5. C4.5

employs two pruning methods. The first is known as the

sub-tree replacement, and the second is known as sub-tree

rising [52, 56]. CART (Classification and Regression

Trees) is another decision tree classifier that uses binary

splits, first grows, and then, prunes and uses the Gini Index

as splitting criteria and surrogates missing values [12, 63].

The CHAID algorithm cultivates the tree by locating the

optimal splits until the stopping criteria is encountered with

respect to the Chi-squares [17]. The CHAID can deal with

missing values, and the outputs of the target function are

discrete [47]. The splitting and stopping steps contained in

the Exhaustive CHAID algorithm that were proposed in

1991 are the same as those in CHAID. The merging step

uses an exhaustive search process to merge any similar

pairs until a single pair remains. However, in large datasets

with many continuous predictor variables, this modified

CHAID algorithm may require significant computing time

[29]. The Ex-Chaid (Exhaustive CHAID) algorithm was

introduced by Biggs in 1991. This method, like CHAID

chooses the best partition on the basis of statistical

Algorithm selection

Can amodel that shows the overall factors affecting classifieraccuracy or speed

be built ?

Do dataset characteristics,

discretisation, or PCA affect the

classifier accuracy? Does speed affect

accuracy?

Are the classifer accuracies or

speed significantly

different from each other?

Dataset collection

1. Basic applications2. Applications after discretisation3. Applications after PCA4. Applications on a noisy dataset before and after data cleaning

Correlation analysis

Descriptive analysis, One-

way Anova difference test

Performance and Speed results ofclassifiers on multiple datasets with different

implementation techniques and Accuracy results of classifiers before and after data cleaning

Regression Descriptive analysis

Do the abilities of classifiers to handle missing or noisy data

differ?

Statistical Results for Accuracy, Speed and Robustness

Fig. 1 Methodological framework

Inf Technol Manag (2013) 14:105–124 109

123

Page 6: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

significance and uses the Bonferroni inequality to calculate

that significance. Unlike CHAID, this algorithm does not

support simple partitions (low values of k), nor does it

discriminate against free-type (no restriction on the order of

values) predictor variables with many categories [9]. The

QUEST (Quick, Unbiased and Efficient Statistical Tree) is a

binary-split-decision tree algorithm used for classification

and data mining [55]. The objective of QUEST is similar to

that of the CART algorithm; it uses an unbiased variable

selection technique by default, uses imputation instead of

surrogate splits to deal with missing values, and can easily

handle categorical predictor variables with many categories

[44]. Badulescu [5] points to the difficulty of selecting the

best attribute when splitting the decision tree in the model

induction phase, and he compared the performance of 29

different splitting measures, claiming that the FSS Naı̈ve

Bayesian splits the attributes best.

IBk (The Instance-based k-nearest Neighbors) algorithm

is an alternate version of the k-nearest neighbor algorithm

used in k-nearest neighbor classification. It is a framework

that creates classification predictions by using only specific

instances. The IBk algorithm does not keep a set of

abstractions derived from instances. This method uses and

extends the nearest neighbor algorithm. However, this

nearest neighbor is supposed to require a larger storage

space. By using an instance-based algorithm, this storage

need can be significantly reduced, and learning or accuracy

rates do not decrease much [2].

The Multinomial Logistic Regression Model utilizes

independent variables to predict the probability of events

by fitting the data to a logistic curve [26]. It is developed by

using ridge estimators in logistic regression to improve the

parameter estimates and to diminish the error made by

further predictions [42].

LogitBoost is a boosting algorithm that was formulated

by Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

The AdaBoost algorithm is spread into a statistics

framework. Specifically, if AdaBoost is regarded to be a

Table 1 The algorithms used for the experiments

Algorithm name Acronym Category Years Introduced by Implementation

tool

Artificial Immune Recognition

Systems v2 Parallel

AIRS2P Artificial immune systems

classification

2005 Watkins [60] WEKA

v. 3.6.4

C4.5 Decision Tree Revision 8 C4.5/J48 Decision tree classification 1993 Quinlan [52] WEKA

v. 3.6.4

Classification and Regression Trees CART Decision tree classification 1984 Breiman et al. [12] SPSS

v.17.0

Colonal Selection Classification

Algorithm

CSCA Artificial immune systems

classification

2005 Brownlee [13] WEKA

v. 3.6.4

Exhaustive Chi-Square Automatic

Interaction Detection

Ex-CHAID Decision tree classification 1991 Bigss et al. [9] SPSS

v.17.0

Instance-based k-nearest Neighbors IBk K nearest neighbors (k-NN)

classification

1991 Aha et al. [2] WEKA

v. 3.6.4

Multinomial Logistic Regression

Model

Logistic Regression-based

classification

1992 Le Cessie and van

Houwelingen [42]

WEKA

v. 3.6.4

Logistic Boost LogitBoost Additive logistic regression-

based boosting

2000 Friedman et al. [22] WEKA

v. 3.6.4

Multi-Layered Perceptron MLP Neural network classification 1986 Rumelhart et al. [54] WEKA

v. 3.6.4

Multi-Pass Learning Vector

Quantization

MLVQ Neural network classification 1990 Kohonen [39–41] WEKA

v. 3.6.4

Naı̈ve Bayesian Classification Naı̈ve

Bayes

Bayesian classification 1995 John and Langley [32] WEKA

v. 3.6.4

Quick, Unbiased, Efficient, Statistical

Tree

QUEST Decision tree classification 1997 Loh and Shih [44] SPSS

v.17.0

Rough Set Exploration System RSES Rough set approach (RSA) 2000 Prof. Andrzej Skowron’s

R&D team [7]

Rosetta

v. 1.4.41

Support Vector Machines (Library for

SVM)

SVM

(LibSVM)

Support vector classification 1992 Boser et al. [10] WEKA

v. 3.6.4

110 Inf Technol Manag (2013) 14:105–124

123

Page 7: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

generalized additive model and then the cost function of

logistic regression is applied, the LogitBoost algorithm can

be derived. LogitBoost can be seen as a convex optimi-

zation and minimizes the logistic loss [22].

Multilayer Perceptron (MLP) is a type of artificial neural

network algorithm that considers the human brain as its

modeling tool [17, 54]. It provides a generic model for

learning real, discrete, and vector target values. The ability

to understand the concealed model is difficult, and training

times may be extensive [47].

MLVQ (Multi-Pass LVQ) is the recommended practice

for the LVQ (Learning Vector Quantization) algorithm

‘‘where a quick rough pass is made on the model using

OLVQ1and then a long fine tuning pass is made on the

model with any of LVQ1, LVQ2.1 or LVQ3’’. The purpose

of the algorithm is to estimate the distribution of a class by

using a reduced number of codebook vectors where the

algorithm tries to minimize classification errors. The

algorithm is connected with the neural network class of

learning algorithms; however, it works significantly dif-

ferently when compared to conventional feed-forward

networks like Back Propagation [62].

The Naı̈ve Bayesian algorithm identifies the classifica-

tion problem in accordance with the probabilistic phrases

and provides statistical methods used to categorize the

instances with respect to probabilities [17, 32].

Another algorithm used in this study is the Rough Set

Exploration System (RSES). The RSES algorithm was

created by the research team supervised by Professor

Andrzej Skowron [6, 7]. A rough set was first introduced

by Zdzisław I. Pawlak and is a formal estimation of a

conventional set in terms of a pair of sets that give the

lower and the upper estimations of the original set. In

the standard version of rough set theory, the lower- and

upper-estimation sets are conventional sets, while in

other variations; the estimating sets may be fuzzy sets

[48, 49].

A support vector machine (SVM) algorithm, introduced

by Boser, Guyon and Vapnik in 1992, analyses data and

recognizes patterns used for classification and regression

analysis. It builds a hyper plane or set of hyper planes in a

high- or infinite-dimensional space, which can then be used

for classification, regression, or other tasks. As expected,

an effective division is achieved by the hyper plane with

the largest distance to the nearest training data points of

any class (functional margin), since in general the larger

the margin, the lower the generalization error of the clas-

sifier [10, 18]. The WLSVM (Weka Library for SVM) can

be viewed as an implementation of the LibSVM running

under Weka environment [21]; the LibSVM package is an

efficient software tool for building SVM classifiers [15].

Each algorithm can utilize both numerical and categor-

ical variables as inputs. In addition, the classifiers can

handle target classes with more than two class types.

Algorithms can also be referred to as classifiers or models.

3.2 Data collection

All sample datasets were collected from the UCI Machine

Learning Repository [58]. The experimental datasets were:

Acute, Breast Cancer, CPU, Credits, Iris, Letters, Pittsburg,

Red Wine, Segment, Wine All and White Wine. Table 2

summarizes the attributes of each of the datasets.

3.3 Implementation

Before the implementation of the algorithms, the datasets

were initially cleaned of missing, noisy and incorrect data.

First, missing data were removed from the datasets which

were then also cleaned of noisy data. Unnecessary space

characters or other spelling mistakes were also cleaned in

the datasets.

Ten datasets (Acute, Breast Cancer, CPU, Credits, Iris,

Letters, Red Wine, Segment, Wine All and White Wine)

Table 2 Dataset characteristics

used for experimentsDataset name Number of

variables

Number of

nominal variables

Number of

numerical variables

Target class

types

Number of

instances

Acute 7 6 1 2 120

Breast cancer 9 0 9 2 684

CPU 6 0 6 3 210

Credits 15 9 6 2 653

Iris 4 0 4 3 150

Letters 16 0 16 26 20,000

Pittsburg 12 10 2 6 92

Red wine 11 0 11 6 1,599

Segment 19 0 19 7 1,500

Wine all 11 0 11 7 6,497

White wine 11 0 11 7 4,898

Inf Technol Manag (2013) 14:105–124 111

123

Page 8: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

were used to run the 14 classification algorithms (AIRS2P,

C4.5, CART, CSCA, Ex-CHAID, IBk, Logistics, Logit-

Boost, MLP, MLVQ, Naı̈ve Bayesian, QUEST, RSES, and

SVM). For all the algorithms, tenfold cross-validation was

implemented for the same datasets. This stage of the

experiment, referred to as ‘‘basic implementation’’, resul-

ted in 140 (10 datasets 9 14 algorithms) rows of accuracy

and speed values, respectively.

After the basic implementation phase, all continuously

variables in the datasets were put into binned intervals

within ±1 standard deviation and saved as new variables.

Again, fourteen algorithms were implemented on the pre-

processed datasets that comprised discrete variables. The

second stage of the experiment, referred to as ‘‘after dis-

cretisation’’, resulted in another 140 rows of accuracy and

speed values (10 datasets 9 14 algorithms).

Following the second stage, principal component anal-

ysis was conducted on the dataset. Components with

eigenvalues over 1.0 were set as components and saved as

new variables. Again, fourteen algorithms were imple-

mented for the preprocessed datasets made up of the

principal components. The third stage of the experiment,

referred to as ‘‘after PCA’’, resulted in another 140 rows of

accuracy and speed values (10 datasets 9 14 algorithms).

In total, a dataset called ‘‘results’’ was obtained by the

420 rows of accuracy and speed values derived from the

three stages of the experiment. The 420-row dataset was

used to answer through the first eight research questions.

In order to answer the final research question, the

‘Breast Cancer’, ‘Credits’ and ‘Pittsburg’ datasets were

selected, and algorithms were run on them before and after

cleaning for noisy instances. These results were then

compared.

3.3.1 Software tools and hardware configuration

WEKA version 3.6.4, SPSS version 17.0, and Rosetta

version 1.4.41 were the main components used to run the

selected algorithms. In addition, SPSS was used to con-

duct all the statistical tests, such as correlations, regres-

sion, Anova, or descriptive tests. For this study, the

default parameter settings of WEKA, Rosetta, and SPSS

were implemented for the algorithms. Environmental facts

are also important for the nature of these experiments. All

data mining algorithms and statistical tests were con-

ducted on a personal computer with the following con-

figuration: Microsoft XP Professional Operating System

with Service Pack 3, 2 9 2.10 GHz CPU, 3 GB RAM,

and a 150 GB hard disk. Table 1 shows the list of algo-

rithms used in the study along with the tool in which they

were implemented, the category they belong to, the year

each was introduced, and finally the people introduced the

algorithms.

4 Results

In this section, the performance and speed results of each

algorithm for each case are discussed and research ques-

tions answered accordingly.

The percentage of instances that are correctly classified

helps to calculate an accuracy [20] that is referred to as the

performance of classifiers throughout the study. The costs

of a wrong assignment, in other words, misclassification

costs, are not within the scope of this study. As Han and

Kamber [25] describe this speed as ‘the computational

costs involved in generating and using the given classifier

or predictor’. In this study, speed is referred to as the CPU

time consumed by the classifiers during model building, in

other words, the time observed in seconds to generate the

classifier. The higher the values of time spent during

modeling, the slower the classifier.

4.1 Research Question 1: When implemented

with different techniques, does the performance

of classifiers deviate significantly on multiple

datasets?

Based on the findings of the empirical study, the same

classifier is not the best choice for all datasets, and the one

chosen always outperforms the other classifiers. For each

dataset, the best predictive classifier is defined for each

stage of the experiment. As Dogan and Tanrikulu [19] also

claim in their study, a classifier cannot be seen to outper-

form the other classifiers in every other dataset.

According to Table 3, the overall best accuracy is

obtained as 100 percent in the ‘Acute’ dataset. The clas-

sifiers producing that rate of accuracy are: AIRS2P, C4.5,

CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, Naı̈ve

Bayesian, RSES, and SVM. The ‘Acute’ dataset seems to

be the easiest one to predict; however, the only algorithms

which could not achieve that amount of accuracy were

CART, Ex-CHAID, and QUEST algorithms. Table 4

shows the ten worst prediction results, and the algorithms

with the lowest prediction abilities as MLVQ, QUEST, Ex-

CHAID, and CART. The ‘Letters’ dataset with the highest

number of instances and variables seems to have been the

hardest dataset to predict, and mostly, the decision tree

algorithms had the lowest prediction values.

Table 5 displays the detailed accuracy results for the

best result cases of each dataset in the basic implementa-

tions step. MLP produced the best accuracy for the ‘Acute’,

‘CPU’, ‘Iris’ and ‘Segment’ datasets. IBk had the best

performance for the ‘Acute’, ‘Red Wine’ and ‘White Wine’

datasets. Logistics had the best performance for the ‘Acute’

and ‘Wine All’ datasets. AIRS2P, C4.5, CSCA, Logit-

Boost, and SVM had the best accuracy for only the

‘Acute’ dataset. RSES produced the best result for only the

112 Inf Technol Manag (2013) 14:105–124

123

Page 9: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

‘Letters’ dataset and CART has the best accuracy for the

‘Credits’ dataset. Interestingly, Ex-CHAID, Naı̈ve Bayes-

ian, and Quest never produced the best result for a dataset

from any of those classifiers. This result may mean that

they cannot handle continuous variables and dense

dimensionality as well as can MLP, IBk, or Logistics.

Since ‘Letters’ is the most complicated dataset, RSES

demonstrated a very powerful prediction for this complex

dataset. The accuracy result is very low of MLVQ on

Letters dataset, even lower than random guessing (1/

26 = 3.85 %). This must be due to the fact that not all

class types are represented in the data file. 10 out of 16

letter types are never represented in the data file which

causes a data bias and apparently reduces the learning

power of MLVQ algorithm. The accuracy of MLVQ is

dependent on the class distribution in the training dataset

since a good distribution of samples will ensure building

useful models.

Table 6 displays the detailed accuracy results for the

best result cases for each dataset after the discretisation

step. IBk had the best accuracy for the ‘Acute’, ‘Wine All’,

‘Wine Red’ and ‘Wine White’ datasets. RSES had the best

accuracy for the ‘Acute’, ‘Letters’, and ‘Segment’ datasets.

Logistics had the best performance for the ‘Acute’ and

‘Iris’ datasets. MLP had the best value for the ‘Acute’ and

‘CPU’ datasets. SVM predicted the ‘Acute’ and ‘Breast

Cancer’ datasets. CART and Ex-CHAID algorithms had

the best performance for the ‘Credits’ dataset, and finally

C4.5, CSCA, AIRS2P, LogitBoost, and MLVQ had the

best accuracy for only the ‘Acute’ dataset. When the

continuous variables were binned into intervals, IBk and

RSES started to predict better more often. This finding can

depend on their ability to handle discrete values better.

However, neither Naı̈ve Bayesian nor QUEST could still

predict as well as other algorithms.

Table 7 displays the detailed accuracy results for the

best result cases for each dataset after the PCA step. IBk

had the best performance for the ‘Acute’, ‘Segment’,

‘Wine All’ and ‘Wine Red’ datasets. MLP had the best

Table 3 Overall best accuracy results

Dataset Algorithm Integers binned PCA Accuracy

Acute AIRS2P, C4.5, CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, RSES, SVM No N 100

Acute AIRS2P, C4.5, CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, RSES, SVM Yes N 100

Acute AIRS2P, C4.5, CSCA, IBk, LogitBoost, Logistics, MLP, MLVQ, Naı̈ve Bayes, RSES, SVM Yes Y 100

Table 4 Overall worst accuracy results

Dataset Algorithm Integers

binned

PCA Accuracy

Letters CART YES N 35.3

Iris EXCHAID, QUEST YES Y 33.3

Letters QUEST YES N 31.1

Letters CART YES Y 31.0

Letters QUEST YES Y 27.6

Letters QUEST NO N 23.7

Letters MLVQ YES Y 3.85

Letters MLVQ YES N 3.80

Letters MLVQ NO N 3.78

Table 5 Best accuracy results for each dataset/basic implementations

Dataset Algorithm Accuracy

Acute AIRS2P, C4.5, CSCA, IBk,

LogitBoost, Logistics,

MLP, MLVQ, SVM

100

Breast cancer MLVQ 97.2

CPU MLP 97.2

Credits CART 87.4

Iris MLP 97.3

Letters RSES 100

Segment MLP 96.7

Wine all Logistics 77.4

Wine red IBk 64.7

Wine white IBk 65.3Table 6 Best accuracy results for each dataset/after discretisations

Dataset Algorithm Accuracy

Acute AIRS2P, C4.5, CSCA, IBk,

LogitBoost, Logistics, MLP,

MLVQ, RSES, SVM

100.0

Breast cancer SVM 97.1

CPU MLP 97.2

Credits CART, Ex-CHAID 86.4

Iris Logistics 90.7

Letters RSES 98.5

Segment RSES 89.9

Wine all IBk 61.6

Wine red IBk 63.0

Wine white IBk 61.7

Inf Technol Manag (2013) 14:105–124 113

123

Page 10: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

accuracy for the ‘Acute’, ‘CPU’ and ‘Iris’ datasets. C4.5

had the best accuracy for the ‘Acute’ and ‘Wine White’

datasets. Logistics had the best value for the ‘Acute’ and

‘Credits’ datasets. LogitBoost had the best accuracy for the

‘Acute’ and ‘Breast Cancer’ datasets. RSES predicted the

‘Acute’ and ‘Letters’ datasets best. AIRS2P, CSCA,

MLVQ, Naı̈ve Bayesian, and SVM could only best predict

the ‘Acute’ dataset. After the PCA application, Naı̈ve

Bayesian and C4.5 started to predict better, and thus they

may be better at handling a lower amount of data and

dimensions. CART, Ex-CHAID, and QUEST did not pre-

dict the best in any dataset after component analysis.

The performance variable was binned into intervals as

LOW, MIDDLE, GOOD, and VERY GOOD with respect

to ±1 standard deviation. Table 8 shows the distribution of

each classifier across those performance intervals with

respect to all stages of the experiment. Table 8 shows that

the distribution of algorithms, such as C4.5 and IBk, were

mostly in the ‘Very Good’ interval; AIRS2P, Logistics and

MLP are other ‘Good’ ones. IBk was the only classifier

with no result in the ‘Low’ interval.

The basic concern of the first research question was to

determine whether the classifiers have significantly differ-

ent accuracies on multiple datasets. Even though the results

show that none of the classifiers is dominant and different

classifiers will predict better in different circumstances, a

One-way Anova test can help to better visualize the dif-

ferences between classifier accuracies. Table 9 shows that

the means of the prediction ability of the classifiers on

original datasets do not significantly differ from each

classifier, and Fig. 2 demonstrates this finding perfectly.

According to Fig. 2, the best classification mean belongs to

IBk. Then, the other good performing classifiers are

Logistics, C4.5 and MLP. The worst mean value belongs to

MLVQ. Then, the other poor performing classifiers are

QUEST and RSES.

Table 9 shows that the means of the prediction abilities

of the classifiers after discretisations still do not differ

significantly from each classifier, and Fig. 3 demonstrates

this finding perfectly. According to Fig. 3, the top classi-

fiers are RSES, IBk, C4.5 and SVM; the worst classifiers

are QUEST, Ex-CHAID and CART. In the second stage of

the experiment, a general tendency of performance to drop

is observable; however, CSCA, MLVQ, RSES and SVM

show a tendency to increase. RSES and SVM demonstrated

a sharp increase in performance.

Table 7 Best accuracy results for each dataset/after PCA

Dataset Algorithm Accuracy

Acute AIRS2P, C4.5, CSCA, IBk,

LogitBoost, Logistics, MLP, MLVQ,

Naı̈ve Bayes, RSES, SVM

100

Breast cancer LogitBoost 97.5

CPU MLP 91.4

Credits Logistics 82.5

Iris MLP 86.7

Letters RSES 98.5

Segment IBk 85.9

Wine all IBk 58.0

Wine red IBk 60.0

Wine white C4.5 51.9

Table 8 Overall distributions of classifiers across performance

intervals

Algorithm Accuracy Total

Low Middle Good Very good

AIRS2P 8 2 12 8 30

C4.5 2 8 10 10 30

CART 7 10 12 1 30

CSCA 8 5 10 7 30

EXCHAID 7 10 12 1 30

IBk 0 10 10 10 30

Logistics 6 5 9 10 30

LogitBoost 7 5 10 8 30

MLP 4 7 9 10 30

MLVQ 10 5 8 7 30

Naı̈ve Bayes 6 7 10 7 30

QUEST 12 6 11 1 30

RSES 8 6 7 9 30

SVM 2 11 10 7 30

Total 87 97 140 96 420

Table 9 Anova accuracy results

Sum of

squares

df Mean

square

F Sig.

All trials

Between groups 11,404.3 13.0 877.3 2.2 .009

Within groups 162,245.5 406.0 399.6

Basic implementations

Between groups 6,426.6 13.0 494.4 1.2 .307

Within groups 53,099.2 126.0 421.4

After discretisation

Between groups 3,870.1 13.0 297.7 .8 .648

Within groups 46,269.2 126.0 367.2

After PCA

Between groups 4,081.5 13.0 314.0 .7 .768

Within groups 57,154.7 126.0 453.6

114 Inf Technol Manag (2013) 14:105–124

123

Page 11: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Table 9 shows that the means of the prediction abilities

of the classifiers after PCA do not differ significantly from

each other since the means of classifiers come far closer to

each one, and Fig. 4 demonstrates this finding perfectly.

According to Fig. 4, the top classifiers are IBk, C4.5, and

SVM; the worst classifiers are QUEST, Ex-Chaid, and

CART. In the third stage of the experiment, a general

tendency of the performance to drop is still observable.

Although the claim was that a classifier cannot outperform

the others in every dataset, with respect to all these

experimental trials, IBk showed the best performance on

average for all the datasets.

As can be estimated, when all 420 trials are taken into

account, the means for the algorithms will differ signifi-

cantly. Figure 5 supports this finding, revealing the best

classifiers out of all the trials to have been IBk, C4.5 and

MLP. Logistics and LogitBoost also predicted well.

However, QUEST, MLVQ, Ex-CHAID, and CART had the

lowest predictive power.

4.2 Research Question 2: Is the performance

of classifiers significantly affected by dataset

characteristics?

Once all of the iterations were completed in the imple-

mentation step, a dataset of 420 rows, including combi-

nations of the datasets, algorithms, discretisation and PCA

application options with 13 columns for the variables, were

obtained.

The first 11 fields, except for the ‘Trial ID’ in Table 10,

were set as input variables, which were ‘dataset name’,

‘algorithm name’, ‘integer variables binned’, ‘number of

Fig. 2 Anova mean plots/basic stage/accuracy

Fig. 3 Anova mean plots/discretisation stage/accuracy

Fig. 4 Anova mean plots/PCA stage/accuracy

Fig. 5 Anova mean plots/all trials/accuracy

Inf Technol Manag (2013) 14:105–124 115

123

Page 12: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

principal components’, ‘PCA applied’, ‘percentage of

cumulative var. obtained in PCA’, ‘number of variables’,

‘number of nominal variables’, ‘number of numerical

variables’, ‘number of target class types’, and ‘number of

instances’. The last fields in Table 10 show the perfor-

mance and CPU time variables, which were set as the

dependent variables. Since the second research question is

interested in dataset characteristics, the independent vari-

ables were defined based on dataset attributes, such as the

number of variables, number of nominal variables, number

of numerical variables, number of target class types, and

number of instances.

On the newly created dataset, referred to as the Results

dataset, some kinds of correlation analysis can be con-

ducted to determine if any of the input variables signifi-

cantly affected performance results. First, in order to

conduct the correlation analysis, all variables were coded

as numerical variables, and Z-score normalizations were

applied to these variables. SPSS was also used for

implementation.

According to Table 11, all the independent variables

were found to be significantly correlated to the dependent

variable or the performance of the classifier. Based on

these results, the number of variables in the dataset (-.290

Pearson value), the number of nominal variables in the

dataset (.280 Pearson value), the number of numerical

variables in the dataset (-.417 Pearson value), the number

of target class types (-.334 Pearson value) and the number

of instances in the dataset (-.456 Pearson value) in the

dataset were found to coincide with the classifier

performance. As a result, the answer to the second question

indicates that all dataset characteristics can affect classifier

performance. In other words, the number of instances,

nominal or numerical variables, target classes in a dataset

makes a significant difference in terms of the classifier

success rate.

Apparently proper preparation of the data is a key factor

in any classification problem. The data must be properly

cleansed to eliminate inconsistencies and support the

requirements of the mining application. Additionally, most

algorithms require some form of data transformation, such

as generalization, normalization in order to reduce the

classification difficulty. Obviously, the dataset becomes

more complex with high number of instances, classes and

variables (attributes) for the classifier to make the predic-

tion. In most classification and regression problems, if the

training and test data is prepared by the tool, normalization

methods are applied to the attributes in order to reduce the

cardinality and increase the discriminating power of the

algorithm.

4.3 Research Question 3: Does binning the continuous

variables in the dataset into discrete intervals affect

classifier accuracy?

According to Table 12, the input variable, ‘the continuous

variables binned into intervals’, has not been found to

significantly correlate to the dependent variable or the

performance of the classifier (-.058 Pearson value). As a

result, the answer to the third question indicates to that

discretisation of the continuous variables in the dataset will

not affect classifier performance.

Further Table 13 shows that the performance means of

instances where continuous variables have been discretised

and instances where continuous variables have not been

discretised are not found to be significantly different with

Table 10 Excerpts from the results dataset

Trial ID 1 … 88 … 420

Dataset name Acute Iris White

wine

Algorithm name AIRS2P CSCA SVM

Implementation attributes

Integer variables binned No Yes Yes

No of principal components 0 1 5

PCA applied N Y Y

Percentage of cumulative

var. obtained in PCA

0 70.99 72.186

Dataset attributes

No of variables 7 4 11

No of nominal variables 6 0 0

No of numerical variables 1 4 11

No of target class types 2 3 7

No of instances 120 150 4898

Class

Accuracy (%) 100 84 49.392

CPU time (s) 0.13 0.05 10.02

Table 11 Correlations between dataset characteristics and accuracy

Dataset characteristics Correlations Accuracy

Number of variables Pearson correlation -.290**

Sig. (2-tailed) .000

Number of nominal variables Pearson correlation .280**

Sig. (2-tailed) .000

Number of numerical variables Pearson correlation -.417**

Sig. (2-tailed) .000

Number of target class types Pearson correlation -.334**

Sig. (2-tailed) .000

Number of instances Pearson correlation -.456**

Sig. (2-tailed) .000

N 420

** Significant correlation

116 Inf Technol Manag (2013) 14:105–124

123

Page 13: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

respect to the significance level of the One-way Anova test

(.235).

4.4 Research Question 4: Does applying principal

component analysis (PCA) in the dataset affect

the classifier accuracy?

According to Table 14, the input variables ‘number of

principal components’ (-.289 Pearson value), ‘percentage

of cumulative variance obtained in PCA’ (-.125 Pearson

value) and ‘PCA applied’ (-.126 Pearson value) were

found to be significantly correlated to the dependent vari-

able, the performance of the classifier. As a result, the

answer to the fourth question is that applying PCA in the

dataset can affect classifier performance. Because principal

components act like a variable in the dataset, they have a

similar effect on the accuracy, for as the number of com-

ponents increase, the accuracy can get worse. Additionally,

applying PCA to a dataset will reduce data content as the

classifier prediction accuracy will be lower.

On the other hand, Table 15 shows that the performance

means of instances of which PCA has not been applied and

the ones of which PCA has applied were found to be sig-

nificantly different with respect to the significance level of

the One-way Anova test (.010).

4.5 Research Question 5: Based on the empirical

results of this study (applying classifiers on various

datasets with different implementation techniques),

what is the overall effect of dataset

and implementation attributes on the accuracy

of the classification algorithm?

Since there is a Results dataset that contains the algorithm,

dataset, and implementation specific attributes in Table 10,

it is possible to use these variables in a regression model

and determine their causal effects on the dependent per-

formance variable. After finding the correlations between

some of the selected independent and dependent perfor-

mance variables in the previous research questions, it is

essential to design a regression model; therefore, a

regression model was developed to answer the fifth

research question. According to the regression results, it

was possible to build a model to demonstrate the correla-

tions to the performance result. Equation (1) shows the

regression function for modeling the performance of a

classifier.

Performance ¼ �:552� number of principal componentsþ :189

�% of cumulative variance obtained in PCA

� :099� number of variablesþ :223

� number of nominal variables

þ :117� number of target class types� :367

� number of instances

þ :006� discretisationþ :155

� PCA application

ð1Þ

The purpose of conducting a regression is to understand

whether the coefficients on the independent variables

actually differ from zero; in other words, whether the

independent variables have an observable effect on the

dependent variable. If coefficients are different from zero,

Table 12 Correlations between accuracy and discretisation

Dataset characteristics Correlations Accuracy

Continuous variables binned Pearson correlation -.058

Sig. (2-tailed) .235

N 420

Table 13 Anova results of accuracy and discretisation

Accuracy Sum of squares df Mean

square

F Sig.

Between

groups

584.950 1 584.950 1.413 .235

Within groups 173,064.824 418 414.031

Total 173,649.774 419

Table 14 Correlations between PCA and accuracy

Accuracy

Number of principal components

Pearson correlation -.289**

Sig. (2-tailed) .000

Percentage of cumulative variance obtained in PCA

Pearson correlation -.125**

Sig. (2-tailed) .010

PCA applied

Pearson correlation -.126**

Sig. (2-tailed) .010

N 420

** Significant correlation

Table 15 Anova results of accuracy and PCA

Accuracy Sum of squares df Mean

square

F Sig.

Between

groups

2,743.139 1 2,743.139 6.709 .010

Within groups 170,906.635 418 408.868

Total 173,649.774 419

Inf Technol Manag (2013) 14:105–124 117

123

Page 14: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

this means that the null hypothesis (the dependent not

affected by the independents) may be rejected [29]. Based

on the regression function in the Eq. (1), some of the

independent variables were to affect the dependent

variable’s performance. As a result, the number of

principal components, number of instances, and number

of variables all had a negative effect on performance. On

the other hand, the number of nominal variables, the

percentage of cumulative variance obtained in PCA, the

PCA application and the number of target class types all

had a positive effect on the performance.

However, within a 95 % confidence interval, the p val-

ues in Table 16 must be close to or lower than 0.05 in order

to be accepted as significant enough. With respect to

p values (sig. column), the effect of the number of principal

components, the number of nominal variables and the

number of instances on performance is said to be more

certain. These findings are also in line with the research

question 2 findings where we claimed that the high number

of variables and instances increase the classification diffi-

culty and impact the algorithm discriminating power.

4.6 Research Question 6: Implemented using different

techniques, does the speed of classifiers deviate

significantly on multiple datasets?

The basic concern of the sixth research question is to

determine whether the classifiers have significantly differ-

ent CPU time on multiple datasets. A One-way Anova test

can help to perfectly visualize the differences between

classifier speeds. Table 17 shows that the means of the

CPU time of the classifiers on original datasets do not

differ significantly from each other (sig. 0.447) and Fig. 6

demonstrates this finding perfectly. According to Fig. 6,

the slowest classifier is CSCA; MLP is slightly slower than

the rest of the classifiers and the remaining classifiers are

all of a high speed nature.

Table 17 shows that the difference in the means of the

CPU time of the classifiers after discretisations are not

significant, and Fig. 7 demonstrates this finding perfectly.

According to Fig. 7, an overall increase in the time spent is

observable for all classifiers. The slowest classifier is still

CSCA; the rest of the classifiers display similar low CPU

Table 16 Regression results/

accuracyModel Unstd. coefficient Std. coefficient t Sig.

B SE Beta

Constant .000 .040 .000 1.000

Number of principal components -.552 .090 -.552 -6.152 .000

Percentage of cumulative

variance obtained in PCA

.189 .315 .189 .598 .550

Number of variables -.099 .055 -.099 -1.809 .071

Number of nominal variables .223 .054 .223 4.095 .000

Number of target class types .117 .105 .117 1.113 .266

Number of instances -.367 .097 -.367 -3.784 .000

Discretisation .006 .046 .006 .138 .890

PCA application .155 .317 .155 .488 .626

Table 17 Anova CPU time

resultsSum of squares df Mean square F Sig.

All trials

Between groups 3.450E ? 08 13 2.654E ? 07 3.075 .000

Within groups 3.504E ? 09 406 8.631E ? 07

Basic implementations

Between groups 1.936E ? 07 13 1.489E ? 06 1.009 .447

Within groups 1.860E ? 08 126 1.476E ? 06

After discretisation

Between groups 2.328E ? 08 13 1.791E ? 07 1.116 .352

Within groups 2.007E ? 09 125 1.605E ? 07

After PCA

Between groups 1.576E ? 08 13 1.212E ? 07 1.232 .265

Within groups 1.240E ? 09 126 9.843E ? 06

118 Inf Technol Manag (2013) 14:105–124

123

Page 15: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

times. In the second stage of the experiment, a general

tendency of the CPU time to increase is observable.

Table 17 shows that the means of the CPU time of the

classifiers after PCA also do not differ significantly from

each other and Fig. 8 demonstrates this finding perfectly.

According to Fig. 8, the slowest classifier is still CSCA,

and the other classifiers are a lower CPU time. In the third

stage of the experiment, a general tendency of the speed to

increase is observable.

As can be estimated, when all 420 trials are taken into

account, the means of the algorithms will differ signifi-

cantly. Figure 9 shows this finding well and reveals that the

slowest classifier is always CSCA. Different implementa-

tion techniques significantly change the overall picture for

classifier CPU time.

4.7 Research Question 7: Based on the empirical

results of this study (applying classifiers on various

datasets with different implementation techniques),

what is the overall effect of dataset

and implementation attributes on speed (consumed

CPU time in seconds) for the classification

algorithm?

It was deemed reasonable to design a regression model;

therefore, a regression model was developed to answer the

seventh research question. According to the regression

results in Table 18, it is possible to build a model to show

the factors that do affect the speed altogether.

Within a 95 % confidence interval, p values in Table 18

should be close to or lower than 0.05 in order to be

accepted as significant enough. With respect to p values

(sig. column), only the effect of the number of instances on

actual CPU time is more certain. In other words, as the

number of instances increase, the time to build the model

0

200

400

600

800

1000

1200

1400

Fig. 6 Anova mean plots/basic implementations/CPU time

Fig. 7 Anova mean plots/after discretisations/CPU time

Fig. 8 Anova mean plots/after PCA/CPU time

Fig. 9 Anova mean plots/all trials/CPU time

Inf Technol Manag (2013) 14:105–124 119

123

Page 16: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

will take longer. This is an expected outcome of the

experiments because for most of the algorithms it is known

that more time and resources are required as the number of

samples increase. For example, in a study, for decision

trees, moving from 10 to 100 K cases increased CPU time

on a PC from 1.4 to 61 s, a factor of 44. The time required

for rule sets, however, increased from 32 to 9,715 s, a

factor of 300. For another example, KNN algorithm needs

N 9 k comparisons. The number of iterations required for

convergence varies and may depend on N, but initially, this

algorithm can be considered linear in the dataset size [64].

The negative effect here implies a reduction in CPU

time, which means it takes less time to build the model, and

a positive effect on the CPU time, which means an

increased time needed to build the model.

4.8 Research Question 8: Does the speed

of the algorithm (CPU time consumed in seconds)

significantly affect the accuracy

of the classification algorithm?

Research Question 8 seeks an answer to the effects of the

speed of a classifier on the accuracy of that classifier. In

other words, the questions tries to understand if the

accuracy rate of algorithm changes based on its being

classifying faster or slower. To answer this question, a

correlation analysis was conducted in SPSS by analyzing

the relation between the ‘Accuracy’ and ‘CPU time’

variables. Based on the results shown in Table 19 (Pear-

son value -0.041), we cannot say that there is a signifi-

cant correlation between performance and the speed of a

classifier; therefore we cannot make an assumption that

the accuracy of an algorithm is reduced if it takes too

much or even less time to complete the prediction

process.

4.9 Research Question 9: Do the abilities of classifiers

to handle missing or noisy data differ?

To answer this last research question, ‘Breast Cancer’ (0

nominal variables, 9 numerical variables, 2 target classes,

684 instances), ‘Credits’ (9 nominal variables, 6 numerical

variables, 2 target classes, 653 instances) and ‘Pittsburg

Bridges’ (10 nominal variables, 2 numerical variables, 6

target classes, 92 instances) datasets were selected for the

experiments. These datasets have some noise, i.e., missing

values. First, all 14 algorithms with tenfold cross-valida-

tion were applied to the datasets, and accuracy results were

tabulated. Then, the missing values were cleaned from the

datasets, and the same algorithms were implemented once

again. The results of the two different steps can lead to one

determining which algorithms are more robust.

Table 20 shows the performance of classifiers before a

missing value analysis and accuracies after the missing

values are cleaned. Based on Table 20, the accuracy of

classifiers does change after the missing values are cleaned.

Based on the average variance in the bottom line of

Table 20, the ‘Pittsburg’ dataset had the highest average

deviation value (3.77). This result is a point to be consid-

ered. The most outstanding feature of the ‘Pittsburg’

dataset is having the lowest amount of instances and the

highest amount of target class types, thus indicating a

difficult classification problem. The proportion of missing

values to the total number of instances is high compared to

the other two datasets; therefore, the effect of missing data

Table 18 Regression results/

CPU timeModel Unstd. coefficient Std. coefficient T Sig.

B SE Beta

(Constant) 310.0 143.8 2.2 .032

Number of principal components -132.8 321.0 .0 -.4 .679

Percentage of cumulative variance obtained in PCA 217 1,128.7 .1 .2 .848

Number of variables -81.8 195.2 .0 -.4 .675

Number of nominal variables 33.0 194.7 .0 .2 .865

Number of target class types 219.1 377.0 .1 .6 .562

Number of instances 680.1 346.7 .2 2.0 .050

Integers are binned 143.5 166.3 .0 .9 .389

PCA applied -164.2 1,135.2 -.1 -.1 .885

Table 19 Accuracy and CPU time correlation

Accuracy CPU time

Pearson correlation -.041

Sig. (2-tailed) .400

N 420

120 Inf Technol Manag (2013) 14:105–124

123

Page 17: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

on the results is also higher. In other words, algorithms

classify more robustly on larger datasets.

The highest deviation in accuracy after the missing

values were cleaned was obtained for the RSES algorithm

(3.8), the least robust classifier based on the experiments.

The algorithm affected by the missing values least seems to

have been the CART (0.70) which handled the missing

values best. The second most robust algorithm seems to

have been Ex-CHAID (0.9) as it seems that the decision

tree algorithms are relatively very robust.

However, we did not observe any other common pattern

or behavior; thus certainly, more experiments should be

conducted to draw more certain conclusions about classi-

fier robustness on noisy datasets.

5 Discussion

The first discussion point for the study is the significant

differences between algorithm success rates. According to

the accuracy results, none of the algorithms outperformed

the others on each dataset; therefore, an algorithm may not

dominantly predict the best in all domains; data miners

should also consider dataset bias. The distribution of

algorithms, such as C4.5 and IBk, is mostly in the ‘Very

Good’ interval; AIRS2P, MLP and Logistics are the other

‘Good’ algorithms. IBk is the only classifier with no result

in the ‘Low’ interval.

When all trials are taken into account, the means of

algorithms will differ significantly. The best classifiers out

of all the trials were IBk, C4.5, and MLP. Logistics and

LogitBoost also predicted well. However, QUEST, MLVQ,

Ex-CHAID, and CART had the lowest predictive power.

The nature of each experiment stage helped us explain

the differences between algorithm performances. For

example, in the basic implementations step, the best clas-

sification means belonged to IBk, C4.5, Logistics, and

MLP; the worst mean values belonged to MLVQ, QUEST,

and RSES. This result means that these classifiers cannot

handle continuous variables and dense dimensionality as

well as MLP, IBk, C4.5, or Logistics can. When the con-

tinuous variables were binned into intervals, a general

tendency of the performance to drop was observable;

however, CSCA, MLVQ, RSES, and SVM showed a ten-

dency to increase. RSES and SVM demonstrated a sharp

increase in performance. This finding depends on their

ability to handle discrete values better. After PCA appli-

cation, a general tendency of the performance to drop was

still observable. Thus, data analysts should be aware that

some data preprocessing attempts may reduce accuracy for

some classifiers. Although it was claimed that a classifier

cannot outperform the others in every dataset, with respect

to all these trials, IBk showed the best performance on

average of all the datasets.

Another debate in this study is about success rates and

possible affecting factors. Based on the correlation analy-

sis, all dataset characteristics and PCA do affect the clas-

sifier accuracy. This finding makes sense because PCA

does affect reducing the number of variables in the dataset.

However, discretisation has no significant effect on the

success rates; in other words, having continuous or discrete

values does not change the accuracy significantly. We

Table 20 Robustness comparisons

Algorithms Datasets Avg. Var.

Breast cancer Credits Pittsburg bridges

Before MVA After MVA Var. Before MVA After MVA Var. Before MVA After MVA Var.

AIRS2P 96.1 96.8 0.6 84.3 84.2 0.1 52.8 59.7 6.9 2.5

C4.5 94.6 96.0 1.4 86.1 85.3 0.8 63.2 67.3 4.1 2.1

CART 92.4 92.7 0.3 87.1 87.4 0.3 40.7 39.1 1.6 0.7

CSCA 96.7 96.3 0.4 65.0 65.4 0.4 31.5 33.6 2.1 1.0

Ex-CHAID 92.7 93.0 0.3 87.2 86.4 0.8 40.7 39.1 1.6 0.9

IBk 95.1 95.7 0.6 81.2 81.2 0 58.4 60.8 2.4 1.0

Logistics 96.6 96.8 0.2 85.2 86.1 0.9 55.6 60.8 5.2 2.1

LogitBoost 95.7 95.3 0.4 84.9 86.5 1.6 67.9 70.6 2.7 1.6

MLP 95.3 96.0 0.7 82.8 82.7 0.1 65 70.6 5.6 2.1

MLVQ 96.5 97.2 0.7 68.5 67.8 0.7 45.2 51 5.8 2.4

Naı̈ve Bayesian 96.0 96.3 0.3 77.7 78.3 0.6 66.9 64.1 2.8 1.2

QUEST 91.0 91.2 0.2 83.6 84.8 1.2 40.7 39.1 1.6 1.0

RSES 94.5 96.0 1.5 38.8 38.7 0.1 33.7 43.6 9.9 3.8

SVM 66.3 68.8 2.5 54.8 54.7 0.1 53.7 53.2 0.5 1.0

Avg 0.72 Avg 0.55 Avg 3.77

Inf Technol Manag (2013) 14:105–124 121

123

Page 18: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

claimed that the high number of variables and instances

increase the classification difficulty and impact the algo-

rithm discriminating power.

The third discussion that the paper focuses on is the

classifier speed. Based on the Anova mean plots, the CSCA

algorithm has always been found to be the most time-

consuming algorithm; the rest are of a similar speed. This

part of the study offers a clear sense of model development

times, since the data analyst can indeed understand that

more training time will be required with CSCA based on

these findings.

Based on the findings of this study, a regression model

was also built to show the overall effects of dataset and

implementation variables on the performance and speed of a

classifier. Obviously, other factors may also influence the

accuracy or speed of a classifier. Thus, the input variables of

the regression function should be increased in the future. In

this study, they are based on dataset attributes, applying of

discretisation and PCA; however, it is suggested that those

additional quality factors be included in any future studies.

Lastly, the robustness comparisons were conducted on

multiple datasets before and after the missing values were

cleaned. The results show that the accuracy of algorithms is

not dramatically reduced when noise is also included;

however, CART was the most robust one compared to

RSES as the least robust, and the effects of missing data on

accuracy was higher on smaller datasets.

Another question raised in this study seeks to find an

answer to the trade-off dilemma between the success rate and

the CPU time consumed by the algorithms. In other words,

does having a longer or shorter classification time increase or

decrease accuracy? However, based on the correlation

studies, no significant relationship between speed and suc-

cess rate was found. We cannot claim that algorithms that

take a longer time will decrease or increase accuracy results.

In this study, the algorithms were compared to each

other based on their total success rates at prediction;

however, the evaluations discussed so far here do not take

into account the cost of making wrong decisions, wrong

classifications. Optimizing classification rate without con-

sidering the cost of the errors could lead to different results.

If the costs are known, however, they can be incorporated

into a financial analysis of the decision-making process.

Even though the costs of the errors can be different,

evaluation by classification accuracy tacitly assumes equal

error costs. Taking the cost matrix into account, however,

replaces the success rate with the average cost per decision.

This is how a classifier, built without taking costs into

consideration, can be used to make predictions that are

sensitive to the specific cost matrix. An alternative is to

take the cost matrix into account during the training pro-

cess and ignore costs at prediction time. [63]. Because the

misclassification costs can be different for each dataset

depending on the nature of the data and target classes to be

predicted, in real life situations these factors should also be

taken into account while conducting a data mining practice.

6 Conclusion

Classification algorithms have been very frequently used

by the data mining community, and their prediction abili-

ties or complexities have been often discussed. When

implemented efficiently and correctly, data mining systems

can be crucial in many areas, such as customer relationship

management, fraud detection, credit evaluation, risk eval-

uations, medical treatment, and disease detection. There-

fore, quality criteria like accuracy and speed do play a

crucial role in data mining projects when selecting a proper

classifier.

In this study, the AIRS2P, C4.5, CART, CSCA, Ex-

CHAID, IBk, Logistics, LogitBoost, MLP, MLVQ, Naı̈ve

Bayesian, QUEST, RSES, and SVM classification algo-

rithms were implemented on ten different datasets. The

factors affecting classification algorithm performance and

speed were underscored based on the empirical results of

the difference test, correlation, and regression studies. The

fact that dataset characteristics and implementation details

may influence the accuracy of an algorithm cannot be

denied. In summary, all dataset characteristics and PCA

applications were found to affect the success rate signifi-

cantly, but not the discretisation. Speed or the amount of

CPU time consumed by the classifier has been another

concern. CSCA has been found the most time taking

algorithm and number of instances in a dataset can be

claimed to increase that time significantly.

On the other hand, misclassification costs should be

applied and studied in further experiments, because dif-

ferent results can be obtained in real life applications if the

costs for errors are closely considered. Additionally, future

studies should be done on larger datasets to observe their

algorithm behaviors and scalability.

Maimon and Rokach claim that ‘‘no induction algorithm

can be best in all possible domains’’, and they introduce the

concept of ‘‘No Free Lunch Theorem’’. This concept states

that ‘‘if one inducer is better than another in some domains,

then there are necessarily other domains in which this

relationship is reversed’’ [45]. Data miners face the

dilemma of determining which classifier to use and that

situation gets harder when other criteria, such as compre-

hensibility or complexity, are also a concern. It is not an

easy task to decide which classifier to use in a data mining

problem; however, this study shows the importance of

model selection and explains that an algorithm and the data

preprocessing technique is not always the best choice for

all datasets. As a result, the nature of the data determines

122 Inf Technol Manag (2013) 14:105–124

123

Page 19: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

which classification algorithm will provide the best solu-

tion to a given problem. The algorithm can differ with

respect to accuracy, time to completion, and comprehen-

sibility. In practice, it will make sense to develop several

models for each algorithm, select the best model for each

algorithm, and then choose the best of those for

implementation.

The scope of the selected algorithms studied here

included a large portion of modern algorithms which were

not very easy to find in the literature review. This paper

studied a broad range of well-known traditional algorithms

and recent popular algorithms with a combination of dif-

ferent implementation settings (such as discretisation,

principal component analysis) on multiple datasets. Addi-

tionally, we believe that this paper statistically demon-

strates the relationships between the success rate of the

classifiers and the dataset attributes as well as other settings

(such as discretisation and PCA). The business and aca-

demic community should take these results into consider-

ation, since establishing a knowledge discovery process on

the same algorithm with the same implementation details

may not always be either certain or efficient. The utmost

attention should be paid to model assessment and the

selection phase in an iterative manner, because any dif-

ference in dataset characteristics or preprocessing tech-

niques can affect a model’s accuracy or speed. Hence

switching to another classifier or changing the prepro-

cessing technique may be a better decision. The regression

models developed in this study also offer some hints to data

miners about the expected accuracy and speed of classifiers

based on either a given dataset or its preprocessing

characteristics.

References

1. Abbasi A, Chen H (2009) A comparison of fraud cues and

classification methods for fake escrow website detection. Inf

Technol Manage 10(2–3):83–101

2. Aha DW, Kibler D, Albert MK (1991) Instance-based learning

algorithms. Mach Learn 6(1):37–66

3. Al-Sheshtawi KA, Abdul-Kader HM, Ismail NA (2010) Artificial

immune clonal selection classification algorithms for classifying

malware and benign processes using API call sequences. Int J

Comput Sci Netw Secur 10(4):31–39

4. Armstrong LJ, Diepeveen D, Maddern R (2007) The application

of data mining techniques to characterize agricultural soil pro-

files. In: Proceedings of the 6th Australasian conference on data

mining and analytics (AusDM’07), Gold Coast, Australia,

pp 81–95

5. Badulescu LA (2007) The choice of the best attribute selection

measure in decision tree induction. Ann Univ Craiova Math

Comp Sci Ser 34(1):88–93

6. Bazan JG (1998) A comparison of dynamic and non-dynamic

rough set methods for extracting laws from decision tables.

In: Polkowski L, Skowron A (eds) Rough sets in knowledge

discovery: methodology and applications. Physica-Verlag, Hei-

delberg, pp 321–365

7. Bazan JG, Szczuka M (2001) RSES and RSESlib - A collection

of tools for rough set computations. In: Ziarko W, Yao Y (eds)

Proceedings of the 2nd international conference on rough sets and

current trends in computing (RSCTC’2000). Lecture Notes in

Artificial Intelligence 2005. Springer, Berlin, pp 106–113

8. Berson A, Smith S, Thearling K (2000) Building data mining

applications for CRM. McGraw Hill, USA

9. Bigss D, Ville B, Suen E (1991) A method of choosing multiway

partitions for classification and decision trees. J Appl Stat 18(1):

49–62

10. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for

optimal margin classes. In: Proceedings of the 5th annual work-

shop on computational learning theory, Pittsburg, USA,

pp 144–152

11. Brazdil PB, Soares C, Costa JP (2003) Ranking learning algo-

rithms: using IBL and meta-learning on accuracy and time

results. Mach Learn 50(3):251–277

12. Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classifi-

cation and regression tree. Wadsworth & Brooks/Cole Advanced

Books & Software, Pacific Grove

13. Brownlee J (2005) Clonal selection theory & CLONALG: the

clonal selection classification algorithm (CSCA). Technical

Report 2-02, Centre for Intelligent Systems and Complex Pro-

cesses (CISCP), Swinburne University of Technology (SUT),

Australia. Available via http://citeseerx.ist.psu.edu/viewdoc/

download?doi=10.1.1.70.2821&rep=rep1&type=pdf. Accessed

on 15 August 2011

14. Castro LN, Zuben FJ (2002) Learning and optimization using the

clonal selection principle. IEEE Trans Evol Computat 6(3):

239–251

15. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector

machines. ACM Trans Intell Syst Technol 2(3):27:1–27. Soft-

ware available at http://www.csie.ntu.edu.tw/*cjlin/libsvm.

Accessed on 29 August 2011

16. Chiarini TM, Berndt JD, Luther SL, Foulis PR, French DD

(2009) Identifying fall-related injuries: text mining the electronic

medical record. Inf Technol Manage 10(4):253–265

17. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data

mining: a knowledge discovery approach. Springer, USA

18. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn

20(3):273–297

19. Dogan N, Tanrikulu Z (2010) A comparative framework for

evaluating classification algorithms. In: Proceedings of WCE

2010: international data mining and knowledge engineering 1,

London, UK, pp 309–314

20. Dunham MH (2002) Data mining: introductory and advanced

topics. Prentice Hall, New Jersey

21. EL-Manzalawy Y, Honavar V (2005) WLSVM: integrating Lib-

SVM into weka environment. Software available at http://www.

cs.iastate.edu/*yasser/wlsvm. Accessed on 29 August 2011

22. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic

regression: a statistical view of boosting. Ann Stat 28(2):337–407

23. Ge E, Nayak R, Xu Y, Li Y (2006) Data mining for lifetime

prediction of metallic components. In: Proceedings of the 5th

Australasian data mining conference (AusDM2006), Sydney,

Australia, pp 75–81

24. Hacker S, Ahn LV (2009) Matchin: eliciting user preferences

with an online game. In: Proceedings of the 27th international

conference on human factors in computing systems (CHI’09),

pp 1207–1216

25. Han J, Kamber M (2006) Data mining concepts and techniques,

2nd edn. Morgan Kaufmann, USA

26. Hand D, Mannila H, Smyth P (2001) Principles of data mining.

The MIT Press, Cambridge

Inf Technol Manag (2013) 14:105–124 123

123

Page 20: A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

27. He H, Jin H, Chen J, McAullay D, Li J, Fallon T (2006) Analysis

of breast feeding data using data mining methods. In: Proceedings

of the 5th Australasian data mining conference (AusDM2006),

Sydney, Australia, pp 47–53

28. Hergert F, Finnoff W, Zimmermann HG (1995) Improving model

selection by dynamic regularization methods. In: Petsche T,

Hanson SJ, Shavlik J (eds) Computational learning theory and

natural learning systems: selecting good models. MIT Press,

Cambridge, pp 323–343

29. Hill T, Lewicki P (2007) STATISTICS: methods and applica-

tions. StatSoft, Tulsa

30. Howley T, Madden MG, OConnell ML, Ryder AG (2006) The effect

of principal component analysis on machine learning accuracy with

high dimensional spectral data. Knowl-Based Syst 19(5):363–370

31. Jamain A, Hand DJ (2008) Mining supervised classification per-

formance studies: a meta-analytic investigation. J Classif 25(1):

87–112

32. John GH, Langley P (1995) Estimating continuous distributions

in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings

of the 17th conference on uncertainty in artificial intelligence.

Morgan Kaufmann, USA, pp 338–345

33. Kaelbling LP (1994) Associative methods in reinforcement

learning: an emprical study. In: Hanson SJ, Petsche T, Kearns M,

Rivest RL (eds) Computational learning theory and natural

learning systems: intersection between theory and experiment.

MIT Press, Cambridge, pp 133–153

34. Kalousis A, Gama J, Hilario M (2004) On data and algorithms:

understanding inductive performance. Mach Learn 54(3):275–312

35. Keogh E, Kasetty S (2002) On the need for time series data

mining benchmarks: a survey and empirical demonstration. Data

Min Knowl Disc 7(4):349–371

36. Keogh E, Stefano L, Ratanamahatana CA (2004) Towards

parameter-free data mining. In: Kim W, Kohavi R, Gehrke J,

DuMouchel W (eds) Proceedings of the 10th ACM SIGKDD

international conference on knowledge discovery and data min-

ing. Seattle, Washington, pp 206–215

37. Kim SB, Han KS, Rim HC, Myaeng SH (2006) Some effective

techniques for Naive Bayes text classification. IEEE Trans Knowl

Data Eng 18(11):1457–1466

38. Ko M, Osei-Bryson KM (2008) Reexamining the impact of

information technology investment on productivity using

regression tree and multivariate adaptive regression splines

(MARS). Inf Technol Manage 9(4):285–299

39. Kohonen T (1990a) Improved versions of learning vector quan-

tization. In: Proceedings of the international joint conference on

neural networks I, San Diego, pp 545–550

40. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):

1464–1480

41. Kohonen T, Hynninen J, Kangas J, Laaksonen J, Torkkola K

(1995) LVQ-PK: the learning vector quantization package ver-

sion 3.1. Technical Report. Helsinki University of Technology

Laboratory of Computer and Information Science, Finland.

Available via http://www.cis.hut.fi/research/lvq_pak/lvq_doc.txt.

Accessed on 27 March 2012

42. Le Cessie S, van Houwelingen JC (1992) Ridge estimators in

logistic regression. Appl Stat 41(1):191–201

43. Lim T-S, Loh W-Y, Shih Y-S (2000) A comparison of prediction

accuracy, complexity, and training time of thirty-three old and

new classification algorithms. Mach Learn 40(3):203–228

44. Loh WY, Shih YS (1997) Split selection methods for classifica-

tion trees. Stat Sin 7(4):815–840

45. Maimon O, Rokach L (2010) The data mining and knowledge

discovery handbook. Springer, Berlin, 2nd edn

46. Maindonald J (2006) Data mining methodological weaknesses

and suggested fixes. In: Proceedings of the 5th Australasian data

mining conference (AusDM2006), Sydney, Australia, pp 9–16

47. Mitchell TM (1997) Machine learning. McGraw Hill, USA

48. Pawlak Z (1982) Rough sets. Int J Parallel Prog 11(5):341–356

49. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning

about data. Kluwer, Dordrecht

50. Pitt E, Nayak R (2007) The use of various data mining and

feature selection methods in the analysis of a population survey

dataset. In: Ong KL, Li W, Gao J (eds) Proceedings of the 2nd

international workshop on integrating artificial intelligence and

data mining (AIDM 2007) CRPIT 87. Goald Coast, Australia,

pp 87–97

51. Putten P, Lingjun M, Kok JN (2008) Profiling novel classification

algorithms: artificial immune system. In: Proceedings of the 7th

IEEE international conference on cybernetic intelligent systems

(CIS 2008), London, UK, pp 1–6

52. Quinlan JR (1993) C4.5: programs for machine learning. Morgan

Kaufmann, USA

53. Quinlan JR (1994) Comparing connectionist and symbolic

learning methods. In: Hanson SJ, Drastal GA, Rivest RL (eds)

Computational learning theory and natural learning systems:

constraints and prospect. MIT Press, Cambridge, pp 445–456

54. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal

representations by error propagation. In: Rumelhart DE, McC-

lelland JL, and the PDP research group (eds) Parallel distributed

processing: explorations in the microstructure of cognition 1:

foundations. MIT Press, Cambridge, pp 318–362

55. Shih YS (2004) QUEST user manual. Department of Mathe-

matics, National Chung Cheng University, Taiwan. Available via

http://www.stat.wisc.edu/*loh/treeprogs/quest/questman.pdf.

Accessed on 15 August 2011

56. SPSS (2012) CHAID and exhaustive CHAID algorithms. Available

via ftp://ftp.software.ibm.com/software/analytics/spss/support/

Stats/Docs/Statistics/Algorithms/13.0/TREE-CHAID.pdf. Acces-

sed on 9 April 2012

57. Su CT, Hsiao YH (2009) Multiclass MTS for simultaneous fea-

ture selection and classification. IEEE Trans Knowl Data Eng

21(2):192–204

58. UCI Machine Learning Repository (2010) Available via http://

archive.ics.uci.edu/ml. Accessed on 10 December 2010

59. Watkins AB (2001) AIRS: a resource limited artificial immune

classifier. M.Sc. Thesis, Mississippi State University Department

of Computer Science, USA. Available via http://www.cse.mss

mtate.edu/*andrew/research/publications/watkins_thesis.pdf. Acc-

essed on 15 August 2011

60. Watkins AB (2005) Exploiting immunological metaphors in the

development of serial, parallel and distributed learning algorithms.

PhD Thesis, University of Kent, Canterbury, UK Available via

http://www.cse.msstate.edu/*andrew/research/publications/

watkins_phd_dissertation.pdf. Accessed on 15 August 2011

61. Watkins AB, Timmis J, Boggess L (2004) Artificial immune

recognition system (AIRS): an immune-inspired supervised

learning algorithm. Genet Program Evolvable Mach 5(3):291–

317

62. WEKA (2011) Classification algorithms. Available via http://

wekaclassalgos.sourceforge.net/. Accessed on 15 August 2011

63. Witten IH, Frank E (2005) Data mining: practical machine

learning tools and techniques, 2nd edn. Morgan Kaufmann, USA

64. Wu X, Kumar V (2009) The top ten algorithms in data mining.

Chapman&Hall/CRC Press, Taylor &Francis Group, USA

65. Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM

(2007) To select or to weigh: a comparative study of linear

combination schemes for superparent-one-dependence estima-

tors. IEEE Trans Knowl Data Eng 19(12):1652–1664

66. Zhu D, Premkumar G, Zhang X, Chu C-H (2001) Data mining for

network intrusion detection: a comparison of alternative methods.

Decis Sci 32(4):635–660

124 Inf Technol Manag (2013) 14:105–124

123