8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 1/13
Model Selection Criterions as Data Mining Algorithms’ Selector
The Selection of Data Mining Algorithms through Model Selection Criterions
Dost Muhammad Khan1, Nawaz Mohamudally
2
1Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur,
PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of Technology
Mauritius (UTM), MAURITIUS2Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of Technology,
Mauritius (UTM), MAURITIUS
Abstract
The selection criterion plays a vital role in the selection of right model for right dataset. It is a gauge to
determine whether the dataset is under-fitted or over-fitted. If the dataset is either over-fitted or under-
fitted, both are the errors in the dataset and lead to produce the vague or ambiguous knowledge from the
dataset and hence need to be addressed properly. The data is used either to predict future behavior or to
describe patterns in an understandable form within discovered process. The major issue is that how to avoid
from these problems. There are different approaches to avoid the problem of over and underfitting, namely,
model selection, Jittering, Weight Decay, Early Stopping and Bayesian estimation. We talk about only the
model selection criterions in this paper. Furthermore, we focus on how the value of model selectioncriterion is used to map with the appropriate data mining algorithm for the dataset.
Keywords: AIC, BIC, Overfitting, Underfitting, Model Selection
1. Introduction
The purpose of model selection is to identify a model that best fits the available dataset, with model
complexity being corrected or penalized. There are two main issues working in data mining, first is bad or
wrong data and the second is controlling the model capacity, making sure it is neither so small that we are
missing useful and exploitable patterns, nor so large that we are confusing pattern and noise. The concept
of over-fitting and under-fitting is important in data mining. The over and underfitting is due to missing and
noisy, inconsistent and redundant values and number of attributes in a dataset. We can avoid these
problems by using one of these techniques; apply upper or lower thresholds values, remove attributes
below a threshold value and remove noise and redundant attributes. The best solution of these problems isuse lots of training dataset and do not make too many or too few assumptions. The other possible solutions
are, model selection, jittering, weight decay, early stopping and Bayesian estimation. The model selection
criterion is discussed in this paper.
There exists models for the selection of data mining algorithms, such as VC (Vapnik-Chervonenkis)-
dimension, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), (SRMVC)
Structural Risk Minimize with VC dimension, CV (Cross-validation), Deviance Information Criterion,
Hannan-Quinn Information Criterion, Jensen-Shannon Divergence, Kullback-Leibler Divergence models
and many more. We select only two model selection criterions AIC and BIC. A model is better than
another model if it has a smaller AIC or BIC value. Both AIC and BIC have solid theoretical foundations:
Kullback-Leibler distance in information theory (for AIC), and integrated likelihood in Bayesian theory
(for BIC). If the complexity of the true model does not increase with the size of the dataset, BIC is the
preferred criterion, otherwise AIC is preferred. Since selecting the number of parameters and number of
attributes is a model selection problem, so one has to take care of these important aspects of a dataset.Using too many parameters can fit the data perfectly, but it can be an overfitting. Using too few parameters
may not fit the dataset at all, thus underfitting. This shows the importance of parameters and observed data
in a given dataset. Variable selection by AIC or BIC will provide an answer to this problem. We illustrate
the importance of comparing different models with different number of parameters by using AIC and BIC.
The goal of this paper is to draw a comparison between the two selected model selection criterions namely
AIC and BIC and to map the appropriate data mining algorithms with a particular dataset i.e. the right
algorithm for the dataset. The idea of model selection using AIC or BIC has also been applied recently to
epidemiology, microarray data analysis, and DNA sequence analysis [21][22][23][24][25].
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 102
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 2/13
The rest of the paper is organized as follows; section 2 discusses the model selection criterions AIC and
BIC and section 3 is about the methodology. In section 4 the results are discussed and finally the
conclusion is drawn in section 5.
2. Model Selection Criterions
A brief introduction of AIC and BIC is given below:
1. Akaike Information Criterion: The AIC is a criterion for the model selection, developed by HirotsuguAkaike in 1974, under the name of Akaike Information Criterion. The AIC is based on information theory.
Suppose that the data is generated by some unknown process f . We consider two candidate models to
represent f : g 1 and g 2. If we knew f , then we could find the information lost from using g 1 to represent f by
calculating the Kullback–Leibler divergence, DKL( f , g 1); similarly, the information lost from using g 2 to
represent f would be found by calculating DKL( f , g 2). We would then choose the candidate model that
minimized the information loss. The AIC can tell nothing about how well a model fits the data in an
absolute sense. If all the candidate models fit poorly, AIC will not give any warning. In other words, AIC is
a difference of accuracy and the complexity of the model [7][9][10][15][17][18][20]. The AIC is defined
below:
k likelihood AIC 2))(log(2 , where 2k are the number of parameters and )log(likelihood is
the log of the likelihood and ))(log(2 likelihood due to perfect fitting, the value of log-likelihood
values gradually approach to 0 as the number of parameters are increased, is also called the ModelAccuracy. Therefore, AIC is:
acy ModelAccur rsofParamete No AIC
rsofParamete Noacy ModelAccur AIC
.
.
2. Bayesian Information Criterion: The BIC is a criteria for the selection of model among class of models
with different numbers of parameters. When estimating model parameters using maximum likelihood
estimation, it is possible to increase the likelihood by adding parameters, which may result in overfitting.
BIC resolves this problem by introducing a penalty term for the number of parameters in the model. This
penalty is larger in the BIC than in the related AIC. BIC is widely used for model identification in time
series and linear regression. The main characteristics of BIC are: It measures the efficiency of the
parameterized model in terms of predicting the data, penalizes the complexity of the model where
complexity refers to the number of parameters in model, is exactly equal to the minimum description lengthcriterion but with negative sign and is closely related to other likelihood criteria such as AIC
[4][8][12][13][16][19]. The formula for BIC is given below:
)log()log(.2 nk likelihood BIC , where k is the number of parameters, n is the sample size or
the datapoints of the given dataset, )log(.2 likelihood the values of log-likelihood gradually approach
to 0 with the increase on number of parameters, is also known as the Model Accuracy and )log(nk is the
model size. Therefore, BIC is:
acy ModelAccur ModelSize BIC
3. Methodology
Suppose there is a sample },...,,{ 21 n x x x X of n sequence of observations, coming from a distributionwith an unknown probabilty density function )|( X p , where )|( X p is called a parametric model, in
which all the parameters are in finite-dimensional parameter spaces. These parameters are collected
together to form a single m-dimensional parameter vector ),...,,( 21 m . To use the method of
maximum likelihood, one first specifies the joint density function for all observations. The joint density
function of the given observation is given below.
)|()|()|()|,...,,()|( 2121 nn x p x p x p x x x p X p
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 103
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 3/13
where the observed values x1, x2, ..., xn are fixed parameters of this function and θ will be the function’s
variable and allowed to vary freely. From this point of view this distribution function will be called the
likelihood:
n
i
inn x p x x x p x x xlikelihood 1
2121 )|()|,...,,(),...,,|(
It is more convenient to work with the logarithm of the likelihood function, called the log-likelihood asshown below.
))|(log()),...,,|(log(1
21
n
i
in x p x x xlikelihood , )log(1
likelihood n
The method of maximum likelihood was first proposed by the English statistician and population geneticist
R. A. Fisher. The maximum likelihood method finds the estimate of a parameter that maximizes the
probability of observing the data given a specific model for the data. The idea behind maximum likelihood
parameter estimation is to determine the parameters that maximize the probability (likelihood) of the
sample data. From a statistical point of view, the method of maximum likelihood is considered to be more
robust and yields estimators with good statistical properties. It is a flexible method and can be applied to
most models and to different types of data. Although the methodology for maximum likelihood estimation
is simple, the implementation is mathematically intense [1][2][3][5][6][11][14].We select a dataset ‘cars’, which is about the different models of brands from different countries. The
number of attributes is 9, number of datapoints or records or the sample size is 261 and number of
parameters is, in this dataset, the brands from 3 countries, from US is 14, from Europe is 10 and from Japan
is 6, are 30. The number of records from US is 62.45%, from Europe 18.00% and from Japan 19.54% of
the whole dataset. The number of parameters/brands from US is 46.67%, from Europe is 33.33% and from
Japan is 16.67% of the total number of brands or parameters. We use the stepwise variable selection
method, starting with one variable and then add or remove variable if the value of AIC or BIC is reduced.
The stepwise variable is a local optimal procedure and is tested with different starting sets of parameters so
that the optimization is not carried to the extreme.
The following steps explain the computation of the value of AIC and BIC:
Step 1: Calculate the maximum likelihood of the dataset
The likelihood function is simply the joint probability of observing the data. The joint probability
is
n
i
inn x p x x x p x x x L1
2121 )|()|,...,,(),...,,|( . Take the log of this value will give the
value of model accuracy which is shown below:
)log(likelihood acy ModelAccur
Step 2: Compute the Model Size
The formula to calculate the model size is )(lognk ModelSize where k is the number of parameters
and n is the datapoints.
Step 3: Compute the Minimum Description Length (MDL)
acy ModelAccur ModelSize MDLScore
Minimum Description Length (MDL) is also referred as BIC (Bayesian Information Criterion). Therefore,
here we can say that AIC and BIC are:
acy ModelAccur rsofParamete No AIC .
acy ModelAccur ModelSize BIC
The smallest value of the model shows that it performs better than the other selection criterion. Therefore,
smallest is the best [1].
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 104
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 4/13
4. Results and Discussion
Case 1: We compute the values of AIC and BIC with different sample size i.e. small, medium and large
and the different number of parameters i.e. minimum 2 and maximum 9, where the small sample size is 50,
medium is 100 and the large is of 400. In case 1, the number of attributes or the observed data are 9. Table
1 shows the values of AIC and BIC with respect to the different number of parameters, when the sample
size n is 50.
Table 1 Models Selection with n=50
No. of Parameters BIC AIC
2 15.02 13.10
3 32.30 29.44
4 54.47 50.64
5 76.87 72.09
6 100.49 94.75
7 124.95 118.26
8 150.18 142.53
9 176.10 167.49
The table 1 shows that as the number of parameters increases the values of AIC and BIC increase with the
small sample size. At the beginning the gap between the two values is small but as the number of parameters increases the difference between the values of AIC and BIC also become large. For this sample
dataset AIC is the best selection due to its low values for each parameters. The graph is drawn between the
value of AIC and the value of BIC when the sample size is 50 as shown in figure 1 below.
Comaprison b/w AIC & BIC
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
200.00
1 2 3 4 5 6 7 8
BIC
AIC
Figure 1 A Comparison of AIC and BIC when Sample Size=50
The graph in figure 1 shows that as the number of parameters increase the values of AIC and BIC increase,
when the sample size is small i.e. the datapoints in the given dataset are 50. But the value of AIC remains
less than the value of BIC from beginning to the end. At the beginning the gap between the two lines is
minute as the number of parameters increase the gap also increases and at the end of the graph it is clearly
visible. So AIC is the preferred model for the dataset because of its less value.
Table 2 shows the values of AIC and BIC with respect to the different number of parameters, when thesample size n is 100.
Table 2 Models Selection with n=100
No. of Parameters BIC AIC
2 15.71 13.10
3 33.29 29.38
4 56.50 51.29
5 78.83 72.32
6 102.16 94.35
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 105
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 5/13
7 127.67 118.55
8 153.18 142.76
9 179.07 167.35
The table 2 shows that as the number of parameters increases the values of AIC and BIC increase with the
small sample size. We notice that as the sample size changes from small to medium the values of AIC and
BIC also change. For this sample dataset AIC is the best selection due to its low values for each parameters.
The graph is drawn between the value of AIC and the value of BIC when the sample size is 100 as shown
in figure 2 below.
Graph b/w AIC & BIC
0
20
4060
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8
BIC
AIC
Figure 2 A Comparison of AIC and BIC when Sample Size=100
The graph in figure 2 shows that as the number of parameters increase the values of AIC and BIC increase,
when the sample size is medium i.e. the datapoints in the given dataset are 100. But the line in the graph
which shows the value of AIC remains less than the value of BIC from beginning to the end. At the
beginning the gap between the two lines is minute as the number of parameters increase the gap also
increases and at the end of the graph it is clearly visible. Another point in this graph is the gap between the
two lines is wider as compared to the graph in figure 1 i.e. as the sample size increases the difference
between the two values increases. Again for this sample dataset AIC is the best selection.Table 3 shows the values of AIC and BIC with respect to the different number of parameters, when the
sample size n is 400.
Table 3 Models Selection with n=400
No. of Parameters BIC AIC
2 17.06 13.09
3 35.33 29.37
4 59.06 51.12
5 82.92 72.99
6 107.33 95.42
7 133.37 119.48
8 159.05 143.16
9 185.17 167.30
The table 3 shows that as the number of parameters increases the values of AIC and BIC increase with the
small sample size. We observe that as the sample size changes from medium to large the values of AIC and
BIC also change. For this sample dataset AIC is the best selection due to its low values for each parameters.
The graph is drawn between the value of AIC and the value of BIC when the sample size is 400 as shown
in figure 3 below.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 106
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 6/13
Graph b/w AIC & BIC
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8
BIC
AIC
Figure 3 A Comparison of AIC and BIC when Sample Size=400
The graph in figure 3 shows that as the number of parameters increase the values of AIC and BIC increase,
when the sample size is large i.e. the datapoints in the given dataset are 400. But the line in the graph which
shows the value of AIC remains less than the value of BIC from beginning to the end. At the beginning thegap between the two lines is minute as the number of parameters increase the gap also increases and at the
end of the graph it is clearly visible. Another point in this graph is the gap between the two lines is wider as
compared to the graphs in figure 1 and 2 i.e. as the sample size increases the difference between the two
values increases. Again for this sample dataset AIC is the best selection. We conclude this case that there is
no problem of overfitting in this dataset.
Case 2: In this case the number of attributes or the observed data is the same as in case 1 i.e. the value is 9.
We compute the values of AIC and BIC on the large sample size of 400 but increase the number of
parameters from 10 to 24 i.e. minimum 10 and maximum 24. As the number of parameters increase the
value of both selection criterions increase. When the number of parameters reaches to 24 the values of AIC
and BIC are infinity, which shows that at the dataset is over-fitted, although, the number of observed data
in this case is not large. The dataset can produce the knowledge up to 23 number of parameters, of course,
the number is again high but the values of the selection criterions are not infinite. This also proves that
number of parameters is non-trivial for any dataset. If the user does not take care of the parameters, it isdifficult to extract the knowledge from the given dataset. The table 4 shows the values of AIC and BIC
with respect to the different number of parameters, when the sample size n is 400.
Table 4 Models Selection with n=400
No. of Parameters BIC AIC
10 251.18 231.32
11 290.03 268.19
12 328.85 305.03
13 359.69 333.88
14 399.57 371.77
15 442.01 412.23
16 473.04 441.27
17 517.13 483.38
18 555.81 520.06
19 601.18 563.46
20 647.36 607.64
21 692.97 651.27
22 742.68 698.99
23 786.41 740.74
24 Infinity Infinity
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 107
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 7/13
The table 4 shows that as the number of parameters increases the values of AIC and BIC increase with the
sample size is large i.e. the datapoints in the given dataset are 400. At the beginning the gap between the
two values is minute but as the number of parameters increase the gap also increases and at the end it
reaches to a maximum difference. The values of AIC and BIC is undetermined for the value of parameters
from 24. For this sample dataset AIC is the best selection. The upper threshold value of this model is 23
and the lower threshold values of this model are 2, therefore, the range is between 2 and 23. The value
outside the range will create the problem of over and underfitting.
Case 3: We take another case in which the number of parameters is small, the observed data is large and
the sample size is large. We observe that there is no such difference between the values of AIC and BIC.
The choice for this dataset is AIC. The results are shown in the table 5 below.
Table 5 Models Selection with n=600
No. of Parameters BIC AIC No. of Attributes
5 522.10 511.12 39
The table 5 shows that with the increase of number of attributes and sample size of the dataset there is no
such problem of overfitting for this dataset and the value of AIC remains small as compared to BIC,
therefore, AIC is the right choice for this dataset.
The second example of this case is where the number of parameters is large, number of observed data is
large and the sample size is medium. We notice that the value of AIC and BIC is infinity or undetermined.
Which proves that dataset is over-fitted, although, the sample size is medium. The user has to modify the
dataset in order to extract knowledge. The results are demonstrated in table 6 below.
Table 6 Models Selection with n=150
No. of Parameters BIC AIC No. of Attributes
20 Infinity Infinity 72
The table 6 shows that with the increase of number of attributes and number of parameters, the dataset
becomes overfitted and the value of both AIC and BIC is infinity.
The third example (DNA dataset) of this case is where the number of parameters is small, number of
observed data is large and the sample size is very large. We notice that there is no such difference between
both the values of AIC and BIC but the value of AIC is still small as compared to BIC. Although the
observed data in this dataset is in binary (0, 1) format but still the values of both model selections are
computable. The results are demonstrated in table 6 below.
Table 7 Models Selection with n=2000
No. of Parameters BIC AIC No. of Attributes
3 648.7343 640.333 180
The result of all these above discussed cases is that if the number of observed data or the number of
parameters is small then the dataset is underfitted. Over and underfitting are the noise in the data which
must be removed in order to extract the useful knowledge from the dataset. We conclude this case that the
number of parameters and the number of attributes play a vital role for a dataset to become over or
underfitting, therefore, in order to avoid from these problems, the user must set the datasets carefully.
Case 4: The model selection criterion, AIC (Akaike Information Criterion), is used to map the appropriate
algorithms with a particular dataset, in order to extract the knowledge. We select three data mining
algorithms namely, K-means, C4.5 and Data Visualization and five different datasets, ‘Iris’, ‘Diabetes’,
Breastcancer, ‘DNA’ and ‘Cars’ are chosen. The sample size of each dataset is different. We have to select
appropriate algorithms for each dataset i.e. the best and suitable algorithm for a dataset. The value of AIC
of these datasets is computed, shown in table 8.
Table 8 The Value of AIC of Datasets
No. of Parameters No. of Attributes Value of AIC Sample Size Dataset
2 9 21.60 788 Diabetes
2 10 24.62 233 Breastcancer
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 108
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 8/13
3 5 26.77 150 Iris
3 180 922.48 2000 DNA
23 9 1054.75 400 Cars
Table 8 shows that the value of AIC of the datasets, ‘Diabetes’, ‘Breastcancer’ and ‘Iris’ is small as
compared to the datasets, ‘DNA’ and ‘Cars’. We also notice in this table that although the number of
attributes of dataset ‘Cars’ is small but the value of AIC is high, which is due to the high number of
parameters. Similarly, in case of dataset ‘DNA’, the number of parameters are small but again the value of
AIC is high, which is due to the high number of attributes. The conclusion is that if either the number of
attributes or the number of parameters are large, the value of AIC will be high, which shows that the dataset
is not suitable for the extraction of knowledge and requires the cleansing.
The computational and storage complexities of selected data mining algorithms are shown in table 9.
Table 9 The Complexities of Data Mining Algorithms
Data Mining Algorithm Computational Complexity Storage Complexity
K-means O(nkl) O(n+k)
C4.5 O(n.m) O(n)
Data Visualization (2D Graph) O(d.n) O(n)
Table 9 is about the computational and storage complexities of K-means, C4.5 and Data Visualization, datamining algorithms. Where ‘n’ is the sample size, ‘m’ is the number of attributes, ‘k’ is the number of
clusters, ‘l’ is the number of iterations and ‘d’ is the dimension (in our case it is 2). We take the ‘log’ of
computational complexities of these algorithms; the value gradually approaches to zero, with the decrease
of number of parameters. We take an example which explains the use of ‘log’; an input dataset containing
10 items takes one second to complete, a dataset containing 100 items takes two seconds, and a dataset
containing 1000 items will take three seconds. This makes the use of ‘log’ extremely efficient when dealing
with large datasets. There are some other utilities of taking the ‘log’ of a value, which are: the ‘log’ is taken
if the transferred data is closer to satisfy the assumptions of the statistical model, to analyze the exponential
processes, because the ‘log’ function is the inverse of the exponential function, to measure the pH or acidity
of a chemical solution, to measure the intensity of earth quake on Richter scale, to model many natural
processes with the statistical model and to model the value of computational complexity of a data mining
algorithm with the value of model selection criterion AIC. This will help to select the right algorithm for
the given dataset.Table 10 The value of AIC of Dataset ‘Iris’ and Complexities of Algorithms
Iris/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC
K-means 19.39 26.77
C4.5 16.78 26.77
Data Visualization 15.46 26.77
Table 10 shows the relationship between the computational complexity of data mining algorithms and the
value of model selection criterion AIC for the dataset ‘Iris’. The value of AIC of the dataset is 26.77 and
the value of the log of computational complexity of K-means is 19.39, which is closer to the value of AIC.
The value of log of computational complexity of C4.5 is 16.78, which is closer to the value of AIC and
similarly, the log of computational complexity of Data Visualization is 15.46, which is again closer to the
value of AIC. The result of this table shows that the values of log of computational complexity of all thedata mining algorithms are close to the value of AIC, so these algorithms are the right choice for the dataset
‘Iris’. It is clear from the table that K-means algorithm is first choice for the dataset ‘Iris’, then C4.5 and
finally data visualization. This is further illustrated in figure 4.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 109
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 9/13
Dataset 'Iris'
0.005.00
10.00
15.00
20.00
25.00
30.00
K-means C4.5 Data Visualization
Data Mining Algorithm s
V a l u e o f A I C &
C o m
p l e x i t i e s o f
A l g o s .
Complexities
AIC
Figure 4. The Graph between the value of AIC & Complexities of DM Algorithms
The graph in figure 4 is a comparison between the computational complexity of data mining algorithms and
the value of model selection AIC of the dataset ‘Iris’. The graph shows that the values of computational
complexity of data mining algorithms are close to the value of AIC; therefore, these algorithms are the best
choice for the dataset ‘Iris’. The difference between the values is not perfect but still there is not a huge
difference.
Table 11 The value of AIC of Dataset ‘Breastcancer’ and Complexities of Algorithms
Breastcancer/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC
K-means 20.06 24.62
C4.5 18.41 24.62
Data Visualization 16.73 24.62
Table 11 is a relationship between the computational complexity of data mining algorithm and the value of
model selection criterion AIC for dataset ‘Breastcancer’. The value of AIC of the dataset is 24.62 and the
value of the log of computational complexity of K-means is 20.06, which is closer to the value of AIC. The
value of log of computational complexity of C4.5 is 18.41, which is closer to the value of AIC and
similarly, the log of computational complexity of Data Visualization is 16.73, which is again closer to the
value of AIC. The result of this table shows that the values of log of computational complexity of all the
data mining algorithms are close to the value of AIC, so these algorithms are the right choice for the dataset
‘Breastcancer’. It is clear from the table that K-means algorithm is first choice for the dataset‘Breastcancer’, then C4.5 and finally data visualization. This is illustrated in figure 5.
Dataset 'Breastcancer'
0.00
5.00
10.00
15.00
20.00
25.00
30.00
K-means C4.5 Data V isualization
Data Mining Algorithm s
V a l u e o f A I C &
C o m p l e x i t i e s o f A l g o s .
Complexities
AIC
Figure 5. The Graph between the value of AIC & Complexities of DM Algorithms
The graph in figure 5 is a comparison between the computational complexity of data mining algorithms and
the value of model selection AIC of the dataset ‘Breastcancer’. The graph shows that the values of
computational complexity of data mining algorithms are close to the value of AIC; therefore, these
algorithms are the best choice for the dataset ‘Breastcancer’. The difference between the values is not
perfect but still the difference is not huge.
Table 12 The value of AIC of Dataset ‘Diabetes’ and Complexities of Algorithms
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 110
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 10/13
Diabetes/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC
K-means 23.57 21.60
C4.5 22.41 21.60
Data Visualization 20.24 21.60
Table 12 is a relationship between the computational complexity of data mining algorithm and the value of
model selection criterion AIC for dataset ‘Diabetes’. The value of AIC of the dataset is 21.60 and the logvalue of complexities of K-means is 23.94, which is greater than the value of AIC but still it is closer to the
value of AIC. The log value of complexities of C4.5 is 22.41, which greater than the value of AIC but the
difference is not large and similarly, the log value of complexities of Data Visualization is 20.24, which is
almost equal to the value of AIC. The result of this table shows that the log value of complexities of all the
data mining algorithms is close to the value of AIC, so these algorithms are the right choice for the dataset
‘Diabetes’. Figure 6 illustrate this comparison.
Dataset 'Diabetes'
18.00
19.00
20.00
21.00
22.00
23.00
24.00
K-means C4.5 Data Visualization
Data Mining Algorithm s
V a l u
e o f A I C &
C o m p l e x i t i e s o f
A l g o s .
Complexities
AIC
Figure 6. The Graph between the value of AIC & Complexities of DM Algorithms
The graph in figure 6 is a comparison between the computational complexity of data mining algorithms and
the value of model selection AIC of the dataset ‘Diabetes’. The graph shows that the values of two data
mining algorithms K-means and C4.5 are greater than the value of AIC and the value of Data Visualization
is less than the value of AIC but there is no such difference between both values, they are very close to
each other; therefore, these algorithms are the best choice for the dataset ‘Diabetes’. In this case the
difference between these values is small as compare to figures 4 and 5 but the value of AIC of the given
dataset is high for K-means and C4.5 data mining algorithms.
Table 13 The value of AIC of Dataset ‘DNA’ and Complexities of Algorithms
DNA/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC
K-means 26.84 922.48
C4.5 29.42 922.48
Data Visualization 22.93 922.48
Table 13 is a relationship between the computational complexity of data mining algorithm and the value of
model selection criterion AIC for dataset ‘DNA’. The value of AIC of the dataset is 922.48 and the log
value of complexities of K-means is 26.84, there is a huge difference between both values. The log value of
complexities of C4.5 is 29.42, there is a big difference between both values and similarly, the log value of
complexities of Data Visualization is 22.93, again the difference between both values is very high. The
result of this table shows that there is no comparison between the log value of complexities of all the datamining algorithms and the value of AIC, so these algorithms are not suitable for the dataset ‘DNA’. In
other words, the selected data mining algorithms are not the right choice for this dataset. In order to make
the dataset ‘DNA’ usable, reduce the number of attributes of the dataset. This is further illustrated in figure
7.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 111
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 11/13
Dataset 'DNA'
0.00200.00
400.00
600.00
800.00
1000.00
K-means C4.5 Data Visualization
Data Mining Algorithms
V a l u e o f A I C &
C o m
p l e x i t i e s o f
A l g o s .
Complexities
AIC
Figure 7. The Graph between the value of AIC & Complexities of DM Algorithms
The graph in figure 7 is a comparison between the computational complexity of data mining algorithms and
the value of model selection AIC of the dataset ‘DNA’. The graph shows that there is a huge difference
between the values of computational complexity of data mining algorithms and the value of AIC; therefore,
these algorithms are not suitable for the dataset ‘DNA’. We can say that in this case there is no comparison
between these values due to the huge difference.
Table 14 The value of AIC of Dataset ‘Cars’ and Complexities of Algorithms
Cars/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC
K-means 25.21 1054.75
C4.5 20.46 1054.75
Data Visualization 18.29 1054.75
Table 14 is a relationship between the computational complexity of data mining algorithm and the value of
model selection criterion AIC for dataset ‘Cars’. The value of AIC of the dataset is 1054.75 and the log
value of complexities of K-means is 25.21, which is far away from the value of AIC. The log value of
complexities of C4.5 is 11.81, there is a huge difference between both values and similarly, the log value
complexities of Data Visualization is 18.29, again there is enormous gap between both values. The result of
this table shows that there is no comparison between the log value of complexities of all the data mining
algorithms and the value of AIC, so these algorithms are not suitable for the dataset ‘Cars’. In other words,
the selected data mining algorithms are not the right choice for this dataset. In order to make the dataset‘Cars’ useable, reduce the number of parameters of the dataset. Figure 8 illustrate this comparison.
Dataset 'Cars'
0.00
200.00
400.00
600.00
800.00
K-means C4.5 Data
Visualization
Data Mining Algorithms
V a l u e s o f A I C & C C
Computational Complexity
AIC
Figure 8. The Graph between the value of AIC & Complexities of DM Algorithms
The graph in figure 8 is a comparison between the computational complexity of data mining algorithms and
the value of model selection AIC of the dataset ‘Cars’. The graph shows that there is gigantic difference
between the log value of complexities of data mining algorithms and the value of AIC; therefore, these
algorithms are not suitable for the dataset ‘Cars’. We can say that in this case there is no comparison
between these values due to the huge difference.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 112
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 12/13
5. Conclusion
This paper discusses the non-trivial and important role of parameters and observed data in a given dataset.
If a dataset has a few number of parameters or small number of observed data then it is difficult to produce
knowledge from that dataset, it is called under-fitted and if a dataset has large number of parameters or
larger number of observed data then again it is a difficult to handle these parameters and to produce the
knowledge, it is called over-fitted. The over-fitted and under-fitted are the errors in the dataset which show
that the dataset is not properly cleansed. The conclusion is that the middle range i.e. number of parametersneither too small nor too large, is required to find knowledge from a dataset. In order to find that the dataset
is either over-fitted or under-fitted, one has to use the model selection criterion. In this research paper we
use two model selection criterions, AIC and BIC. These model selection criterions are tested over the
sample size of small, medium and large and a comparison is also draw. The AIC performs better than BIC
in all types of the sample sizes. As the sample size increases, from small to large, the difference between
the value of AIC and BIC increases. Therefore, we opt for AIC as a selection criterion for our proposed
model. The model selection is not a hypothesis testing, it does not draw conclusion whether a model is
wrong; it explores and ranks various alternative models. In this paper we try to map the value of model
selection criterion AIC of a dataset with the computational complexities of data mining algorithms K-
means, C4.5 and Data Visualization, which helps to select the appropriate data mining algorithm(s) for a
particular dataset. We test the value of AIC for algorithm selection over five datasets namely, ‘Iris’,
‘Breastcancer’, ‘Diabetes’, ‘DNA’ and ‘Cars’. The number of parameters, sample size and number of
observed data is different for each dataset. The bar graphs are plotted to compare the value of AIC of thesedatasets and the computational complexity of data mining algorithms. The conclusion is that the datasets,
‘Iris’, ‘Breastcancer’ and ‘Diabetes’ are suitable to extract the knowledge and there is no problem of over
and under-fitting in these datasets. But on the other hand, the datasets, ‘DNA’ and ‘Cars’ are not suitable
for knowledge extraction due to over-fitting problem. These datasets again require some cleansing, i.e.
reduce the number of parameters from the dataset ‘Cars’ and reduce the number of attributes from the
dataset ‘DNA’. We also recommend a threshold value for the selection of data mining algorithm if the
percentage difference between the value of AIC & complexities is ‘40’ then the algorithm(s) is suitable for
that dataset otherwise the dataset requires cleansing i.e. reduce the number of parameters or reduce the
number of attributes. We conclude our paper, in this paper we use the model selection criterion AIC to
avoid over and under-fitting problems of a dataset and also map this value with the complexities of data
mining algorithms, which helps to select the right algorithm for the right dataset. The results are
encouraging and satisfactory. Furthermore, the number of parameters of a dataset can also be used as the
number of clusters (the value of ‘k’ ) for K-means clustering data mining algorithm.
Acknowledgement
The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial
assistance to carry out this research activity under HEC project 6467/F – II.
Reference
[1] URL: http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture08.pdf, 2011
[2] Aldrich, John., “R. A. Fisher and the making of maximum likelihood 1912–1922”, Statistical Science
12 (3): 162–176. doi:10.1214/ss/1030037906. MR1617519, 1997.
[3] Anderson, Erling B., “Asymptotic Properties of Conditional Maximum Likelihood Estimators”, Journal
of the Royal Statistical Society B 32, 283–301, 1970.
[4] Andersen, Erling B., “Discrete Statistical Models with Social Science Applications”, North Holland,
1980.[5] Debabrata Basu., “Statistical Information and Likelihood : A Collection of Critical Essays”, by Dr. D.
Basu ; J.K. Ghosh, editor. Lecture Notes in Statistics Volume 45, Springer-Verlag, 1988.
[6] Le Cam, Lucien., “Maximum likelihood — an introduction”. ISI Review 58 (2): 153–171, 1990.
[7] Burnham. Kenneth P., Anderson. David R., “Model Selection and Multi-model Inference: a Practical
Information-theoretic Approach”, 2nd edition, Springer, ISBN: 0-387-95364-7, 2002.
[8] Brockwell, P.J., and Davis, R.A., “Time Series: Theory and Methods”, 2nd ed. Springer, 2009.
[9] Akaike, Hirotugu, “A new look at the statistical model identification”, IEEE Transactions on Automatic
Control 19 (6): 716–723. doi:10.1109/TAC.1974.1100705. MR0423716, 1974.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 113
8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…
http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 13/13
[10] Weakliem. David L. , “A Critique of the Bayesian Information Criterion for Model Selection”,
University of Connecticut Sociological Methods Research, vol. 27 no. 3 359-397, 1999
[11] Cavanaugh. Joseph E., “Statistics and Actuarial Science”, The University of Iowa, URL:
http://myweb.uiowa.edu/cavaaugh/ms_lec_6_ho.pdf, 2009.
[12]Liddle, A.R., “Information criteria for astrophysical model selection”,
http://xxx.adelaide.edu.au/PS_cache/astro-ph/pdf/0701/0701113v2.pdf
[13] Ernest S. et al., “HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN
PROC LOGISTIC AND PROC GENMOD”, Harvard Medical School, Harvard Pilgrim Health Care,
Boston, MA, 2010.
[14] In Jae Myung, “Tutorial on maximum likelihood estimation”, Department of Psychology, Ohio State
University, USA, Journal of Mathematical Psychology 47 (2003) pp 90–100.
[15] Isabelle Guyon., “A practical guide to model selection”, ClopiNet, Berkeley, CA 94708, USA, 2010.
[16] Vladimir Cherkassky, “COMPARISON of MODEL SELECTION METHODS for REGRESSION”,
Dept. Electrical & Computer Eng. University of Minnesota, 2010.
[17] Schwarz G., “Estimating the dimension of a model”. Ann Stat 6:461-464, 1978.
[18] Burnham KP, Anderson DR., “Model Selection and Inference”, (Springer), 1998.
[19] Parzen E, Tanabe K, Kitagawa G., “Selected Papers of Hirotugu Akaike”, (Springer), 1998.
[20] Li W., “DNA segmentation as a model selection process”. Proc. RECOMB'01, in press, 2001.
[21] Li W., “New criteria for segmenting DNA sequences”, submitted, 2001.
[22] Li W, Sherriff A, Liu X., “Assessing risk factors of human complex diseases by Akaike and Bayesian
information criteria (abstract)”. Am J Hum Genet 67(Suppl):S222, 2000.[23] Li W, Yang Y., “How many genes are needed for a discriminate microarray data analysis”. Proc.
CAMDA'00, in press, 2001.
[24]Li W, Yang Y, Edington J, Haghighi F., “Determining the number of genes needed for cancer
classification using microarray data”, submitted, 2001.
[25] Wentian Li, Dale R Nyholt, “Marker Selection by AIC and BIC”, Laboratory of Statistical Genetics,
The Rockefeller University, New York, NY, 2010.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 114
Top Related