Optimal Feature Selection for Cluster Based Ensemble ... · overcome are removed by ant colony...
Transcript of Optimal Feature Selection for Cluster Based Ensemble ... · overcome are removed by ant colony...
@IJRTER-2016, All Rights Reserved 43
Optimal Feature Selection for Cluster Based Ensemble Classifier Using
Meta-Heuristic Function for Medical Disease Data Classification for
Symptom Prediction
Deepesh yadav Department of computer science &engineering
Samrat Ashok Technological institute (Poly.)college Vidisha
Abstract. The diversity and applicability of data mining are increase day to day in the field of medical
science for the predication of symptom of disease. The data mining provide lots of technique for mine
data in several field, the technique of mining as association rule mining, clustering technique,
classification technique and emerging technique such as called ensemble classification technique. The
process of ensemble classifier increases the classification rate and improved the majority voting of
classification technique for individual classification algorithm such as KNN, Decision tree and support
vector machine. The new paradigms of ensemble classifier are cluster oriented ensemble technique for
classification of data.
This research paper apply classification proceed based on classifier selection to medical disease data
and propose a clustering-based classifier selection method. In the method, many clusters are selected
for a ensemble process. Then, the standard presentation of each classifier on selected clusters is
calculated and the classifier with the best average performance is chosen to classify the given data. In
the computation of normal act, weighted average is technique is used. Weight values are calculated
according to the distances between the given data and each selected cluster. There are generally two
types of multiple classifiers combination: multiple classifiers selection and multiple classifiers fusion.
Multiple classifiers selection assumes that each classifier has expertise in some local regions of the
feature space and attempts to find which classifier has the highest local accuracy in the vicinity of an
unknown test sample. Then, this classifier is nominated to make the final decision of the system.
Performance of a classifier is frequently the most important aspect of its value and is measured using
a variety of well known method and matrix is used. On the other hand knowledge of a classifier is
often treated as less important or even neglected. However it is vital for the users of the classifier as
they belief it more if they can realize how the classifier works and because additional knowledge about
the relations in observed data can be extracted by involved classifier. Consequently some of the old
methods focus on knowledge of learned classifiers or transforming non- knowledge classifiers into
human- knowledge structure. There is lack of algorithms that treat accuracy and knowledge of
classifiers as uniformly significant, converting the domain of constructing a classifier into heuristic
optimization crisis. Such algorithms are especially important in domains where there are parts of
attribute space that can be classified with high accuracy using knowledgeable classifier and parts that
require non- knowledge classifiers to achieve required classification accuracy.
The process of combining different clustering output (cluster ensemble or clustering Aggregation)
emerged as an alternative approach for improving the quality of the Results of clustering algorithms.
It is based on the success of the combination of supervised classifiers. Given a set of objects, a cluster
ensemble method consists of two principal steps: Generation, which is about the creation of a set of
partitions a of these objects, and Consensus Function, where a new partition, which is the integration
of all partitions obtained in the generation step, is computed. Over the past years, many clustering
ensemble techniques have been proposed, resulting in new ways to face the problem together with new
fields of application for these techniques. Besides the presentation of the main methods, the
introduction of taxonomy of the different tendencies and critical comparisons among the methods is
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 44
really important in order to give a practical application to a survey. Thus, due to the importance that
clustering ensembles have gained facing cluster analysis, we have made a critical study of the different
approaches and the existing methods.
Feature selection technique is used for selecting subset of relevant features from the data set to build
robust classification models. Classification accuracy is improved by removing most irrelevant and
redundant features from the dataset. Ensemble model is proposed for improving classification accuracy
by combining the prediction of multiple classifiers. In this research paper we used cluster based
ensemble classifier. The performance of each classifier and ensemble model is evaluated by using
statistical measures like accuracy, specificity and sensitivity.
Classification of medical data is an important task in the prediction of any disease. It even helps doctors
in their diagnosis decisions. Cluster oriented Ensemble classifier is to generate a set of classifiers
instead of one classifier for the classification of a new object, hoping that the combination of answers
of multiple classification results in better performance.
We demonstrate the algorithmic use of the classification technique by extending SVM the most popular
binary classification algorithms. From the studies above, the key to improve cluster oriented classifier
is to improve binary classification. In the final part of the thesis, we include empirical evaluation that
aim at understanding binary classification better in the context of ensemble learning.
Keywords: Classifier and Cluster Ensembles Using Genetic algorithms, Genetic algorithms based
Cluster and Classifier Ensembles for Mining Drifting Data Streams, Mining Data Streams
I. INTRODUCTION
Cluster and classification play an important role in data mining and machine learning paradigm. The
evaluation of ensemble classifier is great advantage over binary and conventional classifier. The
process of prototype classification is combined two or more method with same nature. The prototype
classification of data brings come in form of cluster oriented ensemble classifier. The need and
requirement of online transaction of data is stream classification, due to stream classification save time
of computation and storage area of network. For the purpose of medical disease data classification
various machine learning algorithm are used, such as clustering, classification, and neural network. In
the classification process of a growing data stream, also the temporary or long-standing activities of
the stream may be more significant, or it often cannot be known a priori as to which one is more
important. We decide the window or horizon of the training data to use so as to obtain the best
classification accuracy. For the proper selection of window and horizon used ensemble classifier with
support of clustering technique. In ensemble methods, the main strategy is to maintain a dynamic set
of classifiers. When a decrease in performance is practical, new classifier are fused into the ensemble
while aged and horrific the stage fusion process are disinterested. For classification, decisions of fusion
in the ensemble are combined, usually with a cluster scheme [10]. The advantage of cluster ensembles
over single classifiers in the data stream classification problem has been proved empirically and
theoretically [1, 2]. However, few ensemble methods have been designed to take into consideration
the problem of recurring contexts [3, 4]. Specifically, in problems where concepts re-occur, models of
the ensemble should be maintained in memory even if they do not perform well in the latest batch of
data. Moreover, every classifier should be specialized in a unique concept, meaning that it should be
trained from data belonging to this concept and used for classifying similar data. In [5, 7], a
methodology that identifies concepts by grouping classifiers of similar performance on specific time
intervals is described. Clusters are then assigned to classifiers according to performance on the latest
batch of data. Predictions are made by using weighted averaging. Although this strategy fits very well
with the recurring contexts problem, it has an offline step for the discovery of concepts that is not
suitable for data streams. In particular, this framework will probably be inaccurate with concepts that
did not appear in the training set. To classified real-world data set with overlapping features from
different classes. The training of class borders between overlap class features in such cases is a hard
crisis. Extreme preparation of the base classifiers will lead to accurate training of the decision border
but resulting in over fitting thus miss-classifying instances of test data. On the other hand, learning
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 45
generalized boundaries will avoid over fitting but at the cost of always missclassifying some
overlapping features. This problem on learning the class boundaries of overlapping features remains
inherent in all the base classifiers and is propagated to the decision fusion stage as well even though
the base classifier errors are uncorrelated. We come bring in clustering at this point. Clustering is the
process of partitioning a data set into multiple groups where each group contains data points that are
very close in Euclidean space. The clusters have well defined and easy to learn boundaries. Let’s
assume that the features are labeled with their cluster number [8]. Now if the base classifiers are trained
on the modified data set they will learn the cluster boundaries. As the clusters have well defined easy
to learn boundaries the base classifiers can learn them with high accuracy. Clusters can contain
overlapping features from multiple classes. The cluster oriented ensemble classifier performs a great
role in classification technique, but the selection process of cluster is not well defined. Now these
overcome are removed by ant colony optimization. Ant colony optimization well knows multi-
objective function used in feature optimization.
The rest of this paper is organized as follows: Section 2 describe the previous work . In Section 3.
Proposed methodology for cluster ensemble technique Section 4 summarizes the survey Section 5
concludes this paper
II. RELATED WORK
Medical diagnosis is a challenging paradigm in that complex relationships can exist in the diagnostic
features that are utilized to map to a resultant diagnosis about the disease condition. In different
luggage the state of the disease condition itself can be marked by stages where the diagnostic symptoms
or signs can be subtle or different to other stages of the disease. This means that there is often not a
clean mapping between the diagnostic features and the diagnosis. Combination of clustering’s a more
challenging task than combination of supervised classifications. In the absence of labeled training data,
we face a difficult correspondence problem between cluster labels in different partitions of an
ensemble. Recent studies have demonstrated that consensus clustering can be found using graph-based,
statistical or information-theoretic methods without explicitly solving the label correspondence
problem. Other empirical consensus functions were also considered. However, the problem of
consensus clustering is known to be NP complete. Clustering has provided a widely used mechanism
for organizing data into similar groupings. The usage of clustering has also been extended to classifiers
and detection systems in order to improve detection and provide greater classification accuracy.
Stability of a clustering algorithm with respect to small perturbations of data and also different
initializations is a desirable quality of the algorithm. Cluster ensembles, on the other hand, enforce and
exploit some instability so that the ensemble comprises of diverse cluster. Although built upon unstable
components, the ensemble is expected to be more accurate and robust than the individual clustering
method. The stability-based validity indices are not bound to the clustering method used for
partitioning the perturbed data. More importantly, there is no implied guess on the clusters’ shape and
size. This makes stability based indices more adequate for using with cluster ensembles, knowing that
the main claim of cluster ensembles is exactly that the obtained clusters can be of any shape and size.
The problem here is that the assumption that stability corresponds to high accuracy may not always
hold. In recent years, various consensus functions have been defined and coupled with heuristic
algorithms for maximizing them. Heuristic clustering ensembles algorithms are commonly based on
three main approaches, namely instance-based clustering ensembles, cluster-based clustering
ensembles, and hybrid clustering ensembles. Instance-based clustering ensembles methods require a
notion of distance measure that possibly employs information available from the ensemble to group
the data objects. Cluster-based clustering ensembles is based on the principle to cluster clusters", i.e.,
to apply a clustering algorithm on the set of all the clusters produced by the clustering solutions in the
ensemble, in order to compute a set of meta-clusters; anally, the consensus partition is computed to
assign each data object to the meta-cluster that maximizes some assignment criterion (e.g., majority
voting). Hybrid clustering ensembles methods attempt to combine ideas coming from both instance-
based and cluster-based approaches. A common limit of all the above approaches is that most of the
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 46
consensus functions proposed in literature are defined by equally considering the various clustering
solutions in the ensemble. This is clearly a weak assumption for a number of reasons; for instance, an
ensemble may be comprised of very different clustering’s, as well as clusters that are somehow
correlated with each other may appear in distinct clustering’s of the ensemble. As a consequence,
treating the constituent solutions of an ensemble equally and averaging over them to extract the
consensus partition may not be effective.
2.1 Previous Work Done
There have been a lot of work done in the field of clustering and ensemble classification. But still there
exist a problem in dealing with huge amount of data. Huge data volume and drifting concepts are not
unknown to the data mining community. One of the goal of traditional data mining algorithms is to
mine models from large databases
Brijesh Verma and Ashfaqur Rahman entitled “Cluster-Oriented Ensemble Classifier: Impact of Multi-
cluster Characterization on Ensemble Classifier Learning” a novel clusteroriented ensemble classifier
is presented [1]. This cluster oriented ensemble classifier is based on original concepts where cluster
boundaries are learned by the base classifier and cluster confidences are mapped with the help of fusion
classifier to the class decision. According to this paper an ensemble classifier is constructed using a
set of base classifier which learns the class boundaries separately over the pattern. It is a difficult
problem to learn class boundaries between overlapping class because this problem remains inherent in
all the base classifier. So concept of clustering comes into existence. Clustering is the process of
separating an item set into multiple item sets group. It is assumed that if the patterns are labeled with
their cluster number and the base classifiers are trained on the modified data set then base classifier
will learn the cluster boundaries. To gain better and improved accuracy of the ensemble classifier
clusters are classified into multiple clusters and cluster decisions produced by the base classifier are
combined into class decision by a fusion classifier.
Anne-Laure Bianne-Bernard, Fare`s Menasri, Rami Al-Hajj Mohamad, Chafic Mokbel, Christopher
Kermorvant, and Laurence Likforman-Sulem a study of building an efficient word recognition system
resulting from the combination of three handwriting recognizers [12] as building an efficient word
recognition system resulting from the combination of three handwriting recognizers. The main
component of this combined system is an HMM-based recognizer which considers dynamic and
contextual information for a better modeling of writing units. Among popular applications of
handwriting recognition are bank check processing, mailed envelopes reading, and handwritten text
recognition in documents and videos, for which different systems have been successfully developed.
The scope of recognition can be extended to reading letters sent to companies or public offices since
there is a demand to sort, search, and automatically answer mails based on document content.
However, for reading most words of a handwritten letter, recognition systems must cope with the
inherent handwriting variability and deal with a large number of word and character models. Stochastic
modeling such as Hidden Markov Models (HMMs) has proven to be effective for modeling
handwriting. HMMs are effective for modeling unconstrained words since they can cope with
nonlinear distortions. HMM systems can be divided into two types, depending on their strategy:
holistic or analytical. The aim of this study is to build an efficient word recognition system resulting
from the combination of three HMM-based recognizers. The first system is a context independent
HMM-based system which includes dynamic information. The second recognizer is an original sliding
window system which aims at enhancing performance accuracy and reducing decoding time. We
introduced a novel approach to build efficient context dependent word models based on the HMM
framework. The key features of such approach are the use of dynamic features and state-based
clustering.
Leo Breiman entitled a Bagging predictor [3] as a method for generating multiple version of a predictor
and using these to get an aggregated predictor. The aggregation averages over the version when
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 47
predicting a numerical outcome and does a plurality vote when predicting a class. The multiple
versions are formed by making bootstrap replication of the learning set and using these as new learning
sets. Tests on real and simulated data sets using classification and regression trees and subset selection
in linear regression show that bagging can give substantial gains in accuracy.
Nandita Tripathi, Stefan Wermter, Chihli Hung, and Michael Oakes entitled a technique using the
maximum significance value to detect a semantic subspace [7] as Subspace detection and processing
is receiving more attention nowadays as a method to speed up search and reduce processing overload.
Subspace Learning algorithms try to detect low dimensional subspaces in the data which minimize the
intra-class separation while maximizing the inter-class separation. Subspace learning methods are
therefore nowadays being increasingly researched and applied to web document classification, image
recognition and data clustering. This work is an effort to explore semantic subspace learning with the
overall objective of improving document retrieval in a vast document space.
III. PROPOSED METHODOLOGY FOR CLUSTER ENSEMBLE TECHNIQUE
Optimal cluster selection is important part in cluster oriented classifier. COEC on classification of data
set is efficient process, but this method generates some better result in compression of clustering
method on the consideration of loss of data. When the scale of data set is increase the complexity of
classification is also increase, it is difficult to select optimal number of cluster for individual classifier
of data set. We introduce a new feature sub set selection method for finding similarity matrix for
clustering without alteration of cluster oriented classifier. The proposed features sub set selection
method based on ant colony optimization, ant colony optimization is very famous meta-heuristic
function for searching for finding similarity of data. In this method we introduced continuity of ants
for similar features and dissimilar features collect into next node. In that process ACO find optimal
selection of features sub set. Suppose ants find features of similarity in continuous root. Every ant of
features compares their property value according to initial features set. When deciding data is cluster
selection should consider the two factors: importance degree and easiness degree of cluster and
optimal. While walking ants secrete phenomenon on the ground according to importance of the outlier
and follow, in probability pheromone previously laid by other ants and the easiness degree of the noise.
In the process of proposed work we discuss COB MODEL, ANT colony optimization and binary
classifier support Vector Machine.
3.1 Proposed Approach For Cluster Ensemble Classifier
Proposed clustering ensemble method based on ant colony optimization where the compromise
clustering is selecting optimal cluster criterion function using a ant colony algorithm. This method uses
a metric between clustering’s based on the ant between partitions. It also uses class level method to
solve the label correspondence problem. The search capabilities of ant colony algorithms are used in
these methods. It allows exploring partitions that are not easy to be found by other methods. However,
a drawback of these algorithms is that a solution is better only in comparison to another; such an
algorithm actually has no concept of an optimal solution or any way to test whether a solution is
optimal or not. This selection function combines partitions obtained by using locally adaptive
clustering AECC algorithms. When a AECC algorithm is applied to a set of objects X, it gives as an
output a partition P = { C1,C2,….,Cq}, which can be also identified by two sets {c1,…..cq}and
{w1,…..wq}, where ci and wi are the centroid and the confidence associated to the cluster Ci
respectively. The AECC algorithms are designed to work with numerical data, i.e. this method assumes
that the object representation in the dataset is made up of numerical features: X={x1,…,xn}, with xj €
R, j =1, …..,n ; n. Also, ci € Rand wi €2 R, i = 1,….., k. The set of partitions P = {p1,p2,…….,pm} is
generated by applying AECC algorithms m times with deferent parameters initialization. The process
of ant colony algorithm is to use the ensemble to choose, the selection of pheromone update (increment
and decrement of constant deposit of interval value of phenomenon) parameters of ant colony
optimization to control the sensitive value. For a given cluster assembling the problem of optimal
cluster selection can be stated as follows: given the variable set of cluster index, ,of data point, find
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 48
optimal ,which consists of classifier 𝑚 < 𝑛,𝑆ϲ𝐹 ,such that the classification accuracy is
maximized. The cluester index selection representation exploited by artificial ants includes the
following:
1. Data point that constitutes the cluster index set
.
2. Distribution of the cluster index space (𝑛𝑎 ants).
3. 𝜏𝑖 the intensity of pheromone trail associated with cluster index 𝑓𝑖 ,which reflects the previous
knowledge about the importance of𝑓𝑖. 4. For each ant 𝑗 , a list that contains the selected optimal cluster subset, 𝑅𝑗 =
.
AECC evaluation measure that is able to estimate the overall performance of subset as well as the local
importance of cluster. A classification algorithm is used to estimate the performance of optimal cluster
selection .On the other hand, the local importance of a given cluster measured using the correlation
based evaluation function, which is a filter evaluation function. In the first iteration, each ant will
randomly choose a cluster index of classifier. Only the best subsets, 𝑘 <𝑛𝑎, be used to update the
pheromone trail and influence the optimal subset of the next move. In the second and following moves,
each ant will start with 𝑚−𝑝 cluster that are randomly chosen from the previously selected 𝑘− best
subsets, where is a number that limit between 1 and 𝑚−1. In this way, the ensemble that constitutes
the best subsets will have more chance to be present in the subsets of the next iteration. However, it
will still be possible for each ant to consider other cluster index as well. For a given ant , those cluster
index are the once that achieve the best compromise between pheromone trails and local importance
with respect to 𝑆𝑗 , where 𝑆𝑗 is the subset that consists of the cluster that have already been selected by
ant . The Updated selection Measure (USM) is used for this purpose the defined as:
Where is the local importance of cluster index 𝑓𝑖 given the subset 𝑆𝑗.The parameters 𝛼 and
control the effect of pheromone trail intensity and local cluster importance respectively.
is measured using the correlation measure and defined as:
(2)
Where 𝐶𝑖𝑅 is the absolute value of the correlation between cluster index and the response (class)
variable , and 𝐶𝑖𝑠 is the absolute value of the inter- correlation between cluster index and opimal
that belongs to 𝑆𝑗
Below are the steps of the algorithm:
1. Initialization:
• Set 𝜏𝑖 =𝑐𝑐 and ,where cc is a constant and 𝛥𝜏𝑖 is the amount of change of
pheromone trail quantity for variable cluster index 𝑓𝑖. • Assign the maximum number of moves.
• Assign , where the 𝑘−𝑏𝑒𝑠𝑡 subsets will influence the subsets of the next iteration.
• Assign , where 𝑚−𝑝 is the number of cluster indexes that each ant will start with in the second
and following moves.
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 49
2. If the first iteration,
• For𝑗 =1 𝑡𝑜 𝑛𝑎,
• Randomly assign a subset of classifier to 𝑆𝑗. • Goto step 4.
3. Select the remaining p cluster index for each ant: For 𝑚𝑚 =𝑚−𝑝+1 𝑡𝑜 𝑚,
For 𝑗 =1 𝑡𝑜 𝑛𝑎 ,
Given subset 𝑆𝑗, choose cluster index 𝑓𝑖 that maximizes
.
Merge the duplicated subsets, if any, with randomly chosen index.
4. estimated the selected index of each cluster ant using a chosen classification algorithm: For 𝑗 =1 𝑡𝑜 𝑛𝑎,
o Estimate the Error of the classification results obtained by classifying the optimal cluster of 𝑆𝑗. Sort the subsets according to their . Update the minimum (if achieved by any ant in this iteration),
and store the corresponding subset of cluster.
5. Using the ensemble subsets of the best ants, update the pheromone trail intensity:
6. For 𝑗 =1 𝑡𝑜 𝑘,
Where 𝜌 is a constant such that represents the evaporation of pheromone trails.
7. If the number of moves is less than the maximum number of moves, or the desired E has not
been achived, initialize the subsets for next iteration and goto step3:
For 𝑗 =1 𝑡𝑜 𝑛𝑎 ,
o From the selected cluster of the best ants, randomly produce 𝑚− 𝑝 classifeir subset for ant , to be
used in the next iteration, and store it in .
Goto step 3.
Figure - 1 shows proposed model of optimal selection of cluster and classification
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 50
IV. EXPERIMENTAL RESULT
The proposed method implements in MATLAB 7.8.0 and tested with very reputed data set from UCI
machine learning research center. In the research work, I have measured classification accuracy, mean
absolute error and execution time of classification method. To evaluate these performance parameters
I have used six datasets from UCI machine learning repository [6] namely Wisconsin breast cancer
(original), Thyroid dataset, heart disease dataset, hepatitis dataset, liver dataset, and diabetes dataset.
Out of these six dataset, two are small dataset namely cancer and heart dataset; and remaining four are
large datasets namely Wisconsin breast cancer dataset and reaming all four dataset.
4.1 Description Of Dataset[103]
Six standard medical diagnosis datasets from the UCI Machine Learning Repository [5] were used for
this experimental evaluation. These include the Wisconsin breast cancer dataset, the Cleveland heart
disease dataset, and the Pima Indian diabetes dataset, hepatitis dataset, liver dataset and etc. Out of the
699 cases for the breast cancer dataset, 16 records containing missing attributes were omitted, leaving
683 for use. In all cases, the dataset was randomly split into a training set comprising approximately
70% of the data and an evaluation set consisting of the remaining 30%.
4.2 Performance Parameters
4.2.1 Classification Accuracy
I have used Wisconsin breast cancer, iris, glass identification, page blocks classification, white wine
quality, and yeast dataset for testing accuracy of the proposed method. The classification accuracy of
all six dataset is increased significantly by the proposed method.
Accuracy rate = Total no. of correctly classified instance * 100
………………………………….(a)
Total no. of instance
4.2.2 Mean absolute Error Rate
In classification the mean absolute error (MAE) is a quantity used to measure how close real or
predictions are to the eventual outcomes. The mean absolute error is given by
………………. (b)
As the name suggests, the mean absolute error is an average of the absolute errors
, where is the prediction and the true value. Note that alternative
formulations may include relative frequencies as weight factors.
4.2.3 Relative mean error rate
The formula for Relative error can be written as,
Relative error = Absolute error / Accepted value * 100…………………. (c)
This is also called as relative error equation. Relative error can be calculated with the above given
formula.
4.3 Experimental Process Of Ensemble Clustering And Classification (Knn)
In this section we describe the result analysis of ensemble clustering and classification. The base
classifier KNN is used and the clustering technique is used k-means technique. For the selection of
optimal cluster used ant colony optimization technique. The number of cluster is 5.
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 51
Figure 4.1 shows that the classification result map of fixed cluster 5 and base classifier KNN in result map.
Figure 4.2 shows that the classification result maps of fixed cluster 5 and base classifier KNN with optimal
selection of number of classifier.
Table 4.1 shows that experimental result analysis of different data set and getting some standard
result in both these cases
Dataset
MAE MRE ACCURACY EXECUTION
TIME
cancer 5.45 12.39 82.85 12
3.725 11.76 82.85 13.58
Diabetes 6.00 14.56 87.95 13.99
5.0 16.93 8.95 13.51
thyroid 4.5 12.6 92.95 13.79
3.5 14.81 92.95 14.29
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 52
liver 4.0 17.9 93.95 14.24
3.0 13.81 93.95 14.05
hepatitis 3.89 11.6 96.35 14.71
2.89 12.18 96.35 14.74
Heart
dataset
1.56 10.34 97.45 14.21
2.01 9.78 97.67 13.99
Figure 4.3 shows that comparative result analysis of fixed number of cluster technique and optimal number
of selected cluster for classification. The result graph shows that optimal selection of cluster process is
improved in compression of fixed number of cluster.
Figure 4.4 shows that comparative result analysis of fixed number of cluster technique and optimal number
of selected cluster for classification. The result graph shows that optimal selection of cluster process is
reduces the value of MAE and MRE.
4.4 Experimental Process Of Ensemble Clustering And Classification (Svm)
In this section we describe the result analysis of ensemble clustering and classification. The base
classifier SVM is used and the clustering technique is used k-means technique. For the selection of
optimal cluster used ant colony optimization technique. The number of cluster is 6.
Figure 4.1 shows that classification map result of fixed number of cluster and support vector machine
classifier. The number of cluster is used is 6 and base classifier is 6 used.
0
20
40
60
80
100
120
comprassion result of accuracy of all six data set
along with optimal selectionn of cluester
ACCURACY
EXECUTION TIME
0 5
10 15 20
MAE
MRE
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 53
Figure 4.5 shows that classification map result of fixed number of cluster and support vector machine
classifier. The number of cluster is used is 6 and base classifier is 6 used. The process of ANT improved the
selection of optimal number of cluster in fixed number of cluster.
Table 4.2 shows that experimental result analysis of different data set and getting some standard result in
both these cases
Dataset
MAE MRE ACCURACY EXECUTION
TIME
cancer 4.55 10.39 86.85 10
3.725 9.76 90.85 12.45
Diabetes 5.23 8.56 87.95 13.99
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 54
4.67 7.93 88.95 13.51
thyroid 4.5 5.6 94.95 13.79
3.5 4.81 95.95 14.29
liver 4.0 7.94 94.95 14.24
3.0 6.81 96.95 14.05
hepatitis 3.89 7.6 97.35 14.71
2.89 5.18 96.35 14.74
Heart
dataset
1.56 4.34 97.45 14.21
2.01 5.78 98.67 13.99
Figure 4.6 shows that optimal selection of cluster and support vector machine classifier ensemble result
analysis the optimal selection of result are increase in compression of fixed number of classifier.
Figure 4.7 shows that comprative result analysis of MAE and MRE for all six dataset
4.5 Comparative Result Analysis Of Number Of Base Classifier Are Changed In Ensemble
0
50
100
150
Comprassion result of all Six dataset for optimal selection and fixed
number of cluester
ACCURACY
EXECUTION TIME
0
5
10
15
20
MAE
MRE
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 55
4.3 Table Performance evaluation of cancer data set for all classification method.
Number of
base lassifier Accuracy Error
KNN SVM SVM-
ANT
KNN SVM SVM-
ANT
3 81.54 86.64 95.64 2.85 4.00 2.50
4 81.54 86.64 95.64 2.61 4.00 2.50
5 82.02 86.74 96.54 3.12 4.05 2.61
4.3 Table Performance evaluation of heart data set for all classification method.
Number of
base classifier Accuracy Error
KNN SVM SVM-
ANT
KNN SVM SVM-
ANT
3 80.90 86.00 95.00 2.75 4.00 2.50
4 80.90 86.00 95.00 2.50 4.00 2.50
5 80.90 86.00 95.00 2.37 3.00 1.50
4.4 Table Performance evaluation of liver data set for all classification method.
4.5
Number of
base classifier Accuracy Error
KNN SVM SVM-
ANT
KNN SVM SVM-
ANT
3 82.12 87.32 96.98 2.94 4.02 2.57
4 83.22 88.33 97.32 2.70 4.56 2.85
5 84.22 87.43 97.98 2.56 4.87 1.50
4.6 Table Performance evaluation of diabetes data set for all classification method.
Number
of base
classifier
Accuracy Error
KNN SVM SVM-
ANT
KNN SVM SVM-
ANT
3 82.58 88.59 96.95 3.72 5.00 3.50
4 82.65 89.56 97.05 3.15 4.00 2.50
5 80.85 87.95 98.59 2.86 4.00 2.50
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 56
4.7 Table Performance evaluation of hepatitis data set for all classification method.
Number
of base
classifier
Accuracy Error
KNN SVM SVM-
ANT
KNN SVM SVM-
ANT
3 82.67 87.73 96.86 3.68 5.00 3.50
4 81.70 88.86 97.66 3.12 4.00 2.50
5 82.00 88.68 97.99 3.68 5.00 3.50
4.8 Table Performance evaluation of hepatitis data set for all classification method.
Number
of base
classifier
Accuracy Error
KNN SVM SVM-
ANT
KNN SVM SVM-
ANT
3 87.08 92.18 98.18 4.56 6.00 4.50
4 87.08 92.18 99.68 3.92 5.00 3.50
5 87.08 92.18 99.07 3.53 5.00 3.50
4.5 Performance Evaluation Of All Six Medical Dataset In Terms Of Number Of Base Classifier
Are Changed.
Figure 4.8 shows that comparative result of accuracy and error of number of base classifier are changed
during selection of ensemble cluster classification technique. In all base classifier support vector machine
classifier perform better result in all classifier. This result shows the cancer data classification for
ensemble process.
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 57
Figure 4.9 shows that comparative result of accuracy and error of number of base classifier are
changed during selection of ensemble cluster classification technique. In all base classifier support
vector machine classifier perform better result in all classifier. This result shows the heart dataset
classification for ensemble process.
Figure 4.10 shows that comparative result of accuracy and error of number of base classifier are changed
during selection of ensemble cluster classification technique. In all base classifier support vector machine
classifier perform better result in all classifier. This result shows the diabetes data classification for
ensemble process.
Figure 4.11 shows that comparative result of accuracy and error of number of base classifier are changed
during selection of ensemble cluster classification technique. In all base classifier support vector machine
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 58
classifier perform better result in all classifier. This result shows the hepatitis data classification for
ensemble process.
Figure 4.12 shows that comparative result of accuracy and error of number of base classifier are changed
during selection of ensemble cluster classification technique. In all base classifier support vector machine
classifier perform better result in all classifier. This result shows the liver data classification for ensemble
process.
Figure 4.13 shows that comparative result of accuracy and error of number of base classifier are changed
during selection of ensemble cluster classification technique. In all base classifier support vector machine
classifier perform better result in all classifier. This result shows the thyroid data classification for
ensemble process.
V. CONCLUSIONS
In this research paper proposed an optimal feature selection of cluster for ensemble classifier. The
proposed ensemble classifier classified the medical diagnosis of disease data such as cancer, liver
problem, and heart disease and so on. Our proposed method better classified data in compression of
conventional cluster ensemble technique. For the selection of optimal number of cluster used ANT
colony optimization algorithm. Ant colony optimization is multi-objective function gives the optimal
number of cluster for assembling process.
For the classification purpose used two base classifier KNN and SVM. The task of clustering performs
by fixed clustering technique such as k-means algorithm.
Optimal selection of Cluster ensemble has improved method for improving robustness, stability and
accuracy of unsupervised classification solutions. So far, numerous works have contributed to find
consensus clustering.
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 05; May - 2016 [ISSN: 2455-1457]
@IJRTER-2016, All Rights Reserved 59
Compared classification accuracy on the same datasets between conventional and proposed ensemble
classification algorithms. We investigated the classification accuracy on benchmark datasets including
UCI repository, medical disease data, as well as real world datasets. The average and the maximum
classification accuracy computed by conventional and proposed ensembles on the same datasets were
compared. The comparison results prove that the average classification accuracy and the maximum
classification accuracy resulted by proposed ensembles on more datasets is better than the average
classification accuracy and the maximum classification accuracy computed by conventional on the
same datasets. The improving rate of classification accuracy average and classification accuracy
maximum resulted by proposed ensembles against conventional were also presented. They prove the
average classification accuracy resulted by proposed ensembles is usually better than the classification
accuracy resulted by conventional. Therefore, using genetic algorithms in classification ensembles are
able to improve the overall classification accuracy.
REFERENCES 1. Brijesh Verma and Ashfaqur Rahman “Cluster-Oriented Ensemble Classifier: Impact of
2. Multicluster Characterization on Ensemble Classifier Learning” in IEEE Transactions on knowledge and data
engineering, 2012.
3. Nayer M.Wanas, Rozita A. Dara and Mohamed S. Kamel “Adaptive fusion and co-operative training for classifier
ensembles” in Pattern Analysis and Machine Intelligence Lab, University of Waterloo, 2006.
4. Giorgio Fumera, Fabio Roli and Alessandra Serrau “A Theoretical Analysis of Bagging as a Linear Combination of
Classifiers” in IEEE Transactions,2008.
5. Albert Hung-Ren Ko and Robert Sabourin “The Implication of Data Diversity for a Classifier-free Ensemble Selection
in Random Subspaces” in IEEE Transactions,2008.
6. Juan J. Rodrı´guez and Jesu ´s Maudes “Boosting recombined weak classifiers” in ScienceDirect, 2007.
7. Oriol Pujol and David Masip “Geometry-Based Ensembles: Toward a Structural Characterization of the Classification
Boundary” in IEEE Transactions, 2009.
8. Nandita Tripathi, Stefan Wermter, Chihli Hung and Michael Oakes “Semantic Subspace Learning with Conditional
Significance Vectors” in IEEE Transactions,2010.
9. Valerio Grossi, Alessandro Sperduti “Kernel-Based Selective Ensemble Learning for Streams of Trees” in
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence 2010.
10. Seyda Ertekin, Jian Huang, Leon Bottou and C. Lee Giles “ Learning on the Border”,2007
11. Piyasak Jeatrakul, KokWaiWong, and Chun Che Fung “Data Cleaning for Classification Using Misclassification
Analysis” in Data Cleaning for Classification Using Misclassification Analysis, 2010.
12. Amal S. Ghanem and Svetha Venkatesh, Geoff West “Multi-Class Pattern Classification in Imbalanced Data” in
International Conference on Pattern Recognition, 2010.
13. Guoqiang Peter Zhang entitled “Neural Networks for Classification: A Survey” in IEEE TRANSACTIONS, 2000.
14. UCI Machine Learning Database, http://archive.ics.uci.edu/ml/, Feb. 2010