Analysis to find the most effective features to predict...
Transcript of Analysis to find the most effective features to predict...
Analysis to find the most effective features to predict
breast cancer; A data mining aproach
Fariba Tat1, Sahar Azadi Ghale Taki
2, Sadjaad
Ozgoli3, Mahdi Sojoodi
5
Electrical and Computer Engineering
Tarbiat Modares University
Tehran, Iran
[email protected], [email protected]
3,
Mohammad Esmaeil Akbari4
Chairman of Cancer Research Center
Prof. of Department of Surgery, Shahid Beheshti University
of Medical Sciences
Tehran, Iran
Abstract— This paper aims to present a hybrid intelligence
model that uses the cluster analysis techniques with feature
selection for analyzing clinical cancer diagnoses. Our model
provides an option of selecting a subset of salient features for
performing clustering and comprehensively considers the use of
most existing models that use all the features to perform
clustering. In particular, we study the methods by selecting
salient features to identify clusters using a comparison of
coincident quantitative measurements. When applied to
benchmark breast cancer datasets, experimental results indicate
that our method outperforms several benchmark filter- and
wrapper-based methods in selecting features used to discover
natural clusters, maximizing the between-cluster scatter and
minimizing the within-cluster scatter toward a satisfactory
clustering quality. The experimental dataset is based on the data
gathered in a hospital in Tehran.
Keywords—cluster analysis; cancer modeling; feature
selection; cancer diagnoses.
I. INTRODUCTION
Breast cancer is the second leading cause of death after cardiovascular diseases in the world. Health professionals are seeking ways for suitable treatment and quality of care in these groups of patients. Survival prediction is important for both physicians and patients in order to choose the best way of management. Today diagnosis of a disease is a vital job in medicine. It is an essential to interpret the correct diagnosis of patient with the help clinical examination and investigations. Computer information based decision support system can play an important role in accurate diagnosis and cost effective treatment. Most hospitals have a huge amount of patient data, which is rarely used to support clinical diagnosis. It is question why we cannot use this data in clinical diagnosis and patient management? Is it possible to formulate own area based prediction system concerned with specific disease by using data mining techniques [1]? Data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems [2]. Basically data mining technique is concerned with data processing,
identifying patterns and trends in information. In other words, data mining simply means collection and processing data in systemic manner by using computer based programs and subsequent formation of disease prediction or patient management system aid. Data mining principles have been known around for many years, but, with the advent of information technology, nowadays it is even more prevalent. Data mining is not all about the database software that you are using. You can perform data mining with comparatively modest database systems and simple tools, including creating and writing your own, or using off the shelf software packages. Complex data mining benefits from the past experience and algorithms defined with existing software and packages, with certain tools gaining a greater affinity or reputation with different techniques [3]. This technique is routinely use in large number of industries like engineering, medicine, crime analysis, expert prediction, Web mining, and mobile computing, besides others utilize Data mining [4]. Machine learning [5] [6], is concerned with the design and development of algorithms that allow computers to evolve behaviors learned from databases and automatically learn to recognize complex patterns and make intelligent decisions based on data. However the massive toll of available data poses a major obstruction in discovering patterns. Feature Selection attempts to select a subset of attributes based on the information gain [7]. Classification is performed to assign the given set of input data to one of many categories [8]. Data analysis procedures can be dichotomized as either exploratory or confirmatory, based on the availability of appropriate models for the data source, but a key element in both types of procedures (whether for hypothesis formation or decision-making) is the grouping, or classification of measurements based on either (i) goodness-of-fit to a postulated model, or (ii) natural groupings (clustering) revealed through analysis. Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity. Intuitively, patterns within a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster. It is important to understand the difference between clustering (unsupervised classification) and discriminant analysis (supervised classification). In supervised classification, we are provided with a collection of
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields 22-23 August 2014, Koc University, ISTANBUL/TURKEY
1
labeled (pre-classified) patterns; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters also, but these category labels are data driven; that is, they are obtained solely from the data. Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations, including data mining, document retrieval, image segmentation, and pattern classification [9].
In [10], the performance criterion of supervised learning
classifiers such as Naïve Bayes, SVM-RBF kernel, RBF neural networks, Decision trees (J48) and simple CART are compared, to find the best classifier in breast cancer datasets (WBC and Breast tissue). The experimental result shows that SVM-RBF kernel is more accurate than other classifiers; it scores accuracy of 96.84% in WBC and 99.00% in Breast tissue. In [11] , the performance of C4.5, Naïve Bayes, Support Vector Machine (SVM) and K- Nearest Neighbor (K-NN) are compared to find the best classifier in WBC. SVM proves to be the most accurate classifier with accuracy of 96.99%. In [12], the performance of decision tree classifier (CART) with or without feature selection in breast cancer datasets Breast Cancer, WBC and WDBC. CART achieves accuracy of 69.23% in Breast Cancer dataset without using feature selection, 94.84% in WBC dataset and 92.97% in WDBC dataset. When using CART with feature selection (Principal Components Attribute Eval), it scores accuracy of 70.63% in Breast Cancer dataset, 96.99 in WBC dataset and 92.09 in WDBC dataset. When CART is used with feature selection (ChiSquared Atrribute Eval), it scores accuracy of 69.23% in Breast Cancer dataset, 94.56 in WBC dataset and 92.61 in WDBC dataset. In [13], the performance of C4.5 decision tree method obtained 94.74% accuracy by using 10-fold cross validation with WDBC dataset. In [14], the neural network classifier is used on WPBC dataset. It achieves accuracy of 70.725%. In [15], a hybrid method is proposed to enhance the classification accuracy of WDBC dataset (95.96) with 10 fold cross validation. In [16], the performance of linear discreet analysis method obtained 96.8% accuracy with WDBC dataset. In [17], the accuracy obtained 95.06% with neuron- fuzzy techniques when using WDBC dataset. In [18], an accuracy of 95.57% was obtained with the application of supervised fuzzy clustering technique with WDBC dataset.
The major contributions of the current work are twofold. First, we have developed a K-means variant that can incorporate background knowledge in the form of instance-level constraints, thus demonstrating that this approach is not limited to a single clustering algorithm. In particular, we have presented our modifications to the K-means algorithm and have demonstrated its performance on six data sets.
Second, we have used the best feature for classification so we can predict the time of cancer recrudescence which is very important for prevention of death due to cancer.
In the next section, we provide some backgrounds on the Clustering algorithm such as K-means algorithm and SVM. Next, we describe our methods and result in some tables in section 3 and 4. Finally, Section 5 summarizes our contributions [19].
II. CLUSTERING ALGORITHMS
To evaluate the effectiveness of NDR for capturing cluster structures, two classical clustering algorithms, HC and K-means, are applied to the reduced feature spaces. These two algorithms are representatives for two kinds of widely used clustering approaches. HC uses agglomerative and divisive strategies and divides data into a sequence of nested partitions, where partitions at one level are joined as clusters at the next level. The number of clusters can be determined immediately at special level upon requirements. While, K-means is one of the most widely used center-based clustering algorithms which uses a partitioning strategy to assign objects into fixed clusters. The algorithm regards data vectors as a point set in a high-dimensional space. According to the input clustering number, K-means randomly selects centroid points for each cluster and allocates each of data point into one of these clusters based on its minimum distance to these centroid points. After necessary optimizing steps, K-means can generate a good clustering solution [11],[12].
A. K-means
Clustering is an important and popular technique in data mining. It partitions a set of objects in such a manner that objects in the same clusters are more similar to each another than objects in the different cluster according to certain predefined criteria. K-means is simple yet an efficient method used in data clustering. However, K-means has a tendency to converge to local optima and depends on initial value of cluster centers. In the past, many heuristic algorithms have been introduced to overcome this local optima problem. Nevertheless, these algorithms too suffer several short-comings [22]. In this paper, we present an efficient hybrid evolutionary data clustering algorithm referred to as K-MCI, whereby, we combine K-means with modified cohort intelligence. Our proposed algorithm is tested on several standard data sets from UCI Machine Learning Repository and its performance is compared with other well-known algorithms such as K-means, K-means++, cohort intelligence (CI), modified cohort intelligence (MCI), genetic algorithm (GA), simulated annealing (SA), tabu search (TS), ant colony optimization (ACO), honeybee mating optimization (HBMO) and particle swarm optimization (PSO). The simulation results are very promising in the terms of quality of solution and convergence speed of algorithm.
B. SVM
With the development of clinical technologies, different tumor features have been collected for breast cancer diagnosis. Filtering all the pertinent feature information to support the clinical disease diagnosis is a challenging and time consuming task. The objective of this research is to diagnose breast cancer
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields 22-23 August 2014, Koc University, ISTANBUL/TURKEY
2
based on the extracted tumor features. Feature extraction and selection are critical to the quality of classifiers founded through data mining methods. To extract useful information and diagnose the tumor, a hybrid of K-means and support vector machine (K-SVM) algorithms is developed. The K-means algorithm is utilized to recognize the hidden patterns of the benign and malignant tumors separately. The membership of each tumor to these patterns is calculated and treated as a new feature in the training model. Then, a support vector machine (SVM) is used to obtain the new classifier to differentiate the incoming tumors. Based on 10-fold cross validation, the proposed methodology improves the accuracy to 97.38%, when tested on the Wisconsin Diagnostic Breast Cancer (WDBC) data set from the University of California – Irvine machine learning repository. Six abstract tumor features are extracted from the 32 original features for the training phase. The results not only illustrate the capability of the proposed approach on breast cancer diagnosis, but also shows time savings during the training phase. Physicians can also benefit from the mined abstract tumor features by better understanding the properties of different types of tumors [23].
III. METHODS
In this paper, we have investigated two data mining techniques: Clustering and Classification. In this paper, we used these algorithms to predict the survivability rate of breast cancer data set. We have selected these two clustering and classification techniques to find the most suitable one for predicting cancer survivability rate [24]. The 24 of 76 features used in our study are listed in Table I.
TABLE I. FEATURE
parameter Explain Parameter Explain
F Family history T Size of the original
tumor
MS Marital status N lymph nodes
S Smoking N+
Cancer has been
found in the lymph nodes
C Childbirth STAGE A number on a scale
of 0 through IV
P Pregnancy PATH The type of pathology
TA Type of abortion GRADE
Grading is a
way of classifying
cancer cells
NA Number of
abortion LVI
Lymphovascular
invasion
B Breastfeeding ER Estrogen receptor
H
Hormones
(estrogen and progesterone)
PR Progesterone receptor
DH Duration of hormone use
HER2
Human epidermal
growth factor
receptor 2
CT Computed
tomography P53 Tumor protein
RT Radio therapy KI67 Antigen identifiedby
monoclonal antibody
In order to make the gathered data being in hospital in Tehran in numerical format, a coding scheme is used. This coding is depicted in Table II for coding of variables describing the general and Table III for variables coding cancer.
TABLE II. CODING OF VARIABLES
DESCRIBING THE GENERAL
Gender Education Type AB FH
Female 0 Collegiate 1 Criminal 1 Frist degree 1
Male 1 Diploma 2 Medical 2 Second degree 2
School 3 C/M 3 Yes/unknown 3
Guidance 4 Unknown 6 Unknown 6
Illiterate 5 No 9 No 9
Smoking Fat Married Menopause
Yes 1 Yes 1 Singel 1 Histectomy 1
Yes/no 2 Yes/no 2 Married 2 Natural 2
Unknown 6 Unknown 6 Divorce 3 No 9 No 9 Widowed 4
Unknown 6
TABLE III. VARIABLES CODING
CANCER
Surgery Armpit CT H.Name Path
Bcs 1 AXLND/padding 1 Yes 1 Tamoxifin 1 IDC 1
Mrm 2 AXLND/darrinage 2 Neo 2 Letrozol 2 DCIS 2
Bcs/Mrm
3
SLN/darrinage 3 Unknown
6
Aromysin 3 IDC/DCIS
3
Unknown
6
SLN 4 No 9 tamox/letrozol
4
ILC 4
AXLND 5
tamox/aromysin
5
ILC/LCIS
5
Unknown 6
Unknown 6 Unknown 6
SLN/AXLND 7
tamox/decapept 7
IDC/ILC 7
Padding 8
herceptin 8 LCIS 8
No 9
no 9
Darrinage 10
ILC/DCIS
10
AXLND/SLN/padding
11
SLN/padding 12
In addition, the coding depicted in Table IV is used to
numerate the result of test on each feature.
TABLE IV. CODES COMMON
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields 22-23 August 2014, Koc University, ISTANBUL/TURKEY
3
1 yes positive
9 no negative
6 Unknown
In the next section the results of clustering and classification will be discussed.
IV. CLUSTERING AND CLASSIFICATION RESULTS
We use 3 clustering for dataset that select some feature in 3 clusters .this cluster show that some feature such as p53 are more than effective to predict breast cancer. In this study, the models were evaluated based on the accuracy measures discussed above (classification accuracy, sensitivity and specificity). The results were achieved using 74 features for each model, and are based on the average results obtained from the test dataset. It has been chosen 983 patients of 1621 as train data and they are clustered into three, and the results are exist in Table V. This classification algorithm uses LVI as a table and predicts cancer more precisely. Using this algorithm, accuracy numbers are listed in Table V. simulation results have been achieved using Rapid Miner.
TABLE V. RESULTS
Clustering Table Recurrence
Result of data train 91.5 %
Result of data test 89.7 %
LVI as a table that shoe is the best feature for predicting effect
time.
Classification Label: LVI
Train data97% Test Data91%
V. CONCLUSIONS AND FUTURE DIRECTIONS
Feature selection is one of the most effective methods to enhance data representation and improve performance in terms of specified criteria, e.g., generalization classification accuracy. In the literature, many studies select a subset of salient features using supervised learning rather than unsupervised learning. When the class labels are absent during training, feature selection in unsupervised learning is integral, but its extensible application is rarely studied in the literature. The objective of this study is to select salient features that can be used to identify interesting clusters in the analysis of cancer diagnosis. Specifically, we highlight three qualitative principles that help users to analyze clinical cancer diagnosis using clusters resulting from a subset of salient features. First, the clusters built by a subset of salient features are more practical and interpretable than those built by all of the features, which include noise. Second, the clustering results provide clinical doctors with an understanding of the context of clinical cancer diagnoses. Finally, a search for relevant records based on the clusters obtained when noisy features are ignored is more efficient. These three principles rely on
the discovery of natural clusters using salient features and are applicable only to unsupervised learning. To demonstrate the usefulness of these three qualitative principles, we use coincident quantitative measurements to analyze the salient features for discovering clusters. The experiments on the cancer (Diagnostic) and cancer (Original) datasets demonstrate that the selected features are effective for selecting salient features to discover natural clusters. Based on a performance evaluation using well-known validations in statistical model and cluster analysis, our analysis provides an interesting aspect in feature selection for discovering clusters.
REFERENCES
[1] M. C. Tayade and M. P. M. Karandikar, “Role of Data Mining
Techniques in Healthcare sector in India,” Sch. J. Appl. Med. Sci. SJAMS, vol. 1, no. 3, pp. 158–160, 2013.
[2] S. Chakrabarti, M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita, G.
Piatetsky-Shapiro, and W. Wang, “Data mining curriculum: A proposal (Version 0.91),” 2004.
[3] I. H. Witten and E. Frank, Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann, 2005. [4] M. Kantardzic, Data mining: concepts, models, methods, and
algorithms. John Wiley & Sons, 2011.
[5] T. M. Mitchell, “Machine learning and data mining,” Commun. ACM, vol. 42, no. 11, pp. 30–36, 1999.
[6] S. B. Kotsiantis, “Supervised machine learning: a review of
classification techniques.,” Inform. 03505596, vol. 31, no. 3, 2007. [7] S. G. Jacob and R. G. Ramani, “Efficient Classifier for Classification of
Prognostic Breast Cancer Data through Data Mining Techniques,” in
Proceedings of the World Congress on Engineering and Computer Science, 2012, vol. 1.
[8] X. Wu and V. Kumar, The top ten algorithms in data mining. CRC
Press, 2009. [9] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,”
ACM Comput. Surv. CSUR, vol. 31, no. 3, pp. 264–323, 1999.
[10] S. Aruna, D. S. Rajagopalan, and L. V. Nandakishore, “Knowledge based analysis of various statistical tools in detecting breast cancer,”
Comput. Sci. Inf. Technol. CSIT, vol. 2, pp. 37–45, 2011.
[11] A. Christobel and Y, Dr.Sivaprakasam, “An Empirical Comparison of Data Mining Classification Methods.,” Int. J. Comput. Inf. Syst., vol. 3,
No 2011.
[12] D. Lavanya and D. K. U. Rani, “Analysis of feature selection with classification: Breast cancer datasets,” Indian J. Comput. Sci. Eng.
IJCSE, 2011.
[13] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in Computer Vision and Pattern
Recognition, 1997. Proceedings., 1997 IEEE Computer Society
Conference on, 1997, pp. 130–136.
[14] V. N. Chunekar and H. P. Ambulgekar, “Approach of Neural Network
to Diagnose Breast Cancer on three different Data Set,” in Advances in Recent Technologies in Communication and Computing, 2009.
ARTCom’09. International Conference on, 2009, pp. 893–895.
[15] D. Lavanya and K. Usha Rani, “ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA.,” Int. J. Inf. Technol.
Converg. Serv., vol. 2, no. 1, 2012.
[16] B. Šter and A. Dobnikar, Neural network in medical diagnosis: comparison with other methods. 1996.
[17] T. Joachims, “Transductive inference for text classification using
support vector machines,” in ICML, 1999, vol. 99, pp. 200–209. [18] J. Abonyi and F. Szeifert, “Supervised fuzzy clustering for the
identification of fuzzy classifiers,” Pattern Recognit. Lett., vol. 24, no.
14, pp. 2195–2207, 2003. [19] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, and others undefined,
“Constrained k-means clustering with background knowledge,” in
ICML, 2001, vol. 1, pp. 577–584.
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields 22-23 August 2014, Koc University, ISTANBUL/TURKEY
4
[20] J. Shi and Z. Luo, “Nonlinear dimensionality reduction of gene
expression data for visualization and clustering analysis of cancer tissue
samples,” Comput. Biol. Med., vol. 40, no. 8, pp. 723–732, Aug. 2010.
[21] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric
framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[22] G. Krishnasamy, A. J. Kulkarni, and R. Paramesran, “A hybrid
approach for data clustering based on modified cohort intelligence and
K-means,” Expert Syst. Appl., vol. 41, no. 13, pp. 6009–6016, Oct.
2014.
[23] B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based
on feature extraction using a hybrid of K-means and support vector
machine algorithms,” Expert Syst. Appl., vol. 41, no. 4, Part 1, pp. 1476–1482, Mar. 2014.
[24] V. Chaurasia and S. Pal, Data Mining Techniques: To Predict and
Resolve Breast Cancer Survivability. IJCSMC, 2014.
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields 22-23 August 2014, Koc University, ISTANBUL/TURKEY
5