A COMPARATIVE STUDY OF CLASSIFICATION VIA ... Journal of Computer Engineering and Applications,...
Transcript of A COMPARATIVE STUDY OF CLASSIFICATION VIA ... Journal of Computer Engineering and Applications,...
International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469
Dhyan Chandra Yadav 1
A COMPARATIVE STUDY OF CLASSIFICATION VIA
CLUSTERING WITH K-MEANS AND J48 ALGORITHMS
Dhyan Chandra Yadav
E-mail: [email protected]
Department of Computer Application
ABSTRACT:
This paper proposes a classification via clustering approach to predict the consumer claim of bank loans on the basis of bureau data. The proposed classification via clustering approach can obtain similar accuracy to traditional classification algorithms. Experiments were carried out using real data from bureau report of disputed YES or NO in case. In this paper mainly we analyzed and compare between meta classifier algorithms “classification via clustering” with J48 classifier and K-Means.
Keywords: Classification via clustering, prediction, J48 algorithms, K-Means, Bureau report, Weka.
[1] INTRODUCTION
One of the biggest problems with credit cards is that it’s easy to forget to make a payment. This
can be especially true if you’re going through a major change in life. Your credit won’t be
damaged severely if you realize that you did miss that payment before your next due date. It’s
possible you could even get your credit card company to waive that late fee , if you haven’t
been habitually late. Also, you can prevent any damage to your credit score by making up that
payment before it gets 30 days past due. If you have a calendar on your computer or smart
phone you should put your due date on it as a recurring event , to help make sure you don’t
miss any payments in the future ,but when you write a letter, it ensures your rights will be
A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)
Dhyan Chandra Yadav
2
protected. Credit reporting companies must investigate your dispute, forward all documents to
the furnisher, and report the results back to you unless they determine your claim is frivolous. If
the consumer reporting company or furnisher determines that your dispute is frivolous, it can
choose not to investigate the dispute so long as it sends you a notice within five days saying
that it has made such a determination. If the furnisher corrects your information after your
dispute, it must notify all of the credit reporting companies it sent the inaccurate information to,
so they can update their reports with the correct information [1].
J. Han and M. Kamber introduced about Classification techniques in data mining. It can be
used to predict categorical class labels and classifies data based on training set and class labels
and it can be used for classifying newly available data.The term could cover any context in
which some decision or forecast is made on the basis of presently available information.
Classification procedureis recognized method for repeatedly making such decisions in new
situations. Here if we assume that problem is a concern with the construction of a procedure
that will be applied to a continuing sequence of cases in which each new case must be assigned
to one of a set of pre defined classes on the basis of observed features of data. Creation of a
classification procedure from a set of data for which the exact classes are known in advance is
termed as pattern recognition or supervised learning. Contexts in which a classification task is
fundamental include, for example, assigning individuals to credit status on the basis of financial
and other personal information, and the initial diagnosis of a patient’s disease in order to select
immediate treatment while awaiting perfect test results. Some of the most critical problems
arising in science, industry and commerce can be called as classification or decision problems.
J. Han and M. Kamber introduced about Clustering .It can be considered the most important
unsupervised learning problem; so, as every other problem of this kind, it deals with finding a
structure in a collection of unlabeled data. A loose definition of clustering could be “the process
of organizing objects into groups whose members are similar in some way”. A cluster is
therefore a collection of objects which are “similar” between them and are “dissimilar” to the
objects belonging to other clusters [2].
We can show this with a simple graphical example as:
Fig.1. Visualized Instances by Classification via Clustering algorithm
International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469
Dhyan Chandra Yadav 3
J. R Quinlan and H. Ian introduced thatJ48 may refer to:
J48, an open source Java implementation of the C4.5 decision tree algorithm
C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an
extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used
for classification, and for this reason, C4.5 is often referred to as a statistical classifier. Authors
of the Weka machine learning software described the C4.5 algorithm as "a landmark decision
tree program that is probably the machine learning workhorse most widely used in practice to
date" [3].
Fig.2. Visualized Instances by J48 Classifier algorithm
J Han and M Kamber introduced about K-means . It is a widely used partitional clustering
method in the industries. The K-means algorithm is the most commonly used partitional
clustering algorithm because it can be easily implemented and is the most efficient one in terms
of the execution time [4].
A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)
Dhyan Chandra Yadav
4
Fig.3. Visualized Instances by K-Means Clustering algorithm
[2] RELATED WORKS:
H, K analyzed that Credit cards fraudulence arises at very high level scale so we cannot
easily detect and predict the related attributes but by the help of data mining classifier tool to
prevent the activity of fraudsters in the misuse of credit cards uses the algorithms of neural
networks. This system predicts the probability of fraud on an account by comparing the
current transactions and the previous activities of each holder [5].
D C Yadav and S Pal discussed that classifier algorithms provide very accurate result
in software error detection by J48, ID3 and Naïve Bayes data mining algorithms correctly
classified instances will be partition in to numeric and percentage value, kappa statics, mean
absolute error and root mean square error will be at numeric value only ID3 andJ48 time
taken to build model: 0.2 seconds and test mode :10 fold cross validation. Here Weka
compare all required parameters on given instances with the classifiers respective accuracy
and prediction rate based on highest accuracy of J48 is 100% without error also Naïve Bayes
100% correctly classified but with some error and ID3 95% correctly classified, so it is clear
that J48 is the best in three respective algorithms so it is more accurate [6].
D C Yadav and R Kumar discussed that association algorithms provide very accurate
result in the frequent and relationship between data object and find the percentage of
confidence, support, of data object by the help of apriori, predictive apriori and filtered
associate algorithms. Therefore these algorithms can be used in other domains to bring out
interestingness among data present in the origin [7].
D C Yadav and R Kumar discussed that three major clustering algorithms: K-Means,
Hierarchical clustering and Density based clustering algorithm and compare the performance
of these three major clustering algorithms. Author compared using a clustering tool and find
result: K-Means algorithm is better than Hierarchical Clustering and Make density based
algorithm because all the algorithms have some ambiguity in some (noisy) data when
clustered [8].
International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469
Dhyan Chandra Yadav 5
R Sukanya and K Prabha discussed that back propagation Neural Network, Support
Vector Machine is used for rainfall prediction. ANN improves the efficiency of Rainfall
prediction by analyzing the historical and current facts to make accurate predictions about
future [9].
R S, S M, N E, S P and V Kirand discussed that the huge volume of warranty data
for segregating the fraudulent warranty claims using pattern recognition and clustering
.Survey of automotive industry shows up to 10% of warranty costs are related to warranty
claims fraud, costing manufacturers several billions of dollars. The existing methods to detect
warranty fraud are very complex and expensive as they are dealing with inaccurate and vague
data, causing manufacturers to bear the excessive costs [10].
D C Yadav analyzed that in statical analysis of binary classification, the F1 score is a
measure of a test's accuracy. It considers both the precision and the recall of the test to
compute the score. In this analysis author computed the best score for F1 by the help of data
mining classifier algorithms and choose the ID3 Tree is the best data mining classifier
algorithms to be applied over selected datasets. Because ID3 Tree has highest F1 score and
take less time to build a mode [11].
D C Yadav analyzed that the Matthews correlation coefficient is used in machine
learning as a measure of the quality of binary (two-class) classifiers. It takes into account true
and false positives and negatives and is generally regarded as a balanced measure which can
be used even if the classes are of very different sizes. Author computed the MCC is in
essence a correlation coefficient between the observed and predicted binary classifications by
the help of data mining classifier algorithms and ID3 Tree is the best data mining classifier
algorithms to be applied over selected datasets. Because ID3 Tree has highest MCC value
and minimum number of time in second 0.00 to build a model [12].
D C Yadav analyzed that the informedness of a prediction method as captured by a
contingency matrix is defined as the probability that the prediction method will make a
correct decision as opposed to guessing and is calculated using the bookmaker algorithm.
Their correlation is the generated by LAD Tree, ID3, and J48 data mining algorithms and
find ID3 is the best data mining classifier algorithms to be applied over selected datasets
[13].
D C Yadav analyzed that the FDR-controlling procedures provide less stringent
control of Type I errors compared to class wise errors . In this analysis we choose the ID3
Tree is the best data mining classifier algorithms to be applied over selected datasets. Because
ID3 Tree has minimum time to build a model [14].
D C Yadav analyzed that all analysis on the basis of dependable variables for overall
performance and Predicts categorical class level classifiers based on training set and the
values in the class level attribute use the model in classifying new data. Author analyzed
between AD Tree, LAD Tree, J48 and Naïve Bayes for correctly classify and incorrectly
classify with kappa static model and choose the LAD Tree is the best data mining classifier
algorithms to be applied over selected datasets. Because LAD Tree has highest correctly
value 83.333% and minimum number of unclassified instances is 0.00. Also Lad tree have
highest value 0.3576 of metric for accuracy [15].
T, R and Liu discussed that a framework was presented on the base of security systems
and Case based reasoning for fraud detection. First, a set of normal and fraud cases are made
A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)
Dhyan Chandra Yadav
6
from labeled data. Then, the primary detectors are made with random or genetic algorithms.
Then, negative selection and clonal selection operations are applied on primary detectors in
order to obtain a set of detectors with different algorithms that can detect a variety of frauds
[16].
M,S,B and Saira discussed that many fraud detection systems that have been presented
so far, have used data mining and neural network approaches. While no fraud detection
system with the combination of anomaly detection, misuse detection and decision making
system have been used so far for fraud detection in credit cards. Then, a system was proposed
that used Hidden Markov Model to detect the fraudulent transactions [17].
A John analyzed that hybrid feature selection and anomaly detection algorithm in order
to detect fraud in credit cards. The authors have noted that fraud detection on the internet
must be done online and immediately. Since the use of credit card by card holders
follows a fixed pattern, this fixed pattern can be extracted from a usual legal activity of card
holders in 1 or 2 years .thus, this pattern is compared to the use of process of card holder and
in case of non-similarity in the pattern, the activity is considered illegal. It should be noted
that the neural networks were used to teach the patterns detection in the model in this study
[18].
A P, M K and A N discuss that data mining as one of the most efficient tools of data
analysis has attracted the attention of many people. The use of different techniques and
algorithms of this tool in various fields like customer relationship management, fraud
management and detection, medical sciences, sport and etc. Due to the large number of data
in banks, data mining has had lots of functions in financial and monetary affairs so far. Credit
risk management, fraud detection, money laundering , customer relationship management and
banking services quality management are some examples of data mining function in banks
[19].
In this work, we propose to use a meta-classifier that uses a cluster for classification
approach based on the assumption that each cluster corresponds to a class . Primary, the
usage and interaction consumer claim data have to be collected and preprocessed. Then, an
optional attribute selection process can be applied or not , in order to select only a group of
attributes/variables or to use all available. Secondary in this paper mainly we analyzed and
compare between meta classifier algorithms “classification via clustering” with J48
classifier and K-Means.
[3] METHODOLOGY:
Our research approach is to use Classification via Clustering , J48 and K-Means on
consumer claim data set. The research methodology is divided into 5 steps to achieve the
desired results:
Step 1: In this step, prepare the data and specify the source of data.
Step 2:In this step select the specific data and transform it into different format by weka.
International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469
Dhyan Chandra Yadav 7
Step 3:In this step, implement data mining algorithms and checking of all the relevant dispute
is perform.
Step 4: The decision is taken on the presence of dispute in source code. If dispute is present
then proceed further, otherwise it will stop. We classify the relevant dispute using
Classification via Clustering , J48 and K-Means .
Step 5: At the end, the results are display and evaluated.
Table.1. Representation of Computational Variables of Consumer Claim
Now we have study about consumer complaints of bank different type’s loans and relate
dispute.
1. Data Preparation:
One of the biggest problems with credit cards is that it’s easy to forget to make a payment. This
can be especially true if you’re going through a major change in life. Your credit won’t be
damaged severely if you realize that you did miss that payment before your next due date. It’s
possible you could even get your credit card company to waive that late fee , if you haven’t
Property Description
Source https://catalog.data.gov/dataset. Customer Financial Protection Bureau
, a U.S. Government Agency.
Complaint Type Consumer Complaint Database Attributes for Financial Problem( Bank,
Lender &Company etc.)
Sample Size 500Total: 100 Consumer dispute and 400 non dispute
Dependable Variables
Dispute(YES) Consumer Complaint dispute
Dispute(NO) Consumer Complaint not dispute
Field name Description
Date received The date the CFPB received the complaint.
Tags Data that supports easier searching and sorting of complaints submitted
by or on behalf of consumers.
Date sent to
company
The date the CFPB sent the complaint to the company.
Company
response to
consumer
This is how the company responded. For example, “Closed with
explanation.”
Timely response? Whether the company gave a timely response. For example, “Yes” or
“No.”
A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)
Dhyan Chandra Yadav
8
been habitually late. Also, you can prevent any damage to your credit score by making up that
payment before it gets 30 days past due. If you have a calendar on your computer or smart
phone you should put your due date on it as a recurring event, to help make sure you don’t miss
any payments in the future ,but when you write a letter, it ensures your rights will be protected.
The Consumer Complaint Database is a collection of complaints on a range of consumer
financial products and services, sent to companies for response. We don’t verify all the facts
alleged in these complaints, but we take steps to confirm a commercial relationship between the
consumer and the company.
2. Data Selection and transform:
The database generally updates daily, and contains certain information for each complaint,
including the source of the complaint, the date of submission, and the company the complaint
was sent to for response. The database also includes information about the actions taken by the
company in response to the complaint, such as, whether the company’s response was timely
and how the company responded. Companies also have the option to select a public response.
Company level information should be considered in context of company size. Data from those
complaints helps us understand the financial marketplace and protect consumers.
Fig.4. Representation of Instances by Classification via Clustering algorithm
International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469
Dhyan Chandra Yadav 9
Fig.5. Representation of Instances by J48 algorithm
Fig.6. Representation of Instances by K-Means algorithm
After data selection we transform data by data mining weka tool. These tree algorithms have
his specific details as accuracy by class ,Time to build a model and stratified cross validation
with summary. Confusion matrix describe correctly classified and incorrectly classified
instances with respect to class.
A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)
Dhyan Chandra Yadav
10
3. Implementation:
In this step, we choose classification algorithm for the purpose of accuracy in dataset, which are
the J48. To investigate further the classifier performance in accuracy. Weka is the data mining
tools. It is the simplest tool for classify the data various types. It is the first model for provide
the graphical user interface of the user. For perform the clustering we used the promise data
repository. It is providing the past project data for analysis. With the help of figures we are
showing the working of various algorithms used in weka. weka is more suitable tool for data
mining applications. Clustering is a main task of explorative data mining, and a common
technique for statical data analysis used in many fields, including machine learning. I am using
Weka data mining tools for this purpose. It provides a better interface to the user than compare
the other data mining tools.
4. Result and Discussion:
All our experiments were performed using Weka and the previously described consumer claim
dataset. In order to test the accuracy of obtained classification models we used the 10-fold
cross- validation method. All classifiers in Weka work in the same way under cross-validation.
The model is built using just the instances in the training fold. The classification via clustering
approach is based on the "clusters to classes" evaluation routine in the cluster evaluation code,
which finds a minimum-error mapping of clusters to classes.
Table.2. Representation Compute Instances for Classifier Algorithms
Algorithms Accuracy Kappa RMSE RAE RRSE TIME NO. of
Iterations
Sum of
Square
Error
Classification
via
Clustering
65.4 0.02 0.23 102.63 143.72 0.00 3 1714
J48 79.2 0.00 0.33 99.34 99.99 0.05 - -
K-Means - - - - - 0.00 3 1610
In the first experiment, we executed the following clustering algorithms provided by Weka for
classification via clustering using 500 instances of consumer claim of available attributes .
Secondary in this paper mainly we analyzed and compare between meta classifier algorithms
“classification via clustering” with J48 classifier and K-Means.
From the above table-2 it is clear that:
In case of classification J48 perform more accurate result 79.2 and less error compare
to classification via clustering.
International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469
Dhyan Chandra Yadav 11
In the case of clustering K-Means perform less error compare to classification via
clustering.
Finally we find that in case of classification and clustering J48 and K-Means are better
compare to classification via clustering algorithm.
5. CONCLUSION
Yes, dispute participation in the claim of bank loan bureau report was a good predictor
for the dispute. Another advantage of classification models based on mapping clusters to
classes is that they are very simple and interpretable to instructors. In the case dispute YES
here, instructors only have to analyze the cluster centroids to know that consumer active in
the bank claim .We find that in case of classification and clustering J48 and K-Means are
better compare to classification via clustering algorithm.
REFERENCES
[1] https://catalog.data.gov/dataset.
[2] J. Han and M. Kamber, “Data Mining Concepts and Techniques”, Elevier, 2011.
[3] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
[4] J Han and M Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann
Publishers, second Edition, (2006).
[5] H, K. (Ed.) “Detecting Payment Card Fraud with Neural Networks. Singapore: World
Scientific” (2000).
[6] D C Yadav and S Pal “An Integration of Clustering and Classification Technique in
Software Error Detection” African Journal of Computing & ICT, Vol. 8,No.2,June 2015 ,
ISSN 2006-1781.
[7] D C Yadav and R Kumar “Generate Identical Rules by the Different Algorithms for
Detection of Risk In Software Project Development” Shodh Prerak A Multidisciplinary
Quarterly International Refereed Journal, Volume V, Issue 2, April 2015.ISSN 2231-413X.
[8] D C Yadav and R Kumar “A Comparative Study of Software Bug by Clustering
Algorithms” Annals of Multi-Disciplinary Research A Quarterly Inter National Refereed
Research Journal, Volume V, Issue 5, Dec.2015.ISSN 2249-8893.
[9] R Sukanya and K Prabha “Comparative Analysis for Prediction of Rainfall using Data
Mining Techniques with Artificial Neural Network” Volume-5 , Issue-6 , Page no. 288-292,
Jun 30, 2017.
[10] R S, S M, N E, S P and S V Kirand “Modelling an Optimized Warranty Analysis
methodology for fleet industry using data mining clustering methodologies with Fraud
A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)
Dhyan Chandra Yadav
12
detection mechanism using pattern recognition on hybrid analytic approach” ,Fourth
International Conference on Recent Trends in Computer Science & Engineering. Chennai,
Tamil Nadu, India Procedia Computer Science ,Volume 87, 2016, Pages 322-327.
[11] D C Yadav “Analysis of Bug Accuracy through F1-Score by Data Mining Classifier
Algorithms”, Annals of Multi-Disciplinary Research A Quarterly International Refereed
Journal, Volume VII, Issue 2, June 2017.ISSN 2249-8893.
[12] D C Yadav “Analysis of Bug through Mathews Correlation Coefficient with Confusion
Matrix” ,Sodha Pravaha A Multidisciplinary Refereed Research Journal , Vol. VII, Issue
3, July 2017,ISSN 2231-4113.
[13] D C Yadav “To Create Correct Decision by Informedness of Bug Prediction” ,Shodh
Prerak A Multidisciplinary Quarterly International Refereed Research Journal,Volme
VII,Issue 2,April 2017.ISSN 2231-413X.
[14] D C Yadav “Controlling the Procedures through False Discovery Rate Defect in Bug
Analysis”, Sodha Pravaha A Multidisciplinary Refereed Research Journal , Vol.VII, Issue
2, April 2017,ISSN 2231-4113.
[15] D C Yadav “Measurement of Gap between Software Bug Classes by Data Mining
Classifier Algorithms” ,Vaichariki A Multidisciplinary Refereed International Research
Journal , Vol.VII, Issue 2, June 2017,ISSN 2249-8907.
[16] Tue, Ren, Liu “Artificial Immune System for Fraud Detection” ,IEEE, vol. 2, pp. 1407-
1411, 2004.
[17] M, S, B and Saira “A Defense Mechanism For Credit Card Fraud Detection”,
International Journal on Cryptography and Information Security (IJCIS), 2012,pp. 89-100.
[18] A John “Data Mining Application for Cyber Credit-Card Fraud Detection System”,
Springer- Verlag Berlin Heidelberg, 2013, pp. 218–228.
[19] A P, M K and A N “Fraud detection in E-banking by using the hybrid feature selection
and evolutionary algorithms”, IJCSNS International Journal of Computer Science and
Network Security, VOL.17 No.8, August 2017.