A COMPARATIVE STUDY OF CLASSIFICATION VIA ... Journal of Computer Engineering and Applications,...

International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469

Dhyan Chandra Yadav 1

A COMPARATIVE STUDY OF CLASSIFICATION VIA

CLUSTERING WITH K-MEANS AND J48 ALGORITHMS

Dhyan Chandra Yadav

E-mail: [email protected]

Department of Computer Application

ABSTRACT:

This paper proposes a classification via clustering approach to predict the consumer claim of bank loans on the basis of bureau data. The proposed classification via clustering approach can obtain similar accuracy to traditional classification algorithms. Experiments were carried out using real data from bureau report of disputed YES or NO in case. In this paper mainly we analyzed and compare between meta classifier algorithms “classification via clustering” with J48 classifier and K-Means.

Keywords: Classification via clustering, prediction, J48 algorithms, K-Means, Bureau report, Weka.

[1] INTRODUCTION

One of the biggest problems with credit cards is that it’s easy to forget to make a payment. This

can be especially true if you’re going through a major change in life. Your credit won’t be

damaged severely if you realize that you did miss that payment before your next due date. It’s

possible you could even get your credit card company to waive that late fee , if you haven’t

been habitually late. Also, you can prevent any damage to your credit score by making up that

payment before it gets 30 days past due. If you have a calendar on your computer or smart

phone you should put your due date on it as a recurring event , to help make sure you don’t

miss any payments in the future ,but when you write a letter, it ensures your rights will be

A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE)

Dhyan Chandra Yadav

2

protected. Credit reporting companies must investigate your dispute, forward all documents to

the furnisher, and report the results back to you unless they determine your claim is frivolous. If

the consumer reporting company or furnisher determines that your dispute is frivolous, it can

choose not to investigate the dispute so long as it sends you a notice within five days saying

that it has made such a determination. If the furnisher corrects your information after your

dispute, it must notify all of the credit reporting companies it sent the inaccurate information to,

so they can update their reports with the correct information [1].

J. Han and M. Kamber introduced about Classification techniques in data mining. It can be

used to predict categorical class labels and classifies data based on training set and class labels

and it can be used for classifying newly available data.The term could cover any context in

which some decision or forecast is made on the basis of presently available information.

Classification procedureis recognized method for repeatedly making such decisions in new

situations. Here if we assume that problem is a concern with the construction of a procedure

that will be applied to a continuing sequence of cases in which each new case must be assigned

to one of a set of pre defined classes on the basis of observed features of data. Creation of a

classification procedure from a set of data for which the exact classes are known in advance is

termed as pattern recognition or supervised learning. Contexts in which a classification task is

fundamental include, for example, assigning individuals to credit status on the basis of financial

and other personal information, and the initial diagnosis of a patient’s disease in order to select

immediate treatment while awaiting perfect test results. Some of the most critical problems

arising in science, industry and commerce can be called as classification or decision problems.

J. Han and M. Kamber introduced about Clustering .It can be considered the most important

unsupervised learning problem; so, as every other problem of this kind, it deals with finding a

structure in a collection of unlabeled data. A loose definition of clustering could be “the process

of organizing objects into groups whose members are similar in some way”. A cluster is

therefore a collection of objects which are “similar” between them and are “dissimilar” to the

objects belonging to other clusters [2].

We can show this with a simple graphical example as:

Fig.1. Visualized Instances by Classification via Clustering algorithm



J. R Quinlan and H. Ian introduced thatJ48 may refer to:

J48, an open source Java implementation of the C4.5 decision tree algorithm

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an

extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used

for classification, and for this reason, C4.5 is often referred to as a statistical classifier. Authors

of the Weka machine learning software described the C4.5 algorithm as "a landmark decision

tree program that is probably the machine learning workhorse most widely used in practice to

date" [3].

Fig.2. Visualized Instances by J48 Classifier algorithm

J Han and M Kamber introduced about K-means . It is a widely used partitional clustering

method in the industries. The K-means algorithm is the most commonly used partitional

clustering algorithm because it can be easily implemented and is the most efficient one in terms

of the execution time [4].

https://en.wikipedia.org/wiki/C4.5_algorithm

https://en.wikipedia.org/wiki/Decision_tree_learning

https://en.wikipedia.org/wiki/Ross_Quinlan

https://en.wikipedia.org/wiki/ID3_algorithm

https://en.wikipedia.org/wiki/Statistical_classification

https://en.wikipedia.org/wiki/Weka_(machine_learning)


Dhyan Chandra Yadav

4

Fig.3. Visualized Instances by K-Means Clustering algorithm

[2] RELATED WORKS:

H, K analyzed that Credit cards fraudulence arises at very high level scale so we cannot

easily detect and predict the related attributes but by the help of data mining classifier tool to

prevent the activity of fraudsters in the misuse of credit cards uses the algorithms of neural

networks. This system predicts the probability of fraud on an account by comparing the

current transactions and the previous activities of each holder [5].

D C Yadav and S Pal discussed that classifier algorithms provide very accurate result

in software error detection by J48, ID3 and Naïve Bayes data mining algorithms correctly

classified instances will be partition in to numeric and percentage value, kappa statics, mean

absolute error and root mean square error will be at numeric value only ID3 andJ48 time

taken to build model: 0.2 seconds and test mode :10 fold cross validation. Here Weka

compare all required parameters on given instances with the classifiers respective accuracy

and prediction rate based on highest accuracy of J48 is 100% without error also Naïve Bayes

100% correctly classified but with some error and ID3 95% correctly classified, so it is clear

that J48 is the best in three respective algorithms so it is more accurate [6].

D C Yadav and R Kumar discussed that association algorithms provide very accurate

result in the frequent and relationship between data object and find the percentage of

confidence, support, of data object by the help of apriori, predictive apriori and filtered

associate algorithms. Therefore these algorithms can be used in other domains to bring out

interestingness among data present in the origin [7].

D C Yadav and R Kumar discussed that three major clustering algorithms: K-Means,

Hierarchical clustering and Density based clustering algorithm and compare the performance

of these three major clustering algorithms. Author compared using a clustering tool and find

result: K-Means algorithm is better than Hierarchical Clustering and Make density based

algorithm because all the algorithms have some ambiguity in some (noisy) data when

clustered [8].



R Sukanya and K Prabha discussed that back propagation Neural Network, Support

Vector Machine is used for rainfall prediction. ANN improves the efficiency of Rainfall

prediction by analyzing the historical and current facts to make accurate predictions about

future [9].

R S, S M, N E, S P and V Kirand discussed that the huge volume of warranty data

for segregating the fraudulent warranty claims using pattern recognition and clustering

.Survey of automotive industry shows up to 10% of warranty costs are related to warranty

claims fraud, costing manufacturers several billions of dollars. The existing methods to detect

warranty fraud are very complex and expensive as they are dealing with inaccurate and vague

data, causing manufacturers to bear the excessive costs [10].

D C Yadav analyzed that in statical analysis of binary classification, the F1 score is a

measure of a test's accuracy. It considers both the precision and the recall of the test to

compute the score. In this analysis author computed the best score for F1 by the help of data

mining classifier algorithms and choose the ID3 Tree is the best data mining classifier

algorithms to be applied over selected datasets. Because ID3 Tree has highest F1 score and

take less time to build a mode [11].

D C Yadav analyzed that the Matthews correlation coefficient is used in machine

learning as a measure of the quality of binary (two-class) classifiers. It takes into account true

and false positives and negatives and is generally regarded as a balanced measure which can

be used even if the classes are of very different sizes. Author computed the MCC is in

essence a correlation coefficient between the observed and predicted binary classifications by

the help of data mining classifier algorithms and ID3 Tree is the best data mining classifier

algorithms to be applied over selected datasets. Because ID3 Tree has highest MCC value

and minimum number of time in second 0.00 to build a model [12].

D C Yadav analyzed that the informedness of a prediction method as captured by a

contingency matrix is defined as the probability that the prediction method will make a

correct decision as opposed to guessing and is calculated using the bookmaker algorithm.

Their correlation is the generated by LAD Tree, ID3, and J48 data mining algorithms and

find ID3 is the best data mining classifier algorithms to be applied over selected datasets

[13].

D C Yadav analyzed that the FDR-controlling procedures provide less stringent

control of Type I errors compared to class wise errors . In this analysis we choose the ID3

Tree is the best data mining classifier algorithms to be applied over selected datasets. Because

ID3 Tree has minimum time to build a model [14].

D C Yadav analyzed that all analysis on the basis of dependable variables for overall

performance and Predicts categorical class level classifiers based on training set and the

values in the class level attribute use the model in classifying new data. Author analyzed

between AD Tree, LAD Tree, J48 and Naïve Bayes for correctly classify and incorrectly

classify with kappa static model and choose the LAD Tree is the best data mining classifier

algorithms to be applied over selected datasets. Because LAD Tree has highest correctly

value 83.333% and minimum number of unclassified instances is 0.00. Also Lad tree have

highest value 0.3576 of metric for accuracy [15].

T, R and Liu discussed that a framework was presented on the base of security systems

and Case based reasoning for fraud detection. First, a set of normal and fraud cases are made


Dhyan Chandra Yadav

6

from labeled data. Then, the primary detectors are made with random or genetic algorithms.

Then, negative selection and clonal selection operations are applied on primary detectors in

order to obtain a set of detectors with different algorithms that can detect a variety of frauds

[16].

M,S,B and Saira discussed that many fraud detection systems that have been presented

so far, have used data mining and neural network approaches. While no fraud detection

system with the combination of anomaly detection, misuse detection and decision making

system have been used so far for fraud detection in credit cards. Then, a system was proposed

that used Hidden Markov Model to detect the fraudulent transactions [17].

A John analyzed that hybrid feature selection and anomaly detection algorithm in order

to detect fraud in credit cards. The authors have noted that fraud detection on the internet

must be done online and immediately. Since the use of credit card by card holders

follows a fixed pattern, this fixed pattern can be extracted from a usual legal activity of card

holders in 1 or 2 years .thus, this pattern is compared to the use of process of card holder and

in case of non-similarity in the pattern, the activity is considered illegal. It should be noted

that the neural networks were used to teach the patterns detection in the model in this study

[18].

A P, M K and A N discuss that data mining as one of the most efficient tools of data

analysis has attracted the attention of many people. The use of different techniques and

algorithms of this tool in various fields like customer relationship management, fraud

management and detection, medical sciences, sport and etc. Due to the large number of data

in banks, data mining has had lots of functions in financial and monetary affairs so far. Credit

risk management, fraud detection, money laundering , customer relationship management and

banking services quality management are some examples of data mining function in banks

[19].

In this work, we propose to use a meta-classifier that uses a cluster for classification

approach based on the assumption that each cluster corresponds to a class . Primary, the

usage and interaction consumer claim data have to be collected and preprocessed. Then, an

optional attribute selection process can be applied or not , in order to select only a group of

attributes/variables or to use all available. Secondary in this paper mainly we analyzed and

compare between meta classifier algorithms “classification via clustering” with J48

classifier and K-Means.

[3] METHODOLOGY:

Our research approach is to use Classification via Clustering , J48 and K-Means on

consumer claim data set. The research methodology is divided into 5 steps to achieve the

desired results:

Step 1: In this step, prepare the data and specify the source of data.

Step 2:In this step select the specific data and transform it into different format by weka.



Step 3:In this step, implement data mining algorithms and checking of all the relevant dispute

is perform.

Step 4: The decision is taken on the presence of dispute in source code. If dispute is present

then proceed further, otherwise it will stop. We classify the relevant dispute using

Classification via Clustering , J48 and K-Means .

Step 5: At the end, the results are display and evaluated.

Table.1. Representation of Computational Variables of Consumer Claim

Now we have study about consumer complaints of bank different type’s loans and relate

dispute.

1. Data Preparation:

One of the biggest problems with credit cards is that it’s easy to forget to make a payment. This

can be especially true if you’re going through a major change in life. Your credit won’t be

damaged severely if you realize that you did miss that payment before your next due date. It’s

possible you could even get your credit card company to waive that late fee , if you haven’t

Property Description

Source https://catalog.data.gov/dataset. Customer Financial Protection Bureau

, a U.S. Government Agency.

Complaint Type Consumer Complaint Database Attributes for Financial Problem( Bank,

Lender &Company etc.)

Sample Size 500Total: 100 Consumer dispute and 400 non dispute

Dependable Variables

Dispute(YES) Consumer Complaint dispute

Dispute(NO) Consumer Complaint not dispute

Field name Description

Date received The date the CFPB received the complaint.

Tags Data that supports easier searching and sorting of complaints submitted

by or on behalf of consumers.

Date sent to

company

The date the CFPB sent the complaint to the company.

Company

response to

consumer

This is how the company responded. For example, “Closed with

explanation.”

Timely response? Whether the company gave a timely response. For example, “Yes” or

“No.”

https://catalog.data.gov/dataset


Dhyan Chandra Yadav

8

been habitually late. Also, you can prevent any damage to your credit score by making up that

payment before it gets 30 days past due. If you have a calendar on your computer or smart

phone you should put your due date on it as a recurring event, to help make sure you don’t miss

any payments in the future ,but when you write a letter, it ensures your rights will be protected.

The Consumer Complaint Database is a collection of complaints on a range of consumer

financial products and services, sent to companies for response. We don’t verify all the facts

alleged in these complaints, but we take steps to confirm a commercial relationship between the

consumer and the company.

2. Data Selection and transform:

The database generally updates daily, and contains certain information for each complaint,

including the source of the complaint, the date of submission, and the company the complaint

was sent to for response. The database also includes information about the actions taken by the

company in response to the complaint, such as, whether the company’s response was timely

and how the company responded. Companies also have the option to select a public response.

Company level information should be considered in context of company size. Data from those

complaints helps us understand the financial marketplace and protect consumers.

Fig.4. Representation of Instances by Classification via Clustering algorithm



Fig.5. Representation of Instances by J48 algorithm

Fig.6. Representation of Instances by K-Means algorithm

After data selection we transform data by data mining weka tool. These tree algorithms have

his specific details as accuracy by class ,Time to build a model and stratified cross validation

with summary. Confusion matrix describe correctly classified and incorrectly classified

instances with respect to class.


Dhyan Chandra Yadav

10

3. Implementation:

In this step, we choose classification algorithm for the purpose of accuracy in dataset, which are

the J48. To investigate further the classifier performance in accuracy. Weka is the data mining

tools. It is the simplest tool for classify the data various types. It is the first model for provide

the graphical user interface of the user. For perform the clustering we used the promise data

repository. It is providing the past project data for analysis. With the help of figures we are

showing the working of various algorithms used in weka. weka is more suitable tool for data

mining applications. Clustering is a main task of explorative data mining, and a common

technique for statical data analysis used in many fields, including machine learning. I am using

Weka data mining tools for this purpose. It provides a better interface to the user than compare

the other data mining tools.

4. Result and Discussion:

All our experiments were performed using Weka and the previously described consumer claim

dataset. In order to test the accuracy of obtained classification models we used the 10-fold

cross- validation method. All classifiers in Weka work in the same way under cross-validation.

The model is built using just the instances in the training fold. The classification via clustering

approach is based on the "clusters to classes" evaluation routine in the cluster evaluation code,

which finds a minimum-error mapping of clusters to classes.

Table.2. Representation Compute Instances for Classifier Algorithms

Algorithms Accuracy Kappa RMSE RAE RRSE TIME NO. of

Iterations

Sum of

Square

Error

Classification

via

Clustering

65.4 0.02 0.23 102.63 143.72 0.00 3 1714

J48 79.2 0.00 0.33 99.34 99.99 0.05 - -

K-Means - - - - - 0.00 3 1610

In the first experiment, we executed the following clustering algorithms provided by Weka for

classification via clustering using 500 instances of consumer claim of available attributes .

Secondary in this paper mainly we analyzed and compare between meta classifier algorithms

“classification via clustering” with J48 classifier and K-Means.

From the above table-2 it is clear that:

In case of classification J48 perform more accurate result 79.2 and less error compare

to classification via clustering.



In the case of clustering K-Means perform less error compare to classification via

clustering.

Finally we find that in case of classification and clustering J48 and K-Means are better

compare to classification via clustering algorithm.

5. CONCLUSION

Yes, dispute participation in the claim of bank loan bureau report was a good predictor

for the dispute. Another advantage of classification models based on mapping clusters to

classes is that they are very simple and interpretable to instructors. In the case dispute YES

here, instructors only have to analyze the cluster centroids to know that consumer active in

the bank claim .We find that in case of classification and clustering J48 and K-Means are

better compare to classification via clustering algorithm.

REFERENCES

[1] https://catalog.data.gov/dataset.

[2] J. Han and M. Kamber, “Data Mining Concepts and Techniques”, Elevier, 2011.

[3] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

[4] J Han and M Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann

Publishers, second Edition, (2006).

[5] H, K. (Ed.) “Detecting Payment Card Fraud with Neural Networks. Singapore: World

Scientific” (2000).

[6] D C Yadav and S Pal “An Integration of Clustering and Classification Technique in

Software Error Detection” African Journal of Computing & ICT, Vol. 8,No.2,June 2015 ,

ISSN 2006-1781.

[7] D C Yadav and R Kumar “Generate Identical Rules by the Different Algorithms for

Detection of Risk In Software Project Development” Shodh Prerak A Multidisciplinary

Quarterly International Refereed Journal, Volume V, Issue 2, April 2015.ISSN 2231-413X.

[8] D C Yadav and R Kumar “A Comparative Study of Software Bug by Clustering

Algorithms” Annals of Multi-Disciplinary Research A Quarterly Inter National Refereed

Research Journal, Volume V, Issue 5, Dec.2015.ISSN 2249-8893.

[9] R Sukanya and K Prabha “Comparative Analysis for Prediction of Rainfall using Data

Mining Techniques with Artificial Neural Network” Volume-5 , Issue-6 , Page no. 288-292,

Jun 30, 2017.

[10] R S, S M, N E, S P and S V Kirand “Modelling an Optimized Warranty Analysis

methodology for fleet industry using data mining clustering methodologies with Fraud


Dhyan Chandra Yadav

12

detection mechanism using pattern recognition on hybrid analytic approach” ,Fourth

International Conference on Recent Trends in Computer Science & Engineering. Chennai,

Tamil Nadu, India Procedia Computer Science ,Volume 87, 2016, Pages 322-327.

[11] D C Yadav “Analysis of Bug Accuracy through F1-Score by Data Mining Classifier

Algorithms”, Annals of Multi-Disciplinary Research A Quarterly International Refereed

Journal, Volume VII, Issue 2, June 2017.ISSN 2249-8893.

[12] D C Yadav “Analysis of Bug through Mathews Correlation Coefficient with Confusion

Matrix” ,Sodha Pravaha A Multidisciplinary Refereed Research Journal , Vol. VII, Issue

3, July 2017,ISSN 2231-4113.

[13] D C Yadav “To Create Correct Decision by Informedness of Bug Prediction” ,Shodh

Prerak A Multidisciplinary Quarterly International Refereed Research Journal,Volme

VII,Issue 2,April 2017.ISSN 2231-413X.

[14] D C Yadav “Controlling the Procedures through False Discovery Rate Defect in Bug

Analysis”, Sodha Pravaha A Multidisciplinary Refereed Research Journal , Vol.VII, Issue

2, April 2017,ISSN 2231-4113.

[15] D C Yadav “Measurement of Gap between Software Bug Classes by Data Mining

Classifier Algorithms” ,Vaichariki A Multidisciplinary Refereed International Research

Journal , Vol.VII, Issue 2, June 2017,ISSN 2249-8907.

[16] Tue, Ren, Liu “Artificial Immune System for Fraud Detection” ,IEEE, vol. 2, pp. 1407-

1411, 2004.

[17] M, S, B and Saira “A Defense Mechanism For Credit Card Fraud Detection”,

International Journal on Cryptography and Information Security (IJCIS), 2012,pp. 89-100.

[18] A John “Data Mining Application for Cyber Credit-Card Fraud Detection System”,

Springer- Verlag Berlin Heidelberg, 2013, pp. 218–228.

[19] A P, M K and A N “Fraud detection in E-banking by using the hybrid feature selection

and evolutionary algorithms”, IJCSNS International Journal of Computer Science and

Network Security, VOL.17 No.8, August 2017.

A COMPARATIVE STUDY OF CLASSIFICATION VIA ... Journal of Computer Engineering and Applications,...

Documents

Transcript of A COMPARATIVE STUDY OF CLASSIFICATION VIA ... Journal of Computer Engineering and Applications,...