SolvingMisclassificationoftheCreditCardImbalanceProblem ...imbalance status, their work provides the...

Research ArticleSolving Misclassification of the Credit Card Imbalance ProblemUsing Near Miss

Nhlakanipho Michael Mqadi1 Nalindren Naicker 1 and Timothy Adeliyi 2

1ICT and Society Research Group Information Systems Durban University of Technology Durban 4001 South Africa2ICT and Society Research Group Information Technology Durban University of Technology Durban 4001 South Africa

Correspondence should be addressed to Nalindren Naicker nalindrenndutacza

Received 30 May 2021 Accepted 2 July 2021 Published 20 July 2021

Academic Editor Jude Hemanth

Copyright copy 2021 Nhlakanipho Michael Mqadi et al (is is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in anymedium provided the original work isproperly cited

In ordinary credit card datasets there are far fewer fraudulent transactions than ordinary transactions In dealing with the creditcard imbalance problem the ideal solution must have low bias and low variance (e paper aims to provide an in-depth ex-perimental investigation of the effect of using a hybrid data-point approach to resolve the class misclassification problem inimbalanced credit card datasets (e goal of the research was to use a novel technique to manage unbalanced datasets to improvethe effectiveness of machine learning algorithms in detecting fraud or anomalous patterns in huge volumes of financial transactionrecords where the class distribution was imbalanced (e paper proposed using random forest and a hybrid data-point approachcombining feature selection with Near Miss-based undersampling technique We assessed the proposed method on two im-balanced credit card datasets namely the European Credit Card dataset and the UCI Credit Card dataset (e experimentalresults were reported using performance matrices We compared the classification results of logistic regression support vectormachine decision tree and random forest before and after using our approach (e findings showed that the proposed approachimproved the predictive accuracy of the logistic regression support vector machine decision tree and random forest algorithmsin credit card datasets Furthermore we found that out of the four algorithms the random forest produced the best results

1 Introduction

(e South African Banking Risk Information Centre(SABRIC) presented its annual crime data for 2019 whichshowed that online banking fraud incidences climbed by20 between 2018 and 2019 [1](ese statistics revealed thatcard fraud occurs in different forms namely ldquowithout thecardrdquo ldquowhen the card is lostrdquo ldquowhen the card is stolenrdquo andldquowhen the card is not receivedrdquo ldquoWithout the card fraudrdquo iswhen the fraudulent transactions occur without the consentof the owner and while the physical card is in the ownerrsquospossession [1 2] ldquoWhen the card is lostrdquo fraud is defined as afraud committed when the valid cardholder is not in pos-session of the card and transactions are made on the cardFurthermore ldquowhen the card is stolenrdquo fraud is when thefraud is committed by the person who is not the rightfulowner of the card ldquoWhen the card is not receivedrdquo fraud

occurs when legitimately issued cards are intercepted beforethey reach their intended recipients [2] (e cards aresubsequently used fraudulently by impostors who haveintercepted them Card fraud transactions stored by fi-nancial issuers are very small compared to legitimatetransactions which results in a high imbalance credit carddataset [3] (e situation in which the dominant classes havea significant advantage over the minority classes is referredto as imbalanced data An imbalance credit card datasetrefers to a class distribution in which the bulk of validtransactions recorded outnumber the minority fraudulenttransactions [4] (e imbalance problems cause the machinelearning classification solutions to be partial towards themajority class and produce a prediction with a high mis-classification rate Failure to deal with imbalanced datajeopardizes the machine learning systemrsquos integrity andprediction ability which can have a significant cost impact

HindawiMathematical Problems in EngineeringVolume 2021 Article ID 7194728 16 pageshttpsdoiorg10115520217194728

[5] Learning algorithms operate on the assumption that thedata is evenly distributed hence imbalanced data is ac-knowledged as part of the fundamental issues in the field ofdata analytics and data science [6]

(e science of building and implementing algorithmsthat can learn patterns from previous events is known asmachine learning [7] (e machine learning classifiers canbe trained by continually feeding input data and assessingtheir performances A machine learning classification so-lution employs sophisticated algorithms that loop over bigdatasets and evaluate data patterns [8] In machinelearning the ideal solution must have low bias and shouldaccurately model the true relationship of positive andnegative classes Machine learning classifiers tend to per-form well with a specific dataset that has been manipulatedto suit the classifier [9] (e use of one dataset tends to havea bias as the data could be manipulated to support theclassification solution (is is commonly referred to asoverfitting in machine learning (e ideal binary classifi-cation solutions should have low variability by producingconsistent predictions across different datasets [10] (egoal of this work was to conduct a thorough investigation ofthe impact of employing a hybrid data-point strategy tohandle the misclassification problem in credit card datasetsthat were imbalanced Oversampling undersampling andfeature selection are examples of strategies for resamplingdata used in dealing with imbalanced classes at the data-point level [11] (e data-point level technique accordingto Sotiris Dimitris and Panayiotis in [12] includes datainterventions to lessen the impact of imbalanced datasetsand it is flexible enough to proceed with modern classifierssuch as logistic regression decision trees and supportvector machines

We present in this paper a hybrid method that amal-gamates the advantages of the random forest and a hybriddata-point technique to deal with the problem of imbalancelearning in credit card fraud Random forest used for pre-diction has the advantage of being able to manage datasetswith several predictor variables We further combinedfeature selection using correlation coefficients in order tomake it easier for machine learning to classify with NearMiss-based undersampling technique Using well-knownperformance metrics the model outperformed other rec-ognised models

2 Related Works

Machine learning models work well when the datasetcontains evenly distributed classes known as a balanceddataset [13] Pes in [14] looked at the efficacy of hybridlearning procedures that combine dimensionality reductionand ways for dealing with class imbalance (e researchcombines univariate and multivariate feature selectionstrategies with cost-sensitive classification and sampling-based class balance methods [14](e dominance of blendedlearning strategies presented a dependable choice for re-current random sampling and investigations proved thathybrid learning strategies outperformed feature selectionsolely for unbalanced datasets [14] Several studies analyzed

and compared existing financial fraud detection algorithmsin order to find the most effective strategy [9 15 16] Usingthe confusion matrix Zhou and Liu in [16] discovered thatthe random forest model outperformed logistic regressionand decision tree based on accuracy precision and recallmatrices Albashrawi et al in [9] found that the logisticregression model is the superior data mining tool fordetecting financial fraud

A paper by Minku et al in [17] looked into the scenarioof classes progressively appearing or disappearing (eClass-Based Ensemble for Class Evolution (CBCE) wassuggested as a class-based ensemble technique CBCE canquickly adjust to class evolution by keeping a base learner foreach class and constantly updating the basic learners withnew data To solve the dynamic class imbalance probleminduced by the steady growth of classes the study developeda novel undersampling strategy for the base learners (eempirical investigations according to Minku et al in [17]revealed the efficiency of CBCE in various class evolutionscenarios when compared to the existing class evolutionadaptation method CBCE responded well to all three sce-narios of class evolution as compared to previous ap-proaches (ie occurrence disappearance and reoccurrenceof classes) (e empirical analysis confirmed under-samplingrsquos dependability and CBCE demonstrated that itbeats other recognised class evolution adaptation algo-rithms not only in terms of the ability to adjust to variedevolution scenarios but also in terms of overall classificationperformance Two learning algorithms were proposed byWang et al in [18] Undersampling-based Online Bagging(UOB) and Oversampling-based Online Bagging (OOB)devised an ensemble approach to overcome the class im-balance in real-time using time-decayed and resamplingmetrics (e study also focused on the performance of OOBand UOBs resampling strategies in both static and dynamicdata streams to see how they could be improved In terms ofdata distributions imbalance rates and changes in classimbalance status their work provides the first compre-hensive examination of class imbalance in data streamsAccording to the findings UOB is superior at detectingminority class cases in static data streams while OOB isbetter at resisting fluctuations in class imbalance status (esupply of data was discovered to be a crucial elementimpacting their performance and more research wasrequired

Liu and Wu experimented with two strategies to avoidthe drawbacks of employing undersampling to deal withclass imbalance When undersampling is used most ma-jority classes are disregarded which is a flaw As a resultEasy-Ensemble and Balance-Cascade were proposed in thestudy Easy-Ensemble breaks the majority class into nu-merous smaller chunks then uses each chunk to train thelearner independently and finally all of the learnersrsquo outputsare combined [19] Balance-Cascade employs a sequentialtraining strategy in which the majority classrsquos properlyclassified instances are excluded from further evaluation inthe next series [19] According to the data the Easy-En-semble and the Balance-Cascade had higher G-meanF-measure and AUC values than other existing techniques

2 Mathematical Problems in Engineering

Many studies have been conducted on class disparitynonetheless the efficacy of most existing technologies indetecting credit card fraud is still far from optimal (e goalof this research paper was to see how employing randomforest and a hybrid data-point strategy integrating featureselection and Near Miss may help enhance the classificationperformance of two credit card datasets Near Miss is anundersampling technique that aims to stabilize class dis-tribution by randomly deleting majority class examples [20]

In general four techniques handling the problem of classimbalance have been proposed in the literature Ensembleapproaches algorithm approaches cost-sensitive ap-proaches and data-level approaches are examples of thesemethodologies In the algorithmic technique learning al-gorithms that are supervised are designed to favour theinstances of the minority class (e most often used data-level methods rebalance the imbalanced dataset By estab-lishing misclassification costs cost-sensitive algorithmssolve the data imbalance problem Undersampling andlearning that are cost sensitive bagging and undersamplingboosting and resampling are some of the tactics used inensemble learning approaches In addition to these methodshybrid approaches such as UnderBagging OverBagging andSMOTEBoost combine undersampling and oversamplingmethods [21]

In undersampling findings the most suitable repre-sentation is important for the accurate prediction of thesupervised learning algorithms on the imbalanced datasetClustering provides a useful representation of the majorityclass in a class imbalance problem To deal with unevenlearning Onan in [21] employed a consensus clustering-based undersampling method He employed k-modes k-means k-means++ self-organizing maps and the DIANAmethod as well as their combinations (e data were cat-egorised using five supervised learning algorithms supportvector machines logistic regression naive Bayes randomforests and the k-nearest neighbour algorithm as well asthree ensemble learner methods AdaBoost Bagging and therandom subspace algorithm (e clustering undersamplingstrategy produced the best prediction results [21]

Onan and Korukoglu in [22] introduced an ensembletechnique to sentiment classification feature selection (eproposed aggregationmodel aggregates the lists from severalfeature selection methods utilizing a genetic algorithm-based rank aggregation (e selection methods used werefilter-based (is method was efficient and outperformedindividual filter-based feature selection methods In anothersentiment analysis grouping study by Onan in [23] lin-guistic inquiry and word count were used to extract psy-cholinguistic features from text documents Four supervisedlearning algorithms and three ensemble learning methodswere used for the classification (e datasets containedpositive negative and neutral tweets 10-fold cross valida-tion was employed

Borah and Gupta in [24] suggested a robust twinbounded support vector machine technique based on thetruncated loss function to overcome the imbalance problem(e total error of the classes was scaled based on the numberof samples in each class to implement cost-sensitive learning

In resolving the problem of class imbalance Gupta andRichhariya in [25] presented entropy-based fuzzy leastsquares support vector machine and entropy-based fuzzyleast squares twin support vector machine Fuzzy mem-bership was calculated on entropy values of samples Inanother study by Gupta et al in [26] a new method wasreferred to as fuzzy Lagrangian twin parametric-marginsupport vector machine which used fuzzy membershipvalues in decision learning to handle outlier points Hazarikaand Gupta in [27] used a support vector machine based ondensity weight to handle the imbalance of classes problem Aweight matrix was used to reduce the effect of the binaryclass imbalance

3 Materials and Methods

31ampeData-Point Approach (e data-point approach wasused to investigate the class imbalance problem (e studyproposed a 2-step hybrid data-point approach (e first stepwas using feature selection after data preprocessing and thenundersampling with Near Miss to resample the data Featureselection is the process of selecting those features that mostcontributed to the prediction variable or intended output[4]

32 Feature Selection Feature selection was used as a stepfollowing preprocessing before the learning occurred Toovercome the drawbacks of an imbalanced distribution andimprove the efficiency of classifiers feature selection is usedto choose appropriate variables We performed feature se-lection using correlation coefficients which is a filter-basedfeature selection method that removes duplicate featureshence choosing the most relevant features (e feature se-lection was then utilized to determine which features wereindependent and which were dependent (e independentfeatures were recorded in theX variable while the dependentfeatures were saved separately on the Y variable (e Yvariable included the indicator of whether the transactionwas normal (labeled as 0) or fraudulent (labeled as 1) whichwas the variable we were seeking to forecast In this studythe class imbalance was investigated in the context of abinary (two-class) classification problem with class 0 rep-resenting the majority and class 1 representing the minority

33 Near Miss-Based Undersampling (e technique ofbalancing the class distribution for a classification datasetwith a skewed class distribution is known as undersampling[28 29] To balance the class distribution undersamplingremoves the training dataset examples which pertain to themajority class such as reducing the skew from a 1 100 to a1 10 1 2 or even a 1 1 class distribution To evaluate theinfluence of the data-point method this paper used anundersampling strategy based on the Near Miss methodNear Miss was chosen based on its advantages to provide amore robust and fair class distribution boundary which wasfound to improve the performance of classifies for detectionin large-scale imbalanced datasets [30 31] (e experimentused an imbalance-learn library to call a class to perform

Mathematical Problems in Engineering 3

undersampling based on the Near Miss technique [7] (eNear Miss method was manipulated by passing parametersthat are to meet the desired results (e Near Miss techniquehas three versions namely [32]

(1) NearMiss-1 finds the test data with the least averagedistance to the negative classrsquos nearest samples

(2) NearMiss-2 chooses the positive samples with theshortest average distance to the negative classrsquosfarthest samples

(3) NearMiss-3 is a two-step procedure First thenearest neighbours of each negative sample shall bepreserved (e positive samples are then chosenbased on the average distance between them andtheir nearest neighbours

(e presence of noise can affect NearMiss-1 whenundersampling a specific class It means that samples fromthe desired class will be chosen in the vicinity of thesesamples [33] However in most cases samples around thelimits will be chosen [34] Because NearMiss-2 focuses onthe farthest samples rather than the closest it will not havethis effect (e presence of noise can also be changed bysampling especially when there are marginal outliers Be-cause of the first-step sample selection NearMiss-3 will beless influenced by noise [35]

(e following table is a snippet of parameters that wereused to instantiate the Near Miss technique (e chosenvariation for this study was the NearMiss-2 version afterexecuting multiple iterations using all the three differentversions to select the most suitable version for the credit carddataset A uniform experiment was conducted on bothdatasets to ensure a fair cross-comparison

Table 1 is a snippet of parameters that were used toinstantiate the Near Miss technique

Table 1 provides a list of all the parameters and theirassociated values which were passed when instantiating theNear Miss method using an API call on the imbalance-learnlibrary Performance of the Near Miss method was opti-mized using parameter tuning which was achieved bychanging the default parameters for the version N neigh-bours and N neighboursrsquo ver3 parameters

34 Design of Study (e experimental method was used toexamine the effect of using a hybrid data-point approach tosolve the misclassification problem created by imbalanceddatasets (e hybrid data-point technique was used on twoimbalanced credit card datasets (is study investigated theundersampling technique instead of the oversamplingtechnique because it balances the data by reducing themajority class (erefore undersampling avoids cloning thesensitive financial data which means that only the authenticfinancial records were used during the experiment [36]

A lot of pieces of literature support undersampling forexample a study byWest and Bhattacharya in [3] found thatundersampling gives better performance when the majorityclass highly outweighs the minority class A cross-com-parison amongst the two datasets was conducted to deter-mine whether Near Miss-based undersampling could cater

for distinct credit card datasets (e two datasets werecollected from Kaggle a public dataset source (httpswwwkagglecommlg-ulbcreditcardfraudhome) [3738] (edatasets were considered because they are labeled highlyunbalanced and handy for the researcher because they arefreely accessible making them more suited to the researchrsquosneeds and budget (e study applied a supervised learningstrategy that used the classification technique Supervisedlearning gives powerful capabilities for using machinelanguage to classify and handle data [39]

To infer a learning method supervised learning wasemployed with labeled data which was a dataset that had beenclassified(e datasets were used as the basis for predicting theclassification of other unlabeled data using machine learningalgorithms (e classification strategies utilized in the trialswere those that focused on assessing data and recognisingpatterns to anticipate a qualitative response [40] During theexperiment the classification algorithms were used to distin-guish between the legitimate and fraudulent classes

(e experiment was executed in four stages preteststage treatment stage posttest stage and review stageDuring the pretest the original dataset was fed into themachine learning classifiers and classification algorithmswere used to train and test the predictive accuracy of theclassifier Each dataset was fed in the ML classifier using the3-step loop training testing and prediction (e data-pointlevel approach methods were applied to the dataset duringthe treatment stage of the experiment to offset the areaaffected by class imbalance (e study investigated the hy-brid technique to determine the resampling strategy thatyields the best results (e resultant dataset from eachprocedure was the stagersquos output

In the posttest stage the resultant dataset was taken andagain fed into the classifiers Stages two and three were aniterative process the aim was to solve the misclassificationproblem created by imbalanced data (erefore an in-depthreview and analysis of accuracy for each result were con-ducted after each iteration to optimize the process for betteraccuracy Lastly the review stage carried out a compre-hensive review of the performance of each algorithm forboth the pretest and posttest results (en a cross-com-parison of the two datasets was performed to determine thebest performing algorithm for both datasets

(is supervised machine learning study was carried outwith the help of Google Colab and the Python programminglanguage Python is suited for this study because it providesconcise and human-readable code as well as an extensive

Table 1 Near Miss method call parameters

Parameter ValueSampling strategy AutoReturn indices FalseRandom state NoneVersion 2N neighbours 3N neighbours ver3 3N jobs 1Ratio None


choice of libraries and frameworks for implementing ma-chine learning algorithms reducing development time [37](e code was performed on the Google Colab notebookwhich runs on a Google browser and executes code onGooglersquos cloud servers using the power of Google hardwaresuch as GPUs and Tensor Processing Units (TPUs) [38] Ahigh level hybrid data-point approach is presented in Al-gorithm 1

35 Datasets (e first dataset included transactions fromEuropean cardholders with 492 fraudulent activities out of atotal of 284807 activities Only 0173 percent of all trans-actions in the sample were from the minority class whichwere reported as real fraud incidents

Fraudcases fraud

instance sizelowast 100

492284807

lowast 100 0173

(1)

Figure 1 shows a class distribution of the imbalancedEuropean Credit Card dataset

Figure 1 shows the bar graph representation of twoclasses found in the European cardholderrsquos transactions (ex-axes represent the class which indicates either normal orfraud (e y-axes represent the frequency of occurrence foreach class (e short blue bar that is hardly visible shows thefraudulent transactions which was the minority class (efigure shows a graphical representation of the imbalanceratio where the minority class accounts for 0173 of thetotal dataset containing 284807 transactions(e dataset has31 characteristics Due to confidentiality concerns theprimary components V1 V2 and up to V28 were translatedusing Principal Component Analysis (PCA) the only fea-tures not converted using PCA were ldquoamountrdquo ldquotimerdquo and

ldquoclassrdquo (e 0 numeric value indicates a normal transactionand 1 indicates fraud in the ldquoclassrdquo feature [14]

(e second dataset called the UCI Credit Card datasetwhich spanned from April 2005 to September 2005 com-prises data on default payments demographic variablescredit data payment history and bill statements for creditcard clients in Taiwan [38] (e dataset is imbalanced andcontains 30000 instances whereby there are 6636 positivecases

Fraudcases fraud


663630000lowast 100 2212

(2)

Figure 2 shows a class distribution for the imbalance UCICredit Card datasets (e minority class caters for 2212 ofthe total distribution containing 30000 instances

Figure 2 shows the UCI Credit Card dataset classdistribution

(e short blue bar is the minority class that caters for2212 of the dataset and represents the credit card de-faulters (e longer blue bar shows the normal transactionswhich is the majority class (e UCI Credit Card dataset has24 numeric attributes which makes the dataset suitable for aclassification problem An attribute called ldquodefaultpay-mentnextmonthrdquo contained the values of either 0 or 1 (eldquo0rdquo represents a legitimate case and the value of ldquo1rdquo rep-resents the fraudulent case [38]

(ere were no unimportant values or misplaced columnsin any of the datasets that were validated To better un-derstand the data an exploratory data analysis was un-dertaken After that we used a class from the sklearnpackage to execute the train-test-split function to split thedata into a training and testing set with a 70 30 ratio [41]

X_train X_test y_train y_test train_test_split(X Y test_size 030) (3)

(e dependent variable Y independent variable X andtest size are all accepted by the train-test-split function (etest-size option indicates the split ratio of the datasetrsquosoriginal size indicating that 30 of the dataset was used totest the model and 70 of the dataset was used to train themodel (e experimentrsquos next step was to create and trainour classifiers To create the classifiers we employed each ofthe chosen algorithms After that each classifier was fittedusing the x-train and y-train training data (e x-test datawas then utilized to try to predict the y-test variable(e nextsection discusses the algorithms and classifiers in moredetail

36 Classification Algorithms For the experiment logisticregression support vector machine (SVM) decision treeand random forest algorithms were chosen (e literaturerevealed that decision tree logistic regression randomforest and SVM algorithms are the leading classical state-of-the-art detection algorithms [42ndash44] (e algorithms were

used to train and validate the fraud detection model fol-lowing the train test and predict technique [45]

37 Performance Metrics (e measurement matrices usedto evaluate the accuracy of the performance are the preci-sion recall F1-score average precision (AP) and confusionmatrix [14 46] Precision is a metric that assesses a modelrsquosability to forecast positive classifications [47ndash49] Pre-cision =TP(TP+ FP) ldquoWhen the actual outcome is positiverecall describes how well the model predicts the positiveclassrdquo [50] Recall = TP(TP+ FN) Askari and Hussain in[48] claimed that utilizing both recall and precision toquantify the prediction powers of the model is beneficial Anin-depth review and analysis of accuracy was conductedusing the following evaluation matrix false-positiverate = FPFP +TN true-positive rate = TPTP+ FN true-negative rate =TNTN+FP and false-negative rate = FNFN+TP A precision-recall curve is a plot of the precision(y-axis) and the recall (x-axis) for different thresholds [51]


Step 1 beginStep 2 for i= 1 to k do begin

r= calculate correcoeff (n)End

Step 3 data-point level approachmdashNear Miss undersamplingFind the distances between all instances of the majority class and the instances of the minority class(e majority class is to be undersampled(en n instances of the majority class that have the smallest distances to those in the minority class are selectedIf there are k instances in the minority class the nearest method will result in klowast n instances of the majority class

Step 4 train test splitmdashsplit the data into a training set and a testing set using a (70 30) split ratioStep 5 model predictionmdashfor random forest modelTrain the model by fitting the training setModel evaluation (predict values for the testing set)

Step 6 outputAnalyze using performance metrics

Step 7 end

ALGORITHM 1 Hybrid data-point approach algorithm

0

50000

100000

150000

200000

250000

Class

Freq

uenc

y

Normal Fraud

Figure 1 European Credit Card dataset

0

5000

10000

15000

20000

Class

Freq

uenc

y

Normal Fraud

Figure 2 UCI Credit Card dataset distribution


(e F1-score is the testrsquos precise measurement Whencomputing the F1-score both the precision and recall scoresare taken into account ldquoA confusion matrix is a table thatshows how well a classification model works on a set of testdata with known true valuesrdquo [52]

4 Presentation of Results

(is section presents a detailed report comparison anddiscussion of the results for both the European Credit Carddataset and the UCI Credit Card dataset (e performancemetrics used to evaluate the accuracy of the performance areprecision recall F1-score average precision (AP) andconfusion matrix (e results are shown for both the neg-ative class (N) and the positive class (P)

41 Pretreatment Test Results After samples of the originaldatasets were split into training datasets and testing datasetsusing a 70 30 ratio the testing dataset was fed into themachine learning classifiers using each of the four algo-rithms that have been mentioned above to train and test thepredictive performance of the classifier

Table 2 shows the European Credit Card dataset resultsof the classification results before using undersampling

(e testing dataset for the European Credit Card datasetenclosed a sample size of 8545 cases and the UCI CreditCard dataset contained a sample size of 900 cases Accordingto the classification report there was a 100 accuracy fromall the classifiers with the European Credit Card datasetwhich is highly misleading Looking only at the accuracyscore with imbalance datasets does not reflect the trueoutcome of the classification Focusing on the EuropeanCredit Card dataset classification we can observe that forthe SVM classifier there was a high bias towards the negativeclasses

All 8545 cases were flagged as legitimate transactionsthis is because there were only 17 fraudulent transactions inthe testing dataset (e logistic regression performed betterthan the SVM and the classifier was biased but looking at theprecision recall and F1-score some positive classes wereable to be classified (e F1-score verifies that the test wasnot accurate(e report does not tell us if the positive classesidentified were true positives or false positives even thoughthe recall score indicates that there was a great deal ofmisclassification False positives and false negatives are themost commonmisclassification problems which means thateven though the classifier has 100 accuracy and can predictboth positive and negatives classes it fails to produce asuccessful prediction A similar observation is seen on thedecision tree and random forest although the random forestperformed much better compared to all the other threeclassifiers

Table 3 shows the UCI Credit Card dataset results of theclassification results before undersampling was used

(e classification report on the UCI Credit Card datasetshows similar results (e SVM classifier was 100 biased asseen with the European Credit Card dataset (e UCI CreditCard testing datasets have a lower imbalance ratio there

were 202 positive cases out of the total sample size (eaccuracy recorded was 78 which is far less than the idealfor a binary classification solution (erefore without evenconsidering the bias and misclassification problem theaccuracy score alone shows that the SVM classifier is notconsistent across multiple datasets (e logistic regressionhad an accuracy score of 78 which is the same as the SVMclassifier

(e major difference is the precision score which was100 for the logistic regression implying that the classifierwas able to predict all the positive classes (erefore we lookat the recall score of 1 and based on this value we canconclude that the classifier was poor when the actual out-come was positive which means that there were a lot of falsepositives and false negatives Based on the precision scorewe can conclude that the classifier is unbiased but theprediction was able to eliminate false positives and falsenegatives (e decision tree was the least effective in terms ofthe accuracy score which was 72(e precision recall andF1-score were all 37 for the positive class (e randomforest continued to lead with an accuracy score of 81 (eprecision was 63 Recall and F1-score show that nearly halfof the predicted was false positives(e initial finding revealsthat there was a bias towards predicting the majority classrepresenting normal transactions

411 ampe Confusion Matrix ldquo(e confusion matrix tableprovides a mapping of the rate of true negative (TN) truepositive (TP) false negative (FN) and false positive (FP)rdquo[53 54] (e following tables provide the results for eachalgorithm on the original dataset after using undersampling(e confusion matrix table is useful to quantify the numberof misclassifications for both the negative and positiveclasses [55] (e total sample size used during testing is thesum of TN FN TP and FP as per the blueprint of theconfusion matrix (e confusion matrix also helps under-stand if the classification was biased [56] (e initial findingreveals that there was a prejudice towards predicting themajority class representing normal transactions

412 Import from sklearnmetrics (e confusion matrixclass was introduced from sklearn using the snippet ldquofromsklearnmetrics Import Confusion_matrixrdquo and given thatthe dataset was labeled for both datasets the parameters thatindicate both class 0 and class 1 were already defined andduring data preprocessing the parameter was stored in aprediction variable Y

Table 4 shows the confusion matrix table (s) blueprint(e blueprint was used to present the classification results

413 ampe Confusion Matrix without UndersamplingTable 5 shows the SVM confusion matrix results beforeundersampling was used to handle class imbalance

(e findings show that the classification was 100 biasedto the majority class for both datasets All the cases werepredicted to be legitimate even though there was a total of 17and 202 positive cases in both samples respectively


Table 6 shows the logistic regression confusion matrixresults before undersampling was used to handle classimbalance

(e results show that the classifier was both biased andhighly inaccurate For example out of a testing sample of 900cases for the UCI Credit Card dataset 94 of negative caseswere correctly classified and only 37 of positive cases werecorrectly classified (e European cardholdersrsquo transactionsdataset had a testing sample of 8545 transactions 999 ofnegative cases were correctly classified and 47 of thepositive cases were correctly classified

Table 7 shows the decision tree confusion matrix re-sults before undersampling was used to handle classimbalance

(e UCI dataset testing sample contained 698 negativecases and 202 positive cases (e total number of casespredicted as negative equals 700 and 200 for the positivecases Looking at the prediction we can assume that themodel was accurate However the confusionmatrix revealedthat 128 of the 700 cases were falsely classified and 126 of the200 were falsely classified A similar observation is madewith the European cardholdersrsquo transactions dataset

Table 2 Performance of imbalance European Credit Card dataset

Classifier Measure N P

SVM

Precision 100 000Recall 100 000

F1-score 100 000Accuracy 100

Average precision 000

Logistic regression




Decision tree




Random forest




Table 3 Performance of imbalance UCI Credit Card dataset

Measure N P

SVM




Logistic regression




Decision tree




Random forest





(erefore even though there was minimum bias with thedecision tree the model was highly inaccurate

Table 8 contains the random forest confusion matrixresults before undersampling

(e confusion matrix for the random forest was bothbiased and highly inaccurate For example out of a testingsample of 900 cases for the UCI Credit Card dataset 94 ofnegative cases were correctly classified and only 36 ofpositive cases were correctly classified (e Europeancardholdersrsquo transactions dataset had a testing sample of8545 transactions 999 of negative cases were correctlyclassified and 53 of positive cases were correctly classified

42 Posttreatment Test after Undersampling (e next phaseof the experiment was to apply the data-point level approachmethods on the dataset whereby to counteract the effect ofthe class imbalance undersampling was applied (e NearMiss technique was used to undersample the majority in-stances and made them equal to the minority class (e classwith majority has been decreased to the total number ofrecords in the minority class resulting in an equal number ofrecords for both classes (e treatment stage was an iterativeprocess the aim was to solve the problem of imbalanceddata therefore an in-depth review and analysis were con-ducted after each iteration to optimize the process

Table 9 shows the European Credit Card results for theclassification of the imbalanced datasets before applicationof the undersampling with Near Miss technique

(e dataset was balanced with a subset containing asample size of 98 instances evenly distributed between thetwo classes namely normal and fraudulent transactions(e accuracy score for the SVM classifier decreased from100 to 073 However the ability to predict positive classesimproved and the precision score for the positive classincreased from 000 to 100 a 100 improvement (e recallscore increased from 000 to 047 an improvement of 47which means that the SVM classifier could predict true

positives after undersampling with Near Miss even thoughthe percentage achieved is not ideal (e F1-score also in-creased from 000 to 064 and the improvement verifies theaccuracy of the test (e logistic regression reported anaccuracy score of 90 which is a decrease of 10 comparedto the results achieved before undersampling However theaverage precision increased from 048 to 087 which is anincrease of 39

(e increase in average precision reveals that eventhough accuracy decreased the overall predictive accuracyincreased (e increase in predictive accuracy is observed bythe increase in precision recall and F1-score for positiveclasses Precision increased from 057 to a decent 093 recallincreased from 047 to 087 and the F1-score increased from052 to 090 for the positive class (e negative class per-formed fairly well too even though the initial 100 accuracywas not achieved and the classifier was not biased on eitherclass (e precision was 088 the recall was 093 and the F1-score was 090 for the negative class (e random forestclassification was similar to the logistic regression whichalso reported an accuracy of 90 (e precision was 083 forthe negative class and 100 for the positive class (e recallwas 100 for the negative class and 080 for the positive class(e F1-score was 091 for the negative class and 089 for thepositive class

(e random forest performed better than all otherclassifiers before using undersampling but was closelymatched by the decision tree in second place However thedecision tress surpassed the random forest and gave the bestresults after undersampling with Near Miss (e decisiontree maintained an accuracy score of 100 and the averageprecision increased from 28 to 100 (e precision recalland F1-score for both the negative and positive classes wereimpressive 100 Based on these results the classificationreport of the European Credit Card dataset after under-sampling with Near Miss to solve the imbalance problemshowed a significant improvement in the ability to predictfraudulent transactions

Table 4 Confusion matrix table (s) blueprint

Predicted 0 Predicted 1Actual 0 TN FPActual 1 FN TP

Table 5 Confusion matrix of the SVM classifier

European Credit Card dataset UCI Credit Card datasetSVM Predicted 0 Predicted 1 SVM Predicted 0 Predicted 1Actual 0 8528 0 Actual-0 698 0Actual 1 17 0 Actual-1 202 0

Table 6 Confusion matrix of the logistic regression classifier

European Credit Card dataset UCI Credit Card datasetLR Predicted 0 Predicted 1 LR Predicted 0 Predicted 1Actual 0 8520 8 Actual 0 572 126Actual 1 9 8 Actual 1 128 74



(e SVM reported an accuracy score of 85 which is anincrease of 7 compared to the accuracy achieved beforeundersampling (e ability to predict the positive classimproved as the average precision increased from 022 to084 an improvement of 62 (e logistic regression ac-curacy decreased from 078 to 073 However the averageprecision improved from 036 to 079 (ese results showthat the logistic regression improved its ability to predictpositive classes (e decision tree reported an improvedaccuracy of 85 and the accuracy increased from 072 to085

(e average precision also increased from 028 to 081 animprovement of 53 (e random forest reported an ac-curacy score of 89 which was the highest out of the fourclassifiers (e average precision also increased from 037 to086 an improvement of 49 All the classifiers reportedimproved precision recall and F1-score after usingundersampling (e classification report for the UCI Credit

Card dataset revealed that there was an overall improvementin the ability to predict positive classes

421 ampe Confusion Matrix with the Data-Point ApproachTable 11 contains the SVM confusion matrix after under-sampling with Near Miss

Even though some confusion level still exists the effect ofNear Miss was observed on both datasets (e ability topredict positive cases improved by 46 on the EuropeanCredit Card dataset and improved by 73 on the UCI CreditCard dataset (e SVM confusion matrix showed im-provement in the ability to predict positive classes

Table 12 shows the confusion matrix of the logistic re-gression after undersampling with Near Miss

(ere was 100 predictive accuracy for negative casesand 87 for positive cases on the European cardholdersrsquotransactions (e UCI Credit Card dataset had an accuracyof 80 for negative classes and 66 for positive classes (econfusion matrix for the logistic regression model also

Table 7 Confusion matrix of the decision tree classifier

European Credit Card dataset UCI Credit Card datasetDT Predicted 0 Predicted 1 DT Predicted 0 Predicted 1Actual 0 8520 8 Actual 0 572 126Actual 1 9 8 Actual 1 128 74

Table 8 Confusion matrix of the random forest classifier

European Credit Card dataset UCI Credit Card datasetRF Predicted 0 Predicted 1 RF Predicted 0 Predicted 1Actual 0 8527 1 Actual 0 656 42Actual 1 8 9 Actual 1 129 73

Table 9 Performance of the European Credit Card dataset


SVM




Logistic regression




Decision tree




Random forest





shows that the Near Miss technique worked well for bothdatasets

Table 13 contains the decision tree confusion matrixafter undersampling with Near Miss

(ere was no confusion with 100 accuracy for bothclasses on the European cardholdersrsquo transactions dataset(at means the ability to predict positive classes improvedby 47 after undersampling with Near Miss (ereforeusing the Near Miss technique with the decision tree pro-duced the best results with the European cardholdersrsquotransactions dataset (ere was 85 accuracy for negativeclasses and 86 accuracy for positive classes on the UCICredit Card dataset

Table 14 shows that there was a predictive accuracy of100 on the European cardholdersrsquo transactions dataset and92 on the UCI Credit Card dataset for negative casesrespectively (ere was a predictive accuracy of 80 and86 respectively on both datasets for positive cases (erandom forest also performed well

43 ampe Precision-Recall Curve (e prediction score wasused to calculate the average precision (AP) At eachthreshold the weighted mean of precisions achieved withthe increase in recall from the preceding threshold used asthe weight is how AP summarizes a precision-recall curve[55]

AP 1113944n

lowast Rn minus Rnminus1( 1113857Pn (4)

(e average precision is calculated using the method abovewhere Pn and Rn are the precision and recall at the nththreshold respectively and precision and recall are always inthe range of zero to one As a result AP falls between 0 and 1AP is a metric used to quantify the accuracy of a classifier thecloser the number is to 1 the more accurate the classifier is A

precision-recall (P minus R) curve is a graph comparing precision(y-axis) with recall (x-axis) for various thresholds In cir-cumstances where the distribution between the two classes isunbalanced using both recall and precision to measure themodelrsquos prediction powers is beneficial [56]

(e following graphs represent the P minus R curves for therandom forest classifier on both datasets namely the Eu-ropean Credit Card dataset and the UCI Credit Card dataset(e P minus R curve was only presented to the best performingalgorithm for further analysis (e goal was to see if the P minus

R curve was pointing towards the chartrsquos upper right corner(e higher the quality is the closer the curve comes to thevalue of one in the upper right corner

431 ampe Precision-Recall Curve without Near MissFigure 3 shows the European Credit Card dataset precision-recall curve for random forest before the data-pointapproach

(e random forest precision-recall curve for the Euro-pean Credit Card dataset starts straight across the highestpoint and halfway through gradually start curving towardsthe lower right corner (e average precision was 066

Figure 4 shows the UCI Credit Card dataset precision-recall curve for random forest before the data-point approach

(e random forest P-R curve for the UCI Credit Carddataset gradually leaned towards the lower right corner fromthe beginning(e average precision was 037 and this can beobserved on the P-R curve (e performance was better onEuropean Credit Card dataset but was not consistent acrossboth datasets However both the above results show poorquality in the ability to predict positive classes for bothdatasets (e P-R curve is a simple way to analyze the qualityof a classifier without having to perform complex analysis(e next step was to apply the data-point approach andobserve the change in quality



SVM




Logistic regression




Decision tree




Random forest





432ampe Precision-Recall Curve with Near Miss (e figuresbelow show the precision-recall curve after treatment usingfeature selection with the Near Miss-based undersamplingtechnique was applied A P minus R curve is a brilliant way to seea graphical representation of a classifierrsquos quality (e P minus R

curves show the improvement in the quality of the classifiersafter using the data-point approach

Figure 5 shows the European Credit Card dataset pre-cision-recall curve for random forest before the data-pointapproach

Figure 5 shows the random forest P minus R curve on theEuropean Credit Card dataset (e classifier improved by33 as the average precision increased from 066 and 100indicated by the straight line on the value of 1 across the y-axis

Figure 6 shows the UCI Credit Card dataset precision-recall curve for random forest before the data-pointapproach

Figure 6 shows the random forest P minus R curve on theUCI Credit Card dataset (e curve starts straight on the

Table 11 Confusion matrix of the SVM classifier with Near Miss

European Credit Card dataset UCI Credit Card datasetSVM Predicted 0 Predicted 1 SVM Predicted 0 Predicted 1Actual 0 15 0 Actual 0 191 6Actual 1 8 7 Actual 1 56 153

Table 12 Logistic regression confusion matrix after undersampling with Near Miss


Table 13 Confusion matrix of the decision tree after undersampling with Near Miss


Table 14 Random forest confusion matrix after undersampling with Near Miss


00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall

Random forest classifier (AP = 066)

Figure 3 European Credit Card dataset (RF) precision-recall curve


value of 1 on the y-axis moving across the x-axis and endsby a gentle fall while leaning towards the upper right corner(e average precision increased from 028 to 081 Both theresults indicate great quality

A P minus R curve that is a straight line on the y-axis value of1 across the x-axis such as Figure 5 of the random forestwith the European Credit Card dataset represents the bestpossible quality A P minus R curve that is leaning more towardsthe upper right corner is also a sign that the classifier hasgood quality such as Figure 6 on the UCI Credit Carddatasets

5 Conclusions

All the algorithms scored an average score of 100 for le-gitimate cases with the European cardholderrsquos credit cardtransactions dataset (D1) and an average score of 087 withthe UCI Credit Card dataset (D2) for the precision recalland F1-score (ese results indicate that the majority classwas dominant due to the imbalance level and the challengeis successfully anticipating the minority class

Recording an average precision score of 077 and anaverage recall score of 045 the random forest model was the

00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10


Figure 4 UCI Credit Card dataset (RF) precision-recall curve

00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision

Figure 5 European Credit Card dataset (RF with Near Miss) precision-recall curve

00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10

Figure 6 UCI Credit Card dataset (RF with Near Miss) precision-recall curve


best performer for detecting minority classes in the weightedaverage classification report with both original datasetsHowever comparing both precisions and recall scores showsthat the model did not perform well (e combined cal-culated average precision of 043 was used to further validatethe model indicating that it was not generating optimalresults and that additional treatment was required In bothdatasets the SVMmodel performed the worst with accuracyand recall scores of 000 Due to the uneven class distri-bution the SVM model was biased and utterly failed toidentify minority classes with a score of 000

(e average precision score for the positive class im-proved by 98 for SVM 495 for decision tree 195 forrandom forest and 55 for logistic regression after utilizingundersampling with the Near Miss approach (e recallscore for the positive class shows that that the strength ofidentifying true positive (which are actually fraudulentcases) improved by 60 for SVM 515 for logistic re-gression 51 for the decision tree and 385 for randomforest and improved their ability to identify true positive(fraudulent cases) by 60 for SVM 515 for logistic re-gression 51 for the decision tree and 385 for randomforest F1-score improved by 735 for SVM 525 forlogistic regression 505 for decision tree and 325 forrandom forest in the positive class according to the findingsWhen the capacity to detect affirmative classes was im-proved the F1-score improved as well After using the data-point approach the predicting accuracy improved for all thealgorithms on both datasets Using a determined averagescore of accuracy recall and F1-score for each classifier therandom forest method is the leading algorithm Orderedfrom best to worst the performance of the machine learningtechniques were as follows random forest decision treelogistic regression and SVM

(e findings reveal that when the data is significantlyskewed the model has difficulty detecting fraudulenttransactions (ere was a considerable improvement inthe capacity to forecast positive classes after applying thehybrid data-point strategy combining feature selectionand the Near Miss-based undersampling technique Basedon the findings the hybrid data-point approach improvedthe predictive accuracy of all the four algorithms used inthis study However even though there was a significantimprovement on all classification algorithms the resultsrevealed that the proposed method with the random forestalgorithm produced the best performance on the twocredit card datasets

(e findings of this study can be used in future researchto look at developing and deploying a real-time system thatcan detect fraud while the transaction is taking place

Data Availability

(e data on credit card fraud are available online at httpswwwkagglecommlg-ulbcreditcardfraudhome

Conflicts of Interest

(e authors declare that there are no conflicts of interest

Acknowledgments

(e authors acknowledge the Durban University of Tech-nology for making funding opportunities and materials forexperiments available for this research project

References

[1] Sabric Annual Crime Stats 2019 httpswwwsabriccozamedia-and-newspress-releasessabric-annual-crime-stats-2019

[2] K Randhawa C K Loo M Seera C P Lim and A K NandildquoCredit card fraud detection using AdaBoost and majorityvotingrdquo IEEE Access vol 6 no 1 pp 14277ndash14284 2018

[3] J West and M Bhattacharya ldquoSome experimental issues infinancial fraud miningrdquo Procedia Computer Science vol 80no 1 pp 1734ndash1744 2016

[4] MWasikowski and X-w Chen ldquoCombating the small sampleclass imbalance problem using feature selectionrdquo IEEETransactions on Knowledge and Data Engineering vol 22no 10 pp 1388ndash1400 2010

[5] W Lin Z Wu L Lin A Wen and J Li ldquoAn ensemblerandom forest algorithm for insurance big data analysisrdquoIEEE Access vol 5 pp 16568ndash16575 2017

[6] E M Hassib A I El-Desouky E-S M El-Kenawy andS M El-Ghamrawy ldquoAn imbalanced big data miningframework for improving optimization algorithms perfor-mancerdquo IEEE Access vol 7 no 1 pp 170774ndash170795 2019

[7] D Chen X-J Wang C Zhou and B Wang ldquo(e distance-based balancing ensemble method for data with a high im-balance ratiordquo IEEE Access vol 7 no 1 pp 68940ndash689562019

[8] S A Shevchik F Saeidi B Meylan and K Wasmer ldquoPre-diction of failure in lubricated surfaces using acoustic time-frequency features and random forest algorithmrdquo IEEETransactions on Industrial Informatics vol 13 no 4pp 1541ndash1553 2017

[9] I Sadgali N Sael and F Benabbou ldquoPerformance of machinelearning techniques in the detection of financial fraudsrdquoProcedia Computer Science vol 148 no 1 pp 45ndash54 2019

[10] A Adedoyin ldquoPredicting fraud in mobile money transferrdquoIEEE Transactions On Knowledge and Data Engineeringvol 28 no 6 pp 1ndash203 2016

[11] T Hasanin T M Khoshgoftaar J Leevy and N SeliyaldquoInvestigating random undersampling and feature selectionon bioinformatics big datardquo in Proceedings of the 2019 IEEEFifth International Conference on Big Data Computing Serviceand Applications (Big Data Service) pp 346ndash356 NewarkCA USA November 2019

[12] K Sotiris K Dimitris and P Panayiotis ldquoHandling imbal-anced datasets a reviewrdquo GESTS International Transactionson Computer Science and Engineering vol 30 no 1 pp 1ndash122016

[13] T M Khoshgoftaar J Van Hulse and A Napolitano ldquoSu-pervised neural network modeling an empirical investigationinto learning from imbalanced data with labeling errorsrdquoIEEE Transactions on Neural Networks vol 21 no 5pp 813ndash830 2010

[14] B Pes ldquoLearning from high-dimensional biomedical datasetsthe issue of class imbalancerdquo IEEE Access vol 8 no 1pp 13527ndash13540 2020

[15] G Ditzler and R Polikar ldquoIncremental learning of conceptdrift from streaming imbalanced datardquo IEEE Transactions On


Knowledge and Data Engineering vol 25 no 10 pp 2283ndash2301 2013

[16] Z Zhou and X Liu ldquoTraining cost-sensitive neural networkswith methods addressing the class imbalance problemrdquo IEEETransactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

[17] L L Minku S Wang and X Yao ldquoOnline ensemble learningof data streams with gradually evolved classesrdquo IEEETransactions on Knowledge and Data Engineering vol 28no 6 pp 1532ndash1545 2016

[18] S Wang L L Minku and X Yao ldquoResampling-based en-semble methods for online class imbalance learningrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 5 pp 1356ndash1368 2015

[19] X Liu J Wu and Z Zhou ldquoExploratory undersampling forclass-imbalance learningrdquo IEEE Transactions on SystemsMan and Cybernetics Part B (Cybernetics) vol 39 no 2pp 539ndash550 2009

[20] C Jiang J Song G Liu L Zheng and W Luan ldquoCredit cardfraud detection a novel approach using aggregation strategyand feedback mechanismrdquo IEEE Internet of ampings Journalvol 5 no 5 pp 3637ndash3647 2018

[21] A Onan ldquoConsensus clustering-based undersampling ap-proach to imbalanced learningrdquo Scientific Programmingvol 2019 Article ID 5901087 2019

[22] A Onan and S Korukoglu ldquoA feature selection model basedon genetic rank aggregation for text sentiment classificationrdquoJournal of Information Science vol 43 no 1 pp 25ndash38 2017

[23] A Onan ldquoSentiment analysis on Twitter based on ensemble ofpsychological and linguistic feature setsrdquo Balkan Journal ofElectrical and Computer Engineering vol 6 no 2 pp 69ndash772018

[24] P Borah and D Gupta ldquoRobust twin bounded support vectormachines for outliers and imbalanced datardquo Applied Intelli-gence vol 51 no 3 pp 1ndash30 2021

[25] D Gupta and B Richhariya ldquoEntropy based fuzzy leastsquares twin support vector machine for class imbalancelearningrdquo Applied Intelligence vol 48 no 11 pp 4212ndash42312018

[26] D Gupta P Borah andM Prasad ldquoA fuzzy based Lagrangiantwin parametric-margin support vector machine(FLTPMSVM)rdquo in Proceedings of the 2017 IEEE symposiumseries on computational intelligence (SSCI) pp 1ndash7 IEEEHonolulu HI USA November 2017

[27] B B Hazarika and D Gupta ldquoDensity-weighted supportvector machines for binary class imbalance learningrdquo NeuralComputing and Applications vol 33 pp 1ndash19 2020

[28] X Zhang C Zhu H Wu Z Liu and Y Xu ldquoAn imbalancecompensation framework for background subtractionrdquo IEEETransactions on Multimedia vol 19 no 11 pp 2425ndash24382017

[29] C Seiffert T M Khoshgoftaar J Van Hulse andA Napolitano ldquoRUSBoost a hybrid approach to alleviatingclass imbalancerdquo IEEE Transactions on Systems Man andCybernetics-Part A Systems and Humans vol 40 no 1pp 185ndash197 2010

[30] L Bao C Juan J Li and Y Zhang ldquoBoosted Near-missunder-sampling on SVM ensembles for concept detection inlarge-scale imbalanced datasetsrdquo Neurocomputing vol 172no 1 pp 198ndash206 2016

[31] M Peng Q Zhang X Xing et al ldquoTrainable undersamplingfor class-imbalance learningrdquo Proceedings of the AAAIConference on Artificial Intelligence vol 33 no 01pp 4707ndash4714 2019

[32] Imbalanced-learn 2020 httpsimbalanced-learnreadthedocsioenstable

[33] L Zheng G Liu C Yan and C Jiang ldquoTransaction frauddetection based on total order relation and behavior diver-sityrdquo IEEE Transactions on Computational Social Systemsvol 5 no 3 pp 796ndash806 2018

[34] S Patil V Nemade and P K Soni ldquoPredictive modelling forcredit card fraud detection using data analyticsrdquo ProcediaComputer Science vol 132 no 1 pp 385ndash395 2018

[35] A Tarjo and N Herawati ldquoApplication of beneish M-scoremodels and data mining to detect financial fraudrdquo Procedia-Social and Behavioral Sciences vol 211 no 1 pp 924ndash9302015

[36] A Somasundaran and U S Reddy ldquoData imbalance effectsand solutions for classification of large and highly imbalanceddatardquo Proceedings of the 1st International Conference onResearch in Engineering Computers and Technology vol 25no 10 pp 28ndash34 2016

[37] Google Colaboratory Frequently Asked Questions 2019httpsresearchgooglecomcolaboratoryfaqhtml

[38] Scikit Learn 2020 httpsscikit-learnorgstablesupervised_learninghtml

[39] M Albashrawi ldquoDetecting financial fraud using data miningtechniques a decade review from 2004 to 2015rdquo Journal ofData Science vol 14 no 1 pp 553ndash570 2016

[40] G Baader and H Krcmar ldquoReducing false positives in frauddetection combining the red flag approach with processminingrdquo International Journal of Accounting InformationSystems vol 31 no 1 pp 1ndash16 2018

[41] C-T Su and Y-H Hsiao ldquoAn evaluation of the robustness ofMTS for imbalanced datardquo IEEE Transactions On Knowledgeand Data Engineering vol 19 no 10 pp 1321ndash1332 2007

[42] R Batuwita and V Palade ldquoFSVM-CIL fuzzy support vectormachines for class imbalance learningrdquo IEEE Transactions onFuzzy Systems vol 18 no 3 pp 558ndash571 2010

[43] S Subudhi and S Panigrahi ldquoQuarter-sphere support vectormachine for fraud detection in mobile telecommunicationnetworksrdquo Procedia Computer Science vol 48 no 1pp 353ndash359 2015

[44] Z Liu T Wen W Sun and Q Zhang ldquoSemi-supervised self-training feature weighted clustering decision tree and randomforestrdquo IEEE Access vol 8 pp 128337ndash128348 2020

[45] E Scornet ldquoRandom forests and Kernel methodsrdquo IEEETransactions on Information ampeory vol 62 no 3pp 1485ndash1500 2016

[46] J T Raj ldquoWhat to do when your classification data is im-balancedrdquo 2019 httpstowardsdatasciencecomwhat-to-do-when-your-classification-dataset-is-imbalanced-6af031b12a36

[47] A Shen R Tong and Y Deng ldquoApplication of classificationmodels on credit card fraud detectionrdquo in Proceedings of the2007 International Conference on Service Systems and ServiceManagement pp 1ndash4 Chengdu China June 2007

[48] S M S Askari and M A Hussain ldquoCredit card fraud de-tection using fuzzy ID3rdquo in Proceedings of the 2017 Inter-national Conference on Computing Communication andAutomation (ICCCA) pp 446ndash452 Noida India May 2017

[49] J Li H He and L Li ldquoCGAN-MBL for reliability assessmentwith imbalanced transmission gear datardquo IEEE Transactionson Instrumentation and Measurement vol 68 no 9pp 3173ndash3183 2019

[50] C-C Lin D-J Deng C-H Kuo and L Chen ldquoConcept driftdetection and adaption in big imbalance industrial IoT datausing an ensemble learningmethod of offline classifiersrdquo IEEEAccess vol 7 no 1 pp 56198ndash56207 2019


[51] J Shao ldquoLinear model selection by cross-validationrdquo Journalof the American Statistical Association vol 88 no 422pp 486ndash494 1993

[52] P Zhang ldquoOn the distributional properties of model selectioncriteriardquo Journal of the American Statistical Associationvol 87 no 419 pp 732ndash737 1992

[53] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo ICML vol 97 no 1pp 179ndash186 1997

[54] A Rashad S Riaz and L Jiao ldquoSemi-supervised deep fuzzyC-mean clustering for imbalanced multi-class classificationrdquoIEEE Access vol 7 no 1 pp 28100ndash28112 2019

[55] J Wei Z Lu K Qiu P Li and H Sun ldquoPredicting drug Risklevel from adverse drug reactions using SMOTE and machinelearning approachesrdquo IEEE Access vol 8 no 1pp 185761ndash185775 2020

[56] R Yao J Li M Hui L Bai and Q Wu ldquoFeature selectionbased on random forest for partial discharges characteristicsetrdquo IEEE Access vol 8 Article ID 159151 2020


[5] Learning algorithms operate on the assumption that thedata is evenly distributed hence imbalanced data is ac-knowledged as part of the fundamental issues in the field ofdata analytics and data science [6]

(e science of building and implementing algorithmsthat can learn patterns from previous events is known asmachine learning [7] (e machine learning classifiers canbe trained by continually feeding input data and assessingtheir performances A machine learning classification so-lution employs sophisticated algorithms that loop over bigdatasets and evaluate data patterns [8] In machinelearning the ideal solution must have low bias and shouldaccurately model the true relationship of positive andnegative classes Machine learning classifiers tend to per-form well with a specific dataset that has been manipulatedto suit the classifier [9] (e use of one dataset tends to havea bias as the data could be manipulated to support theclassification solution (is is commonly referred to asoverfitting in machine learning (e ideal binary classifi-cation solutions should have low variability by producingconsistent predictions across different datasets [10] (egoal of this work was to conduct a thorough investigation ofthe impact of employing a hybrid data-point strategy tohandle the misclassification problem in credit card datasetsthat were imbalanced Oversampling undersampling andfeature selection are examples of strategies for resamplingdata used in dealing with imbalanced classes at the data-point level [11] (e data-point level technique accordingto Sotiris Dimitris and Panayiotis in [12] includes datainterventions to lessen the impact of imbalanced datasetsand it is flexible enough to proceed with modern classifierssuch as logistic regression decision trees and supportvector machines

We present in this paper a hybrid method that amal-gamates the advantages of the random forest and a hybriddata-point technique to deal with the problem of imbalancelearning in credit card fraud Random forest used for pre-diction has the advantage of being able to manage datasetswith several predictor variables We further combinedfeature selection using correlation coefficients in order tomake it easier for machine learning to classify with NearMiss-based undersampling technique Using well-knownperformance metrics the model outperformed other rec-ognised models

2 Related Works

Machine learning models work well when the datasetcontains evenly distributed classes known as a balanceddataset [13] Pes in [14] looked at the efficacy of hybridlearning procedures that combine dimensionality reductionand ways for dealing with class imbalance (e researchcombines univariate and multivariate feature selectionstrategies with cost-sensitive classification and sampling-based class balance methods [14](e dominance of blendedlearning strategies presented a dependable choice for re-current random sampling and investigations proved thathybrid learning strategies outperformed feature selectionsolely for unbalanced datasets [14] Several studies analyzed

and compared existing financial fraud detection algorithmsin order to find the most effective strategy [9 15 16] Usingthe confusion matrix Zhou and Liu in [16] discovered thatthe random forest model outperformed logistic regressionand decision tree based on accuracy precision and recallmatrices Albashrawi et al in [9] found that the logisticregression model is the superior data mining tool fordetecting financial fraud

A paper by Minku et al in [17] looked into the scenarioof classes progressively appearing or disappearing (eClass-Based Ensemble for Class Evolution (CBCE) wassuggested as a class-based ensemble technique CBCE canquickly adjust to class evolution by keeping a base learner foreach class and constantly updating the basic learners withnew data To solve the dynamic class imbalance probleminduced by the steady growth of classes the study developeda novel undersampling strategy for the base learners (eempirical investigations according to Minku et al in [17]revealed the efficiency of CBCE in various class evolutionscenarios when compared to the existing class evolutionadaptation method CBCE responded well to all three sce-narios of class evolution as compared to previous ap-proaches (ie occurrence disappearance and reoccurrenceof classes) (e empirical analysis confirmed under-samplingrsquos dependability and CBCE demonstrated that itbeats other recognised class evolution adaptation algo-rithms not only in terms of the ability to adjust to variedevolution scenarios but also in terms of overall classificationperformance Two learning algorithms were proposed byWang et al in [18] Undersampling-based Online Bagging(UOB) and Oversampling-based Online Bagging (OOB)devised an ensemble approach to overcome the class im-balance in real-time using time-decayed and resamplingmetrics (e study also focused on the performance of OOBand UOBs resampling strategies in both static and dynamicdata streams to see how they could be improved In terms ofdata distributions imbalance rates and changes in classimbalance status their work provides the first compre-hensive examination of class imbalance in data streamsAccording to the findings UOB is superior at detectingminority class cases in static data streams while OOB isbetter at resisting fluctuations in class imbalance status (esupply of data was discovered to be a crucial elementimpacting their performance and more research wasrequired

Liu and Wu experimented with two strategies to avoidthe drawbacks of employing undersampling to deal withclass imbalance When undersampling is used most ma-jority classes are disregarded which is a flaw As a resultEasy-Ensemble and Balance-Cascade were proposed in thestudy Easy-Ensemble breaks the majority class into nu-merous smaller chunks then uses each chunk to train thelearner independently and finally all of the learnersrsquo outputsare combined [19] Balance-Cascade employs a sequentialtraining strategy in which the majority classrsquos properlyclassified instances are excluded from further evaluation inthe next series [19] According to the data the Easy-En-semble and the Balance-Cascade had higher G-meanF-measure and AUC values than other existing techniques

































Fraudcases fraud


492284807

lowast 100 0173

(1)





Fraudcases fraud


663630000lowast 100 2212

(2)
















Step 7 end


0

50000

100000

150000

200000

250000

Class

Freq

uenc

y

Normal Fraud


0

5000

10000

15000

20000

Class

Freq

uenc

y

Normal Fraud


























SVM




Logistic regression




Decision tree




Random forest





Measure N P

SVM




Logistic regression




Decision tree




Random forest



































SVM




Logistic regression




Decision tree




Random forest










AP 1113944n












SVM




Logistic regression




Decision tree




Random forest



















00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References




























































































Fraudcases fraud


492284807

lowast 100 0173

(1)





Fraudcases fraud


663630000lowast 100 2212

(2)
















Step 7 end


0

50000

100000

150000

200000

250000

Class

Freq

uenc

y

Normal Fraud


0

5000

10000

15000

20000

Class

Freq

uenc

y

Normal Fraud


























SVM




Logistic regression




Decision tree




Random forest





Measure N P

SVM




Logistic regression




Decision tree




Random forest



































SVM




Logistic regression




Decision tree




Random forest










AP 1113944n












SVM




Logistic regression




Decision tree




Random forest



















00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References


































































Step 7 end


0

50000

100000

150000

200000

250000

Class

Freq

uenc

y

Normal Fraud


0

5000

10000

15000

20000

Class

Freq

uenc

y

Normal Fraud


























SVM




Logistic regression




Decision tree




Random forest





Measure N P

SVM




Logistic regression




Decision tree




Random forest



































SVM




Logistic regression




Decision tree




Random forest










AP 1113944n












SVM




Logistic regression




Decision tree




Random forest



















00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References




















































































SVM




Logistic regression




Decision tree




Random forest





Measure N P

SVM




Logistic regression




Decision tree




Random forest



































SVM




Logistic regression




Decision tree




Random forest










AP 1113944n












SVM




Logistic regression




Decision tree




Random forest



















00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References



























































































SVM




Logistic regression




Decision tree




Random forest










AP 1113944n












SVM




Logistic regression




Decision tree




Random forest



















00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References


































































AP 1113944n












SVM




Logistic regression




Decision tree




Random forest



















00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References











































































00

02

04

06

08

10

12

00 02 04 06 08 10

Prec

ision

Recall






5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References































































5 Conclusions



00 02 04 06 08 10Pr

ecisi

onRecall

00

02

04

06

08

12

10



00 02 04 06 08 10Recall


00

02

04

06

08

10

12

Prec

ision


00 02 04 06 08 10Recall


Prec

ision

00

02

04

06

08

12

10







Data Availability




Acknowledgments


References

































































Data Availability




Acknowledgments


References





























































SolvingMisclassificationoftheCreditCardImbalanceProblem ...imbalance status, their work provides the...

Documents

Transcript of SolvingMisclassificationoftheCreditCardImbalanceProblem ...imbalance status, their work provides the...