Credit Scoring Method and System Development for Imbalanced … · 3.1.1 k-nearest neighbors (k-NN)...

Ph.D. Thesis

Credit Scoring Method and System Development for

Imbalanced Datasets

A thesis submitted in fulfillment of the requirements for the degree of

DOCTOR OF SCIENCE AND TECHNOLOGY

By

Xiying Hao

Graduate School of Symbiotic Systems Science and Technology

Fukushima University

2014

i

ACKNOWLEDGEMENTS

This thesis would not have been possible without the support and guidance of a number of

important people, for which I would like to take this opportunity to acknowledge and thank

them.

First of all I would like to thank my professor Yanwen Dong, who has been of unwavering

support throughout my time at the Fukushima University. Without his tutorials and expert

knowledge in the field of credit risk modeling I could not have achieved the work conducted

in this thesis.

Special thanks also to professor Katsushige Fujimoto and professor Shoichi Nakamura, for

their helpful discussions and valuable review comments.

I would also like to thank my friends and my parents for their helpful suggestions and

encouragement.

ii

CONTENTS

ACKNOWLEDGEMENTS ...................................................................................................... i

CHAPTER 1 INTRODUCTION ........................................................................................... 1

CHAPTER 2 LITERATURE REVIEW ............................................................................... 7

2.1 Credit Scoring .................................................................................................................. 8

2.1.1 Credit scoring and methodologies ............................................................................. 8

2.1.2 Current research ...................................................................................................... 11

2.2 Class Imbalance Problem ............................................................................................... 14

2.2.1 The problem of imbalanced datasets ....................................................................... 14

2.2.2 Methods for dealing with imbalanced data sets ...................................................... 17

2.3 Issues and Aims of This Study ....................................................................................... 20

CHAPTER 3 METHODOLOGIES ..................................................................................... 22

3.1 Classification Techniques .............................................................................................. 23

3.1.1 k-nearest neighbors (k-NN) ..................................................................................... 23

3.1.2 k-means algorithms ................................................................................................. 24

3.1.3 Decision tree (C4.5) ................................................................................................ 24

3.1.4 Artificial neural networks ........................................................................................ 24

3.2 Ensemble Learning ......................................................................................................... 26

3.2.1 Bagging ................................................................................................................... 26

3.2.2 Boosting .................................................................................................................. 26

3.2.3 Stacking ................................................................................................................... 26

3.2.4 Random forests ........................................................................................................ 27

3.3 Learning from Class Imbalance Data Sets ..................................................................... 27

3.3.1 Sampling Methods ................................................................................................... 27

3.3.2 Cost-sensitive learning ............................................................................................ 29

CHAPTER 4 EVALUATION MEASURES ....................................................................... 31

4.1 Sensitivity, Specificity and Geometric Mean ................................................................. 33

4.2 Type I and Type II errors ............................................................................................... 33

4.3 Integrated Performance Measures .................................................................................. 33

CHAPTER 5 DATA SETS ................................................................................................... 35

iii

5.1 Credit Datasets in a Small Company.............................................................................. 36

5.1.1 Credit assessment problem ...................................................................................... 36

5.1.2 Features of the customers ........................................................................................ 36

5.1.3 Data sets summary .................................................................................................. 37

5.2 German Credit Data Set ................................................................................................. 38

CHAPTER 6 A TWO-STAGE DATA RESAMPLING METHOD FOR CREDIT

SCORING ....................................................................................................... 40

6.1 Background and Purpose of This Study ......................................................................... 41

6.2 System Design ................................................................................................................ 43

6.2.1 Scheme system ........................................................................................................ 44

6.2.2 Training data generating .......................................................................................... 44

6.2.3 Learnig and classification ........................................................................................ 45

6.3 The Application for a Real Credit Scoring Problem ...................................................... 47

6.4 Performance Comparison ............................................................................................... 48

6.5 Concluding Remarks ...................................................................................................... 52

CHAPTER 7 An ADAPTIVE AND HIERARCHICAL SYSTEM FOR CREDIT

SCORING ....................................................................................................... 54

7.1 The Purpose of This Study ............................................................................................. 55

7.2 The Concept and Scheme of The System ....................................................................... 55

7.3 Systematic Constructing Procedures .............................................................................. 57

7.4 Application to Practical Problem ................................................................................... 58

7.5 System Performance and Discussion ............................................................................. 59

7.5.1 Ability for classification .......................................................................................... 59

7.5.2 Ability for predication ............................................................................................. 60

7.6 Comparison with Other Methods ................................................................................... 61

7.6.1 Comparison with neural networks and decision tree .............................................. 61

7.6.2 Comparison with the parallel ensemble system ...................................................... 62

7.6.3 Type I and Type II errors ........................................................................................ 63

7.6.4 Comparison with other methods ............................................................................. 64

7.7 Concluding Remarks and Discussion ............................................................................. 65

iv

CHAPTER 8 AN INVESTIGATION INTO THE RELATIONSHIP BETWEEN

CLASSIFICATION PERFORMANCE AND DEGREE OF IMBALANCE

.......................................................................................................................... 67

8.1 Background and Aims of This Study ............................................................................. 68

8.2 Experimental Design ...................................................................................................... 70

8.2.1 Trainig data set generating ...................................................................................... 70

8.2.2 Selecting of classification techniques ..................................................................... 71

8.2.3 Parameters tuning .................................................................................................... 72

8.2.4 Statistical comparison of classifier .......................................................................... 73

8.3 Experiment Results and Discussion ............................................................................... 74

8.4 Concluding Remarks ...................................................................................................... 77

CHAPTER 9 CONCLUSIONS ............................................................................................ 79

REFERENCES ....................................................................................................................... 82

1

CHAPTER 1

INTRODUCTION

2

Introduction

In today’s increasingly competitive business environment, all of companies are exposed to

different kinds of risk, but the most challenging risk which can cause a company to fail is

credit risk. Credit risk is most simply defined as the potential that a counterparty will fail to

meet its obligations in accordance with agreed terms. Because there are many types of

counterparties -- from individuals to sovereign governments -- and many different types of

obligations -- from auto loans to derivatives transactions -- credit risk takes many forms.

Recently, credit scoring has emerged as a leading method to assess credit risk. The main

idea of credit scoring is to accurately and efficiently quantify the level of credit risk associated

with the counterparties. The credit scoring model’s objective is to predict future behavior in

terms of credit risk by relying on past experience of counterparties with similar characteristics.

The level of credit risk of a counterparty is associated with the probability that it will fail to

meet its obligations. The main task of credit scoring model is to provide discrimination

between the ones who do default and the ones who do not. Discrimination ability is the key

indicator of model successfulness. The higher the discrimination power the more precise the

credit scoring model will be.

A wide range of classification techniques has already been proposed in the credit scoring

literature, including statistical techniques, such as linear discriminant analysis and logistic

regression, and non-parametric models, such as k-nearest neighbors and decision tree. But

there are still several issues to be addressed.

(1) Availability of models

Current models and methods are applied mainly in financial community companies, there

are few literatures focused on small and medium enterprise credit scoring. However, small

and medium enterprises (SMEs) play an important role in the economy of many countries all

over the world and they are financially weak and easily affected and are bankrupted by their

Chapter

r 1

3

partners and their bad/good financial status, so monitoring the SMEs counterparts/customers

is a new challenge for us.

(2) Class imbalance problem

In the field of credit scoring context, imbalanced data sets frequently occur when there are

significantly fewer training instances of one class compared to other classes, making it harder

to be correctly learnt. What is most important, this minority class is usually the one with the

highest interest because in real business it represents the real loss. So it is more important to

identify this minority class to minimize credit risk.

(3) Strong relationship between performance of models and characteristic of dataset

There are often conflicting opinions when comparing the conclusions of studies promoting

different techniques. That is because many empirical studies only evaluate a small number of

classification techniques on a single credit scoring data set. The data sets used in these

empirical studies are also often far smaller and less imbalanced than those data sets used in

practice. Hence, the issue of which classification techniques to use for credit scoring,

particularly with extremely imbalanced data sets, remains a challenging problem.

In this thesis, we will address these three problems and make some contributions as

follows:

(1) Different from existing research, our study contributes to the literature on small

and medium enterprise credit scoring.

The difficulties of credit assessment for small and medium enterprise is that they cannot

require financial data and/or others from their counterparts/customers. Hence, in this thesis we

have proposed some new approaches to assess the customers’ credit only base on the daily

transaction data. This characteristic data can be extracted from the database of small-business

management information systems. The proposed approaches are suitable to be applied to

many of organizations where the customers do not disclose their financial data and they are

also very easy to be incorporated into existing information systems. Our approaches based on

daily transaction data can be used in a wide variety of firms.

(2) For the issue of imbalanced credit scoring data sets, the aim of our study is to

improve the ability for identifying the minority class.

Most learning algorithms obtain a high predictive accuracy over the majority class, but

predict poorly over the minority class. Furthermore, the examples in the minority class can be

treated as noise and they might be completely ignored by the classifier. So how to improve

the classification performance of minority class became a new challenge for us. This paper

4

presents a two-stage data resampling method, an adaptive and hierarchical system,

respectively, to improve the performance of minority class. As a unique resampling method,

we used k-means algorithms to perform under-sampling on the majority class of customers.

Then, in order to avoid losing information, we introduced a pre-classification to pick up

customers of the majority class whose information could not be reflected in the previous

under-sampling result. For the adaptive and hierarchical system, it can choose the best method

adaptively based on the accuracy for identifying customers of every credit score.

(3) Carry on an investigation into the relationship between classification

performance and degree of imbalance.

Many studies indicate that the class imbalance problem is actually a relative problem that

depends on the degree of class imbalance, however, how the performance of classification

techniques is affected by the different degree of class imbalance has not been discussed. In

this study, our focus is on the performance of classification techniques on data sets with

different degree of imbalance. We set out to compare several techniques in varying degree of

class imbalance. After comparing the effectiveness of these techniques, we want to find the

most appropriate one under different scenarios. For this study, the class of bad observations in

each of the training data sets was artificially reduced so as to create the different class

imbalance. The class (good/bad) distributions ranged from 70/30 to 99/1.

The research objectives are addressed in 9 chapters, with the current chapter presenting an

introduction of this research. Following is the general overview and structure of the thesis.

In Chapter 2, a review of the literature topics related to credit scoring will be given. Section

2.1 focuses on introducing some basic theory behind credit scoring. In this section, current

applications of techniques in credit scoring models are also presented. In section 2.2, the issue

of imbalanced credit scoring data sets, which is looked at and reviewed. Finally, based on the

existing problems from the overview of section 2.1 and 2.2, the major research and

contributions of this paper are given.

In Chapter 3, a brief explanation of each of the techniques applied in this thesis is present

with citations given to their full derivation. We have summed up these methods and classified

them into three categories: classification techniques, such as k-nearest neighbors and k-means

algorithms, ensemble methods, such as bagging and boosting, and methods for dealing with

class imbalance problems, such as sampling and cost-sensitive learning.

In Chapter 4, we review several metrics that are commonly used to assess classifier

performance. The commonly used metric is the overall classification (i.e. accuracy). However,

5

on an imbalanced data set, the overall classification rate is no longer a suitable metric, since

the minority class has less effect on accuracy as compared to the majority class. Therefore,

other metrics have been developed, such as sensitivity, geometric mean and so on.

In Chapter 5, two datasets used in our study are described in detail. A widely used

academic data sets (German credit data) obtained from the UCI Repository of Machine

Learning Databases are adopted in the chapter 8. The other data sets are collected in a small

company where the main business is selling school uniforms and accessories at wholesale.

There are 20 employees in the company, and the annual sale is about 600 million Japanese

yen.

In Chapter 6, we present a new approach which uses both the k-nearest neighbor (k-NN)

algorithm and random forest method to deal with imbalanced data sets in a small-business

credit assessment. Two types of classifiers are designed. The first one is called a preliminary

classifier, which is constructed using a k-means clustering algorithm based on the test data in

order to save useful information of the customers of the majority class as much as possible.

The second classifier is constructed using the random forest method; it is used to reclassify

customers that were predicted to belong to the non-majority class in the preliminary

classification to improve the classification performance of the minority class.

In Chapter 7, we aim at proposing an adaptive and hierarchical system to solve the credit

assessment problem and our emphasis is put on improving more accuracy for identifying the

minority class. The proposed system can choose the best method adaptively from neural

networks and decision tree based on the accuracy for identifying customers of every credit

score. The performance and effectiveness of the proposed system have been demonstrated by

applying it to the real problems of the company.

In Chapter 8, we set out to compare several techniques that can be used in the analysis of

imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets occur as

the number of examples in one class significantly outnumbers the number of examples in the

other class. However, some techniques may not be able to adequately cope with these

imbalanced data sets. Therefore, the objective is to compare a variety of techniques

performances’ over differing sizes of class distribution.

In Chapter 9, we display the conclusions that can be drawn from the research undertaken in

this thesis.

6

The relationship between each chapter are summarized in Figure 1.1.

Chapter 1

Introduction

Chapter 2

Literature Review

Chapter 3

Methodologies

Chapter 4

Evaluation Measures

Chapter 5

Data Sets

Chapter 6

A Two-stage Data Resampling Method

Chapter 7

An Adaptive and Hierarchical System

Chapter 8

An Investigation into Relationship Between Classification Performance and

Degree of Imbalance

Chapter 9

Conclusions

(Methods for SMEs Credit Score)

(Theories Introduction)

(Data Description)

(Methods for Generally Theory)

Figure 1.1 The structure of the thesis

7

CHAPTER 2

LITERATURE REVIEW

8

Literature Review

2.1 Credit scoring

2.1.1 Credit scoring and methodologies

Credit scoring is a quantitative method to evaluate the credit risk of counterparties. Almost

every day, individual’s and company’s records of past borrowing and repaying actions are

stocked and analyzed. This information is used for estimating the probability of default,

bankruptcy or fraud associated to a company or an individual. When assessing the risk,

according to the context we can roughly summarize the different kind of scoring as follows

[1]:

Application scoring: It refers to the assessment of the credit worthiness for new

applicants. It quantifies the risks, associated with credit requests, by evaluating the social,

demographic, financial, and other data collected at the time of the application.

Behavioral scoring: It involves principles that are similar to application scoring, with

the difference that it refers to existing customers. As a consequence, the analyst already

has evidence of the borrower’s behavior with the lender. Behavioral scoring models

analyze the consumers’ behavioral patterns to support dynamic portfolio management

processes.

Collection scoring: It is used to divide customers with different levels of insolvency into

groups, separating those who require more decisive actions from those who don’t need to

be attended to immediately. These models are distinguished according to the degree of

delinquency (early, middle, late recovery) and allow a better management of delinquent

customers, from the first signs of delinquency (30–60 days) to subsequent phases and

debt write-off.

Fraud detection: fraud scoring models rank the applicants according to the relative

likelihood that an application may be fraudulent.

Chapter

r 2

9

A graphical conceptual framework shown in Figure 2.1 is used for classifying credit

scoring. The conceptual framework is designed by the literature review of current researches

and books in credit scoring area [2].

As shown in Figure 2.1, the given framework consists of two levels.

The first level includes three types of credit scoring problem comprising enterprise credit

score, individual’s credit score and small and medium enterprises (SMEs) credit score.

Individual (consumer) credit score: It scores people credit using variables like

applicant age, marital status, income and some other variables and can include credit

bureau variables.

Enterprise credit score: using audited financial accounts variables and other internal or

external, industrial or credit bureau variables, the enterprise score is extracted.

SME credit score: For SME and especially small companies financial accounts are not

reliable and it’s up to the owner to withdraw or retain cash, there are also other issues, for

example small companies are affected by their partners and their bad/good financial

status affects them, so monitoring the SMEs counterparts is another way of scoring them

[2]. As a matter of fact, small businesses have a major share of the world economy and

their share is growing, so SME scoring is a major issue which is investigated in this

paper.

The second layer, comprised from three types of solutions and variable selection, they are

presented below.

Credit

Scoring

Individual

Enterprise SME

Variable selection

Single classifier

Ensemble

learning

Hybrid

approach

Figure 2.1 Classification framework for intelligent techniques in credit scoring.

10

Single classifier: Credit scoring is a classification problem and mainly classified

applicant to good or bad. The tested models are mainly statistical methods and artificial

intelligence techniques.

Hybrid approaches: The main idea behind the hybrid approaches is that different

methods have different strengths and weaknesses. This notion makes sense when the

methods can be combined in some extent. This combination covers the weaknesses of the

others. There are five different hybrid methods [3].

- Hybrid Algorithms (HA): In this kind of systems two or more intelligent algorithms

are tightly integrated in order to form a new classification device.

- Clustering and Classificatory devices (CC): These hybrid methods preprocess the

financial information on the failed and non-failed firms and identify groups based on

similarities. The grouping information is used in the subsequent estimation of a

classification model.

① Classification + Clustering

Clustering is an unsupervised learning technique and it cannot distinguish data

accurately like supervised techniques. Therefore, a classifier can be trained first,

and its output is used as the input for the cluster to improve the clustering results.

In the case of credit scoring, one can cluster good applicants in different groups.

② Clustering + Classification

In this approach, clustering technique is done first in order to detect and filter

outlier. Then the remained data, which are not filtered, are used to train the

classifier in order to probably improve the classification result.

③ Classification + Classification

In this approach, the aim of the first classifier is to ‘pre-process’ the data set for

data reduction. That is, the correctly classified data by the first classifier are

collected and used to train the second classifier. It is assumed that for a new

testing set, the second classifier could provide better classification results than

single classifiers trained by the original datasets.

④ Clustering + Clustering

For the combination of two clustering techniques, the first cluster is also used for

data reduction. The correctly clustered data by the first cluster are used to train

the second cluster. Finally, for a new testing set, it is assumed that the second

cluster could provide better results.

11

Variable selection: Selecting appropriate and more predictive variables are fundamental

for credit scoring [4]. Variable selection is the process of selecting the best predictive

subset of variables from the original set of variables in a dataset [5]. There are many

different methods for selecting variables include Stepwise regression, Factor analysis,

and partial least squares.

Ensemble learning: Ensemble learning aggregates the predictions made by multiple

classifiers to improve the overall accuracy. They construct a set of classifiers from the

training data and predict the classes of test samples by combining the predictions of these

classifiers [6]. There are several types of ensembles include bagging, boosting and

stacking.

2.1.2 Current research

In this section, a review of the current applications of techniques in a credit risk modeling

environment will be given. The ideas already present in the literature will be explored with

the aim to highlight potential gaps with which further research could fill. Table 2.1 provides a

selection of techniques currently applied in a credit scoring context.

Table 2.1 Credit scoring techniques and their application.

Categories Classification techniques Application in credit scoring context

Statistical methods

Linear Discriminant Analysis (LDA)

Altman [7], Baesens et al. [8], Desai et al. [9], Karels and Prakash [10], Reichert et al. [11], West [12], Yobas et al.[13]

Logistic Regression (LOG) Arminger, et al.[14], Baesens et al. [8],

Desai et al.[9], Steenackers and Goovaerts [15], west [12], Wiginton [16]

Quadratic discriminant analysis (QDA)

Altman [7], Baesens et al.[8]

Multivariate Adaptive Regression Splines (MARS)

Friedman [17]

Artificial intelligence

techniques

Neural networks(NNs) Altman [18], Arminger et al. [14], Baesens

et al. [8], Desai et al.[9], West [12], Yobas et al.[13]

Decision Tree(DT) Arminger et al.[14], Baesens et al.[8], West

[12], Yobas et al. [13], Hung and Chen [19]

Support vector machines (SVM, LS-SVM, etc.)

Baesens et al.[8], Schebensch and Stecking [20]

Case-Based Reasoning (CBR) Buta [21], Shin and Han [22], Dong [23]

Hybrid approach Hybrid Algorithms (HA) Piramuthu [24], Tseng, Lin and Wang [25]

Clustering and Classificatory devices (CC)

Rafiei, Manzati and Bostanian [26]

Variable selection Stepwise regression, Factor

analysis and partial least squares

Tsai [27], Danenas, et al. [28]

Ensemble learning Bagging, Boosting, Stacking Kim and Kang [29], Tsai and Wu [30]

12

(1) Statistical methods

Many researchers have developed a variety of traditional statistical methods for credit

scoring, with utilization of linear discriminant analysis (LDA) and logistic regression (LOG)

being the two most commonly used statistical techniques in building credit scoring models.

However, Karels and Prakash [31] and Reichert et al. [32] pointed that the application of

linear discriminant analysis (LDA) has often been challenged owing to its assumption of the

categorical nature of the credit data, and the fact that the covariance matrices of the good and

bad credit classes are unlikely to be equal.

In addition to the linear discriminant analysis (LDA) approach, logistic regression (LOG) is

another commonly used alternative to conduct credit scoring tasks. Logistic regression is a

model used for prediction of the probability of occurrence of an event. It makes use of several

predictor variables that may be either numerical or categories. Basically, the logistic

regression model first appeared as the technique in predicting binary outcomes. Logistic

regression does not require the multivariate normality assumption, however, the dependent

variable accessible to a full linear relationship among independent variables in the exponent

of the logistic function. Thomas [33] and West [12] indicated that both linear discriminant

analysis (LDA) and logistic regression (LOG) are intended for the case when the underlying

relationship between variables are linear and hence are reported to be lacking in sufficient

credit scoring accuracy.

Friedman [17] reported that multivariate adaptive regression splines (MARS) is another

commonly discussed classification technique. MARS is widely accepted by researchers for

the following reasons. Firstly, MARS is capable of modeling complex nonlinear relationships

among variables without strong model assumptions. Secondly, MARS can capture the relative

importance of independent variables to the dependent variable when many potential

independent variables are considered. Thirdly, the training process of MARS is simple and

hence can save lots of model building time, especially when the amount of data is huge.

Finally, the resulting model of MARS can be more easily interpreted than can other

classification techniques. The final fact for MARS is its important managerial and explanatory

implications and can help to make appropriate decisions.

(2) Artificial intelligence techniques

Recent studies have revealed that emerging artificial intelligence techniques, such as

decision tree (DT), support vector machine (SVM), genetic algorithm (GA) and artificial

13

neural networks (ANNs) are advantageous to statistical models and optimization technique for

credit risk evaluation. In contrast with statistical methods, artificial intelligence methods do

not assume certain data distributions. These methods automatically extract knowledge from

training samples. According to previous studies, artificial intelligence methods are superior to

statistical methods in dealing with corporate credit risk evaluation problems, especially for

nonlinear pattern classification. Application of aforementioned techniques had been

investigated by several works. Baesens et al. [8] conducted a study for benchmarking of 17

different classification techniques on eight different real-life credit datasets. They used SVM

and Least Squred-SVM with linear and Radial Basis Function (RBF) kernels and adopted a

grid search mechanism to tune the hyper parameters in their study. Their experimental results

indicated that SVM has the highest average ranking on performance. Schebesch and Stecking

[20] used a standard SVM with linear and RBF kernel for applicant credit scoring and used a

linear-kernel-based SVM to divide a set of labeled credit applicants into the sunsets of typical

and critical patterns, which can be used for rejecting applicants. In [34] SVMs were used for

bankruptcy prediction and better accuracy was generated by SVM when compared to other

methods. Gestel et al. [35] used LS-SVM for credit rating of banks, and compared the results

with ordinary least squares, logistic regression (LR) and multilayer perceptron (MLR). Min et

al. [36] proposed methods for improving SVM performance in two aspects: feature subset

selection and parameter optimization. Abdou et al. [37] investigated the ability of neural

networks (NNs), such as probabilistic neural networks (PNN) and multi-layer feed-forward

nets, and traditional techniques such as discriminant analysis, probit analysis and logistic

regression (LR) in evaluating credit risk in Egyptian banks by applying credit scoring models.

The results of their investigation have shown that neural networks (NNs) models have a more

accurate classification rate in comparison with other techniques. Pang and Gong [38] had

applied C5.0 algorithms for credit risk. They stated that decision tree (DT) is good techniques

for these kinds of problem.

(3) Ensemble learning

Ensemble learning is a machine learning paradigm where multiple learners are trained to

solve the same problem [39]. In contrast to ordinary machine learning approaches that try to

learn one hypothesis from the training data, ensemble methods try to construct a set of

hypotheses and combine them to use. Learners composed of an ensemble are usually called

base learners.

14

One of the earliest studies on ensemble learning is Dasarathy and Sheela’s research [40],

which discusses partitioning the feature space using two or more classifiers. In 1990, Hansen

and Salamon showed that the generalization performance of an ANNs can be improved using

an ensemble of similarly configured ANNs [41]. While Schapire proved that a strong

classifier in probably approximately correct (PAC) sense can be generated by combining

weak classifiers through boosting [42], the predecessor of the suite of AdaBoost algorithms.

Since these seminal works, studies in ensemble learning have expanded rapidly, appearing

often in the literature under many creative names and ideas [39].

The generalization ability of an ensemble is usually much stronger than that of a single

learner, which makes ensemble methods very attractive [43]. In practice, to achieve a good

ensemble, two necessary conditions should be satisfied: accuracy and diversity [44].

(4) Hybrid methods

At present, hybrid models that synthesizing advantages of various methods have become

hot research topics. However, there is not a clear solution to how to classifying the hybrid

models. Generally, the classification is employed according to the different method used in

the feature selection and classification stages. Based on this idea, Tsai & Chen [3] divided

them into four types: clustering + classification, classification + classification, clustering +

clustering and classification + clustering. He compared four kinds of classification techniques

(C4.5, Naive Bayesian, Logistic regression, Artificial neural networks) as well as two kinds of

clustering methods (k-means, expectation-maximization algorithm EM). The result showed

that EM + LR, LR + ANNs, EM + EM and LR + EM are the optimal one of the above models

respectively.

Recent years, the imbalanced learning problem has received a high attention in the credit

scoring context. In 2005, The Basel Committee on Banking Supervision [45] highlighted the

fact that calculations based on historical data made for very assets may “not be sufficiently

reliable” for estimating the probability of default. The reason for this is that as there are few

defaulted observations, the resulting estimation is likely to be inaccurate. Therefore a need is

present for a better understanding of the appropriate modeling techniques for data sets which

display a limited number of defaulted observations.

In the next section, we have been further sub-divided into imbalance problems.

2. 2 Class imbalance problem

2.2.1 The problem of imbalanced datasets

15

In a data set with the class imbalance problem, the most obvious characteristic is the

skewed data distribution between classes. However, theoretical and experimental studies

presented in Refs. [46] [47] and [48] indicate that skewed data distribution is not the only

parameter that influences the modelling of a capable classifier in identifying rare events.

Other influential facts include lack of data, and concept complexity.

(1) Imbalanced class distribution

The imbalance degree of a class distribution can be denoted by the ratio of the sample size

of the minority class to that of the majority class. In practical applications, the ratio can be as

drastic as 1:100, 1:1000, or even larger [49]. In Ref. [48], research was conducted to explore

the relationship between the class distribution of a training data set and the classification

performances of decision trees. Their study indicates that a relatively balanced distribution

usually attains a better result. However, at what imbalance degree the class distribution

deteriorates the classification performance cannot be stated explicitly, since other factors such

as sample size and separability also affect performance. In some applications, a ratio as low as

1:35 can make some methods inadequate for building a good model, but in some other cases,

1:10 is tough to deal with [50].

(2) Lack of data

One of the primary problem when learning with imbalanced data sets is the associated lack

of data where the number of samples is small [47]. In a given classification task, the size of

data set has an important role in building a good classifier. Lack of examples, therefore,

makes it difficult to uncover regularities within the small classes. Figure 2.2 illustrates an

example of the problem that can be caused by lack of data. Figure.1 (a) shows the decision

boundary (dashed line) obtained when using sufficient data for training, whereas Figure.1 (b)

shows the result when using a small number of samples. When there is sufficient data, the

estimated decision boundary (dashed line) approximates well the true decision boundary

(solid line); whereas, if there is a lack of data, the estimated decision boundary can be very far

from the true boundary. In fact, it has been shown that as the size of training set increases, the

error rate caused by imbalanced training data decreases [46]. Weiss and Provost conducted

experiments on twenty six data sets taken from the UCI repository to investigate the

relationship between the degree of class imbalance and training set sizes [48]. They showed

that when more training data become available, the classifiers are less sensitive to the level of

imbalance between classes. This suggests that with sufficient amount of training data, the

classification system may not be affected by the high imbalance ratio.

16

(3) Concept complexity

Concept complexity is an important factor in a classifier ability to deal with imbalanced

problems. Concept complexity in data corresponds to the level of separability of classes with

the data. Japkowicz and Stephen reported that for simple data sets that are linearly separable

(Figure 2.3 shows), classifier performances are not susceptible to any amount of imbalance

[46].

Indeed, as the degree of data complexity increases, the class imbalance factor starts

impacting the classifier generalization ability. High complexity refers to inseparable data sets

with highly overlapped classes, complex boundaries and high noise level. When samples of

different classes overlap in the feature space, finding the optimum class boundary becomes

hard (see Figure 2.4). In fact, most accuracy-driven algorithms bias toward the majority class.

That is, they improve the overall accuracy by assigning the overlapped area to the majority

class, and ignore or treat the minority class as noise.

Figure 2.2 The effect of lack of data on class imbalance problem; the solid line represents

the true decision boundary and dashed line represents the estimated decision boundary.

(a) (b)

Figure 2.3 Linear separable data

17

The class imbalance problem is more significant when the data sets have a high level of

noise. Noise in data sets can emerge from various sources, such as data samples are poorly

acquired or incorrectly labeled, or extracted feature are not sufficient for classification. It is

known that noisy data affect many machine learning algorithms; however, Weiss showed that

noise has even more serious impact when learning with imbalanced data. The problem occurs

when samples from the minority class are mistakenly included in the training data for the

majority class, and vice versa. For the majority class it takes only a few noise samples to

influence the learned sub-concept. For a given data set that is complex and imbalanced, the

challenge is how to train a classifier that correctly recognizes samples of different classes with

high accuracy.

2.2.2 Methods for dealing with imbalanced credit scoring data sets

A wide range of different classification techniques for scoring credit data sets has been

proposed in the literature, a non-exhaustive list of which was provided earlier. In addition,

some benchmarking studies have been undertaken to empirically compare the performance of

these various techniques [8], but they did not focus specifically on how these techniques be

compared on heavily imbalanced data sets, or to what extent any such comparison is affected

by the issue of class imbalance. For example, in Baesens et al. [8], seventeen techniques

including both well known techniques such as logistic regression and discriminant analysis

and more advanced techniques such as least square support vector machines were compared

on eight real-life credit scoring data sets. Although more complicated techniques such as

radial basis function least square support vector machines (RBF LS-SVM) and neural

networks (NN) yielded good performances in terms of the area under the ROC curve (AUC),

simpler linear classifiers such as linear discriminant analysis (LDA) and logistic regression

(LOG) also gave very good performances.

Figure 2.4 Overlapping data

18

However, there are often conflicting opinions when comparing the conclusions of studies

promoting differing techniques. For example, Yobas et al. [13] found that linear discriminant

analysis (LDA) outperformed neural networks in the prediction of loan default, whereas Desai

et al. [9] reported that neural networks actually perform significantly better than LDA.

Furthermore, many empirical studies only evaluate a small number of classification

techniques on a single credit scoring data set. The data sets used in these empirical studies are

also often far smaller and less imbalanced than those data sets used in practice. Hence, the

issue of which classification technique to use for credit scoring, particularly with a small

number of bad observations, remains a challenging problem.

In more recent work on the effects of class distribution on the prediction of the probability

of default (PD), Crone and Finlay [51] found that under sampled data sets are inferior to

unbalanced and oversampled data sets. However it was also found that the larger the sample

size used, the less significant the differences between the methods of balancing were. Their

study also incorporated the use of a variety of data mining techniques, including logistic

regression, classification and regression trees, linear discriminate analysis and neural

networks. From the application of these techniques over a variety of class balances it was

found that logistic regression was the least sensitive to balancing. This piece of work is

thorough in its empirical design; however, it does not assess more novel machine learning

techniques in the estimation of default.

In Yao [52], hybrid SVM-based credit scoring models are constructed to evaluate an

applicant’s scoring from an applicant’s input feature. This paper shows the implications of

using machine learning based techniques in a credit scoring context on two widely used credit

scoring data sets (Australian credit and German credit) and compares the accuracy of this

model against other techniques (LDA, logistic regression and NN). Their findings suggest that

the SVM hybrid classifier has the best scoring capability when compared to traditional

techniques. Although this is a non-exhaustive study with a bias towards the use of RBF-

SVMs it gives a clear basis for the hypothetical use of SVMs in a credit scoring context.

In Kennedy [53], the suitability of one-class and supervised two-class classification

algorithms as a solution to the low-default portfolio problem are evaluated. This study

compares a variety of well established credit scoring techniques (e.g. LDA, LOG and k-

nearest neighbor) against the use of a linear kernel SVM. Nine banking data sets are utilized

and class imbalance is artificially created by removing 10% of the defaulting observations

from the training set after each run. The only issue with this process is that the data sets are

19

comparatively small in size (ranging from 125 to 5397) which leads this author to believe a

process of k-fold cross validation would have been more applicable considering the size of the

datasets after a training, validation and test set split are made. As more class imbalance is

induced it is shown that logistic regression performs significantly better than Lin-SVM, QDC

(Quadratic Discriminant Classifier) and k-NN. It is also shown that over-sampling produces

no overall improvement to the best performing two-class classifiers. The findings in this paper

lead into the work that will be conducted in this thesis, as several similar techniques and

datasets will be employed, alongside the determination of classifier performance on

imbalanced data sets.

The topic of which good/bad distribution is the most appropriate in classifying a data set

has been discussed in some detail in the machine learning and data mining literature. In Weiss

and Provost [48] it was found that the naturally occurring class distribution in the twenty-five

data sets looked at, often did not produce the best-performing classifiers. More specifically,

based on the AUC measure (which was preferred over the use of the error rate), it was shown

that the optimal class distribution should contain between 50% and 90% minority class

examples within the training set. Alternatively, a progressive adaptive sampling strategy for

selecting the optimal class distribution is proposed in Provost et al. [54]. Whilst this method

of class adjustment can be very effective for large data sets, with an adequate number of

observations in the minority class of defaulters, in some imbalanced data sets there are only a

very small number of loan defaults to begin with.

Various kinds of techniques have been compared in the literature to try and ascertain the

most effective way of overcoming a large class imbalance. Chawla et al. [55] proposed the

Synthetic Minority Over-sampling techniques (SMOTE) which was applied to example data

sets in fraud, telecommunications management, and detection of oil spills in satellite images.

In Japkowicz [56] over-sampling and downsizing were compared to the author’s own method

of “learning by recognition” in order to determine the most effective techniques. The findings,

however, were inconclusive but demonstrated that both over-sampling the minority class and

downsizing the majority class can be very effective. Subsequently Batista [57] identified ten

alternative techniques to deal with class imbalances and trialled them on thirteen data sets.

The techniques chosen included a variety of under-sampling and over-sampling methods.

Their findings suggested that generally over-sampling methods provide more accurate results

than under-sampling methods. Also, a combination of either SMOTE and Tomek links or

SMOTE and ENN (a nearest-neighbor cleaning rule), were proposed.

20

2.3 Issues and the aims of this study

Although a lot of significant classification methods can be used to assess credit risk, there

are still several issues to be addressed.

(1) According to the study of Sadatrasoul [58], they find that current techniques of credit

scoring are mostly applied to an individual credit score and there is inadequate research

on enterprise and small and midsized companies (SME) credit scoring. Current

research for small and midsized companies (SMEs) credit scoring only 2%.

(2) The stacking strategy of ensemble learning, which is based on different kinds of

classification algorithms is not only to inherit advantages from the different classifiers,

but also inevitably suffers from disadvantages of these classifiers (Hung et al. [59];

Witten & Frank, [60]). Therefore, the performance of this policy is not always better

than an individual classifier. On the other hand, most of hybrid models using different

models usually be structured in parallel, based on a voting strategy.

(3) Data sampling is the approach to produce a more balanced learning data set. As the

under-sampling method extracts a smaller set of majority instances, some information

of the majority class will be lost. It is also very difficult to determine the correct

distribution for a learning algorithm or the appropriate re-sampling strategy to avoid

losing information of the majority class for under-sampling, and over-feeding the

minority class for over-sampling.

(4) In the literature, data sets that can be considered as very low risk, or imbalanced data

sets have had relatively little attention paid to them in particular with regards to which

techniques are most appropriate for scoring them (Benjamin et al. [61]). The

underlying problem with imbalanced data sets is that there are significantly fewer

training instances of one class compared to other classes. A large class imbalance is

therefore present which some techniques may not be able to successfully handle

(Benjamin et al. [61]). In a recent FSA publication regarding conservative estimation

of imbalanced data sets, regulatory concerns were raised about whether companies can

adequately assess the risk of imbalanced credit scoring data sets. A wide range of

classification techniques has already been proposed in the credit scoring literature. But

it is currently unclear from the literature which techniques are the most appropriate for

improving discrimination for imbalanced credit scoring data sets.

In order to address these problems described above, the aim of this contribution is

21

organized as follows:

(1) Because there are few literatures on SME credit scoring, in our study, we mainly take our

attention to focus on the small company. Two novel systems have been proposed to solve

a real small business credit assessment problem based on the features data such as sales,

payments from the customers, and so on.

(2) Different from the existing hybrid approach in parallel, we have proposed an adaptive and

hierarchical system which can inherit advantages and avoid disadvantage of different

classification techniques.

(3) When using under-sampling methods to re-sample instances of the majority class, it is

unavoidable that some useful information of the majority class is lost. In order to avoid

this information loss, we propose a two-stage data re-sampling method to reduce the

sample size of the majority class.

(4) We will address the issue of imbalanced data sets. Whereas other studies have

benchmarked several scoring techniques, in our study, we have explicitly looked at the

problem of having to build models on potentially highly imbalanced data sets. The data

sets collect from a small company which the number of insolvent customers is much

lower than the healthy ones. The other data sets of German used in our study are created

in a large class distribution by altering the percentage of bad observation in the original

training data.

22

CHAPTER 3

METHODOLOGIES

23

Methodologies

3.1 Classification Techniques

3.1.1 k-nearest neighbors

One common classification scheme based on the use of distance measures is that of the k-

nearest neighbor. The k-nearest neighbor technique assumes that the entire sampling set

includes not only the data in the set, but also the desired classification for each item. When a

classification is to be made for a new item, its distance to each item in the sampling set must

be computed. Only the k closest entries in the sampling set are considered further. The new

item is then classified to the class that contains the most items from this set of k closest items.

Figure 3.1 shows an example of a 5-NN classifier which consists of three categories w1, w2

and w3. xu is the new unlabeled input data point to be classified in the testing stage.

According to Figure 3.1, the value of parameter k is 5 and the Euclidean distance formula

has been used to calculate the distance between the training data points and the testing data

point xu. Among the five nearest neighbors of xu, four are belong to category w1 and another

one belongs to category w3. Hence, xu is classified as category w1 by the k-nearest neighbor

(k-NN) classifier [62].

xu

w1 w2

w3

Figure 3.1 Feature space of a three-dimensional 5-NN classifier.

Chapter

r 3

24

3.1.2 k-means algorithm

k-means is the most popularly used algorithms for clustering. The user needs to specify the

number (k) of clusters in advance. The algorithm randomly selects k objects as the cluster

mean or center. It works towards optimizing the square error criterion function, which is

defined as:

k

i Cx

i

i

mcx1

2, (3.1)

where mci is the mean of cluster Ci (i=1,2,…,k) and x represents a sample.

The main steps of the k-means algorithm are

(1) Assign initial means mci (i=1,2,…,k).

(2) Assign each data object x to the cluster Ci for the closest mean.

(3) Compute new mean for each cluster.

(4) Iterate until criteria function converges; that is, there are no more new assignments.

The k-means algorithm has the advantages of fast clustering and easy realization. But there

is a pre-fixed number k of the clusters. This condition has affected and the origin cluster is

stochastic which may bring instability to the result. Hence, it is of high value to improve the

quality and stability in the cluster analysis.

3.1.3 Decision tree (C4.5)

A decision tree consists of internal nodes that specify tests on individual input variables or

attributes that split the data into smaller subsets, and a series of leaf nodes assigning a class to

each of the observations in the resulting segments. For our study, we chose the popular

decision tree classifier C4.5, which builds decision trees using the concept of information

entropy [63]. The entropy of a sample S of classified observations is given by

Entropy (S) = )()(020121

plogpplogp (3.2)

where p1(p0) are the proportions of the class values 1(0) in the sample S, respectively. C4.5

examines the normalised information gain (entropy difference) that results from choosing an

attribute for splitting the data. The attribute with the highest normalised information gain is

the one used to make the decision. The algorithm then recurs on the smaller subsets.

3.1.4 Artificial neural network

Artificial neural network (ANN) is a system based on the operation of biological neural

25

networks, in other words, is an emulation of biological neural system. The key element of this

paradigm is the novel structure of the information processing system [64]. It is composed of a

large number of highly interconnected processing elements working in unison to solve

specific problems.

The most common type of neural networks consists of three layers of units: input layer,

hidden layer, and output layer. It is called multilayers perceptron (MLP) [65]. A layer of

“input” units is connected to a layer of “hidden” units, which is connected to a layer of

“output” units. The activity of each input layer represents the raw information that is fed into

the network. The activity of each hidden unit is determined by the activities of the input units

and the weights on the connections between the input and the hidden units. The behavior of

the output units depends on the activity of the hidden units and the weights between the

hidden and output units. Figure 3.2 shows an example of three-layer neural network including

input, output, and one hidden layer.

Advantages of neural networks include their strong learning ability and no assumptions

about the relationship between input variables. However, it also has some drawbacks. A major

disadvantage of neural networks lies in their poor understandability. Because of the “black

box” nature, it is very difficult for ANN to make knowledge representation. The second

problem is how to design and optimize the network topology, which is a very complex

experimental process.

Out1

Out2

Out3

Hidden Layer

Output Layer

Figure 3.2 A three-layer neural networks

Input Layer

26

3.2 Ensemble learning

An ensemble of classifiers is a collection of several classifiers whose individual decisions

are combined in some way to classify the test examples [66]. It is known that an ensemble

often shows much better performance than the single classifiers that make it up.

3.2.1 Bagging

Bagging, short for bootstrap aggregating, is considered one of the earliest ensemble scheme

[67]. Bagging is intuitive but powerful, especially when the data size is limited. Bagging

generates a series of training subsets by random sampling with replacement from the original

training set. Then the different classifiers are trained by the same classification algorithm with

different training subsets. When a certain number of classifiers are generated, these

individuals are combined by the majority voting scheme. Given a testing instance, different

outputs will be given from the trained classifiers, and the majority will be considered as the

final decision.

3.2.2 Boosting

The AdaBoost family of algorithms, also known as boosting, is another category of

powerful ensemble methods [68]. It explicitly alters the distribution of training data fed to

every individual classifier, specifically weight so each training sample. Initially the weights

are uniform for all the training samples. During the boosting procedure, they are adjusted after

the training of each classifier is completed. For misclassified samples, the weights are

increased, while for correctly classified samples they are decreased. The final ensemble is

constructed by combining individual classifiers according to their own accuracies.

3.2.3 Stacking

Stacking is another popular ensemble learning and general method of using a high-level

base learner to combine lower level base learners to achieve greater predictive accuracy [69].

It builds an ensemble by using different classification algorithms. The simplest way to

combine classification results from different classifiers is by voting. However, this policy may

inherit advantages from some classifiers and disadvantages from other classifiers

simultaneously.

27

3.2.4 Random forests

Random forests are defined as a group of un-pruned classification or regression trees,

trained on bootstrap samples of the training data using random feature selection in the process

of tree generation. After a large number of trees have been generated, each tree votes for the

most popular class. These tree voting procedures are collectively defined as random forests. A

more detailed explanation of how to train a random forest can be found in Breiman [70]. For

the random forests classification technique two parameters require tuning. These are the

number of trees and the number of attributes used to grow each tree.

3.3 Learning from class imbalance data sets

A number of solutions to the class imbalance problem were previously proposed both at the

data and algorithmic levels [71]. At the data level, these solutions include many different

forms of re-sampling such as random over-sampling with replacement, random under-

sampling, directed over-sampling (in which no new examples are created, but the choice of

samples to replace is informed rather than random), directed under-sampling (where, again,

the choice of examples to eliminate is informed), over-sampling with informed generation of

new samples, and combinations of the above techniques. At the algorithmic level, solutions

include adjusting the costs of the various classes so as to counter the class imbalance,

adjusting the probabilistic estimate at the tree leaf (when working with decision trees),

adjusting the decision threshold, and recognition-based (i.e., learning from one class) rather

than discrimination-based (two class) learning.

The most effective techniques to deal with imbalanced data sets include sampling, cost-

sensitive learning.

3.3.1 Sampling Methods

An easy data level method for balancing the classes consists of re-sampling the original

data set, either by over-sampling the minority class or by under-sampling the majority class,

until the classes are approximately equally represented. Both strategies can be applied in any

learning system, since they act as a preprocessing phase, allowing the learning system to

receive the training instances as if they belonged to a well-balanced data set. Thus, any bias of

the system towards the majority class due to the different proportion of examples per class

would be expected to be suppressed.

28

Hulse et al. [72] suggest that the utility of the re-sampling methods depends on a number of

factors, including the ration between positive and negative examples, other characteristics of

data, and the nature of the classifier. However, re-sampling methods have shown important

drawbacks. Under-sampling may throw out potentially useful data, while over-sampling

artificially increases the size of the data set and consequently, worsens the computational

burden of the learning algorithm.

(1) Over-sampling

The simplest method to increase the size of the minority class corresponds to random over-

sampling, that is, a non-heuristic methods that balances the class distribution through the

random replication of positive examples. Nevertheless, since this method replicates existing

examples in the minority class, overfitting is more likely to occur.

(2) Under-sampling

Under-sampling is an efficient method for classing imbalance learning. This method uses a

subset of the majority class to train the classifier. Since many majority class examples are

ignored, the training set becomes more balanced and the training process becomes faster. The

most common preprocessing technique is random majority under-sampling (RUS), in random

under-sampling, instances of the majority class are randomly discarded from the data set. So

the main drawback of under-sampling is that potentially useful information contained in these

ignored examples is neglected.

(3) Advanced sampling

Although sampling methods are widely used for tacking class imbalance problems, there is

no established way to determine the suitable class distribution for a given data set. The

optimal class distribution depends on the performance measures and varies from one data set

to another. Recent variants of over-sampling and under-sampling overcome some of the

weaknesses. Among them, one popular over-sampling approach is SMOTE (Synthetic

Minority Over-sampling Technique), which adds information to the training set by

introducing new, non-replicated minority class examples.

SMOTE is an intelligent over-sampling method. In this approach, the minority class is

over-sampled by taking each minority class sample and introducing synthetic examples along

the line segments joining any/all of the k minority class nearest neighbors. Depending upon

the amount of over-sampling required, neighbors from the k nearest neighbors are randomly

chosen. This process is illustrated in Figure 3.3, where xi is the selected point, xi1 to xi4 are

29

some selected nearest neighbors and r1 to r4 are the synthetic data points created by the

randomized [73].

This method is investigated for C4.5 and gives better results than random over-sampling.

By interpolating the minority class examples with new data, the within class imbalance is

reduced and C4.5 achieves a better generalization of the minority class, opposed to the

specialization effect obtained by randomly replicating the minority class examples.

3.3.2 Cost-sensitive learning: C4.5 decision tree

At the algorithmic level, solutions include adjusting the costs of the various classes so as to

counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working

with decision trees), adjusting the decision threshold, and recognition-based (i.e., learning

from one class) rather than discrimination-based (two class) learning.

Cost-sensitive learning is a type of learning in data mining that takes the misclassifications

costs (and possibly other types of cost) into consideration. There are many ways to implement

cost sensitive learning, in Haibo He [74], it is categorized into three, the first class of

techniques applies misclassification costs to the data set as a form of data space weighting, the

second class applies cost-minimizing techniques to the combination schemes of ensemble

methods, and the last class of techniques incorporates cost sensitive features directly into

classification paradigms to essentially fit the cost sensitive framework into these classifiers.

Incorporating cost into the decision tree classification algorithm which is one of the most

widely used and simple classifier. Cost can be incorporated into it in various ways. First way

is cost can be applied to adjust the decision threshold, second way is cost can be used in

xi3

r3

r4 r2

r1

xi4 xi2

xi1

xi

Figure 3.3 An illustration of how to create the synthetic data point in the SMOTE

algorithm

30

splitting attribute selection during decision tree construction and the other way is cost

sensitive pruning schemes can be applied to the tree.

This paper, we will make use of the cost-sensitive C4.5 decision tree (C4.5CS) proposed in

(Ting, 2002) [75]. This method changes the class distribution such that the tree induced is in

favor of the class with a high weight/cost and is less likely to commit errors with high cost.

Specifically, the computation of the split criteria for C4.5 (normalized information gain) is

modified to take into account the a priori probability according to the number of samples for

each class.

C4.5CS modifies the weight of an instance proportional to the cost of misclassifying the

class to which the instance belonged, leaving the sum of all training instance weights still

equal to N. Let C(j) be the cost of misclassifying a class j instance; the weight of a class j

instance can be computed as :

i iNiC

NjCjw

)()()( (3.3)

such that the sum of all instance weights is j j Njw N)( .

The standard greedy divide-and-conquer procedure for inducing minimum error trees can

then be used without modification, except that Wj(t) is used instead of Nj(t) (number of

instances of class j) in the computation of the test selection criterion in the tree growing

process and the error estimation in the pruning process. That Wj(t) is the result of weighting

the initial number of instances from a class with the weight computed in Eq.(1) Wj(t)=w(j)·

Nj(t) Thus, both processes are affected due to this change.

C4.5CS also introduces another optional modification that alters the usual classification

process after creating the decision tree. Instead of classifying using the minimum error criteria,

it is advisable to classify using the expected misclassification cost in the last part of the

classification procedure. The expected misclassification cost for predicting class i with respect

to the instance x is given by

)())(( ∝)( j,itcosxtxj

ji WEC (3.4)

where t(x) is the leaf of the tree that instance x falls into and Wj(t) is the total weight of class j

training instance in node t.

31

CHAPTER 4

EVALUTION MEASURES

32

Evaluation measures

Evaluation measures play a crucial role in both assessing the classification performance and

guiding the classifier modeling. Traditionally, accuracy is the most commonly used measure

for these purposes.

Accuracy=TNFPFNTP

TNTP

(4.1)

However, for classification with the class imbalance problem, accuracy is no longer a

proper measure since the minority class has very little impact on accuracy as compared to the

majority class. For example, in a problem where a minority class is represented by only 1% of

the training data, a simple strategy can be to predict the majority class label for every example.

It can achieve a high accuracy of 99%. However, this measurement is meaningless to some

applications where the learning concern is the identification of the minority cases. Therefore,

other metrics have been developed to assess classifier performance for imbalanced datasets. A

variety of common metrics are defined based on the confusion matrix. A two-by-two

confusion matrix is shown in Table 4.1.

The four counts, which constitute a confusion matrix (as seen in Table 4.1) for binary

classification are: the number of correctly recognized positive class examples (true positives),

the number of correctly recognized examples that belong to the negative class (true negatives),

and examples that either were incorrectly assigned to the positive class (false positives) or that

were not recognized as positive class examples (false negatives).

Table 4.1 Confusion matrix for performance evaluation.

Predicted class (expectation)

Positive Negative

Actual class

(observation)

Positive True positive (TP) False negative (FN)

Negative False positive (FP) True negative (TN)

Chapter

r 4

33

Among the various evaluation criteria, the measures that most relevant to imbalanced data

are sensitivity, specificity, genmetric mean (G-mean), ROC curve, AUC and MCC. These

metrics share a commonality in that they are all class-independent measures.

4.1 Sensitivity, Specificity and Geometric mean

These measures are utilized when performance of both classes is concerned and expected to

be high simultaneously. The geometric mean (G-mean) metric was suggested in Kubat and

Matwin [76] and has been used by several researchers for evaluating classifiers on imbalanced

data sets [77] [78]. G-mean indicates the balance between classification performance on the

majority and minority class. This metric takes into account both the sensitivity, (the accuracy

on the positive examples) and the specificity (the accuracy on the negative examples):

Sensitivity =FNTP

TP

(4.2)

Specificity =TNFP

TN

(4.3)

G-mean= ySpecificitySensitivit (4.4)

4.2 Type I and Type II errors

Ideally, a perfect system would be described as having 100% sensitivity and 100%

specificity. However, two types of errors, type I and type II error often occurred.

Type I error =1-Specificity (4.5)

Type II error =1-Sensitivity (4.6)

Type I error shows the rate of classification errors of a model, which is to incorrectly

classify the insolvent customers into the healthy ones. When this happens, it will be exposed

to high credit risk. From a theoretical point of view, it is better to utilize classification models

with lower type I error. Opposed to type I error, type II error defines the rate of healthy

customers being classified as insolvent ones. In practice it is also of great importance to

achieve an appropriate balance between type I and type II error so as not to lose potentially

healthy customers.

4.3 Integrated performance measures

(1) ROC and AUC.

34

The receiver operating characteristic (ROC) and the area under the ROC curve (AUC) are

two most common measures for assessing the overall classification performance [79]. The

ROC is a graph showing the relationship between benefits (correct detection rate or true

positive rate) and costs (false detection rate or false positive rate) as the decision threshold

varies. The ROC curve shows that for any classifier, the true positive rate cannot increase

without also increasing the false positive rate.

A ROC curve gives a visual indication if a classifier is superior to another classifier, over a

wide range of operating points. However, a single metric is sometimes preferred when

comparing different classifiers. The area under the ROC curve (AUC) is employed to

summarize the performance of a classifier into a single metric. The AUC does not place more

weight on one class over another. The larger the AUC, the better is the classifier performance.

It can be defined as the arithmetic average of the mean predictions for each class.

AUC=2

ySpecificitySensitivit (4.7)

(2) MCC

The Matthews correlation coefficient [20] is used in machine learning as a measure of the

quality of binary (two-class) classifications. It takes into account true and false positives and

negatives and is generally regarded as a balanced measure which can be used even if the

classes are of very different sizes.

MCC=))()()(( FNTNFPTNFNTPFPTP

FNFPTNTP

(4.8)

If any of the four counts in the denominator is zero, the denominator can be arbitrarily set

to one; which results in a MCC of zero. There are situations, however, where the MCC is not a

reliable performance measure. For instance, the MCC will be relatively high in cases where a

classification model gives very few or no false-positives, but at the same time very few true-

positives

35

CHAPTER 5

DATA SETS

36

Data sets

5.1 The credit datasets in a small company

5.1.1 Credit assessment problem

The credit datasets are available from a small company where the main business is selling

school uniforms and accessories at wholesale. There are 20 employees in the company, and

the annual sale is about 600 million Japanese yen. Orders come from about 800 customers and

these customers are classified into three types: retailers, schools and others, as shown in Table

5.1.

The customers’ credit has been assessed through a four-grade credit score:

Score of one: a healthy customer for which all orders are accepted.

Score of two: a customer for which orders are accepted and limited to a given amount.

Score of three: a customer for which orders are accepted only in a cash sale.

Score of four: an insolvent customer for which all orders are rejected.

5.1.2 Features of the customers

For the company, most of the customers are minor small businesses without disclosure of

financial information, and it is almost impossible to obtain their financial data. It is also

frequently difficult to ask an agency to evaluate customers’ credit due to a limited budget. For

Table 5.1 Type of Customers

Type Customers

Retailer Co-ops or retailers to them products are usually sold on credit

School Nominal customers that are used to treat the sales directly to the students of each

school at the beginning of a school year.

Other Nominal customers that are used to treat the over-the-counter sales or orders

coming from the sales team. Students’ circles or clubs, and any other association.

Chapter

r 5

37

these reasons, we collected the following seven features from the daily transaction which can

be availed by the small businesses to assess customers’ credit.

Type of customers.

Average amount of overdue payment in the year considered.

Maximum overdue days for all overdue payments in the year considered.

Number of times that overdue payment occurs in the year considered.

Total sales in the year considered.

Rate of the average amount of overdue payment of the total sales.

Number of transaction months in which any order from the customer is fulfilled in the

year considered.

This characteristic data can be extracted from the database of small-business management

information systems.

5.1.3 Data sets summary

We collected the data from the financial year of 2001 to 2003 and summarized the

distribution of customers in each credit score, as Table 5.2 shows.

From Table 5.2, it can be seen that the customers in the finical year of 2001 and 2002 with

a score of two or three only accounted for 0.4% of the total amount, the customers in the

finical year of 2003 with score of two only accounted for 0.2% of the total amount and 0.6%

with score of three, much less than the customers with a score of one. According to previous

studies, the traditional methods used for credit score do not work well to identify the

customers with score of two, three or four in the minority class.

Table 5.2 Number of customers 2001 financial year 2002 financial year 2003 financial year

Credit

score

Number of

customers(%)

Credit

score

Number of

customers(%)

Credit

score

Number of

customers(%)

1 474 (95.2%) 1 469 (95.1%) 1 450 (96.4%)

2 2 ( 0.4%) 2 2 ( 0.4%) 2 1 (0.2%)

3 2 ( 0.4%) 3 2 ( 0.4%) 3 3 (0.6%)

4 20 ( 4.0%) 4 20 ( 4.1%) 4 13 (2.8%)

Total 498 (100%) Total 493 (100%) Total 467 (100%)

38

5.2 German credit data set

The other data sets chose for our study is German credit. This is an open data set that can

be available from the UCI Repository (http://www.ics.uci.edu/~mlearn/MLRepository. html),

and has been used in many previous studies as a benchmarking problem.

The German Credit data set contains observations on 30 variables for 1000 past applicants

for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases)

(encoded as 1 and 0 respectively in the Response variable). All the variables are explained in

Table 5.3.

(Note: The original data set had a number of categorical variables, some of which have been

transformed into a series of binary variables so that they can be appropriately handled by our

study).

Table 5.3 Variables for the German Credit data

Var.

#

Variable Name Description Variable

Type

Description

1. OBS# Observation No. Categorical

2. CHK_ACCT Checking account status Categorical 0 : < 0 DM

1: 0 < ...< 200 DM

2 : => 200 DM

3: no checking account

3. DURATION Duration of credit in

months

Numerical

4. HISTORY Credit history Categorical 0: no credits taken

1: all credits at this bank paid

back duly

2: existing credits paid back duly

till now

3: delay in paying off in the past

4: critical account

5. NEW_CAR Purpose of credit Binary car (new) 0: No, 1: Yes

6. USED_CAR Purpose of credit Binary car (used) 0: No, 1: Yes

7. FURNITURE Purpose of credit Binary furniture/equipment 0: No, 1:

Yes

8. RADIO/TV Purpose of credit Binary radio/television 0: No, 1: Yes

9. EDUCATION Purpose of credit Binary education 0: No, 1: Yes

10. RETRAINING Purpose of credit Binary retraining 0: No, 1: Yes

11. AMOUNT Credit amount Numerical

12. SAV_ACCT Average balance in savings

account

Categorical 0 : < 100 DM

1 : 100<= ... < 500 DM

2 : 500<= ... < 1000 DM

3 : =>1000 DM

4 : unknown/ no savings account

39

13. EMPLOYMENT Present employment since Categorical 0 : unemployed

1: < 1 year

2 : 1 <= ... < 4 years

3 : 4 <=... < 7 years

4 : >= 7 years

14. INSTALL_RATE Installment rate as % of

disposable income

Numerical

15. MALE_DIV Applicant is male and

divorced

Binary

0: No, 1: Yes

16. MALE_SINGLE Applicant is male and

single

Binary

0: No, 1: Yes

17. MALE_MAR_WID Applicant is male and

married or a widower

Binary

0: No, 1: Yes

18. CO-APPLICANT Application has a co-

applicant

Binary

0: No, 1: Yes

19. GUARANTOR Applicant has a guarantor Binary 0: No, 1: Yes

20. PRESENT_RESIDENT Present resident since -

years

Categorical 0: <= 1 year

1<…<=2 years

2<…<=3 years

3:>4years

21. REAL_ESTATE Applicant owns real estate Binary 0: No, 1: Yes

22. PROP_UNKN_NONE Applicant owns no

property (or unknown)

Binary

0: No, 1: Yes

23. AGE Age in years Numerical

24. OTHER_INSTALL Applicant has other

installment plan credit

Binary

0: No, 1: Yes

25. RENT Applicant rents Binary 0: No, 1: Yes

26. OWN_RES Applicant owns residence Binary 0: No, 1: Yes

27. NUM_CREDITS Number of existing credits

at this bank

Numerical

28. JOB Nature of job Categorical 0 : unemployed/ unskilled - non-

resident

1 : unskilled - resident

2 : skilled employee / official

3 : management/ self-

employed/highly qualified

employee/ officer

29. NUM_DEPENDENTS Number of people for

whom liable to provide

maintenance

Numerical

30. TELEPHONE Applicant has phone in his

or her name

Binary

0: No, 1: Yes

31. FOREIGN Foreign worker Binary 0: No, 1: Yes

32 RESPONSE Credit rating is good Binary 0: No, 1: Yes

40

CHAPTER 6

A TWO-STAGE DATA RESAMPLING METHOD FOR CREDIT

SCORING

41

A two-stage data resampling method

6.1 Background and purpose of this study

The scenario of classification with imbalanced data sets has supposed a serious challenge

for credit scoring researchers along the last years. The main handicap relates to the number of

insolvent customers is much smaller than the number of healthy ones. As a result, the

classifier tends to favor healthy customers of the majority class. In other words, healthy

customers could be overlearned in the model and therefore can be identified with high

accuracy, but insolvent customers of the minority class cannot be identified correctly.

However, in real business, it is more important to identify insolvent customers in order to

minimize credit risk. Thus, improving the classification performance of insolvent customers

in the minority class became a new challenge for us.

Several researchers have tried to address these problems over past decades. In general,

there are two approaches used to tackle the problem of extremely imbalanced data.

(1) Data Sampling

The training samples are modified in such a way as to produce a more balanced class

distribution that allow classifiers to perform in a similar manner to standard classification.

Typical sampling methods include over-sampling and under-sampling [80] [81] that modify

the prior probability of the majority and minority class in the training set to obtain a more

balanced number of instances in each class.

The under-sampling method extracts a smaller set of majority instances while preserving all

the minority instances. This method is suitable for large-scale application where the number

of majority samples is huge and lessening the training instances reduces the training time and

makes the learning problem more tractable. However, one problem associated with under-

sampling techniques is that we may lose information when discard the instances.

In contrast to under-sampling, the over-sampling method increases the number of minority

instances by over-sampling them. The advantage is that no information is lost from the

Chapterr 6

42

training samples because all instances are employed. However, the minority instances are

over- represented in the training set and, moreover, will increase the training time.

(2) Algorithmic Modification

This approach is oriented towards the adaptation of base learning methods to be more

attuned to class imbalance data [82]. Substantial work has gone into making individual

algorithms cost-sensitive. Cost-sensitive approaches assign a high cost to misclassification of

the minority class, and try to minimize the overall cost [83][84]. Cost-sensitive learning plays

an important role in real-world data mining applications. Turney [85] provided a

comprehensive survey of a large variety of different types of costs in data mining and

machine learning, including misclassification costs, data acquisition costs, active learning

costs, computation costs, human-computer interaction costs, and so on. The misclassification

cost is singled out as the most important cost, and it has also been the most studied in recent

years.

Although much research about the class imbalance problem has been reported, some

challenging problems still remain.

(1) Data sampling is the approach to produce a more balanced learning data set. As the

under-sampling method extracts a smaller set of majority instances, some

informationof the majority class will be lost. Furthermore, it is very difficult to

determine the correct distribution for a learning algorithm or the appropriate

resampling strategy to avoid losing information of the majority class for under-

sampling, and over-feeding the minority class for over-sampling.

(2) Previous research focuses on either resampling techniques or algorithmic modifications.

However, the effectiveness of any learning algorithm is influenced by the construction

method of the learning data set, so it is necessary to consider learning algorithms and

resampling techniques simultaneously.

(3) Most papers published so far have used some benchmark data sets to confirm the

effectiveness: there are very few real-world applications that have been reported. As

algorithms which are effective for benchmark data sets are not necessarily effective in

real-world applications, it is important to make an attempt to solve practical class

imbalance problems and provide some insights or experiences about solving real-world

problems.

43

This study aims to solve a real small-business credit assessment problem and make some

new contributions for dealing with class imbalanced data sets from the following three

viewpoints.

(1) When using under-sampling methods to resample instances of the majority class, it is

unavoidable that some useful information of the majority class is lost. In order to avoid

this information loss, we propose a two-stage data resampling method to reduce the

sample size of the majority class whose information cannot be reflected in the under-

sampling results.

(2) Instead of focusing on either resampling techniques or algorithmic modifications, we

try to propose a new learning approach of performing algorithmic modification and

data resampling at the same time. That is, we use k-means algorithms and the k-nearest

neighbor method for resampling class imbalanced data sets and generate two training

data sets. Meanwhile, we classify healthy and insolvent customers through a hybrid

method of the k-nearest neighbor and random forest methods.

(3) This study has dealt with a credit scoring problem in a small-scale student dress

wholesale company and proposed some new approaches to assess the customers’ credit

only based on characteristic data that can be easily retrieved from daily transaction data

[86] [87]. These approaches are suitable to be applied to many organizations where the

customers do not disclose their financial data and have an advantage of a lower cost

data collection compared to other methods. However, we are having difficulties in

improving the accuracy of identifying insolvent customers. The emphasis of this study

is to provide a new approach based on class imbalance learning and construct a system

to identify the insolvent customers of the minority class with as high as possible

accuracy.

6.2 System design

Here, we propose a two-stage data resampling method to generate two balanced training

data sets. Similar to other under-sampling methods, at the first stage we perform under-

sampling through clustering the majority class of customers using a k-means algorithm. At the

second stage, in order to avoid information loss, we execute a pre-classification to pick up

customers of the majority class whose information cannot be reflected in the under-sampling

results of the first stage.

44

6.2.1 Scheme system

The proposed approach classifies imbalanced data sets through two steps, like the Figure

6.1 shows. The first one is to generate two training data sets and the second one is to construct

two classifiers based on the training data sets to classify new customers.

6.2.2 Training data generating

Let T= {S, M} be an imbalanced data set, where S= {s1, s2, …, sL} is the set of customers in

the minority class and M={m1, m2 ,…, mN} is the set of customers in the majority class. L and

N are the number of the customers in minority and majority classes respectively: in addition L

< N.

[First stage]

(1) For the customer mi (i =1, 2, …, N) of the majority class, we use the k-means algorithm

to generate k cluster means or centers. These k cluster means are defined as the seeds

of the majority class and put into set E.

(2) Combining the customers belonging to minority class S and set E, we generate a new

training set T1: T1= {S, E}.

Through the operation of the first stage, N customers of the majority class are clustered into

k clusters. We can give an appropriate k which is near to the size of minority class S, so the

training set T1 is a well-balanced one.

[Second stage] (1) Based on T1= {S, E}, we can classify the customers of the majority class M using the

1-nearest neighbor algorithm. If a customer mj of the majority class is classified wrongly into

minority class S, then we put it into set H.

(2) Combining the customers of the minority class and set H, another training set T2 can

Generate

two training

data sets

by k-means

and k-NN

Training set T2

Training set T1

Learning

and

classification

by two

classifiers

Figure 6.1 System scheme

45

be generated as T2= {S, H}.

The training data generating process is shown in Figure 6.2.

As the first stage aims at generating a balanced data set through under-sampling of the

majority class, it can discard data potentially important for the classification process. Hence,

we perform the operation of the second stage so the customers of the majority class whose

information cannot be reflected in the training set T1 will be picked up again in training set T2.

It is clear that our approach cannot only perform under-sampling, but also can avoid

information loss.

6.2.3 Learning and classification

As shown in Figure 6.3, the learning and classification are performed as follows.

[Step 1]

Firstly, based on training data set T1, we construct a preliminary classifier C1 where the

1-NN (k-nearest neighbor, k=1) algorithm was used as the classification method.

[Step 2]

Based on training set T2 and using the random forest method, we construct the second

classifier C2.

[Step 3]

When a new customer is given, the preliminary classifier C1 is firstly applied to classify

it. If the customer is classified into the majority class, then it is determined to be a

Pre-classifier

Majority class

M

E: The seeds of majority

class Training set T1

1-nearest

neighbor

H: The customers which

belong to M but were

misclassified into S

Training set T2

Figure 6.2 Generating the training data.

K-means

algorithm

Minority class

S

Majority class M

Training set T1

The first stage

The second stage

46

healthy customer. On the contrary, if the customer is classified into the non-majority

(minority) class, the second classifier C2 is applied again to reclassify it so as to decide

finally whether it belongs to the minority class (an insolvent customer) or to the majority

class.

Compared to other research, the proposed approach constructs and applies two types of

classifiers C1 and C2, and has the following characteristics.

(1) Since the insolvent customers belonging to the minority class S have been included in

both training data sets T1 and T2, and these two training data sets are balanced ones,

insolvent customers can be well represented in the two classifiers C1 and C2.

Furthermore, the second classifier C2 used the random forest method as the base

classifier, and the random forest method has been reported as being able to train the

imbalanced data effectively [88]. For these reasons, the performance of classifying

customers of the minority class can be improved.

(2) As the k cluster means of the majority class were included in the training data set T1,

new customers that are near to a seed of the majority class can be classified correctly

through preliminary classifier C1. In addition to this, other customers that could not be

represented in the seeds of the majority class have been included in the training data set.

Yes

No

New customer

The preliminary classifier

C1 (k-nearest neighbor)

(k-nearest neighbor, k=1)

The second classifier C2 (random forest)

Majority

class Minority

class

Training set

T2

Figure 6.3 Proposed Approach.

Is of majority

class?

Training set

T1

47

T2, and new customers that are not near to a seed of the majority class can be classified

correctly through second classifier C2. Therefore, the customers of the majority class can

be expected to be identified with high accuracy.

6.3 The application for a real credit scoring problem

In order to confirm the effectiveness of the proposed approach and give a real-world

application, this study applies the proposed approach to the credit scoring problem in a small-

scale student dress wholesale company.

As the original training data, we collected the features data from customers of company in

the 2001 financial year. Then generated two training data sets T1 and T2 from the date of the

2001 financial year. When we used the k-means algorithm to generate the seeds of the

majority class and training data set T1, k was set as k=4.

Training data set T1 is used to construct the preliminary classifier C1 where the 1-NN (k-

nearest neighbor, k=1) algorithm was used. Training data set T2 is used to construct the

second classifier C2 where the number of trees was set at 10 and the number of features

selected at each node is 4.

As new customers, we choose every customer in the financial year of 2002 and 2003, and

decide a new credit score by applying the proposed approach shown in Figure 6.3. These new

credit scores were compared with that given by the financial managers of the company. The

predicted result for 2002 customers is shown in Table 6.1.

Table 6.1 Prediction results for 2002 customers.

Number of customers

Credit score

provided by our approach Hit rate

1 2 3 4

Credit scores given

by the financial managers

1 467 1 0 1 99.6%

2 0 2 0 0 100.0%

3 0 0 2 0 100.0%

4 2 0 0 18 90.0%

From Table 6.1, the customers with scores of two and three are 100% in agreement with the

judgments of the financial managers of the company. For the customers with a score of four,

18 of 20 are correctly predicted. According to the result, it is clear that our system has very

high ability for classifying the minority class.

The predicted result in 2003 financial year is shown in Table 6.2.

48

Table 6.2 Prediction results for 2003 customers.

Number of customers

Credit score

provided by our approach Hit rate

1 2 3 4

Credit scores given

by the financial managers

1 444 6 0 0 99.0%

2 0 1 0 0 100.0%

3 0 2 1 0 33.0%

4 6 0 0 7 54.0%

From Table 6.2, the credit scores of healthy customers (score=1) provided by the system

are 99% in agreement with judgments of the financial managers of the company. The hit rate

of customer with score of 3 is 33% and the hit rate of customers with score of 4 is 54%.

6.4 Performance comparison

To clarify the performance and effectiveness of our approach, we compare the proposed

approach with the k-nearest neighbor algorithm and random forest method.

As described above, we also choose every customer in the 2001 financial year as training

data sets and construct two single classifiers using the k-nearest neighbor algorithm and

random forest method, respectively. The k-nearest neighbor algorithm was applied for k=1(1-

NN) using the Weka IBk classifier [88]. For the random forest method, the number of trees

was set to10 and the number of features selected at each node to four in the proposed method

and single random forest.

Data of every customer in financial year 2002 and 2003 were used as the test data and new

credit scores were decided by these two single classifiers respectively, and then compared

with that given by the financial managers of the company. The comparison results are shown

in Table 6.3 and Table 6.4.

Table 6.3 Classification results of the 2002 financial year for using 1-NN and

random forest.

Number of customers

Credit score given by1-

NN

Credit score

given by RF

1 >1 1 >1

Credit score given

by the financial

managers

1 463 6 469 0

>1 13 11 6 18

49

Table 6.4 Classification results of the 2003 financial year for using 1-NN and

random forest.

Number of customers

Credit score given by1-

NN

Credit score

given by RF

1 >1 1 >1

Credit score given

by the financial

managers

1 444 6 447 3

>1 8 9 13 4

(1) Comparison of Specificity

A main purpose of our study is to improve the performance for identifying insolvent

customers of a minority class. Thus, the specificity, which relates to the ability to identify the

minority class, has been assessed based on the results shown in Table 6.3 and Table 6.4, the

comparison results are shown in Figure 6.4 and Figure 6.5.

As Figure 6.4 shows, the specificity obtained by the proposed approach is 92% and it

performs better than random forest (75%) and k-nearest neighbor (46%) methods. It is clear

0.46

0.75

0.92

k-nearest neighbor Random forest The proposed method

Figure 6.4 Comparison of specificity in financial year of 2002.

0.53

0.24

0.53

k-nearest neighbor Random forest The proposed method

Figure 6.5 Comparison of specificity in financial year of 2003.

50

that the performance of classifying customers of the minority class was improved

significantly by our approach.

As Figure 6.5 shows, the specificity obtained by the proposed approach and single k-nearest

neighbor are all the same (53%). However, it performs better than random forest (24%).

(2) Comparison of Type I and Type II errors

In the field of credit scoring, type I and type II errors are very important criteria for

evaluating performance of credit scoring models. The type I and type II errors of the proposed

approach, k-nearest neighbor algorithm and random forest can be calculated based on Tables

6.3 and 6.4, the comparison result in financial year of 2002 is shown in Table 6.5.

Table 6.5 Comparison of type I and type II errors in financial year of 2002.

Methods Type I Type II

k-nearest neighbor 54.0% 1.3%

Random forest 25.0% 0.0%

Proposed method 8.3% 0.4%

From Table 6.5, it is clarified that:

The type II errors range from 0 to 1.3% and all three methods showed very low error to

identify healthy customers of the majority class. This result is because the number of

healthy customers is very large and their features can be learnt sufficiently into the models.

Among the three methods compared here, the proposed approach provided the lowest type I

error (8.3%) and showed a big difference with respect to the k-nearest neighbor (54%) and

random forest (25%) methods, respectively. This shows that the proposed approach is

superior to the single classifiers using the k-nearest neighbor algorithm and random forest

method to control type I errors.

Superior to the single classifiers of the k-nearest neighbor algorithm and random forest

method, the proposed approach can control type I and type II errors at the same time. In

other words, it could not only correctly identify healthy customers of the majority class, but

could classify customers of the minority class with a very low error rate.

The same as the above, type I and type II errors in financial year of 2003 can be calculated

based on Table 6.4. The comparison result in financial year of 2003 is shown in Table 6.6.

51

Table 6.6 Comparison of type I and type II errors in financial year of 2003.

Methods Type I Type II

k-nearest neighbor 47.1% 1.3%

Random forest 76.5% 0.7%

Proposed method 47.1% 1.3%

As shown in Table 6.6, the k-nearest neighbor and proposed method gave the lowest type I

error for classification and prediction, but random forest gave a comparatively high type I

error. These high type I errors arose from the fact that the number of insolvent customers is

much less than that of healthy ones, and thus, models are over learned from healthy customers.

(3) Comparison of integrated performance

In order to clarify the integrated performance of the proposed approach, G-mean, AUC and

MCC are calculated and compared with that of k-nearest neighbor algorithm and random

forest method. The comparison result are shown in Figure 6.6 and Figure 6.7.

From Figure 6.6, it is obvious that our method outperformed the k-nearest neighbor

algorithm and random forest method in all three integrated performance measures G-mean,

AUC and MCC with values of more than 91%. This result proved that our method has a

higher ability to identify both healthy customers of the majority class and insolvent customers

of the minority class than that of 1-NN and random forest method.

67% 72%

53%

87% 88% 86% 96% 96%

91%

G-mean AUC MCC

k-nearest neighbor Random forest Proposed method

Figure 6.6 Comparison of G-mean, AUC and MCC in financial year of 2002.

52

From Figure 6.7, it is obvious that our method outperformed the random forest method in

all three integrated performance measures G-mean, AUC and MCC, and have the same

performance with k-nearest neighbor.

6.5 Concluding Remarks

Prior to this study, we have proposed some new approaches using statistical methods and

case-based reasoning (CBR) to deal with the customers’ credit assessment problem in this

company [86] [87]. Furthermore, in order to consider class imbalance problems and improve

the accuracy of identifying insolvent customers, we have also proposed a credit assessment

system using bagging [89] [90].

Although insolvent customers could be identified with very high accuracy, we have to

improve our study further in order to find more effective and more efficient methods solve the

customers’ credit assessment problem based on less characteristic data. According to this

motivation, this paper makes an attempt to solve the customers’ credit assessment problem by

proposing a new approach based on class imbalance learning. Different from the existing

techniques, we mainly make the following two contributions.

(1) As a unique resampling method, we used k-means algorithms to perform under-

sampling on the majority class of customers. Then, in order to avoid losing information,

we introduced a pre-classification to pick up customers of the majority class whose

72% 76%

55%

48%

61%

35%

72% 76%

55%

G-mean AUC MCC

k-nearest neighbor Random forest Proposed method

Figure 6.7 Comparison of G-mean, AUC and MCC in financial year of 2003.

53

information could not be reflected in the previous under-sampling result. As a result,

we generated two training data sets.

(2) The proposed method was applied to solve a credit assessment problem in a small

company. As demonstrated by the practical credit scoring problems of the company, it

was clarified that the proposed approach can identify insolvent customers more

effectively than single classifiers based only on either the k-nearest neighbor algorithm

or random forest method.

Through the discussion of this study, it was confirmed that the approaches or methods of

dealing with class imbalance data sets are applied to solve practical credit assessment

problems. However, it is also important to decide the system’s structure and parameters, so as

to obtain good performance. As our approach is to assess small business credit scores only

based on daily transaction data, it has the advantage over other approaches using financial

data in our approach can be applied to assess a wide variety of businesses.

54

CHAPTER 7

AN ADAPTIVE AND HIERARCHICAL SYSTEM FOR CREDIT

SCORING

55

An adaptive and hierarchical system

7.1 The purpose of this study

As stated in chapter 6, the proposed system has shown that it can identify insolvent

customers more effectively than single classifiers based only on either the k-nearest neighbor

algorithm or random forest method. But it is a challenging issue to identify the insolvent

customers as higher accuracy as possible.

In this chapter, we still deal with a credit scoring problem in a small-scale student dress

wholesale company and aims at proposing an adaptive and hierarchical system to solve the

credit assessment problem and our emphasis is put on how to improve the accuracy for

identifying insolvent customers as higher accuracy as possible. The proposed system can

choose the best method adaptively from neural networks and decision tree based on the

accuracy for identifying customers of every credit score. The performance and effectiveness

of the proposed system have been demonstrated by applying it to the real problems of the

company.

7.2 The concept and scheme of the system

Although a great many models and methods for credit assessment have been published so

far, each method has the different ability when identifying the customers with different credit

scores. That is, a method can identify correctly the healthy customers. But in the other hand, it

cannot identify any insolvent ones. It is reasonable to group the customers according to their

credit scores and choose the best method to identify the customers of each group.

Meanwhile the performance of most of the credit scoring systems depends on the type and

quantity of the data needed for decision making. It is meaningful to extract the most important

variables or features according to the type of problems and/or customers. Furthermore, we

have found that the key features data to identify healthy customers differ from that to identify

Chapterr 7

56

insolvent ones. However, almost of models or methods reported before applying the same

dataset and assign the same weight to each feature to assess credit scores of all of the

customers. It is expectable to improve accuracy through assigning different weight to features

data when identifying a customer belonging to different groups.

Based on these considerations, we propose an adaptive and hierarchical system for small-

businesses’ credit assessment, as shown in Figure 7.1. The key points of this system can be

given as follows:

(1) The customers are divided into m groups, according to its credit score, where m is the

number of grades of credit score.

(2) The credit score of a customer is assessed though a hierarchical system of m-1 layers,

where the classifiers are arranged hierarchically. In each layer, only one group of

customers is to be identified.

(3) In each layer, two kinds of classifiers are provided, one is a classifier of neural network

and another is a classifier of decision tree. One of these two classifiers will be chosen

adaptively according to the expected probability or the correctness of identification to

identify a group of customers.

(4) A turning algorithm is proposed in the following section to decide the best sequence or

layer in which all groups of customers should be identified with the highest correctness.

Figure 7.1 The adaptive and hierarchicalsystem.

Training

Dataset

Target

Customer

NN

DT Score s1

NN: Neural Networks

DT: Decision Tree

NN

DT

NN

DT

NN

DT

···

Score sm-2

Score s2

Score sm-1

Score sm

57

The proposed system of Figure 7.1 is similar to ensemble learning systems in that it uses

multiple classifiers. However, it differs from ensemble learning systems mainly in the

following aspects:

(1) When constructing an ensemble learning system, the number of classifiers is a

parameter to be decided optimally. In our proposed system, the number of classifiers is

defined according to the number of customer groups and equals to m-1.

(2) Typical bagging or stacking ensemble systems are constructed through a parallel

scheme where multiple classifiers of the same type are arranged in parallel form. As

shown in the Figure 7.1, the proposed system has a hierarchical structure where m-1

classifiers are arranged hierarchically.

(3) Each classifier in ensemble systems is used to classify all groups of customers, but in

our proposed system, one classifier is used to identify only one group of customers,

and therefore it is not necessary to introduce any weighting algorithm or meta-level

classifier to combine the predictions from an ensemble of diverse classifiers.

7.3 Systematic Constructing Procedure

In order to identify all the groups of the customers with the highest accuracy, the system

structure should be decided optimally.

Let the customers’ credit be assessed by m-grade score s= 1, 2, …, m, the group of

customers with score of s be denoted by Cs (s=1, 2, …, m), then we can construct a m-1 layers

system through the following procedure. At layer L, the group of customers with score

sL( s

L1, 2, 3, …, m-1 ) is to be identified.

[Step1] Set the number of layers as L=1, the learning data set as m

ssCC

1

[Step2] Using the learning data set C, we can execute an appropriate learning program to

train the neural network and construct the decision tree. After the learning, can be

calculated the classification accuracy for identifying the customers of group Ct ( t

{ sss m,...,LL 1

}) as

group ofcustomer ofnumber Total

group into classified also are which group ofcustomer ofnumber The

C

CCR

t

ttt

To distinguish the accuracy through the neural network or the decision tree, we denote Rt

obtain by neural networks (NN) as RNNt and that by decision tree (DT) as R

DTt .

[Step3] Finding the best accuracy RNN and R

DT as

58

maxRNN { R

NNt , t { s,ss m,...,

LL 1}} (7.1)

maxRDT { R

DTt , t { s,ss m,...,

LL 1}} (7.2)

If RRDTNN , the neural network is chosen as the classifier of layers L and meanwhile the

customers’ group C s Lis chosen as the group to be identified in layer L, where

argmaxsL

{ RNNt , t { s,ss m,...,

LL 1}} (7.3)

Otherwise the decision tree is chosen as the classifier of layer L and as same as the above,

the customers of group CsLshould be identified where

argmaxsL

{ RDTt , t { s,ss m,...,

LL 1}} (7.4)

[Step4] If L=m-1, then the procedure is finished. Otherwise, set CsCCL

and then left

L=L+1, go back step 2.

7.4 Application to practical problem

To investigate the performance and effectiveness of the proposed system, we apply it to the

real credit assessing problem in the small company.

According to the procedure described in section 7.3, in the first layer, we choose every

customer in 2001 financial year as training data set and then calculate the classification

accuracy through neural network (NN) and decision tree (DT). Because there are many

decision tree algorithms. Here, we use J4.8, which is a modification of C4.5 revision 8 [87].

For neural network (NN), we use back propagation neural networks (BPN), the structure of

BPN is 7, 6 and 4 for the number of units in its input, hidden and output layers, respectively.

The result is shown in Table 7.1.

Table 7.1 The classification accuracy at the first layer. Layers Credit scores NN DT

L=1

1 99.4% 99.8%

2 50.0% 0.0%

3 50.0% 0.0%

4 85.0% 60.0%

From the Table 7.1, we have known that the maximum accuracy is 99.8% by decision tree.

So the decision tree is chosen at the first layer of the hierarchical system. If samples in the

target customer are filtered by decision tree as score 1, they are final result due to the

maximum accuracy. Otherwise, the rest samples should be filtered on the next layer.

59

At the second layer, we find the maximum accuracy only from score 2, 3, 4 though neural

networks and decision tree. The result is shown in Table 7.2.

From the Table 7.2, we have seen that the accuracies by neural networks are all 100%, so

we filtered neural networks at the second layer to identify score 2.

Same as above, according to the Table 7.3, we use neural networks at the third layer to

identify the customers with score of 3 and 4. The classification accuracy at the third layer is

shown in Table 7.3.

According to the described above, the classifier in each layer could be decided and shown

in Figure 7.2.

7.5 System performance and discussion

7.5.1 Ability for classification

First, we choose every customer in 2001 financial year as the target customer and decide its

new credit score by applying the proposed system. These new credit scores provided by the

Table 7.2 The classification accuracy at the second layer. Layers Credit scores NN DT

L=2

2 100.0% 100.0%

3 100.0% 50.0%

4 100.0% 95.0%

Table 7.3 The classification accuracy at the third layer. Layers Credit scores NN DT

L=3 3 100.0% 0.0%

4 100.0% 100.0%

Input DT Score 1

NN: Neural Networks

DT: Decision Tree

NN

NN

Figure7.2 Classifier in each layer of adaptive and hierarchical system.

Score 2

Score 4

Score 3

60

system are compared with that given by the financial managers of the company. The

comparison results are shown in Table 7.4.

Table 7.4 Classification Results by the system.

Number of customers Credit score provided by system Hit

rate 1 2 3 4

Credit score given by

the financial managers

1 473 0 0 0 99.8%

2 0 2 0 0 100%

3 0 0 2 0 100%

4 0 0 0 20 100%

From the Table 7.4, the credit scores for insolvent customers (score=2, 3, 4) provide by the

system are 100% in agreement with judgment of the financial managers of the company.

According to the result, we have found that our system has very high ability for classifying

the insolvent customers.

7.5.2 Ability for predication

As target customers, the features data of 493 customers in 2002 financial year and 467

customers in 2003 financial year was collected. For every customer of 2002 and 2003

financial years, a new credit score is predicted by the system based on the cases of 2001

financial year. Furthermore, these prediction results are also compared with the credit scores

given by the financial managers of the company and the hit rates of prediction are

summarized in table 7.5 and table 7.6.

Table 7.5 Prediction Results for 2002’s Customers.


rate 1 2 3 4



1 469 0 0 24 100%

2 0 2 0 0 100%

3 0 0 2 0 100%

4 1 0 0 19 95%

From Table 7.5, the total hit rates are more than 95% in agreement with the judgments of

the financial managers of the company. According to the result, we have known that the

system which we proposed not only has the ability to predict the insolvent ones but also to the

healthy ones.

61

Table 7.6 Prediction Results for 2003’s Customers.


rate 1 2 3 4



1 426 0 0 24 94.6%

2 0 1 0 0 100%

3 2 0 1 0 33%

4 0 0 0 13 100%

From Table 7.6, the credit scores of healthy customers (score=1) provided by the system

more than 94% in agreement with judgments of the financial managers of the company.

Although, the hit rate of the system for healthy customers is about 5.4% lower than 2002’s

and the hit rate of customer with score of 3 are only 33%. However, the total of the prediction

performance is not bad.

7.6 Comparison with other methods

7.6.1 Comparison with neural network and decision tree

In order to test the performance of our system, we compared it to single classifiers based

only on neural network or decision tree. As same as above, we firstly chose every customer in

2001 financial year as training data set to construct the single classifier and then every

customer of 2002 and 2003 financial years was used as the target customer and new credit

scores were decided by the single classifiers and the proposed system respectively. The hit

rates are shown in Figure 7.3 and Figure 7.4.

99%

0% 0%

65%

100%

0% 0%

50%

100% 100% 100% 95%

score=1 score=2 score=3 score=4

NN DT proposed system

Figure 7.3 The hit rates for 2002’s customers.

62

From Figure 7.3 and Figure 7.4, it is obvious that:

(1) For the customers of score=1, both the single classifiers based only on neural network

or decision tree and the proposed system have very high hit rate, and therefore even the

single classifier can perform well if the number of customers is large.

(2) Because the number of customers of score=2 and score=3 is very small, the single

classifiers based only on neural network or decision tree could not identify these

customers. However, the proposed showed the hit rates of 100% and 33% and

outperforms the single classifiers using neural networks or decision tree.

7.6.2 Comparison with parallel ensemble system

To compare the proposed system with ensemble systems of multiple classifiers, here a

parallel ensemble system is constructed, as shown in Figure 7.5.

95%

0% 0%

46%

93%

0% 0%

24%

95% 100%

33%

100%


NN DT proposed system

Figure 7.4 The hit rates for 2003’s customers.

Training

data

D

Dataset D1

···

Neural Network

Decision Tree

Combined

classifier

Neural Network

Decision Tree

Neural Network

Decision Tree

···

The final

result

Majority voting

Figure 7.5 The parallel ensemble system with voting.

Dataset D

2

Dataset D

n

63

Firstly, we use the bagging technique to make several data sets and in each dataset two

kinds of single classifier based on neural networks and decision tree are located in parallel. As

majority vote is the most commonly used methods for combining different classifiers, we also

combine the classification results of every single classifier by majority voting. In our study,

the number of single classifiers ranges from 5 to 15 and the best results will be selected to

compare with the proposed system.

After every single classifier was trained by using the customers’ data in 2001 financial year,

the new credit scores of customers in 2002 and 2003 financial year were predicted by the

parallel ensemble system of Figure 7.5 and then the hit rates compared with the proposed

system. The comparison result is shown in Table 7.7.

From Table 7.7, it is obvious that:

(1) When the number of single classifiers increase, the parallel ensemble system can also

give the same high accuracy as the proposed system to identifying the insolvent

customers of score 2 and score 3. Meantime, the proposed system outperformed the

parallel ensemble system for identifying the insolvent customers of score 4.

(2) Although the proposed system has only three classifiers, the parallel ensemble system

consists of 5 to 15 single classifiers and it is necessary to decide the optimal number of

the single classifiers. From this viewpoint, the proposed system is very simple and has

an advantage over the parallel ensemble system.

7.6.3 Type I and Type II errors

According to the previous study, most of them only examine the average prediction

performance of their models. However, from Table 7.5 and Table 7.6, we have found that

usually, there are two types of errors in the prediction results.

Table 7.7 The comparison with parallel ensemble system. The hit rates for 2002’s customers

Method Hit rate


Parallel ensemble system 96.1% 100% 100% 90%

Proposed system 100% 100% 100% 95%

The hit rates for 2003’s customers

Method Hit rate


Parallel ensemble system 94.2% 100% 33% 62%

Proposed system 94.6% 100% 33% 100%

64

Type I error: it represents an actual bankrupt firm classified as non-bankrupt.

Type II error: it represents an actual non-bankrupt firm classified as bankrupt.

Because type I errors represent real losses, we should improve our models or methods to

insolvent customers more accurately. Here we compare the result which was provided by our

system with single classifiers and multiple classifiers which were combined with neural

networks and decision tree by voting examine their Type I and Type II errors and show them

in Table 7.8.

Table 7.8 showed that the proposed system has lower errors than other methods. On the

other hand, the system we proposed has higher performance and effectiveness to control type

I and type II errors.

7.6.4 Comparison with other method

We further compared the hit rates for classifying the 2002 customers with the following

three approaches :

・ CBR system: a case-based reasoning system developed by Dong [86].

・ CBR+Bagging: a credit assessment system using hybrid method of bagging and case

case-based reasoning [90].

・ Two-stage data resampling method (TDR) which are proposed in chapter 6 of this

thesis [91].

and show the comparison result in Table 7.9.

Table 7.8 Type I and Type II errors.

Method 2002’s customers 2003’s cutomers

Type I Type II Type I Type II

Neural Network 30.0% 1.0% 46.0% 5.3%

Decision Tree 50.0% 0.0% 58.0% 6.8%

Voting 8.0% 3.8% 41.0% 5.8%

Proposed System 4.2% 0.0% 11.7% 5.3%

Table 7.9 Comparing hit rates for 2002 customers.

Customers’ credit

score

Hit rate

CBR CBR+

Bagging TDR

Proposed

approach

1 98.9% 94.4% 99.6% 100.0%

2 50.0% 100.0% 100.0% 100.0%

3 100.0% 100.0% 100.0% 100.0%

4 95.0% 90.0% 90.0% 95.0%

65

From Table 7.9, it is obvious that:

(1) Compared with the CBR system developed by Dong [85], the proposed approach

showed higher accuracy for classifying customers with scores of one or two. Although

the accuracy of identifying customers with a score of four is 95.0%, which is the same

to CBR system, the proposed approach can classify the customers without using their

former credit scores. That is, the proposed approach uses less characteristic data than

the CBR system.

(2) Compared to the hybrid method of bagging and CBR (CBR+Bagging) [90], the

proposed approach showed higher accuracy for classifying customers with a score of

one and four. The same accuracy for classifying customers with a score of two and

three. However, when using the CBR+Bagging approach, we have to build 10

bootstrapped replicas and decide an appropriate sampling method. The proposed

approach has a very simple structure.

(3) The proposed approach showed the same accuracy as two-stage data resampling

method system (TDR) [91] for identifying customers with scores of two or three, the

accuracies of the proposed approach for identifying customers with scores of one and

four are higher than TDR. As the higher accuracies of the depend mainly on the

adoption of two neural networks.

7.7 Concluding Remarks

In this chapter, we still dealt with customers’ credit scoring problems in a small company

and intended to assess the customer’s credit based only on daily transaction data such as sales,

payments by customers, amount of overdue payment, etc. The emphasis has been put on how

to improve the accuracy for identifying insolvent customers. An adaptive and hierarchical

system was proposed, where the best method can be chosen adaptively from neural networks

and decision tree in each level. It is similar to ensemble learning systems in that it uses

multiple classifiers. But it differs from ensemble learning systems mainly in that:

(1) The number of classifiers is decided by the number of customer groups and it is usually

less than ensemble learning systems.

(2) While typical bagging or stacking ensemble systems are constructed through a parallel

scheme, the proposed system has a hierarchical form where m-1 classifiers are

arranged hierarchically.

66

(3) In the proposed system, one classifier is used to identify only one group of customers

and it is not necessary to introduce any weighting algorithm or meta-level classifier to

combine the predictions from an ensemble of diverse classifiers.

The performance and effectiveness has been confirmed by applying it to the real problems

of the company. The experiment results showed that the system can identify insolvent

customers more effectively than single classifiers based only on neural networks or decision

tree. The system has also higher ability to identify insolvent customers than the parallel

ensemble system based on neural networks and decision tree.

67

CHAPTER 8

AN INVESTIGATION INTO THE RELATIONSHIP BETWEEN

CLASSIFICATION PERFORMANCE AND DEGREE OF

IMBALANCE

68

An investigation into the relationship between

classification performance and degree of imbalance

8.1 Background and aims of this study

During recent years, the class imbalance problem has received a high attention. A number

of solutions to the class imbalance problem were previously proposed both at the data and

algorithmic levels. As summarized by López et al. [92], these solutions can be categorized

into three major groups:

(1) Data sampling: at the data level, these solutions include many different forms of re-

sampling such as random over-sampling with replacement, random under-sampling,

directed over-sampling (in which no new instances are created, but the choice of

samples to replace is informed rather than random), directed under-sampling (where,

again, the choice of instances to eliminate is informed), over-sampling with informed

generation of new samples, and combinations of the above techniques [93].

(2) Algorithmic modification: at the algorithmic level, this procedure is oriented towards

the adaptation of base learning methods, including both standard learning algorithms

and ensemble techniques, to be more attuned to class imbalance issues.

(3) Cost-sensitive learning: this type of solutions incorporate approaches at the data level,

at the algorithmic level, or at both levels combined, considering higher costs for the

misclassification of instances of the positive class with respect to the negative class,

and therefore, trying to minimize higher cost errors.

Some experimental studies have been carried out to compare the effectiveness of the

methods previously proposed to deal with the class imbalance problem [46][48]. Meanwhile,

the nature (concept complexity, size of the training set and class imbalance level, etc.) of the

class imbalance problem has been investigated by several researchers. Japkowicz and Stephen

[46] have argued that the class imbalance problem is a relative problem that depends on 1) the

Chapterr 8

69

degree of class imbalance; 2) the complexity of the concept represented by the data; 3) the

overall size of the training set; and 4) the classifier involved. They also found that the higher

the degree of class imbalance the higher the complexity of the concept and the smaller the

overall size of the training set, the greater the effect of class imbalances in classifiers sensitive

to the problem. Furthermore, several researchers have considered how class distribution

affects classifier performance. López et al. [92] have made a detailed analysis and study of the

data intrinsic characteristics and given a brief description on how they affect the performance

of the classification algorithms. Although some researchers have argued that the level of class

imbalance have effect on classifiers’ performance [93-96], López et al. [92] pointed out that

the imbalanced ratio by itself does not have the most significant effect on the classifiers’

performance, but that there are other issues that must be taken into account.

In this study, we carry out an experimental study to investigate the effect of different levels

of imbalanced class distribution on the performance of three techniques often used for solving

the class imbalance problem: Synthetic Minority Over-sampling TEchnique (SMOTE), cost-

sensitive learning and ensemble learning. The purpose is to clarify the most effective

technique with different degrees of class imbalance. Our research differs from the others and

makes the following contributions:

(1) Although some researchers have studied the effects of different levels of imbalanced

class distribution on classifiers’ performance, their emphasizes have been put on how

to determine the best class distribution or decide the most appropriate sampling

algorithm for selecting training instances for a particular learning method such as

bagging, cost-sensitive learning, fuzzy classifier and decision-tree [93-95]. Here, we

aim at investigating the behavior and performance of the three techniques often used

for solving the class imbalance problem with varying levels of class imbalance, and the

emphasis is put on deciding the most effective technique.

(2) Almost all of the previous researches used several different kinds of benchmark data

sets to investigate the relationship between the level of class imbalance and classifiers’

performance. However, if data sets are changed, the degree of class imbalance as well

as other nature changes, and therefore it is difficult to distinguish the effect of the

degree of class imbalance from that caused by other natures. Here, we design an

experiment that generates the training data sets with different levels of imbalanced

class distribution from one original data set.

70

8.2 Experimental Design

8.2.1 Training data set generating

This study aims at investigating how the degree of class imbalance affects the performance

of various classification techniques, while eliminating the influence of other factors as much

as possible. We selected the German credit data as the original training dataset, which is a

widely used academic data sets and can be obtained from UCI Machine Learning Repository.

This data set has two classes, “good” and “bad” (credits), 7 numerical attributes and 13

categorical attributes. 700 instances belong to the “good” class and 300 instances belong to

the “bad”class.

At first, we divide the German credit data set into two parts: the training data and the test

data; the training data consists of instances of two-thirds and the test set consists of instances

of one-thirds. The training data and the test data are randomly resampled from the original

German credit data set while the ratio of bad/good instances is fixed at 3/7. Furthermore, the

test data will remain unchanged throughout the experiment.

Then, as shown in Figure 8.1, we fixed the imbalance ratio of bad/good instances IR=30/70,

20/80, 15/85, 12/88, 10/90, 3/97, 2/98 and 1/99 respectively and made resampling from the

original training data set. As the result, eight groups of training data G30/70, G20/80, G15/85,

G12/88, G10/90, G3/97, G2/98 and G1/99 were generated; each group of them contains

thirteen training data sets with the same IR.

71

8.2.2 Selecting of classification techniques

Current approaches to deal with the problem of imbalanced datasets fall into two major

categories: data sampling and algorithmic modification. Nevertheless, there is not a full

exhaustive comparison between those models. In order to analyze the data sampling

methodologies against cost-sensitive learning approach, we will use the “Synthetic Minority

Over-sampling Technique” (SMOTE) compare with the cost-sensitive C4.5 decision tree. In

addition, we also present in the comparison a hybrid procedure that combines with boosting.

Hence, SMOTE techniques, cost-sensitive learning techniques and ensemble learning

techniques have been selected.

As shown in Table 8.1, we will use the C4.5 decision tree as the base classifier [63]. This is

due to the reason that firstly the C4.5 has been widely used to deal with imbalanced data sets

[78][98], and secondly it has been included as one of the top-ten data mining algorithms [99].

Combined with the base classifier C4.5, three classification methods: C4.5+SMOTE,

C4.5+Boost and C4.5+CS will be compared in our experimental study.

The original

training

dataset

Figure 8.1 Setting up training data sets.

Training data G12/88

(12%bad, 88% good)

×13set

Resampling

with IR=12/88


(10%bad, 90% good)

×13set

Resampling

with IR=10/90

Training data G3/97

(3%bad, 97% good)

×13set

Resampling

with IR=3/97

Resampling

with IR=2/98

Training data G1/99

(1%bad, 99% good)

×13set

Resampling

with IR=1/99


(30%bad, 70% good)

×13set


(20%bad, 80% good)

×13set


(15%bad, 85% good)

×13set

Resampling

with IR=30/70

Resampling

with IR=20/80

Resampling

with IR=15/85

Training data G2/98

(2%bad, 88% good)

×13set

72

Table 8.1 The classification techniques used in the experimental study.

Acronyms Base

classifier Combined algorithms Algorithm description

C4.5+SMOTE

C4.5

SMOTE The base classifier of C4.5 applied to a

dataset preprocessed with the SMOTE algorithm.

C4.5+ENN AdaBoost The base classifier of C4.5 combined

with booting algorithm.

C4.5+CS Cost-sensitive

C4.5 decision tree

Invest the cost-sensitive learning algorithms into the base classifier of

C4.5.

8.2.3 Parameter tuning

The confidence level for the pruning strategy of C4.5 was 0.25. The tree was built using the

Weka [88] package.

The 5-nearest neighbors scheme was applied to generate synthetic training data in the

SMOTE. The over-sampling rate was set to 50%, 100%, 200%, 300%, 400%, 500%. And the

most appropriate value was selected for each dataset based on validation set performance. As

with the C4.5 algorithm, SMOTE was also trained in Weka.

For the boosting classifier, the number of iterations was varied in the range

[10,50,100,250,500,1000], and we also select the appropriate values to compare with other

methods.

Furthermore, we have to identify the misclassification costs associated with the positive

and negative class for the cost-sensitive learning versions. If we misclassify a positive

samples as a negative one, the associated misclassification cost is the IR of the data sets

(C(1,0)=IR) which is defined as:

IR=class bad of percentage The

class good of percentage The (8.1)

and the value of IR in each data set is presented in Table 8.2, where we denote the number of

data set, class distribution and IR.

If we misclassify a negative sample as a positive one the associated cost is 1(C(0,1)=1).

The cost of classifying correctly is 0(C(1,1)=C(0,0)=0) because guessing the correct class

should not penalize the built model.

73

Table 8.2 The value of IR in each data set.

Data No. Class distribution (bad/good) IR 1 30/70 2.33 2 20/80 4 3 15/85 5.7 4 12/88 7.3 5 10/90 9 6 3/97 32.5 7 2/98 49 8 1/99 99

8.2.4 Statistical comparison of classifiers

Since this study is to compare three methods according to their performance for classifying

eight groups of training data sets, statistical analysis needs to be carried out in order to find

significant differences among the obtained results. Here, we use Friedman’s test to compare

the AUCs obtained by the three methods; it is a well-known non-parametric statistical tests

for multiple comparisons [100].

The Friedman test statistic is based on the average ranked (AR) performance of the

classification techniques on each data set. Let D be the number of data sets used in the study,

K be the total number of classifiers and j

ir be the rank of classifier j on data set i, then the

average rank of classifier j is calculated as follows:

ARj=

D

i

j

irD 1

1 (8.2)

The test statistic is given by

K

jjF

KKAR

KK

D

1

2

22

4

)1(

)1(

12 (8.3)

2

F is distributed according to the Chi-square distribution with K-1 degrees of freedom. If

the value of 2

F is large enough, then the null hypothesis that there is no difference between

the techniques can be rejected. The Friedman statistic is well suited for this type of data

analysis as it is less susceptible to outliers [100].

Furthermore, we consider the average ranking of the classification methods in order to

show how good a method is with respect to classification of imbalanced data sets. This

ranking is obtained by assigning a position to each method depending on its performance for

each data set. The method which achieves the best performance (AUC value) in a specific

data set will have the first rank (value 1); then, the method with the second best performance

74

is assigned rank 2, and so forth.

8.3 Experimental results and discussion

The Table 8.3 shows the AUC values that were obtained by applying the four methods:

C4.5, C4.5+SMOTE, C4.5+Boost and C4.5+CS to train the classifiers based on the data set

from the eight groups of training data, and then evaluate the AUC of these classifiers for

classifying the test data [101]. For each level of imbalance, the Friedman test statistic and

corresponding p-values are calculated and also shown in Table 8.3. The average rank (AR) of

the four methods on each group of training data is shown in the column on the extreme right

of Table 8.3 and their comparison is given in Figure 8.2. The maximum AUC on each data set

as well as the highest average ranked value among the four methods is underlined.

From Table 8.3 and Figure 8.2, it is clear that:

Among the eight groups of training data sets, the minimum of Friedman test statistic is

19.36 and the corresponding p is 0.023%. Because of this, it is clear that there are

significant differences with respect to the classification performance of the four

methods.

As all of the Friedman test statistic corresponds to very low p-values (p<0.001), the

classification performance of all the four methods: C4.5, C4.5+SMOTE, C4.5+Boost

and C4.5+CS varies significantly with the degrees of class imbalance and there is no

any method that is always effective at varying degrees of class imbalance.

The base classifier C4.5 gave the lowest AUCs at all degrees of class imbalance. In

other words, in order to solve the class imbalance problem, it is a very effective

approach to combine a base classifier with SMOTE, cost-sensitive learning and

ensemble learning methods.

While training classifiers based on the data set from group G30/70 (bad/good=30/70)

and G20/80 (bad/good=20/80), C4.5+Boost was the best performing classification

method with the AR values of 1.23 and 1.54. The next well performed classifier was

the C4.5+SMOTE.

75

Table 8.3 The AUCs of four classifiers on eight groups of training data.

Training data G30/70 (bad/good=30/70) Friedman test statistic= 26.82 (p<0.001)

Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.60

0.67

0.73

0.67

0.61

0.69

0.68

0.67

0.68

0.71

0.69

0.68

0.62

0.67

0.70

0.66

0.62

0.68

0.70

0.67

0.67

0.66

0.67

0.65

0.58

0.66

0.71

0.62

0.63

0.67 0.69

0.61

0.61

0.68

0.69

0.67

0.60

0.67

0.73

0.67

0.63

0.68

0.72

0.68

0.63

0.65 0.67

0.61

0.72

0.69

0.70

0.68

3.46

2.00

1.23

3.31


Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.64

0.61

0.67

0.64

0.61

0.68

0.65

0.59

0.61

0.65

0.64

0.62

0.59

0.62

0.66

0.61

0.61

0.66

0.65

0.62

0.58

0.62

0.65

0.61

0.56

0.66

0.64

0.60

0.58

0.67 0.66

0.60

0.59

0.66

0.63

0.66

0.57

0.63

0.64

0.59

0.54

0.62

0.65

0.58

0.62

0.66 0.67

0.62

0.56

0.65

0.66

0.60

3.85

1.70

1.54

2.92

Training data G15/85 (bad/good=15/85) Friedman test statistic=19.98 (p<0.001)

Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.56

0.62

0.64

0.63

0.57

0.61

0.58

0.63

0.56

0.62

0.62

0.61

0.58

0.67

0.61

0.64

0.57

0.61

0.60

0.60

0.56

0.61

0.59

0.62

0.57

0.60

0.61

0.66

0.60

0.60 0.61

0.60

0.53

0.60

0.63

0.65

0.60

0.66

0.62

0.63

0.57

0.65

0.62

0.63

0.59

0.67 0.61

0.61

0.56

0.65

0.61

0.59

3.85

1.77

2.31

2.08


Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.57

0.58

0.58

0.57

0.56

0.61

0.57

0.64

0.53

0.61

0.61

0.59

0.58

0.66

0.60

0.62

0.52

0.59

0.56

0.59

0.60

0.61

0.58

0.58

0.52

0.63

0.61

0.58

0.53

0.61 0.60

0.64

0.57

0.64

0.61

0.56

0.60

0.67

0.62

0.60

0.55

0.60

0.58

0.59

0.57

0.65 0.63

0.61

0.58

0.62

0.58

0.61

3.54

1.23

2.62

3.31


Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.50

0.57

0.57

0.58

0.51

0.60

0.56

0.63

0.57

0.61

0.58

0.64

0.52

0.59

0.57

0.56

0.50

0.61

0.59

0.60

0.51

0.57

0.58

0.61

0.50

0.58

0.56

0.62

0.56

0.60 0.59

0.59

0.51

0.61

0.55

0.58

0.50

0.59

0.57

0.58

0.50

0.62

0.57

0.59

0.54

0.57 0.59

0.61

0.53

0.59

0.59

0.57

4

1.69

2.54

1.77


Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.50

0.52

0.55

0.55

0.50

0.52

0.55

0.57

0.50

0.51

0.51

0.54

0.50

0.52

0.54

0.56

0.50

0.51

0.55

0.57

0.50

0.52

0.53

0.53

0.50

0.51

0.54

0.52

0.50

0.53 0.58

0.53

0.52

0.52

0.54

0.55

0.52

0.52

0.54

0.55

0.50

0.53

0.55

0.57

0.50

0.52 0.56

0.53

0.50

0.56

0.58

0.58

3.92

1.69

3.08

1.31


Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.50

0.54

0.52

0.56

0.50

0.50

0.51

0.52

0.50

0.53

0.55

0.55

0.50

0.51

0.50

0.51

0.50

0.52

0.53

0.55

0.50

0.52

0.51

0.52

0.50

0.54

0.52

0.56

0.50

0.52 0.52

0.54

0.50

0.51

0.51

0.50

0.50

0.54

0.51

0.48

0.50

0.55

0.52

0.53

0.50

0.58 0.54

0.60

0.50

0.54

0.52

0.57

3.92

1.92

2.46

1.54


Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

0.50

0.54

0.51

0.52

0.50

0.53

0.51

0.54

0.50

0.53

0.51

0.53

0.50

0.51

0.51

0.53

0.50

0.50

0.51

0.49

0.50

0.52

0.51

0.53

0.50

0.52

0.51

0.50

0.50

0.54 0.54

0.55

0.50

0.52

0.50

0.50

0.50

0.53

0.52

0.55

0.50

0.52

0.53

0.52

0.50

0.50 0.50

0.52

0.50

0.51

0.51

0.55

3.62

1.77

2.54

1.69

76

When the training data set came from group G15/85 (bad/good=15/85), G12/88

(bad/good=12/88) and G10/90 (bad/good=10/90), the highest average rank (AR) was

given by the C4.5+SMOTE classifier.

When the imbalance degree increased and the training data group became G3/97

(bad/good=3/97), G2/98 (bad/good=2/98) and G1/99 (bad/good=1/99), the effect of

cost-sensitive learning becomes remarkable gradually and as the result, C4.5+CS

provided the best average rank across the three groups of data sets.

In addition to the comparison of the average rank described above, a comparison can also

be made from the view point of the AUC values. Figure 8.3 shows the average of AUCs on

each group of training data.

From Figure 8.3, it can be observed that:

As the degree of class imbalance increases, the average of AUCs is reduced almost

monotonically for all the four methods. Compared with the AUC on the training data

group G30/70, the average of AUCs on the training data group G1/99 is reduced by

20%-26%. It is clear that the degree of imbalance gives a strong effect to the

performance of classification methods.

Comparing the AUCs obtained by C4.5+SMOTE, C4.5+Boost and C4.5+CS

respectively, the range (=maximum – minimum) of the average AUC on the same

3.46

3.85 3.85

3.54

4 3.92 3.92

3.62

2

1.7 1.77

1.23

1.69 1.69

1.92 1.77

1.23

1.54

2.31

2.62 2.54

3.08

2.46 2.54

3.31

2.94

2.08

3.31

1.77

1.31

1.54 1.69

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

G30/70 G20/80 G15/85 G12/88 G10/90 G3/97 G2/98 G1/99

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

Figure 8.2 AR comparison on eight groups of training data.

77

group of training data is 4%-8% of the maximum. Therefore, the difference in the

average of AUCs obtained by the three methods is very small.

According to these observations, it has been clarified that:

(1) The degree of class imbalance affects significantly the performance of the four

classification methods : C4.5, C4.5+SMOTE, C4.5+Boost and C4.5+CS. When the

degree of class imbalance increases or the imbalance ratio decreases, the performance

of the classification methods deteriorates noteworthily. Meanwhile, there is no any

method that is always effective at varying degrees of class imbalance.

(2) Although the difference in the average of AUCs is very small, the most effective

classification method changes significantly along with change of the degree of class

imbalance. When the degree of class imbalance is low (imbalance ratio is 30/70 or

20/80), C4.5+Boost is the best performing classification technique. At the middle level

of class imbalance where the imbalance ratio is reduced to 15/85, 12/88 and 10/90, the

C4.5+SMOTE classifier outperformed other methods. Moreover, when the imbalance

degree is high and the imbalance ratio is reduced to 3/97, 2/98 and 1/99, the cost-

sensitive learning based classifier C4.5+CS can provide the best performance.

8.4 Concluding remarks

In our study, we carried out an experiment study to investigate the effect of the degree of

class imbalance on the performance of three representative techniques often used for solving

the class imbalance problem: SMOTE, cost-sensitive learning and ensemble learning

(boosting). In order to specify the effect of the degree of class imbalance and exclude the

0.45

0.50

0.55

0.60

0.65

0.70

0.75

G30/70 G20/80 G15/85 G12/88 G10/90 G3/97 G2/98 G1/89

Figure 8.3 Average of AUCs on eight groups of training data.

C4.5

C4.5+SMOTE

C4.5+Boost

C4.5+CS

G1/99

78

influence of other natures, we designed an experiment that generate the training data sets with

different levels of imbalanced class distribution from one original data set. Through our

experiment, it is clear that the degree of imbalance gives a strong effect to the performance of

classification methods, and the performance of the classification methods deteriorates

noteworthily if the degree of class imbalance increases.

It is also clarified that there is no any method that is always effective at varying degrees of

class imbalance. The cost-sensitive learning based classifiers can perform well at the very

high degree of class imbalance (imbalance ratio is 3/97, 2/98 and 1/99), and the SMOTE

classifiers can work well when the class imbalance is at a middle level (imbalance ratio is

15/85, 12/88 and 10/90). Furthermore, the boosting methods outperform others at the low

degree of class imbalance (imbalance ratio is 30/70 or 20/80). According to these results, we

can conclude that selecting the appropriate classification techniques is very important to deal

with the class imbalance problems.

79

CHAPTER 9

CONCLUSIONS

80

Conclusions

In this PhD thesis, we mainly addressed three issues relating to the credit scoring. The

issues raised in this thesis included that of building credit method and system for imbalanced

credit scoring data sets. Especially for the extremely imbalanced data sets; The contributors to

the literature on small and medium enterprise credit scoring; and an investigation into the

relationship between classification performance and degree of imbalance.

In this chapter, we display the conclusions that can be drawn from the research undertaken

in this thesis.

The main contributions of the thesis are summarized below:

(1) In this thesis, some effective approaches have been proposed to assess credit for small

and medium enterprise.

The two systems are suitable to be applied to many of organizations where the

customers do not disclose their financial data and they are also very easy to be

incorporated into existing information systems. Furthermore, the proposed systems are

more valuable and can be applied more widely than other existing credit assessment

systems.

(2) For the class imbalance problem, the emphasis has been put on how to improve the

ability for identifying the minority class.

The two-stage resampling method can used k-means algorithms to perform under-

sampling on the majority class of customers. In order to avoid losing information,

we introduced a pre-classification to pick up customers of the majority class whose

information could not be reflected in the previous under-sampling result.

An adaptive and hierarchical system can choose the best method adaptively from

neural networks and decision tree based on the accuracy for identifying customers

of every credit score.

Chapterr 9

81

The experiment results showed that the performance of classifying the minority class

was improved significantly by the two proposed approaches.

(3) We carry on an investigation into the relationship between classification performance

and degree of imbalance.

Latest techniques include single C4.5, hybrid techniques of boosting, Synthetic

Minority Over-sampling Technique and cost-sensitive learning algorithms have been

used in the analysis of different degree of class imbalance. The results help to select the

appropriate methods for problems with different degrees of imbalance. We also find

that when faced with a large class imbalance the cost-sensitive learning performs very

well.

82

References

[1] Van Gestel, T. and B. Baesens, Credit Risk Management: Oxford University Press.

[2] Edelman, D.B. and J.N. Crook, Credit Scoring and its Applications. Society for Industrial

Mathematics: Philadelphia, 2002.

[3] Tsai, C.F. and M.-L. Chen, Credit Rating by Hybrid Machine Learning Techniques.

Applied Soft Computing, 10 (2), 374-380, 2010.

[4] Leung, K, Cheong, F, Cheong, C, O’ Farrell, S and Tissington, R, A Comparison of

Variable Selection Techniques for Credit Scoring. In Proceedings of the 7th

International

Conference on Computational Intelligence in Economics and Finance, Taiwan,

December 5-7, 2008.

[5] Cios, K.J., et al., Data mining methods for knowledge discovery. Kluwer Academic

Publishers, 1998.

[6] Tan, P.N., Steinbach, M. and Kumar, V., Introduction to data mining. Pearson Addison

Wesley Boston, 2006.

[7] Altman E., Financial Ratios, Discriminant Analysis and Prediction of Corporate

Bankruptcy. Journal of Finance, 23 (4), 589-609, 1968.

[8] Baesens B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. and Vanthienen, J.,

Benchmarking State of Art Classification Algorithms for Credit Scoring. Journal of the

Operational Research Society, 54 (6), 627-635, 2003.

[9] Desai, V.S., Crook, J.N., Overstreet Jr G.A., A Comparison of Neural Networks and

Linear Scoring Models in the Credit Union Environment. European Journal of Operation

Research Society, 95 (1), 24-37, 1996.

[10] Karels, G.V., and Prakash, A.J., Multivariate Normality and Forecasting of Business

Bankruptcy. Journal of Business Finance and Accounting, 14 (4), 573-593,1987.

[11] Reichert, A.K., Cho, C.-C and Wagner, G.M., An Examination of the Conceptual Issues

Involved in Developing Credit Scoring models. Journal of Business and Economic

Statistics, 1 (2), 101-114, 1983.

[12] West, D., Neural Network Credit Scoring models. Computers & Operations Research,

27(11-12), 1131-1152, 2000.

[13] Yobas, M.B., Crook, J.N. and Ross, P., Credit Scoring using Neural and Evolutionary

Techniques. IMA Journal of Management Mathematics, 11 (2), 111-125, 2000.

83

[14] Arminger,G., Enache, D and Bonne, T., Analyzing Credit Risk Data: A Comparison of

Logistic Discrimination, Classification Tree Analysis, and Feedforward Networks.

Computational Statistics, 12 (2), 293-310, 1997.

[15] Steenackers, A. and Goovaerts M.J., A Credit Scoring Model for Personal Loans.

Inurances: Mathematics and Economics, 8 (1), 31-34, 1989.

[16] Wiginton, J.C., A Note on the Comparison of Logit and Discriminant Models of

Consumer Credit Behavior. Journal of Financial and Quantitative Analysis, 15, 757-770,

1980.

[17] Friedman, J.H., Multivariate Adaptive Regression Splines. The Annals of Statistics, 19 (1),

1-67,1991.

[18] Altman, E., Corporate Distress Diagnosis: Comparisons Using Linear Discriminant

Analysis and Neural Networks. Journal of Banking & Finance, 18 (3), 505-529, 1994.

[19] Hung, C. and Chen, J.H., A Selective Ensemble Based on Expected Probabilities for

Bankruptcy Prediction. Expert System with Applications, 36(3), 5297-5303, 2009.

[20] Schebesch, K.B., and Stecking, R., Support Vector Machines for Classifying and

Describing Credit Applicants: Detecting Typical and Critical Regions. The Journal of the

Operational Research Society, 56 (9), 1082-1088, 2005.

[21] Buta, Mining for financial knowledge with CBR. AI Expert, 9(2), 34-41,1994.

[22] Shin, K.S., and Han, I., A Case-Based Approach Using Inductive Indexing for Corpoate

Bond Rating. Decision Support System, 32(1), 41-52, 2001.

[23] Yanwen, D., Development of a Customer Credit Evaluation System via Case-based

Reasoning Approach. Asia-Pacific Journal of Industrial Management, 1, 1-7, 2008.

[24] Piramuthu, S., Financial Credit Risk Evaluation with Neural and Neurofuzzy Systems.

European Journal of Operational Research, 112, 310-321, 1999.

[25] Tseng, F.M., Lin, L., A Quadratic Interval Logit Model for Forecasting Bankruptcy.

Omega, 33 (1), 85-91, 2005.

[26] Rafiei, F.M., Manzari, S.M., Bostanian, S., Financial Health Prediction Models Using

Artificial Neural Networks, Genetic Algorithms and Multivariate Discriminant Analysis:

Iranian Evidence. Expert Systems with Applications, 38(8), 10210-10217, 2011.

[27] Tsai, C.F., Feature selection in bankruptcy prediction. Knowledge-Based System, 22 (2),

120-127, 2009.

[28] Danenas, P., Garsva, G., Guda, S., Credit risk evaluation model development using

support vector based classifiers. Procedia Computer Science, 4, 1699-1707, 2011.

84

[29] Kim, M.J., Kang, D.K., Ensemble with neural networks for bankruptcy prediction. Expert

System with Applications, 37 (4), 3373-3379, 2010.

[30] Tsai, Ch.F., Wu, J.W., Using neural networks ensembles for bankruptcy prediction and

credit scoring. Expert Systems with Applications, 34 (4), 2639-2649,2008.

[31] Karels, G., Prakash, A., Multivariate Normality and Forecasting of Business Bankruptcy.

Journal of Business Finance Accounting, 14(4), 573-593, 1987.

[32] Reichert, A.K., Cho, C.C., Wagner, G.M., An Examination of the Conceptual Issues

Involved in Developing Credit-scoring Models. Journal of Business and Economic

Statistics, 101-114, 1983.

[33] Thomas, L.C., A Survey of Credit and Behavioral Scoring: Forecasting Financial Risks

of Lending to Customers. International Journal of Forecasting, 16, 147-172, 2000.

[34] Shin, K.S., Lee, T.S., and Kim, H., An Application of Support Vector Machines in

Bankruptcy Prediction Model. Expert Systems with Applications, 28, 127-135, 2005.

[35] Gestel, T.V., Baesens, B., Suykens, J.A., Van den Poel, D., Baestaens, D.E. and

Willekens, B., Bayesian Kernel based Classification for Financial Distress Detection.

European Journal of Operational Research, 172, 979-1003, 2004.

[36] Min, S.H., Lee, J., and Han, I., Hybrid Genetic Algorithms and Support Vector machines

for Bankruptcy Prediction. Expert Systems with Applications, 31, 652-660, 2006.

[37] Abdou, H., Pointon, J. and Elmasry, A., Neural Nets Versus Conventional Techniques in

Credit Scoring in Egyptian Banking. Expert Systems with Application, 35 (3), 1275-

1292,2008.

[38] Pang, S. and Gong, J., Classification Algorithms and Application on Individual Credit

Evaluation of Banks. Systems Engineering Theory & Practice, 29 (12), 94-104, 2009.

[39] Polikar, R., Ensemble Based Systems in Decision Making. IEEE Circuits and Systems

Magazine, 6 (3), 21-45, 2006.

[40] Dasarathy, B.V., Sheela, B.V., Composite Classifier System Design: Concepts and

Methodology. Proceedings of the IEEE, 67 (5), 708-713, 1979.

[41] Hansen, L.K., Salamon, P., Neural Networks Ensemble. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 12 (10), 993-1001, 1990.

[42] Schapire, R.E., The Strength of Weak Learnability. Machine Learning, 5 (2), 197-227,

1990.

[43] Dietterich, T.G., Machine Learning Research: Four Current Directions. AI Magazine, 18

(4), 97-136, 1997.

85

[44] Windeatt, T., Ardeshir, G., Decision Tree Simplification for Classifier Ensembles.

International Journal of Pattern Recognition, 18 (5), 749-776, 2004.

[45] Basel Committee on Banking Supervision, Basel Committee Newsletter No.6: Validation

of low-default portfolios in the Basel II Framework. Technical Report, Bank for

International Settlements.

[46] Japkowicz, N., Stephen, S., The Class Imbalance Problem: A Systematic Study. Journal

Intelligent Data Analysis, 6(5), 429-449, 2002.

[47] Weiss, G., Mining With Rarity: A Unifying Framework. SIGKDD Explorations Special

Issue on Learning from Imbalanced Datasets, 6 (1), 7-19, 2004.

[48] Weiss G., Provost F.J., Learning When Training Data are Costly: The Effect of Class

Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19, 315-354,

2003.

[49] Chawla, N.V., Japkowicz, N., Kotcz, A., Editorial: Special Issue on Learning from

Imbalanced Datasets. SIGKDD Explorations, 6, 1-6, 2004.

[50] Joshi, M.V., Learning Classifier Models for Predicting Rare Phenomena, Ph.D.,

University of Minnesota, Twin Cites, MN, USA, 2002.

[51] Crone, S.F., Finlay, S., Instance Sampling in Credit Scoring: An empirical study of

sample size and balancing. International Journal of Forecasting, 28, 224-238, 2012.

[52] Yao, P., Hybrid Classifier Using Neighborhood Rough Set and SVM for Credit Scoring.

International Conference on Business Intelligence and Financial Engineering, 138-142,

2009.

[53] Kennedy, K., et al., Using semi-supervised Classifiers for Credit Scoring. The Journal of

the Operational Research Society, 64, 513-529, 2013.

[54] Provost, F., Jensen, D. and Oates, T., Efficient Progressive Sampling. In Proceedings of

the Fifth International Conference on Knowledge Discovery and Data Mining, 23-32,

1999.

[55] Chawla, N.W., et al., SMOTE: Synthetic Minority Over-sampling Techniques. Journal of

Artificial Intelligence Research, 16, 321-357, 2002.

[56] Japkowicz, N., Learning from Imbalanced Data sets: A Comparison of Various Strategies.

AAAI Workshop on Learning from Imbalanced Data Sets, 6, 10-15, 2000.

[57] Batista, G., A Study of the Behavior of Several Methods for Balancing Machine Learning

Training Data. ACM SIGKDD Explorations Newsletter, 6 (1), 20-29, 2004.

86

[58] Sadatrasoul, S.M., et al., Credit Scoring in Banks and Financial Institutions via Data

Mining Techniques: A Literature Review. Journal of Artificial Intelligence and Data

Mining, 1 (2), 119-129, 2013.

[59] Hung, C., Chen, J.-H., Wermter, S., Hybrid Probability-Based Ensemble for Bankruptcy

Prediction. In Proceedings of International Conference on Business and Information,

July 11-13, Tokyo, Japan.

[60] Witten, I.H. and Frank, E., Data Mining. Morgan Kaufmann Publishers: Elsevier, 2005.

[61] Benjamin, N. et al., Low Default Portfolios: A Proposal for Conservative Estimation of

Default Probabilities. Discussion Paper, Financial Services Authority, 2006.

[62] Osuna, R.G., Lecture Notes CS 790: Introduction to Pattern Recognition. Wright State

University, Dayton, Ohio,USA, 2002.

[63] Quinlan, J.R., C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo,

California, 1993.

[64] Haykin, S., Neural Networks: A Comprehensive Foundation, New Jersey: Prentice Hall,

1999.

[65] West, D., Dellana, S. and Qian, J., Neural network ensemble strategies for financial

decision applications. Computers & Operations Research, 32 (10), 2543-2559, 2005.

[66] Skurichina, M., Duin, R.P.W., Bagging, Boosting and The Random Subspace Method for

Linear Classifiers. Pattern Analysis and Applications, 5(2), 121-135, 2002.

[67] Breiman, L., Bagging Predictors. Machine learning, 24 (2), 123-140, 1996.

[68] Freund, Y., and Schapire, R.E., A Decision-theoretic generalization of On-line Learning

and an Application to Boosting. Journal of Computer and System Science, 55 (1), 119-

139, 1997.

[69] Wolpert, D.H., Stacked Generalization. Neural Networks, 5 (2), 241-259, 1992.

[70] Breiman, L., Random Forest. Machine Learning, 45 (1), 5-32, 2001.

[71] Chawla, N.V., Japkowicz, N. and Kolcz, A., Editorial: Special Issue On Learning From

Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, 6 (1), 1-6, 2004.

[72] Hulse, J., Khoshgoftaar, T., Napolitano, A., Experimental perspectives on Learning from

imbalanced data, In Proceedings of the 24th

International Conference on Machine

Learning, 935-942, 2007.

[73] López, V., Fernández, A., Moreno-Torres, J.G., Herrera, F., Analysis of preprocessing vs.

cost-sensitive learning for imbalanced classification. Open problems on intrinsic data

characteristics. Expert Systems with Applications, 39 (7), 6585-6608, 2012.

87

[74] Haibo, He, Edwardo, A., Garcia, Learning from Imbalanced Data, IEEE Transactions on

Knowledge and Data Engineering, 21 (9), 1263-1284, 2009.

[75] Ting, K.M., An instance-weighting method to induce cost-sensitive trees. IEEE

Transaction on Knowledge and Data Engineering, 14 (3), 659-665, 2002.

[76] Kubat, M and Matwin, S., Addressing the curse of imbalanced training sets: one-sided

selection. Proceedings of the Fourteenth International Conference on Machine Learning,

179-186, 1997.

[77] Ertekin, S., Huang, J., Bottou, L., and Giles, C.L., Learning on the Border: Active

Learning in Imbalanced Data Classification. In Proceedings of the Sixteenth ACM

Conference on Information and Knowledge Management, 127-136, Portugal, 2007.

[78] Su, C.T. and Hsiao, Y.H., An evaluation of the robustness of MTS for imbalanced data.

IEEE Transactions on Knowledge and Data Engineering, 19(10), 1321-1332,2007.

[79] Weiss, G.M., Mining with Rarity: a Unifying Framework. SIGKDD Explorations and

Newsletters, 6,7-19, 2004.

[80] Lu, Y., Guo, H. and Feldkamp, L., Robust Neural Learning from Unbalanced Data

Samples. In IEEE International Joint Conference on Neural Networks, IEEE World

Congress on Computational Intelligence, 3, 1816-1821, 1998.

[81] Yoon, K. and Kwek, S., A Data Reduction Approach for Resolving the Imbalanced Data

Issue in Functional Genomics. Neural Computing and Applications, 16 (3), 295-206,

2007.

[82] Zadrozny, B. and Elkan, C., Learning and Making Decisions When Costs and

Probabilities are Both Unknown. In Proceeding of the 7th International Conference on

Knowledge Discovery and Data Mining, 204-213, 2001.

[83] Domingos, P., Metacost: A General Method for Making Classifiers Cost-Sensitive, In

Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining, 155-164, 1999.

[84] Pazzani, M., Murphy, P., Ali, K., Hune, T. and Brunk, C., Reducing Misclassification

Costs. In Proceedings of the 11th International Conference on Machine Learning, 217-

225, 1994.

[85] Turney, P., Types of Cost in Inductive Concept Learning. In Proceedings of the

Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on

Machine Learning, 15-21, 2000.

88

[86] Dong, Y., Development of a Customer Credit Evaluation System via Case-based

Reasoning Approach, ASIA-PACIFIC Journal of Industrial Management, 1 (1), 2008.

[87] Dong, Y., Hao, X. and Yu, C., Comparison of Statistical and Artificial Intelligence

Methodologies in Small-Businesses’ Credit Assessment Based On Daily Transaction

Data, ICIC Express Letters, An International Journal of Research and Surveys, 5 (5),

1725-1730, 2011.

[88] Witten, I. H. and Frank, E., Data Mining : Practical Machine Learning Tools and

Techniques. Morgan Kaufmann, San Francisco, 2005.

[89] Dong, Y., Application of Bagging for Solving Small-Businesses’ Credit Assessment

Problems Based on Daily Transaction Data. China Management Information, 115-119,

2009.

[90] Dong, Y., Application of Hybrid Method of Bagging and Case-Based Reasoning to Solve

Small-Businesses’ Credit Assessment Problems. Information, 14 (2), 399-409, 2011.

[91] Xiying, H., Dong, Y., A New Approach Based on Class Imbalance Learning for Small-

business Credit Assessment. Journal of Japan Industrial Management Association,

64(2E), 325-335, 2013.

[92] López, V., Fernández, A., García, S., Palade, V. and Herrera, F., An insight into

classification with imbalanced data: Empirical results and current trends on using data

intrinsic characteristics. Information Sciences, 250, 113–141, 2013.

[93] Visa, S. and Ralescu, A., Issues in mining imbalanced data sets-a review paper. In:

Proceedings of the sixteen Midwest artificial intelligence and cognitive science

conference, 67-73, 2005.

[94] Visa, S. and Ralescu, A., The effect of imbalanced data class distribution on fuzzy

classifiers—experimental study. IEEE International Conference on Fuzzy Systems,

FUZZ-IEEE 2005, 749–754, 2005.

[95] Chen, N., Chen, A. and Ribeiro, B., Influence of class distribution on cost-sensitive

learning: A case study of bankruptcy analysis. Journal Intelligent Data Analysis,17(3),

423-437, 2013.

[96] Liang, G., Zhu, X., and Zhang, C., The effect of varying levels of class distribution on

bagging for different algorithms: An empirical study. International Journal of Machine

Learning and Cybernetics, 5(1), 63-71, 2014.

[97] Kotsiantis, S., Kanellopoulos, D. and Pintelas, P., Handling Imbalanced Datasets: A

Review. GESTS International Transactions on Computer Science and Engineering, 30(1),

89

25-36, 2006.

[98] Garc´ıa, S., Fern´andez, A., Luengo, J., and Herrera, F., A study of statistical techniques

and performance measures for genetics-based machine learning: accuracy and

interpretability. Soft Computing, 13 (10), 959-977, 2009.

[99] Wu, X. and Kumar, V., The top ten algorithms in data mining. Data mining and

Knowledge Discovery Series. Chapman and Hall/CRC press, 2009.

[100] Friedman, M., A comparison of alternative tests of significance for the problem of m

rankings. Annals of Mathematical Statistics, 11(1), 86-92, 1940.

[101] Hao, X., Dong, Y. and Wu, S., Dealing with severely imbalanced credit scoring

dataset. Proceedings of 2012 Asian Conference of Management Science & Applications

(ACMSA2012), 69-74, 2012.

Credit Scoring Method and System Development for Imbalanced … · 3.1.1 k-nearest neighbors (k-NN)...

Documents

Transcript of Credit Scoring Method and System Development for Imbalanced … · 3.1.1 k-nearest neighbors (k-NN)...