FORECASTING BANKRUPTCY PREDICTION USING CONDITIONAL …

FORECASTING BANKRUPTCY PREDICTION USING CONDITIONAL RANDOM FIELDS

Bo Wang

Student ID: 01501409

Promotor: Prof. Dr. Dries Benoit

Tutor(s): Ir. Wai Kit Tsang

A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of

Master of Science in Statistical Data Analysis.

Academic year: 2017 - 2018

The author and promoter give permission to consult this master dissertation and to copy it

or parts of it for personal use. Every other use falls under the restrictions of the copyright,

in particular concerning the obligation to mention explicitly the source when using results

of this master dissertation.

Gent August 14, 2018

The promotor, The author,

Prof. Dr. Dries Benoit Bo Wang

3

Abstract

The aim of this thesis is to predict whether a Belgian company will go bankrupt the next

financial year (classification problem). We focus on small and medium-sized enterprises.

The traditional company business failure prediction models only included financial ratios

as predictors and see companies as isolated entities. In this research, networks of Belgian

companies will be created by linking companies through shared board members. This thesis

includes the network statistics as predictors and evaluates the performance of Conditional

Random Fields (CRF) in Business Failure Prediction. Forming Board networks is

particularly interesting because information is transferred from company to company

through these linked Boards. There might be a relationship between a board's connection

and the financial status of the linked companies. We hope to get better performance with

CRF model comparing to the commonly used analytical techniques model (logistic

regression, decision tree for example), which do not consider the influence of the joint

board member link.

5

Contents

Abstract ...........................................................................................................................................3

1. Introduction .........................................................................................................................7

1.1 Business failure predictions on small and medium-sized enterprises ....................................... 7

1.2 Conditional Random Fields in Business Failure Prediction. .................................................... 7

1.3 Organization of this work ......................................................................................................... 9

2. Data and Methods .............................................................................................................10

2.1 Dataset and variables .............................................................................................................. 10

2.2 Data exploration and visualization ......................................................................................... 12

2.3 Conditional random fields ...................................................................................................... 14

2.3.1 Fundamentals of Conditional Random Fields...................................................................... 16

2.3.2 Representation...................................................................................................................... 18

2.3.3 Inference .............................................................................................................................. 21

2.3.4 Learning ............................................................................................................................... 23

2.4 Benchmark Model ................................................................................................................... 24

2.4.1 Logistic regression ............................................................................................................... 24

2.4.2 Decision Tree and Random Forest ....................................................................................... 25

2.5 Performance metrics ............................................................................................................... 26

2.5.1 Accuracy, precision, recall and F-measure ......................................................................... 26

2.5.2 AUC ..................................................................................................................................... 27

2.6 Dummy data generation .......................................................................................................... 28

3. Results ................................................................................................................................32

3.1 Apply CRF model on dummy data ......................................................................................... 32

3.2 Apply CRF on real company dataset. ..................................................................................... 35

3.2.1 On original company data .................................................................................................... 35

6

3.2.2 On up-sampling company data ............................................................................................ 36

4. Conclusions and future work ...........................................................................................38

References .....................................................................................................................................40

Appendix A: Dataset description ...............................................................................................42

7

1. Introduction

1.1 Business failure predictions on small and medium-sized

enterprises

Small and medium-sized enterprises (SMEs) play an important role in the European

economy. “About 99.8% of enterprises which operated in the EU-28 non-financial business

sector in 2016 were SMEs. These SMEs employed 93 million people, accounting for 67 %

of total employment in the EU-28 non-financial business sector, and generating 57 % of

value added in the EU-28 non-financial business sector” [Muller et al., 2016]. The

European definition of SMEs is formulated as: "The category of micro, small and medium-

sized enterprises (SMEs) is made up of enterprises which employ fewer than 250 persons

and which have an annual turnover not exceeding 50 million euro, and/or an annual balance

sheet total not exceeding 43 million euro" [Commission et al., 2003].

A major research domain within corporate finance is to build Business Failure Prediction

(BFP) models that accurately predict the future status of companies [Tkac and Verner,

2016]. Indeed, BFP models can help stakeholders take precautions and potentially prevent

business failures. BFP models could be interesting for investors, managers,

holding/collaborating companies and the government.

1.2 Conditional Random Fields in Business Failure Prediction.

In classic statistical BFP models, businesses are classified as either failing or non-failing,

which is a binary classification problem. One category of explanatory variables that been

popularly used as predictors in business failure classification, which also be used in this

thesis, is the accounting based measures, also known as financial ratios [Baysinger and

Butler, 1985]. The reason is that the financial ratios can be viewed as the overall

summarization of the business historical performance. The Bankruptcy company normally

have bad performance thus poor financial ratios.

Based on financial ratios as predictors, different types of technique and method has been

applied to BFP, from classical univariate/multivariate statistical analysis to advance

machine learning methods. Table 1.1 made a short summary of those most important and

popular techniques that been applied in the BFP field.

8

Table 1.1 Short list of the techniques been applied in BFP field.

Techniques Summary

Univariate analysis

[Beaver 1966]

The very first study of BFP. Applied binary

classification model with a variety of financial

ratios.

Multiple discriminant analysis

[Altman 1968]

Extend BFP problem from univariate to

multivariate way. Assume normality and equal

covariance assumptions.

Probit analysis

[Zopounidis et al., 1999]

Probit models assumes a cumulative normal

distribution.

Logit analysis

[Doumpos et al., 1999]

Logit model assumes a logistic distribution.

Self-organizing maps

[Huysmans et al., 2006]

Unsupervised machine learning techniques. Have

powerful visualization capabilities.

Support vector machines

[Min et al., 2006]

Supervised Machine learning techniques. Suffer

from less interpretability compare with MDA.

Probabilistic neural networks

[Ravi et al., 2008]

Feedforward neural network that been widely used

in classification and pattern recognition problems.

Artificial Neural Networks

[Chen et al., 2009]

A popular technique to apply on BPF currently.

Suffer from interpretability and transparency.

Hybrid systems, ensemble

methods

[Lin et al., 2012]

Aggregated multiple prediction techniques to

improve the accuracy.

The approaches and methods that list above, however, treated companies as isolated entities

or unit that exist on themselves, they ignore the linkage and did not consider any influences

between the different companies. In this thesis, we want to explore that property. Our

reasoning is that in reality, all businesses form an ecosystem. Companies may link with

each other by the same businesses categorization, same countries, investors, sizes, etc. The

performance of companies may have influence between each other through those links. A

particularly interesting phenomenon called interlocking can often be observed is that a

member of the board of directors of one company will often be a member of the board of

directors of another company. Previous research has shown that the financial status of those

linked companies might be influenced by each other since resources and information can

be transferred between the linked company through these Boards [Tobback et al., 2016]. In

this thesis, we link the companies by their shared board members to form network and

trying to utilize the boards network information on the business failure prediction.

https://en.wikipedia.org/wiki/Feedforward_neural_network

9

Taken the board network into account to predict the business failure is basically a problem

of predicting many variables that depend on each other as well as on other observed

variables. The conditional random fields (CRF) technique can be particularly suitable for

this task. “CRF is a graphical model which can take neighboring samples or context into

account in order to predict a label” [Sutton et al., 2012]. In this thesis, the CRF will be

applied to build BFP models to take the network into account. The purpose is to evaluate

if there is added value of taking the board network into account on bankruptcy prediction.

The performance of CRF is evaluated and compared with several benchmark models.

1.3 Organization of this work

In the remainder of this work, chapter 2 starts with data exploration and visualization of

the company dataset. We then describe how to apply the CRF in the business failure

prediction context, including model representation, inference and parameter learning. We

also describe the benchmark model and the performance metrics used to evaluate the model,

followed by a short section of dummy data generation and up-sampling method. In chapter

3, we presented the results of the CRF model on both dummy and real company dataset. In

the concluding chapter 4, we discussed the obtained results and several issues and the

recommendation for further research.

10

2. Data and Methods

2.1 Dataset and variables

The dataset used in this thesis contains the financial information of both active and failing

companies of Belgium in year range from 2011-2017, obtained from the database Orbis of

Bureau Van Dijk (BvD) in the thesis work of Domien Van Damme. Only medium and

small-sized companies are selected. Refer to the thesis work of Domien Van Damme for

detail data selection procedure and criteria [Van Damme et al., 2017].

The raw data file contains variables for each company of each year. The variables including

yearly financial information (revenues, stock turnover, current ratios, etc), and non-

financial information such as the company names, industrial classification, the unique ID

for each board directors. We refer to Table A.0.3 for the detail definition of each variable.

For each company, the business status of each year is also given in raw data file. The

business statuses are divided into 12 categories. We refer to Table A.0.6 for the full list of

possible business status. Following the suggestions of the thesis work of Domien Van

Damme, to get a dataset of companies that are as much similar as possible in accounting

and legal terms, companies with the business status of “Bankruptcy” or “Dissolved

(bankruptcy)” are considered as a failed company, with the business status of “Active” is

considered as an active company. All the other types of business status are ignored.

To predict a company bankruptcy or not, the variables in the raw data file need to be

transformed/converted into predictors. Based on the available variables, 30 predictors were

created with 9 continues and 21 discrete predictors. Refer to Table A.0.3 for the

transformation method of each variable. Refer to Table A.0.4 an overview of all the

predictors used in BFP modeling in this research. The most important variables

transformations steps and motivations are listed below:

• The variable “date of incorporation” is the date of formation of the company. It

cannot be directly used as a predictor. However, since the failed firms could be

significantly younger than active firms, we transformed it into the age variable,

which is the difference between the year of the observation and the year of

incorporation.

• The variable “Category of the company” was transformed into a binary variable

with 0 stands for the small-sized company and 1 for a medium-sized company.

• Some variables have many missing values. Figure 2.1 shows the missing value

percentage over all the dataset of each variable. Previous literature shows that failing

companies tend not to share information, those variable that has more than 30%

11

missing values were converted into binary dummies, indicating whether information

was given or missing.

• Unrelated variables like “company names”, “Last avail. Year” are deleted because

those variables are irrelevant to the business failure prediction problem and thus

cannot be used as a predictor.

• As will be seen in chapter 2, we going to build the CRF model based on log-linear

features: all the numerical variables will be exponentialized. To avoid numerical

overflow, all the numerical variables in the raw data are rescaled into range -1 to 1.

The raw data file has the data structure shown in Table A.0.1, are converted into basetable

by year (cross-sectional), with the structure of company ID, predictors, and status, as shown

in Table A.0.2.

Figure 2.1 Variables missing value percentage

12

2.2 Data exploration and visualization

Firstly, it is interesting to know how many directors a board contained typically in Belgium.

Figure 2.2 shows the company counts of different board size. Companies with small board

size below 10 are most frequently presented (94.5%). There are however 4 companies with

board size larger than 100 board members.

Figure 2.2 Company counts with different board size

Figure 2.3 shows the count of directors with different number of seats. There are 242960

directors in total. Most directors (83.4%) are involved with only 1 company. There are

40253 (16.6%) directors connected with more than 2 companies. Those 40253 directors

and the related companies form the Belgian company board networks. It is quite

remarkable that several directors connected with a lot of companies. One of them

connected with 168 companies. How can this be interpreted? “In Belgium, the independent

supervisory board often consists of labor- or governmental organizations and institutional

investors. These institutional investors are typically connected to many companies.” [Van

Damme et al., 2017].

13

Figure 2.3 Count of directors with different number of seats

One of our research goals is to check if the board members have an influence on whether

or not a company will fail. Figure 2.4 shows that most directors (93.8%) are not connected

to failed companies. Among those directors who are connected to a company which has

failed, most of them (93.7%) only failed one company. However, it also happens quite often

that a director is connected to multiple bankrupt companies. One extreme case is that a

director is connected to 12 bankrupt companies.

Figure 2.4 Count of director that linked to the different number of bankruptcy companies

14

Figure 2.5 shows the number bankruptcy vs active companies yearly, the percentage of

bankruptcy company is not that much, which is good news for the economy but may be

bad news for model training due to the highly unbalanced classes.

Figure 2.5 Status counts of companies per year.

2.3 Conditional random fields

Before start building the CRF model, it needs to point out that the companies in our dataset

are linked in very complex ways, there are many different network topology exist in our

dataset. Table 2.1 gives an overview of the different types of network topology found in the

dataset and the schematic examples. In general, the graph topologies can be split into two

categories: singly-connected graph(tree) and multiply-connected graph(loopy). A graph is

singly-connected means there is only one path from any node a to another other node b. A

graph is multiply-connected if it is not singly-connected. Thus in Table 2.1 the linear

chain/star topology are singly-connected graph and the rest are the multiply-connected

graph.

In this thesis, we going to limit ourselves to model only the linear-chained network

topology (double linked company in specifically) based on the motivation listed below:

1. The linear chained topology is not only the simplest but also represent the most part

(80.03%) of the structures in our dataset.

2. For singly-connected graph, there exist efficient algorithms such as belief propagation

that scales linearly with the number of the nodes in the graph. Although approximate

inference algorithms such as loopy-cut algorithms exist for the multiply-connected models,

in general, it is computationally inefficient.

15

3. The parameter trained based on the double linked linear chain model, can also be used

as the clique templates and do inference on other types of singly-connected graph such as

star topology.

Table 2.1 Network topology found in the dataset. Nodes represent companies, edges

means linkages (shared board member)exist between companies

Topology Example

Linear chain

Triangle

Star

Fully connected

Ring

Complex network structure

In the section below, after the general introduction of the Fundamentals of Conditional

Random Fields, we explain in detail how to apply the conditional random fields framework

to the problem of business failure prediction.

16

2.3.1 Fundamentals of Conditional Random Fields

Let variables X be the set of input variables that always observed, variables Y the output

set of variables that we wish to predict. The formal definition of the general conditional

random fields is then given below:

“Let G be a factor graph over variables X and Y. Then (X, Y) is a conditional random

fields if, for any value x of X, the distribution 𝒑(𝒚|𝒙) factorizes according to G” [Sutton

et al., 2012].

In formulas, if 𝐹 = {Ѱ𝑎} is the set of factors in G, then the conditional distribution for a

CRF is

𝑝(𝑦|𝑥) = 1

𝑍∏ Ѱ𝑎(𝑦𝑎 , 𝑥𝑎)

𝐴

𝑎=1

(2.1)

With 𝑍 the normalization constant so the distribution sums to one:

𝑍 = ∑ ∏ Ѱ𝑎(𝑦𝑎 , 𝑥𝑎)

𝐴

𝑎=1𝑦

(2.2)

The factor Ѱ𝑎(𝑦𝑎 , 𝑥𝑎) is often parameterized as log-linear representation:

Ѱ𝑎(𝑦𝑎 , 𝑥𝑎; 𝜃𝑎) = exp {∑ 𝜃𝑎𝑘𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎)

𝐾

𝑘=1

} (2.3)

With 𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎): 𝑉𝑎𝑙(𝑦𝑎 , 𝑥𝑎) → ℝ the feature function, 𝜃𝑎 the weight of the feature

function .

Putting together (2.1), (2.2) and (2.3), the conditional distribution for a CRF with log-

linear factors can be written as:

𝑝(𝑦|𝑥) = 1

𝑍∏ exp {∑ 𝜃𝑎𝑘𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎)

𝐾

𝑘=1

}

𝐴

Ѱ𝑎∈𝐹

(2.4)

Where normalization constant Z:

𝑍 = ∑ ∏ exp {∑ 𝜃𝑎𝑘𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎)

𝐾

𝑘=1

}

𝐴

Ѱ𝑎∈𝐹 𝑦

(2.5)

17

The CRF representation above emphasize that each factor has its own set of weights, it

happened quite often that, different factors in G can share the same feature functions and

also the same parameters values. In this case, the clique template can be created to simplify

the model by partition the factors of G into 𝐶 = {𝐶1, 𝐶2, … 𝐶𝑝}, where each 𝐶𝑝 is a clique

template. A CRF that uses clique templates can be written as

𝑝(𝑦|𝑥) = 1

𝑍∏ ∏ Ѱ𝑐(𝑦𝑐 , 𝑥𝑐; 𝜃𝑝)

Ѱ𝑐∈𝐶𝑝𝐶𝑝∈𝐶

(2.6)

Where each templated factor parameterized in log-linear representation as

Ѱ𝑐(𝑦𝑐 , 𝑥𝑐; 𝜃𝑝) = exp { ∑ 𝜃𝑝𝑘𝑓𝑝𝑘(𝑦𝑐 , 𝑥𝑐)

𝐾(𝑝)

𝑘=1

} (2.7)

And the normalization function is

𝑍 = ∑ ∏ ∏ Ѱ𝑐(𝑦𝑐 , 𝑥𝑐; 𝜃𝑝)

Ѱ𝑐∈𝐶𝑝𝐶𝑝∈𝐶 𝑦

(2.8)

In general, the complete CRF algorithm has three aspects:

• Graph model representation: With the graph representation properly defined, we

also define the factors, the clique template, and the features. From which the

conditional distribution for a CRF model (2.6) is defined.

• Model learning: also known as the model training or parameter estimation. In this

phase the parameter vector is learned from the data, thus the features are weighted.

• Inference: which refers both to the task of computing the marginal distributions of

𝑝(𝑦|𝑥) and/or to the task of computing the most liked label of y given x : 𝑦∗ =

arg 𝑚𝑎𝑥𝑦𝑝(𝑦|𝑥) . Notice that the inference and learning procedures are often

closely coupled, because learning usually calls the inference procedures as a

subroutine.

To apply CRF on the context of company business failure prediction, three tasks correspond

to the three CRF aspects mentioned above need to be done:

Task 1. Build a graphical network model to represent the concept: the network defines the

factors that incorporate features of both financial ratios and consider the link between

companies.

18

Task 2. Define an inference algorithm to perform inference in the network constructed.

Task 3. Use the inference algorithm to learn the optimal feature weights from the data and

then to find the best business status assignment for every company.

2.3.2 Representation

Suppose we have n companies are linked by the shared board members. There are two sets

of variables in the model we will build: Yi and Xi for i = 1,...,n. Xi is a vector-valued variable

that corresponds to the features of the 𝑖𝑡ℎ company, those variables we always observed. Yi

is the business status (1 for bankruptcy and 0 for active) assignment to the 𝑖𝑡ℎ company,

which are the hidden variables that we want to predict. With CRF we seek to model P(Y|X),

the conditional distribution over business status gave the observed finance variables, and

find the assignment to the Yi variables that correctly describes the company business status

in the Xi variables.

We will use the linear chained Markov network to model the distribution over the Yi

variables, given the variables Xi. Figure 2.6 is one example of a linear chained Markov

network over 3 linked companies.

In this example, we have singleton factors Ѱ𝑖𝐶(𝑌𝑖 , 𝑋𝑖) that represent how likely a given

company i is bankrupt, and pairwise factors Ѱ𝑖𝑃(𝑌𝑖 , 𝑌𝑖+1) that represent the interactions

between adjacent pairs of company i and i+1.

With these two types of factors, since I is always observed, the CRF allow us to model the

conditional distribution P(Y|X) that not only take into account the influence of observed

finance features on a single company business status, but also higher-order dependencies

between the linked companies (the dependencies could be, for example, a bankruptcy

company may likely to be linked by another bankruptcy company.)

19

Figure 2.6 Markov network over 3 linked companies.

Now we define the factors formally and the features involved in the factors.

Singleton Factors:

In the simplest way, a model may contain only the factors Ѱ𝐶 for singleton features. There

are n such factors, one for every company, with scope for the 𝑖𝑡ℎ company, is {𝑌𝑖 , 𝑋𝑖}; we

name them singleton factors since 𝑋𝑖 is always observed, so these factors essentially

operate on single companies. This model is shown in Figure 2.7. We use the model with

only the singleton factors as the baseline model. By adding more complex factors to the

model, the baseline model can be used to evaluate the improvement we made.

Figure 2.7 Baseline model contains only the factors for singleton features Ѱ𝐶

20

Two types of features involved in the singleton factors :

• 𝑓𝑖,𝑐𝐶 (𝑌𝑖) = 1{𝑌𝑖=𝑐} , an indicator for Yi = 0 or Yi = 1, which operates on the hidden

variables of single company. These features are used to encode the individual

probability that Yi = 0 or Yi = 1.

• 𝑓𝑖,𝑗,𝑐,𝑑𝐶 (𝑌𝑖 , 𝑥𝑖𝑗) = 𝑥𝑖𝑗1{𝑌𝑖=𝑐}, an indicator for Yi = c, xij=d, which operates on a hidden

state and the financial ratios 𝑥𝑖𝑗 associated with that state c={0,1} of a single

company. These feature are used to encode the individual probability that Yi = c

given xi.

Pairwise Factor:

The model which uses only the singleton factors is an entirely valid, though simplistic,

Markov network. The issue with the singleton factors is that they do not consider any

interactions between companies with shared board members. To improve the model, we

introduce n-1 pairwise factors Ѱp(𝐶𝑖 , 𝐶𝑖+1) for i=1,...,n-1, to represent these interactions.

This give us the model in Markov network as in Figure 2.8.

Figure 2.8 Linear chained CRF with singleton factors and pairwise factors.

The intuition behind these pairwise factors is as follows. Suppose there are two companies

that linked by a shared board member. In isolation, we can only predict the company status

based on its given financial ratios. Suppose, however, that the singleton factor for the 1st

company assigns a very high score to bankruptcy: i.e., we are fairly certain that the 1st

company goes bankrupt. In addition, suppose that the odds of seeing two bankruptcy

company are so much higher than seeing one bankruptcy company linked to an active one,

then it can be said with a high probability that the second company is bankruptcy even if

the singleton factor assigns the second company nearly equal scores on bankruptcy vs

active.

21

The features involved in the pairwise factors are:

• 𝑓𝑖,𝑐,𝑑𝑃 (𝑌𝑖 , 𝑌𝑖+1) = 1{𝑌𝑖=𝑐,𝑌𝑖+1=𝑑 } , which operates on the hidden states of a pair of

linked companies. Since there are only two states of companies, the pairwise factor

involved totally 4 features: 1{𝑌𝑖=0,𝑌𝑖+1=0 } , 1{𝑌𝑖=0,𝑌𝑖+1=1 } , 1{𝑌𝑖=1,𝑌𝑖+1=0 } ,

1{𝑌𝑖=1,𝑌𝑖+1=1 }

So far, we defined two factors and the associated features, the CRF model in (2.4) can

already be defined. Before learning the weights of each feature from the training dataset,

an “inference engine” need to be defined firstly.

2.3.3 Inference

Inference corresponds to using the distribution to answers questions about the environment.

With factors and parameters being defined in CRF model (2.5), it is possible to use “brute

force” to find the MAP assignment to each of factors. The brute force way, however, unable

to handle large network because the running time is proportional to the number of entries

on the joint distribution over the entire network. For example, assume 10 companies been

linked together by a shared board member, which is not a rear case in our dataset. The brute

force inference will need to enumerate 210 combinations to find the MAP assignment.

There are other forms of exact inference can still be very efficient. In this thesis, we

implement belief propagation, specifically, the clique tree message passing algorithm as an

exact inference engine.

22

Algorithm: Clique tree message passing [Koller and Friedman, 2009].

1. Construct a clique tree from a given set of factors Ѱ.

2. Assign each factor 𝜓 ∈ Ѱ to a clique 𝐶𝛼(𝑘) such that 𝑆𝑐𝑜𝑝𝑒[𝜓𝑘] ∈ 𝐶𝛼(𝑘). 𝛼(𝑘) returns

the index of the clique to which 𝜓𝑘 is assigned.

3. Compute initial potentials 𝜙𝑖(𝐶𝑖) = ∏ 𝜓𝑘𝑘:𝛼(𝑘)=𝑖

4. Designate an arbitrary clique as the root, and pass messages 𝛿 upwards from the leaves

towards the root clique.

5. Pass messages from the root down towards the leaves.

6. Compute the beliefs for each clique: 𝛽𝑖(𝐶𝑖) = 𝜙𝑖 × ∏ 𝛿𝑘→𝑖𝑘∈𝑁𝑖

As an example, consider the following network with 6 linked companies:

Figure 2.9 Chained Markov network with 6 linked Company

The following clique tree can be created from the list of factors corresponding to this

network:

Figure 2.10 Clique tree for the network

Next, we assign each of the original factors to a clique to initialize the clique potentials in

this tree. The clique with variables 1 and 2 was arbitrarily chosen to be the root. Afterward,

messages passing start from the leaves up to the root, and then down from the root to the

leaves. The red arrows show the messages passing from the leaves to the root, and the blue

arrows show the messages passing from the root to the leaves. In this clique tree, there are

totally 5 cliques, so 2*(5-1) = 8 messages suffice to correctly compute all beliefs. Finally,

we can use the calibrated clique beliefs to answer probabilistic queries on the original

network.

23

2.3.4 Learning

Now we have constructed a Markov network for the task of business failure prediction and

define an inference engine for the network. What still lack are the weights of each feature,

which needs to be learned from data. We will learn those parameters using the maximum

likelihood estimation :

Given a set of M training examples, 𝐷 = {(𝑥[𝑚], 𝑦[𝑚])}𝑚=1𝑀 , we want to find the 𝜃∗ that

maximizes the likelihood of the observed data:

𝜃∗ = argmax𝜃

𝐿(𝜃: 𝐷) = argmax𝜃

∏ 𝑃(𝑦[𝑚]|𝑥[𝑚]; 𝜃)𝑀

𝑚=1 (2.2)

In this thesis, we use the stochastic gradient descent algorithm to learn the parameters from

the set of training data. The stochastic gradient descent algorithm is described below:

Algorithm: Stochastic gradient descent

for k = 1 to max iterations:

Pick an arbitrary training example (x[m]; y[m]), then update

𝜃 ≔ 𝜃 − 𝛼𝑘∇𝜃[−𝑙𝑜𝑔𝑃(𝑦[𝑚]|𝑥[𝑚]; 𝜃)]

Where the learning rate 𝛼𝑘 = 0.1

1+√𝑘

From the algorithm above, essentially the mission is: for a given data instance (x;Y) and a

parameter setting 𝜃, we need to compute the cost function (negative log-likelihood) and

the gradient of parameters with respect to that cost. To avoid overfitting, L2-regularization

penalty on the parameter values add to the negative log-likelihood. Thus the function we

seek to minimize is:

𝑛𝑙𝑙(𝑥, 𝑌, 𝜃) ≡ log(𝑍𝑥(𝜃)) − ∑ 𝜃𝑖𝑓𝑖(𝒀, 𝒙) + 𝜆

2

𝑘

𝑖=1

∑ 𝜃𝑖2

𝑘

𝑖=1

(2.3)

The partial derivatives for this function have an elegant form [Koller and Friedman, 2009]:

𝜕

𝜕𝜃𝑖

𝑛𝑙𝑙(𝑥, 𝑌, 𝜃) = 𝐸𝜃[𝑓𝑖] − 𝐸𝐷[𝑓𝑖] + 𝜆𝜃𝑖 (2.4)

In the derivative, there are two expectations: 𝐸𝜃[𝑓𝑖], the expectation of feature values with

respect to the model parameters, and 𝐸𝐷[𝑓𝑖], the expectation of the feature values with

respect to the given data instance 𝐷 ≡ (𝑋; 𝑦).

24

𝐸𝜃[𝑓𝑖] = ∑ 𝑃(𝑌′|𝑥; 𝜃)𝑓𝑖(𝑌′, 𝑥)

𝑌′

(2.5)

𝐸𝐷[𝑓𝑖] = 𝑓𝑖(𝑌, 𝑥) (2.6)

In the (2.5), we sum over all possible assignments to the Y variables in the scope of the

feature 𝑓𝑖. Since each feature has a small number of Y variables (in our case, at most 2

variables in the features involved in the pairwise factors) in its scope, this sum is tractable.

Unfortunately, computing the conditional probability 𝑃(𝑌′|𝑥; 𝜃) for each assignment

requires performing inference for the data instance x. Thus the inference subroutine need

to be called repeatedly during each iterations loop.

2.4 Benchmark Model

2.4.1 Logistic regression

Logistic Regression (LR) is a type of generalized linear model for predicting the probability

of a binary classification problem. The hypothesis function of logistic regression is:

ℎ(𝒙) = 𝑔(𝜽𝑡𝒙), 𝑔(𝑧) = 1

1 + 𝑒−𝑧 (2.7)

With x the input predictor vector, 𝜽 the parameters (weight of the predictor), h(x) the output,

g(z) is the sigmoid function, with the plot shown in Figure 2.11. The sigmoid function is

an s shaped curve with input value between [−∞, +∞] and output value between [0, 1].

This feature of the sigmoid function is very important for binary classification problem

because we can simply assume the probability of y=1 given x and 𝜽 is:

𝑃(𝑦 = 1|𝒙; 𝜽) = 𝑔(𝜽𝑡𝒙) = 1

1 + 𝑒−𝜽𝑡𝒙 (2.8)

Figure 2.11 sigmoid function

25

Thus the decision function can be:

𝑦 = 1, 𝑖𝑓 𝑃(𝑦 = 1|𝒙; 𝜽) > threshold value (2.9)

Though 0.5 is normally chosen for the threshold value, in practice different threshold value

can be chosen: if the true positive is more important, a larger threshold value should be set.

If the positive recall is more important, the smaller threshold value may be chosen.

Training a logistic regression model is basically learning the parameters 𝜃 from data.

Popular and simple algorithm like gradient decent or stochastic gradient decent can

efficiently learn the parameters with maximum likelihood estimation.

LR is easy to use and the learned weight is easy to interpret, we choose LR as a benchmark

model in this thesis.

2.4.2 Decision Tree and Random Forest

Decision trees (DT) are a type of model used for both classification and regression. Trees

answer sequential questions which send us from root nodes of the tree down a certain route

to given the answer. The model at each node behaves with “if this then that” conditions and

ultimately yielding a specific result. This is easy to see with the image in Figure 2.12 which

maps out an example of whether or not to play tennis.

Figure 2.12 Decision tree example

The problem with DT is that it is not very robust to data changes. A small change in data

can sometimes cause a large change in final predictions [Dudoit et al., 2002]. Random

forests (RF) is a modeling technique that is much more robust than a single decision tree.

RF is basically an ensemble method. The ensemble method aggregates many models to

limit overfitting as well as an error due to bias and therefore yields useful results. In the

case of RF, it creates an entire forest of random uncorrelated decision trees to gain better

predictive results. RF will also be used as a benchmark model in this thesis.

26

2.5 Performance metrics

2.5.1 Accuracy, precision, recall and F-measure

In a binary classification problem, the class of a sample can only be 0 (negative) or 1

(positive). When we use a classification model for prediction, sample with true class 0, the

model may predicts as either 0 or 1. For sample with true class 1, the model predict it as 0

or 1. Those 4 different combinations named confusion matrix, which is given in Figure

2.13. The 4 performance metrics discussed in this section are all based on the confusion

matrix, as shown in (2.10), (2.11), (2.12) and (2.13).

Figure 2.13 Confusion matrix

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 (2.10)

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑃 (2.11)

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (2.12)

𝐹1 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (2.13)

Accuracy - Accuracy is simply a ratio of correctly predicted observation to the total

observations. Accuracy is a great performance measure for symmetric datasets where the

values of false positive and false negatives are almost the same. With an unbalanced dataset,

accuracy may mislead: with a dataset that has 99% negative cases, a classifier just blindly

label all the sample with negative will give us 99% accurate.

27

Precision - Precision measures how many of the predicted positives are true positive. When

the costs of False Positive is high, precision can be a good metrics to consider. In spammed

email detection, a False Positive email means the email which is not spam been classified

as spam, the user may lose important information if the precision of the classifier is not

very high.

Recall – Recall calculates how many true positive cases the model captures among all the

cases that are truly positive. Recall is useful when there is a high cost associated with the

False Negative. In sick detection, after a test, if a sick person been labeled as non-sick

(False Negative), the cost can be extremely high if the sickness is serious. In company

bankruptcy prediction, a bankruptcy company been predicted as non-bankruptcy may

results in high loss to investors.

F1 Score - F1 score is the harmonic average of Precision and Recall. The precision and

recall normally trade-off with each other: higher precision normally associated with low

recall. The range of F1 score is between 0 and 1. Only when precision and recall are both

1, the F1 score is 1. Thus, the higher F1 score the better. F1-score can be useful when we

need to seek a balance between Precision and Recall.

2.5.2 AUC

AUC is another popular performance metrics used in binary classification. The problem

with the 4 performance metrics discussed in 2.5.1 is that they highly depend on the

threshold value chosen for classifying the sample: the predictive results of the classification

model are probabilities or scores, which need to be transformed into class 0 or 1. For each

threshold value, the numbers in the confusion matrixes change thus the performance

metrics derived from those numbers will also change. The main benefits of using AUC as

performance metrics is that it is threshold independent.

AUC stands for Area Under Curve, which is the area under the ROC (Receiver Operating

Characteristic) curve. The x-axis of ROC is the false positive rate (FPR) and the y-axis is

the true positive rate (TPR), which is defined in (2.15) and (2.14).

𝐹𝑃𝑅 = 𝐹𝑃

𝐹𝑃 + 𝑇𝑁 (2.14)

𝑇𝑃𝑅 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (2.15)

The ROC plot gives information about the trade-off between the TPR and FPR for each

threshold value. Thus the aggregate performance across the entire range of trade-offs

values, which is the area under the ROC curve, describes a general performance of the

classifiers. The AUC is calculated as follows:

28

𝐴𝑈𝐶 = ∫𝑇𝑃

𝑇𝑃 + 𝐹𝑁

1

0

𝑑𝐹𝑃

𝐹𝑃 + 𝑇𝑁 (2.16)

Interpretation of the AUC is easy: the higher the AUC, the better the performance. With

0.50 indicating the classifier that randomly labels the sample and 1.00 denoting the perfect

performance.

2.6 Dummy data generation

After implementation of the CRF business failure prediction model in MATLAB, to debug

the code and to characterize the model performance, we applied the model first on the

dummy triple linked company data, as schematically shown in Figure 2.14. The dummy

data have the same number of continuous and discrete variables as in the real data. In

addition, we intentionally make the linked company have the same business status

(bankruptcy or active) with a certain probability, which named link strength will be

explained later.

Figure 2.14 Dummy triple linked company data

The procedure to generate the dummy linked company are schematically shown in Figure

2.15.

29

Figure 2.15 Procedure to generate triple linked dummy company data.

Step1: With the help of scikit-learn package in python, we first generated a batch of

random 2-class classification dataset with 20-discrete variables and 9 continuous variables

(by function make_classfication), which is the same number of predictors as in the real

dataset. Among those 29 variables, 4 of them are informative variables (the number 4 is

randomly picked by the author) and the rest are redundant variables. The proportions of

samples assigned to each class are 50% (balanced class).

Step2: Based on the class of each sample, split the initial batch generated in step 1 into

active company batch (where the class is 0) and bankruptcy company batch (where the

class is 1).

Step3: Forming the triple linked company by “tossing a coin” 3 times:

• For the 1st toss, if head, draw a company from the bankruptcy batch, if tail, randomly

draw a company from the active batch.

• For the 2nd and the 3rd toss, if head, draw a company from the same batch of the

previous drawn company. If tail, draw a company from the different batch of the

previous drawn company.

• Table 2.2 shows the generated triple linked company class correspond to the results

of 3 coin tosses.

Step4: Repeat the step3 until one of the bankruptcy and active batch is empty or the number

of linked companies researched the desired number.

30

Table 2.2 The generated linked dummy company status corresponds to the coin toss results.

Coin toss results Linked company status

HHH 111

HTT 101

HTH 100

HHT 110

TTT 010

THT 001

THH 000

TTH 011

Based on the data generation procedure described above, three types of dummy data been

generated:

1. Dummy linked company with different link strength:

The meaning of link strength is the probability that a company has the same business status

(bankruptcy or active) with its linked company. To generate linked company with different

link strength, the 2nd and 3rd coin toss need to be biased. For example, to generate linked

company with link strength 0.7, after drawing the 1st company from the 1st coin tossing,

the second and third coin toss biased with a probability of 0.7 to toss a head.

3 batches of dummy data with link strength 0.5, 0.7 and 0.9 are generated. Link strength

0.5 means the companies with the status of bankruptcy or active are randomly linked. Link

strength 0.7 means the linked companies have the same business status with a probability

of 0.7. Each of the 3 batches is split into training (70% proportions) and test (30%

proportions) dataset for model learning and test. If the CRF model implemented correctly,

we expect better performance of the model on the dummy data batch with higher link

strength.

2. Dummy linked company with different unbalanced class level:

Since the real dataset is highly unbalanced (very little bankruptcy status observed), it is

interesting to study the influence of imbalance of class on the model performance. To

generate the dummy linked company with different unbalanced class level, we bias the first

coin toss and keep high link strength at 0.9. Dummy data with the first coin toss biased at

probability to toss a head at 0.5(fair coin), 0.2, 0.01 and 0.001 are generated. The lower the

biased probability, the less positive (bankruptcy) cases observed.

31

3. Dummy data with different y-flip ratio:

To add a different level of noise in the data, in the dummy data generation procedure, after

the 1st step, and before the 2nd step, a certain percentage of company’s status (y-values) are

randomly flipped. We generated 3 batches of dummy data with 0.01%, 10% and 30%

company percentage flipped the y-value (y-flip ratio). The higher the y-flip ratio, the high

the noise level.

32

3. Results

3.1 Apply CRF model on dummy data

The generated dummy dataset have been split into training and test dataset with 70%-30%

weight. After learning the parameters in the training dataset, the learned model is then

applied to the test dataset with the model performance evaluated.

As mentioned before, since accuracy/precision/recall and F1-score are highly dependent on

the threshold used to classify the company status (in this thesis we always put 0.5 as the

threshold), we should only use AUC to evaluate the model performance.

Model performance on dummy data with different link strength

Table 3.1 and Figure 3.1 shows the CRF model performance on the dummy data with

different link strength. Two types of models are applied, one with only singleton factor

considered (S model) and the other with both singleton factor and pairwise factors (S+P

model) considered. The S model is simply the log-linear regression model and its

performance acts as the baseline.

Table 3.1 Model performance on dummy data with different link strength

Link strength Model AUC Accuracy Precision Recall F1-score

0.5 S 0.872 0.797 0.797 0.789 0.793

0.5 S+P 0.872 0.797 0.799 0.786 0.793

0.7 S 0.870 0.797 0.822 0.781 0.800

0.7 S+P 0.882 0.803 0.824 0.795 0.809

0.9 S 0.867 0.790 0.775 0.787 0.782

0.9 S+P 0.939 0.864 0.856 0.861 0.859

Figure 3.1 AUC performance on dummy data with different link strength

33

With increasing of link strength, the AUC performances of S models stay almost the same,

which is as expected, since the singleton factor do not consider the dependence between

linked companies. With link strength of 0.5, the S model and S+P model have similar

performance, which again is expected, because the companies with different business status

are randomly linked, the status of linked companies are independent thus the pairwise

factor has no predictive power. With link strength 0.7 and 0.9, the AUC of S+P model

clearly overperforms the S model. The improvement of baseline (S model) on 0.9 link

strength is higher than the 0.7 link strength, which shows the effectiveness of our model

and also is a prove of the concept of the motivation to apply the CRF on the linked company

context, that if there is dependence of business status between companies that linked by

the shared board member, we might be able to get better performance with CRF model

comparing with the model that does not consider the link.

Model performance on dummy data with different unbalanced class levels.

Table 3.2 and Figure 3.2 shows the model performance on dummy data with different class

balance levels. Remember the lower the 1st coin toss bias, the less positive (bankruptcy)

cases been observed, the higher the class imbalance level.

Table 3.2 Model performance on dummy data with different first coin toss bias.

Link

strength

Model First coin

toss bias

AUC Accuracy Precision Recall F1-

score

0.9 S 50% 0.867 0.790 0.775 0.787 0.782

S+P 50% 0.939 0.864 0.856 0.861 0.859

0.9 S 20% 0.849 0.824 0.662 0.679 0.670

S+P 20% 0.912 0.870 0.835 0.628 0.716

0.9 S 1% 0.823 0.911 0.474 0.381 0.422

S+P 1% 0.862 0.928 0.702 0.280 0.4

0.9 S 0.1% 0.804 0.910 0.643 0.257 0.367

S+P 0.1% 0.833 0.914 0.789 0.214 0.337

34

Figure 3.2 AUC performance on dummy data with different first coin toss bias

The performance of both S and S+P model decreases when applied to the unbalanced data.

AUC keep decreasing with data more and more unbalanced. However, because of the high

link strength, the S+P model still out-perform the S model.

Model performance on dummy data with the different y-flip ratio (noise).

Table 3.3 and Figure 3.3 shows the model performance on dummy data with the different

y-flip ratio. The higher the y-flip ratio, the larger the noise. The performance of both S and

S+P model decrease with increasing of noise level, however, due to the high link strength,

the S+P mode again over performance the S model.

Table 3.3 Model performance on dummy data with the different y-flip ratio.

Link

strength

Factor y-flip

ratio

AUC Accuracy Precision Recall F1-score

0.9 S 0.01% 0.867 0.790 0.775 0.787 0.782

S+P 0.01% 0.939 0.864 0.856 0.861 0.859

0.9 S 10% 0.846 0.767 0.767 0.795 0.781

S+P 10% 0.92 0.843 0.819 0.898 0.857

0.9 S 30% 0.783 0.722 0.723 0.684 0.703

S+P 30% 0.862 0.778 0.765 0.774 0.771

35

Figure 3.3 AUC performance on dummy data with the different y-flip ratio

3.2 Apply CRF on real company dataset.

3.2.1 On original company data

We first consider the companies that have no missing values in the year 2013. This gives

us 28174 companies in total, with number of 27946 active and 228 (1.02%) bankruptcy

companies, from which 42178 linked company pairs are generated. Those company pairs

are split into training and test dataset with 70%-30% weight. After training the parameters,

the results on test dataset is shown in Table 3.4.

Table 3.4 Model performance on original company dataset (year 2013)

Factor AUC Accuracy Precision Recall F1-score

S 0.797 0.995 NaN 0 NaN

S+P 0.798 0.995 NaN 0 NaN

For S model, AUC is 0.797. The precision and F1-score is NaN because at the default

decision threshold 0.5, all companies status are predicted as active. The S+P model almost

did not improve the AUC, which might be because of the high imbalance in the data: Table

3.5 shows the percentage of company pairs with different linked status, there are only 228

(1.02%) company bankruptcy in our data set, with the stochastic gradient descent method,

the bankruptcy cases are too little to train the model effectively. The 2nd possible reason

that the S+P model did not show a great improvement over S model can be due to the

oversimplification of the companies connections: in chapter 2, we made the decision to

limit ourselves to model only the linear-chained (double linked company in specifically)

network topology. In another word, we made a huge simplification about the company link

36

structure, while about 20% of network structures are not linear-chained topology. Since the

graph models are basically graph-based representations of various factorization

assumptions of distributions. These factorizations are typically equivalent to independence

statements amongst variables in the distribution. If the graph representation cannot

represent or cannot nicely approximate the true connections, it is hard to expect the graphic

model have great performance over the baseline model.

Table 3.5 Percentage of 2-linked company status of the year 2013

A Survival A Bankruptcy

B survival 41830 (99.17 %) 159 (0.38%)

B bankruptcy 159 (0.38%) 30 (0.07%)

To solve the imbalanced data problem, we try to train the model by up-sampling the data.

3.2.2 On up-sampling company data

We again consider the data in the year 2013. After splitting the data into training and test

dataset, we perform upsampling on the training dataset. We only upsample the linked

company that has at least one company bankruptcy. After upsampling, the proportion of

the different linked companies as shown in Table 3.6.

Table 3.6 Percentage of 2-linked company status of the year 2013 (up-sampling data)


B survival 29305 (50 %) 14985 (25.5%)

B bankruptcy 12150 (20.6%) 2430 (3.9%)

The trained model then is applied to the test dataset with the performance shown in Table

3.7:

Table 3.7 Model performance on upsampling company dataset (year 2013)

Factor AUC Accuracy Precision Recall F1-score

S 0.836 0.569 0.009 0.912 0.017

S+P 0.84 0.573 0.0086 0.913 0.017

Compared with the model trained on the original data, the AUC greatly improves, in

addition, the S+P model does show small improvement compared to the S model. There is,

however, an issue with the train and test the model on the data of the same year: some

companies are presented in both the training and test dataset, in other words, there is data

leakage. The model learned on data of 2013 was then applied on the data of other years,

Table 3.8 shows the performance:

37

Table 3.8 Test result on the year 2014 data of model trained by the year 2013 data.

Factor AUC

2010

AUC

2011

AUC

2012

AUC

2014

AUC

2015

AUC

2016

S 0.6871 0.7076 0.6137 0.6 0.6968 0.6316

S+P 0.6739 0.7034 0.6077 0.593 0.6888 0.6176

Unfortunately, at this time, the S+P model underperforms the baseline model. Table 3.9

shows the percentage of 2-linked company status of the year 2014: in the linked company

where one is bankrupt, the probability of the other one also bankruptcy decrease to 3.33%

( 10/(145+145+10) ), which means, the model trained using the 2013 data, do not really

apply on 2014 due to the change of link strength.

Table 3.9 Percentage of 2-linked company status of the year 2014


B survival 41130 (99.28 %) 145 (0.35%)

B bankruptcy 145 (0.35%) 10 (0.024%)

Table 3.10 shows the odds of bankruptcy-bankruptcy company pairs among the links with

at least one bankruptcy, the odds change from year to year. Since this odds can be viewed

as the “link strength” or dependencies between the company pairs. If there is no link

dependencies “pattern”, the parameters of the features in the pairwise factors that trained

with data in one year cannot be applied to another year with different link strength.

Table 3.10 Odds of bankruptcy-bankruptcy among links with at least one bankruptcy

2010 2011 2012 2013 2014 2015 2016

8.6% 10.2% 10.3% 8.9% 3.3% 2.34% 7.9%

Table 3.11 shows the performance of logistic regression (LR) and random forest(RF) on

the original company dataset of a different year. Compare with Table 3.8, both RF and LR

overperform the CRF model. Overall, the RF has better performance than the LR model.

Table 3.11 Benchmark Model Performance on each year

2010 2011 2012 2013 2014 2015 2016

AUC of LR 0.72 0.71 0.65 0.71 0.69 0.59 0.51

AUC of RF 0.74 0.76 0.71 0.77 0.82 0.77 0.68

38

4. Conclusions and future work

4.1 Conclusion

The aim of this thesis is to predict whether a Belgian company will go bankrupt the next

financial year (classification problem). The traditional company business failure prediction

models only included financial ratios as predictors and see companies as isolated entities,

they ignore the linkage and did not consider any influences between the different

companies. In reality, all businesses form an ecosystem. Companies may link with each

other by their shared board members. Forming Board networks is particularly interesting

because information is transferred from company to company through these linked Boards.

There might be a relationship between a board's connection and the financial status of the

linked companies. In this thesis, we want to explore the added value of taking the board

network into account on bankruptcy prediction and evaluates the performance of

Conditional Random Fields (CRF) in Business Failure Prediction.

Specifically, we designed, implemented and trained a CRF model to predict the business

failure of companies with shared board members. The linear chained Markov network was

used as the CRF model graph representation. The clique tree message passing algorithm

was used to build the exact inference engine and the stochastic gradient descent algorithm

has been applied to learn the parameters from the data. The Belgian SMEs data from 2010

to 2016 was used to train and evaluate the model. The performance of CRF is also

compared with several benchmark models.

Two types of CRF model were built, one with only the singleton factors taken into account

(S model) that act as the baseline model. Another with both singleton and pairwise factors

(S+P model) take into account.

The model was trained and evaluated firstly on the dummy generated datasets to simulate

the model performance under different circumstances. Basically, the model on dummy data

worked as expected. It is shown that with increasing of company link strength, the

performance of S+P model increased and the S model performance stays the same, with

link strength higher than 0.5, the S+P model overperforms the S model. The results show

the model implementation correctness and also prove the concept of the motivation to apply

the CRF on the linked company context. It is also showed that both unbalanced data and

data noise will decrease the model performance.

The CRF model was then trained and evaluated on the real company dataset. Both S and

S+P model do show some predictive power (with AUC ~0.8). However, compared with the

baseline model, the S+P model shows little improvement. We suspect there are two reasons.

The reason one might due to the highly unbalanced dataset. On the up-sampling dataset for

the year 2013, the S+P model does show some (although still not very significant)

39

improvement over the S model with AUC improved from 0.836 to 0.84. The second reason

could be the oversimplification of the true companies connections by the linear-chained

graph model representation since about 20% of the network structures in the dataset are

not linear-chained topology.

The model that trained on the up-sampled dataset of the year 2013 was then applied to the

year 2014. Unfortunately this time the S+P model even underperform the baseline model,

which shows the limitation to generalize the model. It is found that the odds of bankruptcy-

bankruptcy company pairs is not a constant number but changed quite a lot from year to

year. Since there is no link dependencies “pattern”, the model trained with data in one year

cannot predict the bankruptcy of another year with different link strength.

In conclusion, the linear chained CRF model we build shows poor performance in

predicting the business failure of companies with shared board members. It also shows

limited added value of taking the board network into account on bankruptcy prediction

with the data set we used in this thesis.

4.2 Limitation and future work

1. This study only considers the linear chained company structure. There are many other

types of structures available in the network topologies like ring, star, densely connected

topologies. It will also be interesting to evaluate the CRF model on different network

topologies.

2. The exact belief propagation inference algorithm we used in the thesis only applied to

the tree topologies and not work for the complex topologies with cyclic network structures,

in which case the approximal inference algorism can be applied.

3. As a way of learning, this thesis did not use any external CRF packaged but implemented

the full mode in MATLAB without considering too much the efficiency, Thus the time to

run the model is long. To increase the efficiency, the model will need to be carefully

implemented with languages like C/C++.

40

References

[Muller et al., 2015] Muller, P., Caliandro, C., Peycheva, V., Gagliardi, D., Marzocchi, C.,

Ramlogan, R., and Cox, D. (2015). Annual report on european smes. Performance review.

The European Commission Publication Office.

[Commission et al., 2003] Commission, E. U. et al. (2003). Commission recommendation

of 6 may 2003 concerning the definition of micro, small and medium-sized enterprises.

Official Journal of the European Union, 46:36-41.

[Tkac and Verner, 2016] Tkac, M. and Verner, R. (2016). Artificial neural networks in

business: Two decades of research. Applied Soft Computing, 38:788-804.

[Baysinger and Butler, 1985] Baysinger, B. D. and Butler, H. N. (1985). Corporate

governance and the board of directors: Performance effects of changes in board

composition. Journal of Law, Economics, & Organization, 1(1):101-124.

[Tobback et al., 2016] Tobback, E., Moeyersoms, J., Stankova, M., Martens, D., et al.

(2016). Bankruptcy prediction for smes using relational data. Technical report.

[Sutton et al., 2012] Sutton, C., McCallum, A., et al. (2012). An introduction to conditional

random fields. Foundations and Trends in Machine Learning, 4(4):267-373.

[Van Damme et al., 2017] Domien, Van Damme, et al. (2017) Conditional Random Fields

For Bankruptcy Prediction.

[Dudoit et al., 2002] Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of

discrimination methods for the classification of tumors using gene expression data. Journal

of the American statistical association, 97(457):77-87.

[Koller and Friedman, 2009] Koller, D. and Friedman, N. (2009). Probabilistic graphical

models: principles and techniques. MIT press.

[Beaver, 1966] Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of

accounting research, pages 71-111.

[Altman, 1968] Altman, E. I. (1968). Financial ratios, discriminant analysis and the

prediction of corporate bankruptcy. The journal of finance, 23(4):589-609.

[Theil and Theil, 1971] Theil, H. and Theil, H. (1971). Principles of econometrics.

Technical report.

[Hung et al., 1998] Hung, H. et al. (1998). A typology of the theories of the roles of

governing boards. Corporate governance, 6(2):101-111.

[Doumpos and Zopounidis, 1999] Doumpos, M. and Zopounidis, C. (1999). A multicriteria

41

discrimination method for the prediction of financial distress: The case of Greece.

Multinational Finance Journal, 3(2):71.

[Huysmans et al., 2006] Huysmans, J., Baesens, B., Vanthienen, J., and Van Gestel, T.

(2006). Failure prediction with self-organizing maps. Expert Systems with Applications,

30(3):479-487.

[Min et al., 2006] Min, S.-H., Lee, J., and Han, I. (2006). Hybrid genetic algorithms and

support vector machines for bankruptcy prediction. Expert systems with applications,

31(3):652-660.

[Ravi et al., 2008] Ravi, V., Kurniawan, H., Thai, P. N. K., and Kumar, P. R. (2008). Soft

computing system for bank performance prediction. Applied soft computing, 8(1):305-

315.

[Chen et al., 2009] Chen, H.-J., Huang, S.-Y., and Kuo, C.-L. (2009). Using the artificial

neural network to predict fraud litigation: Some empirical evidence from emerging

markets. Expert Systems with Applications, 36(2):1478-1484.

[Lin et al., 2012] Lin, W.-Y., Hu, Y.-H., and Tsai, C.-F. (2012). Machine learning in

financial crisis prediction: a survey. IEEE Transactions on Systems, Man, and Cybernetics,

Part C (Applications and Reviews), 42(4):421-436.

42

Appendix A: Dataset description

Table A.0.1 Raw data structure

BvD ID

Number

DM UCI

(Unique

Contact

Identi_er)

Status Status Date Variables

2010

Variables

2011

Variables

2012

Variables

2013

Variables

2014

Variables

2015

Variables

2016

Company 1 Board 1 ID

number

Status 1 Status 1 date Variables

company

1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1


number


company

1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Company 1 NA Status 3 Status 3 date Variables

company

1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1

Variables

company 1


number


company

2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2


number


company

2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2


number

NA NA Variables

company

2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2


number

NA NA Variables

company

2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

Variables

company 2

43

Table A.0.2 Basetable structure in year 2010

BvD ID Number Predictors Status

Company1 Predictors 2010 Status 2010




44

Table A.0.3 Definition of variables in raw data.

Variable Definition Treatment

ROCE using P/L before tax (Profit (Loss) before tax + Interest Paid) / (Shareholders Funds + Non-Current

Liabilities) * 100

Convert to binary

ROCE using Net income (Net income for period + Interest Paid) / (Shareholders Funds + Non-Current

Liabilities) * 100

Convert to binary

Profit per employee th USD Profit before tax / Employees Convert to binary

Profit margin (Profit before tax / Operating revenue) * 100 Convert to binary

EBITDA margin (EBITDA / Operating revenue) * 100 Convert to binary

EBIT margin (EBIT / Operating revenue) * 100 Convert to binary

Cash flow / Operating

revenue

(Cash ow / Operating revenue) * 100 Convert to binary

Net assets turnover Operating revenue / (Shareholders funds + Non current liabilities) Convert to binary

Interest cover Operating profit / Interest paid Convert to binary

Stock turnover Operating revenue / Stocks Convert to binary

Collection period days (Debtors / Operating revenue) * 360 Convert to binary

Credit period days (Creditors / Operating revenue) * 360 Convert to binary

Shareholders liquidity ratio Shareholders funds / Non current liabilities Convert to binary

Solvency ratio (Liability

based)

(Shareholders funds / Non-current liabilities + current liabilities) *

100

Convert to binary

Operating revenue per

employee th USD

Operating revenue / Employees Convert to binary

Costs of employees /

Operating revenue

(Cost of employees / Operating revenue) * 100 Convert to binary

Shareholders funds per

employee th USD

Shareholders funds / Employees Convert to binary

Average cost of employee Cost of employees / Employees Convert to binary

45

th USD

ROE using P/L before tax (Profit (Loss) before tax / Shareholders funds) * 100 Standardization

ROE using Net income (Profit (Loss) before tax / Total Assets) * 100 Standardization

ROA using P/L before tax (Net income for period / Shareholders funds) * 100 Standardization

ROA using Net income (Net income for period / Total Assets) * 100 Standardization

Current ratio Current assets / Current liabilities Rescale

Liquidity ratio (Current assets Stocks) / Current liabilities Rescale

Solvency ratio (Asset

based)

(Shareholders funds / Total assets) * 100 Standardization

Gearing ((Non current liabilities + Loans) / Shareholders funds) * 100 Rescale

Category of the company Medium-sized or small-sized company Convert to binary

'NACE Rev. 2 main section Industry classification code, refer to Table A.0.5 for meaning of code Convert to category

Date of incorporation Date of formation of the company Convert into age and

then rescale

BvD ID number Unique identifier of the company Keep

DM UCI (Unique Contact

Identifier)

Unique identifier of the company board ID number. Consider in network

Company name Name of the company Deleted

Status Legal status of the company, refer to Error! Reference source not found. for

all the available status

Convert to binary

Status date Date of status update Deleted

46

Table A.0.4 List of predictors

Variable Meaning

ROE using P/L before tax (Profit (Loss) before tax / Shareholders funds) *

100

ROE using Net income (Profit (Loss) before tax / Total Assets) * 100

ROA using P/L before tax (Net income for period / Shareholders funds) * 100

ROA using Net income (Net income for period / Total Assets) * 100

Current ratio Current assets / Current liabilities

Liquidity ratio (Current assets Stocks) / Current liabilities

Solvency ratio (Asset based) (Shareholders funds / Total assets) * 100

Gearing ((Non current liabilities + Loans) / Shareholders

funds) * 100

Age year - year of incorporation

'NACE Rev. 2 main section' 16 level discrete variable indicating industry

classification

'Category of the company Binary variable: 1 if the company is Medium-sized,

0 if Small-sized

NA_dummy_ROCE using P/L

before tax

Binary variable indicating if the according variable

has the value NA

NA_dummy_ROCE using Net

income


has the value NA

NA_dummy_Profit per

employee th USD


has the value NA

NA_dummy_Profit margin Binary variable indicating if the according variable

has the value NA

NA_dummy_EBITDA margin Binary variable indicating if the according variable

has the value NA

NA_dummy_EBIT margin Binary variable indicating if the according variable

has the value NA

NA_dummy_Cash flow /

Operating revenue


has the value NA

NA_dummy_Net assets turnover Binary variable indicating if the according variable

has the value NA

NA_dummy_Interest cover Binary variable indicating if the according variable

has the value NA

NA_dummy_Stock turnover Binary variable indicating if the according variable

has the value NA

NA_dummy_Collection period

days


has the value NA

NA_dummy_Credit period days Binary variable indicating if the according variable

has the value NA

NA_dummy_Shareholders Binary variable indicating if the according variable

47

liquidity ratio has the value NA

NA_dummy_Solvency ratio

(Liability based)


has the value NA

NA_dummy_Operating revenue

per employee th USD


has the value NA

NA_dummy_Costs of employees

/ Operating revenue


has the value NA

NA_dummy_Shareholders funds

per employee th USD


has the value NA

NA_dummy_Average cost of

employee th USD


has the value NA

Table A.0.5 Codes that represent different Industry classification

A Agriculture, forestry and fishing

B Mining and quarrying

C Manufacturing

D Electricity, gas, steam and air conditioning supply

E Water supply; sewerage, waste management and remediation

activities

F Construction

G Wholesale and retail trade; repair of motor vehicles and

motorcycles

H Transportation and storage

I Accommodation and food service activities

J Information and communication

M Professional, scientific and technical activities

N Administrative and support service activities

R Arts, entertainment and recreation

S Other service activities

T Activities of households as employers

U Activities of extraterritorial organizations and bodies

Table A.0.6 Overview of the possible statuses

Status Meaning Model

value

Active The company is active 0

Active (insolvency

proceedings)

The debtor is unable to pay his debts Removed

Active (rescue plan) Business rescue plan: proceedings to facilitate the

rehabilitation of a company that is financially

distressed

Removed

Bankruptcy Legally declared inability of a company to pays its 1

48

creditors. The company no longer exists because it

has ceased its activities since it is in the process of

bankruptcy.

Dissolved

(bankruptcy)

The company no longer exists as a legal entity,

because it has ceased its activities since it is in the

process of bankruptcy.

1

Dissolved (demerger) The company no longer exists as a legal entity, the

reason for this is a demerger the company has

been split

Removed

Dissolved

(liquidation)

The company no longer exists because it has

ceased its activities, since it is in the process of

liquidation.

Removed

Dissolved (merger or

take-over)

The company no longer exists as a legal entity

because the company has been included in a

merger.

Removed

In liquidation The company is in the process of liquidation Removed

FORECASTING BANKRUPTCY PREDICTION USING CONDITIONAL …

Documents

Transcript of FORECASTING BANKRUPTCY PREDICTION USING CONDITIONAL …