FORECASTING BANKRUPTCY PREDICTION USING CONDITIONAL RANDOM FIELDS
Bo Wang
Student ID: 01501409
Promotor: Prof. Dr. Dries Benoit
Tutor(s): Ir. Wai Kit Tsang
A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of
Master of Science in Statistical Data Analysis.
Academic year: 2017 - 2018
The author and promoter give permission to consult this master dissertation and to copy it
or parts of it for personal use. Every other use falls under the restrictions of the copyright,
in particular concerning the obligation to mention explicitly the source when using results
of this master dissertation.
Gent August 14, 2018
The promotor, The author,
Prof. Dr. Dries Benoit Bo Wang
3
Abstract
The aim of this thesis is to predict whether a Belgian company will go bankrupt the next
financial year (classification problem). We focus on small and medium-sized enterprises.
The traditional company business failure prediction models only included financial ratios
as predictors and see companies as isolated entities. In this research, networks of Belgian
companies will be created by linking companies through shared board members. This thesis
includes the network statistics as predictors and evaluates the performance of Conditional
Random Fields (CRF) in Business Failure Prediction. Forming Board networks is
particularly interesting because information is transferred from company to company
through these linked Boards. There might be a relationship between a board's connection
and the financial status of the linked companies. We hope to get better performance with
CRF model comparing to the commonly used analytical techniques model (logistic
regression, decision tree for example), which do not consider the influence of the joint
board member link.
4
5
Contents
Abstract ...........................................................................................................................................3
1. Introduction .........................................................................................................................7
1.1 Business failure predictions on small and medium-sized enterprises ....................................... 7
1.2 Conditional Random Fields in Business Failure Prediction. .................................................... 7
1.3 Organization of this work ......................................................................................................... 9
2. Data and Methods .............................................................................................................10
2.1 Dataset and variables .............................................................................................................. 10
2.2 Data exploration and visualization ......................................................................................... 12
2.3 Conditional random fields ...................................................................................................... 14
2.3.1 Fundamentals of Conditional Random Fields...................................................................... 16
2.3.2 Representation...................................................................................................................... 18
2.3.3 Inference .............................................................................................................................. 21
2.3.4 Learning ............................................................................................................................... 23
2.4 Benchmark Model ................................................................................................................... 24
2.4.1 Logistic regression ............................................................................................................... 24
2.4.2 Decision Tree and Random Forest ....................................................................................... 25
2.5 Performance metrics ............................................................................................................... 26
2.5.1 Accuracy, precision, recall and F-measure ......................................................................... 26
2.5.2 AUC ..................................................................................................................................... 27
2.6 Dummy data generation .......................................................................................................... 28
3. Results ................................................................................................................................32
3.1 Apply CRF model on dummy data ......................................................................................... 32
3.2 Apply CRF on real company dataset. ..................................................................................... 35
3.2.1 On original company data .................................................................................................... 35
6
3.2.2 On up-sampling company data ............................................................................................ 36
4. Conclusions and future work ...........................................................................................38
References .....................................................................................................................................40
Appendix A: Dataset description ...............................................................................................42
7
1. Introduction
1.1 Business failure predictions on small and medium-sized
enterprises
Small and medium-sized enterprises (SMEs) play an important role in the European
economy. “About 99.8% of enterprises which operated in the EU-28 non-financial business
sector in 2016 were SMEs. These SMEs employed 93 million people, accounting for 67 %
of total employment in the EU-28 non-financial business sector, and generating 57 % of
value added in the EU-28 non-financial business sector” [Muller et al., 2016]. The
European definition of SMEs is formulated as: "The category of micro, small and medium-
sized enterprises (SMEs) is made up of enterprises which employ fewer than 250 persons
and which have an annual turnover not exceeding 50 million euro, and/or an annual balance
sheet total not exceeding 43 million euro" [Commission et al., 2003].
A major research domain within corporate finance is to build Business Failure Prediction
(BFP) models that accurately predict the future status of companies [Tkac and Verner,
2016]. Indeed, BFP models can help stakeholders take precautions and potentially prevent
business failures. BFP models could be interesting for investors, managers,
holding/collaborating companies and the government.
1.2 Conditional Random Fields in Business Failure Prediction.
In classic statistical BFP models, businesses are classified as either failing or non-failing,
which is a binary classification problem. One category of explanatory variables that been
popularly used as predictors in business failure classification, which also be used in this
thesis, is the accounting based measures, also known as financial ratios [Baysinger and
Butler, 1985]. The reason is that the financial ratios can be viewed as the overall
summarization of the business historical performance. The Bankruptcy company normally
have bad performance thus poor financial ratios.
Based on financial ratios as predictors, different types of technique and method has been
applied to BFP, from classical univariate/multivariate statistical analysis to advance
machine learning methods. Table 1.1 made a short summary of those most important and
popular techniques that been applied in the BFP field.
8
Table 1.1 Short list of the techniques been applied in BFP field.
Techniques Summary
Univariate analysis
[Beaver 1966]
The very first study of BFP. Applied binary
classification model with a variety of financial
ratios.
Multiple discriminant analysis
[Altman 1968]
Extend BFP problem from univariate to
multivariate way. Assume normality and equal
covariance assumptions.
Probit analysis
[Zopounidis et al., 1999]
Probit models assumes a cumulative normal
distribution.
Logit analysis
[Doumpos et al., 1999]
Logit model assumes a logistic distribution.
Self-organizing maps
[Huysmans et al., 2006]
Unsupervised machine learning techniques. Have
powerful visualization capabilities.
Support vector machines
[Min et al., 2006]
Supervised Machine learning techniques. Suffer
from less interpretability compare with MDA.
Probabilistic neural networks
[Ravi et al., 2008]
Feedforward neural network that been widely used
in classification and pattern recognition problems.
Artificial Neural Networks
[Chen et al., 2009]
A popular technique to apply on BPF currently.
Suffer from interpretability and transparency.
Hybrid systems, ensemble
methods
[Lin et al., 2012]
Aggregated multiple prediction techniques to
improve the accuracy.
The approaches and methods that list above, however, treated companies as isolated entities
or unit that exist on themselves, they ignore the linkage and did not consider any influences
between the different companies. In this thesis, we want to explore that property. Our
reasoning is that in reality, all businesses form an ecosystem. Companies may link with
each other by the same businesses categorization, same countries, investors, sizes, etc. The
performance of companies may have influence between each other through those links. A
particularly interesting phenomenon called interlocking can often be observed is that a
member of the board of directors of one company will often be a member of the board of
directors of another company. Previous research has shown that the financial status of those
linked companies might be influenced by each other since resources and information can
be transferred between the linked company through these Boards [Tobback et al., 2016]. In
this thesis, we link the companies by their shared board members to form network and
trying to utilize the boards network information on the business failure prediction.
9
Taken the board network into account to predict the business failure is basically a problem
of predicting many variables that depend on each other as well as on other observed
variables. The conditional random fields (CRF) technique can be particularly suitable for
this task. “CRF is a graphical model which can take neighboring samples or context into
account in order to predict a label” [Sutton et al., 2012]. In this thesis, the CRF will be
applied to build BFP models to take the network into account. The purpose is to evaluate
if there is added value of taking the board network into account on bankruptcy prediction.
The performance of CRF is evaluated and compared with several benchmark models.
1.3 Organization of this work
In the remainder of this work, chapter 2 starts with data exploration and visualization of
the company dataset. We then describe how to apply the CRF in the business failure
prediction context, including model representation, inference and parameter learning. We
also describe the benchmark model and the performance metrics used to evaluate the model,
followed by a short section of dummy data generation and up-sampling method. In chapter
3, we presented the results of the CRF model on both dummy and real company dataset. In
the concluding chapter 4, we discussed the obtained results and several issues and the
recommendation for further research.
10
2. Data and Methods
2.1 Dataset and variables
The dataset used in this thesis contains the financial information of both active and failing
companies of Belgium in year range from 2011-2017, obtained from the database Orbis of
Bureau Van Dijk (BvD) in the thesis work of Domien Van Damme. Only medium and
small-sized companies are selected. Refer to the thesis work of Domien Van Damme for
detail data selection procedure and criteria [Van Damme et al., 2017].
The raw data file contains variables for each company of each year. The variables including
yearly financial information (revenues, stock turnover, current ratios, etc), and non-
financial information such as the company names, industrial classification, the unique ID
for each board directors. We refer to Table A.0.3 for the detail definition of each variable.
For each company, the business status of each year is also given in raw data file. The
business statuses are divided into 12 categories. We refer to Table A.0.6 for the full list of
possible business status. Following the suggestions of the thesis work of Domien Van
Damme, to get a dataset of companies that are as much similar as possible in accounting
and legal terms, companies with the business status of “Bankruptcy” or “Dissolved
(bankruptcy)” are considered as a failed company, with the business status of “Active” is
considered as an active company. All the other types of business status are ignored.
To predict a company bankruptcy or not, the variables in the raw data file need to be
transformed/converted into predictors. Based on the available variables, 30 predictors were
created with 9 continues and 21 discrete predictors. Refer to Table A.0.3 for the
transformation method of each variable. Refer to Table A.0.4 an overview of all the
predictors used in BFP modeling in this research. The most important variables
transformations steps and motivations are listed below:
• The variable “date of incorporation” is the date of formation of the company. It
cannot be directly used as a predictor. However, since the failed firms could be
significantly younger than active firms, we transformed it into the age variable,
which is the difference between the year of the observation and the year of
incorporation.
• The variable “Category of the company” was transformed into a binary variable
with 0 stands for the small-sized company and 1 for a medium-sized company.
• Some variables have many missing values. Figure 2.1 shows the missing value
percentage over all the dataset of each variable. Previous literature shows that failing
companies tend not to share information, those variable that has more than 30%
11
missing values were converted into binary dummies, indicating whether information
was given or missing.
• Unrelated variables like “company names”, “Last avail. Year” are deleted because
those variables are irrelevant to the business failure prediction problem and thus
cannot be used as a predictor.
• As will be seen in chapter 2, we going to build the CRF model based on log-linear
features: all the numerical variables will be exponentialized. To avoid numerical
overflow, all the numerical variables in the raw data are rescaled into range -1 to 1.
The raw data file has the data structure shown in Table A.0.1, are converted into basetable
by year (cross-sectional), with the structure of company ID, predictors, and status, as shown
in Table A.0.2.
Figure 2.1 Variables missing value percentage
12
2.2 Data exploration and visualization
Firstly, it is interesting to know how many directors a board contained typically in Belgium.
Figure 2.2 shows the company counts of different board size. Companies with small board
size below 10 are most frequently presented (94.5%). There are however 4 companies with
board size larger than 100 board members.
Figure 2.2 Company counts with different board size
Figure 2.3 shows the count of directors with different number of seats. There are 242960
directors in total. Most directors (83.4%) are involved with only 1 company. There are
40253 (16.6%) directors connected with more than 2 companies. Those 40253 directors
and the related companies form the Belgian company board networks. It is quite
remarkable that several directors connected with a lot of companies. One of them
connected with 168 companies. How can this be interpreted? “In Belgium, the independent
supervisory board often consists of labor- or governmental organizations and institutional
investors. These institutional investors are typically connected to many companies.” [Van
Damme et al., 2017].
13
Figure 2.3 Count of directors with different number of seats
One of our research goals is to check if the board members have an influence on whether
or not a company will fail. Figure 2.4 shows that most directors (93.8%) are not connected
to failed companies. Among those directors who are connected to a company which has
failed, most of them (93.7%) only failed one company. However, it also happens quite often
that a director is connected to multiple bankrupt companies. One extreme case is that a
director is connected to 12 bankrupt companies.
Figure 2.4 Count of director that linked to the different number of bankruptcy companies
14
Figure 2.5 shows the number bankruptcy vs active companies yearly, the percentage of
bankruptcy company is not that much, which is good news for the economy but may be
bad news for model training due to the highly unbalanced classes.
Figure 2.5 Status counts of companies per year.
2.3 Conditional random fields
Before start building the CRF model, it needs to point out that the companies in our dataset
are linked in very complex ways, there are many different network topology exist in our
dataset. Table 2.1 gives an overview of the different types of network topology found in the
dataset and the schematic examples. In general, the graph topologies can be split into two
categories: singly-connected graph(tree) and multiply-connected graph(loopy). A graph is
singly-connected means there is only one path from any node a to another other node b. A
graph is multiply-connected if it is not singly-connected. Thus in Table 2.1 the linear
chain/star topology are singly-connected graph and the rest are the multiply-connected
graph.
In this thesis, we going to limit ourselves to model only the linear-chained network
topology (double linked company in specifically) based on the motivation listed below:
1. The linear chained topology is not only the simplest but also represent the most part
(80.03%) of the structures in our dataset.
2. For singly-connected graph, there exist efficient algorithms such as belief propagation
that scales linearly with the number of the nodes in the graph. Although approximate
inference algorithms such as loopy-cut algorithms exist for the multiply-connected models,
in general, it is computationally inefficient.
15
3. The parameter trained based on the double linked linear chain model, can also be used
as the clique templates and do inference on other types of singly-connected graph such as
star topology.
Table 2.1 Network topology found in the dataset. Nodes represent companies, edges
means linkages (shared board member)exist between companies
Topology Example
Linear chain
Triangle
Star
Fully connected
Ring
Complex network structure
In the section below, after the general introduction of the Fundamentals of Conditional
Random Fields, we explain in detail how to apply the conditional random fields framework
to the problem of business failure prediction.
16
2.3.1 Fundamentals of Conditional Random Fields
Let variables X be the set of input variables that always observed, variables Y the output
set of variables that we wish to predict. The formal definition of the general conditional
random fields is then given below:
“Let G be a factor graph over variables X and Y. Then (X, Y) is a conditional random
fields if, for any value x of X, the distribution 𝒑(𝒚|𝒙) factorizes according to G” [Sutton
et al., 2012].
In formulas, if 𝐹 = {Ѱ𝑎} is the set of factors in G, then the conditional distribution for a
CRF is
𝑝(𝑦|𝑥) = 1
𝑍∏ Ѱ𝑎(𝑦𝑎 , 𝑥𝑎)
𝐴
𝑎=1
(2.1)
With 𝑍 the normalization constant so the distribution sums to one:
𝑍 = ∑ ∏ Ѱ𝑎(𝑦𝑎 , 𝑥𝑎)
𝐴
𝑎=1𝑦
(2.2)
The factor Ѱ𝑎(𝑦𝑎 , 𝑥𝑎) is often parameterized as log-linear representation:
Ѱ𝑎(𝑦𝑎 , 𝑥𝑎; 𝜃𝑎) = exp {∑ 𝜃𝑎𝑘𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎)
𝐾
𝑘=1
} (2.3)
With 𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎): 𝑉𝑎𝑙(𝑦𝑎 , 𝑥𝑎) → ℝ the feature function, 𝜃𝑎 the weight of the feature
function .
Putting together (2.1), (2.2) and (2.3), the conditional distribution for a CRF with log-
linear factors can be written as:
𝑝(𝑦|𝑥) = 1
𝑍∏ exp {∑ 𝜃𝑎𝑘𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎)
𝐾
𝑘=1
}
𝐴
Ѱ𝑎∈𝐹
(2.4)
Where normalization constant Z:
𝑍 = ∑ ∏ exp {∑ 𝜃𝑎𝑘𝑓𝑎𝑘(𝑦𝑎 , 𝑥𝑎)
𝐾
𝑘=1
}
𝐴
Ѱ𝑎∈𝐹 𝑦
(2.5)
17
The CRF representation above emphasize that each factor has its own set of weights, it
happened quite often that, different factors in G can share the same feature functions and
also the same parameters values. In this case, the clique template can be created to simplify
the model by partition the factors of G into 𝐶 = {𝐶1, 𝐶2, … 𝐶𝑝}, where each 𝐶𝑝 is a clique
template. A CRF that uses clique templates can be written as
𝑝(𝑦|𝑥) = 1
𝑍∏ ∏ Ѱ𝑐(𝑦𝑐 , 𝑥𝑐; 𝜃𝑝)
Ѱ𝑐∈𝐶𝑝𝐶𝑝∈𝐶
(2.6)
Where each templated factor parameterized in log-linear representation as
Ѱ𝑐(𝑦𝑐 , 𝑥𝑐; 𝜃𝑝) = exp { ∑ 𝜃𝑝𝑘𝑓𝑝𝑘(𝑦𝑐 , 𝑥𝑐)
𝐾(𝑝)
𝑘=1
} (2.7)
And the normalization function is
𝑍 = ∑ ∏ ∏ Ѱ𝑐(𝑦𝑐 , 𝑥𝑐; 𝜃𝑝)
Ѱ𝑐∈𝐶𝑝𝐶𝑝∈𝐶 𝑦
(2.8)
In general, the complete CRF algorithm has three aspects:
• Graph model representation: With the graph representation properly defined, we
also define the factors, the clique template, and the features. From which the
conditional distribution for a CRF model (2.6) is defined.
• Model learning: also known as the model training or parameter estimation. In this
phase the parameter vector is learned from the data, thus the features are weighted.
• Inference: which refers both to the task of computing the marginal distributions of
𝑝(𝑦|𝑥) and/or to the task of computing the most liked label of y given x : 𝑦∗ =
arg 𝑚𝑎𝑥𝑦𝑝(𝑦|𝑥) . Notice that the inference and learning procedures are often
closely coupled, because learning usually calls the inference procedures as a
subroutine.
To apply CRF on the context of company business failure prediction, three tasks correspond
to the three CRF aspects mentioned above need to be done:
Task 1. Build a graphical network model to represent the concept: the network defines the
factors that incorporate features of both financial ratios and consider the link between
companies.
18
Task 2. Define an inference algorithm to perform inference in the network constructed.
Task 3. Use the inference algorithm to learn the optimal feature weights from the data and
then to find the best business status assignment for every company.
2.3.2 Representation
Suppose we have n companies are linked by the shared board members. There are two sets
of variables in the model we will build: Yi and Xi for i = 1,...,n. Xi is a vector-valued variable
that corresponds to the features of the 𝑖𝑡ℎ company, those variables we always observed. Yi
is the business status (1 for bankruptcy and 0 for active) assignment to the 𝑖𝑡ℎ company,
which are the hidden variables that we want to predict. With CRF we seek to model P(Y|X),
the conditional distribution over business status gave the observed finance variables, and
find the assignment to the Yi variables that correctly describes the company business status
in the Xi variables.
We will use the linear chained Markov network to model the distribution over the Yi
variables, given the variables Xi. Figure 2.6 is one example of a linear chained Markov
network over 3 linked companies.
In this example, we have singleton factors Ѱ𝑖𝐶(𝑌𝑖 , 𝑋𝑖) that represent how likely a given
company i is bankrupt, and pairwise factors Ѱ𝑖𝑃(𝑌𝑖 , 𝑌𝑖+1) that represent the interactions
between adjacent pairs of company i and i+1.
With these two types of factors, since I is always observed, the CRF allow us to model the
conditional distribution P(Y|X) that not only take into account the influence of observed
finance features on a single company business status, but also higher-order dependencies
between the linked companies (the dependencies could be, for example, a bankruptcy
company may likely to be linked by another bankruptcy company.)
19
Figure 2.6 Markov network over 3 linked companies.
Now we define the factors formally and the features involved in the factors.
Singleton Factors:
In the simplest way, a model may contain only the factors Ѱ𝐶 for singleton features. There
are n such factors, one for every company, with scope for the 𝑖𝑡ℎ company, is {𝑌𝑖 , 𝑋𝑖}; we
name them singleton factors since 𝑋𝑖 is always observed, so these factors essentially
operate on single companies. This model is shown in Figure 2.7. We use the model with
only the singleton factors as the baseline model. By adding more complex factors to the
model, the baseline model can be used to evaluate the improvement we made.
Figure 2.7 Baseline model contains only the factors for singleton features Ѱ𝐶
20
Two types of features involved in the singleton factors :
• 𝑓𝑖,𝑐𝐶 (𝑌𝑖) = 1{𝑌𝑖=𝑐} , an indicator for Yi = 0 or Yi = 1, which operates on the hidden
variables of single company. These features are used to encode the individual
probability that Yi = 0 or Yi = 1.
• 𝑓𝑖,𝑗,𝑐,𝑑𝐶 (𝑌𝑖 , 𝑥𝑖𝑗) = 𝑥𝑖𝑗1{𝑌𝑖=𝑐}, an indicator for Yi = c, xij=d, which operates on a hidden
state and the financial ratios 𝑥𝑖𝑗 associated with that state c={0,1} of a single
company. These feature are used to encode the individual probability that Yi = c
given xi.
Pairwise Factor:
The model which uses only the singleton factors is an entirely valid, though simplistic,
Markov network. The issue with the singleton factors is that they do not consider any
interactions between companies with shared board members. To improve the model, we
introduce n-1 pairwise factors Ѱp(𝐶𝑖 , 𝐶𝑖+1) for i=1,...,n-1, to represent these interactions.
This give us the model in Markov network as in Figure 2.8.
Figure 2.8 Linear chained CRF with singleton factors and pairwise factors.
The intuition behind these pairwise factors is as follows. Suppose there are two companies
that linked by a shared board member. In isolation, we can only predict the company status
based on its given financial ratios. Suppose, however, that the singleton factor for the 1st
company assigns a very high score to bankruptcy: i.e., we are fairly certain that the 1st
company goes bankrupt. In addition, suppose that the odds of seeing two bankruptcy
company are so much higher than seeing one bankruptcy company linked to an active one,
then it can be said with a high probability that the second company is bankruptcy even if
the singleton factor assigns the second company nearly equal scores on bankruptcy vs
active.
21
The features involved in the pairwise factors are:
• 𝑓𝑖,𝑐,𝑑𝑃 (𝑌𝑖 , 𝑌𝑖+1) = 1{𝑌𝑖=𝑐,𝑌𝑖+1=𝑑 } , which operates on the hidden states of a pair of
linked companies. Since there are only two states of companies, the pairwise factor
involved totally 4 features: 1{𝑌𝑖=0,𝑌𝑖+1=0 } , 1{𝑌𝑖=0,𝑌𝑖+1=1 } , 1{𝑌𝑖=1,𝑌𝑖+1=0 } ,
1{𝑌𝑖=1,𝑌𝑖+1=1 }
So far, we defined two factors and the associated features, the CRF model in (2.4) can
already be defined. Before learning the weights of each feature from the training dataset,
an “inference engine” need to be defined firstly.
2.3.3 Inference
Inference corresponds to using the distribution to answers questions about the environment.
With factors and parameters being defined in CRF model (2.5), it is possible to use “brute
force” to find the MAP assignment to each of factors. The brute force way, however, unable
to handle large network because the running time is proportional to the number of entries
on the joint distribution over the entire network. For example, assume 10 companies been
linked together by a shared board member, which is not a rear case in our dataset. The brute
force inference will need to enumerate 210 combinations to find the MAP assignment.
There are other forms of exact inference can still be very efficient. In this thesis, we
implement belief propagation, specifically, the clique tree message passing algorithm as an
exact inference engine.
22
Algorithm: Clique tree message passing [Koller and Friedman, 2009].
1. Construct a clique tree from a given set of factors Ѱ.
2. Assign each factor 𝜓 ∈ Ѱ to a clique 𝐶𝛼(𝑘) such that 𝑆𝑐𝑜𝑝𝑒[𝜓𝑘] ∈ 𝐶𝛼(𝑘). 𝛼(𝑘) returns
the index of the clique to which 𝜓𝑘 is assigned.
3. Compute initial potentials 𝜙𝑖(𝐶𝑖) = ∏ 𝜓𝑘𝑘:𝛼(𝑘)=𝑖
4. Designate an arbitrary clique as the root, and pass messages 𝛿 upwards from the leaves
towards the root clique.
5. Pass messages from the root down towards the leaves.
6. Compute the beliefs for each clique: 𝛽𝑖(𝐶𝑖) = 𝜙𝑖 × ∏ 𝛿𝑘→𝑖𝑘∈𝑁𝑖
As an example, consider the following network with 6 linked companies:
Figure 2.9 Chained Markov network with 6 linked Company
The following clique tree can be created from the list of factors corresponding to this
network:
Figure 2.10 Clique tree for the network
Next, we assign each of the original factors to a clique to initialize the clique potentials in
this tree. The clique with variables 1 and 2 was arbitrarily chosen to be the root. Afterward,
messages passing start from the leaves up to the root, and then down from the root to the
leaves. The red arrows show the messages passing from the leaves to the root, and the blue
arrows show the messages passing from the root to the leaves. In this clique tree, there are
totally 5 cliques, so 2*(5-1) = 8 messages suffice to correctly compute all beliefs. Finally,
we can use the calibrated clique beliefs to answer probabilistic queries on the original
network.
23
2.3.4 Learning
Now we have constructed a Markov network for the task of business failure prediction and
define an inference engine for the network. What still lack are the weights of each feature,
which needs to be learned from data. We will learn those parameters using the maximum
likelihood estimation :
Given a set of M training examples, 𝐷 = {(𝑥[𝑚], 𝑦[𝑚])}𝑚=1𝑀 , we want to find the 𝜃∗ that
maximizes the likelihood of the observed data:
𝜃∗ = argmax𝜃
𝐿(𝜃: 𝐷) = argmax𝜃
∏ 𝑃(𝑦[𝑚]|𝑥[𝑚]; 𝜃)𝑀
𝑚=1 (2.2)
In this thesis, we use the stochastic gradient descent algorithm to learn the parameters from
the set of training data. The stochastic gradient descent algorithm is described below:
Algorithm: Stochastic gradient descent
for k = 1 to max iterations:
Pick an arbitrary training example (x[m]; y[m]), then update
𝜃 ≔ 𝜃 − 𝛼𝑘∇𝜃[−𝑙𝑜𝑔𝑃(𝑦[𝑚]|𝑥[𝑚]; 𝜃)]
Where the learning rate 𝛼𝑘 = 0.1
1+√𝑘
From the algorithm above, essentially the mission is: for a given data instance (x;Y) and a
parameter setting 𝜃, we need to compute the cost function (negative log-likelihood) and
the gradient of parameters with respect to that cost. To avoid overfitting, L2-regularization
penalty on the parameter values add to the negative log-likelihood. Thus the function we
seek to minimize is:
𝑛𝑙𝑙(𝑥, 𝑌, 𝜃) ≡ log(𝑍𝑥(𝜃)) − ∑ 𝜃𝑖𝑓𝑖(𝒀, 𝒙) + 𝜆
2
𝑘
𝑖=1
∑ 𝜃𝑖2
𝑘
𝑖=1
(2.3)
The partial derivatives for this function have an elegant form [Koller and Friedman, 2009]:
𝜕
𝜕𝜃𝑖
𝑛𝑙𝑙(𝑥, 𝑌, 𝜃) = 𝐸𝜃[𝑓𝑖] − 𝐸𝐷[𝑓𝑖] + 𝜆𝜃𝑖 (2.4)
In the derivative, there are two expectations: 𝐸𝜃[𝑓𝑖], the expectation of feature values with
respect to the model parameters, and 𝐸𝐷[𝑓𝑖], the expectation of the feature values with
respect to the given data instance 𝐷 ≡ (𝑋; 𝑦).
24
𝐸𝜃[𝑓𝑖] = ∑ 𝑃(𝑌′|𝑥; 𝜃)𝑓𝑖(𝑌′, 𝑥)
𝑌′
(2.5)
𝐸𝐷[𝑓𝑖] = 𝑓𝑖(𝑌, 𝑥) (2.6)
In the (2.5), we sum over all possible assignments to the Y variables in the scope of the
feature 𝑓𝑖. Since each feature has a small number of Y variables (in our case, at most 2
variables in the features involved in the pairwise factors) in its scope, this sum is tractable.
Unfortunately, computing the conditional probability 𝑃(𝑌′|𝑥; 𝜃) for each assignment
requires performing inference for the data instance x. Thus the inference subroutine need
to be called repeatedly during each iterations loop.
2.4 Benchmark Model
2.4.1 Logistic regression
Logistic Regression (LR) is a type of generalized linear model for predicting the probability
of a binary classification problem. The hypothesis function of logistic regression is:
ℎ(𝒙) = 𝑔(𝜽𝑡𝒙), 𝑔(𝑧) = 1
1 + 𝑒−𝑧 (2.7)
With x the input predictor vector, 𝜽 the parameters (weight of the predictor), h(x) the output,
g(z) is the sigmoid function, with the plot shown in Figure 2.11. The sigmoid function is
an s shaped curve with input value between [−∞, +∞] and output value between [0, 1].
This feature of the sigmoid function is very important for binary classification problem
because we can simply assume the probability of y=1 given x and 𝜽 is:
𝑃(𝑦 = 1|𝒙; 𝜽) = 𝑔(𝜽𝑡𝒙) = 1
1 + 𝑒−𝜽𝑡𝒙 (2.8)
Figure 2.11 sigmoid function
25
Thus the decision function can be:
𝑦 = 1, 𝑖𝑓 𝑃(𝑦 = 1|𝒙; 𝜽) > threshold value (2.9)
Though 0.5 is normally chosen for the threshold value, in practice different threshold value
can be chosen: if the true positive is more important, a larger threshold value should be set.
If the positive recall is more important, the smaller threshold value may be chosen.
Training a logistic regression model is basically learning the parameters 𝜃 from data.
Popular and simple algorithm like gradient decent or stochastic gradient decent can
efficiently learn the parameters with maximum likelihood estimation.
LR is easy to use and the learned weight is easy to interpret, we choose LR as a benchmark
model in this thesis.
2.4.2 Decision Tree and Random Forest
Decision trees (DT) are a type of model used for both classification and regression. Trees
answer sequential questions which send us from root nodes of the tree down a certain route
to given the answer. The model at each node behaves with “if this then that” conditions and
ultimately yielding a specific result. This is easy to see with the image in Figure 2.12 which
maps out an example of whether or not to play tennis.
Figure 2.12 Decision tree example
The problem with DT is that it is not very robust to data changes. A small change in data
can sometimes cause a large change in final predictions [Dudoit et al., 2002]. Random
forests (RF) is a modeling technique that is much more robust than a single decision tree.
RF is basically an ensemble method. The ensemble method aggregates many models to
limit overfitting as well as an error due to bias and therefore yields useful results. In the
case of RF, it creates an entire forest of random uncorrelated decision trees to gain better
predictive results. RF will also be used as a benchmark model in this thesis.
26
2.5 Performance metrics
2.5.1 Accuracy, precision, recall and F-measure
In a binary classification problem, the class of a sample can only be 0 (negative) or 1
(positive). When we use a classification model for prediction, sample with true class 0, the
model may predicts as either 0 or 1. For sample with true class 1, the model predict it as 0
or 1. Those 4 different combinations named confusion matrix, which is given in Figure
2.13. The 4 performance metrics discussed in this section are all based on the confusion
matrix, as shown in (2.10), (2.11), (2.12) and (2.13).
Figure 2.13 Confusion matrix
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 (2.10)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑃 (2.11)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑁 (2.12)
𝐹1 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (2.13)
Accuracy - Accuracy is simply a ratio of correctly predicted observation to the total
observations. Accuracy is a great performance measure for symmetric datasets where the
values of false positive and false negatives are almost the same. With an unbalanced dataset,
accuracy may mislead: with a dataset that has 99% negative cases, a classifier just blindly
label all the sample with negative will give us 99% accurate.
27
Precision - Precision measures how many of the predicted positives are true positive. When
the costs of False Positive is high, precision can be a good metrics to consider. In spammed
email detection, a False Positive email means the email which is not spam been classified
as spam, the user may lose important information if the precision of the classifier is not
very high.
Recall – Recall calculates how many true positive cases the model captures among all the
cases that are truly positive. Recall is useful when there is a high cost associated with the
False Negative. In sick detection, after a test, if a sick person been labeled as non-sick
(False Negative), the cost can be extremely high if the sickness is serious. In company
bankruptcy prediction, a bankruptcy company been predicted as non-bankruptcy may
results in high loss to investors.
F1 Score - F1 score is the harmonic average of Precision and Recall. The precision and
recall normally trade-off with each other: higher precision normally associated with low
recall. The range of F1 score is between 0 and 1. Only when precision and recall are both
1, the F1 score is 1. Thus, the higher F1 score the better. F1-score can be useful when we
need to seek a balance between Precision and Recall.
2.5.2 AUC
AUC is another popular performance metrics used in binary classification. The problem
with the 4 performance metrics discussed in 2.5.1 is that they highly depend on the
threshold value chosen for classifying the sample: the predictive results of the classification
model are probabilities or scores, which need to be transformed into class 0 or 1. For each
threshold value, the numbers in the confusion matrixes change thus the performance
metrics derived from those numbers will also change. The main benefits of using AUC as
performance metrics is that it is threshold independent.
AUC stands for Area Under Curve, which is the area under the ROC (Receiver Operating
Characteristic) curve. The x-axis of ROC is the false positive rate (FPR) and the y-axis is
the true positive rate (TPR), which is defined in (2.15) and (2.14).
𝐹𝑃𝑅 = 𝐹𝑃
𝐹𝑃 + 𝑇𝑁 (2.14)
𝑇𝑃𝑅 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑁 (2.15)
The ROC plot gives information about the trade-off between the TPR and FPR for each
threshold value. Thus the aggregate performance across the entire range of trade-offs
values, which is the area under the ROC curve, describes a general performance of the
classifiers. The AUC is calculated as follows:
28
𝐴𝑈𝐶 = ∫𝑇𝑃
𝑇𝑃 + 𝐹𝑁
1
0
𝑑𝐹𝑃
𝐹𝑃 + 𝑇𝑁 (2.16)
Interpretation of the AUC is easy: the higher the AUC, the better the performance. With
0.50 indicating the classifier that randomly labels the sample and 1.00 denoting the perfect
performance.
2.6 Dummy data generation
After implementation of the CRF business failure prediction model in MATLAB, to debug
the code and to characterize the model performance, we applied the model first on the
dummy triple linked company data, as schematically shown in Figure 2.14. The dummy
data have the same number of continuous and discrete variables as in the real data. In
addition, we intentionally make the linked company have the same business status
(bankruptcy or active) with a certain probability, which named link strength will be
explained later.
Figure 2.14 Dummy triple linked company data
The procedure to generate the dummy linked company are schematically shown in Figure
2.15.
29
Figure 2.15 Procedure to generate triple linked dummy company data.
Step1: With the help of scikit-learn package in python, we first generated a batch of
random 2-class classification dataset with 20-discrete variables and 9 continuous variables
(by function make_classfication), which is the same number of predictors as in the real
dataset. Among those 29 variables, 4 of them are informative variables (the number 4 is
randomly picked by the author) and the rest are redundant variables. The proportions of
samples assigned to each class are 50% (balanced class).
Step2: Based on the class of each sample, split the initial batch generated in step 1 into
active company batch (where the class is 0) and bankruptcy company batch (where the
class is 1).
Step3: Forming the triple linked company by “tossing a coin” 3 times:
• For the 1st toss, if head, draw a company from the bankruptcy batch, if tail, randomly
draw a company from the active batch.
• For the 2nd and the 3rd toss, if head, draw a company from the same batch of the
previous drawn company. If tail, draw a company from the different batch of the
previous drawn company.
• Table 2.2 shows the generated triple linked company class correspond to the results
of 3 coin tosses.
Step4: Repeat the step3 until one of the bankruptcy and active batch is empty or the number
of linked companies researched the desired number.
30
Table 2.2 The generated linked dummy company status corresponds to the coin toss results.
Coin toss results Linked company status
HHH 111
HTT 101
HTH 100
HHT 110
TTT 010
THT 001
THH 000
TTH 011
Based on the data generation procedure described above, three types of dummy data been
generated:
1. Dummy linked company with different link strength:
The meaning of link strength is the probability that a company has the same business status
(bankruptcy or active) with its linked company. To generate linked company with different
link strength, the 2nd and 3rd coin toss need to be biased. For example, to generate linked
company with link strength 0.7, after drawing the 1st company from the 1st coin tossing,
the second and third coin toss biased with a probability of 0.7 to toss a head.
3 batches of dummy data with link strength 0.5, 0.7 and 0.9 are generated. Link strength
0.5 means the companies with the status of bankruptcy or active are randomly linked. Link
strength 0.7 means the linked companies have the same business status with a probability
of 0.7. Each of the 3 batches is split into training (70% proportions) and test (30%
proportions) dataset for model learning and test. If the CRF model implemented correctly,
we expect better performance of the model on the dummy data batch with higher link
strength.
2. Dummy linked company with different unbalanced class level:
Since the real dataset is highly unbalanced (very little bankruptcy status observed), it is
interesting to study the influence of imbalance of class on the model performance. To
generate the dummy linked company with different unbalanced class level, we bias the first
coin toss and keep high link strength at 0.9. Dummy data with the first coin toss biased at
probability to toss a head at 0.5(fair coin), 0.2, 0.01 and 0.001 are generated. The lower the
biased probability, the less positive (bankruptcy) cases observed.
31
3. Dummy data with different y-flip ratio:
To add a different level of noise in the data, in the dummy data generation procedure, after
the 1st step, and before the 2nd step, a certain percentage of company’s status (y-values) are
randomly flipped. We generated 3 batches of dummy data with 0.01%, 10% and 30%
company percentage flipped the y-value (y-flip ratio). The higher the y-flip ratio, the high
the noise level.
32
3. Results
3.1 Apply CRF model on dummy data
The generated dummy dataset have been split into training and test dataset with 70%-30%
weight. After learning the parameters in the training dataset, the learned model is then
applied to the test dataset with the model performance evaluated.
As mentioned before, since accuracy/precision/recall and F1-score are highly dependent on
the threshold used to classify the company status (in this thesis we always put 0.5 as the
threshold), we should only use AUC to evaluate the model performance.
Model performance on dummy data with different link strength
Table 3.1 and Figure 3.1 shows the CRF model performance on the dummy data with
different link strength. Two types of models are applied, one with only singleton factor
considered (S model) and the other with both singleton factor and pairwise factors (S+P
model) considered. The S model is simply the log-linear regression model and its
performance acts as the baseline.
Table 3.1 Model performance on dummy data with different link strength
Link strength Model AUC Accuracy Precision Recall F1-score
0.5 S 0.872 0.797 0.797 0.789 0.793
0.5 S+P 0.872 0.797 0.799 0.786 0.793
0.7 S 0.870 0.797 0.822 0.781 0.800
0.7 S+P 0.882 0.803 0.824 0.795 0.809
0.9 S 0.867 0.790 0.775 0.787 0.782
0.9 S+P 0.939 0.864 0.856 0.861 0.859
Figure 3.1 AUC performance on dummy data with different link strength
33
With increasing of link strength, the AUC performances of S models stay almost the same,
which is as expected, since the singleton factor do not consider the dependence between
linked companies. With link strength of 0.5, the S model and S+P model have similar
performance, which again is expected, because the companies with different business status
are randomly linked, the status of linked companies are independent thus the pairwise
factor has no predictive power. With link strength 0.7 and 0.9, the AUC of S+P model
clearly overperforms the S model. The improvement of baseline (S model) on 0.9 link
strength is higher than the 0.7 link strength, which shows the effectiveness of our model
and also is a prove of the concept of the motivation to apply the CRF on the linked company
context, that if there is dependence of business status between companies that linked by
the shared board member, we might be able to get better performance with CRF model
comparing with the model that does not consider the link.
Model performance on dummy data with different unbalanced class levels.
Table 3.2 and Figure 3.2 shows the model performance on dummy data with different class
balance levels. Remember the lower the 1st coin toss bias, the less positive (bankruptcy)
cases been observed, the higher the class imbalance level.
Table 3.2 Model performance on dummy data with different first coin toss bias.
Link
strength
Model First coin
toss bias
AUC Accuracy Precision Recall F1-
score
0.9 S 50% 0.867 0.790 0.775 0.787 0.782
S+P 50% 0.939 0.864 0.856 0.861 0.859
0.9 S 20% 0.849 0.824 0.662 0.679 0.670
S+P 20% 0.912 0.870 0.835 0.628 0.716
0.9 S 1% 0.823 0.911 0.474 0.381 0.422
S+P 1% 0.862 0.928 0.702 0.280 0.4
0.9 S 0.1% 0.804 0.910 0.643 0.257 0.367
S+P 0.1% 0.833 0.914 0.789 0.214 0.337
34
Figure 3.2 AUC performance on dummy data with different first coin toss bias
The performance of both S and S+P model decreases when applied to the unbalanced data.
AUC keep decreasing with data more and more unbalanced. However, because of the high
link strength, the S+P model still out-perform the S model.
Model performance on dummy data with the different y-flip ratio (noise).
Table 3.3 and Figure 3.3 shows the model performance on dummy data with the different
y-flip ratio. The higher the y-flip ratio, the larger the noise. The performance of both S and
S+P model decrease with increasing of noise level, however, due to the high link strength,
the S+P mode again over performance the S model.
Table 3.3 Model performance on dummy data with the different y-flip ratio.
Link
strength
Factor y-flip
ratio
AUC Accuracy Precision Recall F1-score
0.9 S 0.01% 0.867 0.790 0.775 0.787 0.782
S+P 0.01% 0.939 0.864 0.856 0.861 0.859
0.9 S 10% 0.846 0.767 0.767 0.795 0.781
S+P 10% 0.92 0.843 0.819 0.898 0.857
0.9 S 30% 0.783 0.722 0.723 0.684 0.703
S+P 30% 0.862 0.778 0.765 0.774 0.771
35
Figure 3.3 AUC performance on dummy data with the different y-flip ratio
3.2 Apply CRF on real company dataset.
3.2.1 On original company data
We first consider the companies that have no missing values in the year 2013. This gives
us 28174 companies in total, with number of 27946 active and 228 (1.02%) bankruptcy
companies, from which 42178 linked company pairs are generated. Those company pairs
are split into training and test dataset with 70%-30% weight. After training the parameters,
the results on test dataset is shown in Table 3.4.
Table 3.4 Model performance on original company dataset (year 2013)
Factor AUC Accuracy Precision Recall F1-score
S 0.797 0.995 NaN 0 NaN
S+P 0.798 0.995 NaN 0 NaN
For S model, AUC is 0.797. The precision and F1-score is NaN because at the default
decision threshold 0.5, all companies status are predicted as active. The S+P model almost
did not improve the AUC, which might be because of the high imbalance in the data: Table
3.5 shows the percentage of company pairs with different linked status, there are only 228
(1.02%) company bankruptcy in our data set, with the stochastic gradient descent method,
the bankruptcy cases are too little to train the model effectively. The 2nd possible reason
that the S+P model did not show a great improvement over S model can be due to the
oversimplification of the companies connections: in chapter 2, we made the decision to
limit ourselves to model only the linear-chained (double linked company in specifically)
network topology. In another word, we made a huge simplification about the company link
36
structure, while about 20% of network structures are not linear-chained topology. Since the
graph models are basically graph-based representations of various factorization
assumptions of distributions. These factorizations are typically equivalent to independence
statements amongst variables in the distribution. If the graph representation cannot
represent or cannot nicely approximate the true connections, it is hard to expect the graphic
model have great performance over the baseline model.
Table 3.5 Percentage of 2-linked company status of the year 2013
A Survival A Bankruptcy
B survival 41830 (99.17 %) 159 (0.38%)
B bankruptcy 159 (0.38%) 30 (0.07%)
To solve the imbalanced data problem, we try to train the model by up-sampling the data.
3.2.2 On up-sampling company data
We again consider the data in the year 2013. After splitting the data into training and test
dataset, we perform upsampling on the training dataset. We only upsample the linked
company that has at least one company bankruptcy. After upsampling, the proportion of
the different linked companies as shown in Table 3.6.
Table 3.6 Percentage of 2-linked company status of the year 2013 (up-sampling data)
A Survival A Bankruptcy
B survival 29305 (50 %) 14985 (25.5%)
B bankruptcy 12150 (20.6%) 2430 (3.9%)
The trained model then is applied to the test dataset with the performance shown in Table
3.7:
Table 3.7 Model performance on upsampling company dataset (year 2013)
Factor AUC Accuracy Precision Recall F1-score
S 0.836 0.569 0.009 0.912 0.017
S+P 0.84 0.573 0.0086 0.913 0.017
Compared with the model trained on the original data, the AUC greatly improves, in
addition, the S+P model does show small improvement compared to the S model. There is,
however, an issue with the train and test the model on the data of the same year: some
companies are presented in both the training and test dataset, in other words, there is data
leakage. The model learned on data of 2013 was then applied on the data of other years,
Table 3.8 shows the performance:
37
Table 3.8 Test result on the year 2014 data of model trained by the year 2013 data.
Factor AUC
2010
AUC
2011
AUC
2012
AUC
2014
AUC
2015
AUC
2016
S 0.6871 0.7076 0.6137 0.6 0.6968 0.6316
S+P 0.6739 0.7034 0.6077 0.593 0.6888 0.6176
Unfortunately, at this time, the S+P model underperforms the baseline model. Table 3.9
shows the percentage of 2-linked company status of the year 2014: in the linked company
where one is bankrupt, the probability of the other one also bankruptcy decrease to 3.33%
( 10/(145+145+10) ), which means, the model trained using the 2013 data, do not really
apply on 2014 due to the change of link strength.
Table 3.9 Percentage of 2-linked company status of the year 2014
A Survival A Bankruptcy
B survival 41130 (99.28 %) 145 (0.35%)
B bankruptcy 145 (0.35%) 10 (0.024%)
Table 3.10 shows the odds of bankruptcy-bankruptcy company pairs among the links with
at least one bankruptcy, the odds change from year to year. Since this odds can be viewed
as the “link strength” or dependencies between the company pairs. If there is no link
dependencies “pattern”, the parameters of the features in the pairwise factors that trained
with data in one year cannot be applied to another year with different link strength.
Table 3.10 Odds of bankruptcy-bankruptcy among links with at least one bankruptcy
2010 2011 2012 2013 2014 2015 2016
8.6% 10.2% 10.3% 8.9% 3.3% 2.34% 7.9%
Table 3.11 shows the performance of logistic regression (LR) and random forest(RF) on
the original company dataset of a different year. Compare with Table 3.8, both RF and LR
overperform the CRF model. Overall, the RF has better performance than the LR model.
Table 3.11 Benchmark Model Performance on each year
2010 2011 2012 2013 2014 2015 2016
AUC of LR 0.72 0.71 0.65 0.71 0.69 0.59 0.51
AUC of RF 0.74 0.76 0.71 0.77 0.82 0.77 0.68
38
4. Conclusions and future work
4.1 Conclusion
The aim of this thesis is to predict whether a Belgian company will go bankrupt the next
financial year (classification problem). The traditional company business failure prediction
models only included financial ratios as predictors and see companies as isolated entities,
they ignore the linkage and did not consider any influences between the different
companies. In reality, all businesses form an ecosystem. Companies may link with each
other by their shared board members. Forming Board networks is particularly interesting
because information is transferred from company to company through these linked Boards.
There might be a relationship between a board's connection and the financial status of the
linked companies. In this thesis, we want to explore the added value of taking the board
network into account on bankruptcy prediction and evaluates the performance of
Conditional Random Fields (CRF) in Business Failure Prediction.
Specifically, we designed, implemented and trained a CRF model to predict the business
failure of companies with shared board members. The linear chained Markov network was
used as the CRF model graph representation. The clique tree message passing algorithm
was used to build the exact inference engine and the stochastic gradient descent algorithm
has been applied to learn the parameters from the data. The Belgian SMEs data from 2010
to 2016 was used to train and evaluate the model. The performance of CRF is also
compared with several benchmark models.
Two types of CRF model were built, one with only the singleton factors taken into account
(S model) that act as the baseline model. Another with both singleton and pairwise factors
(S+P model) take into account.
The model was trained and evaluated firstly on the dummy generated datasets to simulate
the model performance under different circumstances. Basically, the model on dummy data
worked as expected. It is shown that with increasing of company link strength, the
performance of S+P model increased and the S model performance stays the same, with
link strength higher than 0.5, the S+P model overperforms the S model. The results show
the model implementation correctness and also prove the concept of the motivation to apply
the CRF on the linked company context. It is also showed that both unbalanced data and
data noise will decrease the model performance.
The CRF model was then trained and evaluated on the real company dataset. Both S and
S+P model do show some predictive power (with AUC ~0.8). However, compared with the
baseline model, the S+P model shows little improvement. We suspect there are two reasons.
The reason one might due to the highly unbalanced dataset. On the up-sampling dataset for
the year 2013, the S+P model does show some (although still not very significant)
39
improvement over the S model with AUC improved from 0.836 to 0.84. The second reason
could be the oversimplification of the true companies connections by the linear-chained
graph model representation since about 20% of the network structures in the dataset are
not linear-chained topology.
The model that trained on the up-sampled dataset of the year 2013 was then applied to the
year 2014. Unfortunately this time the S+P model even underperform the baseline model,
which shows the limitation to generalize the model. It is found that the odds of bankruptcy-
bankruptcy company pairs is not a constant number but changed quite a lot from year to
year. Since there is no link dependencies “pattern”, the model trained with data in one year
cannot predict the bankruptcy of another year with different link strength.
In conclusion, the linear chained CRF model we build shows poor performance in
predicting the business failure of companies with shared board members. It also shows
limited added value of taking the board network into account on bankruptcy prediction
with the data set we used in this thesis.
4.2 Limitation and future work
1. This study only considers the linear chained company structure. There are many other
types of structures available in the network topologies like ring, star, densely connected
topologies. It will also be interesting to evaluate the CRF model on different network
topologies.
2. The exact belief propagation inference algorithm we used in the thesis only applied to
the tree topologies and not work for the complex topologies with cyclic network structures,
in which case the approximal inference algorism can be applied.
3. As a way of learning, this thesis did not use any external CRF packaged but implemented
the full mode in MATLAB without considering too much the efficiency, Thus the time to
run the model is long. To increase the efficiency, the model will need to be carefully
implemented with languages like C/C++.
40
References
[Muller et al., 2015] Muller, P., Caliandro, C., Peycheva, V., Gagliardi, D., Marzocchi, C.,
Ramlogan, R., and Cox, D. (2015). Annual report on european smes. Performance review.
The European Commission Publication Office.
[Commission et al., 2003] Commission, E. U. et al. (2003). Commission recommendation
of 6 may 2003 concerning the definition of micro, small and medium-sized enterprises.
Official Journal of the European Union, 46:36-41.
[Tkac and Verner, 2016] Tkac, M. and Verner, R. (2016). Artificial neural networks in
business: Two decades of research. Applied Soft Computing, 38:788-804.
[Baysinger and Butler, 1985] Baysinger, B. D. and Butler, H. N. (1985). Corporate
governance and the board of directors: Performance effects of changes in board
composition. Journal of Law, Economics, & Organization, 1(1):101-124.
[Tobback et al., 2016] Tobback, E., Moeyersoms, J., Stankova, M., Martens, D., et al.
(2016). Bankruptcy prediction for smes using relational data. Technical report.
[Sutton et al., 2012] Sutton, C., McCallum, A., et al. (2012). An introduction to conditional
random fields. Foundations and Trends in Machine Learning, 4(4):267-373.
[Van Damme et al., 2017] Domien, Van Damme, et al. (2017) Conditional Random Fields
For Bankruptcy Prediction.
[Dudoit et al., 2002] Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of
discrimination methods for the classification of tumors using gene expression data. Journal
of the American statistical association, 97(457):77-87.
[Koller and Friedman, 2009] Koller, D. and Friedman, N. (2009). Probabilistic graphical
models: principles and techniques. MIT press.
[Beaver, 1966] Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of
accounting research, pages 71-111.
[Altman, 1968] Altman, E. I. (1968). Financial ratios, discriminant analysis and the
prediction of corporate bankruptcy. The journal of finance, 23(4):589-609.
[Theil and Theil, 1971] Theil, H. and Theil, H. (1971). Principles of econometrics.
Technical report.
[Hung et al., 1998] Hung, H. et al. (1998). A typology of the theories of the roles of
governing boards. Corporate governance, 6(2):101-111.
[Doumpos and Zopounidis, 1999] Doumpos, M. and Zopounidis, C. (1999). A multicriteria
41
discrimination method for the prediction of financial distress: The case of Greece.
Multinational Finance Journal, 3(2):71.
[Huysmans et al., 2006] Huysmans, J., Baesens, B., Vanthienen, J., and Van Gestel, T.
(2006). Failure prediction with self-organizing maps. Expert Systems with Applications,
30(3):479-487.
[Min et al., 2006] Min, S.-H., Lee, J., and Han, I. (2006). Hybrid genetic algorithms and
support vector machines for bankruptcy prediction. Expert systems with applications,
31(3):652-660.
[Ravi et al., 2008] Ravi, V., Kurniawan, H., Thai, P. N. K., and Kumar, P. R. (2008). Soft
computing system for bank performance prediction. Applied soft computing, 8(1):305-
315.
[Chen et al., 2009] Chen, H.-J., Huang, S.-Y., and Kuo, C.-L. (2009). Using the artificial
neural network to predict fraud litigation: Some empirical evidence from emerging
markets. Expert Systems with Applications, 36(2):1478-1484.
[Lin et al., 2012] Lin, W.-Y., Hu, Y.-H., and Tsai, C.-F. (2012). Machine learning in
financial crisis prediction: a survey. IEEE Transactions on Systems, Man, and Cybernetics,
Part C (Applications and Reviews), 42(4):421-436.
42
Appendix A: Dataset description
Table A.0.1 Raw data structure
BvD ID
Number
DM UCI
(Unique
Contact
Identi_er)
Status Status Date Variables
2010
Variables
2011
Variables
2012
Variables
2013
Variables
2014
Variables
2015
Variables
2016
Company 1 Board 1 ID
number
Status 1 Status 1 date Variables
company
1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Company 1 Board 2 ID
number
Status 2 Status 2 date Variables
company
1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Company 1 NA Status 3 Status 3 date Variables
company
1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Variables
company 1
Company 2 Board 1 ID
number
Status 1 Status 1 date Variables
company
2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Company 2 Board 2 ID
number
Status 2 Status 2 date Variables
company
2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Company 2 Board 3 ID
number
NA NA Variables
company
2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Company 2 Board 4 ID
number
NA NA Variables
company
2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
Variables
company 2
43
Table A.0.2 Basetable structure in year 2010
BvD ID Number Predictors Status
Company1 Predictors 2010 Status 2010
Company2 Predictors 2010 Status 2010
Company3 Predictors 2010 Status 2010
Company4 Predictors 2010 Status 2010
44
Table A.0.3 Definition of variables in raw data.
Variable Definition Treatment
ROCE using P/L before tax (Profit (Loss) before tax + Interest Paid) / (Shareholders Funds + Non-Current
Liabilities) * 100
Convert to binary
ROCE using Net income (Net income for period + Interest Paid) / (Shareholders Funds + Non-Current
Liabilities) * 100
Convert to binary
Profit per employee th USD Profit before tax / Employees Convert to binary
Profit margin (Profit before tax / Operating revenue) * 100 Convert to binary
EBITDA margin (EBITDA / Operating revenue) * 100 Convert to binary
EBIT margin (EBIT / Operating revenue) * 100 Convert to binary
Cash flow / Operating
revenue
(Cash ow / Operating revenue) * 100 Convert to binary
Net assets turnover Operating revenue / (Shareholders funds + Non current liabilities) Convert to binary
Interest cover Operating profit / Interest paid Convert to binary
Stock turnover Operating revenue / Stocks Convert to binary
Collection period days (Debtors / Operating revenue) * 360 Convert to binary
Credit period days (Creditors / Operating revenue) * 360 Convert to binary
Shareholders liquidity ratio Shareholders funds / Non current liabilities Convert to binary
Solvency ratio (Liability
based)
(Shareholders funds / Non-current liabilities + current liabilities) *
100
Convert to binary
Operating revenue per
employee th USD
Operating revenue / Employees Convert to binary
Costs of employees /
Operating revenue
(Cost of employees / Operating revenue) * 100 Convert to binary
Shareholders funds per
employee th USD
Shareholders funds / Employees Convert to binary
Average cost of employee Cost of employees / Employees Convert to binary
45
th USD
ROE using P/L before tax (Profit (Loss) before tax / Shareholders funds) * 100 Standardization
ROE using Net income (Profit (Loss) before tax / Total Assets) * 100 Standardization
ROA using P/L before tax (Net income for period / Shareholders funds) * 100 Standardization
ROA using Net income (Net income for period / Total Assets) * 100 Standardization
Current ratio Current assets / Current liabilities Rescale
Liquidity ratio (Current assets Stocks) / Current liabilities Rescale
Solvency ratio (Asset
based)
(Shareholders funds / Total assets) * 100 Standardization
Gearing ((Non current liabilities + Loans) / Shareholders funds) * 100 Rescale
Category of the company Medium-sized or small-sized company Convert to binary
'NACE Rev. 2 main section Industry classification code, refer to Table A.0.5 for meaning of code Convert to category
Date of incorporation Date of formation of the company Convert into age and
then rescale
BvD ID number Unique identifier of the company Keep
DM UCI (Unique Contact
Identifier)
Unique identifier of the company board ID number. Consider in network
Company name Name of the company Deleted
Status Legal status of the company, refer to Error! Reference source not found. for
all the available status
Convert to binary
Status date Date of status update Deleted
46
Table A.0.4 List of predictors
Variable Meaning
ROE using P/L before tax (Profit (Loss) before tax / Shareholders funds) *
100
ROE using Net income (Profit (Loss) before tax / Total Assets) * 100
ROA using P/L before tax (Net income for period / Shareholders funds) * 100
ROA using Net income (Net income for period / Total Assets) * 100
Current ratio Current assets / Current liabilities
Liquidity ratio (Current assets Stocks) / Current liabilities
Solvency ratio (Asset based) (Shareholders funds / Total assets) * 100
Gearing ((Non current liabilities + Loans) / Shareholders
funds) * 100
Age year - year of incorporation
'NACE Rev. 2 main section' 16 level discrete variable indicating industry
classification
'Category of the company Binary variable: 1 if the company is Medium-sized,
0 if Small-sized
NA_dummy_ROCE using P/L
before tax
Binary variable indicating if the according variable
has the value NA
NA_dummy_ROCE using Net
income
Binary variable indicating if the according variable
has the value NA
NA_dummy_Profit per
employee th USD
Binary variable indicating if the according variable
has the value NA
NA_dummy_Profit margin Binary variable indicating if the according variable
has the value NA
NA_dummy_EBITDA margin Binary variable indicating if the according variable
has the value NA
NA_dummy_EBIT margin Binary variable indicating if the according variable
has the value NA
NA_dummy_Cash flow /
Operating revenue
Binary variable indicating if the according variable
has the value NA
NA_dummy_Net assets turnover Binary variable indicating if the according variable
has the value NA
NA_dummy_Interest cover Binary variable indicating if the according variable
has the value NA
NA_dummy_Stock turnover Binary variable indicating if the according variable
has the value NA
NA_dummy_Collection period
days
Binary variable indicating if the according variable
has the value NA
NA_dummy_Credit period days Binary variable indicating if the according variable
has the value NA
NA_dummy_Shareholders Binary variable indicating if the according variable
47
liquidity ratio has the value NA
NA_dummy_Solvency ratio
(Liability based)
Binary variable indicating if the according variable
has the value NA
NA_dummy_Operating revenue
per employee th USD
Binary variable indicating if the according variable
has the value NA
NA_dummy_Costs of employees
/ Operating revenue
Binary variable indicating if the according variable
has the value NA
NA_dummy_Shareholders funds
per employee th USD
Binary variable indicating if the according variable
has the value NA
NA_dummy_Average cost of
employee th USD
Binary variable indicating if the according variable
has the value NA
Table A.0.5 Codes that represent different Industry classification
A Agriculture, forestry and fishing
B Mining and quarrying
C Manufacturing
D Electricity, gas, steam and air conditioning supply
E Water supply; sewerage, waste management and remediation
activities
F Construction
G Wholesale and retail trade; repair of motor vehicles and
motorcycles
H Transportation and storage
I Accommodation and food service activities
J Information and communication
M Professional, scientific and technical activities
N Administrative and support service activities
R Arts, entertainment and recreation
S Other service activities
T Activities of households as employers
U Activities of extraterritorial organizations and bodies
Table A.0.6 Overview of the possible statuses
Status Meaning Model
value
Active The company is active 0
Active (insolvency
proceedings)
The debtor is unable to pay his debts Removed
Active (rescue plan) Business rescue plan: proceedings to facilitate the
rehabilitation of a company that is financially
distressed
Removed
Bankruptcy Legally declared inability of a company to pays its 1
48
creditors. The company no longer exists because it
has ceased its activities since it is in the process of
bankruptcy.
Dissolved
(bankruptcy)
The company no longer exists as a legal entity,
because it has ceased its activities since it is in the
process of bankruptcy.
1
Dissolved (demerger) The company no longer exists as a legal entity, the
reason for this is a demerger the company has
been split
Removed
Dissolved
(liquidation)
The company no longer exists because it has
ceased its activities, since it is in the process of
liquidation.
Removed
Dissolved (merger or
take-over)
The company no longer exists as a legal entity
because the company has been included in a
merger.
Removed
In liquidation The company is in the process of liquidation Removed
Top Related