Post on 04-Jul-2020
CDS Project Report
Title: Credit risk assessment of mid-size corporates for default prediction
Dated: 5 December 2016
Submitted by:
Group 11 -> Abhilash Fulkar 16BM6JP01
Apoorv Agrawal 16BM6JP09
Dhrubajyoti Dutta 16BM6JP15
Neha Singh Lohchubh 16BM6JP30
As a part of successful completion of ‘Computing in Data Sciences’ course for the ISI semester of PGDBA,
we were required to do a project that combined all the learnings we had during the semester and utilize
it to solve a real-world business problem. Pondering over the different choices of business problems
around us, we were particularly intrigued by the common yet complex issue of managing credit risk in
corporate financial transactions.
Credit risk which is also known as counterparty risk or default risk is the measure of likelihood that a
borrower or a counterparty will fail to meet their financial obligations and may default on loans, bonds or
leases at the time of repayment, thereby leading to huge losses for the lender. Evidently, credit risk has
always been a subject of paramount importance to lenders from both financial as well as non-financial
institutions to limit their exposure in adverse situations. However, post the aftermath of 2008 recession,
the concept of effective assessment and management of credit risk has gained a stature of un-paralleled
relevance thus making it an interesting field to explore. Having said that, depending on the type of
transactions, the premise of credit risk assessment differs. For this project, we deal with the field of supply
chain financing and evaluate the credit worthiness of mid-size corporates involved in the transactions.
Supply Chain financing (SCF) is a business and financing operation that brings together buyer (mid-size
corporates), their suppliers (small to medium scale enterprises) and a financing institution. Under SCF, a
financing institution provides a short-term credit to the suppliers on behalf of the corporates against
approved payables. This enables the suppliers to maintain cashflows to ensure smooth operation of their
business towards successful completion of orders. On the other hand, it extends the flexibility to
corporates to make payments as per their business cycle.
Fig 1. Working of Supply Chain financing
However, Consider the scenario that due to financial stress, the corporate is unable to make required
payments to the financing institution thereby leading to the situation of default. In such a case, the
financing institution suffers huge losses reiterating the need for adequate credit risk assessment of the
corporates before entering a transaction with them.
Fig 2. Default scenario in Supply Chain financing
To minimize their risk, several financing institutions extend collateralized credit to the suppliers. But,
because these enterprises are relatively small or medium, they face a hard time managing required
collateral, often leaving them with deficit liquidity. To target such situations, some financing institutions
offer the facility of short term non-collateralized loans. One such budding financing institution is MUSK
technologies. It is a 9-month-old startup working in the field of SCF with an aim to simplify it and extend
its reach by providing non-collateralized loans to SME’s.
The credit risk involved in above mentioned scenario stems from a complex interaction of many
underlying factors. Macro-economic developments on the corporate sector and other short term and long
term business practices predominantly guide the credit worthiness of a mid-size corporate. In addition to
these, risk arising out of collusion between supplier and corporate and due to forged transactions and
invoices are also to be accounted for while assessing the decision to either extend the credit or not to the
supplier. Factoring the contribution of so many different factors towards credit risk assessment of
corporates makes for a challenging task. This is primarily the business problem that MUSK technologies
face. For our project, we decided to collaborate with them to partially solve this problem by modelling
the credit risk as a function of past and projected financial health of the corporate in question.
Fig 3. Model development Process
Problem Analysis and Data Collection:
Since, MUSK technologies is a relatively new firm, we did not have a baseline set to analyze the indicators
that affect the creditworthiness of a corporate. Neither did we have any sort of current transaction history
of the corporate that could signal the trend of its present growth and economic health. In addition to
these, as most of their loans are yet to mature, we did not have a reference data that we could use to
train or test our model. Considering these limitations, we were faced with the mammoth task of
identifying auxiliary indicators that could help us estimate the creditability of the corporates with a level
of accuracy.
Upon research, we found that similar to different financial instruments (like bonds and derivatives),
companies also have ratings indicative of their creditworthiness issued by neutral third parties. In Indian
sub context, agencies like CRISIL, ONICRA and ICRA offer these. However, just like other financial
information, they are also proprietary and hence ratings for very few companies are available in the public
domain. We decided to collect whatever information we could gather for the companies and their ratings
from these websites by means of web scrapping. For this purpose, we used python packages like selenium
and beautiful soup. We managed to collectively scrape data on 12000 traded and non-traded companies.
The data was parsed into the schema as: Company | Sector | Credit Type | Rating and stored in a csv file. To
augment this credit rating data, we decided to try another measure by leveraging the relation between
spreads (on bond yields and credit default swaps for a corporate) and credit risk of the corporate. In other
words, bond yield spread or credit default swap spread is inversely related to the market assessment of
the creditworthiness of the corporate. Therefore, a high spread indicates that the market has low
confidence in the corporates capability to repay its obligations. Unfortunately, we could not find any
public source that could provide us with this required information in a meaningful way.
Problem Analysis
Data Collection
Data Cleaning
Data Analysis
Model Fitting
Model Evaluation
Fig 4. Collection of credit ratings
The next step was to figure out how to use these ratings to assess the credit risk of companies. We wanted
some observable indicators which could relate to this data and thus help us construct a model around
these. In our quest, we came across the PD model which is a very naïve way to determine the credit risk.
It works by applying logistic regression on certain financial ratios treating them as explanatory variables.
The output is a quantity between 0 and 1, termed as probability of default. As the name suggests, higher
the value higher we regard the chances that the corporate in question will default. We decided to combine
the aspect of financial data from PD model with our credit ratings to predict them as response variable.
Once, we have the credit rating for the test data, we would map them to the respective probabilities of
default as per the standard conversion given by global rating firms like S&P and Moody’s.
To complete our dataset, we had to derive financial information about the companies whose ratings we
had, on different financial parameters. For this purpose, we scraped data off MoneyControl. We had a
major surprise when we could find financial information for just 380 companies from this website. This
was due to the fact that MoneyControl has information on publicly traded companies only. We tried other
sources like Ministry of Corporate Affairs as well for financial information on non-public companies but
the efforts proved futile. Thus, we decided to go ahead with our current dataset of just 380 companies.
Data Cleaning:
The dataset derived in the previous step was merged from several sources without any processing done
on top of it while scrapping. Consequently, it suffered from issues like missing values and inconsistency.
To rectify the problem of missing values, we took aid of missmap function (Amelia package) to visualize
their density. The plot revealed concentration of missing values in few columns which were further found
to be redundant with respect to the information they stored. Hence, we decided to remove these
columns. Apart from this, as the companies varied in their size and annual turnover, their corresponding
values on different financial parameters showed high dispersion accordingly. Thus, before we could utilize
this information, we had to normalize it to reveal meaningful trends. To do so, we adopted the standard
normalization technique of transforming each value x as:
𝑥−𝜇
𝜎 where: 𝜇 : Mean value of x’s
𝜎: Standard deviation of x’s
In addition to this, we observed that credit ratings, which is a factor variable, by virtue of being sourced
from different agencies had a mismatch in their representation of similar information. Also, it had too
many levels indicative of excessive granularity. As we did not have much data, we could not effectively
differentiate between the close levels. Therefore, we decided to compact the ratings into 8 primary levels
to sufficiently represent different credit risk degrees (1 being bad and 8 being good).
After data cleaning, we were left with 36 financial parameters to be used as explanatory variables for
prediction. (Refer Appendix I for list of parameters)
Data Analysis:
After ensuring consistency of the data, we began investigating it for preliminary analysis in a hope to
reveal relevant insights or consequential irregularities. Careful inspection unearthed outliers in quite a
few columns which can generate biased regression coefficients. For example, given below are two such
columns (Enterprise Value & Inventory Turnover Ratio) with outliers:
Fig 5. Summary statistics of two columns with outliers
Since, we already had very few rows, we could not simply remove all the rows with outliers. Thus, began
an iterative process of judging the significance of outlier columns in prediction of credit ratings. For cases
where we found that the effect of columns is trivial in the prediction, we simply ignored the outliers
whereas for the columns that contributed significantly to the prediction, we removed the outlier rows.
Moving ahead, we decided to reduce the number of variables in our dataset by studying the relationship
between them both intuitively and analytically. We took help of corr function in R along with different
plots and our inherent knowledge of various financial ratios to optimize our variable set. For example,
Current ratio, Quick ratio and Cash ratio, all three are types of liquidity ratios and help us to measure the
same quantity i.e. liquidity of a company. Thus, they represent redundant information and having them
all is an overhead we can easily do away with. Further, we assumed that if a variable is strongly correlated
to several other variables, then it must be significant and other variables depend on it. Thus, we should
retain this variable and eliminate the rest such that no unique information is lost.
The above procedure yielded a final set of 15 financial parameters that were used subsequently to fit
different models. (Refer Appendix II for list of reduced variable set)
Fig 6. Correlation plot of the reduced variable set
Post this, our preliminary exploratory analysis of the data was complete. Before we proceed to fit different
statistical models to our data, we partitioned it into 80:20 ratio to form training as well as test set.
Model Fitting, Motivation and Evaluation:
Conscious of the fact that we had very less number of observations to train our model and a large set of
correlated parameters, we decided to favor simple models over complex ones. A major issue looming over
the modelling process with such a small dataset is of overfitting. Thus, a major challenge for us was to fit
a model to our dataset ensuring that it is not an overfit. In such a scenario, ordinary least squares method
for linear regression is expected to be inadequate for the purpose of both interpretation and prediction.
Hence, we decided to go for regularization techniques in the form of Ridge and Lasso regression which
not only minimizes the residual sum of squares but also penalizes the large coefficients. This trade-off
helps achieve convex optimization for Bias2-variance, thereby providing better accuracy at reduced
complexity. Between Ridge and Lasso, we chose Lasso regression as it also performs feature selection
subjecting the regression coefficients to L1 penalty as opposed to L2 penalty in case of Ridge regression.
Thus, coefficients for less prominent features are reduced to zero resulting in a smaller variable set.
However, upon further research, we observed that Lasso regression tends to select only one feature from
the group of features among which the pairwise correlations are high without any consideration of which
feature is being selected[1]. As a result, it lacks the ability to reveal grouping information.
To overcome these drawbacks, we decided to implemented another regularization technique called
Elastic Net which is somewhat similar to Lasso but can select the groups of correlated variables. It is the
convex combination of both Lasso and Ridge penalty terms. Therefore, the equivalent optimization
problem is transformed as below:
Where: α is the tuning parameter that balances the proportion of Lasso and Ridge penalty desired[1].
Fig 7. Code snippet for implementation of Elastic net
As we had already mentioned that our dataset had outliers, consequently we decided to try different
regression techniques to further neutralize their effect. These regression methods were; Quantile
Regression and Robust Regression.
Quantile regression models the quantiles (median) of the dependent variables as a function of the
independent variables rather than the traditional choice of mean of the dependent variable. The optimal
predictor is the conditional median med(y|x). The quantile regression estimator for quantile q minimizes
the objective function[2] :
Robust Regression, on the other hand, aims to mitigate the effect of outliers. This method down-weights
the outliers depending on how far they are from the best-fit line and iteratively re-fits the model until the
convergence is achieved. Thus, robust regression as the name suggests, kind of ignore these outliers and
follow the true trend of the majority of the data. The most commonly used method of robust regression
is M-estimation. With M-estimation, the coefficient estimates are determined by minimizing a particular
objective function over all coefficients[3] as given below:
Where: ρ is the function that determines the contribution of each residual to the objective function.
Fig 8. Code snippet for implementation of quantile and robust regression
Apart from the above-mentioned models, Random Forest was also implemented. This technique is usually
employed to minimize the effect of variance of the dataset on the model by growing a number of trees
with a random subset of variables selected as candidates to make the split at every node. As a result,
Random Forests for prediction are mostly immune to overfitting and impart randomness effects on
prediction accuracy.
Fig 9. Code snippet for implementation of Random Forrest
All the above models were fitted using K fold cross validation with k equal to 10. The final predictions were
judged against the test dataset based on RMSE. The results thus obtained for these techniques are as
follows:
Modelling technique RMSE
Random Forrest 1.51
Quantile Regression 2.33
Robust Regression 2.39
Elastic Net 2.42
Fig 10. Results for different modelling techniques
Once we have the predicted credit ratings, we can map them against the required probabilities of default
as per the standard transition matrix issued by any rating agency like CRISIL, S&P or Moody’s.
References:
[1] http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf
[2] http://fmwww.bc.edu/EC-C/S2013/823/EC823.S2013.nn04.slides.pdf
[3] http://users.stat.umn.edu/~sandy/courses/8053/handouts/robust.pdf
Appendix I:
List of parameters post Data Cleaning [36] –
Earnings Retention Ratio
Current Ratio
Price Net Operating Revenue PBIT Margin
Diluted EPS
Return on Networth Equity
Cash EPS
PBIT Share
Earnings Yield
PBT Margin
EV EBITDA
Enterprise Value
Book Value ExclRevalReserve Share
Book Value InclRevalReserve Share
Asset Turnover Ratio
Total Debt Equity
Return on Capital Employed
Retention Ratios
PBDIT Share
Revenue from Operations Share
Net Profit Share
Net Profit Margin
l_Year
Basic EPS
Company
EV Net Operating Revenue
Return on Assets
Quick Ratio
Name
Inventory Turnover Ratio
PBT Share
Cash Earnings Retention Ratio
MarketCap Net Operating Revenue
PBDIT Margin
Price BV
Appendix II:
List of parameters post Exploratory Data Analysis [15] –
Current Ratio
Price Net Operating Revenue
Return on Operating Revenue
Return on Networth Equity
Cash EPS
Earnings Yield
EV EBITDA
Enterprise Value
Asset Turnover Ratio
Total Debt Equity
Return on Capital Employed
Revenue from Operations Share
Inventory Turnover Ratio
Cash Earnings Retention Ratio
Price BV