CDS Project Report Title: Credit risk assessment of mid-size … · 2016-12-07 · CDS Project...

CDS Project Report

Title: Credit risk assessment of mid-size corporates for default prediction

Dated: 5 December 2016

Submitted by:

Group 11 -> Abhilash Fulkar 16BM6JP01

Apoorv Agrawal 16BM6JP09

Dhrubajyoti Dutta 16BM6JP15

Neha Singh Lohchubh 16BM6JP30

As a part of successful completion of ‘Computing in Data Sciences’ course for the ISI semester of PGDBA,

we were required to do a project that combined all the learnings we had during the semester and utilize

it to solve a real-world business problem. Pondering over the different choices of business problems

around us, we were particularly intrigued by the common yet complex issue of managing credit risk in

corporate financial transactions.

Credit risk which is also known as counterparty risk or default risk is the measure of likelihood that a

borrower or a counterparty will fail to meet their financial obligations and may default on loans, bonds or

leases at the time of repayment, thereby leading to huge losses for the lender. Evidently, credit risk has

always been a subject of paramount importance to lenders from both financial as well as non-financial

institutions to limit their exposure in adverse situations. However, post the aftermath of 2008 recession,

the concept of effective assessment and management of credit risk has gained a stature of un-paralleled

relevance thus making it an interesting field to explore. Having said that, depending on the type of

transactions, the premise of credit risk assessment differs. For this project, we deal with the field of supply

chain financing and evaluate the credit worthiness of mid-size corporates involved in the transactions.

Supply Chain financing (SCF) is a business and financing operation that brings together buyer (mid-size

corporates), their suppliers (small to medium scale enterprises) and a financing institution. Under SCF, a

financing institution provides a short-term credit to the suppliers on behalf of the corporates against

approved payables. This enables the suppliers to maintain cashflows to ensure smooth operation of their

business towards successful completion of orders. On the other hand, it extends the flexibility to

corporates to make payments as per their business cycle.

Fig 1. Working of Supply Chain financing

However, Consider the scenario that due to financial stress, the corporate is unable to make required

payments to the financing institution thereby leading to the situation of default. In such a case, the

financing institution suffers huge losses reiterating the need for adequate credit risk assessment of the

corporates before entering a transaction with them.

Fig 2. Default scenario in Supply Chain financing

To minimize their risk, several financing institutions extend collateralized credit to the suppliers. But,

because these enterprises are relatively small or medium, they face a hard time managing required

collateral, often leaving them with deficit liquidity. To target such situations, some financing institutions

offer the facility of short term non-collateralized loans. One such budding financing institution is MUSK

technologies. It is a 9-month-old startup working in the field of SCF with an aim to simplify it and extend

its reach by providing non-collateralized loans to SME’s.

The credit risk involved in above mentioned scenario stems from a complex interaction of many

underlying factors. Macro-economic developments on the corporate sector and other short term and long

term business practices predominantly guide the credit worthiness of a mid-size corporate. In addition to

these, risk arising out of collusion between supplier and corporate and due to forged transactions and

invoices are also to be accounted for while assessing the decision to either extend the credit or not to the

supplier. Factoring the contribution of so many different factors towards credit risk assessment of

corporates makes for a challenging task. This is primarily the business problem that MUSK technologies

face. For our project, we decided to collaborate with them to partially solve this problem by modelling

the credit risk as a function of past and projected financial health of the corporate in question.

Fig 3. Model development Process

Problem Analysis and Data Collection:

Since, MUSK technologies is a relatively new firm, we did not have a baseline set to analyze the indicators

that affect the creditworthiness of a corporate. Neither did we have any sort of current transaction history

of the corporate that could signal the trend of its present growth and economic health. In addition to

these, as most of their loans are yet to mature, we did not have a reference data that we could use to

train or test our model. Considering these limitations, we were faced with the mammoth task of

identifying auxiliary indicators that could help us estimate the creditability of the corporates with a level

of accuracy.

Upon research, we found that similar to different financial instruments (like bonds and derivatives),

companies also have ratings indicative of their creditworthiness issued by neutral third parties. In Indian

sub context, agencies like CRISIL, ONICRA and ICRA offer these. However, just like other financial

information, they are also proprietary and hence ratings for very few companies are available in the public

domain. We decided to collect whatever information we could gather for the companies and their ratings

from these websites by means of web scrapping. For this purpose, we used python packages like selenium

and beautiful soup. We managed to collectively scrape data on 12000 traded and non-traded companies.

The data was parsed into the schema as: Company | Sector | Credit Type | Rating and stored in a csv file. To

augment this credit rating data, we decided to try another measure by leveraging the relation between

spreads (on bond yields and credit default swaps for a corporate) and credit risk of the corporate. In other

words, bond yield spread or credit default swap spread is inversely related to the market assessment of

the creditworthiness of the corporate. Therefore, a high spread indicates that the market has low

confidence in the corporates capability to repay its obligations. Unfortunately, we could not find any

public source that could provide us with this required information in a meaningful way.

Problem Analysis

Data Collection

Data Cleaning

Data Analysis

Model Fitting

Model Evaluation

Fig 4. Collection of credit ratings

The next step was to figure out how to use these ratings to assess the credit risk of companies. We wanted

some observable indicators which could relate to this data and thus help us construct a model around

these. In our quest, we came across the PD model which is a very naïve way to determine the credit risk.

It works by applying logistic regression on certain financial ratios treating them as explanatory variables.

The output is a quantity between 0 and 1, termed as probability of default. As the name suggests, higher

the value higher we regard the chances that the corporate in question will default. We decided to combine

the aspect of financial data from PD model with our credit ratings to predict them as response variable.

Once, we have the credit rating for the test data, we would map them to the respective probabilities of

default as per the standard conversion given by global rating firms like S&P and Moody’s.

To complete our dataset, we had to derive financial information about the companies whose ratings we

had, on different financial parameters. For this purpose, we scraped data off MoneyControl. We had a

major surprise when we could find financial information for just 380 companies from this website. This

was due to the fact that MoneyControl has information on publicly traded companies only. We tried other

sources like Ministry of Corporate Affairs as well for financial information on non-public companies but

the efforts proved futile. Thus, we decided to go ahead with our current dataset of just 380 companies.

Data Cleaning:

The dataset derived in the previous step was merged from several sources without any processing done

on top of it while scrapping. Consequently, it suffered from issues like missing values and inconsistency.

To rectify the problem of missing values, we took aid of missmap function (Amelia package) to visualize

their density. The plot revealed concentration of missing values in few columns which were further found

to be redundant with respect to the information they stored. Hence, we decided to remove these

columns. Apart from this, as the companies varied in their size and annual turnover, their corresponding

values on different financial parameters showed high dispersion accordingly. Thus, before we could utilize

this information, we had to normalize it to reveal meaningful trends. To do so, we adopted the standard

normalization technique of transforming each value x as:

𝑥−𝜇

𝜎 where: 𝜇 : Mean value of x’s

𝜎: Standard deviation of x’s

In addition to this, we observed that credit ratings, which is a factor variable, by virtue of being sourced

from different agencies had a mismatch in their representation of similar information. Also, it had too

many levels indicative of excessive granularity. As we did not have much data, we could not effectively

differentiate between the close levels. Therefore, we decided to compact the ratings into 8 primary levels

to sufficiently represent different credit risk degrees (1 being bad and 8 being good).

After data cleaning, we were left with 36 financial parameters to be used as explanatory variables for

prediction. (Refer Appendix I for list of parameters)

Data Analysis:

After ensuring consistency of the data, we began investigating it for preliminary analysis in a hope to

reveal relevant insights or consequential irregularities. Careful inspection unearthed outliers in quite a

few columns which can generate biased regression coefficients. For example, given below are two such

columns (Enterprise Value & Inventory Turnover Ratio) with outliers:

Fig 5. Summary statistics of two columns with outliers

Since, we already had very few rows, we could not simply remove all the rows with outliers. Thus, began

an iterative process of judging the significance of outlier columns in prediction of credit ratings. For cases

where we found that the effect of columns is trivial in the prediction, we simply ignored the outliers

whereas for the columns that contributed significantly to the prediction, we removed the outlier rows.

Moving ahead, we decided to reduce the number of variables in our dataset by studying the relationship

between them both intuitively and analytically. We took help of corr function in R along with different

plots and our inherent knowledge of various financial ratios to optimize our variable set. For example,

Current ratio, Quick ratio and Cash ratio, all three are types of liquidity ratios and help us to measure the

same quantity i.e. liquidity of a company. Thus, they represent redundant information and having them

all is an overhead we can easily do away with. Further, we assumed that if a variable is strongly correlated

to several other variables, then it must be significant and other variables depend on it. Thus, we should

retain this variable and eliminate the rest such that no unique information is lost.

The above procedure yielded a final set of 15 financial parameters that were used subsequently to fit

different models. (Refer Appendix II for list of reduced variable set)

Fig 6. Correlation plot of the reduced variable set

Post this, our preliminary exploratory analysis of the data was complete. Before we proceed to fit different

statistical models to our data, we partitioned it into 80:20 ratio to form training as well as test set.

Model Fitting, Motivation and Evaluation:

Conscious of the fact that we had very less number of observations to train our model and a large set of

correlated parameters, we decided to favor simple models over complex ones. A major issue looming over

the modelling process with such a small dataset is of overfitting. Thus, a major challenge for us was to fit

a model to our dataset ensuring that it is not an overfit. In such a scenario, ordinary least squares method

for linear regression is expected to be inadequate for the purpose of both interpretation and prediction.

Hence, we decided to go for regularization techniques in the form of Ridge and Lasso regression which

not only minimizes the residual sum of squares but also penalizes the large coefficients. This trade-off

helps achieve convex optimization for Bias2-variance, thereby providing better accuracy at reduced

complexity. Between Ridge and Lasso, we chose Lasso regression as it also performs feature selection

subjecting the regression coefficients to L1 penalty as opposed to L2 penalty in case of Ridge regression.

Thus, coefficients for less prominent features are reduced to zero resulting in a smaller variable set.

However, upon further research, we observed that Lasso regression tends to select only one feature from

the group of features among which the pairwise correlations are high without any consideration of which

feature is being selected[1]. As a result, it lacks the ability to reveal grouping information.

To overcome these drawbacks, we decided to implemented another regularization technique called

Elastic Net which is somewhat similar to Lasso but can select the groups of correlated variables. It is the

convex combination of both Lasso and Ridge penalty terms. Therefore, the equivalent optimization

problem is transformed as below:

Where: α is the tuning parameter that balances the proportion of Lasso and Ridge penalty desired[1].

Fig 7. Code snippet for implementation of Elastic net

As we had already mentioned that our dataset had outliers, consequently we decided to try different

regression techniques to further neutralize their effect. These regression methods were; Quantile

Regression and Robust Regression.

Quantile regression models the quantiles (median) of the dependent variables as a function of the

independent variables rather than the traditional choice of mean of the dependent variable. The optimal

predictor is the conditional median med(y|x). The quantile regression estimator for quantile q minimizes

the objective function[2] :

Robust Regression, on the other hand, aims to mitigate the effect of outliers. This method down-weights

the outliers depending on how far they are from the best-fit line and iteratively re-fits the model until the

convergence is achieved. Thus, robust regression as the name suggests, kind of ignore these outliers and

follow the true trend of the majority of the data. The most commonly used method of robust regression

is M-estimation. With M-estimation, the coefficient estimates are determined by minimizing a particular

objective function over all coefficients[3] as given below:

Where: ρ is the function that determines the contribution of each residual to the objective function.

Fig 8. Code snippet for implementation of quantile and robust regression

Apart from the above-mentioned models, Random Forest was also implemented. This technique is usually

employed to minimize the effect of variance of the dataset on the model by growing a number of trees

with a random subset of variables selected as candidates to make the split at every node. As a result,

Random Forests for prediction are mostly immune to overfitting and impart randomness effects on

prediction accuracy.

Fig 9. Code snippet for implementation of Random Forrest

All the above models were fitted using K fold cross validation with k equal to 10. The final predictions were

judged against the test dataset based on RMSE. The results thus obtained for these techniques are as

follows:

Modelling technique RMSE

Random Forrest 1.51

Quantile Regression 2.33

Robust Regression 2.39

Elastic Net 2.42

Fig 10. Results for different modelling techniques

Once we have the predicted credit ratings, we can map them against the required probabilities of default

as per the standard transition matrix issued by any rating agency like CRISIL, S&P or Moody’s.

References:

[1] http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf

[2] http://fmwww.bc.edu/EC-C/S2013/823/EC823.S2013.nn04.slides.pdf

[3] http://users.stat.umn.edu/~sandy/courses/8053/handouts/robust.pdf

Appendix I:

http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf

http://fmwww.bc.edu/EC-C/S2013/823/EC823.S2013.nn04.slides.pdf

http://fmwww.bc.edu/EC-C/S2013/823/EC823.S2013.nn04.slides.pdf

http://users.stat.umn.edu/~sandy/courses/8053/handouts/robust.pdf

http://users.stat.umn.edu/~sandy/courses/8053/handouts/robust.pdf

List of parameters post Data Cleaning [36] –

Earnings Retention Ratio

Current Ratio

Price Net Operating Revenue PBIT Margin

Diluted EPS

Return on Networth Equity

Cash EPS

PBIT Share

Earnings Yield

PBT Margin

EV EBITDA

Enterprise Value

Book Value ExclRevalReserve Share

Book Value InclRevalReserve Share

Asset Turnover Ratio

Total Debt Equity

Return on Capital Employed

Retention Ratios

PBDIT Share

Revenue from Operations Share

Net Profit Share

Net Profit Margin

l_Year

Basic EPS

Company

EV Net Operating Revenue

Return on Assets

Quick Ratio

Name

Inventory Turnover Ratio

PBT Share

Cash Earnings Retention Ratio

MarketCap Net Operating Revenue

PBDIT Margin

Price BV

Appendix II:

List of parameters post Exploratory Data Analysis [15] –

Current Ratio

Price Net Operating Revenue

Return on Operating Revenue

Return on Networth Equity

Cash EPS

Earnings Yield

EV EBITDA

Enterprise Value

Asset Turnover Ratio

Total Debt Equity

Return on Capital Employed

Revenue from Operations Share

Inventory Turnover Ratio

Cash Earnings Retention Ratio

Price BV

CDS Project Report Title: Credit risk assessment of mid-size … · 2016-12-07 · CDS Project...

Documents

Transcript of CDS Project Report Title: Credit risk assessment of mid-size … · 2016-12-07 · CDS Project...