Final SAS Day 2015 Poster

1
Introduction Cluster Analysis and the Cost of Simplicity Supervised vs. Unsupervised Discretization Britney Cook and Reuben Hilliard Department of Statistics and Analytical Sciences Advisor: Dr. Jennifer Priestley Procedure After having cleaned all of the data, the next step was the transformation of the variables. For the user-defined transformations an equal widths logic was used for defining the ranks. This was done by looking at a histogram of each variable and evaluating the range and distribution of the data. Where the majority of the observations lied, the ranks were assigned equal values and where it tailed off, the ranks were assigned increasingly larger values to compensate for the distribution of the data. Figure 1 is a plot of the user-defined ranks for the variable BRAVGMOS, or the average number of months per bank revolving account open. The exact values for each rank can be found in Table 1. For the SAS-defined transformations an equal frequencies logic was used for defining the ranks. This was automated using the PROC RANK procedure, after specifying the desired number of ranks. The number of ranks needed to be high enough so meaningful information was not masked, but not too high that it became cumbersome. Ten ranks seemed appropriate. Figure 2 is a plot of the SAS-defined ranks for the variable BRAVGMOS. The exact values for each rank can be found in Table 2. Wherever the difference in spread between ranks was insignificant, determined by a flat line (user-defined) or a t-test (SAS-defined), the two ranks were collapsed into one, which is why in Figure 2 there are only 9 ranks. After finalizing the ranks, they were given ordinal values resulting in the ordinal transformation for the user and SAS-defined procedures. From the ordinal transformations, the odds and log of odds transformations were created. After undergoing all of the variable transformations for both the supervised and unsupervised methods, the 441 variables were reduced to the top 50 most significant in predicting the dependent variable GOODBAD. Then for each of the six transformations, the top five variables for that transformation were set aside for cluster analysis. PROC LOGIT was first used to determine the profit per 1,000 of our original prediction model, prior to clustering. Using these same variables, PROC CLUSTER was then used to generate three criteria: Cubic Clustering Criterion (CCC), Pseudo-F and Pseudo T-Squared, which helped determine the optimal number of clusters. Where there was a peak in the CCC, a peak in Pseudo-F and a dip in Pseudo T- Squared, the optimal number of clusters would be found. It can be seen in Figure 3 that these three events occur at five clusters. Next, PROC FASTCLUS was used to cluster the data into five groups, followed by PROC REPORT which generated the profitability per 1,000 for each of the five clusters determined. Using the clusters constructed, a transactional dataset was merged in using a left join and MATCHKEY, a customer identifier. The top 20 SIC codes from each cluster were then collected and used to determine spending behaviors of the customers that comprised each cluster, shown in Table 5. Next these trends were compared across clusters so that a categorization of customer characteristics for both profitable and Non- profitable clusters could be made. There were a couple of additional procedures that were implemented to see if some variables, much like spending characteristics, varied across clusters. One was a two-way frequency table, Table 4, and the other a heat map, Figure 5. The results of these procedures can be found in the following section. Results Figure 4, along with Figures 6, 7 & 8, clearly indicate that unsupervised, or the most mathematically optimal approach, results in models that profit more than those that utilized a supervised approach. The log of odds transformation using the unsupervised approach resulted in the most profitable top cluster at $222,078.40 per 1,000 customers. This was significantly more than the model with no clustering which profited only $78,690.72 per 1,000, shown in Table 3. After evaluating customer characteristics for each cluster in the most profitable transformation, it was discovered that clusters containing customers who use their credit cards for automated cash disbursement and liquor store purchases profit $70,000-$150,000 less (about half) than those that contained customers who do not. Also, customers in the lowest profitable cluster used their credit card inside service stations 40% more often than customers in the highest profitable cluster. Customers in the highest profitable cluster were also found to use their credit card more than 50% more frequently at dine-in than at fast-food restaurants, versus customers in the lowest profitable cluster which were found to use their credit card the same amount of times at both. In Table 4 it can be see that the least profitable cluster, cluster 5, had the highest percent TRATE3, or the number of accounts 60 days past due. Meaning that cluster 5 had the highest number of people who had either 1, 2, 3 or 4 accounts past due. In Figure 5 it was found that for the top two most profitable clusters, a higher BNKINQ2, or number of banking inquires in the past 6 months, resulted in a predicted probability of GOODBAD as low as 0.2. For the bottom two least profitable clusters, a higher BNKINQ2 resulted in a predicted probability of GOODBAD only as low as 0.5. In other words, the number of inquiries had little effect on the predicted probability of default for customers in these profitable clusters. Conclusion Upon discovering that the most complex transformation, using the unsupervised approach was the most profitable, the question that needed to be answered next was whether or not the ability to explain the variables was more important than having the highest possible profit. To answer this question an evaluation of the next highest profit and the transformation from which it was derived would be analyzed to see if the tradeoff was worth it. In Figure 4, the second highest profiting transformation was the ordinal, using the unsupervised approach. This transformation was the most simple and easy to explain of all the transformations, but had about $19,000 less profit in its top cluster and about $122,000 less in overall profit. This is the cost of simplicity and is sometimes not a very easy call to make. Are having variables that can be explained in simple terms really worth $122,000? When a decision comes with such a high cost, it would be best to consult the client and see what is most important to their organization. In our case however, there is no client, and thus the decision was made to keep the complex model in favor of the most profitable outcome, $759,363.34 per 5,000 customers. Sample Code Table 1: User-Defined Ranks for BRAVGMOS Table 2: SAS-Defined Ranks for BRAVGMOS Figure 2: Plot of SAS-Defined Ranks for BRAVGMOS Figure 1: Plot of User-Defined Ranks for BRAVGMOS When it comes to defining variable transformations there are two general approaches that can be utilized: supervised (user-defined) and unsupervised (SAS-defined). It would seem that the most mathematically optimal method, unsupervised, produces models that profit more than those based on variables built using a supervised approach, which allows room for human interpretation and error. However, the most profitable transformations can often result in a cumbersome output of variables. This leaves one wondering, should the ability to explain variables be sacrificed for the most profitable form? If the answer is no, where then should the line be drawn? This concept, known as the cost of simplicity, is one of heavy debate and will be addressed within this presentation. Another matter that will be looked into is the difference in profit between the supervised and unsupervised methods. In a binary classification model using logistic regression, 50 of 441 variables were found to be significant in generating a credit risk score, labeled GOODBAD; this variable had a value of either 0 (good) or 1 (bad). It was of this 50 variable pool that the variables used in this analysis were retrieved. A comparison of profits for both supervised and unsupervised would be investigated for three different transformations: ordinal, odds and log of odds. The differences between these methods and transformation are defined in the following procedure. Table 3: Profitability per 1,000 for Model and Top Cluster Figure 4: Stacked Bar Plot of Profit by Transformation Grouped by Cluster Figure 6: Bubble Plot of Cluster Size by Cluster Profit (Ordinal) Figure 7: Bubble Plot of Cluster Size by Cluster Profit (Odds) Figure 8: Bubble Plot of Cluster Size by Cluster Profit (Log of Odds) Teal = Supervised Olive = Unsupervised Yellow = Supervised Red = Unsupervised Green = Supervised Purple = Unsupervised Figure 3: Cluster Analysis for the Log of Odds (Unsupervised) Table 6: Cluster Profitability for Log of Odds (Unsupervised) Table 5: SIC Code Comparison of Clusters for Log of Odds (Unsupervised) Figure 5: Heat Map of BNKINQ2 by Cluster for Log of Odds (Unsupervised) Table 4: Two-Way Frequency Table of TRATE3 by Cluster for Log of Odds (Unsupervised) Transforma)on Model Top Cluster Ordinal (S) $106,873.47 $183,952.57 Ordinal (U) $108,180.97 $203,738.01 Odds (S) $69,964.89 $173,323.94 Odds (U) $65,514.19 $166,386.46 Log of Odds (S) $72,017.65 $204,782.40 Log of Odds (U) $78,690.72 $222,078.40 S = Supervised U = Unsupervised TRATE3 Cluster 0 1 2 3 4 Total 1 31856 4.23 7627 5.78 6761 0.90 16.19 5.18 2133 0.28 5.11 4.63 744 0.10 1.78 4.09 271 0.04 0.65 3.62 41765 5.55 2 173373 23.02 76.50 31.48 34990 4.65 15.44 26.81 11548 1.53 5.10 25.05 4670 0.62 2.06 25.69 2060 0.27 0.91 27.53 226641 30.10 3 65042 8.64 83.07 11.81 9736 1.29 12.43 7.46 2491 0.33 3.18 5.40 768 0.10 0.98 4.23 264 0.04 0.34 3.53 78301 10.40 4 59655 7.92 72.69 10.83 14824 1.97 18.06 11.36 5033 0.67 6.13 10.92 1855 0.25 2.26 10.21 696 0.09 0.85 9.30 82063 10.90 5 220809 29.32 68.1 40.09 64207 8.53 19.80 49.19 24896 3.31 7.68 54.00 10140 1.35 3.13 55.78 4194 0.56 1.29 56.01 324243 43.06 Total 550735 73.14 130518 17.33 46101 6.12 18177 2.41 7482 0.99 753013 100.00 Frequency Percent Row Percent Column Percent Average Independent Average Dependent Rank Frequency 1 114440 10 0.25848 2 129842 20 0.20406 3 254353 31 0.181142 4 123412 44 0.17609 5 129910 51 0.16705 6 114318 59 0.16102 7 133391 69 0.15858 8 125048 81 0.15309 9 130715 111 0.12426 Average Independent Average Dependent Rank Frequency 1 176173 13 0.24165 2 167186 24 0.18893 3 829725 58 0.16499 4 82345 120 0.11585 Figures 4, 5, 6, 7, and 8 were all generated using SAS® Visual Analytics

Transcript of Final SAS Day 2015 Poster

Page 1: Final SAS Day 2015 Poster

Introduction

Cluster Analysis and the Cost of Simplicity Supervised vs. Unsupervised Discretization

Britney Cook and Reuben Hilliard Department of Statistics and Analytical Sciences

Advisor: Dr. Jennifer Priestley

Procedure

After having cleaned all of the data, the next step was the transformation of the variables. For the user-defined transformations an equal widths logic was used for defining the ranks. This was done by looking at a histogram of each variable and evaluating the range and distribution of the data. Where the majority of the observations lied, the ranks were assigned equal values and where it tailed off, the ranks were assigned increasingly larger values to compensate for the distribution of the data. Figure 1 is a plot of the user-defined ranks for the variable BRAVGMOS, or the average number of months per bank revolving account open. The exact values for each rank can be found in Table 1. For the SAS-defined transformations an equal frequencies logic was used for defining the ranks. This was automated using the PROC RANK procedure, after specifying the desired number of ranks. The number of ranks needed to be high enough so meaningful information was not masked, but not too high that it became cumbersome. Ten ranks seemed appropriate. Figure 2 is a plot of the SAS-defined ranks for the variable BRAVGMOS. The exact values for each rank can be found in Table 2. Wherever the difference in spread between ranks was insignificant, determined by a flat line (user-defined) or a t-test (SAS-defined), the two ranks were collapsed into one, which is why in Figure 2 there are only 9 ranks. After finalizing the ranks, they were given ordinal values resulting in the ordinal transformation for the user and SAS-defined procedures. From the ordinal transformations, the odds and log of odds transformations were created. After undergoing all of the variable transformations for both the supervised and unsupervised methods, the 441 variables were reduced to the top 50 most significant in predicting the dependent variable GOODBAD. Then for each of the six transformations, the top five variables for that transformation were set aside for cluster analysis. PROC LOGIT was first used to determine the profit per 1,000 of our original prediction model, prior to clustering. Using these same variables, PROC CLUSTER was then used to generate three criteria: Cubic Clustering Criterion (CCC), Pseudo-F and Pseudo T-Squared, which helped determine the optimal number of clusters. Where there was a peak in the CCC, a peak in Pseudo-F and a dip in Pseudo T-Squared, the optimal number of clusters would be found. It can be seen in Figure 3 that these three events occur at five clusters. Next, PROC FASTCLUS was used to cluster the data into five groups, followed by PROC REPORT which generated the profitability per 1,000 for each of the five clusters determined. Using the clusters constructed, a transactional dataset was merged in using a left join and MATCHKEY, a customer identifier. The top 20 SIC codes from each cluster were then collected and used to determine spending behaviors of the customers that comprised each cluster, shown in Table 5. Next these trends were compared across clusters so that a categorization of customer characteristics for both profitable and Non-profitable clusters could be made. There were a couple of additional procedures that were implemented to see if some variables, much like spending characteristics, varied across clusters. One was a two-way frequency table, Table 4, and the other a heat map, Figure 5. The results of these procedures can be found in the following section.

Results Figure 4, along with Figures 6, 7 & 8, clearly indicate that unsupervised, or the most mathematically optimal approach, results in models that profit more than those that utilized a supervised approach. The log of odds transformation using the unsupervised approach resulted in the most profitable top cluster at $222,078.40 per 1,000 customers. This was significantly more than the model with no clustering which profited only $78,690.72 per 1,000, shown in Table 3. After evaluating customer characteristics for each cluster in the most profitable transformation, it was discovered that clusters containing customers who use their credit cards for automated cash disbursement and liquor store purchases profit $70,000-$150,000 less (about half) than those that contained customers who do not. Also, customers in the lowest profitable cluster used their credit card inside service stations 40% more often than customers in the highest profitable cluster. Customers in the highest profitable cluster were also found to use their credit card more than 50% more frequently at dine-in than at fast-food restaurants, versus customers in the lowest profitable cluster which were found to use their credit card the same amount of times at both. In Table 4 it can be see that the least profitable cluster, cluster 5, had the highest percent TRATE3, or the number of accounts 60 days past due. Meaning that cluster 5 had the highest number of people who had either 1, 2, 3 or 4 accounts past due. In Figure 5 it was found that for the top two most profitable clusters, a higher BNKINQ2, or number of banking inquires in the past 6 months, resulted in a predicted probability of GOODBAD as low as 0.2. For the bottom two least profitable clusters, a higher BNKINQ2 resulted in a predicted probability of GOODBAD only as low as 0.5. In other words, the number of inquiries had little effect on the predicted probability of default for customers in these profitable clusters.

Conclusion

Upon discovering that the most complex transformation, using the unsupervised approach was the most profitable, the question that needed to be answered next was whether or not the ability to explain the variables was more important than having the highest possible profit. To answer this question an evaluation of the next highest profit and the transformation from which it was derived would be analyzed to see if the tradeoff was worth it. In Figure 4, the second highest profiting transformation was the ordinal, using the unsupervised approach. This transformation was the most simple and easy to explain of all the transformations, but had about $19,000 less profit in its top cluster and about $122,000 less in overall profit. This is the cost of simplicity and is sometimes not a very easy call to make. Are having variables that can be explained in simple terms really worth $122,000? When a decision comes with such a high cost, it would be best to consult the client and see what is most important to their organization. In our case however, there is no client, and thus the decision was made to keep the complex model in favor of the most profitable outcome, $759,363.34 per 5,000 customers.

Sample Code

Table 1: User-Defined Ranks for BRAVGMOS

Table 2: SAS-Defined Ranks for BRAVGMOS

Figure 2: Plot of SAS-Defined Ranks for BRAVGMOS

Figure 1: Plot of User-Defined Ranks for BRAVGMOS

When it comes to defining variable transformations there are two general approaches that can be utilized: supervised (user-defined) and unsupervised (SAS-defined). It would seem that the most mathematically optimal method, unsupervised, produces models that profit more than those based on variables built using a supervised approach, which allows room for human interpretation and error. However, the most profitable transformations can often result in a cumbersome output of variables. This leaves one wondering, should the ability to explain variables be sacrificed for the most profitable form? If the answer is no, where then should the line be drawn? This concept, known as the cost of simplicity, is one of heavy debate and will be addressed within this presentation. Another matter that will be looked into is the difference in profit between the supervised and unsupervised methods. In a binary classification model using logistic regression, 50 of 441 variables were found to be significant in generating a credit risk score, labeled GOODBAD; this variable had a value of either 0 (good) or 1 (bad). It was of this 50 variable pool that the variables used in this analysis were retrieved. A comparison of profits for both supervised and unsupervised would be investigated for three different transformations: ordinal, odds and log of odds. The differences between these methods and transformation are defined in the following procedure.

Table 3: Profitability per 1,000 for Model and Top Cluster

Figure 4: Stacked Bar Plot of Profit by Transformation Grouped by Cluster

Figure 6: Bubble Plot of Cluster Size by Cluster Profit (Ordinal)

Figure 7: Bubble Plot of Cluster Size by Cluster Profit (Odds)

Figure 8: Bubble Plot of Cluster Size by Cluster Profit (Log of Odds)

Teal = Supervised Olive = Unsupervised Yellow = Supervised Red = Unsupervised Green = Supervised Purple = Unsupervised

Figure 3: Cluster Analysis for the Log of Odds (Unsupervised)

Table 6: Cluster Profitability for Log of Odds (Unsupervised)

Table 5: SIC Code Comparison of Clusters for Log of Odds (Unsupervised)

Figure 5: Heat Map of BNKINQ2 by Cluster for Log of Odds (Unsupervised)

Table 4: Two-Way Frequency Table of TRATE3 by Cluster for Log of Odds

(Unsupervised)

Transforma)on   Model   Top  Cluster  

Ordinal  (S)    $106,873.47      $183,952.57    

Ordinal  (U)    $108,180.97      $203,738.01    

Odds  (S)    $69,964.89      $173,323.94    

Odds  (U)    $65,514.19      $166,386.46    

Log  of  Odds  (S)    $72,017.65      $204,782.40    

Log  of  Odds  (U)    $78,690.72      $222,078.40    

S  =  Supervised          U  =  Unsupervised  

    TRATE3  Cluster   0   1   2   3   4   Total  

1   31856  4.23  7627  5.78  

6761  0.90  16.19  5.18  

2133  0.28  5.11  4.63  

744  0.10  1.78  4.09  

271  0.04  0.65  3.62  

41765  5.55  

2   173373  23.02  76.50  31.48  

34990  4.65  15.44  26.81  

11548  1.53  5.10  25.05  

4670  0.62  2.06  25.69  

2060  0.27  0.91  27.53  

226641  30.10  

3   65042  8.64  83.07  11.81  

9736  1.29  12.43  7.46  

2491  0.33  3.18  5.40  

768  0.10  0.98  4.23  

264  0.04  0.34  3.53  

78301  10.40  

4   59655  7.92  72.69  10.83  

14824  1.97  18.06  11.36  

5033  0.67  6.13  10.92  

1855  0.25  2.26  10.21  

696  0.09  0.85  9.30  

82063  10.90  

5   220809  29.32  68.1  40.09  

64207  8.53  19.80  49.19  

24896  3.31  7.68  54.00  

10140  1.35  3.13  55.78  

4194  0.56  1.29  56.01  

324243  43.06  

Total  550735  73.14  

130518  17.33  

46101  6.12  

18177  2.41  

7482  0.99  

753013  100.00  

Frequency        Percent        Row  Percent        Column  Percent  

Average  Independent  

Average  Dependent  Rank   Frequency  

1   114440   10   0.25848  2   129842   20   0.20406  3   254353   31   0.181142  4   123412   44   0.17609  5   129910   51   0.16705  6   114318   59   0.16102  7   133391   69   0.15858  8   125048   81   0.15309  9   130715   111   0.12426  

Average  Independent  

Average  Dependent  Rank   Frequency  

1   176173   13   0.24165  

2   167186   24   0.18893  

3   829725   58   0.16499  

4   82345   120   0.11585  

Figures 4, 5, 6, 7, and 8 were all generated using SAS® Visual Analytics