Cost-Benefit Analysis Using Data-Driven Misclassification Costs

11
1 Stat 529: Current Issues in Data Mining Assignment 1: Cost-Benefit Analysis Using Data-Driven Misclassification Costs Kevin Bahr Use PVA training-training and PVA training-test data sets. (This is so that we all use the same training/test partition. Table 1 briefly describes and compares the training and test data sets that are used for analysis. The percent of contributors is different by 0.21% and the average contribution amount differs by $0.58. Table 1 - Summary of training and testing data sets Dataset Training Test # Records 136,560 45,219 # of Contributors 6,997 2,220 % Contributors 5.12% 4.91% Average Contribution of Contributors $15.50 $16.08 In your classification models, don’t use contribution amount as a predictor. Make sure #_Recent_Gifts and Amount_Last_Gift are ordinal, not nominal or continuous. #_Gifts_to_Card_Promos should be continuous, while Contributor (the target) should be a flag. Figures 1 shows contribution amount being excluded from the CART model. This is the same for all CART models. Figure 2 shows Amount_Last_Gift and #_Recent_Gifts both set as ordinal, #_Gifts_to_Card_Promos as continuous, and Contributor set as flag and role as target. Figure 2 - Type node settings Figure 1 - CART Contribution_Amount exclusion

description

Used rebalancing and misclassificaiton costs on CART models (Classification and Regression Trees). Misclassification costs were data driven using the mean contribution amount to a direct mailing campaign for a non-profit. Software used was SPSS Modeler.

Transcript of Cost-Benefit Analysis Using Data-Driven Misclassification Costs

Page 1: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

1

Stat 529: Current Issues in Data Mining Assignment 1: Cost-Benefit Analysis Using Data-Driven Misclassification Costs Kevin Bahr Use PVA training-training and PVA training-test data sets. (This is so that we all use the same training/test partition. Table 1 briefly describes and compares the training and test data sets that are used for analysis. The percent of contributors is different by 0.21% and the average contribution amount differs by $0.58. Table 1 - Summary of training and testing data sets

Dataset Training Test # Records 136,560 45,219 # of Contributors 6,997 2,220 % Contributors 5.12% 4.91% Average Contribution of Contributors

$15.50 $16.08

In your classification models, don’t use contribution amount as a predictor. Make sure #_Recent_Gifts and Amount_Last_Gift are ordinal, not nominal or continuous. #_Gifts_to_Card_Promos should be continuous, while Contributor (the target) should be a flag. Figures 1 shows contribution amount being excluded from the CART model. This is the same for all CART models. Figure 2 shows Amount_Last_Gift and #_Recent_Gifts both set as ordinal, #_Gifts_to_Card_Promos as continuous, and Contributor set as flag and role as target.

Figure 2 - Type node settings

Figure 1 - CART Contribution_Amount exclusion

Page 2: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

2

In the CART node settings, do the following: • Basics: Uncheck: “Prune tree to avoid overfitting.” • Advanced: Set Overfit protection set to 0%.

Figures 3 and 4 show the two CART node settings. On the left, “Prune tree to avoid overfitting” is unchecked. On the right, “Overfit prevention set) is 0.

Figure 4 - CART settings - Advanced

1. Perform EDA. Comment on each.

a. Provide normalized distributions of #_Recent_Gifts and Amount_Last_Gift with Contributor overlay.

b. Provide normalized and non-normalized histograms of #_Gifts_to_Card_Promos, with Contributor overlay.

Figure 6 – Distribution of Amt_Last_Gift

Figure 3 - CART settings - basics

Figure 5 - Distribution of #_Recent_Gifts

Page 3: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

3

Figures 5 is the normalized distribution of #_Recent_Gifts. There is a positive relationship between number of recent gifts and contributor. The higher the recent gifts the more likely they were a contributor to the campaign. Figure 6 show is the normalized distribution of Amt_Last_Gift. The letters match up with the following values:

• D=$5.00 - $9.99 • E=$10.00 - $14.99 • F=$15.00 - $24.99 • G=$25.00 and above

There is an inverse relationship between the amount of last gift and contributor. The recipient of the mailing is more likely to be a contributor if the amount of their last gift is lower. There were no gifts below $5.00. It is not known why, but perhaps adding the option for a gift below $5.00 could increase the number of contributors because of the inverse relationship.

Figure 8 - #_Gifts_to_Card_Promotion Normalized

Figures 7 and 8 show the histogram and normalized histogram of #_Gifts_to_Card_Promotion variable. This variable represents the number of times the contributor has responded positively to similar promotions in the past. The count of the variable decreases overall as the variable increases in value, but the proportion of Contributors increases positively with the increases in the variable.

2. Calculate FP cost and FN cost. For FN cost, use only positive records.

From the assignment instructions we learn that it costs 68 cents to contact a donor. We derive the false positive (FP) cost from the $0.68, since that is the cost to contact a potential donor. Additionally, we will subtract $0.68 from the contribution amount when calculating profit.

We average the contribution amounts of the 6,998 contributors in order to calculate the false negative (FN) cost. The average of all their contributions is $15.50.

Figure 7 - #_Gifts_to_Card_Promotion histogram

Page 4: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

4

We thus have the FP cost of $0.68 and FN cost of $15.50.

We can then construct matrix of direct costs and adjusted costs to reference. Tables 2 through 5 display these matrices.

Table 2 - Matrix of direct costs

Predicted Category Actual Category 0 1

0 𝐶𝑜𝑠𝑡!" = $0 𝐶𝑜𝑠𝑡!" = $0.68 1 𝐶𝑜𝑠𝑡!" = $0 𝐶𝑜𝑠𝑡!! = −$15.50

Table 2 shows a false positive cost of $0.68 and a true positive cost of -$15.50.

Table 3 - Adjusted cost matrix

Predicted Category Actual Category 0 1

0 0 𝐶𝑜𝑠𝑡!",!"# = $0.68 1 𝐶𝑜𝑠𝑡!",!"# = $15.50 0

Table 3 adjusts the true positive cost to 0 and the false negative cost of $15.50. This table is used to adjust costs in the SPSS Modeler CART node.

Table 4 - Adjusted cost matrix, FP = $1

Predicted Category Actual Category 0 1

0 0 𝐶𝑜𝑠𝑡!",!"# = $1 1 𝐶𝑜𝑠𝑡!",!"# = $22.80 0

Table 4 adjusts the false positive cost to be $1. The false negative cost is comparatively adjusted to $22.80.

Table 5 - Adjusted cost matrix, FN = $1

Predicted Category Actual Category 0 1

0 0 𝐶𝑜𝑠𝑡!",!"# = $0.05 1 𝐶𝑜𝑠𝑡!",!"# = $1 0

Table 5 adjusts the false negative cost to be $1. The false positive cost is comparatively adjusted to $0.05.

Page 5: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

5

3. Apply CART to Train-train set with following settings: a. No balancing, no misclassification costs b. 10 – 1 balancing, no misclass costs (Balance node: Factor: 10.0, Condition:

Contributor = 1) c. No balancing, misclass costs d. 10 – 1 balancing, misclass costs e. Some other combination of balancing and costs, of your choice (provide rationale)

-Resampling ratio would be FP / FN Figure 9 shows the various balance and CART nodes for models A through E. Each Balance node has the settings of Factor: 10.0 and Condition: Contributor = 1. Each CART node was configured to uncheck “Prune tree to avoid overfitting” and to set Overfit protection” to 0. Rebalancing (Balance node: Factor: 22.8, Condition: Contributor = 1) was solely used for Model E. The rationale is that 𝐶𝑜𝑠𝑡!",!"# >  𝐶𝑜𝑠𝑡!",!"#, so the number of records with positive responses is multiplied by b, where b is the resampling ratio (𝐶𝑜𝑠𝑡!",!"#/𝐶𝑜𝑠𝑡!",!"#). The value of b is 15.50 / 0.68 = 22.8. No misclassification costs were used with the balance node.

4. Evaluate Models a – e on the training–test data set. Provide evaluation measures for each

of Models a – e, similar to Table 16.15 in the chapter. For comparison, include the “Send-to-everyone” baseline model in this table. Use three tables, each similar to Table 16.15, containing (i) Models a and b, (ii) Models c and d, and (iii) Model e and the baseline model.

The overall model cost was calculated in two similar manners. The average was calculated by summing adding the mean contribution amount ($15.50) minus the cost of mail ($0.68). The actual was calculated by summing each row’s contribution amount subtracted by the cost of mail ($0.68).

Figure 9 - CART Models

Page 6: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

6

For positive responses, the mode overall and actual costs are:

𝑂𝑣𝑒𝑟𝑎𝑙𝑙  𝑚𝑜𝑑𝑒𝑙  𝑐𝑜𝑠𝑡  𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒  𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑠 ∗ ($15.50− $0.68)

𝑂𝑣𝑒𝑟𝑎𝑙𝑙  𝑚𝑜𝑑𝑒𝑙  𝑐𝑜𝑠𝑡  𝑎𝑐𝑡𝑢𝑎𝑙 =   (𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝐴𝑚𝑜𝑢𝑛𝑡! −!

!!!!

 $0.68)

Where n is the number of positive responses, $15.50 is the mean contribution amount (FP), and $0.68 is the cost of mailing (FN cost). Table 6 - Contingency table, Model A

Predicted Category Actual Category 0 1

0 42,999 0 1 2,220 0

Model A had no balancing and no misclassification costs. Consequently, the model predicts no Contributors without guidance on costs or further balancing the model with positive responses. The error rate is likely to be low, but revenue low as well. Table 7 - Contingency table, Model B

Predicted Category Actual Category 0 1

0 40,668 2,331 1 1,960 260

Model B balanced the model with 10 times the number of contributors. The model was able predict 260 correct contributors.

Page 7: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

7

Table 8 - Evaluation measures, Models A and B

CART Model Evaluation Measure Model A Model B Accuracy Overall Error Rate

0.951 0.049

0.905 0.095

Sensitivity False Negative Rate

0 1

0.117 0.883

Specificity False Positive Rate

1 0

0.946 0.054

Proportion of True Positives Proportion of False Positives

0 0

0.100 0.899

Proportion of True Negatives Proportion of False Negatives

0.951 0.49

0.954 0.054

Overall Model Cost Average Overall Model Cost Actual

$0 $0

-$2,268.12 $255.89

Revenue per Contributor $0 $1.02 | -$0.12 Model A has a lower error rate than model B, while model B realizes an overall average model of -$2,268.12. The accuracy of Model A is 0.951, but the model cost (and gain) is $0. Table 9 - Contingency table, Model C

Predicted Category Actual Category 0 1

0 18,589 24,410 1 654 1,566

Model C contains no balancing, but has misclassification costs. The model has a richer mix of true positives compared to the previous two models. Table 10 - Contingency table, Model D

Predicted Category Actual Category 0 1

0 0 42,999 1 0 2,220

Model D is balanced with 10 times the number of contributors and misclassification costs are applied as well. The balancing and misclassified costs cause the model to predict every row as a contributor. This model is opposite of model A, where no contributors were predicted.

Page 8: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

8

Table 11 - Evaluation measures, Models C and D

CART Model Evaluation Measure Model C Model D Accuracy Overall Error Rate

0.446 0.554

0.049 0.950

Sensitivity False Negative Rate

0.705 0.295

1 0

Specificity False Positive Rate

0.432 0.568

0 1

Proportion of True Positives Proportion of False Positives

0.602 0.939

0.049 0.950

Proportion of True Negatives Proportion of False Negatives

0.966 0.033

0 0

Overall Model Cost Average Overall Model Cost Actual

-$6,609.32 -$3,648.32

-$3,661.10 -$4,968.10

Revenue per Contributor $2.98 | $1.64 $1.65 | $2.24 Table 11 compares Model C and Model D. Model C outperforms on average overall model cost while model D outperforms on actual overall model cost. Adding misclassification costs have helped align the CART model with the business costs and gains. Table 12 - Contingency table, Model E

Predicted Category Actual Category 0 1

0 16,853 26,146 1 542 1,678

Model D is balanced with 22.8 times the number of contributors. There is a competitive mix of true positives when evaluated next to model C. Table 13 - Contingency table, baseline model

Predicted Category Actual Category 0 1

0 0 42,999 1 1 2,220

Table 13 shows the baseline model, which has the same outcome as model D.

Page 9: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

9

Table 14 - Evaluation measures, Models E and baseline

CART Model Evaluation Measure Model E Baseline Model Accuracy Overall Error Rate

0.409 0.590

0.049 0.950

Sensitivity False Negative Rate

0.755 0.244

1 0

Specificity False Positive Rate

0.391 0.680

0 1

Proportion of True Positives Proportion of False Positives

0.060 0.939

0.049 0.950

Proportion of True Negatives Proportion of False Negatives

0.968 0.031

0 0

Overall Model Cost Average Overall Model Cost Actual

-$7,088.68 -$3,050.68

-$3,661.10 -$4,968.10

Revenue per Contributor $3.19 | $1.37 $1.65 | $2.24 Model E outperformed the baseline model in average overall model cost and has the best average overall model cost of all the models. Its overall average model cost is -$7,088.68 and revenue per contributor is $3.19.

5. Thoroughly discuss your results from 4. Especially (though not exclusively) discuss how

using misclassification costs improved the bottom line.

Table 15 compares each model’s best overall cost and accuracy. Table 15 - Model costs and accuracy

Model Best Overall Cost Accuracy A $0 0.951 B -$2,268.12 0.905 C -$6,609.32 0.446 D -$4,968.10 0.049 E -$7,088.68 0.409 Baseline -$4,968.10 0.049 Model A is the most accurate model, but performed the worst in costs. Models D and the baseline model were the least accurate, but still outperformed model A by $4,968.10. Model E performed the best, earning $7,088.68 in contributions.

Page 10: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

10

Model C is also worth noting. Model C uses only misclassification costs and still performs very well, just underperforming by $479.36 compared to model E. To summarize, using only misclassification costs results in an increase in revenue by $6,609.32 (model C), while using only balancing results in an increase in revenue by $7,7088.68 (model E).

6. Explain your best CART model, the most important predictors, splits, and so on. Draw

parallels between model results and EDA. The best CART model is model E, which was was rebalanced (Balance node: Factor: 22.8, Condition: Contributor = 1), resulting in overall cost of -$7,088.68. The most important predictors for model E are shown in figure 6.

Figure 6 - Most important predictors

Amt_Last_Gift, #_Recent_Gifts, and #_Gifts_to_card_Promos are the three most important predictors of model E.

Page 11: Cost-Benefit Analysis Using Data-Driven Misclassification Costs

11

Figure 7 - Splits for Model E

Figure 7 displays the most important splits. Amount of last gift is the first split. The values D, and E contain a higher proportion of Contributors than the F and G split. This related directly to the figure 6, histogram of amount of last gift. There is an inverse relationship between the value and Contributor as the amount of last gift increases. Two other important split are #_Recent_Gifts and #_Gifts_to_Card_Promos. Figures 5, 7, and 8 all show positive relationships between their values and Contributors. This correlates to splits shown in figure 7, where the ratio of Contributors to non-contributors is higher.