Data Analytics Project

12
IOE 373: Data Processing Final Project Fall 2014 Professor Luis Garcia-Guzman Authors: Benjamin Bennet Kevin Dulic Maria Renee Simon

Transcript of Data Analytics Project

Page 1: Data Analytics Project

IOE 373: Data Processing

Final Project Fall 2014

Professor Luis Garcia-Guzman

Authors:

Benjamin Bennet

Kevin Dulic

Maria Renee Simon

Page 2: Data Analytics Project

INTRODUCTION

The purpose of this analysis is to model customer behaviors that result in the

likelihood of mediation/arbitration lawsuits, to determine the most significant factors

that result in these cases, and finally give a recommendation based on the findings of

the data. Tasks required to complete this analysis were: create an Access database

using the Excel data; create new input variables within the data; create queries using

SQL to summarize the data in order to be analyzed; cleanse the data, omit any records

that seem invalid (no valid Customer ID or over 50 vehicle purchases); link the query

to an Excel spreadsheet using VBA; generate pivot tables to summarize the frequency

of Mediation/Arbitration cases; partition the data into a training and validation set into

a 50-50 split; perform a logistics regression on the training set to identify significant

factors; validate the results using the validation set; interpret the results and make a

recommendation to reduce the number of mediation/arbitration lawsuits. Methods

used to perform this analysis were the use of Microsoft Access SQL, Excel VBA to

organize the data in the form of pivot tables, Minitab to partition the data and perform

logistic stepwise regression and create contingency tables to perform the analysis.

METHODOLOGY

The purpose of this analysis is to model customer behaviors that result in the

likelihood of mediation/arbitration lawsuits. In order to do so, the data went the data

mining procedure of SEMMA: Sample, Explore, Modify, Model, and Assess.

First, the given data was prepared and cleansed for it to be later loaded into an

analytical program. Utilizing Microsoft Office Access, all data was aggregated by

customer ID and new variables that seemed significant to the analysis were created.

These new variables include: recency, longevity, number of vehicles purchased,

number of passenger cars, number of purchases, number of leases, number of cases,

Page 3: Data Analytics Project

maximum and average case duration, number of complaints, “goodwill” indicator,

and average dealer score. All variables were grouped by customer. The analysis

focuses on individual customers and small businesses, therefore it was decided to

limit the analysis to customers that did not exceed 50 vehicle purchases. Further

examination of the data was performed to avoid errors and invalid customer ID’s.

After completing the exploration process and modifying data that would lead

to errors, the final data table was achieved. Next, the data was transferred to Microsoft

Office Excel where the information could be visually analyzed using pivot tables,

which clearly summarized the frequency of Mediation/Arbitration cases.

The last stage was the modeling of the data, in which the analytical tool

Minitab was used. The data was transferred from Excel to Minitab, were it was split

into 50-50 sets, one for training and one for validation by partitioning in Excel.

Afterwards, a logistic stepwise regression model was run on the training set, to

determine which factors were significant for the analysis. Insignificant variables were

eliminated as a result of the stepwise regression. The logistic regression model

obtained was run on the validation set in order to observe the accuracy of the model.

Contingency tables were created to summarize the data and draw conclusions about

the analysis.

RESULTS

Below is a summary of the result from performing a logistic regression in

Minitab. The following continuous variables represented in the table were the

determined as significant by the results below, with the exception of Num_VLK_Veh

and Avg_DealerScore. This can be seen in Table 1.

Page 4: Data Analytics Project

Table 1: P-Value Summary for Variables in Logistic Model

The regression produced a regression equation given in Equation 1. The Y’

represents the linear input for P(1), where P(1) is the probability of a lawsuit, given

the continuous input variables.

Equation 1: Logistic Regression Equation with P(1) = Probability of Lawsuit

Table 2 gives the coefficients of the linear equation Y’ in the regression equation.

Table 2: Coefficients for Continuous Variables in Equation 1

Page 5: Data Analytics Project

Table 3 represents the odds ratios for the continuous predictor variables

Table 3: Calculated Odds Ratio for Continuous Predictors

Table 4 shows the results of the contingency table for predicting a lawsuit within the

training set.

Table 4: Contingency Table for Training Set

Table 5 shows the results of the contingency table for predicting a lawsuit within the

validation set, which uses the training set to calculate the amount of lawsuits.

Table 5: Contingency Table for Validation Set

Page 6: Data Analytics Project

To visualize the data from the final Access data table, a pivot table in Excel

was created using VBA. This table is given below in Table 6.

Table 6: Pivot Table showing frequency of Arbitration / Mediation Cases

CONCLUSIONS

After performing the logistic regression analysis with stepwise, the significant

factors were determined to be Recency_Months, Longevity_Months, Percent_VLK,

Percent_PassCar, Num_New, Num_Purchases, Num_Complaints, and Goodwill. This

can be seen by studying the P-Value for each variable; the variables with P-Values of

0.05 or greater can be excluded, in this case Percent_VLK and Avg_DealerScore are

the only variables in the model that are not significant with a P-Value of 0.147 and

0.082, respectively. The P-Value statistic for each variable in the model is

summarized in the Table 1.

Additional statistical information from Minitab like as the odds ration give

further insight into the data. The odds ratio is a measure of how much one unit of the

input affects the probability of event 1 (lawsuit occurs). If the odds ratio is above 1,

that means that an increase in the input of interest by one unit will increase the

likelihood of the event from occurring. Similarly, if the odds ratio is smaller than 1

the likelihood of the event output is decreased. Minitab calculates the odds ratio for

each variable in the model, and it is summarized in the Table 3. From this table, it can

be determined that an increase in Recency_Months, Longevity_Months,

Page 7: Data Analytics Project

Num_VLK_Veh, and Avg_DealerScore have very little marginal effect on the

probability of a lawsuit. This is because the odds ratio for those variables are close to

1. A unit increase in Num_New and Num_Purchases increases the odds of a lawsuit

by 1.146 times and 1.117 times, respectively. An increase in Percent_VLK,

Percent_PassCar, Num_Complaints, and Goodwill will decrease the odds of an event

by 0.68, 0.82, 0.76, and 0.87 times, respectively.

Finally, Minitab provides the regression equation for the training set as seen in

the Equation. The probability of an event (lawsuit) is given by P(1), where Y’ is the

linear equation with input variables of Recency_Months, Longevity_Months,

Num_VLK_Veh, Percent_VLK, Percent_PassCar, Num_New, Num_Purchases,

Num_Complaints, Goodwill, and Avg_DealerScore. This equation is very useful to

predict the probability of lawsuit given these variable inputs. For most cases, the

probability threshold of an event occurring is P=0.5. Therefore when calculating a

probability given certain inputs, if P > 0.5 then it can be determined that this event is

likely to happen. In the case of this analysis, if P(1) > 0.5 then it is likely that a

lawsuit will occur.

Using the regression equation to calculate the probabilities of the validation

set will help determine how closely the training set predicts the validation set. As seen

in Tables 4 and 5, the contingency tables for the training set and validation set are

very similar. A predictive model is said to be good if the training set calculated

outcomes approximately match the actual events for the validation set. This is a

common data mining practice and should always be done to verify the model’s

effectiveness. Tables 4 and 5 are very similar, indicating that the regression

techniques used were valid. However the amount of predicted lawsuits is significantly

Page 8: Data Analytics Project

different than the number of actual lawsuits, indicating an inaccurate predictive

model.

From this analysis it is determined that the most significant factors that are

positively correlated with a lawsuit are Num_New and Num_Purchases, while the

most significant factors that are negatively correlated with a lawsuit are

Percent_VLK, Percent_PassCar, Num_Complaints, and Goodwill. From a practical

perspective, this means that the least likely customer to file a lawsuit is one with low

Num_New and Num_Purchases values, and high Percent_VLK, Percent_PassCar,

Num_Complaints, and Goodwill values. It is recommended that the company focus

on doing with business with this type of customer profile to limit the number of

lawsuits.

During our analysis some unexpected results were produced using the Minitab

logistic regression. Further inspection of the data could lead to a more accurate

predictive model. The current model does not accurately predict likelihood of a

lawsuit due to possible errors in our aggregate functions. The regression techniques

for the analysis are still considered valid, and should be repeated given the corrected

data.