Data Analytics Final

Predicting if a car purchasedat auction is a lemon

Barbara Hsieh | Julian Phillips | Kit Kwan | Pravish Kappakadavath

Predicting if a car purchased at auction is a lemon

Executive Summary

Automobile dealerships purchase thousands of cars at auction every year to replenish inventory, and attached to each of these purchases is the risk of procuring a “kick”, a car which has been tampered with and cannot be resold. The purchase of kicks result in a loss to the dealer which incurs sunk costs resulting from the initial vehicle purchase and subsequent cost to transport the vehicle to their lot.

The objective of this project was to build a model to help dealers avoid purchasing lemons at auction. In order to do this, we needed to first identify the attributes of lemon cars that should be avoided. From our data set, we had a list of variables that are associated with cars that ended up being lemons in the past and those that were good buys. Utilizing these variables, we built logistic regression and decision tree models to find the variables that will best classify bad buys from good buys.

Our final dataset consisted of 51,285 records and 31 variables and was split in half to create a training set (25,642 records) and testing set (25,643 records). Preliminary findings, based on business assumptions and analysis of the dataset using scatter plots and graphs, revealed six variables that have the highest impact on whether a vehicle purchased at auction would be a good or bad buy. The six variables are as follows:

1. Vehicle Age2. VehOdo3. MMRAcquisitionRetailCleanPrice4. MMRCurrentRetailAveragePrice5. VehBCost6. Whether Wheel info is null or not?

Looking at this initial set of variables, our hypothesis was that there is a relationship between variables, such as vehicle’s age and auction price and whether the vehicle is a bad buy or not. After multiple iterations of decision trees and logistic regressions using both existing variables and over 10 enriched variables, our best and 2nd best model were both logistic regressions and consisted of the following variables:

Vehicle Age VehOdo MMRCurrentRetailCleanPrice VehBCost (Purchase price of the

vehicle)

Warranty Cost Purchase Qtr (1,3) Southern States GM Diff Retail Average

A summary of the resulting insight from our best model is that an older car with a combination of having a high odometer value, low warranty cost, low initial purchase price, but being in perfect condition and having a very high retail price is very likely to be a lemon. Additionally, some surprising insights emerged from our enriched variables: if the car is not a GM car and was purchased during the 1st and 3rd

Quarters in the southern states, there was an increased chance of it being a lemon. From these new insights, we can now tell dealers how to avoid lemon cars with an 81% chance of accuracy.

2


Table of Contents Introduction 4

Problem 4-5

Methodology 5

Data – Extract, Transform, Load5-6

Data - Enrichment 6-7

Analysis 7-8

Performance Measure 8-9

Business Insights 9

Improvements 9-10

Conclusion 10

Bibliography11

Appendix 12-19

3


Introduction

One of the most significant sources of inventory for automobile dealers is auctions. Dealers rely on auctions to replenish their inventory, and often attend several auctions a month, crossing state lines to take advantage of better market conditions for specific makes and models. For example, a dealer in Florida may want to purchase convertibles in Wisconsin where the demand for convertibles is lower, as are, subsequently, the prices. The dealer then turns around and sells these vehicles in-state to generate higher profit margins. Each year 9 million cars are sold at auction, and dealers represent a large percentage of the buyers at these events. A major challenge that dealerships face when purchasing used cars at auction is determining vehicle quality, and most important, avoiding the purchase of “kicks”, automobiles procured at auction which turn out to have serious issues and cannot be sold to customers. Oftentimes, vehicles are kicked as a result of tampered odometers, mechanical issues, or title transfer issues. The combination of low resale value and high transaction cost, results in an enormous loss to the dealer which incurs sunk costs from vehicle transport and throw-away repair work, and also suffers market losses due to junk inventory.

All that said, there is no question that being able to reduce the risk of purchasing kicks can provide a great deal of value to dealers. By avoiding the rising cost of lemons, dealers are not only able to reduce costs, but can also provide their customers with the best selection of inventory.

Problem

Hypothesis

Primary

H0: There is no relationship between the dependent variable- is bad buy- and at least one of its predictor variables.

Ha: There is a relationship between the dependent variable- is bad buy- and at least one of its predictor variables.

If the p-value of the logistic regression analysis is lower than .05, the model is deemed statistically significant and we must reject the null hypothesis and conclude that there is a relationship between the Y-variable and its predictors.

Secondary

H0: There is a significant relationship between the variables that reflect the physical condition of a car at auction and the dependent variable- is bad buy.

Ha: There is not a significant relationship between the variables that reflect the physical condition of a car at auction and the dependent variable- is bad buy.

4


If the p-value of the independent variables vehicle age, vehicle odometer reading and acquisition cost at time of purchase are less than .05, they are deemed statistically significant and account for the variation in the dependent variable- is bad buy.

Methodology

Statistical Analysis: Using the sample data from the sales historical data available on good and bad cars the null hypotheses can be accepted or rejected. For this analysis the single probability significance level of less than .05 will be used as the threshold significance level. In the testing, if any variable is identified to have a significance level of less than .05, then the null hypotheses will be rejected and it will be proved that the dependent variable, is a bad buy, has a relationship with at least one of its predictor variables.

Data Mining Models: For this analysis we built decision tree and logistic regression models to test the hypotheses. These methods were chosen as the decision tree is easily explainable in plain English and it provides an interface where the significant variables can be identified. With these variables we will build a logistic regression model. The logistic regression model will provide better results as the model is built on a best fit approach.

Data Sampling: Sampling the data to create training and validation data will be done by splitting the dataset into two equal halves. To randomize the data we will add a random variable and sort the data based on the random variable. Training data will be used to create the model and validation data will be used to validate the model.

Missing Data: Wherever we identify missing data for any of the attributes depending on the type of data and the proportion of the missing data we will fill in the missing data, delete the rows or remove the columns all together. For example in the case of a variable such as Acquisition Average Price we will fill in with an average value but for an attribute such as Trim which is a qualitative categorical data where we cannot use numerical average depending on the significance of the variable we may either drop that column or remove the rows where data is missing.

Data - Extract, Transform, Load

The dataset that we extracted from Kaggle originally consisted of 72,983 tuples and 37 variables, but after reviewing the dataset, both the number of tuples and variables decreased. Considering the size and variety in the data, we chose to focus only on records related to cars, defined as non-trucks and non-sports utility vehicles (SUVs). This cut down the dataset to 51,291. Further removal of variables with missing data resulted in a final data set imported into JMP that contained 51,285 tuples and 31 variables. We split the dataset in half and used 25,642 tuples as our training set and 25,643 tuples as our testing set.

Below is a list of variables that were removed or ignored. In the case where the variable had missing data, we provide a count of the number of missing data and the method we used to deal with the missing data. Variables where we do not specify a method for dealing with missing data were qualitative

5


variables that, based on our business assumptions, did not provide value to our model. We retained these variables in the import to JMP. Although these variables were ignored in our models, we deemed them potential variables that may warrant further analysis in the future. If these variables are to be used in future analysis, the dataset would need to be pared down further to accommodate these variables.

Trim: 1,591o Does not have high business value or statistical value (Removed column)

Transmission: 6 (Removed rows) Wheel Type ID: 2235 Wheel Type: 2235 MMRAcquisitionAuctionAveragePrice: 437 (Used Average) MMRAcquisitionAuctionCleanPrice: 437 (Used Average) MMRAcquisitionRetailAveragePrice: 437 (Used Average) MMRAcquisitonRetailCleanPrice: 437 (Used Average) MMRCurrentAuctionAveragePrice: 418 (Used Average) MMRCurrentAuctionCleanPrice: 418 (Used Average) MMRCurrentRetailAveragePrice: 418 (Used Average) MMRCurrentRetailCleanPrice: 418 (Used Average) Prime Unit: 48,741

o Too many missing data to be useful (Removed column) AUCGUART: 48,741

o Too many missing data to be useful (Removed column)

Data - Enrichment

An initial logistic regression model revealed the following variables to be significantly important:

Vehicle Age VehOdo MMRAcquisitionRetailCleanPrice MMRCurrentRetailAveragePrice VehBCost Whether wheel info is null or not?

The scatterplots in the Appendix show blue dots as lemons and red dots as good cars. There is no apparent relationship between the variables MMRAcquisitonRetailCleanPrice and MMRCurrentRetailAveragePrice and whether a car will be a lemon or not. There does, however, seem to be a connection between lemons and the variables: VehBCost, Vehicle Age, VehOdo, and whether the wheel column is null or not. Even though the wheel column’s missing data proves to be an important factor in determining whether a car is a lemon or not, we do not know the reason for the missing data. Therefore, in the analysis below we chose not to move forward with using the missing wheel data as a predictor.

6


Even though the Acquisition and Retail Average Price variables did not show a relationship in the scatterplots, the initial logistic regression does prove that they are important variables. Therefore we decided to further enrich those variables by calculating for the change or difference between the current and acquisition price for current and average retail and auction prices. Other variables such as the quarter that the car was purchased in and the area in which the car was purchased were added based on business insights that cars purchased at different dealers in different sales periods can affect the selection of cars available. Average miles per year were added assuming that this variable would give more insights into the usage of the car compare to the total odometer reading.

In addition to creating variables using business insights we also tried to convert some of the qualitative variables into quantitative variables, such as models, make, and nationality. Those columns were not only qualitative, but also contained too much variety that would introduce too much chaos to the model rather than help us make more sense of the data, so we created categories among each set. For example, we grouped the Nationality variable into American versus Non-American, represented by the values 1 and 0, respectively. The model and make variables were too complex to decipher, but we were able to split the information and group by whether the car was a 2 Door or Not, represented by the values 1 and 0, respectively. Based on the scatterplot, there does seem to be a minor pattern between lemon cars and the top 3 American brands they fall under. Based on this we did a test and created a variable called GM or Not (1,0). After converting most of the potentially useful qualitative variables we began our analysis.

Analysis

Analysis Process: Tools Used and Further Analysis

The logistic regression print outs and confusion matrices for our best and second best model are shown in the Appendix. We started the analysis with the initial logistic regression model, which included the wheel column with no info. The training profit was $283,557 and testing profit was $282,531, but as explained above the missing wheel info is an unexplainable predictor. As shown in the Appendix, the profit from the best model compared with the profit from the 2nd best model is a little more than $10,000 less than the initial model, but the predictors used are more likely to be variables that can’t count on the be provided consistently whenever a dealer purchases a car. We also used a decision tree model, but the profit and accuracy rate from the decision tree was lower than the two logistic regression model shown in the Appendix.

From the best and 2nd best model we found the most significant variables from both existing and enriched variables are:

Vehicle Age VehOdo MMRCurrentRetailCleanPrice VehBCost Warranty Cost Purchase Qtr (1,3)

7


Southern States GM Diff Retail Average

The difference between best and 2nd best was the removal of the Diff Retail Average (Difference between Retail Average Price of acquisition and current) and the addition of Warranty Cost. The lift was minimal, but it still increased in both accuracy and profit. Although the best model performed better based on our metrics, the variables Warranty Cost, Southern States and MMRCurrentRetailCleanPrice became insignificant according the logistic regression print out.

From the variables shown above, it can be seen that even though only 4 of the 10 enriched variables added were significant, they represented half of the variables used in the 2nd best model and slightly less than half in the best model. Our business insight of sales period and purchase location being possible factors proved to be correct. Variables enriched based on information collected from the scatterplots such as Brand (i.e. GM variable) and differences in current and acquisition price also played an important role. The interesting feedback from the model is that overall odometer proves to be dominating in importance over average miles spent per year, which was different from our initial assumptions.

Knowing these variables as key predictors the next steps will be to improve, collect, or add outside data to improve our models. Steps we plan to take will be detailed in the improvement section.

Performance Measure

To determine profit, the upper limit of cars purchased from a pool of 43,481 was set to 100. With this upper limit established, our best model revealed a 90.906% propensity of a car being good, a 9.904% propensity of a car being bad and generated a profit of $270,991 for the training set. The accuracy rate was 80.82% for the training set with a true positive rate of 30.51% and false positive rate of 12.87%. The true negative rate was 87.13%.

The testing set generated a slightly smaller profit, $262,449. The propensity of a car being good in the testing set was 89.905% and the propensity of a car being bad was lower, 10.095%. The total accuracy rate fell to 80.31%, with the true positive rate falling to 30.10%. That said, the false positive and true negative rates improved slightly and dropped and rose to 12.65% and 87.35%, respectively.

Furthermore, our overall model test is statistically significant as evidenced by the p value of <.0001. All the variables called out in previous sections are also listed below, along with their statistically significant p values.

Vehicle Age: <0.0001 VehOdo: <0.0001 VehBCost: <0.0001 Purchased Qtr 1,3 [0]: 0.0086 GM (Yes, No)[0]: <0.0001

8


Please refer to the confusion matrices and model test charts provided in the Appendix for a detailed breakdown of the results information.

Business Insights

Logistic Regression is an extremely practical business tool because it determines the probability of a binary outcome and reveals the most significant independent variables in a data set that account for the variation in the dependent variable. As an auto dealer purchasing cars at auction, one can extrapolate the probability of a car being a lemon through the logistic regression equation. The values of the various parameters utilized in our regression model (vehicle age, vehicle odometer reading, etc.) can be ascertained from the particular car being auctioned, and by inserting these values into the logistic regression equation one can calculate the log of odds of a car being a lemon. The log of odds can be directly converted into odds, which can consequently be converted into a probability. Therefore, by gauging the likelihood that a car will be a lemon, auto dealers can make informed decisions when purchasing cars at auction and reduce the chance of buying a lemon, thereby mitigating loss and increasing profitability.

Prior to the analysis we conducted, our team expected the logistic regression model to confirm certain variables to be highly influential on the variation of the dependent variable is bad buy. Variables such as the vehicle age, vehicle odometer reading, and acquisition cost at time of purchase were recognized as highly significant, with each parameter possessing a p-value less than .05. The classification of these variables as significant was to be expected considering that all of these characteristics reflect the physical condition of the vehicle. Variables that were unexpectedly revealed as significant include whether a car is manufactured by GM and whether it was purchased in the first or third quarter of the calendar year. JMP’s profiler function illustrates that the propensity for a car at auction to be a lemon decreases if it falls under either of those categories, thereby leading us to assume that GM possesses superior manufacturing capabilities and that premium automobiles are released seasonally.

Another insight that we obtained from our analysis was the profound accuracy logistic regression models can have in predicting a binary outcome. The accuracy rate of our logistic regression model in predicting whether a car at auction will be a lemon was an impressive 81%. As a business professional considering which classification method or other analytical tool to use in order to derive business insights from a data set, the high accuracy rate that we achieved through this analysis demonstrates the effectiveness of logistic regression.

Improvements

Enhance the dataset to add data points such as fuel economy, cost of ownership, previous ownership such as first purchase, zip code etc.

Add third party datasets which will complement the existing dataset to,o provide statistics on reliability of specific models in the car dataset -

http://carsoninfo.net/Updated2010CarBrandReliabilityGradePointAverages.aspx

9

http://carsoninfo.net/Updated2010CarBrandReliabilityGradePointAverages.aspx


o Sales and market share information of specific models by year, more common models on the road will be an easier sell and family car models maybe an easier sell compared to sports models

o Economic and demographic data of previous ownership http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml. For example, we could add this data by zipcode and analyze the impact on the prediction variable.

Use a more complete dataset – for example missing information such as wheel type , prime unit, AUCGUART ( guarantee level provided by auction for the vehicle – green, yellow, red) , various price details

Use a more robust tool such as R for the data modelling part

Conclusion

Looking back at the process of taking our original data set of close to 73,000 tuples and scrubbing it down to 51,000 tuples, there are a lot of steps we can take to improve our results. As mentioned above, we could bring in external data on previous owners. We can build the model based on car models versus car types and even add in specs of different car models. But even though it may not be a perfect model, our best model as of today is able to tell us the significant variables to differentiate between good and bad buys. The insight that each pre-owned car dealer should take back is the confirmation that the greater the age and odometer combined with a low purchase value from the original buyer are good indicators of a lemon car. More importantly dealers should pay attention to the new discovery that the model, purchase date, purchase location, and the car’s current retail price if it was in perfect condition are also influential factors. Given the assumptions that the cost for every pre-owned car the dealers purchase is $10,217 and a salvage value of a quarter (¼) of the cost of a lemon, we were able to make a $270,991 profit the dealers if the number of cars purchased per year is capped at 100. This is a 7% lift compared to the baseline profit, not to mention, an 81% accuracy rate. Dealers can improve their return on investment using our model and continuously improve using our recommended improvements.

Bibliography

1. https://www.kaggle.com/c/DontGetKicked

10

https://www.kaggle.com/c/DontGetKicked

http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml


2. https://www.kaggle.com/c/DontGetKicked/data 3. http://www.irs.gov/Businesses/Small-Businesses-&-Self-Employed/Retail-Industry-ATG-Chapter-

3-Examination-Techniques-for-Specific-Industries-Independent-Used-Automobile-Dealerships 4. http://www.naaa.com/about_us/all_about_auctions/chapter2.html

Appendix

Qualitative VariablesField Definition

11

http://www.naaa.com/about_us/all_about_auctions/chapter2.html

http://www.irs.gov/Businesses/Small-Businesses-&-Self-Employed/Retail-Industry-ATG-Chapter-3-Examination-Techniques-for-Specific-Industries-Independent-Used-Automobile-Dealerships

http://www.irs.gov/Businesses/Small-Businesses-&-Self-Employed/Retail-Industry-ATG-Chapter-3-Examination-Techniques-for-Specific-Industries-Independent-Used-Automobile-Dealerships

https://www.kaggle.com/c/DontGetKicked/data


RefID Unique (sequential) number assigned to vehicles

IsBadBuy Identifies if the kicked vehicle was an avoidable purchase. 1 = bad buy. 0 = good buy.

PurchDate The date the vehicle was purchased at auctionAuction Auction provider at which the vehicle was purchasedVehYear The manufacturer’s year of the vehicleMake Vehicle ManufacturerModel Vehicle ModelTrim Vehicle Trim LevelSubModel Vehicle SubmodelColor Vehicle ColorTransmission Vehicle’s transmission type (Automatic, Manual)WheelTypeID The type ID of the vehicle wheelWheelType The vehicle wheel type description (Alloy, Covers)Nationality The manufacturer’s countrySize The size category of the vehicle (Compact, SUV, etc.)

TopThreeAmericanName Identifies if the manufacturer is one of the top three American manufacturers

BYRNO Unique number assigned to the buyer that purchased the vehicle

VNZIP Zip code where the car was purchasedVNST State where the car was purchasedIsOnlineSale Identifies if the vehicle was originally purchased online

Quantitative VariablesField Definition

VehicleAge The years elapsed since the manufacturer’s yearVehOdo The vehicle’s odometer reading

MMRAcquisitionAuctionAveragePrice Acquisition price for this vehicle in average condition at time of purchase

MMRAcquisitionAuctionCleanPrice Acquisition price for this vehicle in the above average condition at time of purchase

MMRAcquisitionRetailAveragePrice Acquisition price for this vehicle in the retail market in average condition at time of purchase

MMRAcquisitionRetailCleanPrice Acquisition price for this vehicle in the retail market in above average condition at time of purchase

MMCurrentAuctionAveragePrice Acquisition price for this vehicle in average condition as of current day

MMRCurrentAuctionCleanPrice Acquisition price for this vehicle in the above average condition as of current day

MMRCurrentRetailAveragePrice Acquisition price for this vehicle in the retail market in average condition as of current day

MMRCurrentRetailCleanPrice Acquisition price for this vehicle in the retail market in above average condition as of current day

VehBCost Acquisition cost paid for the vehicle at time of purchaseWarrantyCost Warranty price (term=36 month and mileage =36K)

12


Best Model

13


Testing PredictedGood Car Bad Car

Good Car 19549 2830 22379Bad Car 2195 945 3140

21744 3775 25519

Training

Predicted number of good cars = 43481Upper limit for cars purchased = 100Actual number of cars purchased = 100

Propensity of good cars = 90.906%Propensity of bad cars = 9.094%

Total Profit = $ 270,991

Other MetricsAccuracy % 80.82%

True Positive Rate 30.51%False Positive Rate 12.87%

Sensitivity ( True Positive Rate) 30.51%Specificity (True Negative Rate) 87.13%

Testing

Predicted number of good cars = 43489

Upper limit for cars purchased = 100

Actual number of cars purchased = 100

14

Training PredictedGood Car Bad Car

Good Car 19763 2919 22682Bad Car 1977 868 2845

21740 3787 25527


Propensity of good cars = 89.905%

Propensity of bad cars = 10.095%

Total Profit = $262,449

Other MetricsAccuracy % 80.31%

True Positive Rate 30.10%

False Positive Rate 12.65%

Sensitivity ( True Positive Rate) 30.10%

Specificity (True Negative Rate) 87.35%

15


2nd Best Model

Training PredictedGood Car Bad Car

Good Car 19755 2927 22682Bad Car 1985 860 2845

21740 3787 25527

Testing PredictedGood Car Bad Car

Good Car 19545 2834 22379Bad Car 2192 948 3140

21737 3782 25519

16


Training







Other Metrics

Accuracy %80.76%

True Positive Rate30.23%

False Positive Rate12.90%

Sensitivity ( True Positive Rate)30.23%

Specificity (True Negative Rate)87.10%

Testing







17


Other Metrics

Accuracy %80.30%

True Positive Rate30.19%

False Positive Rate12.66%

Sensitivity ( True Positive Rate)30.19%

Specificity (True Negative Rate)87.34%

18


19

Data Analytics Final

Documents

Transcript of Data Analytics Final