Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15
-
Upload
mlconf -
Category
Technology
-
view
591 -
download
3
Transcript of Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15
Leveraging Machine Learning Techniques for the Vehicle Auction Industry
Raji Balasubramaniyan, PhDSenior Data Scientist
Manheim, Inc.,
Manheim | Proprietary and Confidential 1
Overview
• Automobile auction– Manheim
• Introduce the ML use cases– Churn rate– recommendations– Forecasting
• How to approach a problem?– Tools and algorithms used
• QA
Manheim, Inc., Automobile auctionProviding auction services for the physical sale of automobiles as well as online tools to connect wholesale vehicle buyers and sellers.
Leader in wholesale vehicle auction industry. 85% vehicle auction business happens at Manheim.
We have over 100 location across US and Canada
About 15 million cars goes through auction every year
ML use case 1: Predicting Churn rate
• What is Churn?– Churn rate, refers to the proportion of members who leave during a
given time period
• Motto: Make customer happy– If the customer is happy, he/she wont churn.
• Why it is important?– It helps us predict and analyze the parameters that drives the
customers away helps sales force team to focus on those parameters and coach the customer
Manheim | Proprietary and Confidential 4
Predicting Churn rate: The approach
• Step 1– Create profile for current and cancelled members by collecting their
behavior data for last 6 months • Activity, Transactions, Messages, Response time etc.,
• Step 2– Segment the customer according to their behavior
• Unsupervised clustering
• Step 3– For every segment perform supervised learning, to select parameters
that influence current members Vs. cancelled members• Logistic regression, Neural net
• Step 4– Include sentiment analysis add another score
Manheim | Proprietary and Confidential 5
Algorithms: Unsupervised K-means clustering
• Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector consists of each members parameters, k-means clustering aims to partition the n observations into k (≤ n) sets S = {Successful Seller, Successful Buyer, Buyer at risk, Seller at risk, undecided} so as to minimize the within-cluster sum of squares (WCSS).
In other words, its objective is to find:
• where μi is the mean of points in Si.
Manheim | Proprietary and Confidential 6
Algorithms: Logistic regression
Manheim | Proprietary and Confidential 7
If P is viewed as a linear function of an explanatory variable, or a linear combination of explanatory variables, then the logistic regression function can be written as
Where α1…αn are parameters influencing the churn
Algorithms: Neural net
Manheim | Proprietary and Confidential 8
Given a specific task to assign a user in a group, given 5 groups, learning means using a set of factors to find f* ∈ F which solves the task in optimal sense.
Our training data consists of N dealers from each group from 5 groups.
x1 :Activity
x2 : Number of messages
x3: Response time
xn : etc
w1
w2
w3
wn
Output
Our cost function is the mean-squared error, which tries to minimize the average squared error between the network's output.
Algorithms :Sentiment analysis
Manheim | Proprietary and Confidential 9
Sentiment refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. We used Naïve-Bayes model.
We have two training groups G ={ ‘Cancel’, “Member”}, D= Messages
Example tk= {“like”, “love”, “hate”, “bad”, “worst” , "interesting-to-me" : "not-interesting-to-me”,…..k-terms}
Goal is to find best group for a message D using maximum a posteriori (MAP) group Gmap
tk is a term; Dm is the set from ‘Members’; Dmk is the subset that contain tk; Dc is the set from ‘Cancelled Member’; Dck is the subset that contain tk.
The Result
• Every dealer will be assigned to a group• He / She will have 3 different health score (1-Churn rate)
– 0-30 days health score (Calculated using last 30 days data)– 30-60 days health score (Calculated using last 30-60 days data)– 60+days health score (Calculated using last 60-120 days data)
• Sales force will be alarmed to see if a successful user turned to fall in risk category. They will look into the parameter which forced them to be in risk category– Example : Last 30 days less Activity
• Marketing team will take risk category users and aim promotion schemes to them
Manheim | Proprietary and Confidential 10
ML use case 2: Recommendation
Manheim | Proprietary and Confidential 11
What is recommendation system?
Recommender systems are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.
Goal
Suggest relevant content to the users
Recommendation: The Approach
Manheim | Proprietary and Confidential 12
• Step 1– Segment customers according their transaction patterns
• Step 2– For every segment create user profile per customer
• Step 3– Match user profile with vehicle profile and arrive at matching score
• Step 4– Rank the relevant content
• Step 5– Combine profile matching and ranking and provide recommendations
The approach: Segment the customers
Manheim | Proprietary and Confidential 13
Segment the customers according to their behavior• Franchise dealer, Independent, Wholesaler
K-means or any clustering technique could be used for this purpose
Our objective is to find best group every dealer belongs to.
where μi is the mean of points in Si. and S = {different customer segments}
The approach :Creating user profile and Matching
• Create user profiles by collecting the dealer transaction pattern for a period of time
• For every user profile perform vehicle filtering using content based collaborative filtering– User – Item collaborative filtering: Relevant content recommendation
• Customers who bought car X also bought car Y– 2010 Honda Accord Vs 2010 Toyota Camry
– User- User collaborative filtering : You may also like these• Dealer A and Dealer B how much their profiles match
Similarity or Co-rating matrix is used to arrive at relevant content matching correlations
Manheim | Proprietary and Confidential 14
The approach: Ranking scores using regression
Customer need score
Once we have filtered the profiles that are relevant to the users, rank/sort the vehicles according to some goal to provide more relevant content on top
• Example: Suggest items that makes more profit for the customers in the retail market, in this case regression goal is profit.
Where α1…αn can be Buying price from auction, retail selling price, Detailing work done on the cars etc.,
Result Suggest relevant cars to the dealers when they login to the site
ML use case 3: Forecasting
• How many transaction a buyer is going to make in next few weeks?– Given the past year transaction history for a buyer, how many cars the
dealer will buy in future few auctions or online.– Which year, make and model the dealer buy?– In which auction, region he will buy.
• How many users are going to Churn in next few months?– How many will move from risk category to successful category– How many will move to risk category– How many non active moved to active category
Manheim | Proprietary and Confidential 16
Synopsis : Time series and ARIMA
Manheim | Proprietary and Confidential 17
A time series can be viewed as a combination of signal and noise, and could have different patterns like, and it could also have a seasonal component. • Mean reversion
• The trend will tend to move to the mean over time• Sinusoidal oscillation• Etc.,
An ARIMA model can be viewed as a “filter” that tries to separate the signal from the noise, and the signal is then extrapolated into the future to obtain forecasts.
ARIMA models are, the most general class of models for forecasting a time series.
The Approach :ARIMA
Auto Regressive Integrated moving average model for calculating the forecast,
A non seasonal ARIMA model is classified as an"ARIMA(p,d,q) model,
where:p is the number of autoregressive termsd is the number of non seasonal differences needed for stationarityq is the number of moving average terms. A seasonal ARIMA model is classified as an ARIMA(p,d,q)x(P,D,Q) model, where
P=number of seasonal autoregressive (SAR) termsD=number of seasonal differencesQ=number of seasonal moving average (SMA) terms
According to signal type, we developed automatic forecast parameter prediction algorithm, that choses different p,P, d,D and q,Q values and selects the one which has lowest RMSE value using 80-20 rule.
Manheim | Proprietary and Confidential 18
Manheim | Proprietary and Confidential 19
One Example
Summary
• We used various ML techniques and implemented them for vehicle auction industry use cases.
• Choosing the algorithm determines the success of the results and depending on the use case, various algorithms can be used
• Extracting , Cleaning and normalizing the data forms the crucial layer in determining the use case success
Manheim | Proprietary and Confidential 20
Acknowledgement
• Dr. Stephane Pinel• Sonar Team• Manheim
Manheim | Proprietary and Confidential 21
Q &A
Manheim | Proprietary and Confidential 22