Practical Predictive Analytics - Meetupfiles.meetup.com/17637122/Practical Predictive...
Transcript of Practical Predictive Analytics - Meetupfiles.meetup.com/17637122/Practical Predictive...
Practical Predictive Analytics Karim Maarouf, Senior Data Scientist - Teradata Egypt
Cairo’s Data Science Community Meetup
January 9th 2015
2
• Introduction and Motivation
• Business Understanding of Churn
• Data Preparation
• Modeling
• Evaluation
• Retention
• Other Topics in Churn
Agenda
© 2014 Teradata 2
3
How can we help your business become more data driven?
Enabling data-driven business
Corporate Vision
Providing the world’s best analytic data solutions to drive competitive advantage for our customers
Mission
4
• Organizations must move from information and hindsight to optimization and foresight.
Gartner: Use Data More Proactively!
5
Predictive Analytics in Marketing
Detect customers at risk of default Detect instances of fraud
Detect customers that are at risk of churn and successfully retain them
Forecast projected net profit from a customer
Determine which customers are likely to buy which products. Recommend products
accordingly
Direct Marketing
Retention
Customer Lifetime Value
Risk
6
• Follow the standard CRISP-DM
Methodology
7
• Egyptian Telecommunications Market:
– Saturated market: mobile phone penetration is at 111%
– Majority (>95%) pre-paid
– Customers are price sensitive and easily switch between different providers
– Multi-SIM penetration is high (roughly 50%)
• Typical Pre-paid Customer Lifecycle
– Active customers can make and receive calls and other activities as long as they have enough credit
– Customers who are inactive (no activities or recharges) for an extended period of time (usually 90 days) are suspended.
– Customers who fail to recharge their line during suspension (usually 1 month) are disconnected. This event is considered churn.
• Churn by definition is a customer cancelling or deciding not to renew a service. For pre-paid subscribers the target event is actually inactivity.
Business Understanding Churn for a Telecommunications Provider
8
• Churn rate = number of customers churned / total number of customers (calculated over a certain period of time)
• Typical monthly churn rates are anywhere between 3 and 8%
• Annual churn rate = 1 – Annual retention rate
= 1 – [(1 – monthly churn rate) ^ 12]
(For simplicity annual retention rate = monthly churn rate * 12)
• Assume monthly churn rate = 5%
Customers remaining at the end of one year = (1 – 0.05)^12 = 56%
Annual churn rate = 1 – [(1 – 0.05)^12] = 46%
This means by the end of the year the provider has lost almost half of the customers it started out with!
Business Understanding Churn for a Telecommunications Provider
9
• Create classification models to predict which customers are most likely to churn and gain insights about customer churn from these models.
Modeling Approach
Customer Population
Churners
Non-
Churners
Churn Model
Historical Training Data
Non-Churners
Churners
Train Model Current Unclassified Data
Score Customers
10
• Model scoring results in a score between 0 and 1 for each customer indicating the risk this customer will churn.
• Order customers by descending churn score and divide them into equal sized bins.
• Measure model accuracy overall and for the top X bins
• Target customers in the top X bins depending on model accuracy and budget.
Modeling Approach Scoring Customers
Bin Percentile
10 90% – 100%
9 80% – 90%
. .
. .
1 0% – 10%
Top Bin: 10% of customers with the highest churn scores
Churn Score
11
• Churn modeling is a type of rare event modeling
– Special considerations when sampling and when evaluating model accuracy
• Churn prediction is time sensitive.
Customer A:
Predicting inactivity for customer A is trivial. However, the prediction is too late.
Customer B:
Predicting inactivity for customer B is more actionable. However, it may be too late.
Customer C:
This is the ideal case. Proactively detect inactivity to be able to retain the customer before it is too late.
Special Considerations
Today Last Activity Date
Today Last Activity Date
Today Last Activity Date
12
• Observation Period: This is the period used to study the historical behavior of the customers.
• Latency Period: This is usually a 1 week or 2 week gap between the historical period and when the churn event starts to take place. It is used to simulate the time needed to build an ADS on the observation period, score all the customers and launch retention campaigns.
• Target Period: This is when the churn event takes place. In our case this is the start of the inactivity period.
Modeling Approach Time Periods
Observation
Period
Latency
Period
Target
Period
12W 1W 8W
13
• Customers in a pre-paid multi-SIM market tend to become inactive and either churn completely or in most case the line becomes a secondary line.
• The target is therefore defined as a consecutive period of inactive days starting from the first week of the target period as opposed to a single event of churn
• Conduct a pre-analysis to find the tipping point. After how many consecutive days of inactivity will it be highly unlikely that a customer will return?
• The choice of target definition depends on how aggressive we want to be in targeting churners.
Define the Target Observation
Period
Latency
Period
Target
Period
12W 1W 1W 8W
Find consecutive
inactivity period that
starts in the first week
That leads to
dormancy
here
0 10 20 60
50
100
Number of Consecutive Days of Inactivity
Po
pu
latio
n %
Continue to 60D
Reactive before 60D
Example: 50% of the customers who are inactive for 10 or more days will continue inactivity to 60 days.
14
• Simplest way is to evaluate based on overall accuracy.
– Percentage of correct predictions = (TN+TP) / Total Count
– Problem in rare event modeling
Assume churn rate = 5%
Assume we predict all cases as non-churn
Overall accuracy = (95 + 0) / 100 = 95%!!
Evaluation Overall Accuracy
Predicted
Non-Churn Churn
Actual Non-Churn True Negatives False Positives
Churn False Negatives True Positives
Predicted
Non-C Churn
Actual Non-C 95 0
Churn 5 0
15
• Need to answer two questions
– When a customer is predicted as a churner, what is the likelihood that this prediction is correct? / What percentage of customers predicted as churners are in fact churners?– Precision
Precision = TP / Total predicted as churn = TP / (TP+FP)
– What percentage of actual churners are detected as churners?– Recall
Recall = TP / Total actual churners = TP / (TP+FN)
Evaluation Precision and Recall
Predicted
Non-Churn Churn
Actual Non-Churn True Negatives False Positives
Churn False Negatives True Positives
16
• Extreme Case (High Precision - Low Recall)
• Extreme Case (High Recall - Low Precision)
• Need to successfully balance precision and recall
Evaluation Precision and Recall
Predicted
Non-C Churn
Actual Non-C 95 0
Churn 4 1
Predicted
Non-C Churn
Actual Non-C 50 45
Churn 0 5
Precision = 1/1 = 100%
Recall = 1/5 = 20%
Precision = 5/50 = 10%
Recall = 5/5 = 100%
Predicted
Non-C Churn
Actual Non-C 93 2
Churn 1 4
Precision = 4/6 = 67%
Recall = 4/5 = 80%
17
• How to determine if precision is good enough?
• Assume churn rate is 5%
• A random model (selecting churners at random) will have a precision of 5%
• Lift specifies how much better the predictive model is doing than the random model
Lift = Precision / Churn Rate
Evaluation Lift
Case A
Assume
Churn rate = 5%
Precision = 50%
Lift = 50/5 = 10
Max Lift = 100/5 = 20
Case B
Assume
Churn rate = 10%
Precision = 60%
Lift = 60/10 = 6
Max Lift = 100/10 = 10
18
• An Analytical Dataset (ADS) captures all relevant attributes which are necessary to predict future churn.
• Variables in an ADS cover different subject areas related to the business problem.
• The same variable is normally repeated for different time periods throughout the observation period.
– Example:
- Num_Calls_W1 Num_Calls_W2 Num_Calls_M1 Num_Calls_M2
- Num_Services_Used_M1 Num_Services_Used_M2
- Num_Complaints_M1 Num_Complaints_M2
Data Understanding and Preparation ADS and Core Variables
Usage Revenue Recharges Roaming and International
Services
Calling Circle Network
Experience Call Center Complaints
Status Call Gaps
Tariff Loyalty Points Competition
19
• Evolution Variables
– Compare a variable over time
– Important to compare current behavior to customer’s normal behavior
– Usually used on active days, revenue, call gaps, number of recharges, total usage, etc. variables.
– Example: compare number of active days during last month of the observation period to the first month of the observation period.
– 𝑁𝑢𝑚_𝐴𝑐𝑡𝑖𝑣𝑒_𝐷𝑎𝑦𝑠_𝐸𝑉 =𝑁𝑢𝑚_𝐴𝑐𝑡𝑖𝑣𝑒_𝐷𝑎𝑦𝑠_𝑀3−𝑁𝑢𝑚_𝐴𝑐𝑡𝑖𝑣𝑒_𝐷𝑎𝑦𝑠_𝑀1
𝑁𝑢𝑚_𝐴𝑐𝑡𝑖𝑣𝑒_𝐷𝑎𝑦𝑠_𝑀1
=𝑁𝑢𝑚_𝐴𝑐𝑡𝑖𝑣𝑒_𝐷𝑎𝑦𝑠_𝑀3
𝑁𝑢𝑚_𝐴𝑐𝑡𝑖𝑣𝑒_𝐷𝑎𝑦𝑠_𝑀1 −1
M1 = 20, M3 = 30 EV = 0.5
Number of active days increased by 50%
M1 = 30, M3 = 15 EV = -0.5
Number of active days decreased by 50%
Data Understanding and Preparation Derived Variables
20
• Ratio Variables
– Ratio between on-net and off-net usage.
– Ratio between calls to top 10 calling circle and total calls.
– Ratio between usage during business hours, night, etc. to total usage.
– Ratio between number of small, medium, etc. recharges to total number of recharges.
• Combine different types of derived variables
– Example: Evolution of num_small_recharges_Ratio variable
Data Understanding and Preparation Derived Variables
21
• The following are some important transformations that should be applied to the data before modeling:
• Missing Value Replacement: missing values should be replaced either by a zero or the mean or mode of the variable (depending on the meaning of the variables itself)
• Outlier Replacement: outliers are extreme values in a sample that skew the distribution of a variable. Simplest way to replace outliers is using z-score.
• Normalization: transform skewed variables to a normal distribution using a log transform, square root or even binning.
Data Understanding and Preparation Variable Transformation
Log transform
22
• Customer data used for modeling is divided into training, validation and test sets. A Common division is (60% - 20% - 20%)
– Training data: Data used for learning i.e. training the model.
– Validation data: Data used to verify the accuracy of the learned model. The purpose of a validation data set is to avoid over-fitting by measuring the accuracy of the model on data it has not seen before. If a model is accurate on the training data but significantly less accurate on the validation data, then this model has over-fit the training data and needs to be revised.
– Test data: Data used to test the final selected model. Measures of the model’s accuracy are calculated using this data set.
• In rare event modeling it is a good idea to increase the ratio of target cases in the training data
Modeling Samples
Training Data
Non-Churners 95%
Churners 5%s
Up sample non-churners or down sample churners
Non-Churners 50%
Churners 50%
23
• Generally don’t build one model for all customers
• Need to divide customers into several populations that exhibit similar churn behavior
• The simplest way is to look at churn rate among different groups of customers
– Example: Assume overall churn rate is 5%
Modeling Populations
Value Segment Churn Rate
High 3%
Medium 5%
Low 10%
Tenure Churn Rate
> 2 yrs 3%
1yr – 2 yrs 6%
< 1 yr 15%
Value Segment Tenure
Separating the low value segment gives a
natural lift = 10/5 = 2
Separating the <1 yr tenure gives a
natural lift = 15/5 = 3
24
• Collinearity between two variables means that they have a perfect linear relationship. You can perfectly predict one from the other.
• Having collinear variables doesn’t necessarily affect prediction accuracy. However, it makes it difficult to distinguish between the individual effects of these variables.
• In regression correlation is often used interchangeable with collinearity
• When two variables exhibit dependence they are said to be correlated
– When the variables increase together the correlation is positive
– When a variables increases when the other decreases and vice-versa the correlation is negative
• Use the Pearson correlation coefficient to measure the correlation between numeric variables
Modeling Removing Collinearity
25
• One solution to eliminating collinearity is to drop correlated variables.
• This works since many correlated variables in our problem describe more or less the same thing.
• Dropping correlated variables will also ensure that the list of top contributing predictors is not dominated by variables from only one or two subject areas
• Example the following variables usually exhibit high correlation. Simply keep one of them:
Number of voice calls Duration of voice calls
Revenue of voice calls Number of distinct people called
• Build a correlation matrix to help in dropping variables. Example:
Modeling Removing Collinearity
Var 1 Var 2 Var 3
Var 1 1 0.3 0.6
Var 2 0.3 1 0.2
Var 3 0.6 0.2 1
26
• Dimensionality reduction (i.e. reducing the number of predictor variables) helps in:
– Simplifying models and making them easier to interpret
– Removing collinearity
– Reducing over-fitting
• Use any of the following techniques to reduce the number of predictor variables:
• Principal Components Analysis (PCA): transform the predictors into a set of linearly uncorrelated variables called principal components. The top principal component has the largest possible variance.
• Feature Selection: eliminate predictors that are either irrelevant or redundant.
• Use a simple decision tree: Train a simple decision tree using all predictor variables and neglect all but the top X variables. Focus only on these top X variables moving forward
Modeling Dimensionality Reduction
27
• Logistic regression is a statistical method for analyzing a dataset in which there are one or more predictor variables and a target variable that has one of two outcomes.
• Find the best fitting model to represent the relationship between the predictors and the target.
• It is a generalized form of the general linear model where a logit (log odds) link function is used.
𝑙𝑜𝑔𝑖𝑡 𝑝 = ln𝑝
1−𝑝= 𝑏0 + 𝑏1𝑥1 + …. + 𝑏𝑛𝑥𝑛
– P is the probability of the occurrence of the target event (churn)
– X1 to Xn represent the predictor variables and b0 to bn represent the coefficients
– Assume the probability of churn is 0.25 then the odds of churn are 0.25/0.75 = 1 to 3 and the logit = -1.1
Modeling Logistic Regression
28
• A decision tree in predictive modeling maps observations about predictor variables to values for the target variable by following a sequence of rules.
• Decision trees classify instances by traversing the tree starting from the root to a leaf node with the decision.
• When building a decision tree the split is made on the variables that provide the most information gain.
• Example: Decision Tree for ‘play tennis’
Modeling Decision Trees
29
• Build a model using the training data set.
• Use the validation data set to determine the best performing model.
• After selecting the final model test it on the test data set.
• Evaluating the accuracy of a model:
– Order customers by descending churn score and divide them into equal sized bins.
– Examine cumulative precision, recall and lift in the first 2 or 3 bins
Example*:
* Figures just for illustration. Assume churn rate is 10%
Evaluation
Bin Percentile Precision Recall Lift
10 90% – 100% 80% 50% 8
9 80% – 90% 60% 65% 6
8 70% - 80% 50% 75% 5
30
• Cumulative gains is used as a measure of recall for portions of the population (usually each bin)
• What percentage of the real churners are detected by the model?
• Explanation:
– If you target the top 2 bins (top 20 %), you capture 50% of the real churners.
Evaluation Gains Chart
0 10 20 100
50
10
100
Population %
Targ
et
%
Random
Model
Perfect
31
• A lift chart specifies how much better the predictive model is doing than the random model for portions of the population.
• How much better is the model performing than a random model?
• Explanation:
– In the top bin (top 10%) the model’s precision is 3 times the random model’s precision.
Evaluation Lift Chart
0 10 20 100
3
1
5
Population %
Lift
Random
Model
Perfect
32
• Use the list of top contributing variables to reveal insights about the churn problem
• Logistic Regression Example:
𝑙𝑜𝑔𝑖𝑡 𝑝 = 𝑏0 + 𝑏1. 𝑐𝑎𝑙𝑙𝑖𝑛𝑔_𝑐𝑖𝑟𝑐𝑙𝑒_𝑠𝑖𝑧𝑒 + 𝑏2. 𝑟𝑎𝑡𝑖𝑜_𝑠𝑚𝑎𝑙𝑙_𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒𝑠_𝑒𝑣 + 𝑏3. 𝑛𝑢𝑚_𝑠𝑒𝑟𝑣𝑖𝑐𝑒𝑠_𝑎𝑐𝑡𝑖𝑣𝑒
• A one unit increase in calling_circle_size is associated with a b1 increase or decrease (depending on sign) in the log odds of churn when the other predictors are constant
• Decision Tree Example:
• Derive business rules by traversing the tree
Modeling Interpretation
Calling Circle Size < 2
Churn
Non-Churn Ratio small recharges ev
<= 0
Num services active > 3
Non-Churn
Churn Y
N
Y
N Y
N
33
• Churn detection is only half the problem!
• Need to successfully retain customers.
• Make sure detection is not too late and action is timely (within the latency period).
• Use insights from models to determine possible churn reasons:
– Better tariff plans or offers from competitors
– Network quality issues
– Unresolved complaints
• Tailor Retention offers based on customer segments and interests
Retention
Score customers and send out
retention campaigns before
latency is over
1st day to take action
Observation Latency Target
34
• Important to have a churn program
– Proactive methods: different predictive models for different customer segments and subject areas.
– Re-active methods: churn triggers (customers inactive for an abnormally extended period of time
• Build separate models based on different subject areas. Give more weight to variables being overshadowed by stronger ones in the primary model.
– Example: a model built using only network variables and check the overlap in churners detected.
Other Topics in Churn
All Churners
Churners Detected using Primary Model
Churners Detected using Secondary Model
35
Thank You