Bagging and Boosting Classification Trees to Predict Churn. Insights from the US Telecom Industry...

Post on 18-Dec-2015

221 views 0 download

Tags:

Transcript of Bagging and Boosting Classification Trees to Predict Churn. Insights from the US Telecom Industry...

Bagging and Boosting Classification Trees to Predict Churn.

Insights from the US Telecom Industry

Forthcoming, Journal of Marketing Research

Joint work with Christophe Croux

Aurélie Lemmens

The 2002 Churn Tournament organised by Teradata Center for CRM

at Duke University

Churn means defecting from a company, i.e. take his business

elsewhere

Customer database from an anonymous U.S. wireless telecom

company

Challenge: predicting churn for elaborating targeted retention

strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and Zhang

2002)

Details can be found in Neslin et al. (2004)

The Context

The US Wireless Telecom market (2004)

182.1 million subscribers

Leader in market share: Cingular Wireless

26.9% total market volume

turnover US$19.4 billion / net income US$201 million

Other major players: AT&T, Verizon, Sprint and Nextel

Mergers & Acquisitions : Cingular with AT&T Wireless & Sprint

with Nextel

The Context (cont’d)

Churn

High churn rates 2.6% a month

Causes: increased competition, lack of differentiation,

market saturation

Cost: $300 to $700 cost of replacement of a lost

customer in terms of sales support, marketing,

advertising, etc.

Targeted retention strategies

The Context (cont’d)

Formulation of the Churn Problem

Churn as a Classification issue:

Classify a customer i characterized by k variables

xi = (xi1 , xi2 , …, xiK ) as

Churner yi = + 1

Non-churner yi = - 1

Churn is the response binary variable to predict: yi = f(xi )

Choice of the binary choice model f ( . ) ?

Classification Models in Marketing

Simple binary logit choice model (e.g. Andrews et al. 2002)

Models allowing for the heterogeneity in consumers’ response:

Finite mixture model (e.g. Wedel and Kamakura 2000)

Hierarchical Bayes model (e.g. Yang and Allenby 2003)

Non-parametric choice models:

Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)

Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),

Stochastic gradient boosting (Friedman 2002)

Classification Models in Marketing

Simple binary logit choice model (e.g. Andrews et al. 2002)

Models allowing for the heterogeneity in consumers’ response:

Finite mixture model (e.g. Wedel and Kamakura 2000)

Hierarchical Bayes model (e.g. Yang and Allenby 2003)

Non-parametric choice models:

Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)

Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),

Stochastic gradient boosting (Friedman 2002)

Mostly ignored in the marketing literature

S.G.B. won the Tournament (Cardell, from Salford Systems)

Decision Trees for ChurnChange in consumption

Customer care calls

< 0.5 ≥ 0.5

≥ 3< 3

Age

Yes

≥ 55

55< & ≥ 26 < 26

No

Handset price

≥ $150 <$150

No Yes

Yes No

Example:

Bagging and Boosting

Machine Learning Algorithms

Principle: classifier aggregation (Breiman, 1996)

Tree-based method (e.g. Currim et al. 1988)

Bagging: Bootstrap AGGregatING

Calibration sampleZ = {(xi , yi ) }, i = 1, …, N

Random sample Z1*

Random sample Z2*

xf *1̂

xf *2̂

e.g. tree

Aggregating bootstrap samples

. . .

xf *2̂

xf *1̂

xf *3̂

xfB*ˆ

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

Churn propensity score:

Churn classification:

)(ˆ)(ˆ xfsignxc bagbag

Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}

B bootstrap samples

From each , a base classifier (e.g. tree) is estimated,

giving B score functions:

The final classifier is obtained by averaging the scores

The classification rule is carried out via

BbZb ,,2 ,1 ,*

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

*bZ

xfxfxf Bb***

1ˆ,, ˆ,, ˆ

)(ˆ)(ˆ xfsignxc bagbag

Bagging

Winner of the Teradata Churn Modeling Tournament

(Cardell, Golovnya and Steinberg, Salford Systems).

Data adaptively resampled

Stochastic Gradient Boosting

• Previously misclassified observations weights

• Previously well-classified observations weights

Data

Time

Customer

Balanced

Sample

Proportional

Sample

Calibration Sample Validation Hold-Out Sample

yi = + 1

yi = + 1

yi = - 1

yi = - 1

Xi = (x1,…, x46) yi

Xi=(x1,…, x46) yi

Behavioral predictorse.g. the average monthly minutes of use

Company interaction’s variablese.g. mean unrounded minutes of customer care calls

Customer demographicse.g. the number of adults in the household

N = 51,306

N=100,462Real-life proportion of churners = 1.8%

Equal proportion of churners = 50%

Research Questions

Do bagging (and boosting) provide better results than

other benchmarks?

What are the financial gains to be expected from this improvement?

What are the more relevant churn drivers or triggers that marketers

could watch for?

How to correct estimated scores obtained from a

balanced calibration sample, when predicting rare

events like churn?

Comparing Error Rates…

Model* Validated Error Rate**

Binary Logit Model 0.400

Bagging (tree-based) 0.374

Stochastic Gradient Boosting 0.460

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample

Bias due to Balanced Sampling

Overestimation of the number of churners

Several bias correction methods exist (see e.g. Cosslett 1993;

Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and Lancaster

1996; King and Zeng 2001a,b; Scott and Wild 1997).

However, most are dedicated to traditional models (e.g. logit).

We discuss two corrections for bagging and boosting.

The Bias Correction Methods

The weighting correction:

Based on marketers’ prior beliefs about the churn rate, i.e. the

proportion of churners among their customers, we attach weights

to observations of a balanced calibration sample.

The intercept correction:

Take a non-zero cut-off value τB such that the proportion of

predicted churners in the calibration sample equals the actual a

priori proportion of churners.

Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN,yN)}

B bootstrap samples

From each , a base classifier (e.g. tree) is estimated,

giving B score functions:

The final classifier is obtained by averaging the scores

The classification is carried out via

BbZb ,,2 ,1 ,*

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

*bZ

xfxfxf Bb***

1ˆ,, ˆ,, ˆ

)(ˆ)(ˆ Bbagbag xfsignxc

Bagging

Assessing the Best Bias Correction…

Bias Correction

No correction Intercept Weighting

Model* Validated Error Rates**

Binary logit model 0.400 0.035 0.018

Bagging (tree-based) 0.374 0.034 0.025

S.G. boosting 0.460 0.034 0.018

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample

The Top-Decile Lift Focuses on the most critical group of customers

regarding their churn risk: Ideal segment for targeting a

retention marketing campaign

The top 10% riskiest customers

With = the proportion of churners in this risky segment

And = the proportion of churners in the whole validation set

Risk to churn

10%

ˆ

ˆlift decile-Top %10

%10̂̂

Financial Gains: Neslin et al. (2004)

N : customer base of the company

α : percentage of targeted customers (here, 10%)

ΔTop decile : increase in top-decile lift

γ : success rate of the incentive among the churners

LVC : lifetime value of a customer (Gupta, Lehmann and Stuart 2004)

δ : incentive cost per customer

ψ : success rate of the incentive among the non-churners.

LVCdecileTopNGain ˆ

0 20 40 60 80 100

Number of iterations

1.6

1.8

2.0

2.2

2.4

2.6

Top d

eci

le*

BaggingStochastic Gradient BoostingBinary Logit Model

Top-Decile Lift with Intercept Correction

* Model estimated on the balanced sample, and lift computed on the validation sample.

+26%

Validated** Top-Decile Lift

Model*No / Intercept

correctionWeighting correction

Binary logit model 1.775 1.764

Bagging (tree-based) 2.246 1.549

Stochastic gradient boosting 2.290 1.632

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample

Financial Gains

If we consider

N : customer base of 5,000,000 customers

α : 10% of targeted customers

γ : 30% success rate of the incentive among the churners

LVC : $2,500 lifetime value of a customer

δ : $50 incentive cost per customer

ψ : 50% success rate of the incentive among the non-churners

LVCdecileTopNGain ˆ

Financial Gains

Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of the logit model:

ΔTop decile : 0. 471 (= 2.246 – 1.775)

Gain = + $ 3,214,800

Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of a random selection:

ΔTop decile : 1.246 (= 2.246 – 1.000)

Gain = + $ 8,550,000

Most Important Churn Triggers

Bagging

Partial Dependence Plots

-1000 0 1000 2000

Change in monthly min. of use

48

50

52

54

56

58

60

62

Pro

bability t

o c

hurn

0 500 1000 1500

Equipment days

44

46

48

50

52

54

56

Pro

bability t

o c

hurn

Bagging

Partial Dependence Plot

Pro

bab

ilit

y to

ch

urn

49

50

51

Conclusions: Main Findings

1. Bagging and S.G. boosting are substantially better

classifiers than the binary logit choice model

Improvement of 26% for the top-decile lift,

Good diagnostic measures offering face validity,

Interesting insights about potential churn drivers,

Bagging is conceptually simple and easy-to-implement.

2. Intercept correction constitutes an appropriate bias

correction for bagging when using balanced sampling

scheme.

Thanks for your attention

From Profit to Financial Gains

LVCdecileTopN

LVCN

ˆ

ˆ-ˆ

ProfitProfitGain

2 1

2 1 2-1

cLVCN 1111 ˆ1ˆ ˆ Profit

LVC of a churner

who does not

churn

Incentive cost for the

churners retained

+ non-churners

targeted

Contact

cost

ˆ/ ˆdecile Top 1 1