Behavior-Based Predictive Models
-
Upload
liuwensui -
Category
Technology
-
view
1.163 -
download
2
description
Transcript of Behavior-Based Predictive Models
Copyright © 2008, SAS Institute Inc. All rights reserved.
Copyright © 2008, SAS Institute Inc. All rights reserved.
Behavior-Based Predictive Models
Copyright © 2008, SAS Institute Inc. All rights reserved.
Disclaimer
Any opinions, advice, statements, or other information or content expressed or made in the following presentation are those of the presenter and do not necessarily state or reflect the positions or opinions of JPMorgan Chase, its affiliates or subsidiaries.
Copyright © 2008, SAS Institute Inc. All rights reserved.
Special Thanks to
- SAS and Dr. Jerry Oglesby
- ChoicePoint Precision Marketing (CPPM) analytic team that supported me to finish this work but doesn’t exist any more.
Rules:
1. During talk, stop me any time if you have question.
2. After talk, welcome to discuss with me offline.
Copyright © 2008, SAS Institute Inc. All rights reserved.
Introduction
Most popular Data Mining models:
Logistic regression and its variants (NNets, SVM, … …)
2-state Assumption => Bernoulli Outcome
Predict the presence of certain behaviors (response, … …)
Major Limitation:
- Ignore Frequency and Severity given presence of behavior
Ex. 1st-time Auto Claim => Bad Luck => Normal
2 or More Claims => Bad Habit => Risky
- Consequence of 2-state: rank order head count but not $$$
Copyright © 2008, SAS Institute Inc. All rights reserved.
Current Status
Most efforts focusing on relationship exploration:
- NNets, SVM, GAM, CART, … …
Overlook Definition of Left-hand Side:
- Binary outcome is derived from but over-simplifies behaviors
- Why not model behaviors directly using Count Models?
Any loss without 2-state assumption (Logistic Regression)?
Law of Small Numbers:
Bernoulli (N, p) ≈ Poisson (Np) given N -> ∞ and p -> 0
=> Prob (Y = 1|Y~Bern.) ≈ Prob (Y ≥ 1|Y~Pois.) - Show later !
Copyright © 2008, SAS Institute Inc. All rights reserved.
Genuine Count Model
Starting Point: basic Poisson Model
iii
Yii
ii XExp where,!Y
ExpX|Yf
i
Major Drawback:
Strong Assumption of Equi-Dispersion => Mean = Variance
Real-world Data => Over-Dispersion
- Excess Zeroes: Majority with 0 delinquency in Credit Card
- Long Right Tail: Severely sick patients in Insurance
Observed Heterogeneity
Copyright © 2008, SAS Institute Inc. All rights reserved.
Major Alternative
Negative Binomial Model (continuous mixture):
),(Gamma~Exp where,ExpXExpXExp 11iiiiii
E(Y|X) = λ and Var(Y|X) = λ + α λ2 > λ => Problem Solved !
Potential Limitation: 1-Process Assumption
- Lack of flexibility for heterogeneous population
- Lack of intuitive interpretation on excess zeroes
- Lack of insight for customer segmentation
Observed Heterogeneity
Unobserved Heterogeneity┴
Copyright © 2008, SAS Institute Inc. All rights reserved.
Composite Models
Main Assumption => Multiple(2+) Components
- Data governed by multiple processes
Ex. Insurance claimant might behave differently after 1st claim.
Models covered:
- Hurdle Model (Mullahy 1986)
- Zero-Inflated Poisson Model (Lambert 1992)
- Latent Class Poisson Model (Wedel 1993)
Additional Benefit:
- Segmentation by behavior or / and characteristics
Copyright © 2008, SAS Institute Inc. All rights reserved.
An Application
Credit Card Data used in Econometric Analysis (Greene 1992)
Outcome: # of 60-day Delinquencies in payment
Predictors:AGE Age in years as of November, 1989INCOME Self reported income, in $10,000sAVGEXP Average monthly credit card expenseEXP_INC Average monthly credit card expense/Average monthly incomeMAJOR Binary indicator of whether applicant has a major credit cardOWNRENT Binary indicator of whether applicant owns their homeDEPNDT Number of dependentsINC_PER Monthly income divided by 1 + DEPNDTSELFEMPL Binary indicator of whether the applicant is self-employedACTIVE Number of active credit card accountsCUR_ADD Number of months living at current address
Copyright © 2008, SAS Institute Inc. All rights reserved.
Data SummaryVariable Mean Std.Dev. Min Max.
MAJORDRG 0.4564 1.3453 0.00 14.00AGE 33.2131 10.1428 0.17 83.50INCOME 3.3654 1.6939 0.21 13.50EXP_INC 0.0687 0.0947 0.00 0.91AVGEXP 185.0570 272.2190 0.00 3100.00MAJOR 0.8173 0.3866 0.00 1.00OWNRENT 0.4405 0.4966 0.00 1.00DEPNDT 0.9939 1.2478 0.00 6.00INC_PER 2.1556 1.3635 0.07 11.00SELFEMPL 0.0690 0.2535 0.00 1.00ACTIVE 6.9970 6.3058 0.00 46.00CUR_ADD 55.2676 66.2717 0.00 540.00
For outcome, Variance = 4 times Mean
Copyright © 2008, SAS Institute Inc. All rights reserved.
EDA on Outcome
0%
20%
40%
60%
80%
0 1 2 3 4 5 6 7 8 9 10+
1. 80% Cardholders have 0 delinquency.
2. Large dispersion with long tail
Copyright © 2008, SAS Institute Inc. All rights reserved.
Traditional Modeling Practice
Logistic Regression based on 2-State assumption:
Define Y = 0 if MajorDrg = 0 and Y = 1 otherwise
Fit a logistic regression with 0/1 Bernoulli outcome
proc logistic data = credit;
model Y = < PREDICTORS > ;
run;
Can’t differentiate between 1 delinquency and 3 delinquencies
Able to capture head counts but not dollar
Copyright © 2008, SAS Institute Inc. All rights reserved.
Standard Count Data Model
Basic Poisson Model => Not Sufficient for data with 80% Zeroes
Negative Binomial Model:
proc genmod data = credit;
model Y = < PREDICTORS > / dist = NB link = log ;
run;
Goodness-of-Fit: Both portfolio level and account level
1Y
YX|Yf
i1
i
i1
1
1i
1i
ii
1
Copyright © 2008, SAS Institute Inc. All rights reserved.
NB Output
Parameter Estimate Standard Error t Value Pr > |t|
B2_Intercept -1.7324 0.3771 -4.59 <.0001
B2_Age 0.004542 0.008946 0.51 0.6118
B2_Income -0.06657 0.08815 -0.76 0.4503
B2_Exp_inc -7.7236 2.6413 -2.92 0.0035
B2_Avgexp -0.0001 0.000832 -0.12 0.9031
B2_Ownrent -0.7795 0.1727 -4.51 <.0001
B2_Selfempl -0.07347 0.2802 -0.26 0.7932
B2_Depndt 0.1932 0.1215 1.59 0.112
B2_Inc_per 0.1292 0.1191 1.08 0.2783
B2_Cur_add 0.002557 0.001195 2.14 0.0326
B2_Major 0.02561 0.1893 0.14 0.8924
B2_Active 0.1152 0.0141 8.17 <.0001
alpha 3.5161 0.4046 8.69 <.0001
Copyright © 2008, SAS Institute Inc. All rights reserved.
NB Portfolio Prediction
0%
20%
40%
60%
80%
0 1 2 3 4 5 6 7 8 9 10+
MajorDrg NB Prediction
Copyright © 2008, SAS Institute Inc. All rights reserved.
How to Score
Count Model Scoring Scheme
Model Development
Prob(Y=0), Prob(Y=1), Prob(Y=2) Prob(Y=3), Prob(Y=4) ……
Define Good / Bad Ex: Bad = 1 when Y >= 2
Logit Model Scoring Scheme
Define Good / Bad Ex: Bad = 1 when Y >= 2
Model Development
Prob(Good) = Prob(Y=0 or 1) Prob(Bad) = 1 – Prob(Y=0 or 1)
Prob(Good) and Prob(Bad)
Copyright © 2008, SAS Institute Inc. All rights reserved.
NB Account Prediction
0 %
1 0 %
2 0 %
3 0 %
4 0 %
5 0 %
6 0 %
7 0 %
8 0 %
9 0 %
1 0 0 %
0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 %
C u m u la t iv e % o f P o p u la t io n
Cum
ulat
ive
% o
f Bad
s
N B S c o r e fo r 1 + D e l in q u e n c ie s N B S c o r e fo r 2 + D e l in q u e n c ie s
N B S c o r e fo r 3 + D e l in q u e n c ie s L o g is t ic R e g r e s s io n S c o r e
Copyright © 2008, SAS Institute Inc. All rights reserved.
Hurdle Model
Two-Component Assumption:
- Zeroes counts determined by Binomial distribution
- Positive counts governed by Zero-Truncated Poisson distribution
2-Group Segmentation:
- Group without delinquency
- Group with delinquency
0Y for
!YExp1
Exp1
0Y for
X|Yfi
ii
Yiii
ii
iii
Copyright © 2008, SAS Institute Inc. All rights reserved.
Hurdle Model in SAS
proc nlmixed data = data;
params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...;
xb = b0 + b1 * INCOME ... ...);
mu = exp(xb);
xa = a0 + a1 * INCOME ... ...);
if y = 0 then p = exp(xa) / (1 + exp(xa));
else p = (1 - exp(xa) / (1 + exp(xa))) / (1 - exp(-mu)) * (exp(-mu) * mu ** y / fact(y));
ll = log(p);
model y ~ general(ll);
run;
Probability for Zero
Probability for Zero-Truncated Poisson
Copyright © 2008, SAS Institute Inc. All rights reserved.
Hurdle Output
Parameter Estimate Pr > |t| Parameter Estimate Pr > |t|B1_Intercept 1.92 <.0001 B2_Intercept 0.51 0.04B1_Age 0.00 0.64 B2_Age -0.01 0.34B1_Income 0.01 0.95 B2_Income -0.18 0.01B1_Exp_inc 6.71 0.00 B2_Exp_inc -13.81 0.00B1_Avgexp 0.00 0.47 B2_Avgexp 0.00 0.97B1_Ownrent 0.71 <.0001 B2_Ownrent -0.41 0.00B1_Selfempl -0.06 0.81 B2_Selfempl -0.05 0.80B1_Depndt -0.07 0.56 B2_Depndt 0.28 0.00B1_Inc_per -0.04 0.69 B2_Inc_per 0.25 0.00B1_Cur_add 0.00 0.00 B2_Cur_add 0.00 0.70B1_Major 0.18 0.35 B2_Major 0.22 0.09B1_Active -0.10 <.0001 B2_Active 0.04 <.0001
Logit Component Truncated Poisson Component
Drivers for Presence of Delinquency Drivers for Severity of Delinquency
Copyright © 2008, SAS Institute Inc. All rights reserved.
Hurdle Portfolio Prediction
0 %
2 0 %
4 0 %
6 0 %
8 0 %
0 1 2 3 4 5 6 7 8 9 1 0 +
M a j o r D r g H u r d l e P r e d i c t i o n T r u n c a t e d P o i s s o n P r e d i c t i o n
Un-normalized Truncated Poisson Distribution
Composite Distribution
Copyright © 2008, SAS Institute Inc. All rights reserved.
Hurdle Segmentation
0 %
2 0 %
4 0 %
6 0 %
8 0 %
0 1 2 3 4 5 6 7 8 9 1 0 +
N o D e l i n q u e n c y s e g m e n t ( 8 0 % ) D e l i n q u e n c y s e g m e n t ( 2 0 % )
1. Segmentation Model:
Logistic Model separates BLUE from RED
2. Severity Model:
Truncated Poisson predicts severity of RED
Copyright © 2008, SAS Institute Inc. All rights reserved.
Hurdle Account Prediction
0 %
1 0 %
2 0 %
3 0 %
4 0 %
5 0 %
6 0 %
7 0 %
8 0 %
9 0 %
1 0 0 %
0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 %
C u m u la t iv e % o f P o p u la t io n
Cum
ulat
ive
% o
f Bad
s
H D L S c o r e fo r 1 + D e l in q u e n c ie s H D L S c o r e fo r 2 + D e l in q u e n c ie s
H D L S c o r e fo r 3 + D e l in q u e n c ie s L o g is t ic R e g S c o r e
Copyright © 2008, SAS Institute Inc. All rights reserved.
Zero-Inflated Poisson Model
Two-Component Assumption:
- Part of zeroes determined by Binomial distribution
- Rest of zeroes together with positive counts determined by standard Poisson distribution
2-Group Segmentation:
- Group without delinquency risk
- Group with delinquency risk
0Y for !Y
Exp1
0Y for Exp1
X|Yfi
i
Yii
i
iiii
iii
Copyright © 2008, SAS Institute Inc. All rights reserved.
ZIP Model in SAS
proc nlmixed data = data;
params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...;
xb = b0 + b1 * INCOME ... ...);
mu = exp(xb);
xa = a0 + a1 * INCOME … …);
if y = 0 then p = exp(xa) / (1 + exp(xa)) + (1 - exp(xa) / (1 + exp(xa)) * exp(-mu);
else p = (1 - exp(xa) / (1 + exp(xa))) * (exp(-mu) * mu ** y / fact(y));
ll = log(p);
model y ~ general(ll);
Run;
Probability for zero
Probability for Poisson after excluding zero
Copyright © 2008, SAS Institute Inc. All rights reserved.
ZIP Output
Parameter Estimate Pr > |t| Parameter Estimate Pr > |t|B1_Intercept 1.61 0.00 B2_Intercept 0.45 0.08B1_Age -0.01 0.31 B2_Age -0.01 0.26B1_Income -0.12 0.31 B2_Income -0.18 0.00B1_Exp_inc 4.65 0.23 B2_Exp_inc -7.26 0.01B1_Avgexp 0.00 0.18 B2_Avgexp 0.00 0.52B1_Ownrent 0.54 0.01 B2_Ownrent -0.46 0.00B1_Selfempl 0.01 0.98 B2_Selfempl 0.01 0.94B1_Depndt 0.14 0.33 B2_Depndt 0.29 0.00B1_Inc_per 0.17 0.24 B2_Inc_per 0.26 0.00B1_Cur_add 0.00 0.00 B2_Cur_add 0.00 0.88B1_Major 0.36 0.13 B2_Major 0.23 0.08B1_Active -0.09 <.0001 B2_Active 0.04 <.0001
Logit Component Poisson Component
Drivers for Existence of Risk Drivers for Severity of Risk
Copyright © 2008, SAS Institute Inc. All rights reserved.
ZIP Portfolio Prediction
0 %
2 0 %
4 0 %
6 0 %
8 0 %
0 1 2 3 4 5 6 7 8 9 1 0 +
M a j o r D r g Z I P P r e d i c t i o n P o i s s o n P r e d i c t i o n
Un-normalized Poisson Distribution
Composite Distribution
Copyright © 2008, SAS Institute Inc. All rights reserved.
ZIP Segmentation
0 %
2 0 %
4 0 %
6 0 %
8 0 %
0 1 2 3 4 5 6 7 8 9 1 0 +
N o D e l i n q u e n c y S e g m e n t ( 7 2 % ) P o t e n t i a l D e l i n q u e n c y S e g m e n t ( 2 8 % )
Same outcome but different risk implications
1. Blue (72%): Established, free from financial risk
2. Red (8%): Vulnerable, might deteriorate in bad time
Copyright © 2008, SAS Institute Inc. All rights reserved.
ZIP Account Prediction
0 %
1 0 %
2 0 %
3 0 %
4 0 %
5 0 %
6 0 %
7 0 %
8 0 %
9 0 %
1 0 0 %
0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 %
C u m u la t iv e % o f P o p u la t io n
Cum
ulat
ive
% o
f Bad
s
Z IP S c o r e fo r 1 + D e l in q u e n c ie s Z IP S c o r e fo r 2 + D e l in q u e n c ie s
Z IP S c o r e fo r 3 + D e l in q u e n c ie s L o g is t ic R e g S c o r e
Copyright © 2008, SAS Institute Inc. All rights reserved.
Latent Class Poisson Model
General S-Component Assumption for S>= 2:
- Avoid sharp dichotomization
- Each case drawn from an unobserved Poisson component with different parameter
- S is determined by AIC / BIC
Segmentation assumed S = 2:
- Group with low risk
- Group with high risk
S
1s i
Ys| is| i
sii !Y
ExppX|Yf
i
Copyright © 2008, SAS Institute Inc. All rights reserved.
LCP Model in SAS
proc nlmixed data = data;
params a0 = 0 ... b0 = 1 ...
prior1 = 0 to 1 by 0.1;
xa = a0 + a1 * INCOME ... ...); ma = exp(xa);
pa = exp(-ma) * ma ** y / fact(y);
xb = b0 + b1 * INCOME ... ...); mb = exp(xb);
pb = exp(-mb) * mb ** y / fact(y);
p = prior1 * pa + (1 - prior1) * pb;
ll = log(p);
run;
Probability of LC component 1
Probability of LC component 2
Copyright © 2008, SAS Institute Inc. All rights reserved.
LCP Output
Parameter Estimate Pr > |t| Parameter Estimate Pr > |t|B1_Intercept -1.82 <.0001 B2_Intercept 0.31 0.42B1_Age 0.00 0.92 B2_Age 0.00 0.74B1_Income -0.10 0.38 B2_Income -0.17 0.06B1_Exp_inc -31.40 0.00 B2_Exp_inc -4.73 0.06B1_Avgexp 0.00 0.00 B2_Avgexp 0.00 0.29B1_Ownrent -0.97 0.00 B2_Ownrent -0.47 0.00B1_Selfempl 0.34 0.21 B2_Selfempl 0.40 0.37B1_Depndt 0.10 0.55 B2_Depndt 0.27 0.06B1_Inc_per 0.05 0.73 B2_Inc_per 0.23 0.05B1_Cur_add 0.00 <.0001 B2_Cur_add 0.00 0.03B1_Major -0.27 0.30 B2_Major 0.19 0.24B1_Active 0.09 <.0001 B2_Active 0.07 <.0001
Latent Poisson Component 1 Latent Poisson Component 2
Drivers for Low Risk Drivers for High Risk
Copyright © 2008, SAS Institute Inc. All rights reserved.
LCP Portfolio Prediction
0 %
2 0 %
4 0 %
6 0 %
8 0 %
0 1 2 3 4 5 6 7 8 9 1 0 +
M a j o r D r g L C P r e d i c t i o n
L o w - M e a n P o i s s o n P r e d i c t i o n H i g h - M e a n P o i s s o n P r e d i c t i o n
Poisson Distribution of High Mean
Composite Distribution
Poisson Distribution of Low Mean
Copyright © 2008, SAS Institute Inc. All rights reserved.
LCP Segmentation
0 %
2 0 %
4 0 %
6 0 %
8 0 %
0 1 2 3 4 5 6 7 8 9 1 0 +
L o w R i s k S e g m e n t ( 8 7 % ) H i g h R i s k S e g m e n t ( 1 3 % )
2
1s
s| iis
s| iisi
|Yfp
|YfpX|sobPr
Copyright © 2008, SAS Institute Inc. All rights reserved.
LCP Account Prediction
0 %
1 0 %
2 0 %
3 0 %
4 0 %
5 0 %
6 0 %
7 0 %
8 0 %
9 0 %
1 0 0 %
0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 %
C u m u la t iv e % o f P o p u la t io n
Cum
ulat
ive
% o
f Bad
s
L C S c o r e fo r 1 + D e l in q u e n c ie s L C S c o r e fo r 2 + D e l in q u e n c ie s
L C S c o r e fo r 3 + D e l in q u e n c ie s L o g is t ic R e g S c o r e
~ 5% benefit at high-risk zone
Copyright © 2008, SAS Institute Inc. All rights reserved.
Parameter Comparison
Logit Trunc. Logit Poisson Class 1 Class 2 Intercept -1.73 1.92 0.51 1.61 0.45 -1.82 0.31Age 0.00 0.00 -0.01 -0.01 -0.01 0.00 0.00Income -0.07 0.01 -0.18 -0.12 -0.18 -0.10 -0.17Exp_inc -7.72 6.71 -13.81 4.65 -7.26 -31.40 -4.73Avgexp 0.00 0.00 0.00 0.00 0.00 0.00 0.00Ownrent -0.78 0.71 -0.41 0.54 -0.46 -0.97 -0.47Selfempl -0.07 -0.06 -0.05 0.01 0.01 0.34 0.40Depndt 0.19 -0.07 0.28 0.14 0.29 0.10 0.27Inc_per 0.13 -0.04 0.25 0.17 0.26 0.05 0.23Cur_add 0.00 0.00 0.00 0.00 0.00 0.00 0.00Major 0.03 0.18 0.22 0.36 0.23 -0.27 0.19Active 0.12 -0.10 0.04 -0.09 0.04 0.09 0.07Dispersion 3.52Prior_Prob 0.87
ZIP 2-Class LCPParameters
Hurdle Neg Binomial
In Hurdle / ZIP, 1st set of BETAs explain why delinquent and 2nd set explain how many delinquencies will be.
Copyright © 2008, SAS Institute Inc. All rights reserved.
Prediction Comparison
Overall, NB model fits the best
Hurdle / ZIP works better in excess zeroes
In Cherry-picking, all are comparable to Logistic regression
Implied Models: Hurdle NB / Zero-Inflated NB / Latent Class NB ?
Outcome Observed NB Hurdle ZIP LCP0 1060 1058 1060 1059 10451 137 144 98 105 1552 50 52 65 72 493 24 24 38 42 264 17 13 21 22 165 11 8 10 10 106 5 5 5 5 67 6 3 3 2 48 0 2 1 1 29 2 2 1 1 1
10+ 7 8 17 1 4total 1319 1319 1319 1319 1319
Copyright © 2008, SAS Institute Inc. All rights reserved.
Model Comparison
Statistical Consideration:
Better Statistics, More Parsimonious => NB
Business Consideration:
Better Interpretation, More Insight => Hurdle / ZIP / LCP
Statistics NB Hurdle ZIP LCPLog Likelihood -982.40 -1007.00 -1018.30 -986.00# of Parameters 13 24 24 25AIC 1979.80 2040.00 2062.60 1999.00BIC 2005.36 2088.89 2111.49 2050.01Voung Test -3.75 -2.11 -0.40