© Deloitte Consulting, 2004 Model Validation and Bootstrapping James Guszcza, FCAS, MAAA CAS...
Transcript of © Deloitte Consulting, 2004 Model Validation and Bootstrapping James Guszcza, FCAS, MAAA CAS...
© Deloitte Consulting, 2004
Model Validation and Bootstrapping
James Guszcza, FCAS, MAAA
CAS Predictive Modeling Seminar
Chicago
October, 2004
© Deloitte Consulting, 2004
Agenda
Problem of Model ValidationUse of Out-of-Sample Data
Lift Curves & Gains ChartsCross-ValidationCV example: Pruning Decision Trees
Bootstrapping
© Deloitte Consulting, 2004
The Problem of Model Validation
© Deloitte Consulting, 2004
Why We All Need Validation
1. Business Reasons Need to choose the best model. Measure accuracy/power of selected model. Good to measure ROI of the modeling project.
2. Statistical Reasons Model Building techniques are inherently designed to
minimize “loss” or “bias”. To an extent, a model will always fit “noise” as well as
“signal”. If you just fit a bunch of models on a given dataset and
choose the “best” one, it will likely be overly “optimistic”.
© Deloitte Consulting, 2004
Some Definitions
Target Variable Y What we are trying to predict.
Profitability (loss ratio, LTV), Retention,…
Predictive Variables {X1, X2,… ,XN} “Covariates” used to make predictions.
Policy Age, Credit, #vehicles….
Predictive Model Y = f(X1, X2,… ,XN) “Scoring engine” that estimates the unknown value Y
based on known values {Xi}.
© Deloitte Consulting, 2004
The Problem of Overfitting
Left to their own devices, modeling techniques will “overfit” the data.
Classic Example: multiple regression Every time you add a variable to the regression, the
model’s R2 goes up. Naïve interpretation: every additional predictive variable
helps explain yet more of the target’s variance. But that can’t be true! Left to its own devices, Multiple Regression will fit too
many patterns. A reason why modeling requires subject-matter expertise.
© Deloitte Consulting, 2004
The Perils of Optimism
Error on the dataset used to fit the model can be misleading
Doesn’t predict future performance.
Too much complexity can diminish model’s accuracy on future data.
Sometimes called the Bias-Variance Tradeoff. 40 60 80 100 120 140 160
0.0
50
.10
0.1
50
.20
complexity (# nnet cycles)
pre
dic
tion
err
or
train datatest data
Training vs Test Error
low biashigh variance----->
high biaslow variance<-----
© Deloitte Consulting, 2004
The Bias-Variance Tradeoff
Complex model: Low “bias”:
the model fit is good. i.e., the model value is
close to the data’s expected value.
High “Variance”: Model more likely to
make a wrong prediction.
Bias alone is not the name of the game.
40 60 80 100 120 140 160
0.0
50
.10
0.1
50
.20
complexity (# nnet cycles)
pre
dic
tion
err
or
train datatest data
Training vs Test Error
low biashigh variance----->
high biaslow variance<-----
© Deloitte Consulting, 2004
The Bias-Variance Tradeoff
The tradeoff is quite generic. Regression
# variables Decision Trees
size of tree Neural Nets
#nodes # training cycles
MARS #basis functions
40 60 80 100 120 140 160
0.0
50
.10
0.1
50
.20
complexity (# nnet cycles)
pre
dic
tion
err
or
train datatest data
Training vs Test Error
low biashigh variance----->
high biaslow variance<-----
© Deloitte Consulting, 2004
Curb Your Enthusiasm
Multiple Regression, use adjusted R2 Rather than simple R2. A “penalty” is added to R2 such that each additional
variable both raises & lowers adjusted-R2. Net effect can be positive or negative. Attempts to estimate prediction error on fresh data.
One instance of a general idea: We need to find ways of measuring and controlling
techniques’ propensity to fit all patterns in sight.
© Deloitte Consulting, 2004
How to Curb Your Enthusiasm
1. Adopt goodness-of-fit measures that penalize model complexity.
No hold-out data needed Adjusted R2
Akaike Information Bayes Information
2. Or…. use out-of-sample data! Rely more on the data, less on penalized likelihood. Akaike and the others try to approximate the use of
out-of-sample data to measure prediction error.
© Deloitte Consulting, 2004
Using Out-of-Sample Data
Holdout Data
Lift Curves & Gains Charts
Validation Data
Cross-Validation
© Deloitte Consulting, 2004
Out-of-Sample Data
Simplest idea: Divide data into 2 pieces. Training Data: data used to fit model Test Data: “fresh” data used to evaluate model
Test data contains: actual target value Y model prediction Y*
We can find clever ways of displaying the relation between Y and Y*. Lift curves, gains charts, ROC curves…………
© Deloitte Consulting, 2004
Lift Curves
Sort data by Y* (score). Break test data into 10
equal pieces Best “decile”: lowest
score lowest LR Worst “decile”: highest
score highest LR Difference: “Lift”
Lift measures: Segmentation power ROI of modeling project
-40%
-10%
-2%
2%5%
20%
35%
50%
-25%
-30%
-50%
-40%
-30%
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
1 2 3 4 5 6 7 8 9 10
Decile
LR
Re
lativ
ity
Test Data
© Deloitte Consulting, 2004
Gains Charts: Binary Target
Y is {0,1}-valued Fraud Defection Cross-Sell
Sort data by Y* (score). For each data point,
calculate % of “1’s” vs. % of population considered so far.
Gain: get 90% of the fraudsters by focusing on 40% of population.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Perc.Total
Pe
rc.F
rau
d
perfect modeldecision tree
Fraud Detection - Gains Chart
© Deloitte Consulting, 2004
Model Selection vs. Validation
Suppose we’ve gone though an iterative model-building process. Fit several models on the training data Tested/compared them on the test data Selected the “best” model
The test lift curve of the best model might still be overly optimistic. Why: we used the test data to select the best model. Implicitly, it was used for modeling.
© Deloitte Consulting, 2004
Validation Data
It is therefore preferable to divide the data into three pieces: Training Data: data used to fit model Test Data: “fresh” data used to select model Validation Data: data used to evaluate the final,
selected model.
Train/Test data is iteratively used for model building, model selection. During this time, Validation data set aside in a “lock
box”
© Deloitte Consulting, 2004
Validation Data
The model lift on train data is overly optimistic. The lift on test data might be somewhat optimistic as well.
The Validation lift curve is a realistic estimate of future performance.
-50%
-30%
-10%
10%
30%
50%
1 2 3 4 5 6 7 8 9 10
Decile
LR
re
lativ
ity to
ave
rag
e
train test val
© Deloitte Consulting, 2004
Validation Data
This method is the best of all worlds. Train/Test is a good way to select an optimal model. Validation lift a realistic estimate of future performance.
Assuming you have enough data!
-50%
-30%
-10%
10%
30%
50%
1 2 3 4 5 6 7 8 9 10
Decile
LR
re
lativ
ity to
ave
rag
e
train test val
© Deloitte Consulting, 2004
Cross-Validation
What if we don’t have enough data to set aside a test dataset?
Cross-Validation: Each data point is used both as train and test data.
Basic idea: Fit model on 90% of the data; test on other 10%. Now do this on a different 90/10 split. Cycle through all 10 cases. 10 “folds” a common rule of thumb.
© Deloitte Consulting, 2004
Cross-Validation
Divide data into 10 equal pieces P1…P10.
Fit 10 models, each on 90% of the data.
Each data point is treated as an out-of-sample data point by exactly one of the models.
model P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 train train train train train train train train train test
2 train train train train train train train train test train
3 train train train train train train train test train train
4 train train train train train train test train train train
5 train train train train train test train train train train
6 train train train train test train train train train train
7 train train train test train train train train train train
8 train train test train train train train train train train
9 train test train train train train train train train train
10 test train train train train train train train train train
© Deloitte Consulting, 2004
Cross-Validation
Collect the scores from the red diagonal…
…You have an out-of-sample lift curve based on the entire dataset.
Even though the entire dataset was also used to fit the models.
model P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 train train train train train train train train train test
2 train train train train train train train train test train
3 train train train train train train train test train train
4 train train train train train train test train train train
5 train train train train train test train train train train
6 train train train train test train train train train train
7 train train train test train train train train train train
8 train train test train train train train train train train
9 train test train train train train train train train train
10 test train train train train train train train train train
© Deloitte Consulting, 2004
Uses of Cross-Validation
Model Evaluation Collect the scores from the ‘red boxes’ and generate a
lift curve or gains chart. Simulates the effect of using the train/test method.
Model Selection Index your models by some parameter α.
# variables in a regression # neural net nodes # leaves in a tree
Choose α value resulting in lowest CV error rate.
© Deloitte Consulting, 2004
Model Selection Example
Use CV to select an optimal decision tree. Built into the Classification & Regression
Tree (CART) decision tree algorithm. Basic idea: “grow the tree” out as far as you
can…. Then “prune back”. CV: tells you when to stop pruning.
© Deloitte Consulting, 2004
How Trees Grow
Goal: partition the dataset so that each partition (“node”) is a pure as possible. How: find the yes/no split
(Xi < θ) that results in the greatest increase in purity. A split is a variable/value
combination. Now do the same thing to
the two resulting nodes. Keep going until you’ve
exhausted the data.
|
© Deloitte Consulting, 2004
How Trees Grow
Suppose we are predicting fraudsters.
Ideally: each “leaf” would contain either 100% fraudsters or 100% non-fraudsters. The more you split, the
purer the nodes become. (Low bias)
But how do we know we’re not over-fitting?
(High variance)
|
© Deloitte Consulting, 2004
Finding the Right Tree
“Inside every big tree is a small, perfect tree waiting to come out.”
--Dan Steinberg
The optimal tradeoff of bias and variance.
But how to find it??
|
|
© Deloitte Consulting, 2004
Growing & Pruning
One approach: stop growing the tree early.
But how do you know when to stop?
CART: just grow the tree all the way out; then prune back.
Sequentially collapse nodes that result in the smallest change in purity.
“weakest link” pruning.
|
|
© Deloitte Consulting, 2004
Cost-Complexity Pruning
Definition: Cost-Complexity Criterion
Rα= MC + αL MC = misclassification rate
Relative to # misclassifications in root node.
L = # leaves (terminal nodes) You get a credit for lower MC. But you also get a penalty for more leaves.
Let T0 be the biggest tree.
Find sub-tree of Tα of T0 that minimizes Rα. Optimal trade-off of accuracy and complexity.
© Deloitte Consulting, 2004
Weakest-Link Pruning
Let’s sequentially collapse nodes that result in the smallest change in purity.
This gives us a nested sequence of trees that are all sub-trees of T0.
T0 » T1 » T2 » T3 » … » Tk » … Theorem: the sub-tree Tα of T0 that
minimizes Rα is in this sequence! Gives us a simple strategy for finding best tree. Find the tree in the above sequence that minimizes
CV misclassification rate.
© Deloitte Consulting, 2004
What is the Optimal Size?
Note that α is a free parameter in:
Rα= MC + αL 1:1 correspondence betw. α and size of tree. What value of α should we choose?
α=0 maximum tree T0 is best. α=big You never get past the root node. Truth lies in the middle.
Use cross-validation to select optimal α (size)
© Deloitte Consulting, 2004
Finding α
Fit 10 trees on the “blue” data.
Test them on the “red” data.
Keep track of mis-classification rates for different values of α.
Now go back to the full dataset and choose the α-tree.
model P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 train train train train train train train train train test
2 train train train train train train train train test train
3 train train train train train train train test train train
4 train train train train train train test train train train
5 train train train train train test train train train train
6 train train train train test train train train train train
7 train train train test train train train train train train
8 train train test train train train train train train train
9 train test train train train train train train train train
10 test train train train train train train train train train
© Deloitte Consulting, 2004
How to Cross-Validate
Grow the tree on all the data: T0. Now break the data into 10 equal-size pieces. 10 times: grow a tree on 90% of the data.
Drop the remaining 10% (test data) down the nested trees corresponding to each value of α.
For each α add up errors in all 10 of the test data sets.
Keep track of the α corresponding to lowest test error.
This corresponds to one of the nested trees Tk«T0.
© Deloitte Consulting, 2004
Just Right
Relative error: proportion of CV-test cases misclassified.
According to CV, the 15-node tree is nearly optimal.
In summary: grow the tree all the way out.
Then weakest-link prune back to the 15 node tree.
cp
X-v
al R
ela
tive
Err
or
0.2
0.4
0.6
0.8
1.0
Inf 0.059 0.035 0.0093 0.0055 0.0036
1 2 3 5 6 7 8 10 13 18 21
size of tree
© Deloitte Consulting, 2004
The Bootstrap
A Simulation-Based Technique for Estimating Distributions
© Deloitte Consulting, 2004
The Bootstrap
The Statistician Brad Efron proposed a very simple and clever idea for mechanically estimating confidence intervals:
The Bootstrap The idea is to take multiple resamples of your
original dataset. Compute the statistic of interest on each
resampleyou thereby estimate the distribution of this statistic!
© Deloitte Consulting, 2004
Motivating Example
Suppose we take 1000 draws from the normal(500,100) distribution
Sample mean ≈ 500 what we expect a point estimate of the
“true” mean
From theory we know that:
98.96STD
499.23MEAN
836.87MAX
181.15MIN
1000.00N
valuestatistic
98.96STD
499.23MEAN
836.87MAX
181.15MIN
1000.00N
valuestatistic
16.31000
100/).(. NXds
© Deloitte Consulting, 2004
Sampling with Replacement
Draw a data point at random from the data set. Then throw it back in
Draw a second data point. Then throw it back in…
Keep going until we’ve got 1000 data points. You might call this a “pseudo” data set.
This is not merely re-sorting the data. Some of the original data points will appear more than
once; others won’t appear at all.
© Deloitte Consulting, 2004
Sampling with Replacement
In fact, there is a chance of
(1-1/1000)1000 ≈ 1/e ≈ .368
that any one of the original data points won’t appear at all if we sample with replacement 1000 times.
any data point is included with Prob ≈ .632
Intuitively, we treat the original sample as the “true population in the sky”.
Each resample simulates the process of taking a sample from the “true” distribution.
© Deloitte Consulting, 2004
Resampling
Sample with replacement 1000 data points from the original dataset S Call this S*1
Now do this 399 more times! S*1, S*2,…, S*400
Compute X-bar on each of these 400 samples
S*N
...
S*10
S*9
S*8
S*7
S*6
S*5
S*4
S*3
S*2
S*1
S
© Deloitte Consulting, 2004
The Result
The green bars are a histogram of the sample means of S*1,…, S*400
The blue curve is a normal distribution with the sample mean and s.d.
The red curve is a kernel density estimate of the distribution underlying the histogram Intuitively, a smoothed
histogram
© Deloitte Consulting, 2004
The Result
The result is an estimate of the distribution of X-bar.
Notice that it is normal with mean≈500 and s.d.≈3.2
The purely mechanical bootstrapping procedure produces what theory tells us to expect.
We can also apply this technique to statistics with unknown distributions… …like loss ratio.
© Deloitte Consulting, 2004
Bootstrapping Loss Data
Same idea can be applied to insurance statistics: Loss ratio Frequency customer lifetime value outstanding reserve…
This is good because these statistics do not necessarily have well behaved distributions i.e., no handy results from probability theory that tell us
how to create confidence intervals.
© Deloitte Consulting, 2004
Empirical dist: loss
© Deloitte Consulting, 2004
Empirical dist: log(loss)
© Deloitte Consulting, 2004
Empirical dist: loss ratio
© Deloitte Consulting, 2004
Empirical dist: Capped LR
© Deloitte Consulting, 2004
Empirical dist: frequency
© Deloitte Consulting, 2004
Bootstrapping & Validation
This is interesting in its own right. But bootstrapping also relates back to model
validation.Along the lines of cross-validation.
You can fit models on bootstrap resamples of your data.
For each resample, test the model on the ≈ .368 of the data not in your resample.
Will be biased, but corrections are available. Get a spectrum of lift curves.
© Deloitte Consulting, 2004
Closing Thoughts
The “cross-validation” approach has several nice features: Relies on the data, not likelihood theory, etc. Comports nicely with the lift curve concept. Allows model validation that has both business &
statistical meaning. Is generic can be used to compare models generated
from competing techniques… … or even pre-existing models Can be performed on different sub-segments of the data Is very intuitive, easily grasped.
© Deloitte Consulting, 2004
Closing Thoughts
Bootstrapping has a family resemblance to cross-validation: Use the data to estimate features of a statistic or a model that
we previously relied on statistical theory to give us. Classic examples of the “data mining” (in the non-pejorative
sense of the term!) mindset: Leverage modern computers to “do it yourself” rather than look
up a formula in a book! Generic tools that can be used creatively.
Can be used to estimate model bias & variance. Can be used to estimate (simulate) distributional characteristics
of very difficult statistics. Ideal for many actuarial applications.