CAS Predictive Modeling Seminar Practical Issues in Model Design
description
Transcript of CAS Predictive Modeling Seminar Practical Issues in Model Design
#!@CAS Predictive Modeling SeminarPractical Issues in Model DesignChuck Boucek (312) 879-3859Chuck Boucek (312) 879-3859
2
OverviewOverview
• Data usually does not seamlessly fit into model assumptions
• The focus of this presentation is the impact that selected issues have on the design matrix
• Agenda– Overview of the Design Matrix– Non-linearity in predictors– Missing data
• Data usually does not seamlessly fit into model assumptions
• The focus of this presentation is the impact that selected issues have on the design matrix
• Agenda– Overview of the Design Matrix– Non-linearity in predictors– Missing data
3
What is the Design Matrix?What is the Design Matrix?
• Representation of the predictor variables used to construct model
• Representation of the predictor variables used to construct model
Class State AOI
Pop Density
65198 MA 125 .03365198 IL 235 .03270446 MA 240 .03470446 FL 350 .04464446 MA 100 .02364446 IN 110 .025
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .0331 0 0 0 235 .032
1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .025
Data Design Matrix
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
4
How is GLM Fit to Data?How is GLM Fit to Data?
• Linear predictors are transformed to estimate of response data via inverse link function
• Family and link function determine form of MLE
– Family: Gaussian, Link: identity, MLE:
• Linear predictors are transformed to estimate of response data via inverse link function
• Family and link function determine form of MLE
– Family: Gaussian, Link: identity, MLE:
a5
a4
a3
a2
a1
1111
11
0011
00
.0251101
.0231001
.0443500
.0342400
.0322350
.0331250
LP6
LP5
LP4
LP3
LP2
LP1
Linear PredictorsCoefficients =Design Matrix x
X =
0101
01
a6
n
i
p
jjiji xy
yl1
22
1
2
)2ln(2
1)(
2
1)(
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
5
Non Linearity – Description of IssueNon Linearity – Description of Issue
0 10 20 30
0.9
80
0.9
85
0.9
90
0.9
95
1.0
00
1.0
05
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
6
Non Linearity – Description of IssueNon Linearity – Description of Issue
• GLMs fit linear patterns to data• Produces poor fit for certain predictor variables• Splines can address non-linearity within a GLM
• GLMs fit linear patterns to data• Produces poor fit for certain predictor variables• Splines can address non-linearity within a GLM
0 10 20 30
0.9
80
0.9
85
0.9
90
0.9
95
1.0
00
1.0
05
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
7
Natural Cubic Spline CharacteristicsNatural Cubic Spline Characteristics
• 3rd degree polynomial between the knots• Continuous value, first and second derivative at the knots• Linear outside of the boundary knots
0 10 20 30
0.9
80
0.9
85
0.9
90
0.9
95
1.0
00
1.0
05
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
8
GLM with a Natural SplineGLM with a Natural Spline
• Two columns are added to the design matrix– These columns are the spline basis– Two additional coefficients are needed
• GLM is fit with same MLE and link function
• Two columns are added to the design matrix– These columns are the spline basis– Two additional coefficients are needed
• GLM is fit with same MLE and link function
a5
a4
a3
a2
a1
1111
11
0011
00
.0251101
.0231001
.0443500
.0342400
.0322350
.0331250
LP6
LP5
LP4
LP3
LP2
LP1
X =
0.00.0
497.3109.8
98.40.6
0.00.0
401.375.1
66.10.0
a6
a7
0101
01
a8
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
9
GLM with Natural SplineGLM with Natural Spline
• Proper reasonability testing– Statistical Significance – Time Consistency Plot
• Proper reasonability testing– Statistical Significance – Time Consistency Plot
0 10 20 30
0.9
80
0.9
85
0.9
90
0.9
95
1.0
00
1.0
05
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
10
1997
0.99
1.00
1.01
0 5 10 15 20 25
1998
0.99
1.00
1.01
0 5 10 15 20 25
1999 2000
Time Consistency PlotDesign MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
11
Missing Data-Description of IssueMissing Data-Description of Issue
• Missing data can present unique challenges in model creation
• Missing data can present unique challenges in model creation
Class State AOI
Pop Density
65198 MA 125 .03365198 IL 235 .03270446 MA 240 .03470446 FL 350 .04464446 MA 100 .02364446 IN 110 NA
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .0331 0 0 0 235 .032
1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 NA
Data Design Matrix
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
12
Missing Data-Description of IssueMissing Data-Description of Issue
• What methodologies exist for addressing missing data?
• What methodologies exist for addressing missing data?
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .0331 0 0 0 235 .032
1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 NA
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
13
Methodology #1Methodology #1
• Listwise Deletion: Eliminate any row in the design matrix with missing values
• Listwise Deletion: Eliminate any row in the design matrix with missing values
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .0331 0 0 0 235 .032
1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .023
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
14
Methodology #2Methodology #2
• Mean Imputation: Replace missing values with mean of values where data is present
• Mean Imputation: Replace missing values with mean of values where data is present
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .0331 0 0 0 235 .032
1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .033
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
15
Methodology #3Methodology #3
• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis
• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .033 .109 .3591 0 0 0 235 .032 .102 .328
1 1 0 1 240 .034 .116 .3931 1 0 0 350 .044 .194 .8521 0 1 1 100 .023 .053 .1221 0 1 0 110
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
16
Methodology #3Methodology #3
• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis
• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .033 .109 .3591 0 0 0 235 .032 .102 .328
1 1 0 1 240 .034 .116 .3931 1 0 0 350 .044 .194 .8521 0 1 1 100 .023 .053 .1221 0 1 0 110 .033 .411 .115
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
17
Methodology #4Methodology #4
• Single imputation: Use other predictor variables to build a model and impute missing values– Example: Model Pop Density based on AOI
• Single imputation: Use other predictor variables to build a model and impute missing values– Example: Model Pop Density based on AOI
Intercept Class ST MA AOI
Pop Density
1 0 0 1 125 .0331 0 0 0 235 .025
1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .027
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
18
Methodology #5Methodology #5
• Multiple Imputation: Use other predictor variables to model missing values– Multiple imputations are created based on distribution of residuals in
estimates of missing values
• Multiple Imputation: Use other predictor variables to model missing values– Multiple imputations are created based on distribution of residuals in
estimates of missing values
010101
ST MA
111111
Intercept
001100
Class
.0251101
.0231001
.0443500
.0342400
.0252350
.0331250
Pop DensityAOI
010101
ST MA
111111
Intercept
001100
Class
.0251101
.0231001
.0443500
.0342400
.0252350
.0331250
Pop DensityAOI
010101
ST MA
111111
Intercept
001100
Class
.0291101
.0231001
.0443500
.0342400
.0252350
.0331250
Pop DensityAOI
010101
ST MA
111111
Intercept
001100
Class
.0291101
.0231001
.0443500
.0342400
.0252350
.0331250
Pop DensityAOI
010101
ST MA
111111
Intercept
001100
Class
.0271101
.0231001
.0443500
.0342400
.0252350
.0331250
Pop DensityAOI
010101
ST MA
111111
Intercept
001100
Class
.0271101
.0231001
.0443500
.0342400
.0252350
.0331250
Pop DensityAOI
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
19
Multiple Imputation ProcessMultiple Imputation Process
1. Choose starting values for mean and covariance matrix of predictor variables
2. Use mean and covariance matrix to estimate regression parameters
3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable
4. Use the resulting data set to compute new mean and covariance matrix
5. Make a random draw from the posterior distribution of the means and covariances
6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved
1. Choose starting values for mean and covariance matrix of predictor variables
2. Use mean and covariance matrix to estimate regression parameters
3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable
4. Use the resulting data set to compute new mean and covariance matrix
5. Make a random draw from the posterior distribution of the means and covariances
6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
20
Multiple Imputation ProcessMultiple Imputation Process
• Assumptions underlying multiple imputation algorithms– Data is Missing At Random: Missingness of predictor
variable “V” cannot depend on value of “V” but can depend on values of other predictor variables.
– Data is distributed with a Multi-Variate Normal distribution
• Two issues that must be addressed– Initial convergence of iterations– Correlation of consecutive iterations
• Assumptions underlying multiple imputation algorithms– Data is Missing At Random: Missingness of predictor
variable “V” cannot depend on value of “V” but can depend on values of other predictor variables.
– Data is distributed with a Multi-Variate Normal distribution
• Two issues that must be addressed– Initial convergence of iterations– Correlation of consecutive iterations
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
21
Time Series PlotTime Series Plot
• Initial convergence is assessed via a time series plot
• Initial convergence is assessed via a time series plot
Iteration Number
0 20 40 60 80 100
0.84
0.86
0.88
0.90
0.92
0.94
Iteration Number
0 20 40 60 80 100
0.84
0.86
0.88
0.90
0.92
0.94
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
22
Auto Correlation PlotAuto Correlation Plot
lag
AC
F
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
• Spread between iterations is assessed via an autocorrelation plot
• Spread between iterations is assessed via an autocorrelation plot
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
23
Testing of Missing Value MethodsTesting of Missing Value Methods
• Method #1– Created a training and holdout data sets
• Both contained missing data– Built models of claim frequency under different missing
value analysis methods with training dataset• Identical predictor variables in all models
– Compared results (deviance) of methods in data set where all data is present
• Method #1– Created a training and holdout data sets
• Both contained missing data– Built models of claim frequency under different missing
value analysis methods with training dataset• Identical predictor variables in all models
– Compared results (deviance) of methods in data set where all data is present
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
24
Testing of Missing Value OptionsTesting of Missing Value Options
• Method #2– Created a model of missing probability– Limited modeling database to observations in which all data was
present– Randomly generated missing values based on missing probability
• 100 iterations– Built models of claim frequency under different missing value
analysis methods• Identical predictor variables in all models
– Compared results (deviance) of methods in data set where all data is present
• Method #2– Created a model of missing probability– Limited modeling database to observations in which all data was
present– Randomly generated missing values based on missing probability
• 100 iterations– Built models of claim frequency under different missing value
analysis methods• Identical predictor variables in all models
– Compared results (deviance) of methods in data set where all data is present
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
25
Performance of Missing Value MethodsPerformance of Missing Value Methods
1. Single Imputation/Multiple Imputation
2. Linear Mean Imputation
3. Mean Imputation
4. Listwise Deletion
1. Single Imputation/Multiple Imputation
2. Linear Mean Imputation
3. Mean Imputation
4. Listwise Deletion
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
26
Missing Data FrameworkMissing Data Framework
• Questions– What is the level of missing data?– What can be inferred about the missing data
mechanism?– What is the size of the modeling database in
which all values are present?– Will the data continue to be missing when the
model is applied?
• Questions– What is the level of missing data?– What can be inferred about the missing data
mechanism?– What is the size of the modeling database in
which all values are present?– Will the data continue to be missing when the
model is applied?
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
27
Missing Data FrameworkMissing Data Framework
• Actions– For low proportions of missing data: Listwise Deletion– For higher proportions of missing data in a large
modeling database: Listwise Deletion with oversampling– For mid to small modeling databases: employ
imputation• Initial exploration with Linear Mean Imputation• Fit final model with Single Imputation or Multiple
Imputation
• Actions– For low proportions of missing data: Listwise Deletion– For higher proportions of missing data in a large
modeling database: Listwise Deletion with oversampling– For mid to small modeling databases: employ
imputation• Initial exploration with Linear Mean Imputation• Fit final model with Single Imputation or Multiple
Imputation
Design MatrixNon-LinearityMissing Data
Design MatrixNon-LinearityMissing Data
28
SourcesSources
• Splines– Hastie, Tibshirani and Friedman: The Elements of
Statistical Learning
• Missing Data– Paul Allison: Missing Data– J.L. Schafer: Analysis of Incomplete Multivariate Data– Insightful Corporation: Analyzing Data with Missing
Values in S-Plus
• Splines– Hastie, Tibshirani and Friedman: The Elements of
Statistical Learning
• Missing Data– Paul Allison: Missing Data– J.L. Schafer: Analysis of Incomplete Multivariate Data– Insightful Corporation: Analyzing Data with Missing
Values in S-Plus