CAS Predictive Modeling Seminar Practical Issues in Model Design

#!@CAS Predictive Modeling SeminarPractical Issues in Model DesignChuck Boucek (312) 879-3859Chuck Boucek (312) 879-3859

2

OverviewOverview

• Data usually does not seamlessly fit into model assumptions

• The focus of this presentation is the impact that selected issues have on the design matrix

• Agenda– Overview of the Design Matrix– Non-linearity in predictors– Missing data

• Data usually does not seamlessly fit into model assumptions

• The focus of this presentation is the impact that selected issues have on the design matrix

• Agenda– Overview of the Design Matrix– Non-linearity in predictors– Missing data

3

What is the Design Matrix?What is the Design Matrix?

• Representation of the predictor variables used to construct model

• Representation of the predictor variables used to construct model

Class State AOI

Pop Density

65198 MA 125 .03365198 IL 235 .03270446 MA 240 .03470446 FL 350 .04464446 MA 100 .02364446 IN 110 .025

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .025

Data Design Matrix

Design MatrixNon-LinearityMissing Data


4

How is GLM Fit to Data?How is GLM Fit to Data?

• Linear predictors are transformed to estimate of response data via inverse link function

• Family and link function determine form of MLE

– Family: Gaussian, Link: identity, MLE:

• Linear predictors are transformed to estimate of response data via inverse link function

• Family and link function determine form of MLE

– Family: Gaussian, Link: identity, MLE:

a5

a4

a3

a2

a1

1111

11

0011

00

.0251101

.0231001

.0443500

.0342400

.0322350

.0331250

LP6

LP5

LP4

LP3

LP2

LP1

Linear PredictorsCoefficients =Design Matrix x

X =

0101

01

a6

n

i

p

jjiji xy

yl1

22

1

2

)2ln(2

1)(

2

1)(



5

Non Linearity – Description of IssueNon Linearity – Description of Issue

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05



6

Non Linearity – Description of IssueNon Linearity – Description of Issue

• GLMs fit linear patterns to data• Produces poor fit for certain predictor variables• Splines can address non-linearity within a GLM

• GLMs fit linear patterns to data• Produces poor fit for certain predictor variables• Splines can address non-linearity within a GLM

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05



7

Natural Cubic Spline CharacteristicsNatural Cubic Spline Characteristics

• 3rd degree polynomial between the knots• Continuous value, first and second derivative at the knots• Linear outside of the boundary knots

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05



8

GLM with a Natural SplineGLM with a Natural Spline

• Two columns are added to the design matrix– These columns are the spline basis– Two additional coefficients are needed

• GLM is fit with same MLE and link function

• Two columns are added to the design matrix– These columns are the spline basis– Two additional coefficients are needed

• GLM is fit with same MLE and link function

a5

a4

a3

a2

a1

1111

11

0011

00

.0251101

.0231001

.0443500

.0342400

.0322350

.0331250

LP6

LP5

LP4

LP3

LP2

LP1

X =

0.00.0

497.3109.8

98.40.6

0.00.0

401.375.1

66.10.0

a6

a7

0101

01

a8



9

GLM with Natural SplineGLM with Natural Spline

• Proper reasonability testing– Statistical Significance – Time Consistency Plot

• Proper reasonability testing– Statistical Significance – Time Consistency Plot

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05



10

1997

0.99

1.00

1.01

0 5 10 15 20 25

1998

0.99

1.00

1.01

0 5 10 15 20 25

1999 2000

Time Consistency PlotDesign MatrixNon-LinearityMissing Data


11

Missing Data-Description of IssueMissing Data-Description of Issue

• Missing data can present unique challenges in model creation

• Missing data can present unique challenges in model creation

Class State AOI

Pop Density

65198 MA 125 .03365198 IL 235 .03270446 MA 240 .03470446 FL 350 .04464446 MA 100 .02364446 IN 110 NA


Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 NA

Data Design Matrix



12

Missing Data-Description of IssueMissing Data-Description of Issue

• What methodologies exist for addressing missing data?

• What methodologies exist for addressing missing data?


Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 NA



13

Methodology #1Methodology #1

• Listwise Deletion: Eliminate any row in the design matrix with missing values

• Listwise Deletion: Eliminate any row in the design matrix with missing values


Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .023



14


• Mean Imputation: Replace missing values with mean of values where data is present

• Mean Imputation: Replace missing values with mean of values where data is present


Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .033



15


• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis



Pop Density

1 0 0 1 125 .033 .109 .3591 0 0 0 235 .032 .102 .328

1 1 0 1 240 .034 .116 .3931 1 0 0 350 .044 .194 .8521 0 1 1 100 .023 .053 .1221 0 1 0 110



16





Pop Density

1 0 0 1 125 .033 .109 .3591 0 0 0 235 .032 .102 .328

1 1 0 1 240 .034 .116 .3931 1 0 0 350 .044 .194 .8521 0 1 1 100 .023 .053 .1221 0 1 0 110 .033 .411 .115



17


• Single imputation: Use other predictor variables to build a model and impute missing values– Example: Model Pop Density based on AOI

• Single imputation: Use other predictor variables to build a model and impute missing values– Example: Model Pop Density based on AOI


Pop Density

1 0 0 1 125 .0331 0 0 0 235 .025

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .027



18


• Multiple Imputation: Use other predictor variables to model missing values– Multiple imputations are created based on distribution of residuals in

estimates of missing values

• Multiple Imputation: Use other predictor variables to model missing values– Multiple imputations are created based on distribution of residuals in

estimates of missing values

010101

ST MA

111111

Intercept

001100

Class

.0251101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0251101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0291101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0291101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0271101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0271101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI



19

Multiple Imputation ProcessMultiple Imputation Process

1. Choose starting values for mean and covariance matrix of predictor variables

2. Use mean and covariance matrix to estimate regression parameters

3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable

4. Use the resulting data set to compute new mean and covariance matrix

5. Make a random draw from the posterior distribution of the means and covariances

6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved

1. Choose starting values for mean and covariance matrix of predictor variables

2. Use mean and covariance matrix to estimate regression parameters

3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable

4. Use the resulting data set to compute new mean and covariance matrix

5. Make a random draw from the posterior distribution of the means and covariances

6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved



20

Multiple Imputation ProcessMultiple Imputation Process

• Assumptions underlying multiple imputation algorithms– Data is Missing At Random: Missingness of predictor

variable “V” cannot depend on value of “V” but can depend on values of other predictor variables.

– Data is distributed with a Multi-Variate Normal distribution

• Two issues that must be addressed– Initial convergence of iterations– Correlation of consecutive iterations

• Assumptions underlying multiple imputation algorithms– Data is Missing At Random: Missingness of predictor

variable “V” cannot depend on value of “V” but can depend on values of other predictor variables.

– Data is distributed with a Multi-Variate Normal distribution

• Two issues that must be addressed– Initial convergence of iterations– Correlation of consecutive iterations



21

Time Series PlotTime Series Plot

• Initial convergence is assessed via a time series plot

• Initial convergence is assessed via a time series plot

Iteration Number

0 20 40 60 80 100

0.84

0.86

0.88

0.90

0.92

0.94

Iteration Number

0 20 40 60 80 100

0.84

0.86

0.88

0.90

0.92

0.94



22

Auto Correlation PlotAuto Correlation Plot

lag

AC

F

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

• Spread between iterations is assessed via an autocorrelation plot

• Spread between iterations is assessed via an autocorrelation plot



23

Testing of Missing Value MethodsTesting of Missing Value Methods

• Method #1– Created a training and holdout data sets

• Both contained missing data– Built models of claim frequency under different missing

value analysis methods with training dataset• Identical predictor variables in all models

– Compared results (deviance) of methods in data set where all data is present

• Method #1– Created a training and holdout data sets

• Both contained missing data– Built models of claim frequency under different missing

value analysis methods with training dataset• Identical predictor variables in all models




24

Testing of Missing Value OptionsTesting of Missing Value Options

• Method #2– Created a model of missing probability– Limited modeling database to observations in which all data was

present– Randomly generated missing values based on missing probability

• 100 iterations– Built models of claim frequency under different missing value

analysis methods• Identical predictor variables in all models


• Method #2– Created a model of missing probability– Limited modeling database to observations in which all data was

present– Randomly generated missing values based on missing probability

• 100 iterations– Built models of claim frequency under different missing value

analysis methods• Identical predictor variables in all models




25

Performance of Missing Value MethodsPerformance of Missing Value Methods

1. Single Imputation/Multiple Imputation

2. Linear Mean Imputation

3. Mean Imputation

4. Listwise Deletion

1. Single Imputation/Multiple Imputation

2. Linear Mean Imputation

3. Mean Imputation

4. Listwise Deletion



26

Missing Data FrameworkMissing Data Framework

• Questions– What is the level of missing data?– What can be inferred about the missing data

mechanism?– What is the size of the modeling database in

which all values are present?– Will the data continue to be missing when the

model is applied?

• Questions– What is the level of missing data?– What can be inferred about the missing data

mechanism?– What is the size of the modeling database in

which all values are present?– Will the data continue to be missing when the

model is applied?



27

Missing Data FrameworkMissing Data Framework

• Actions– For low proportions of missing data: Listwise Deletion– For higher proportions of missing data in a large

modeling database: Listwise Deletion with oversampling– For mid to small modeling databases: employ

imputation• Initial exploration with Linear Mean Imputation• Fit final model with Single Imputation or Multiple

Imputation

• Actions– For low proportions of missing data: Listwise Deletion– For higher proportions of missing data in a large

modeling database: Listwise Deletion with oversampling– For mid to small modeling databases: employ

imputation• Initial exploration with Linear Mean Imputation• Fit final model with Single Imputation or Multiple

Imputation



28

SourcesSources

• Splines– Hastie, Tibshirani and Friedman: The Elements of

Statistical Learning

• Missing Data– Paul Allison: Missing Data– J.L. Schafer: Analysis of Incomplete Multivariate Data– Insightful Corporation: Analyzing Data with Missing

Values in S-Plus

• Splines– Hastie, Tibshirani and Friedman: The Elements of

Statistical Learning

• Missing Data– Paul Allison: Missing Data– J.L. Schafer: Analysis of Incomplete Multivariate Data– Insightful Corporation: Analyzing Data with Missing

Values in S-Plus

CAS Predictive Modeling Seminar Practical Issues in Model Design

Documents

Transcript of CAS Predictive Modeling Seminar Practical Issues in Model Design