CAS Predictive Modeling Seminar Practical Issues in Model Design

28
# !@ CAS Predictive Modeling Seminar Practical Issues in Model Design Chuck Boucek (312) 879-3859

description

CAS Predictive Modeling Seminar Practical Issues in Model Design. Chuck Boucek (312) 879-3859. Overview. Data usually does not seamlessly fit into model assumptions The focus of this presentation is the impact that selected issues have on the design matrix Agenda - PowerPoint PPT Presentation

Transcript of CAS Predictive Modeling Seminar Practical Issues in Model Design

Page 1: CAS Predictive Modeling Seminar Practical Issues in Model Design

#!@CAS Predictive Modeling SeminarPractical Issues in Model DesignChuck Boucek (312) 879-3859Chuck Boucek (312) 879-3859

Page 2: CAS Predictive Modeling Seminar Practical Issues in Model Design

2

OverviewOverview

• Data usually does not seamlessly fit into model assumptions

• The focus of this presentation is the impact that selected issues have on the design matrix

• Agenda– Overview of the Design Matrix– Non-linearity in predictors– Missing data

• Data usually does not seamlessly fit into model assumptions

• The focus of this presentation is the impact that selected issues have on the design matrix

• Agenda– Overview of the Design Matrix– Non-linearity in predictors– Missing data

Page 3: CAS Predictive Modeling Seminar Practical Issues in Model Design

3

What is the Design Matrix?What is the Design Matrix?

• Representation of the predictor variables used to construct model

• Representation of the predictor variables used to construct model

Class State AOI

Pop Density

65198 MA 125 .03365198 IL 235 .03270446 MA 240 .03470446 FL 350 .04464446 MA 100 .02364446 IN 110 .025

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .025

Data Design Matrix

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 4: CAS Predictive Modeling Seminar Practical Issues in Model Design

4

How is GLM Fit to Data?How is GLM Fit to Data?

• Linear predictors are transformed to estimate of response data via inverse link function

• Family and link function determine form of MLE

– Family: Gaussian, Link: identity, MLE:

• Linear predictors are transformed to estimate of response data via inverse link function

• Family and link function determine form of MLE

– Family: Gaussian, Link: identity, MLE:

a5

a4

a3

a2

a1

1111

11

0011

00

.0251101

.0231001

.0443500

.0342400

.0322350

.0331250

LP6

LP5

LP4

LP3

LP2

LP1

Linear PredictorsCoefficients =Design Matrix x

X =

0101

01

a6

n

i

p

jjiji xy

yl1

22

1

2

)2ln(2

1)(

2

1)(

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 5: CAS Predictive Modeling Seminar Practical Issues in Model Design

5

Non Linearity – Description of IssueNon Linearity – Description of Issue

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 6: CAS Predictive Modeling Seminar Practical Issues in Model Design

6

Non Linearity – Description of IssueNon Linearity – Description of Issue

• GLMs fit linear patterns to data• Produces poor fit for certain predictor variables• Splines can address non-linearity within a GLM

• GLMs fit linear patterns to data• Produces poor fit for certain predictor variables• Splines can address non-linearity within a GLM

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 7: CAS Predictive Modeling Seminar Practical Issues in Model Design

7

Natural Cubic Spline CharacteristicsNatural Cubic Spline Characteristics

• 3rd degree polynomial between the knots• Continuous value, first and second derivative at the knots• Linear outside of the boundary knots

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 8: CAS Predictive Modeling Seminar Practical Issues in Model Design

8

GLM with a Natural SplineGLM with a Natural Spline

• Two columns are added to the design matrix– These columns are the spline basis– Two additional coefficients are needed

• GLM is fit with same MLE and link function

• Two columns are added to the design matrix– These columns are the spline basis– Two additional coefficients are needed

• GLM is fit with same MLE and link function

a5

a4

a3

a2

a1

1111

11

0011

00

.0251101

.0231001

.0443500

.0342400

.0322350

.0331250

LP6

LP5

LP4

LP3

LP2

LP1

X =

0.00.0

497.3109.8

98.40.6

0.00.0

401.375.1

66.10.0

a6

a7

0101

01

a8

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 9: CAS Predictive Modeling Seminar Practical Issues in Model Design

9

GLM with Natural SplineGLM with Natural Spline

• Proper reasonability testing– Statistical Significance – Time Consistency Plot

• Proper reasonability testing– Statistical Significance – Time Consistency Plot

0 10 20 30

0.9

80

0.9

85

0.9

90

0.9

95

1.0

00

1.0

05

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 10: CAS Predictive Modeling Seminar Practical Issues in Model Design

10

1997

0.99

1.00

1.01

0 5 10 15 20 25

1998

0.99

1.00

1.01

0 5 10 15 20 25

1999 2000

Time Consistency PlotDesign MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 11: CAS Predictive Modeling Seminar Practical Issues in Model Design

11

Missing Data-Description of IssueMissing Data-Description of Issue

• Missing data can present unique challenges in model creation

• Missing data can present unique challenges in model creation

Class State AOI

Pop Density

65198 MA 125 .03365198 IL 235 .03270446 MA 240 .03470446 FL 350 .04464446 MA 100 .02364446 IN 110 NA

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 NA

Data Design Matrix

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 12: CAS Predictive Modeling Seminar Practical Issues in Model Design

12

Missing Data-Description of IssueMissing Data-Description of Issue

• What methodologies exist for addressing missing data?

• What methodologies exist for addressing missing data?

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 NA

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 13: CAS Predictive Modeling Seminar Practical Issues in Model Design

13

Methodology #1Methodology #1

• Listwise Deletion: Eliminate any row in the design matrix with missing values

• Listwise Deletion: Eliminate any row in the design matrix with missing values

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .023

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 14: CAS Predictive Modeling Seminar Practical Issues in Model Design

14

Methodology #2Methodology #2

• Mean Imputation: Replace missing values with mean of values where data is present

• Mean Imputation: Replace missing values with mean of values where data is present

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .032

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .033

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 15: CAS Predictive Modeling Seminar Practical Issues in Model Design

15

Methodology #3Methodology #3

• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis

• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .033 .109 .3591 0 0 0 235 .032 .102 .328

1 1 0 1 240 .034 .116 .3931 1 0 0 350 .044 .194 .8521 0 1 1 100 .023 .053 .1221 0 1 0 110

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 16: CAS Predictive Modeling Seminar Practical Issues in Model Design

16

Methodology #3Methodology #3

• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis

• Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .033 .109 .3591 0 0 0 235 .032 .102 .328

1 1 0 1 240 .034 .116 .3931 1 0 0 350 .044 .194 .8521 0 1 1 100 .023 .053 .1221 0 1 0 110 .033 .411 .115

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 17: CAS Predictive Modeling Seminar Practical Issues in Model Design

17

Methodology #4Methodology #4

• Single imputation: Use other predictor variables to build a model and impute missing values– Example: Model Pop Density based on AOI

• Single imputation: Use other predictor variables to build a model and impute missing values– Example: Model Pop Density based on AOI

Intercept Class ST MA AOI

Pop Density

1 0 0 1 125 .0331 0 0 0 235 .025

1 1 0 1 240 .0341 1 0 0 350 .0441 0 1 1 100 .0231 0 1 0 110 .027

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 18: CAS Predictive Modeling Seminar Practical Issues in Model Design

18

Methodology #5Methodology #5

• Multiple Imputation: Use other predictor variables to model missing values– Multiple imputations are created based on distribution of residuals in

estimates of missing values

• Multiple Imputation: Use other predictor variables to model missing values– Multiple imputations are created based on distribution of residuals in

estimates of missing values

010101

ST MA

111111

Intercept

001100

Class

.0251101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0251101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0291101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0291101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0271101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

010101

ST MA

111111

Intercept

001100

Class

.0271101

.0231001

.0443500

.0342400

.0252350

.0331250

Pop DensityAOI

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 19: CAS Predictive Modeling Seminar Practical Issues in Model Design

19

Multiple Imputation ProcessMultiple Imputation Process

1. Choose starting values for mean and covariance matrix of predictor variables

2. Use mean and covariance matrix to estimate regression parameters

3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable

4. Use the resulting data set to compute new mean and covariance matrix

5. Make a random draw from the posterior distribution of the means and covariances

6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved

1. Choose starting values for mean and covariance matrix of predictor variables

2. Use mean and covariance matrix to estimate regression parameters

3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable

4. Use the resulting data set to compute new mean and covariance matrix

5. Make a random draw from the posterior distribution of the means and covariances

6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 20: CAS Predictive Modeling Seminar Practical Issues in Model Design

20

Multiple Imputation ProcessMultiple Imputation Process

• Assumptions underlying multiple imputation algorithms– Data is Missing At Random: Missingness of predictor

variable “V” cannot depend on value of “V” but can depend on values of other predictor variables.

– Data is distributed with a Multi-Variate Normal distribution

• Two issues that must be addressed– Initial convergence of iterations– Correlation of consecutive iterations

• Assumptions underlying multiple imputation algorithms– Data is Missing At Random: Missingness of predictor

variable “V” cannot depend on value of “V” but can depend on values of other predictor variables.

– Data is distributed with a Multi-Variate Normal distribution

• Two issues that must be addressed– Initial convergence of iterations– Correlation of consecutive iterations

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 21: CAS Predictive Modeling Seminar Practical Issues in Model Design

21

Time Series PlotTime Series Plot

• Initial convergence is assessed via a time series plot

• Initial convergence is assessed via a time series plot

Iteration Number

0 20 40 60 80 100

0.84

0.86

0.88

0.90

0.92

0.94

Iteration Number

0 20 40 60 80 100

0.84

0.86

0.88

0.90

0.92

0.94

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 22: CAS Predictive Modeling Seminar Practical Issues in Model Design

22

Auto Correlation PlotAuto Correlation Plot

lag

AC

F

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

• Spread between iterations is assessed via an autocorrelation plot

• Spread between iterations is assessed via an autocorrelation plot

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 23: CAS Predictive Modeling Seminar Practical Issues in Model Design

23

Testing of Missing Value MethodsTesting of Missing Value Methods

• Method #1– Created a training and holdout data sets

• Both contained missing data– Built models of claim frequency under different missing

value analysis methods with training dataset• Identical predictor variables in all models

– Compared results (deviance) of methods in data set where all data is present

• Method #1– Created a training and holdout data sets

• Both contained missing data– Built models of claim frequency under different missing

value analysis methods with training dataset• Identical predictor variables in all models

– Compared results (deviance) of methods in data set where all data is present

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 24: CAS Predictive Modeling Seminar Practical Issues in Model Design

24

Testing of Missing Value OptionsTesting of Missing Value Options

• Method #2– Created a model of missing probability– Limited modeling database to observations in which all data was

present– Randomly generated missing values based on missing probability

• 100 iterations– Built models of claim frequency under different missing value

analysis methods• Identical predictor variables in all models

– Compared results (deviance) of methods in data set where all data is present

• Method #2– Created a model of missing probability– Limited modeling database to observations in which all data was

present– Randomly generated missing values based on missing probability

• 100 iterations– Built models of claim frequency under different missing value

analysis methods• Identical predictor variables in all models

– Compared results (deviance) of methods in data set where all data is present

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 25: CAS Predictive Modeling Seminar Practical Issues in Model Design

25

Performance of Missing Value MethodsPerformance of Missing Value Methods

1. Single Imputation/Multiple Imputation

2. Linear Mean Imputation

3. Mean Imputation

4. Listwise Deletion

1. Single Imputation/Multiple Imputation

2. Linear Mean Imputation

3. Mean Imputation

4. Listwise Deletion

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 26: CAS Predictive Modeling Seminar Practical Issues in Model Design

26

Missing Data FrameworkMissing Data Framework

• Questions– What is the level of missing data?– What can be inferred about the missing data

mechanism?– What is the size of the modeling database in

which all values are present?– Will the data continue to be missing when the

model is applied?

• Questions– What is the level of missing data?– What can be inferred about the missing data

mechanism?– What is the size of the modeling database in

which all values are present?– Will the data continue to be missing when the

model is applied?

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 27: CAS Predictive Modeling Seminar Practical Issues in Model Design

27

Missing Data FrameworkMissing Data Framework

• Actions– For low proportions of missing data: Listwise Deletion– For higher proportions of missing data in a large

modeling database: Listwise Deletion with oversampling– For mid to small modeling databases: employ

imputation• Initial exploration with Linear Mean Imputation• Fit final model with Single Imputation or Multiple

Imputation

• Actions– For low proportions of missing data: Listwise Deletion– For higher proportions of missing data in a large

modeling database: Listwise Deletion with oversampling– For mid to small modeling databases: employ

imputation• Initial exploration with Linear Mean Imputation• Fit final model with Single Imputation or Multiple

Imputation

Design MatrixNon-LinearityMissing Data

Design MatrixNon-LinearityMissing Data

Page 28: CAS Predictive Modeling Seminar Practical Issues in Model Design

28

SourcesSources

• Splines– Hastie, Tibshirani and Friedman: The Elements of

Statistical Learning

• Missing Data– Paul Allison: Missing Data– J.L. Schafer: Analysis of Incomplete Multivariate Data– Insightful Corporation: Analyzing Data with Missing

Values in S-Plus

• Splines– Hastie, Tibshirani and Friedman: The Elements of

Statistical Learning

• Missing Data– Paul Allison: Missing Data– J.L. Schafer: Analysis of Incomplete Multivariate Data– Insightful Corporation: Analyzing Data with Missing

Values in S-Plus