2014 IIAG Imputation Assessments

13
Imputation assessments for the IIAG 1 Introduction This document contains a write-up of some simulations performed by the Mo Ibrahim Foundation to assess imputation methods. We begin below by describing the imputation methods and the missingness mechanisms considered, as well as what was measured. We then assess the accuracy and precision of the various methods, and some characteristics of the predicted distributions. An assessment of the amount of remaining missingness after imputation follows. Finally, we draw some conclusions from the experiments, viz., that in terms of accuracy and precision, linear interpolation is the most approriate method for our data; and that for data where whole country timeseries are missing, the most accurate method is the so-called all variable multilevel method. 1.1 Methods of imputation under consideration Below is the list of methods of imputation under consideration. 1. Mean substitution. Here missing interior datapoints are replaced by the mean of the closest available datapoints on either side in the timeseries; and last value carried forward (LVCF)/first value carried backwards (FVCB) is used for the exterior missing data. The special rule for the Antiretroviral Treatment Provision (ATP) and ART Provision for Pregnant Women (ARTPPW) variables (no imputation between the year 2000 and the next available datapoint) is used. 2. Mean substitution, no ad hoc. As above, but with no ad hoc rules for ATP and ARTPPW. 3. Linear interpolation. Interior missing data are replaced by linear interpolation, while exterior missing data is replaced by LVCF/FVCB (that is, 0th order extrapolation). 4. Linear interpolation, higher order extrapolation. Interior data replaced by linear interpolation, and exterior missing data replaced by higher order extrapolation. Higher order interpolation was also used but was found to be very inaccurate (the analysis has been omitted). 5. All variable multilevel. Missing data replaced using a multilevel model trained on all the data present. 6. Variable multilevel. Missing data for a variable replaced by a multilevel model trained on all data available for that variable. 7. Regression imputation. Missing data replaced by multiple imputation by chained equations, using linear regression on the variables. 1

Transcript of 2014 IIAG Imputation Assessments

Page 1: 2014 IIAG Imputation Assessments

Imputation assessments for the IIAG

1 Introduction

This document contains a write-up of some simulations performed by the MoIbrahim Foundation to assess imputation methods. We begin below by describing theimputation methods and the missingness mechanisms considered, as well as whatwas measured. We then assess the accuracy and precision of the various methods,and some characteristics of the predicted distributions. An assessment of theamount of remaining missingness after imputation follows.

Finally, we draw some conclusions from the experiments, viz., that in terms ofaccuracy and precision, linear interpolation is the most approriate method for ourdata; and that for data where whole country timeseries are missing, the mostaccurate method is the so-called all variable multilevel method.

1.1 Methods of imputation under consideration

Below is the list of methods of imputation under consideration.

1. Mean substitution. Here missing interior datapoints are replaced by the meanof the closest available datapoints on either side in the timeseries; and lastvalue carried forward (LVCF)/first value carried backwards (FVCB) is used forthe exterior missing data. The special rule for the Antiretroviral TreatmentProvision (ATP) and ART Provision for PregnantWomen (ARTPPW) variables (noimputation between the year 2000 and the next available datapoint) is used.

2. Mean substitution, no ad hoc. As above, but with no ad hoc rules for ATP andARTPPW.

3. Linear interpolation. Interior missing data are replaced by linear interpolation,while exterior missing data is replaced by LVCF/FVCB (that is, 0th orderextrapolation).

4. Linear interpolation, higher order extrapolation. Interior data replaced by linearinterpolation, and exterior missing data replaced by higher order extrapolation.Higher order interpolation was also used but was found to be very inaccurate(the analysis has been omitted).

5. All variable multilevel. Missing data replaced using a multilevel model trainedon all the data present.

6. Variable multilevel. Missing data for a variable replaced by a multilevel modeltrained on all data available for that variable.

7. Regression imputation. Missing data replaced by multiple imputation bychained equations, using linear regression on the variables.

1

Page 2: 2014 IIAG Imputation Assessments

The first four above will be referred to as interpolation-type methods. Numberfive and six will be referred to as multilevel methods, and the last three will bereferred to as regression-type methods.

1.2 Missingness mechanisms

There are various ways in which missingness can be generated. We have simulatedtwo in these experiments.

1. Missing completely at random. Data deleted from IIAG dataset at random.

2. Data deleted from IIAG dataset by the following procedure: Select a variable atrandom. Delete the data for a random number of years. Delete the data for arandom number of countries.

1.3 What was measured?

For each random selection of parameters which determine the amount and type ofmissingness, the following quantities are computed:

1. The amount of missingness before imputation and the amount of missingnessremaining after imputation.

2. The quantiles and mean of the distance between the actual values of theindicators and the imputed values.

3. The quantiles and mean of the distance between the actual and the imputedbroad governance score.

4. Quantiles, mean, standard deviation, skewness and kurtosis of the differencebetween the actual indicator value and the imputed value.

5. The quantiles, mean and skewness of the difference of actual and imputedbroad governance score.

2 Accuracy and precision of predictions

To assess the merits of the imputation techniques, we measure accuracy andprecision. Accuracy means being on target, while precision is doing so consistently.Being accurate but not precise means an even spread centered on the target, whileaccuracy and precision means a smaller spread centered on the target. Goodestimates must be both accurate and precise.

The mean substitution with and without the ad hoc imputation for the ATP andARTPPW variables have a mean distance between actual and imputed value of 1.24and 1.26 respectively. The ad hoc procedure can be expected to be more accurate,since it tries to impute fewer values. The linear interpolation method of imputation issimilarly accurate. Regression imputation and the all variable multilevel method are

2

Page 3: 2014 IIAG Imputation Assessments

the least accurate, as can be expected given that the correlations that it uses arelower, see Figure 1.

Figure 1: Mean variable distance

As far as distance from the broad governance score is concerned, the all variablemultilevel method fares well, with the variable multilevel method and the meansubstitution and linear interpolation method in the second best group of methods,see Figure 2.

3

Page 4: 2014 IIAG Imputation Assessments

Figure 2: Broad governance score distance

The standard deviations of the difference between actual and imputed variablescores show that the precision of the variable multilevel method and theinterpolation-type methods is a great deal higher, with the interpolation-typemethods performing the best. All variable multilevel imputation and regressionimputation are quite bad in this respect, see Figure 3.

4

Page 5: 2014 IIAG Imputation Assessments

Figure 3: Variable difference standard deviation

The interpolation-type distribution of differences has shorter tails than thecorresponding distribution for the other methods. The linear interpolation method isslightly more so than the mean substitution. From the measure of kurtosis of thedifferences of imputed and actual variable values, it can be seen that on average thevariable multilevel method produced the most peaky distribution, see Table 1.

Method 1st Qu. Median Mean 3rd Qu.Variable multilevel 24.69 53.82 1501.00 168.90Linear interpolation 27.83 64.83 1057.00 249.50Mean substitution, no ad hoc 26.08 61.48 1048.00 241.80Mean substitution 25.93 60.16 1048.00 239.30Higher order interpolation 17.84 47.46 678.30 179.20Regression imputation 2.76 7.29 364.00 24.25All variable multilevel 0.50 2.47 36.31 8.45

Table 1: Variable difference kurtosis

As far as bias is concerned, the mean skewness of the difference between actualand imputed is generally positive for all the timeseries imputation methods, withmean skewness larger and positive for the variable multilevel method. The allvariable multilevel method performs the best in this regard, and the regressionimputation method also performs well, see Figure 4.

5

Page 6: 2014 IIAG Imputation Assessments

Figure 4: Variable difference skewness

2.0.1 Higher order extrapolation

Since the linear interpolation methods perform well, it would seem plausible thathigher order interpolation methods could produce better results. This turns out notto be the case, however (results omitted).

In addition, at mid-to-high levels of missingness, higher order extrapolation (withlinear interpolation) produces less accurate results, as can be seen by regressing themean variable distance on the amount of original missingness and the order of theextrapolation.

Estimate Std. Error t value Pr(>|t|)(Intercept) -4.1188 0.1328 -31.02 0.0000

ord -0.0401 0.0204 -1.96 0.0498origMiss 9.9823 0.2033 49.11 0.0000

ord:origMiss 0.1212 0.0316 3.84 0.0001

Table 2: Variable accuracy and order of extrapolation

The coefficient of order in Table 2 is small and negative, while the interactionterm is larger and positive, which means that for a fixed level of original missingnesshigher than 0.33, an increase in order will produce a decrease in accuracy.

The mean distance for indicators increases a great deal with the level of originalmissingness, and it is also clear that the 0th order extrapolation (that is, the last

6

Page 7: 2014 IIAG Imputation Assessments

value carried forward, and the first value carried backwards) is the most accurate, seeFigure 5.

Figure 5: Variable distance quantiles and order of extrapolation

For missingness proportions smaller than 0.33, higher order extrapolation has asmall, positive effect on the mean; for higher values of original missingness, themedian and 10th and 90th quantiles of the imputation-actual difference increases.

3 Remaining missingness

The amount of original missingness is of course predictive of the amount ofmissingness remaining after imputation, but the imputation method and the type ofmissingness also has a significant impact.

Introducing missingness completely at random, by selecting random entries inthe IIAG dataset to delete results in a dataset in which the all variable multilevelmethod can impute all the values, for almost all levels of missingness. Up to highlevels of original missingness, regression imputation performs very well in this regard,followed by linear extrapolation, constant extrapolation with linear interpolation andthe mean substitution method− which are almost indistinguishable−, followed bythe variable multilevel method, see Table 3.

We also generate missingness as follows: Make a random selection of indicators,and for those variables, data for all countries for a random selection of years were

7

Page 8: 2014 IIAG Imputation Assessments

Method 1st Qu. Median Mean 3rd Qu.All variable multilevel 0.00 0.00 0.00 0.00Regression imputation 0.00 0.01 0.10 0.04Higher order interpolation 0.05 0.06 0.14 0.14Linear interpolation 0.06 0.11 0.24 0.32Mean substitution, no ad hoc 0.06 0.11 0.24 0.32Mean substitution 0.07 0.12 0.24 0.32Variable multilevel 0.17 0.45 0.46 0.73

Table 3: Remaining missingness proportion

deleted and data for all years for a random selection of countries. Again, the allvariable multilevel and the regression imputation methods both perform well,leaving little unimputed. The proportion of remaining missingness is higher for thistype of missingness for all methods, except the variable multilevel method, whichperforms a lot better, see Table 4.

Method 1st Qu. Median Mean 3rd Qu.All variable multilevel 0.00 0.00 0.00 0.00Regression imputation 0.00 0.00 0.10 0.10Variable multilevel 0.00 0.10 0.20 0.20Mean substitution, no ad hoc 0.10 0.20 0.30 0.40Mean substitution 0.10 0.20 0.30 0.40Linear interpolation 0.10 0.20 0.30 0.40Higher order interpolation 0.10 0.20 0.30 0.40

Table 4: Remaining missingness

As is clear from Figure 6, the missingness mechanism is an important determinantof how much data will remain unimputed. Below, the left panel shows data from themissing completely at random mechanism, while the right panel shows data missingaccording to the above mechanism.

8

Page 9: 2014 IIAG Imputation Assessments

Figure 6: Remaining missingness

3.1 Accuracy and precision, and country and time missingness

Holding the proportion of variables deleted constant, the deletion of whole countrytimeseries is expected to have a larger effect on the remaining missingness forinterpolation-type imputation methods, since it will not be possible to impute values,and this turns out to be the case, see Figure 7, where the rows of panels haveconstant variable missingness levels in the intervals (0, 0.33], (0.33, 0.67] and(0.67, 1]. Also, the variance in the amount of remaining missingness increases withthe proportion of countries missing.

9

Page 10: 2014 IIAG Imputation Assessments

Figure 7: Remaining missingness

While holding the proportion of deleted country timeseries fixed, the deletion ofvariable-years affects the variable multilevel and the regression imputation morethan the interpolation-type methods, see Figure 8, where the rows of panels haveconstant country missingness levels in the intervals (0, 0.33], (0.33, 0.67] and(0.67, 1]. Similarly, the variance in remaining missingness increases with theproportion of deleted variable-years. In addition, the more country-missingnessthere is, the less of an effect additional year-missingness has.

10

Page 11: 2014 IIAG Imputation Assessments

Figure 8: Remaining missingness

It is therefore clear that the remaining missingness is more affected by deletingcountry timeseries, than by deleting all countries for one year for theinterpolation-type imputation methods; while the variable multilevel method ismore affected by the deletion of variable-years. The all variable multilevel method isnot affected, while the regression imputation is somewhat affected.

The remaining missingness increasing for the interpolation-type imputationsmeans that the accuracy of those methods is not affected as much as theregression-type imputation methods which will attempt to impute values eventhough present data are sparse. Indeed, for fixed levels of missing variables,increasing the proportion of missing country timeseries decreases accuracy for thevariable, all variable and regression imputation, as can be seen in Figure 9 below,where the rows of panels have constant variable missingness levels in the intervals(0, 0.33], (0.33, 0.67] and (0.67, 1]. The most affected is the regression imputation,followed by the all variable and variable multilevel methods.

11

Page 12: 2014 IIAG Imputation Assessments

Figure 9: Mean variable distance

Similarly, removing years for all countries decreases the accuracy for theinterpolation-type methods, up to a point after which imputation is no longerpossible. It should also be noted that the all variable and regression imputations areaffected similarly, while the accuracy of the variable multilevel method does notfollow this pattern, see Figure 10, where the rows of panels have constant countrymissingness levels in the intervals (0, 0.33], (0.33, 0.67] and (0.67, 1].

12

Page 13: 2014 IIAG Imputation Assessments

Figure 10: Mean variable distance

The amount of country coverage affects the accuracy of the prediction of thebroad governance score for all methods of imputation, with the variable multilevelthe least affected.

4 Conclusion

1. The linear interpolation method is slightly more accurate and precise thanmean substitution, and is therefore the most suitable approach of imputationfor our data. It also has the conceptual advantage of being consistent with theidea that natural phenomena are continuous.

2. For imputing variables where a whole timeseries is empty, we are reduced to achoice between all variable multilevel imputation and regression imputation.The all variable multilevel imputation is much better in terms of accuracy ofprediction of the broad governance score and also in terms of leaving very littlemissingness behind.

3. Country timeseries missingness is detrimental to the accuracy ofinterpolation-type methods; the removal of additional years has a linear effecton accuracy. Variable year missingness is less bad.

13