Post on 09-Jun-2020
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
1
Missing data and imputation Philip Anner
philip.anner@meduniwien.ac.at
Missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
2
• Why? • administrative reasons
• equipment failure
• human errors
• dropped out patients
• study design
• Prevention is better than statistical “cures”
• However, some missing values are unavoidable
Identify the underlying cause of missing values
Missing data patterns
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
3
van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16, 219–242 (2007).
Missing data patterns
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
4
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
5
• Missing completely at random (MCAR)
• Missing values only depend on unknown parameters
• Test: significant differenced in observed values between patients with all variables observed and patients with missing values?
Example: missing blood pressure measurements in a study due to the breakdown of a medical device.
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
6
• Missing completely at random (MCAR)
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
7
• Missing at random (MCAR):
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
8
• Missing at random (MAR):
• less restrictive assumption than MCAR
• systematic difference between and can be explained by observed data
Example: missing blood pressure measurements of young patients in a study. Young patients eventually forget measuring their blood pressure more often than older patients. Young patients have usually a lower blood pressure than older people. Missing values can be explained by age – whereas age must not have any missing value.
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
9
• Missing at random (MAR):
***
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
10
• Missing at random (MAR):
Mechanisms of missing data
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
11
• Missing not at random (MNAR): • Missing data does not occur at random and depends on other missing factors
• Significant differences between observed and missing data cannot be explained by observed data
• Cannot be excluded! … can only be explained by other missing factors
Example: missing blood pressure measurements patients suffering from hypertension. These patients have a higher incidence for headache. Therefore, they neglect an examination in the clinic and prefer to stay at home.
Missing value imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
12
Why missing value imputation?
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
13
• Complete case analysis • „easy way“
• drop study items or variables containing missing values
• Only for small amount of missing data
• Problems: • MCAR:
• reduced power (lower n!)
• A potentially existing effect cannot be shown
• MAR/ MNAR: • severe bias
• eventually wrong conclusions
Weighting Procedures
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
14
• “Weighted complete case analysis” vs. “weighted imputation”
• Weight respondents by their inverse probability of response
• Improved representation of the population containing missing values
• Pros: • Easy implementation/ interpretation
• Usually good results under MAR and 1 variable containing missing values
• Cons: • Not sufficient in multiple/ complex no response situations
Single Imputation Methods
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
15
Imputation-Based Procedures
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
16
• Use available data to estimate missing values • Mean imputation
• Hot deck imputation • copying available values from cases that are similar in observed variables
• Last value carried forward • replacing missing entries by the last measured value
• Regression imputation
• Cons: • Biased parameter estimates
• Biased standard Errors
Model-Based Procedures
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
17
• Bayesian approach
• Estimate a posterior distribution or likelihood for the imputation of missing variables
• Methods: • Maximum likelihood
• Multiple linear regression models
• Multiple Imputation
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
18
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
19
• Take uncertainties of missing values into account • Each missing value has a distribution of likely values
• The distribution reflects the uncertainty about what the variable may have been
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
20
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
21
• Joint modeling (JM) • Partition observations into groups of identical missing data patterns
• Impute each pattern according to a joint model
reduced flexibility
• Fully conditional specication (FCS) • Multivariate imputation by chained equations (MICE)
• Generate a model for each variable containing missing values
• Include: and other variables
Multiple Imputation by Chained Equations (MICE)
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
22
• An imputation model for each variable • Including other variables (all observed or partially missing)
• Iterative estimation of missing values
1. Single value imputation (e.g. Mean imputation)
2. Estimate 1 variable – including all others as independent variables
3. Infer values by drawing from a Gibb’s sampler
4. Repeat steps 2,3 for each variable • Repeat multiple times, until distribution of drawn values converges
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
23
• Predictor selection for imputation • Use correlated (partially) observed variables
• The more, the better
Elementary Imputation Methods
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
24
• Bayesian linear regression numeric variables
• Predictive mean matching (PMM) numeric variables • Non-parametric approach (donor pool)
• Imputed values cannot be outside of a variable’s range
• Logistic regression 2 categories
• Polytomous logistic regression >= 2 categories
• Linear discriminant analyses >= 2 categories
Predictive mean matching
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
25
X
Y
Mahalanobis distance
X and Y observed
Only X observed
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
26
• Assessing convergence of the Gibb‘s sampler (distribution)
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
27
• Example of non-convergence
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
28
• Validity of imputed values
Multiple Imputation
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
29
• Validity of imputed values
Pooling estimates - Rubin's rules
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
30
Pooling estimates - Rubin's rules
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
31
• Combine m results to a single estimate
• Point estimates • Average
• Variance • Total variance
• Within imputation variance:
• Between imputation variance:
Questions?
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner
32
Thank you for your attention!
Literature:
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons (1987).
Little, RJA and Rublin, D.: Statistical Analysis with Missing Data. John Wiley & Sons (2002).
Horton, N.J., Kleinman, K.P.: Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007).
mice https://cran.r-project.org/web/packages/mice/index.html