Erasmus Medical Center, Rotterdam · Howblack-boxuseofimputationcancausebias NicoleErler Erasmus...

How black-box use of imputation can cause bias

Nicole Erler

Erasmus Medical Center, Rotterdam

Nicole Erler, FGME 2019, Kiel 1

Handling Missing Values is Easy!

Functions automatically exclude missing values:

## [...]## Residual standard error: 2.305 on 69 degrees of freedom## (25 observations deleted due to missingness)## Multiple R-squared: 0.09255, Adjusted R-squared: 0.02679## F-statistic: 1.407 on 5 and 69 DF, p-value: 0.2325

Imputation is super easy:

library("mice")imp <- mice(mydata)

However ...


Handling Missing Values Correctly is Not So Easy!

(Imputation) methods makes certain assumptions, e.g.:

missingness is M(C)AR

the incomplete variable has a certain conditional distribution(e.g. normal)all associations are linear

no interactionsno non-linear effectsno transformations

compatibility of the imputation modelscongeniality (compatibility between analysis and imputation models)

violation á bias




missingness is M(C)ARthe incomplete variable has a certain conditional distribution(e.g. normal)

all associations are linearno interactionsno non-linear effectsno transformations


violation á bias




missingness is M(C)ARthe incomplete variable has a certain conditional distribution(e.g. normal)all associations are linear

no interactionsno non-linear effectsno transformations


violation á bias


Literature: mis-specification in Multiple Imputation

Several authors have

investigated robustness to mis-specification (of distribution)in MI using FCS / MICEin joint model MI

and/or proposed to useTukey’s gh distributionFleishman polynomialsGAMs (in FCS)Doubly-robust weighted estimating equations (instead of MI)


MICE vs JointAI

Imputation in MICE

p(x1 | y ,Xcompl ., x2, x3, x4, . . . , θ)p(x2 | y ,Xcompl ., x1, x3, x4, . . . , θ)p(x3 | y ,Xcompl ., x1, x2, x4, . . . , θ). . .

Imputation in JointAI

p(y | Xcompl ., x1, x2, x3, . . . , θ)p(x1 | Xcompl ., θ)p(x2 | Xcompl ., x1, θ)p(x3 | Xcompl ., x1, x2, θ). . .

No issues withcomplex outcomes, e.g.:

multi-levelsurvival

congenialitycompatibility


MICE vs JointAI

Imputation in MICE



p(y | Xcompl ., x1, x2, x3, . . . , θ)

p(x1 | Xcompl ., θ)p(x2 | Xcompl ., x1, θ)p(x3 | Xcompl ., x1, x2, θ). . .

No issues withcomplex outcomes, e.g.:

multi-levelsurvival

congenialitycompatibility


MICE vs JointAI

Imputation in MICE


á


p(y | Xcompl ., x1, x2, x3, . . . , θ)p(x1 | Xcompl ., θ)p(x2 | Xcompl ., x1, θ)p(x3 | Xcompl ., x1, x2, θ). . .

áPotential mis-specification of

association structureconditional distributionM(C)AR


Simulation Study: Quadratic Effect

Analysis model: y ∼ β0 + β1x1 + β2x2 + . . .

Quadratic association between covariates: x1 ∼ α0 + α1x2 +HHHα2x2

2 + . . .

−2 0 2 −2 0 2 −2 0 2

−10

0

10

20

30

x2(complete covariate)

x 1(in

com

plet

e co

varia

te)

mice JointAI mice JointAI mice JointAI0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

rela

tive

bias

10% NA

30% NA

50% NA


Simulation Study: Logarithmic Effect


Log-association between covariates: x1 ∼ α0 + α1ZZlog(x2) + . . .

0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2−12

−8

−4

0

4


x 1(in

com

plet

e co

varia

te)

mice JointAI mice JointAI mice JointAI

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

rela

tive

bias

10% NA

30% NA

50% NA


Simulation Study: Gamma distribution


Gamma-distributed covariate: x1 | x2, x3, . . . ∼ Ga()

0 2 4 6 0 2 4 6 0 2 4 60.0

0.5

1.0

1.5

2.0

x1(incomplete covariate)

cond

ition

al d

istr

ibut

ion


0.8

1.0

1.2

1.4

1.6

1.8

2.0

rela

tive

bias

10% NA

30% NA

50% NA


Flexible Bayesian Models

We need more flexible imputation models!

Ideally: models that fit (almost) any distribution / association structure.

Ideas:

flexible association structure: penalized splinesflexible residual distribution: mixture of Polya-Trees


Bayesian P-Splines

Instead of β1x2 we used∑

`=1β`B`(x2):

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●●

●

●●

●●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

x2

x 1


Simulation: Bayesian P-SplinesAnalysis model: y ∼ β0 + β1x1 + β2x2 + . . .

Quadratic association between covariates: x1 ∼ α0 + α1x2 +HHHα2x22 + . . .

−2 0 2 −2 0 2 −2 0 2

−10

0

10

20

30


x 1(in

com

plet

e co

varia

te)

JointAIdefault

JointAIp−spline

JointAIdefault

JointAIp−spline

JointAIdefault

JointAIp−spline

1.00

1.25

1.50

rela

tive

bias


Simulation: Bayesian P-Splines

Potential Issue:

covariate

inco

mpl

ete

cova

riate

missingobserved


Mixture of Polya Trees

0.0

0.5

1.0

1.5

dens

ity

Beta(a0, a0)

Beta(a0, a0)

Beta(a1, a1) Beta(a1, a1)

Beta(a2, a2) Beta(a2, a2) Beta(a2, a2) Beta(a2, a2)




0.0

0.5

1.0

1.5

dens

ity

Beta(a0, a0)


Beta(a0, a0)






0.0

0.5

1.0

1.5

dens

ity

Beta(a0, a0)



Beta(a0, a0)






0.0

0.5

1.0

1.5

dens

ity

Beta(a0, a0)





Practical Issues & Ideas

Potential / probable issues for practice:

flexible fit needs observed data everywherecomputational time

Ideas:

check first if simple model fits, e.g.posterior predictive checks

χ2 type of testsKolmogorov-Smirnoff test?discordance tests?

feasibility checks before running the complex model


Take home message

assumptions of imputation models can easily be violated á bias

more flexible imputation models are needed

semi- / non-parametric (Bayesian) methods can offer a solution

more flexibility á more complexity á need for guidance tools


Thank you for your attention.

R [email protected] N_Erler� NErler� www.nerler.com

https://twitter.com/N_Erler

https://twitter.com/N_Erler

https://github.com/nerler

https://github.com/nerler

https://nerler.com

https://nerler.com

Simulation Study: Beta Distribution

Analysis model: y ∼ β0 + β1x1 + β2 + . . .

Beta-distributed covariate: x1 | x2, x3, . . . ∼ Be()

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000

1

2

3

4


cond

ition

al d

istr

ibut

ion


0.6

0.8

1.0

1.2

1.4

rela

tive

bias


Simulation Study: Reverting the sequence


Exclusion of important predictor: x1 ∼ α0 + α1x2 +XXXα2x3 + . . .

−5 0 5 −5 0 5 −5 0 50.0

0.2

0.4

0.6


cond

ition

al d

istr

ibut

ion

mice JointAI mice JointAI mice JointAI0.6

0.8

1.0

1.2

1.4

1.6

rela

tive

bias

10% NA

30% NA

50% NA


Simulation Study: Ignoring an Interaction


Interaction between covariates: x1 ∼ α0 + α1x2 + α2x3 +XXXXα3x2x3 + . . .

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

−5.0

−2.5

0.0

2.5

5.0


x 1(in

com

plet

e co

varia

te)


0.6

0.8

1.0

1.2

1.4

1.6

1.8

rela

tive

bias

10% NA

30% NA

50% NA


Erasmus Medical Center, Rotterdam · Howblack-boxuseofimputationcancausebias NicoleErler Erasmus...

Documents

Transcript of Erasmus Medical Center, Rotterdam · Howblack-boxuseofimputationcancausebias NicoleErler Erasmus...