R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...

R-miss-tastic: a unified platform for missingvalues methods and workflows

Imke Mayer∗ Aude Sportisse† Julie Josse‡ Nicholas Tierney§

Nathalie Vialaneix¶

March 25, 2021

Abstract

Missing values are unavoidable when working with data. Their occurrence isexacerbated as more data from different sources become available. However, moststatistical models and visualization methods require complete data, and improperhandling of missing data results in information loss, or biased analyses. Since theseminal work of Rubin (1976), there has been a burgeoning literature on missing val-ues with heterogeneous aims and motivations. This has resulted in the developmentof various methods, formalizations, and tools (including a large number of R packagesand Python modules). However, for practitioners, it remains challenging to decidewhich method is most suited for their problem, partially because handling missingdata is still not a topic systematically covered in statistics or data science curricula.

To help address this challenge, we have launched a unified platform: “R-miss-tastic”, which aims to provide an overview of standard missing values problems,methods, how to handle them in analyses, and relevant implementations of method-ologies. In the same perspective, we have also developed several pipelines in R andPython to allow for a hands-on illustration of how to handle missing values in vari-ous statistical tasks such as estimation and prediction, while ensuring reproducibility

∗CAMS, EHESS, Paris, France (email: [email protected]). Corresponding co-author.†LPSM, Sorbonne Université, Paris, France (email: [email protected]). Corresponding co-author.‡Inria Sophia-Antipolis, Montpellier, France (email: [email protected]).§NUMBAT, Monash University, Victoria, Australia (email: [email protected]).¶MIAT, Université de Toulouse, INRA, Castanet Tolosan, France (email: [email protected]).

1

arX

iv:1

908.

0482

2v2

[st

at.M

E]

24

Mar

202

1

of the analyses. This will hopefully also provide some guidance on deciding whichmethod to choose for a specific problem and data. The objective of this work is notonly to comprehensively organize materials, but also to create standardized analysisworkflows, and to provide a common ground for discussions among the community.This platform is thus suited for beginners, students, more advanced analysts andresearchers.

Keywords: missing data; state of the art; bibliography; reproducibility; guided workflows;teaching material

2

1 Context and motivation

Missing data are unavoidable as soon as collecting or acquiring data is involved. Theyoccur for many reasons including: individuals choose not to answer survey questions, mea-surement devices fail, or data have simply not been recorded. Their presence becomes evenmore important as data are now obtained at increasing velocity and volume, and fromheterogeneous sources not originally designed to be analyzed together. As pointed out byZhu et al. (2019), “one of the ironies of working with Big Data is that missing data play anever more significant role, and often present serious difficulties for analysis”. Despite this,the approach most commonly implemented by default in software is to toss out cases withmissing values. At best, this is inefficient because it wastes information from the partiallyobserved cases. At worst, it results in biased estimates, particularly when the distributionsof the missing values are systematically different from those of the observed values (e.g.,Enders, 2010, Chap. 2).

However, handling missing data in a more efficient and relevant way (than limiting theanalysis on solely the complete cases) has attracted a lot of attention in the literature in thelast two decades. In particular, a number of reference books have been published (Schaferand Graham, 2002; Little and Rubin, 2002; van Buuren, 2012; Carpenter and Kenward,2012) and the topic is an active field of research (Josse and Reiter, 2018). The diversityof the problems of missing data means there is great variety in the methods proposed andstudied. They include model-based approaches, integrating likelihoods or posterior distri-butions over missing values, filling in missing values in a realistic way with single, or multipleimputations, or weighting approaches, appealing to ideas from the design-based literaturein survey sampling. The multiplicity of the available solutions makes sense because there isno single solution or tool to manage missing data: the appropriate methodology to handlethem depends on many features, such as the objective of the analysis, type of data, thetype of missing data and their pattern. Some of these methods are available in various andheterogeneous software solutions. As R (R Core Team, 2020) is one of the main softwaresfor statisticians and data scientists and as its development has started almost three decadesago (Ihaka, 1998), R is one of the language that offers the largest number of implementedapproaches. This is also due to its ease to incorporate new methods and its modular pack-aging system. Currently, there are over 270 R packages on CRAN that mention missingdata or imputation in their DESCRIPTION files. These packages serve many differentapplications, data types or types of analysis. More precisely, exploratory and visualizationtools for missing data are available in packages like naniar, VIM, and MissingDataGUI(Tierney, Cook, McBain, and Fay, Tierney et al.; Tierney and Cook, 2018; Kowarik and

3

Templ, 2016; Cheng et al., 2015). Imputation methods are included in packages like mice,Amelia, and mi (van Buuren and Groothuis-Oudshoorn, 2011; Honaker et al., 2011; Gelmanand Hill, 2011). Other packages focus on dealing with complex, heterogeneous (categor-ical, quantitative, ordinal variables) data or with large dimension multi-level data, suchas missMDA, and MixedDataImpute (Josse et al., 2016; Murray and Reiter, 2015). Toour knowledge, R is the programming language offering the largest variety of implementedmethods. However, other languages such as Python (Van Rossum and Drake, 2009), whichcurrently only have few publicly available implementations of methods that handle missingvalues in statistical tasks, offer more and more solutions. Two prominent examples are:1) the scikit-learn library (Pedregosa et al., 2011) which has recently added a module formissing values imputation; and 2) the DataWig library (Biessmann et al., 2018) whichprovides a framework to learn to impute incomplete data tables.

The rich variety of methods and tools for working with missing data means there aremany solutions for a variety of applications. Despite the number of options, missing dataare often not handled appropriately by practitioners. This may be for a few reasons. First,the plethora of options can be a double-edged sword - the sheer number of options makingit challenging to navigate and find the best one. Second, the topic of missing data isoften itself missing from many statistics and data science syllabuses, despite its relativeomnipresence in data. So then, faced with missing data, practitioners are left powerless:quite possibly never having been taught about missing data, they do not have an ideaof how to approach the problem, what are the dangers of mismanagement, navigate themethods, software, or choose the appropriate method or workflow for their problem.

To help promote better management and understanding of missing data, we have re-leased R-miss-tastic, an open platform for missing values. The platform takes the form ofa reference website1, which collects, organizes and produces material on missing data. Ithas been conceived by an infrastructure steering committee (ISC; its members are authorsof this article) working group, which first provided a CRAN Task View2 on missing data3 that lists and organizes existing R packages on the topic. The “R-miss-tastic” platformextends and builds on the CRAN Task View by collecting and organizing articles, tutorials,documentation, and workflows for analyses with missing data. The platform provides newtutorials, examples and pipelines of analyses we have developed with missing data thatspan the entirety of an analysis. The latter have been developed by us and implemented inR and in Python for this platform, implementing standard methods for generating missing

1https://rmisstastic.netlify.com/2https://CRAN.R-project.org/package=ctv3https://cran.r-project.org/web/views/MissingData.html

4

https://rmisstastic.netlify.com/https://CRAN.R-project.org/package=ctvhttps://cran.r-project.org/web/views/MissingData.html

values and analyzing them under different perspectives. Starting from data preparation,these pipelines also cover exploratory data analyses, establishing statistical models, analysisdiagnostics, applying machine learning methods, through to communication of the resultsobtained from incomplete data. We hope that these pipelines also serve as guidance whenit comes to choosing a method to handle missing values in a specific context.

Another important ingredient in statistical analyses, some may even argue the mostimportant one, is data which a proposed method or estimator is apply to. We thus alsoreference publicly available datasets that are commonly used as benchmark for new missingvalues methodologies. It is easily extendable and well documented, so it can seamlesslyincorporate future research in missing values. The intent of the platform is to foster awelcoming community, within and beyond the R community . “R-miss-tastic” has beendesigned to be accessible for a wide audience with different levels of prior knowledge, needs,and questions. This includes, for instance, students looking for course material to com-plement their studies, teachers and professors who can use a reference website for theirown classes or refer to students, statisticians or researchers in a different fields using statis-tics searching for solutions or existing work to help with analysis, researchers wishing tounderstand or contribute information for specific research questions, or find collaborators.

The remainder of the article is organized as follows: Section 2 describes the differentcomponents of the platform, the structure that has been chosen, and the targeted audi-ence. The section is organized as the platform itself, starting by describing materials forless advanced users then materials for researchers and finally resources for practical imple-mentation. Section 3 details the implementation and use-cases of the provided workflows,implemented both in R and in Python. Finally, in Section 4 we conclude with an overviewof the planed future development for the platform and of interesting areas in missing valuesresearch that we would like to bring to a broader audience.

2 Structure and content of the platform

The R-miss-tastic platform is released at https://rmisstastic.netlify.com/. It hasbeen developed using the R package blogdown Xie et al. (2017) which wraps hugo4. Liveexamples have been included using the tool https://rdrr.io/snippets/ provided bythe website “R Package Documentation”. The source code and materials of the platformhave been made publicly available on GitHub5, which provides a transparent record of the

4https://gohugo.io/5repository R-miss-tastic/website

5

https://rmisstastic.netlify.com/https://rdrr.io/snippets/https://gohugo.io/

platform’s development, and facilitates community contributions.We now discuss the structure of the R-miss-tastic platform, the aim and content of each

subsection, and highlight key features of the platform.

2.1 Workflows

An important contribution and novelty of this work is the proposal of several workflowsthat allow for a hands-on illustration of classical analyses with missing values, both onsimulated data and on publicly available real-world data. These workflows are providedboth in R and in Python code and cover the following topics:

• How to generate missing values? Generate missing values under different mecha-nisms, on complete or incomplete datasets. This is useful when performing simula-tions to compare methods.

• How to do statistical inference with missing values? We focus in particular in thedifferent solutions (maximum likelihood or multiple imputation) that are available toestimate linear and logistic regression parameters with missing values in the covari-ates.

• How to impute missing values? We compare different single imputation/matrix com-pletion methods (for instance using conditional models, low-rank models, etc.)

• How to predict with missing values? We consider establishing predictive models (forinstance using random forests (Breiman, 2001)) on data with incomplete predictors.The workflows present different strategies to deal with the missing values in thecovariates both in the train set and in the test set.

The aim of these workflows is threefold: 1) they provide a practical implementation ofconcepts and methods discussed in the lectures and bibliography sections (see Sections 2.2and 2.3); 2) they are coded in a generic way, allowing for simple re-use on other datasets, forintegration of other estimation or imputation methods; 3) the distinction between inference,imputation, and prediction lets the user keep in mind that the solutions are not the samein these cases.

Furthermore, they allow for a transparent and open discussion about the proposedimplementations which can be followed on the project GitHub repository6.

6https://github.com/R-miss-tastic/website

6

https://github.com/R-miss-tastic/website

We provide a more detailed view on the proposed workflows in Section 3, giving codeexamples and their corresponding tabular or graphical outputs as well as recommendationson how to interpret and leverage these outputs.

2.2 Lectures

Before starting to use any of the existing implementations for handling missing values in astatistical analysis, it is essential to understand why missing values are problematic. Thereare many lenses to view missing data through: the source of the missing values, theirpotential meaning, the relevance of the features they occur in – and the implications fordifferent types of analyses. For someone unfamiliar with missing data, it is a challengeto know where to begin the journey of understanding them, and the methods to addressthem. This challenge is being addressed with “R-miss-tastic”, which makes the materialto get started easily accessible.

Teaching and workshop material takes many forms – from slides, course notes, labworkshops, video tutorials and in-depth seminars. The material is of high quality, andhas been generously contributed by numerous renowned researchers who investigate theproblems of missing values, many of whom are professors having designed introductoryand advanced classes for statistical analysis with missing data. This makes the material onthe “R-miss-tastic” platform well suited for both beginners and more experienced users.

These teaching and workshop materials are described as “lectures”, and are organizedinto five sections:

1. General lectures: introduction to statistical analyses with missing values, theoryand concepts are covered, such as missing values mechanism, likelihood methods,imputation.

2. Multiple imputation: introduction to popular methods of multiple imputation (jointmodeling and fully conditional), how to correctly perform multiple imputation andlimits of imputation methods.

3. Principal component methods: introduction to methods exploiting low-rank typestructures in the data for visualization, imputation and estimation

4. Specific data or applications types: lectures covering in detail various sub-problemssuch as missing values in time series, in surveys, or in treatment effect estimation. In-deed, certain data types require adaptations of standard missing values methods (for

7

instance handling the time dependence in time series (Moritz and Bartz-Beielstein,2017)) or additional assumptions about the impact of missing values (such as theimpact on confounded treatment effects in the causal inference context (Mayer et al.,2020)). But also more in-depth material, for instance video recordings from a virtualworkshop on Missing Data Challenges in Computation, Statistics and Applications7

held in 2020.

5. Implementations: a non-exhaustive list of detailed vignettes describing functionalitiesof R packages and of Python modules that implement some of the statistical analysismethods covered in the other lectures. For instance the functionalities and possibleapplications of the missMDA R package are presented in a succinct summary, allowingthe reader to compare the main differences between this package and the mice packagewhich is also summarized using the same summary format.

Figure 1 is a screenshot of two views of the lectures page: Figure 1A shows a collapsedview presenting the different topics, Figure 1B shows an example of the expanded view ofone topic (General tutorials), with a detailed description of one of the lecture (obtainedby clicking on its title), “Analysis of missing values” by Jae-Kwang Kim. Each lecture cancontain several documents (as is the case for this one) and is briefly described by a headerpresenting its purpose.

Lectures that we found very complete and thus recommend are:

• Statistical Methods for Analysis with Missing Data by Mauricio Sadinle (in “Generaltutorials”)

• Missing Values in Clinical Research – Multiple Imputation by Nicole Erler (in “Mul-tiple imputation”)

• Handling missing values in PCA and MCA by François Husson. (in “Missing valuesand principal component methods”)

2.3 Bibliography

Complementary to the lectures section, this part of the platform serves as a broad overviewon the scientific literature discussing missing values taxonomies and mechanisms and sta-tistical, machine learning methods to handle them. This overview covers both classical

7https://www.ias.edu/math/mdccsa

8

https://www.ias.edu/math/mdccsa

Home

Workflows

Lectures

Bibliography

R packages

Data

People

More info & links

Contact

R-miss-tastic

A resource website on missing values - Methods and references for managing missing data

Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

Collapse All

Collapse All

Below you will find a selection of high-quality lectures, tutorials and labs on different

aspects of missing values.

Ifyouwishtocontributesomeofyourownmaterialtothisplatform,pleasefeelfreeto

contactusviatheContactform.

General lectures

Multiple imputation

Missing values and principal component methods

Specific data or application types

Implementation in R

++

++

+

About

This website is proudly

sponsored by R Consortium

and maintained by Julie

Josse, Imke Mayer, Nicholas

Tierney and Nathalie

Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

(a) Collapsed view

Home

Workflows

Bibliography

Lectures

R packages

Data

People

More info & links

Contact

R-miss-tastic



Back to top

Collapse All

Collapse All

Below you will find a selection of high-quality lectures and labs on different aspects of

missing values.

Ifyouwishtocontributesomeofyourownmaterialtothisplatform,pleasefeelfreeto

contactusviatheContactform.

General tutorials

Multiple imputation

Missing values and principal component methods

Specific data or application types

Implementation in R

+

Statistical Methods for Analysis With Missing Data

(Marie Davidian, course at NC State University, spring 2017)

Handling missing values

(Julie Josse, course at École Polytechnique, fall 2018 and Julie Josse & Nick Tierney, tutorial at useR!2018, 2018)

This course focuses on the theory and methods for missing data analysis. Topics include maximum likelihood estimation under

missing data, EM algorithm, Monte Carlo computation techniques, imputation, Bayesian approach, propensity scores, semi-

parametric approach, and non-ignorable missing data.

Introduction

Likelihood-based approach

Computation

Imputation

Multiple imputation

Propensity Score approach

Nonignorable missing data

Analysis of missing values

(Jae-Kwang Kim, course at Iowa State University, fall 2015)

Statistical Methods for Analysis with Missing Data

(Mauricio Sadinle, course at University of Washington, winter 2019)

++

++

About






Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

(b) Extract

Figure 1: Lectures overview.

references with books, articles, etc. such as Schafer and Graham (2002); Little and Rubin(2002); van Buuren (2012); Carpenter and Kenward (2012) and more recent developmentssuch as Josse et al. (2019); Gondara and Wang (2018) who study the consistency of su-pervised learning with missing values. The entire (non-exhaustive) bibliography can bebrowsed in two ways: 1) a complete list, filtering by publication type and year, with asearch option for the authors or 2), as a contextualized version. For 2), we classified thereferences into several domains of research or application, briefly discussing important as-pects of each domain. This double representation is shown in Figure 2 and allows anextensive search in the existing literature, while providing some guidance for those focusedon a specific topic. All references are also collected in a unique BibTex file made avail-able on the GitHub repository8. This centralized file allows external users to easily proposeadditions to the bibliography which are then reviewed by the platform editorial and mainte-nance committee, composed of researchers with different focuses on the handling of missingvalues.

8in resources/rmisstastic_biblio.bib

9

resources/rmisstastic_biblio.bib

Home

Workflows

Bibliography

Lectures

R packages

Data

People

More info & links

Contact

R-miss-tastic



Back to top

Collapse All

Collapse All

On this platform we attempt to give you an overview of main references on missing

values. We do not claim to gather all available references on the subject but rather to

offer a peak into different fields of active research on handling missing values, allowing

for an introductory reading as well as a starting point for further bibliographical research.

See here for a full (and uncommented) list of references.

Inspired by CRAN Task View on Missing Data and a review of Imbert & Villa-Vialaneix on

handling missing values (2018, written in French) we organized our selection of relevant

references on missing values by different topics.

Short introduction to missing values

General references and reviews

Weighting methods

Hot-deck and kNN approaches

Likelihood-based approaches

Single imputation

Multiple imputation

Machine Learning

Missing values mechanisms

Share

++

++

++

++

+

About






Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

Processing math: 100%

(a) Contextualized version

Home

Workflows

Bibliography

Lectures

R packages

Data

People

More info & links

Contact

R-miss-tastic



Back to top

A commented version of this bibliography can be found here.

Publication type

All

Year

All

Author

Search by author name..

Share

Citation Year

Publication

type

Abayomi,K.,A.Gelman,andM.Levy.Diagnosticsformultivariateimputations.In:JournaloftheRoyalStatistical

Society,SeriesC(AppliedStatistics)57.3(2008),pp.273-291.

DOI

2008 Article

Albert,P.S.andD.A.Follmann.Modelingrepeatedcountdatasubjecttoinformativedropout.In:Biometrics

56.3(2000),pp.667-677.

DOI

2000 Article

Allison,P.D.MissingData.QuantitativeApplicationsintheSocialSciences.ThousandOaks,CA,USA:Sage

Publications,2001.ISBN:9780761916727.

DOI

2001 Book

Andridge,R.andR.J.A.Little.Areviewofhotdeckimputationforsurveynon-response.In:International

StatisticalReview78.1(2010),pp.40-64.

DOI

2010 Article

Audigier,V.,F.Husson,andJ.Josse.Aprincipalcomponentmethodtoimputemissingvaluesformixeddata.

In:AdvancesinDataAnalysisandClassification10.1(2016),pp.5-26.

DOI

2016 Article

Audigier,V.,F.Husson,andJ.Josse.MIMCA:multipleimputationforcategoricalvariableswithmultiple

correspondenceanalysis.In:StatisticsandComputing27.2(2016),pp.1-18.eprint:1505.08116.

DOI

2016 Article

Audigier,V.,F.Husson,andJ.Josse.MultipleimputationforcontinuousvariablesusingaBayesianprincipal

componentanalysis.In:JournalofStatisticalComputationandSimulation86.11(2015),pp.2140-2156.

DOI

2015 Article

Bang,H.andJ.M.Robins.Doublyrobustestimationinmissingdataandcausalinferencemodels.In:Biometrics

61.4(2005),pp.962-973.

DOI

2005 Article

Baraldi,A.N.andC.K.Enders.Anintroductiontomodernmissingdataanalysis.In:JournalofSchool

Psychology48.1(2010),pp.5-37.

DOI

2010 Article

Baretta,L.andA.Santaniello.Nearestneighborimputationalgorithms:acriticalevaluation.In:BMCMedical

InformaticsandDecisionMaking.Proceedingsofthe5thTranslationalBioinformaticsConference(TBC2015):

medicalinformaticsanddecisionmaking16.Supp.3(2016),p.74.

DOI

2016 Article

Bartlett,J.W.,O.Harel,andJ.R.Carpenter.Asymptoticallyunbiasedestimationofexposureoddsratiosin

completerecordslogisticregression.In:Americanjournalofepidemiology182.8(2015),pp.730–736.

DOI

2015 Article

Bengio,Y.andF.Gingras.Recurrentneuralnetworksformissingorasynchronousdata.In:Proceedingsofthe

8thInternationalConferenceonNeuralInformationProcessingSystems.(Nov.27,1995-Dec.02,1995).Ed.by-.

Cambridge,MA,USA:MITPress,1995,pp.395-401.

URL

1995 Paper

Biessmann,F.,D.Salinas,S.Schelter,etal.“Deep”LearningforMissingValueImputationinTableswithNon-

NumericalData.In:Proceedingsofthe27thACMInternationalConferenceonInformationandKnowledge

Management.Ed.by-.CIKM’18.Torino,Italy:ACM,2018,pp.2017–2025.ISBN:978-1-4503-6014-2.

DOI URL

2018 Paper

Blake,H.A.,C.Leyrat,K.Mansfield,etal.Propensityscoresusingmissingnesspatterninformation:apractical

guide.In:arXivpreprint(2019).arXiv:1901.03981[stat.ME].

URL

2019 Article

Buck,S.F.Amethodofestimationofmissingvaluesinmultivariatedatasuitableforusewithanelectronic

computer.In:JournaloftheRoyalStatisticalSociety,SeriesB22(1960),pp.302-306.

DOI

1960 Article

Burns,R.M.Multipleandreplicateitemimputationinacomplexsamplesurvey.In:Proceedingsofthe6th

AnnualResearchConference.Ed.byB.oftheCensus.WashingtonDC,USA,1990,pp.655-665.

1990 Paper

Buuren,S.van.FlexibleImputationofMissingData.BocaRaton,FL:ChapmanandHall/CRC,2018.

URL

2018 Book

Buuren,S.van.Multipleimputationofdiscreteandcontinuousdatabyfullyconditionalspecification.In:

StatisticalMethodsinMedicalResearch16(2007),pp.219-242.

DOI

2007 Article

Buuren,S.van,J.P.L.Brand,C.G.M.Groothuis-Oudshoorn,etal.Fullyconditionalspecificationinmultivariate

imputation.In:JournalofStatisticalComputationandSimulation76.12(2006),pp.1049-1064.

DOI

2006 Article

Buuren,S.vanandK.Groothuis-Oudshoorn.MICE:multivariateimputationbychainedequationsinR.In:

JournalofStatisticalSoftware45(2011),p.3.eprint:NIHMS150003.

DOI

2011 Article

Candès,E.J.,C.A.Sing-Long,andJ.D.Trzasko.Unbiasedriskestimatesforsingularvaluethresholdingand

spectralestimators.In:IEEETransactionsonSignalProcessing61.19(2013),pp.4643-4657.

DOI

2013 Article

Carpenter,J.R.,M.G.Kenward,andS.Vansteelandt.Acomparisonofmultipleimputationanddoublyrobust

estimationforanalyseswithmissingdata.In:JournaloftheRoyalStatisticalSociety:SeriesA(Statisticsin

Society)169.3(2006),pp.571–584.

DOI

2006 Article

Carpenter,J.andM.Kenward.MultipleImputationanditsApplication.Chichester,WestSussex,UK:Wiley,

2013.ISBN:9780470740521.

DOI

2013 Book

Chen,T.andC.Guestrin.XGBoost:AScalableTreeBoostingSystem.In:Proceedingsofthe22ndACM

SIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.(Aug.13,2016-Aug.17,2016).Ed.

by-.NewYork,NY,USA:ACM,2016,pp.785-794.ISBN:0450342322.

DOI

2016 Paper

Chen,J.andJ.Shao.Nearestneighborimputationforsurveydata.In:JournalofOfficialStatistics16.2(2000),

pp.113-131.

URL

2000 Article

Collins,L.M.,J.L.Schafer,andK.Chi-Ming.Acomparisonofinclusiveandrestrictivestrategiesinmodern

missingdataprocedures.In:PsychologicalMethods6.4(2007),pp.330-351.

DOI

2007 Article

Cranmer,S.J.andJ.Gill.Wehavetobediscreteaboutthis:anon-parametricimputationtechniqueformissing

categoricaldata.In:BritishJournalofPoliticalScience43(2012),pp.425-449.

DOI

2012 Article

Crookston,N.L.andA.O.Finley.yaImpute:anRpackageforkNNimputation.In:JournalofStatisticalSoftware

23(2008),p.10.

DOI

2008 Article

Dax,A.ImputingMissingEntriesofaDataMatrix:Areview.In:JournalofAdvancedComputing3.3(2014),

pp.98-222.

DOI

2014 Article

Dempster,A.P.,N.M.Laird,andD.B.Rubin.MaximumlikelihoodfromincompletedataviatheEMalgorithm.

In:JournaloftheRoyalStatisticalSociety,SeriesB(Methodological)39.1(1977),pp.1-38.

URL

1977 Article

Diggle,P.andM.G.Kenward.Informativedrop-outinlongitudinaldataanalysis.In:JournaloftheRoyal

StatisticalSociety,SeriesC(AppliedStatistics)43.1(1994),pp.49-93.

DOI

1994 Article

Ding,P.andF.Li.CausalInference:AMissingDataPerspective.In:StatisticalScience33.2(2018),pp.214–237.

DOI

2018 Article

Ding,Y.andJ.S.Simonoff.Aninvestigationofmissingdatamethodsforclassificationtreesappliedtobinary

responsedata.In:JournalofMachineLearningResearch11.1(2010),pp.131-170.

URL

2010 Article

Dong,Y.andC.J.Peng.Principledmissingdatamethodsforresearchers.In:SpringerPlus2(2013),p.222.

DOI

2013 Article

Enders,C.K.AppliedMissingDataAnalysis.GuilfordPress,2010,p.401.ISBN:9781606236390. 2010 Book

Enders,C.K.Aprimeronmaximumlikelihoodalgorithmsavailableforusewithmissingdata.In:Structural

EquationModeling8.1(2001),pp.128-141.

DOI

2001 Article

Fang,F.,J.Zhao,andJ.Shao.Imputation-basedadjustedscoreequationsingeneralizedlinearmodelswith

nonignorablemissingcovariatevalues.In:StatisticaSinica28.4(2018),pp.1677–1701.

DOI

2018 Article

Fay,R.E.Alternativeparadigmsfortheanalysisofimputedsurveydata.In:JournaloftheAmericanStatistical

Association91.434(1996),pp.490-498.

DOI

1996 Article

Fellegi,I.P.andD.Holt.Asystematicapproachtoautomaticeditandimputation.In:JournaloftheAmerican

StatisticalAssociation71.353(1976),pp.17-35.

DOI

1976 Article

Ferrari,P.A.,P.Annoni,A.Barbiero,etal.Animputationmethodforcategoricalvariableswithapplicationto

nonlinearprincipalcomponentanalysis.In:ComputationalStatistics&DataAnalysis55.7(2011),pp.2410-2420.

DOI

2011 Article

Finkbeiner,C.Estimationforthemultiplefactormodelwhendataaremissing.In:Psychometrika44.4(1979),

pp.409-420.

DOI

1979 Article

Fitzmaurice,G.M.,G.Molenberghs,andS.R.Lipsitz.RegressionModelsforLongitudinalBinaryResponses

withInformativeDrop-Outs.In:JournaloftheRoyalStatisticalSociety.SeriesB(Methodological)57.4(1995),

pp.691–704.

URL

1995 Article

Follmann,D.andM.Wu.Anapproximategeneralizedlinearmodelwithrandomeffectsforinformativemissing

data.In:Biometrics51.1(1995),pp.151-168.

DOI

1995 Article

Gad,A.M.andN.M.M.Darwish.Asharedparametermodelforlongitudinaldatawithmissingvalues.In:

AmericanJournalofAppliedMathematicsandStatistics1.2(2013),pp.30-35.

URL

2013 Article

Gelman,A.,G.King,andC.Liu.Notaskedandnotanswered:Multipleimputationformultiplesurveys.In:

JournaloftheAmericanStatisticalAssociation93.443(1998),pp.846–857.

DOI

1998 Article

Gelman,A.,I.vanMechelen,G.Verbeke,etal.MultipleImputationforModelChecking:Completed-DataPlots

withMissingandLatentData.In:Biometrics61.1(2005),pp.74–85.

DOI

2005 Article

Gondara,L.andK.Wang.MIDA:MultipleImputationusingDenoisingAutoencoders.In:Proceedingsofthe

22ndPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD2018).(Jun.03,2018-Jun.06,

2018).Ed.byD.Phung,V.Tseng,G.Webb,B.Ho,M.GanjiandL.Rashidi.LectureNotesinComputerScience.

SpringerInternationalPublishing,2018,pp.260-272.ISBN:3319930404.

DOI URL

2018 Paper

Goodfellow,I.,M.Mirza,A.Courville,etal.Multi-PredictionDeepBoltzmannMachines.In:Proceedingsofthe

26thInternationalConferenceonNeuralInformationProcessingSystems.(Dec.05,2013-Dec.10,2013).Ed.by

C.Burges,L.Bottou,M.Welling,Z.GhahramaniandK.Weinberger.AdvancesinNeuralInformationProcessing

Systems26.CurranAssociates,Inc.,2013,pp.548–556.

URL

2013 Paper

Graham,J.W.Missingdataanalysis:makingitworkintherealworld.In:AnnualReviewofPsychology60(2009),

pp.549-576.

DOI

2009 Article

Graham,J.W.,S.M.Hofer,S.I.Donaldson,etal.TheScienceofPrevention:MethodologicalAdvancesfrom

AlcoholandSubstanceAbuseResearch.In:TheScienceofPrevention:MethodologicalAdvancesfromAlcohol

andSubstanceAbuseResearch.Ed.byK.Bryant,M.WindleandS.West.Washington,DC,USA:American

PsychologicalAssociation,1997.Chap.Analysiswithmissingdatainpreventionresearch,pp.325-366.ISBN:1-

55798-439-5.

DOI

1997 Book

Graham,J.W.,A.E.Olchowski,andT.E.Gilreath.Howmanyimputationsarereallyneeded?Somepractical

clarificationsofmultipleimputationtheory.In:PreventionScience8.3(2007),pp.206-213.

DOI

2007 Article

Heckman,J.Sampleselectionbiasasaspecificationerror.In:Econometrica47.1(1979),pp.153-161.

DOI

1979 Article

Heckman,J.J.Thecommonstructureofstatisticalmodelsoftruncation,sampleselectionandlimited

dependentvariablesandasimpleestimatorforsuchmodels.In:AnnalsofEconomicandSocialMeasurement

5.4(1976),pp.475-492.

URL

1976 Article

Hogan,J.W.andN.M.Laird.Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.In:

StatisticsinMedecine16.1-3(1997),pp.239-257.

1997 Article

Hogan,J.W.andT.Lancaster.Instrumentalvariablesandinverseprobabilityweightingforcausalinferencefrom

longitudinalobservationalstudies.In:StatisticalMethodsinMedicalResearch13.1(2004),pp.17-48.

DOI

2004 Article

Honaker,J.,G.King,andM.Blackwell.AmeliaII:aprogramformissingdata.In:JournalofStatisticalSoftware

45.7(2011).eprint:arXiv:1501.0228.

DOI

2011 Article

Hothorn,T.,K.Hornik,andA.Zeileis.UnbiasedRecursivePartitioning:AConditionalInferenceFramework.In:

JournalofComputationalandGraphicalStatistics15.3(2012),pp.651-674.

DOI

2012 Article

Horton,N.J.andK.P.Kleinman.MuchAdoAboutNothing-AComparisonofMissingDataMethodsand

SoftwaretoFitIncompleteDataRegressionModels.In:TheAmericanStatistician61.1(2017),pp.79-90.

DOI

2017 Article

Huisman,M.Imputationofmissingitemresponses:somesimpletechniques.In:Quality&Quantity34.4(2000),

pp.331-351.

DOI

2000 Article

Husson,F.andJ.Josse.Handlingmissingvaluesinmultiplefactoranalysis.In:FoodQualityandPreference30

(2013),pp.77-85.

DOI

2013 Article

Ibrahim,J.G.,S.R.Lipsitz,andM.Chen.MissingCovariatesinGeneralizedLinearModelsWhentheMissing

DataMechanismisNon-Ignorable.In:JournaloftheRoyalStatisticalSociety.SeriesB(StatisticalMethodology)

61.1(1999),pp.173-190.

DOI URL

1999 Article

Ibrahim,J.G.,M.Chen,andS.R.Lipsitz.Missingresponsesingeneralisedlinearmixedmodelswhenthe

missingdatamechanismisnonignorable.In:Biometrika88.2(2001),pp.551-564.

DOI

2001 Article

Ilin,A.andT.Raiko.PracticalapproachestoPrincipalComponentAnalysisinthepresenceofmissingvalues.In:

JournalofMachineLearningResearch11(2010),pp.1957-2000.

URL

2010 Article

Imbert,A.,A.Valsesia,C.LeGall,etal.Multiplehot-deckimputationfornetworkinferencefromRNA

sequencingdata.In:Bioinformatics34.10(2018),pp.1726-1732.

DOI

2018 Article

Jönsson,P.andC.Wohlin.Anevaluationofk-nearestneighbourimputationusinglIkertdata.In:Proceedingsof

the10thInternationalSymposiumonSoftwareMetrics.(Sep.14,2004-Sep.16,2004).Ed.by-.Chicago,IL,

USA:IEEE,2004,pp.1530-1435.ISBN:0769521290.

DOI

2004 Paper

Jamshidian,M.andS.Jalal.Testsofhomoscedasticity,normality,andmissingcompletelyatrandomfor

incompletemultivariatedata.In:Psychometrika75.4(2010),pp.649-674.eprint:NIHMS150003.

DOI

2010 Article

Jamshidian,M.,S.Jalal,andC.Jansen.MissMech:anRpackagefortestinghomoscedasticity,multivariate

normality,andmissingcompletelyatrandom(MCAR).In:JournalofStatisticalSoftware56.6(2014),pp.1-31.

DOI

2014 Article

Jiang,W.,J.Josse,andM.Lavielle.LogisticRegressionwithMissingCovariates–ParameterEstimation,Model

SelectionandPrediction.In:arXivpreprint(2018).arXiv:1805.04602[stat.ME].

2018 Article

Joenssen,D.W.andU.Bankhofer.Donorlimitedhotdeckimputation:effectonparameterestimation.In:

JournalofTheoreticalandAppliedComputerScience6.3(2012),pp.58-70.

URL

2012 Article

Jones,M.P.IndicatorandStratificationMethodsforMissingExplanatoryVariablesinMultipleLinearRegression.

In:JournaloftheAmericanStatisticalAssociation91.433(1996),pp.222-230.

DOI

1996 Article

Josse,J.,M.Chavent,B.Liquet,etal.Handlingmissingvalueswithregularizediterativemultiple

correspondanceanalysis.In:JournalofClassification29.1(2012),pp.91-116.

DOI

2012 Article

Josse,J.andF.Husson.missMDA:apackageforhandlingmissingvaluesinmultivariatedataanalysis.In:

JournalofStatisticalSoftware70.1(2016),pp.1-31.

DOI

2016 Article

Josse,J.andF.Husson.Handlingmissingvaluesinexploratorymultivariatedataanalysismethods.In:Journal

delaSociétéFrançaisedeStatistique153.2(2012),pp.79-99.

URL

2012 Article

Josse,J.,F.Husson,andJ.Pagès.GestiondesdonnéesmanquantesenAnalyseenComposantesPrincipales.

In:JournaldelaSociétéFrançaisedeStatistique150.2(2009),pp.28-51.

URL

2009 Article

Josse,J.,J.Pagès,andF.Husson.Multipleimputationinprincipalcomponentanalysis.In:AdvancesinData

AnalysisandClassification5.3(2011),pp.231-246.

DOI

2011 Article

Josse,J.,N.Prost,E.Scornet,etal.Ontheconsistencyofsupervisedlearningwithmissingvalues.In:arXiv

preprint(2019).arXiv:1902.06931[stat.ML].

URL

2019 Article

Kaiser,J.Dealingwithmissingvaluesindata.In:JournalofSystemsIntegration5.1(2014),pp.42-51.

DOI

2014 Article

Kallus,N.,X.Mao,andM.Udell.CausalInferencewithNoisyandMissingCovariatesviaMatrixFactorization.In:

AdvancesinNeuralInformationProcessingSystems.Ed.by-.2018.eprint:1806.00811.

URL

2018 Paper

Kalton,G.andD.Kasprzyk.Thetreatmentofmissingsurveydata.In:SurveyMethodology12.1(1986),pp.1-16.

URL

1986 Article

Kapelner,A.andJ.Bleich.PredictionwithmissingdataviaBayesianadditiveregressiontrees.In:Canadian

JournalofStatistics43.2(2015),pp.224-239.

DOI URL

2015 Article

Kim,J.K.andJ.Shao.StatisticalMethodsforHandlingIncompleteData.BocaRaton,FL,USA:Chapmanand

Hall/CRC,2013.ISBN:9781482205077.

2013 Book

Kohn,R.andC.F.Ansley.Estimation,prediction,andinterpolationforARIMAmodelswithmissingdata.In:

JournaloftheAmericanStatisticalAssociation81.395(1986),pp.751-761.

DOI

1986 Article

Kowarik,A.andM.Templ.ImputationwiththeRPackageVIM.In:JournalofStatisticalSoftware74.7(2016),

pp.1-16.

DOI

2016 Article

Kropko,J.,B.Goodrich,A.Gelman,etal.MultipleImputationforContinuousandCategoricalData:Comparing

JointMultivariateNormalandConditionalApproaches.In:PoliticalAnalysis22.4(2014),pp.497–519.

DOI

2014 Article

Lee,K.M.,R.Mitra,andS.Biedermann.Optimaldesignwhenoutcomevaluesarenotmissingatrandom.In:

StatisticaSinica28.4(2018),pp.1821–1838.

DOI

2018 Article

Little,R.J.A.Modelingthedrop-outmechanisminrepeated-measuresstudies.In:JournaloftheAmerican

StatisticalAssociation90.431(1995),pp.1112-1121.

DOI

1995 Article

Little,R.J.A.Pattern-mixturemodelsformultivariateincompletedata.In:JournaloftheAmericanStatistical

Association88.421(1993),pp.125-134.

DOI

1993 Article

Little,R.J.A.Atestofmissingcompletelyatrandomformultivariatedatawithmissingvalues.In:Journalofthe

AmericanStatisticalAssociation83.404(1988),pp.1198-1202.

DOI

1988 Article

Little,R.J.A.andD.B.Rubin.StatisticalAnalysiswithMissingData.Wiley,2002,p.408.ISBN:0471183865.

DOI

2002 Book

Little,R.J.A.RegressionwithmissingX’s:areview.In:JournaloftheAmericanStatisticalAssociation87.420

(1992),pp.1227-1237.

DOI

1992 Article

Louis,T.A.FindingtheObservedInformationMatrixwhenUsingtheEMAlgorithm.In:JournaloftheRoyal

StatisticalSociety.SeriesB(Methodological)44.2(1982),pp.226–233.

URL

1982 Article

McLachlan,G.J.andT.Krishnan.TheEMAlgorithmandExtensions.Wileyseriesinprobabilityandstatistics.

Hoboken,NJ,USA:Wiley,2008.ISBN:9780471201700.

2008 Book

Meng,S.L.andD.B.Rubin.MaximumlikelihoodestimationviatheECMalgorithm:ageneralframework.In:

Biometrika80.2(1993),pp.267-278.

DOI

1993 Article

Meng,X.L.andD.B.Rubin.UsingEMtoobtainasymptoticvariance-covariancematrices:theSEMalgorithm.

In:JournaloftheAmericanStatisticalAssociation86.416(1991),pp.899-909.

DOI

1991 Article

Meng,X.L.UYouwantmetoanalyzedataIdon’thave?Areyouinsane?In:ShanghaiArchivesofPsychiatry

24.5(2012),pp.287-301.

DOI URL

2012 Article

Miao,W.andE.J.TchetgenTchetgen.Identificationandinferencewithnonignorablemissingcovariatedata.In:

StatisticaSinica28.4(2018),pp.2049–2067.

DOI

2018 Article

Moeur,M.andA.R.Stage.Mostsimilarneighbor:animprovedsamplinginferenceprocedurefornatural

resourcesplanning.In:ForestScience42.1(1995),pp.337-359.

DOI

1995 Article

Molenberghs,G.,G.Fitzmaurice,M.G.Kenward,etal.HandbookofMissingDataMethodology.Chapman&

Hall/CRCHandbooksofModernStatisticalMethods.NewYork,NY,USA:ChapmanandHall/CRC,2014.ISBN:

9781439854624.

2014 Book

Molenberghs,G.andM.G.Kenward.MissingDatainClinicalStudies.Chichester,WestSussex,UK:Wiley,

2007.ISBN:9780470849811.

DOI

2007 Book

Molenberghs,G.,B.Michiels,M.G.Kenward,etal.Monotonemissingdataandpattern-mixturemodels.In:

StatisticaNeerlandica52.2(1998),pp.153-161.

DOI

1998 Article

Molnar,F.J.,B.Hutton,andD.Fergusson.Doesanalysisusinglastobservationcarriedforwardintroducebiasin

dementiaresearch?In:CanadianMedicalAssociationJournal179.8(2008),pp.751-753.

DOI

2008 Article

Moritz,S.andT.Bartz-Beielstein.imputeTS:timeseriesmissingvalueimputationinR.In:TheRJournal9.1

(2017),pp.207-218.

URL

2017 Article

Moritz,S.,A.Sardá,T.Bartz-Beielstein,etal.Comparisonofdifferentmethodsforunivariatetimeseries

imputationinR.PrepintarXiv1510.03924.2015.

URL

2015 Misc

Murray,J.S.andJ.P.Reiter.MultipleImputationofMissingCategoricalandContinuousValuesviaBayesian

MixtureModelsWithLocalDependence.In:JournaloftheAmericanStatisticalAssociation111.516(2016),

pp.1466-1479.

DOI

2016 Article

NationalResearchCouncil,U.ThePreventionandTreatmentofMissingDatainClinicalTrials.Washington(DC),

USA:NationalAcademiesPress,2010.ISBN:9780309158145.

DOI

2010 Book

Nowicki,R.K.,R.Scherer,andL.Rutkowski.Novelroughneuralnetworkforclassificationwithmissingdata.In:

21stInternationalConferenceonMethodsandModelsinAutomationandRobotics(MMAR).(Sep.29,2016-

Sep.01,2016).Ed.by-.IEEE,2016,pp.820–825.

DOI

2016 Paper

O’Kelly,M.andB.Ratitch.ClinicalTrialswithMissingData:AGuideforPractitioners.JohnWiley&Sons,Ltd,

2014.

DOI

2014 Book

Orchard,T.andM.A.Woodbury.Amissinginformationprinciple:theoryandapplications.In:Proceedingsofthe

SixthBerkeleySymposiumonMathematicalStatisticsandProbability,Volume1:TheoryofStatistic.Ed.byL.M.

LeCam,N.J.andE.L.Scott.Vol.1.UniversityofCaliforniaPress,1972,pp.697–715.

URL

1972 Paper

Peugh,J.L.andC.K.Enders.Missingdataineducationalresearch:areviewofreportingpracticesand

suggestionsforimprovement.In:ReviewofEducationalResearch74.4(2004),pp.525–556.

DOI URL

2004 Article

Pigott,T.D.Areviewofmethodsformissingdata.In:EducationalResearchandEvaluation7.4(2001),pp.353–

383.

DOI

2001 Article

Preisser,J.S.,K.K.Lohman,andP.J.Rathouz.Performanceofweightedestimatingequationsforlongitudinal

binarydatawithdrop-outsmissingatrandom.In:StatisticsinMedicine21.20(2002),pp.3035–3054.

DOI

2002 Article

Rahman,G.andZ.Islam.Missingvalueimputationusingdecisiontreesanddecisionforestsbysplittingand

mergingrecords:Twonoveltechniques.In:Knowledge-BasedSystems53(2013),pp.51–65.

DOI URL

2013 Article

Rao,J.N.K.andJ.Shao.Jackknifevarianceestimationwithsurveydataunderhotdeckimputation.In:

Biometrika79.4(1992),pp.811-822.

DOI

1992 Article

Reilly,M.andM.Pepe.Therelationshipbetweenhot-deckmultipleimputationandweightedlikelihood.In:

StatisticsinMedecine16.1-3(1997),pp.5-19.

DOI

1997 Article

Reiter,J.P.andM.Sadinle.Itemwiseconditionallyindependentnonresponsemodellingforincomplete

multivariatedata.In:Biometrika104.1(Jan.2017),pp.207-220.eprint:http://oup.prod.sis.lan/biomet/article-

pdf/104/1/207/13066719/asw063.pdf.

DOI URL

2017 Article

Rieger,A.,T.Hothorn,andC.Strobl.Randomforestswithmissingvaluesinthecovariates.Tech.rep.79.

UniversityofMunich,DepartmentofStatistics,2010.

URL

2010 Misc

Robins,J.M.,A.Rotnitzky,andL.P.Zhao.EstimationofRegressionCoefficientsWhenSomeRegressorsare

notAlwaysObserved.In:JournaloftheAmericanStatisticalAssociation89.427(1994),pp.846-866.

DOI

1994 Article

Robins,J.M.,A.Rotnitzky,andL.P.Zhao.Analysisofsemiparametricregressionmodelsforrepeatedoutcomes

inthepresenceofmissingdata.In:JournaloftheAmericanStatisticalAssociation90.429(1995),pp.106-121.

DOI

1995 Article

Robins,J.M.andN.Wang.Inferenceforimputationestimators.In:Biometrika87.1(2000),pp.113-124.

URL

2000 Article

Rosseel,Y.lavaan:anRpackageforstructuralequationmodeling.In:JournalofStatisticalSoftware48.2(2012).

DOI

2012 Article

Rotnitzky,A.,J.M.Robins,andD.O.Scharfstein.Semiparametricregressionforrepeatedoutcomeswith

nonignorablenonresponse.In:JournaloftheAmericanStatisticalAssociation93.444(1998),pp.1321-1339.

DOI

1998 Article

Rubin,D.B.Multipleimputationafter18+years.In:JournaloftheAmericanStatisticalAssociation91.434

(2012),pp.473-489.

DOI

2012 Article

Rubin,D.B.MultlipeImputationforNonresponseinSurveys.Hoboken,NJ,USA:Wiley,1987.ISBN:

9780471655740.

1987 Book

Rubin,D.B.Formalizingsubjectivenotionsabouttheeffectofnonrespondentsinsamplesurveys.In:Journalof

theAmericanStatisticalAssociation72.359(1977),pp.538-543.

DOI

1977 Article

Rubin,D.B.Inferenceandmissingdata.In:Biometrika63.3(1976),pp.581-592.

DOI

1976 Article

Sadinle,M.andJ.P.Reiter.SequentialIdentificationofNonignorableMissingDataMechanisms.In:Statistica

Sinica28.4(2018),pp.1741–1759.

DOI

2018 Article

Schafer,J.L.Multipleimputation:aprimer.In:StatisticalMethodsinMedicalResearch8.1(1999),pp.3-15.

DOI

1999 Article

Schafer,J.L.AnalysisofIncompleteMultivariateData.CRCMonographsonStatistics&AppliedProbability.

BocaRaton,FL,USA:ChapmanandHall/CRC,1997.ISBN:0412040611.

1997 Book

Schafer,J.L.andJ.W.Graham.Missingdata:ourviewofthestateoftheart.In:PsychologicalMethods7.2

(2002),pp.147-177.

DOI

2002 Article

Schafer,J.L.andM.K.Olsen.MultipleImputationformultivariatemissing-dataproblems:adataanalyst’s

perspective.In:MultivariateBehavioralResearch33.4(1998),pp.545-571.

DOI

1998 Article

Seaman,S.,J.Galati,D.Jackson,etal.WhatIsMeantby“MissingatRandom”?In:StatisticalScience28.2

(2013),pp.257–268.

DOI URL

2013 Article

Seaman,S.R.andI.R.White.Reviewofinverseprobabilityweightingfordealingwithmissingdata.In:

StatisticalMethodsinMedicalResearch22.3(2011),pp.278-295.

DOI

2011 Article

Seaman,S.R.andS.Vansteelandt.IntroductiontoDoubleRobustMethodsforIncompleteData.In:Statistical

Science33.2(2018),p.184.

DOI

2018 Article

Shao,J.andJ.Zhang.Atransformationapproachinlinearmixed-effectsmodelswithinformativemissing

responses.In:Biometrika102.1(2015),pp.107-119.

DOI

2015 Article

Sharpe,P.K.andR.J.Solly.Dealingwithmissingvaluesinneuralnetwork-baseddiagnosticsystems.In:Neural

Computing&Applications3.2(1995),pp.73-77.

DOI

1995 Article

Simon,G.A.andJ.S.Simonoff.Diagnosticplotsformissingdatainleastsquaresregression.In:Journalofthe


DOI

1986 Article

Śmieja,M.,Ł.Struski,J.Tabor,etal.Processingofmissingdatabyneuralnetworks.In:ComputingResearch

Repositoryabs/1805.07405(2018).eprint:1805.07405.

URL

2018 Article

Sovilj,D.,E.Eirola,Y.Miche,etal.Extremelearningmachineformissingdatausingmultipleimputations.In:

Neurocomputing174.A(2016),pp.220-231.

DOI

2016 Article

Sportisse,A.,C.Boyer,andJ.Josse.Imputationandlow-rankestimationwithMissingNonAtRandomdata.In:

arXivpreprint(2018).arXiv:1812.11409[stat.ML].

URL

2018 Article

Stacklies,W.,H.Redestig,M.Scholz,etal.pcaMethods–abioconductorpackageprovidingPCAmethodsfor

incompletedata.In:Bioconductor23.9(2007),pp.1164-1167.

DOI

2007 Article

Stage,A.R.andN.L.Crookston.Partitioningerrorcomponentsforaccuracy-assessmentofnear-neighbor

methodsofimputation.In:ForestScience53.1(2007),pp.62-72.

DOI

2007 Article

Stekhoven,D.J.andP.Bühlmann.Missforest-non-parametricmissingvalueimputationformixed-typedata.In:

Bioinformatics28.1(2012),pp.112-118.eprint:1105.0828.

DOI

2012 Article

Strobl,C.,A.L.Boulesteix,andT.Augustin.UnbiasedsplitselectionforclassificationtreesbasedontheGini

Index.In:ComputationalStatistics&DataAnalysis52.1(2007),pp.483-501.

DOI

2007 Article

Stuart,E.A.,M.Azur,C.Frangakis,etal.Multipleimputationwithlargedatasets:acasestudyofthechildren’s

mentalhealthinitiative.In:AmericanJournalofEpidemiology169.9(2009),pp.1133-1139.

DOI

2009 Article

Stubbendick,A.L.andJ.G.Ibrahim.MaximumLikelihoodMethodsforNonignorableMissingResponsesand

CovariatesinRandomEffectsModels.In:Biometrics59.4(2003),pp.1140–1150.

DOI

2003 Article

Stubbendick,A.L.andJ.G.Ibrahim.Likelihood-basedinferencewithnonignorablemissingresponsesand

covariatesinmodelsfordiscretelongitudinaldata.In:StatisticaSinica16.4(2006),pp.1143–1167.

URL

2006 Article

Su,Y.S.,A.Gelman,J.Hill,etal.Multipleimputationwithdiagnostics(mi)inR:openingwindowsintotheblack

box.In:JournalofStatisticalSoftware45(2011),p.2.

DOI

2011 Article

Tanner,M.A.andW.Wong.Thecalculationofposteriordistributionsbydataaugmentation.In:Journalofthe


DOI URL

1987 Article

TchetgenTchetgen,E.J.,L.Wang,andB.Sun.Discretechoicemodelsfornonmonotonenonignorablemissing

data:identificationandinference.In:StatisticaSinica28.4(2018),pp.2069–2088.

DOI

2018 Article

Templ,M.,A.Alfons,andP.Filzmoser.ExploringIncompletedatausingvisualizationtechniques.In:Advancesin

DataAnalysisandClassification6.1(2012),pp.29-47.

DOI

2012 Article

Thijs,H.,G.Molenberghs,B.Michiels,etal.Strategiestofitpattern-mixturemodels.In:Biostatistics3.2(2002),

pp.245-265.

DOI

2002 Article

Tierney,N.andD.Cook.Expandingtidydataprinciplestofacilitatemissingdataexploration,visualizationand

assessmentofimputations.MonashEconometricsandBusinessStatisticsWorkingPapers14/18.Monash

University,DepartmentofEconometricsandBusinessStatistics,2018.

URL

2018 Misc

Tierney,N.J.,F.A.Harden,M.J.Harden,etal.Usingdecisiontreestounderstandstructureinmissingdata.In:

BMJOpen5.6(2015),p.e007450.

DOI

2015 Article

Tran,L.,X.Liu,J.Zhou,etal.MissingModalitiesImputationviaCascadedResidualAutoencoder.In:2017IEEE

ConferenceonComputerVisionandPAtternRecognition(CVPR).(Jul.21,2017-Jul.26,2017).Ed.by-.IEEE,

2017,pp.4971-4980.

DOI

2017 Paper

Troyanskaya,O.,M.Cantor,G.Sherlock,etal.MissingvalueestimationmethodsforDNAmicroarrays.In:

Bioinformatics17.6(2001),pp.520-525.

DOI

2001 Article

Twala,B.E.T.H.,M.C.Jones,andD.J.Hand.Goodmethodsforcopingwithmissingdataindecisiontrees.In:

PatternRecognitionLetters29.7(2008),pp.950-956.

DOI

2008 Article

Unnebrink,K.andJ.Windeler.Intention-to-treat:methodsfordealingwithmissingvaluesinclinicaltrialsof

progressivelydeterioratingdiseases.In:StatisticsinMedecine20.24(2001),pp.3931-3946.

DOI

2001 Article

Velden,M.vandeandT.H.A.Bijmolt.Generalizedcanonicalcorrelationanalysisofmatriceswithmissingrows:

asimulationstudy.In:Psychometrika(2006).

DOI

2006 Article

Verbanck,M.,J.Josse,andF.Husson.RegularisedPCAtodenoiseandvisualisedata.In:Statisticsand

Computing25.2(2015),pp.471-486.

DOI

2015 Article

Verbeke,G.,G.Molenberghs,H.Thijs,etal.Sensitivityanalysisfornonrandomdropout:alocalinfluence

approach.In:Biometrics57.1(2001),pp.7-14.

DOI

2001 Article

Voillet,V.,P.Besse,L.Liaubet,etal.Handlingmissingrowsinmulti-omicsdataintegration:multipleimputation

inmultiplefactoranalysisframework.In:BMCBioinformatics17.402(2016).Forthcoming.

DOI

2016 Article

Wal,W.M.vanderandR.B.Geskus.ipw:anRpackageforinverseprobabilityweighting.In:Journalof

StatisticalSoftware43.13(2011).

DOI

2011 Article

Vansteelandt,S.,J.Carpenter,andM.G.Kenward.Analysisofincompletedatausinginverseprobability

weightinganddoublyrobustestimators.In:Methodology–EuropeanJournalofResearchMethodsforthe

BehavioralandSocialSciences6.1(2010),pp.37–48.

DOI

2010 Article

Vansteelandt,S.,A.Rotnitzky,andJ.Robins.Estimationofregressionmodelsforthemeanofrepeated

outcomesundernonignorablenonmonotonenonresponse.In:Biometrika94.4(2007),pp.841–860.

DOI

2007 Article

Wainer,H.,ed.DrawingInferencesfromSelf-SelectedSamples.NewYork,NY,USA:Springer,1986. 1986 Book

Wang,N.andJ.M.Robins.Large-sampletheoryforparametricmultipleimputationprocedures.In:Biometrika

85.4(1998),pp.935–948.

DOI

1998 Article

White,I.R.,J.Carpenter,andN.J.Horton.Ameanscoremethodforsensitivityanalysistodeparturesfromthe

missingatrandomassumptioninrandomisedtrials.In:StatisticaSinica28.4(2018),pp.1985–2003.

DOI

2018 Article

Wu,M.C.andR.J.Carroll.Estimationandcomparisonofchangesinthepresenceofinformativeright

censoringbymodelingthecensoringprocess.In:Biometrics44.1(1988),pp.175-188.

DOI

1988 Article

Xie,X.andX.L.Meng.Dissectingmultipleimputationfromamulti-phaseinferenceperspective:whathappens

whenGod’s,imputer’sandanalyst’smodelsareuncongenial?In:StatisticaSinica27.4(2017),pp.1485–1594.

DOI

2017 Article

Yang,S.,L.Wang,andP.Ding.Identificationandestimationofcausaleffectswithconfounderssubjectto

instrumentalmissingness.In:StatisticsMethodologyRepository(2017).

URL

2017 Article

Yoon,J.,J.Jordon,andM.vanderSchaar.GAIN:MissingDataImputationusingGenerativeAdversarialNets.

In:Proceedingsofthe35thInternationalConferenceonMachineLearning.(Jul.10,2018-Jul.15,2018).Ed.by

J.DyandA.Krause.Vol.80.ProceedingsofMachineLearningResearch.Stockholmsmässan,Stockholm

Sweden:PMLR,2018,pp.5689–5698.

URL

2018 Paper

Zhang,H.,P.Xie,andE.Xing.MissingValueImputationBasedonDeepGenerativeModels.In:Computing

ResearchRepositoryabs/1808.01684(2018).

URL

2018 Article

Zhang,S.NearestneighborselectionforiterativekNNimputation.In:JournalofSystemsandSoftware85.11

(2012),pp.2541-2552.

DOI

2012 Article

Zhou,Y.,R.J.A.Little,andJ.D.Kalbfleisch.Block-conditionalmissingatrandommodelsformissingdata.In:

StatisticalScience25.4(2010),pp.517–532.

DOI

2010 Article

Zhu,Z.,T.Wang,andR.J.Samworth.High-dimensionalprincipalcomponentanalysiswithheterogeneous

missingness.In:arXivpreprint(2019).

URL

2019 Article

About






Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

(b) Unordered version

Figure 2: Bibliography overview.

10

2.4 Implementations

R packages As mentioned in the introduction, prior to the platform development, theproject started with the release of the MissingData CRAN Task View, which currently listsapproximately 150 R packages. The CRAN Task View is continuously updated, adding newR packages, and removing obsolete ones. Packages are organized by topic: exploration ofmissing data, likelihood based approaches, single imputation, multiple imputation, weightingmethods, specific types of data, specific application fields. We selected only those that weresufficiently mature and stable, and that were already published on CRAN or Bioconductor.This choice was made to ensure all listed packages can easily be installed and used bypractitioners.

Despite the Task View classifying packages into different sub-domains, it may still be achallenge for practitioners and researchers inexperienced with missing values to choose theright package for the right application. To address this challenge, we provide a partial, lesscondensed overview on existing R packages on the platform, choosing the most popularand versatile. This overview is a blend of the Task View, and the individual packagedescription pages and vignettes. For each selected package (7 at the date of writing of thisarticle, namely imputeTS, mice, missForest, missMDA, naniar, simputation and VIM), weprovided a category (in the style of the categorization in the Task View), a short descriptionof use-cases, its description (as on CRAN), the usual CRAN statistics (number of monthlydownloads, last update), the handled data formats (e.g., data.frame, matrix, vector), alist of implemented algorithms (e.g., k-means, PCA, decision tree), and the list of availabledatasets, some references (such as articles and books) and a small chunk of code, ready fora direct execution on the platform via the R package Documentation9. Figure 3 shows thecondensed view of the package page and the expanded description sheet of a given package(here naniar).

We believe shortlisting R packages is especially useful for practitioners new to the field,as it demonstrates data analysis that handles missing values in the data. We are awarethat this selection is subjective, and welcome external suggestions for other packages toadd to this shortlist.

Python modules To the best of our knowledge, there only exist very few implementedmethods for handling missing data in Python. However, one of the major libraries for ma-chine learning and data analysis, scikit-learn (Pedregosa et al., 2011) has recently proposeda module for simple and multiple imputations, sklearn.impute. And, as an alternative, the

9https://rdrr.io/snippets/

11

https://CRAN.R-project.org/view=MissingData

Home

Workflows

Bibliography

Lectures

R packages

Data

People

More info & links

Contact

R-miss-tastic



Back to top

R Packages

This page provides introductions to popular missing data packages with small examples

on how to use them. Thus the page gives more extensive information than the CRAN

Task View on Missing Data, which is recommended to get a first overall overview about

the CRAN missing data landscape.

Youcanalsocontributeonyourowntothispageandprovideashortintroductiontoa

missingdatapackage.Takea lookat thisshortdescriptiononhow todo this.Weare

veryhappyaboutallcontributions.

Search Sort by name Sort by Category

missMDA

Category: Single and multiple Imputation, Multivariate Data Analysis

Imputationofincompletecontinuousorcategoricaldatasets;Missing

valuesareimputedwithaprincipalcomponentanalysis(PCA),a

multiplecorrespondenceanalysis(MCA)modeloramultiplefactor

analysis(MFA)model;PerformmultipleimputationwithandinPCAor

MCA.

downloadsdownloads 4000/month4000/month CRANCRAN 2019-01-232019-01-23

more..

imputeTS

Category: Time-Series Imputation, Visualisations for Missing Data

Imputation(replacement)ofmissingvaluesinunivariatetimeseries.

Offersseveralimputationfunctionsandmissingdataplots.Available

imputationalgorithmsinclude:'Mean','LOCF','Interpolation','Moving

Average','SeasonalDecomposition','KalmanSmoothingonStructural

TimeSeriesmodels','KalmanSmoothingonARIMAmodels'.

downloadsdownloads 12K/month12K/month CRANCRAN 2019-07-012019-07-01

more..

mice

Category: Multiple Imputation

MultipleimputationusingFullyConditionalSpecification(FCS)

implementedbytheMICEalgorithmasdescribedinVanBuurenand

Groothuis-Oudshoorn(2011).Eachvariablehasitsownimputation

model.Built-inimputationmodelsareprovidedforcontinuousdata

(predictivemeanmatching,normal),binarydata(logisticregression),

unorderedcategoricaldata(polytomouslogisticregression)and

orderedcategoricaldata(proportionalodds).MICEcanalsoimpute

continuoustwo-leveldata(normalmodel,pan,second-levelvariables).

Passiveimputationcanbeusedtomaintainconsistencybetween

variables.Variousdiagnosticplotsareavailabletoinspectthequalityof

theimputations.


more..

naniar

Category: Visualisations for Missing Data

Missingvaluesareubiquitousindataandneedtobeexploredand

handledintheinitialstagesofanalysis.'naniar'providesdatastructures

andfunctionsthatfacilitatetheplottingofmissingvaluesand

examinationofimputations.Thisallowsmissingdatadependenciesto

beexploredwithminimaldeviationfromthecommonworkpatternsof

'ggplot2'andtidydata.


more..

missForest

Category: Single Imputation

Thefunction'missForest'inthispackageisusedtoimputemissing

valuesparticularlyinthecaseofmixed-typedata.Itusesarandom

foresttrainedontheobservedvaluesofadatamatrixtopredictthe

missingvalues.Itcanbeusedtoimputecontinuousand/orcategorical

dataincludingcomplexinteractionsandnon-linearrelations.Ityieldsan

out-of-bag(OOB)imputationerrorestimatewithouttheneedofatest

setorelaboratecross-validation.Itcanberuninparalleltosave

computationtime.


more..

simputation

Category: Single Imputation, Meta-Package

Easytouseinterfacestoanumberofimputationmethodsthatfitinthe

not-a-pipeoperatorofthe'magrittr'package.


more..

VIM

Category: Single Imputation, Visualisations for Missing Data

Newtoolsforthevisualizationofmissingand/orimputedvaluesare

introduced,whichcanbeusedforexploringthedataandthestructure

ofthemissingand/orimputedvalues.Dependingonthisstructureof

themissingvalues,thecorrespondingmethodsmayhelptoidentifythe

mechanismgeneratingthemissingvaluesandallowstoexplorethe

dataincludingmissingvalues.Inaddition,thequalityofimputationcan

bevisuallyexploredusingvariousunivariate,bivariate,multipleand

multivariateplotmethods.Agraphicaluserinterfaceavailableinthe

separatepackageVIMGUIallowsaneasyhandlingoftheimplemented

plotmethods.


more..

Your favorite package is missing? Here is an explanation on how to make an entry for

your package. Template

Share

About






Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

(a) Extract of global view

Home

Workflows

Bibliography

Lectures

R packages

Data

People

More info & links

Contact

R-miss-tastic



Back to top

Package:

naniar

Category:

Data Structures, Summaries, and Visualisations for Missing Data

Use-Cases:

Visualization of missing values, descriptive statistics, …

Popularity:

downloadsdownloads 5305/month5305/month

Description:

Missing values are ubiquitous in data and need to be carefully explored and handled in

the initial stages of analysis. In this vignette we describe the tools in the package naniar

for exploring missing data structures with minimal deviation from the common

workflows of ggplot and tidy data.

Last update:

CRANCRAN 2019-02-152019-02-15

Datasets:

oceanbuoys

pedestrian

riskfactors

Further Information:

Tierney, N. J., & Cook, D. H. (2018). Expanding tidy data principles to facilitate

missing data exploration, visualization and assessment of imputations. arXiv

preprint arXiv:1809.02264. PDF (on arXiv)

Vignettes

Related visdat R-package

Input:

data.frame, vector

Example:

library(naniar)

data(airquality)

print("printdatasetwithNAs")

print(head(airquality))

##Replace"NA"valueswithvalues10%lower

##thantheminimumvalueinthatvariable.

##Thisisdonebycallingthegeom_miss_point()function

ggplot2::ggplot(airquality,

ggplot2::aes(x=Solar.R,

y=Ozone))+

geom_miss_point()

Here you can have a interactive look at the example:

library(naniar)

data(airquality)

print("printdatasetwithNAs")print(head(airquality))

##Replace“NA”valueswithvalues10%lower##thantheminimumvalueinthatvariable.##Thisisdonebycallingthegeom_miss_point()functionggplot2::ggplot(airquality, ggplot2::aes(x=Solar.R, y=Ozone))+geom_miss_point()Run (Ctrl-Enter)

You should assume that any scripts or data

that you put into this service are public.

Privacy policy.

Computation provided by rdrr.io: hosting

documentation for all R packages.

https://rdrr.io/snippets/embedding/

Share

About






Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

(b) Description sheet

Figure 3: R packages overview.

12

statsmodels10 library now also has an implementation module for multiple imputation inPython. We regularly survey new Python implementations for handling missing values and,if pertinent from a theoretical and practical point of view, reference them on our platform.We expect this to promote their use but also additional assessment by practitioners andresearchers from the missing values (statistics/machine learning) community.

2.5 Datasets

Especially in methodology research, an important aspect is the comparison of differentmethods to assess the respective strengths and weaknesses. There are several datasets thatare recurrent in the missing values literature but, they have not been listed together yet.We gathered publicly available datasets that have been used recurrently for comparisonor illustration purposes in publications, R packages and tutorials. Most of these datasetsare already included in R packages but some are available on other data collections. Fig-ure 4 shows how the datasets are presented, with a detailed description shown for oneof the dataset (“Ozone”, obtained by clicking on its name). The description follows theUCI Machine Learning Repository presentation (Dua and Graff, 2019), including a shortdescription of the dataset, how to obtain it, external references describing the dataset inmore details, and links to tutorials/lectures on our websites or to vignettes in R packagesthat use the dataset.

In addition, the Datasets section also references existing methods for generating miss-ing data, given assumptions on their generation mechanisms (as in the R package mice).These datasets also serve as benchmark in the proposed workflows and allow for a fair andtransparent comparison between the different methods.

Note however, that the list of datasets we gather here is comparatively short whencompared to benchmark datasets for full data methods such as the UCI Machine LearningRepository. Therefore, our proposed list also serves as an invitation to tackle this lack ofa wider variety of common benchmark datasets in the missing data community.

2.6 Additional content

This unified platform collects and edits the contributions of numerous individuals whohave investigated the missing values problems, and developed methods to handle them formany years. To provide an overview of some of the main actors in this field, the list of all

10https://www.statsmodels.org/stable/about.html

13

https://www.statsmodels.org/stable/about.html

Home

Workflows

Bibliography

Lectures

R packages

Data

People

More info & links

Contact

R-miss-tastic



Back to top

Here you will find a constantly growing list of interesting data sets which are frequently

used in the R community working on missing values. These data sets can be useful to

get familiar with different concepts in handling missing values and to assess the quality

and performance of new methods.

Ifyouhavesuggestionsonotherdatasetswhichmightbeofinteresttoothers,please

feelfreetocontactusviatheContactform.

Complete data

If you wish to evaluate a certain missing data method on real (or simulated) data it can

be useful to first generate missing values in a complete dataset. This allows to control

the response mechanism and evaluate the method for different response mechanisms.

Some useful tools for this:

The ampute function of the mice R-package. Rianne Schouten and her colleagues

wrote a self-contained tutorial on how to ampute data.

The R workflow on How to generate missing values? extending some

functionalities of the ampute function. For the related R source code click here.

The missCompare R-package.

Incomplete data

The data sets listed below are either widely used in general in the missing data

community or used for illustration of different methods handling missing values in the

tutorials from the Tutorials and R packages sections. This presentation scheme is

inspired by the UCI Machine Learning Repository.

Click on a table entry to obtain further information about the data set.

Share

Name Data Types

Attribute

Types

#

Instances

#

Attributes

%

Missing

entries

Complete

data

available Year

Airquality Multivariate,

Time Series

Real 154 6 7 No 1973

chorizonDL Multivariate Integer,

Real

606 110 15 Yes 1998

Health

Nutrition And

Population

Statistics

Multivariate,

Time Series

Integer,

Real

15,022 397 54 No 2017

NHANES Multivariate Categorical,

Integer,

Real

10,000 75 37 No 2012

oceanbuoys Multivariate,

Time Series

Real 736 8 3 No 1997

Ozone Multivariate Categorical,

Integer,

Real

366 13 6 No 1976

Los Angeles Ozone Pollution Data, 1976. This data set contains daily measurements of ozone concentration and

meteorological quantities. It can be found in R in the mlbench package and is loaded by calling data(Ozone).

More information on the dataset.

Tutorials illustrating methods on this data:

Julie Josse's course on missing values imputation using PC methods.

Julie Josse's and Nick Tierney's tutorial on handling missing values. Download the data set from this

tutorial: ozoneNA.csv

Nick Tierney's naniar vignette for missing data visualization.

pedestrian Multivariate,

Time series

Categorical,

Integer

37,700 9 2 No 2016

riskfactors Multivariate Categorical,

Integer,

Real

245 34 14 No 2009

SBS52424 Multivariate Real 262 9 2 No 2016

sleep Multivariate Integer,

Real

62 10 6 No 1976

tsAirgap Time series Integer 144 1 9 Yes 1960

tsHeating Time series Real 606,837 1 9 Yes 2015

tsNH4 Time series Real 4,552 1 9 Yes 2014

About






Vialaneix.

Read more →

FAQ →

Follow us!

Events

GitHub

Figure 4: Datasets overview.

14

contributors who agreed to appear on this platform is given with links to their personal orto their research lab website.

We also provide links to other interesting websites or working groups, not necessarilyworking with R and Python (Van Rossum and Drake, 2009) but with other programminglanguages such as SAS/STAT® and STATA (StataCorp., 2019).

The platform also provides two other features to engage the community:

1. a regularly updated list of events such as conferences or summer schools with specialfocus on missing values problems, and

2. a list of recurring questions together with short answers and links for more details forevery question.

3 Workflows

After this general introduction to the R-miss-tastic project and platform and the overviewof its structure, we now turn to a more detailed presentation of the various workflows wehave developed and proposed on this platform.

To allow for both hands-on tutorials illustrating current practices and state-of-the-art and ready-to-use pipelines, we propose the workflows under different formats such asHTML, PDF, R Markdown (for R code) and IPYNB (for Python code). We encouragepractitioners and researchers to use and adapt these workflows, in order to increase repro-ducibility and comparability of their work. Of course, we are aware that these workflowsdo not cover the entire spectrum of existing methods and data problems. The goal of theproposed workflows is rather to initiate a joint effort to create a larger spectrum of open-source workflows, and to encourage the use of standardized procedures to handle missingvalues.

With an incomplete dataset at hand, prior to embarking on an in-depth statisticalanalysis, a specific aim has to be defined in order to choose a specific method to use. Anexample of a method whose success crucially depends on the analyst’s goal is mean imputa-tion: this approach is strongly counter-indicated if the aim is to estimate parameters, butit can be consistent if the aim is to predict as well as possible (Josse et al., 2019). Followingthis observation, our workflows are divided into different parts, defined by the objective ofthe statistical analysis. We have tried to present and compare the main implementationsavailable both in R and Python for each objective. Currently there are seven workflowsavailable on the platform and we present their scope and use below.

15

3.1 How to generate missing values?

Rubin (1976) classifies the cause for a lack of data into three missing data mechanisms.The missing data mechanism is said to be: (i) missing completely at random (MCAR) ifthe lack of data is totally independent of the data values, (ii) missing at random (MAR) ifthe process that causes the missing data may only depend on the observed values and (iii)missing not at random (MNAR) if the unavailability of the data depends on the missingvariables, for instance on their values themselves and possibly on the values of observedvariables. More formally, let us define the indicator matrix R ∈ Rn×p, which indicates theindices of the observed values in X ∈ Rn×p, i.e., Rij = 1 if Xij is observed and Rij = 0otherwise. The missing data mechanism is then the distribution of the indicator matrix Rgiven the data X.

The goal of this workflow is to propose functions to generate missing values under thesedifferent mechanisms. The way in which the missing values are generated is crucial forcomparing methods (and studies) in a fair manner and is often subject of debate (Seamanet al., 2013). This code aims to unify classical solutions to do this. Indeed, a usual strategyto compare imputation or estimation strategies is to introduce (additional) missing valuesin the dataset, and use the ground truth for these missing values to evaluate the strategies.

In R In the R workflow, the main function produce NA allows to generate missing valuesunder the three missing-data mechanisms outlined above. This function internally calls theampute function of the mice package (van Buuren and Groothuis-Oudshoorn, 2011) butwe chose to simplify its use while adding some additional options to specify the missingvalues generation. In addition, the original ampute function generates missing values onlyfor complete dataset. In our workflow, the user can easily introduce (additional) missingvalues in a complete or incomplete dataset composed of quantitative, categorical or mixedvariables. The three main arguments are the initial dataset (data) in which the missingvalues are introduced using a given missing data mechanism (mechanism) and a givenpercentage of missing values (perc.missing). For example, to introduce 20% of MCARvalues in the dataset X, the code is detailed below.

1 X.miss.mcar

The function returns the data matrix containing the new missing values (and the previouslypresent missing values of the input data) and the indicator matrix R.

To introduce missing values under the MAR and MNAR mechanisms, there are severaloptions detailed and illustrated in the workflow. We consider the definitions of the missingdata mechanisms of Rubin (1976). For instance, if X contains three variables denoted asX1, X2, X3, two options are available to generate MAR values:

1. the first option consists of generating missing values in X1 by using a logistic modeldepending on (X2, X3), which are observed, i.e.

P(R1 = 0|X;φ) = 1/(1 + exp(−(φ2X2 + φ3X3)),

where φ = (φ2, φ3) is the parameter of the missing-data mechanism. In our function,φ is chosen so that the given percentage of missing values is reached. This allows toobtain missing values in the first variable XNA1 . Then, the same strategy is performedto introduce missing values in X2 and X3, by using a logistic model depending on(X1, X3) (which are observed) and (X1, X2) (which are observed) respectively. Toget the final matrix containing missing values, we concatenate XNA1 , X

NA2 and X

NA3

by handling the rows containing only missing values. To use this option, the code isdetailed below.

1 X.miss.mar

1 X.miss.marpat

For the MAR mechanism, by contrast with the R workflow, the Python code relies onthe definition of Little and Rubin (2002). For instance, if we have X = (X1, X2, X3), thenat least one variable should be fully-observed (say X3) and missing values in X1 and X2are introduced with the following logistic model,

P(R1 = 0|X;φ) = P(R2 = 0|X;φ) = 1/(1 + exp(−φ3X3),

with φ = φ3 the parameter of the missing-data mechanism chosen to reach the givenpercentage of missing values. To introduce 20% missing values in each missing variable(i.e., X1 and X2), the following code can be used, with p obs the proportion of fullyobserved variables.

1 X_miss_mar = produce_NA(X=X, p_miss =0.2,

2 mecha="MAR", p_obs =0.3)

Listing 6: Generating MAR values in Python.

For the MNAR mechanism, three main options are available, using a logistic model,a quantile censorship or a logistic model for a self-masked mechanism (for their exactdefinition, we refer to the workflow). For example, to introduce 20% of self-masked missingvalues (in all the variables), we can use the code below.

1 X_miss_selfmasked = produce_NA(X=X, p_miss =0.4,

2 mecha="MNAR", opt="selfmasked")

Listing 7: Generating self-masked MNAR values in Python.

3.2 How to impute missing values?

There exists a vast literature on how to impute missing values. The aim of these workflows(in R and Python) is to compare the most classical imputation methods and to proposea reference pipeline for comparison on simulated and real datasets, which can be easilyextended with other imputation methods. Different types of methods are included:

1. imputation by the mean, which serves as a naive benchmark.

2. conditional models, if, roughly speaking, the imputation relies on the distributions ofeach variable given the others.

• in R:

19

https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values

– mice (van Buuren and Groothuis-Oudshoorn, 2011): it allows to computemultiple imputations by chained equations and thus returns several imputeddatasets. We use the predictive mean matching method (default method)and aggregate the complete datasets using the mean of the imputations toget a simple imputation.

– missForest (Stekhoven and Bühlmann, 2012): it imputes missing valuesiteratively by training random forests.

• in Python:– IterativeImputer of scikit-learn library (Pedregosa et al., 2011): this func-

tion is inspired by mice, but it uses (iterative) regularized imputation usingconditional expectation and provides a simple imputation. We also use theExtraTreesRegressor estimator of IterativeImputer, which trains iterativerandom forests.

3. low-rank based models, if the data matrix to impute is assumed to be low rank and thesimilarities between the variables (or the observations) may inform the imputation,

• in R:– softImpute (Hastie et al., 2015): it minimizes the reweighted least squares

error penalized by the nuclear norm.

– missMDA (Josse et al., 2016): it minimizes the reweighted least squareserror penalized by a mix between the `2-norm and `0-norm.

• in Python: softImpute (coded in Python by ourselves).

4. recent methods (for the Python workflow only) using optimal transport or variationalautoencoders, variables (or the observations) may inform the imputation,

• in Python:– MIWAE (Mattei and Frellsen, 2019): it imputes missing values with a deep

latent variable model based on importance weighted variational inference.

– Sinkhorn (Muzellec et al., 2020): it extracts randomly several batches andconsists in minimizing optimal transport distances between batches to im-pute missing values.

Other methods such as GAIN (Yoon et al., 2018) which uses generative adversarial nets,have not yet been compared, as they are close to those already being compared, but willbe added in the future.

20

The metric we choose to compare the methods is the mean squared error (MSE), whichcan be calculated if the ground truth of the missing values is known. More precisely, theprocedure is the following one: (i) we have access to a complete dataset X, (ii) missingvalues are introduced in X and we get an incomplete dataset XNA, (iii) this incompletedataset is imputed and we obtain an imputed dataset X imp. The MSE for X imp is computedas follows

MSE(X imp, X) =1

nNA

∑i

∑j

1{XNAij =NA}(Ximpij −Xij)2

where nNA =∑

i

∑j 1{XNAij =NA} is the number of missing entries in X

NA. Note that this

procedure can also be performed on an incomplete dataset by introducing additional missingvalues. However, for now, both R and Python notebooks only consider complete datasets.

In R This workflow first presents the main imputation methods available in R, includingmice, missForest, softImpute and missMDA.

We compare the methods on a simulated dataset X ∈ Rn×d under the multivariateGaussian law X ∼ N (µ,Σ), with µ the mean vector and Σ the covariance matrix. Thefunction HowtoImpute compares the imputation methods by introducing missing valuesin a complete dataset (X) using different percentages of missing values (perc.list) andmissing data mechanisms (mecha.list). It returns the mean of the methods’ MSEs for thedifferent missing values settings by taking the average over nbsim repetitions. The codeto use this function is given below. For the sake of clarity, in the workflow, all the code isdetailed and commented. The output of this function and its associated plot are shown inFigure 5, when n = 1000, d = 10, µi = 1, ∀i ∈ {1, . . . , d} and Σij = 0.5 if i 6= j ∈ {1, . . . , d}and Σij = 1 if i = j. For the MCAR mechanism, the methods perform well, while for theMNAR mechanism, the results are close to those of the naive i

R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...

Documents

Transcript of R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...