R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...

38
R-miss-tastic: a unified platform for missing values methods and workflows Imke Mayer * Aude Sportisse Julie Josse Nicholas Tierney § Nathalie Vialaneix March 25, 2021 Abstract Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss, or biased analyses. Since the seminal work of Rubin (1976), there has been a burgeoning literature on missing val- ues with heterogeneous aims and motivations. This has resulted in the development of various methods, formalizations, and tools (including a large number of R packages and Python modules). However, for practitioners, it remains challenging to decide which method is most suited for their problem, partially because handling missing data is still not a topic systematically covered in statistics or data science curricula. To help address this challenge, we have launched a unified platform: “R-miss- tastic”, which aims to provide an overview of standard missing values problems, methods, how to handle them in analyses, and relevant implementations of method- ologies. In the same perspective, we have also developed several pipelines in R and Python to allow for a hands-on illustration of how to handle missing values in vari- ous statistical tasks such as estimation and prediction, while ensuring reproducibility * CAMS, EHESS, Paris, France (email: [email protected]). Corresponding co-author. LPSM, Sorbonne Universit´ e, Paris, France (email: [email protected]). Corresponding co-author. Inria Sophia-Antipolis, Montpellier, France (email: [email protected]). § NUMBAT, Monash University, Victoria, Australia (email: [email protected]). MIAT, Universit´ e de Toulouse, INRA, Castanet Tolosan, France (email: [email protected]). 1 arXiv:1908.04822v2 [stat.ME] 24 Mar 2021

Transcript of R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...

  • R-miss-tastic: a unified platform for missingvalues methods and workflows

    Imke Mayer∗ Aude Sportisse† Julie Josse‡ Nicholas Tierney§

    Nathalie Vialaneix¶

    March 25, 2021

    Abstract

    Missing values are unavoidable when working with data. Their occurrence isexacerbated as more data from different sources become available. However, moststatistical models and visualization methods require complete data, and improperhandling of missing data results in information loss, or biased analyses. Since theseminal work of Rubin (1976), there has been a burgeoning literature on missing val-ues with heterogeneous aims and motivations. This has resulted in the developmentof various methods, formalizations, and tools (including a large number of R packagesand Python modules). However, for practitioners, it remains challenging to decidewhich method is most suited for their problem, partially because handling missingdata is still not a topic systematically covered in statistics or data science curricula.

    To help address this challenge, we have launched a unified platform: “R-miss-tastic”, which aims to provide an overview of standard missing values problems,methods, how to handle them in analyses, and relevant implementations of method-ologies. In the same perspective, we have also developed several pipelines in R andPython to allow for a hands-on illustration of how to handle missing values in vari-ous statistical tasks such as estimation and prediction, while ensuring reproducibility

    ∗CAMS, EHESS, Paris, France (email: [email protected]). Corresponding co-author.†LPSM, Sorbonne Université, Paris, France (email: [email protected]). Corresponding co-author.‡Inria Sophia-Antipolis, Montpellier, France (email: [email protected]).§NUMBAT, Monash University, Victoria, Australia (email: [email protected]).¶MIAT, Université de Toulouse, INRA, Castanet Tolosan, France (email: [email protected]).

    1

    arX

    iv:1

    908.

    0482

    2v2

    [st

    at.M

    E]

    24

    Mar

    202

    1

  • of the analyses. This will hopefully also provide some guidance on deciding whichmethod to choose for a specific problem and data. The objective of this work is notonly to comprehensively organize materials, but also to create standardized analysisworkflows, and to provide a common ground for discussions among the community.This platform is thus suited for beginners, students, more advanced analysts andresearchers.

    Keywords: missing data; state of the art; bibliography; reproducibility; guided workflows;teaching material

    2

  • 1 Context and motivation

    Missing data are unavoidable as soon as collecting or acquiring data is involved. Theyoccur for many reasons including: individuals choose not to answer survey questions, mea-surement devices fail, or data have simply not been recorded. Their presence becomes evenmore important as data are now obtained at increasing velocity and volume, and fromheterogeneous sources not originally designed to be analyzed together. As pointed out byZhu et al. (2019), “one of the ironies of working with Big Data is that missing data play anever more significant role, and often present serious difficulties for analysis”. Despite this,the approach most commonly implemented by default in software is to toss out cases withmissing values. At best, this is inefficient because it wastes information from the partiallyobserved cases. At worst, it results in biased estimates, particularly when the distributionsof the missing values are systematically different from those of the observed values (e.g.,Enders, 2010, Chap. 2).

    However, handling missing data in a more efficient and relevant way (than limiting theanalysis on solely the complete cases) has attracted a lot of attention in the literature in thelast two decades. In particular, a number of reference books have been published (Schaferand Graham, 2002; Little and Rubin, 2002; van Buuren, 2012; Carpenter and Kenward,2012) and the topic is an active field of research (Josse and Reiter, 2018). The diversityof the problems of missing data means there is great variety in the methods proposed andstudied. They include model-based approaches, integrating likelihoods or posterior distri-butions over missing values, filling in missing values in a realistic way with single, or multipleimputations, or weighting approaches, appealing to ideas from the design-based literaturein survey sampling. The multiplicity of the available solutions makes sense because there isno single solution or tool to manage missing data: the appropriate methodology to handlethem depends on many features, such as the objective of the analysis, type of data, thetype of missing data and their pattern. Some of these methods are available in various andheterogeneous software solutions. As R (R Core Team, 2020) is one of the main softwaresfor statisticians and data scientists and as its development has started almost three decadesago (Ihaka, 1998), R is one of the language that offers the largest number of implementedapproaches. This is also due to its ease to incorporate new methods and its modular pack-aging system. Currently, there are over 270 R packages on CRAN that mention missingdata or imputation in their DESCRIPTION files. These packages serve many differentapplications, data types or types of analysis. More precisely, exploratory and visualizationtools for missing data are available in packages like naniar, VIM, and MissingDataGUI(Tierney, Cook, McBain, and Fay, Tierney et al.; Tierney and Cook, 2018; Kowarik and

    3

  • Templ, 2016; Cheng et al., 2015). Imputation methods are included in packages like mice,Amelia, and mi (van Buuren and Groothuis-Oudshoorn, 2011; Honaker et al., 2011; Gelmanand Hill, 2011). Other packages focus on dealing with complex, heterogeneous (categor-ical, quantitative, ordinal variables) data or with large dimension multi-level data, suchas missMDA, and MixedDataImpute (Josse et al., 2016; Murray and Reiter, 2015). Toour knowledge, R is the programming language offering the largest variety of implementedmethods. However, other languages such as Python (Van Rossum and Drake, 2009), whichcurrently only have few publicly available implementations of methods that handle missingvalues in statistical tasks, offer more and more solutions. Two prominent examples are:1) the scikit-learn library (Pedregosa et al., 2011) which has recently added a module formissing values imputation; and 2) the DataWig library (Biessmann et al., 2018) whichprovides a framework to learn to impute incomplete data tables.

    The rich variety of methods and tools for working with missing data means there aremany solutions for a variety of applications. Despite the number of options, missing dataare often not handled appropriately by practitioners. This may be for a few reasons. First,the plethora of options can be a double-edged sword - the sheer number of options makingit challenging to navigate and find the best one. Second, the topic of missing data isoften itself missing from many statistics and data science syllabuses, despite its relativeomnipresence in data. So then, faced with missing data, practitioners are left powerless:quite possibly never having been taught about missing data, they do not have an ideaof how to approach the problem, what are the dangers of mismanagement, navigate themethods, software, or choose the appropriate method or workflow for their problem.

    To help promote better management and understanding of missing data, we have re-leased R-miss-tastic, an open platform for missing values. The platform takes the form ofa reference website1, which collects, organizes and produces material on missing data. Ithas been conceived by an infrastructure steering committee (ISC; its members are authorsof this article) working group, which first provided a CRAN Task View2 on missing data3 that lists and organizes existing R packages on the topic. The “R-miss-tastic” platformextends and builds on the CRAN Task View by collecting and organizing articles, tutorials,documentation, and workflows for analyses with missing data. The platform provides newtutorials, examples and pipelines of analyses we have developed with missing data thatspan the entirety of an analysis. The latter have been developed by us and implemented inR and in Python for this platform, implementing standard methods for generating missing

    1https://rmisstastic.netlify.com/2https://CRAN.R-project.org/package=ctv3https://cran.r-project.org/web/views/MissingData.html

    4

    https://rmisstastic.netlify.com/https://CRAN.R-project.org/package=ctvhttps://cran.r-project.org/web/views/MissingData.html

  • values and analyzing them under different perspectives. Starting from data preparation,these pipelines also cover exploratory data analyses, establishing statistical models, analysisdiagnostics, applying machine learning methods, through to communication of the resultsobtained from incomplete data. We hope that these pipelines also serve as guidance whenit comes to choosing a method to handle missing values in a specific context.

    Another important ingredient in statistical analyses, some may even argue the mostimportant one, is data which a proposed method or estimator is apply to. We thus alsoreference publicly available datasets that are commonly used as benchmark for new missingvalues methodologies. It is easily extendable and well documented, so it can seamlesslyincorporate future research in missing values. The intent of the platform is to foster awelcoming community, within and beyond the R community . “R-miss-tastic” has beendesigned to be accessible for a wide audience with different levels of prior knowledge, needs,and questions. This includes, for instance, students looking for course material to com-plement their studies, teachers and professors who can use a reference website for theirown classes or refer to students, statisticians or researchers in a different fields using statis-tics searching for solutions or existing work to help with analysis, researchers wishing tounderstand or contribute information for specific research questions, or find collaborators.

    The remainder of the article is organized as follows: Section 2 describes the differentcomponents of the platform, the structure that has been chosen, and the targeted audi-ence. The section is organized as the platform itself, starting by describing materials forless advanced users then materials for researchers and finally resources for practical imple-mentation. Section 3 details the implementation and use-cases of the provided workflows,implemented both in R and in Python. Finally, in Section 4 we conclude with an overviewof the planed future development for the platform and of interesting areas in missing valuesresearch that we would like to bring to a broader audience.

    2 Structure and content of the platform

    The R-miss-tastic platform is released at https://rmisstastic.netlify.com/. It hasbeen developed using the R package blogdown Xie et al. (2017) which wraps hugo4. Liveexamples have been included using the tool https://rdrr.io/snippets/ provided bythe website “R Package Documentation”. The source code and materials of the platformhave been made publicly available on GitHub5, which provides a transparent record of the

    4https://gohugo.io/5repository R-miss-tastic/website

    5

    https://rmisstastic.netlify.com/https://rdrr.io/snippets/https://gohugo.io/

  • platform’s development, and facilitates community contributions.We now discuss the structure of the R-miss-tastic platform, the aim and content of each

    subsection, and highlight key features of the platform.

    2.1 Workflows

    An important contribution and novelty of this work is the proposal of several workflowsthat allow for a hands-on illustration of classical analyses with missing values, both onsimulated data and on publicly available real-world data. These workflows are providedboth in R and in Python code and cover the following topics:

    • How to generate missing values? Generate missing values under different mecha-nisms, on complete or incomplete datasets. This is useful when performing simula-tions to compare methods.

    • How to do statistical inference with missing values? We focus in particular in thedifferent solutions (maximum likelihood or multiple imputation) that are available toestimate linear and logistic regression parameters with missing values in the covari-ates.

    • How to impute missing values? We compare different single imputation/matrix com-pletion methods (for instance using conditional models, low-rank models, etc.)

    • How to predict with missing values? We consider establishing predictive models (forinstance using random forests (Breiman, 2001)) on data with incomplete predictors.The workflows present different strategies to deal with the missing values in thecovariates both in the train set and in the test set.

    The aim of these workflows is threefold: 1) they provide a practical implementation ofconcepts and methods discussed in the lectures and bibliography sections (see Sections 2.2and 2.3); 2) they are coded in a generic way, allowing for simple re-use on other datasets, forintegration of other estimation or imputation methods; 3) the distinction between inference,imputation, and prediction lets the user keep in mind that the solutions are not the samein these cases.

    Furthermore, they allow for a transparent and open discussion about the proposedimplementations which can be followed on the project GitHub repository6.

    6https://github.com/R-miss-tastic/website

    6

    https://github.com/R-miss-tastic/website

  • We provide a more detailed view on the proposed workflows in Section 3, giving codeexamples and their corresponding tabular or graphical outputs as well as recommendationson how to interpret and leverage these outputs.

    2.2 Lectures

    Before starting to use any of the existing implementations for handling missing values in astatistical analysis, it is essential to understand why missing values are problematic. Thereare many lenses to view missing data through: the source of the missing values, theirpotential meaning, the relevance of the features they occur in – and the implications fordifferent types of analyses. For someone unfamiliar with missing data, it is a challengeto know where to begin the journey of understanding them, and the methods to addressthem. This challenge is being addressed with “R-miss-tastic”, which makes the materialto get started easily accessible.

    Teaching and workshop material takes many forms – from slides, course notes, labworkshops, video tutorials and in-depth seminars. The material is of high quality, andhas been generously contributed by numerous renowned researchers who investigate theproblems of missing values, many of whom are professors having designed introductoryand advanced classes for statistical analysis with missing data. This makes the material onthe “R-miss-tastic” platform well suited for both beginners and more experienced users.

    These teaching and workshop materials are described as “lectures”, and are organizedinto five sections:

    1. General lectures: introduction to statistical analyses with missing values, theoryand concepts are covered, such as missing values mechanism, likelihood methods,imputation.

    2. Multiple imputation: introduction to popular methods of multiple imputation (jointmodeling and fully conditional), how to correctly perform multiple imputation andlimits of imputation methods.

    3. Principal component methods: introduction to methods exploiting low-rank typestructures in the data for visualization, imputation and estimation

    4. Specific data or applications types: lectures covering in detail various sub-problemssuch as missing values in time series, in surveys, or in treatment effect estimation. In-deed, certain data types require adaptations of standard missing values methods (for

    7

  • instance handling the time dependence in time series (Moritz and Bartz-Beielstein,2017)) or additional assumptions about the impact of missing values (such as theimpact on confounded treatment effects in the causal inference context (Mayer et al.,2020)). But also more in-depth material, for instance video recordings from a virtualworkshop on Missing Data Challenges in Computation, Statistics and Applications7

    held in 2020.

    5. Implementations: a non-exhaustive list of detailed vignettes describing functionalitiesof R packages and of Python modules that implement some of the statistical analysismethods covered in the other lectures. For instance the functionalities and possibleapplications of the missMDA R package are presented in a succinct summary, allowingthe reader to compare the main differences between this package and the mice packagewhich is also summarized using the same summary format.

    Figure 1 is a screenshot of two views of the lectures page: Figure 1A shows a collapsedview presenting the different topics, Figure 1B shows an example of the expanded view ofone topic (General tutorials), with a detailed description of one of the lecture (obtainedby clicking on its title), “Analysis of missing values” by Jae-Kwang Kim. Each lecture cancontain several documents (as is the case for this one) and is briefly described by a headerpresenting its purpose.

    Lectures that we found very complete and thus recommend are:

    • Statistical Methods for Analysis with Missing Data by Mauricio Sadinle (in “Generaltutorials”)

    • Missing Values in Clinical Research – Multiple Imputation by Nicole Erler (in “Mul-tiple imputation”)

    • Handling missing values in PCA and MCA by François Husson. (in “Missing valuesand principal component methods”)

    2.3 Bibliography

    Complementary to the lectures section, this part of the platform serves as a broad overviewon the scientific literature discussing missing values taxonomies and mechanisms and sta-tistical, machine learning methods to handle them. This overview covers both classical

    7https://www.ias.edu/math/mdccsa

    8

    https://www.ias.edu/math/mdccsa

  • Home

    Workflows

    Lectures

    Bibliography

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Collapse All

    Collapse All

    Below you will find a selection of high-quality lectures, tutorials and labs on different

    aspects of missing values.

    Ifyouwishtocontributesomeofyourownmaterialtothisplatform,pleasefeelfreeto

    contactusviatheContactform.

    General lectures

    Multiple imputation

    Missing values and principal component methods

    Specific data or application types

    Implementation in R

    ++

    ++

    +

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    (a) Collapsed view

    Home

    Workflows

    Bibliography

    Lectures

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Back to top

    Collapse All

    Collapse All

    Below you will find a selection of high-quality lectures and labs on different aspects of

    missing values.

    Ifyouwishtocontributesomeofyourownmaterialtothisplatform,pleasefeelfreeto

    contactusviatheContactform.

    General tutorials

    Multiple imputation

    Missing values and principal component methods

    Specific data or application types

    Implementation in R

    +

    Statistical Methods for Analysis With Missing Data

    (Marie Davidian, course at NC State University, spring 2017)

    Handling missing values

    (Julie Josse, course at École Polytechnique, fall 2018 and Julie Josse & Nick Tierney, tutorial at useR!2018, 2018)

    This course focuses on the theory and methods for missing data analysis. Topics include maximum likelihood estimation under

    missing data, EM algorithm, Monte Carlo computation techniques, imputation, Bayesian approach, propensity scores, semi-

    parametric approach, and non-ignorable missing data.

    Introduction

    Likelihood-based approach

    Computation

    Imputation

    Multiple imputation

    Propensity Score approach

    Nonignorable missing data

    Analysis of missing values

    (Jae-Kwang Kim, course at Iowa State University, fall 2015)

    Statistical Methods for Analysis with Missing Data

    (Mauricio Sadinle, course at University of Washington, winter 2019)

    ++

    ++

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    (b) Extract

    Figure 1: Lectures overview.

    references with books, articles, etc. such as Schafer and Graham (2002); Little and Rubin(2002); van Buuren (2012); Carpenter and Kenward (2012) and more recent developmentssuch as Josse et al. (2019); Gondara and Wang (2018) who study the consistency of su-pervised learning with missing values. The entire (non-exhaustive) bibliography can bebrowsed in two ways: 1) a complete list, filtering by publication type and year, with asearch option for the authors or 2), as a contextualized version. For 2), we classified thereferences into several domains of research or application, briefly discussing important as-pects of each domain. This double representation is shown in Figure 2 and allows anextensive search in the existing literature, while providing some guidance for those focusedon a specific topic. All references are also collected in a unique BibTex file made avail-able on the GitHub repository8. This centralized file allows external users to easily proposeadditions to the bibliography which are then reviewed by the platform editorial and mainte-nance committee, composed of researchers with different focuses on the handling of missingvalues.

    8in resources/rmisstastic_biblio.bib

    9

    resources/rmisstastic_biblio.bib

  • Home

    Workflows

    Bibliography

    Lectures

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Back to top

    Collapse All

    Collapse All

    On this platform we attempt to give you an overview of main references on missing

    values. We do not claim to gather all available references on the subject but rather to

    offer a peak into different fields of active research on handling missing values, allowing

    for an introductory reading as well as a starting point for further bibliographical research.

    See here for a full (and uncommented) list of references.

    Inspired by CRAN Task View on Missing Data and a review of Imbert & Villa-Vialaneix on

    handling missing values (2018, written in French) we organized our selection of relevant

    references on missing values by different topics.

    Short introduction to missing values

    General references and reviews

    Weighting methods

    Hot-deck and kNN approaches

    Likelihood-based approaches

    Single imputation

    Multiple imputation

    Machine Learning

    Missing values mechanisms

    Share

    ++

    ++

    ++

    ++

    +

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    Processing math: 100%

    (a) Contextualized version

    Home

    Workflows

    Bibliography

    Lectures

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Back to top

    A commented version of this bibliography can be found here.

    Publication type

    All

    Year

    All

    Author

    Search by author name..

    Share

    Citation Year

    Publication

    type

    Abayomi,K.,A.Gelman,andM.Levy.Diagnosticsformultivariateimputations.In:JournaloftheRoyalStatistical

    Society,SeriesC(AppliedStatistics)57.3(2008),pp.273-291.

    DOI

    2008 Article

    Albert,P.S.andD.A.Follmann.Modelingrepeatedcountdatasubjecttoinformativedropout.In:Biometrics

    56.3(2000),pp.667-677.

    DOI

    2000 Article

    Allison,P.D.MissingData.QuantitativeApplicationsintheSocialSciences.ThousandOaks,CA,USA:Sage

    Publications,2001.ISBN:9780761916727.

    DOI

    2001 Book

    Andridge,R.andR.J.A.Little.Areviewofhotdeckimputationforsurveynon-response.In:International

    StatisticalReview78.1(2010),pp.40-64.

    DOI

    2010 Article

    Audigier,V.,F.Husson,andJ.Josse.Aprincipalcomponentmethodtoimputemissingvaluesformixeddata.

    In:AdvancesinDataAnalysisandClassification10.1(2016),pp.5-26.

    DOI

    2016 Article

    Audigier,V.,F.Husson,andJ.Josse.MIMCA:multipleimputationforcategoricalvariableswithmultiple

    correspondenceanalysis.In:StatisticsandComputing27.2(2016),pp.1-18.eprint:1505.08116.

    DOI

    2016 Article

    Audigier,V.,F.Husson,andJ.Josse.MultipleimputationforcontinuousvariablesusingaBayesianprincipal

    componentanalysis.In:JournalofStatisticalComputationandSimulation86.11(2015),pp.2140-2156.

    DOI

    2015 Article

    Bang,H.andJ.M.Robins.Doublyrobustestimationinmissingdataandcausalinferencemodels.In:Biometrics

    61.4(2005),pp.962-973.

    DOI

    2005 Article

    Baraldi,A.N.andC.K.Enders.Anintroductiontomodernmissingdataanalysis.In:JournalofSchool

    Psychology48.1(2010),pp.5-37.

    DOI

    2010 Article

    Baretta,L.andA.Santaniello.Nearestneighborimputationalgorithms:acriticalevaluation.In:BMCMedical

    InformaticsandDecisionMaking.Proceedingsofthe5thTranslationalBioinformaticsConference(TBC2015):

    medicalinformaticsanddecisionmaking16.Supp.3(2016),p.74.

    DOI

    2016 Article

    Bartlett,J.W.,O.Harel,andJ.R.Carpenter.Asymptoticallyunbiasedestimationofexposureoddsratiosin

    completerecordslogisticregression.In:Americanjournalofepidemiology182.8(2015),pp.730–736.

    DOI

    2015 Article

    Bengio,Y.andF.Gingras.Recurrentneuralnetworksformissingorasynchronousdata.In:Proceedingsofthe

    8thInternationalConferenceonNeuralInformationProcessingSystems.(Nov.27,1995-Dec.02,1995).Ed.by-.

    Cambridge,MA,USA:MITPress,1995,pp.395-401.

    URL

    1995 Paper

    Biessmann,F.,D.Salinas,S.Schelter,etal.“Deep”LearningforMissingValueImputationinTableswithNon-

    NumericalData.In:Proceedingsofthe27thACMInternationalConferenceonInformationandKnowledge

    Management.Ed.by-.CIKM’18.Torino,Italy:ACM,2018,pp.2017–2025.ISBN:978-1-4503-6014-2.

    DOI URL

    2018 Paper

    Blake,H.A.,C.Leyrat,K.Mansfield,etal.Propensityscoresusingmissingnesspatterninformation:apractical

    guide.In:arXivpreprint(2019).arXiv:1901.03981[stat.ME].

    URL

    2019 Article

    Buck,S.F.Amethodofestimationofmissingvaluesinmultivariatedatasuitableforusewithanelectronic

    computer.In:JournaloftheRoyalStatisticalSociety,SeriesB22(1960),pp.302-306.

    DOI

    1960 Article

    Burns,R.M.Multipleandreplicateitemimputationinacomplexsamplesurvey.In:Proceedingsofthe6th

    AnnualResearchConference.Ed.byB.oftheCensus.WashingtonDC,USA,1990,pp.655-665.

    1990 Paper

    Buuren,S.van.FlexibleImputationofMissingData.BocaRaton,FL:ChapmanandHall/CRC,2018.

    URL

    2018 Book

    Buuren,S.van.Multipleimputationofdiscreteandcontinuousdatabyfullyconditionalspecification.In:

    StatisticalMethodsinMedicalResearch16(2007),pp.219-242.

    DOI

    2007 Article

    Buuren,S.van,J.P.L.Brand,C.G.M.Groothuis-Oudshoorn,etal.Fullyconditionalspecificationinmultivariate

    imputation.In:JournalofStatisticalComputationandSimulation76.12(2006),pp.1049-1064.

    DOI

    2006 Article

    Buuren,S.vanandK.Groothuis-Oudshoorn.MICE:multivariateimputationbychainedequationsinR.In:

    JournalofStatisticalSoftware45(2011),p.3.eprint:NIHMS150003.

    DOI

    2011 Article

    Candès,E.J.,C.A.Sing-Long,andJ.D.Trzasko.Unbiasedriskestimatesforsingularvaluethresholdingand

    spectralestimators.In:IEEETransactionsonSignalProcessing61.19(2013),pp.4643-4657.

    DOI

    2013 Article

    Carpenter,J.R.,M.G.Kenward,andS.Vansteelandt.Acomparisonofmultipleimputationanddoublyrobust

    estimationforanalyseswithmissingdata.In:JournaloftheRoyalStatisticalSociety:SeriesA(Statisticsin

    Society)169.3(2006),pp.571–584.

    DOI

    2006 Article

    Carpenter,J.andM.Kenward.MultipleImputationanditsApplication.Chichester,WestSussex,UK:Wiley,

    2013.ISBN:9780470740521.

    DOI

    2013 Book

    Chen,T.andC.Guestrin.XGBoost:AScalableTreeBoostingSystem.In:Proceedingsofthe22ndACM

    SIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.(Aug.13,2016-Aug.17,2016).Ed.

    by-.NewYork,NY,USA:ACM,2016,pp.785-794.ISBN:0450342322.

    DOI

    2016 Paper

    Chen,J.andJ.Shao.Nearestneighborimputationforsurveydata.In:JournalofOfficialStatistics16.2(2000),

    pp.113-131.

    URL

    2000 Article

    Collins,L.M.,J.L.Schafer,andK.Chi-Ming.Acomparisonofinclusiveandrestrictivestrategiesinmodern

    missingdataprocedures.In:PsychologicalMethods6.4(2007),pp.330-351.

    DOI

    2007 Article

    Cranmer,S.J.andJ.Gill.Wehavetobediscreteaboutthis:anon-parametricimputationtechniqueformissing

    categoricaldata.In:BritishJournalofPoliticalScience43(2012),pp.425-449.

    DOI

    2012 Article

    Crookston,N.L.andA.O.Finley.yaImpute:anRpackageforkNNimputation.In:JournalofStatisticalSoftware

    23(2008),p.10.

    DOI

    2008 Article

    Dax,A.ImputingMissingEntriesofaDataMatrix:Areview.In:JournalofAdvancedComputing3.3(2014),

    pp.98-222.

    DOI

    2014 Article

    Dempster,A.P.,N.M.Laird,andD.B.Rubin.MaximumlikelihoodfromincompletedataviatheEMalgorithm.

    In:JournaloftheRoyalStatisticalSociety,SeriesB(Methodological)39.1(1977),pp.1-38.

    URL

    1977 Article

    Diggle,P.andM.G.Kenward.Informativedrop-outinlongitudinaldataanalysis.In:JournaloftheRoyal

    StatisticalSociety,SeriesC(AppliedStatistics)43.1(1994),pp.49-93.

    DOI

    1994 Article

    Ding,P.andF.Li.CausalInference:AMissingDataPerspective.In:StatisticalScience33.2(2018),pp.214–237.

    DOI

    2018 Article

    Ding,Y.andJ.S.Simonoff.Aninvestigationofmissingdatamethodsforclassificationtreesappliedtobinary

    responsedata.In:JournalofMachineLearningResearch11.1(2010),pp.131-170.

    URL

    2010 Article

    Dong,Y.andC.J.Peng.Principledmissingdatamethodsforresearchers.In:SpringerPlus2(2013),p.222.

    DOI

    2013 Article

    Enders,C.K.AppliedMissingDataAnalysis.GuilfordPress,2010,p.401.ISBN:9781606236390. 2010 Book

    Enders,C.K.Aprimeronmaximumlikelihoodalgorithmsavailableforusewithmissingdata.In:Structural

    EquationModeling8.1(2001),pp.128-141.

    DOI

    2001 Article

    Fang,F.,J.Zhao,andJ.Shao.Imputation-basedadjustedscoreequationsingeneralizedlinearmodelswith

    nonignorablemissingcovariatevalues.In:StatisticaSinica28.4(2018),pp.1677–1701.

    DOI

    2018 Article

    Fay,R.E.Alternativeparadigmsfortheanalysisofimputedsurveydata.In:JournaloftheAmericanStatistical

    Association91.434(1996),pp.490-498.

    DOI

    1996 Article

    Fellegi,I.P.andD.Holt.Asystematicapproachtoautomaticeditandimputation.In:JournaloftheAmerican

    StatisticalAssociation71.353(1976),pp.17-35.

    DOI

    1976 Article

    Ferrari,P.A.,P.Annoni,A.Barbiero,etal.Animputationmethodforcategoricalvariableswithapplicationto

    nonlinearprincipalcomponentanalysis.In:ComputationalStatistics&DataAnalysis55.7(2011),pp.2410-2420.

    DOI

    2011 Article

    Finkbeiner,C.Estimationforthemultiplefactormodelwhendataaremissing.In:Psychometrika44.4(1979),

    pp.409-420.

    DOI

    1979 Article

    Fitzmaurice,G.M.,G.Molenberghs,andS.R.Lipsitz.RegressionModelsforLongitudinalBinaryResponses

    withInformativeDrop-Outs.In:JournaloftheRoyalStatisticalSociety.SeriesB(Methodological)57.4(1995),

    pp.691–704.

    URL

    1995 Article

    Follmann,D.andM.Wu.Anapproximategeneralizedlinearmodelwithrandomeffectsforinformativemissing

    data.In:Biometrics51.1(1995),pp.151-168.

    DOI

    1995 Article

    Gad,A.M.andN.M.M.Darwish.Asharedparametermodelforlongitudinaldatawithmissingvalues.In:

    AmericanJournalofAppliedMathematicsandStatistics1.2(2013),pp.30-35.

    URL

    2013 Article

    Gelman,A.,G.King,andC.Liu.Notaskedandnotanswered:Multipleimputationformultiplesurveys.In:

    JournaloftheAmericanStatisticalAssociation93.443(1998),pp.846–857.

    DOI

    1998 Article

    Gelman,A.,I.vanMechelen,G.Verbeke,etal.MultipleImputationforModelChecking:Completed-DataPlots

    withMissingandLatentData.In:Biometrics61.1(2005),pp.74–85.

    DOI

    2005 Article

    Gondara,L.andK.Wang.MIDA:MultipleImputationusingDenoisingAutoencoders.In:Proceedingsofthe

    22ndPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD2018).(Jun.03,2018-Jun.06,

    2018).Ed.byD.Phung,V.Tseng,G.Webb,B.Ho,M.GanjiandL.Rashidi.LectureNotesinComputerScience.

    SpringerInternationalPublishing,2018,pp.260-272.ISBN:3319930404.

    DOI URL

    2018 Paper

    Goodfellow,I.,M.Mirza,A.Courville,etal.Multi-PredictionDeepBoltzmannMachines.In:Proceedingsofthe

    26thInternationalConferenceonNeuralInformationProcessingSystems.(Dec.05,2013-Dec.10,2013).Ed.by

    C.Burges,L.Bottou,M.Welling,Z.GhahramaniandK.Weinberger.AdvancesinNeuralInformationProcessing

    Systems26.CurranAssociates,Inc.,2013,pp.548–556.

    URL

    2013 Paper

    Graham,J.W.Missingdataanalysis:makingitworkintherealworld.In:AnnualReviewofPsychology60(2009),

    pp.549-576.

    DOI

    2009 Article

    Graham,J.W.,S.M.Hofer,S.I.Donaldson,etal.TheScienceofPrevention:MethodologicalAdvancesfrom

    AlcoholandSubstanceAbuseResearch.In:TheScienceofPrevention:MethodologicalAdvancesfromAlcohol

    andSubstanceAbuseResearch.Ed.byK.Bryant,M.WindleandS.West.Washington,DC,USA:American

    PsychologicalAssociation,1997.Chap.Analysiswithmissingdatainpreventionresearch,pp.325-366.ISBN:1-

    55798-439-5.

    DOI

    1997 Book

    Graham,J.W.,A.E.Olchowski,andT.E.Gilreath.Howmanyimputationsarereallyneeded?Somepractical

    clarificationsofmultipleimputationtheory.In:PreventionScience8.3(2007),pp.206-213.

    DOI

    2007 Article

    Heckman,J.Sampleselectionbiasasaspecificationerror.In:Econometrica47.1(1979),pp.153-161.

    DOI

    1979 Article

    Heckman,J.J.Thecommonstructureofstatisticalmodelsoftruncation,sampleselectionandlimited

    dependentvariablesandasimpleestimatorforsuchmodels.In:AnnalsofEconomicandSocialMeasurement

    5.4(1976),pp.475-492.

    URL

    1976 Article

    Hogan,J.W.andN.M.Laird.Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.In:

    StatisticsinMedecine16.1-3(1997),pp.239-257.

    1997 Article

    Hogan,J.W.andT.Lancaster.Instrumentalvariablesandinverseprobabilityweightingforcausalinferencefrom

    longitudinalobservationalstudies.In:StatisticalMethodsinMedicalResearch13.1(2004),pp.17-48.

    DOI

    2004 Article

    Honaker,J.,G.King,andM.Blackwell.AmeliaII:aprogramformissingdata.In:JournalofStatisticalSoftware

    45.7(2011).eprint:arXiv:1501.0228.

    DOI

    2011 Article

    Hothorn,T.,K.Hornik,andA.Zeileis.UnbiasedRecursivePartitioning:AConditionalInferenceFramework.In:

    JournalofComputationalandGraphicalStatistics15.3(2012),pp.651-674.

    DOI

    2012 Article

    Horton,N.J.andK.P.Kleinman.MuchAdoAboutNothing-AComparisonofMissingDataMethodsand

    SoftwaretoFitIncompleteDataRegressionModels.In:TheAmericanStatistician61.1(2017),pp.79-90.

    DOI

    2017 Article

    Huisman,M.Imputationofmissingitemresponses:somesimpletechniques.In:Quality&Quantity34.4(2000),

    pp.331-351.

    DOI

    2000 Article

    Husson,F.andJ.Josse.Handlingmissingvaluesinmultiplefactoranalysis.In:FoodQualityandPreference30

    (2013),pp.77-85.

    DOI

    2013 Article

    Ibrahim,J.G.,S.R.Lipsitz,andM.Chen.MissingCovariatesinGeneralizedLinearModelsWhentheMissing

    DataMechanismisNon-Ignorable.In:JournaloftheRoyalStatisticalSociety.SeriesB(StatisticalMethodology)

    61.1(1999),pp.173-190.

    DOI URL

    1999 Article

    Ibrahim,J.G.,M.Chen,andS.R.Lipsitz.Missingresponsesingeneralisedlinearmixedmodelswhenthe

    missingdatamechanismisnonignorable.In:Biometrika88.2(2001),pp.551-564.

    DOI

    2001 Article

    Ilin,A.andT.Raiko.PracticalapproachestoPrincipalComponentAnalysisinthepresenceofmissingvalues.In:

    JournalofMachineLearningResearch11(2010),pp.1957-2000.

    URL

    2010 Article

    Imbert,A.,A.Valsesia,C.LeGall,etal.Multiplehot-deckimputationfornetworkinferencefromRNA

    sequencingdata.In:Bioinformatics34.10(2018),pp.1726-1732.

    DOI

    2018 Article

    Jönsson,P.andC.Wohlin.Anevaluationofk-nearestneighbourimputationusinglIkertdata.In:Proceedingsof

    the10thInternationalSymposiumonSoftwareMetrics.(Sep.14,2004-Sep.16,2004).Ed.by-.Chicago,IL,

    USA:IEEE,2004,pp.1530-1435.ISBN:0769521290.

    DOI

    2004 Paper

    Jamshidian,M.andS.Jalal.Testsofhomoscedasticity,normality,andmissingcompletelyatrandomfor

    incompletemultivariatedata.In:Psychometrika75.4(2010),pp.649-674.eprint:NIHMS150003.

    DOI

    2010 Article

    Jamshidian,M.,S.Jalal,andC.Jansen.MissMech:anRpackagefortestinghomoscedasticity,multivariate

    normality,andmissingcompletelyatrandom(MCAR).In:JournalofStatisticalSoftware56.6(2014),pp.1-31.

    DOI

    2014 Article

    Jiang,W.,J.Josse,andM.Lavielle.LogisticRegressionwithMissingCovariates–ParameterEstimation,Model

    SelectionandPrediction.In:arXivpreprint(2018).arXiv:1805.04602[stat.ME].

    2018 Article

    Joenssen,D.W.andU.Bankhofer.Donorlimitedhotdeckimputation:effectonparameterestimation.In:

    JournalofTheoreticalandAppliedComputerScience6.3(2012),pp.58-70.

    URL

    2012 Article

    Jones,M.P.IndicatorandStratificationMethodsforMissingExplanatoryVariablesinMultipleLinearRegression.

    In:JournaloftheAmericanStatisticalAssociation91.433(1996),pp.222-230.

    DOI

    1996 Article

    Josse,J.,M.Chavent,B.Liquet,etal.Handlingmissingvalueswithregularizediterativemultiple

    correspondanceanalysis.In:JournalofClassification29.1(2012),pp.91-116.

    DOI

    2012 Article

    Josse,J.andF.Husson.missMDA:apackageforhandlingmissingvaluesinmultivariatedataanalysis.In:

    JournalofStatisticalSoftware70.1(2016),pp.1-31.

    DOI

    2016 Article

    Josse,J.andF.Husson.Handlingmissingvaluesinexploratorymultivariatedataanalysismethods.In:Journal

    delaSociétéFrançaisedeStatistique153.2(2012),pp.79-99.

    URL

    2012 Article

    Josse,J.,F.Husson,andJ.Pagès.GestiondesdonnéesmanquantesenAnalyseenComposantesPrincipales.

    In:JournaldelaSociétéFrançaisedeStatistique150.2(2009),pp.28-51.

    URL

    2009 Article

    Josse,J.,J.Pagès,andF.Husson.Multipleimputationinprincipalcomponentanalysis.In:AdvancesinData

    AnalysisandClassification5.3(2011),pp.231-246.

    DOI

    2011 Article

    Josse,J.,N.Prost,E.Scornet,etal.Ontheconsistencyofsupervisedlearningwithmissingvalues.In:arXiv

    preprint(2019).arXiv:1902.06931[stat.ML].

    URL

    2019 Article

    Kaiser,J.Dealingwithmissingvaluesindata.In:JournalofSystemsIntegration5.1(2014),pp.42-51.

    DOI

    2014 Article

    Kallus,N.,X.Mao,andM.Udell.CausalInferencewithNoisyandMissingCovariatesviaMatrixFactorization.In:

    AdvancesinNeuralInformationProcessingSystems.Ed.by-.2018.eprint:1806.00811.

    URL

    2018 Paper

    Kalton,G.andD.Kasprzyk.Thetreatmentofmissingsurveydata.In:SurveyMethodology12.1(1986),pp.1-16.

    URL

    1986 Article

    Kapelner,A.andJ.Bleich.PredictionwithmissingdataviaBayesianadditiveregressiontrees.In:Canadian

    JournalofStatistics43.2(2015),pp.224-239.

    DOI URL

    2015 Article

    Kim,J.K.andJ.Shao.StatisticalMethodsforHandlingIncompleteData.BocaRaton,FL,USA:Chapmanand

    Hall/CRC,2013.ISBN:9781482205077.

    2013 Book

    Kohn,R.andC.F.Ansley.Estimation,prediction,andinterpolationforARIMAmodelswithmissingdata.In:

    JournaloftheAmericanStatisticalAssociation81.395(1986),pp.751-761.

    DOI

    1986 Article

    Kowarik,A.andM.Templ.ImputationwiththeRPackageVIM.In:JournalofStatisticalSoftware74.7(2016),

    pp.1-16.

    DOI

    2016 Article

    Kropko,J.,B.Goodrich,A.Gelman,etal.MultipleImputationforContinuousandCategoricalData:Comparing

    JointMultivariateNormalandConditionalApproaches.In:PoliticalAnalysis22.4(2014),pp.497–519.

    DOI

    2014 Article

    Lee,K.M.,R.Mitra,andS.Biedermann.Optimaldesignwhenoutcomevaluesarenotmissingatrandom.In:

    StatisticaSinica28.4(2018),pp.1821–1838.

    DOI

    2018 Article

    Little,R.J.A.Modelingthedrop-outmechanisminrepeated-measuresstudies.In:JournaloftheAmerican

    StatisticalAssociation90.431(1995),pp.1112-1121.

    DOI

    1995 Article

    Little,R.J.A.Pattern-mixturemodelsformultivariateincompletedata.In:JournaloftheAmericanStatistical

    Association88.421(1993),pp.125-134.

    DOI

    1993 Article

    Little,R.J.A.Atestofmissingcompletelyatrandomformultivariatedatawithmissingvalues.In:Journalofthe

    AmericanStatisticalAssociation83.404(1988),pp.1198-1202.

    DOI

    1988 Article

    Little,R.J.A.andD.B.Rubin.StatisticalAnalysiswithMissingData.Wiley,2002,p.408.ISBN:0471183865.

    DOI

    2002 Book

    Little,R.J.A.RegressionwithmissingX’s:areview.In:JournaloftheAmericanStatisticalAssociation87.420

    (1992),pp.1227-1237.

    DOI

    1992 Article

    Louis,T.A.FindingtheObservedInformationMatrixwhenUsingtheEMAlgorithm.In:JournaloftheRoyal

    StatisticalSociety.SeriesB(Methodological)44.2(1982),pp.226–233.

    URL

    1982 Article

    McLachlan,G.J.andT.Krishnan.TheEMAlgorithmandExtensions.Wileyseriesinprobabilityandstatistics.

    Hoboken,NJ,USA:Wiley,2008.ISBN:9780471201700.

    2008 Book

    Meng,S.L.andD.B.Rubin.MaximumlikelihoodestimationviatheECMalgorithm:ageneralframework.In:

    Biometrika80.2(1993),pp.267-278.

    DOI

    1993 Article

    Meng,X.L.andD.B.Rubin.UsingEMtoobtainasymptoticvariance-covariancematrices:theSEMalgorithm.

    In:JournaloftheAmericanStatisticalAssociation86.416(1991),pp.899-909.

    DOI

    1991 Article

    Meng,X.L.UYouwantmetoanalyzedataIdon’thave?Areyouinsane?In:ShanghaiArchivesofPsychiatry

    24.5(2012),pp.287-301.

    DOI URL

    2012 Article

    Miao,W.andE.J.TchetgenTchetgen.Identificationandinferencewithnonignorablemissingcovariatedata.In:

    StatisticaSinica28.4(2018),pp.2049–2067.

    DOI

    2018 Article

    Moeur,M.andA.R.Stage.Mostsimilarneighbor:animprovedsamplinginferenceprocedurefornatural

    resourcesplanning.In:ForestScience42.1(1995),pp.337-359.

    DOI

    1995 Article

    Molenberghs,G.,G.Fitzmaurice,M.G.Kenward,etal.HandbookofMissingDataMethodology.Chapman&

    Hall/CRCHandbooksofModernStatisticalMethods.NewYork,NY,USA:ChapmanandHall/CRC,2014.ISBN:

    9781439854624.

    2014 Book

    Molenberghs,G.andM.G.Kenward.MissingDatainClinicalStudies.Chichester,WestSussex,UK:Wiley,

    2007.ISBN:9780470849811.

    DOI

    2007 Book

    Molenberghs,G.,B.Michiels,M.G.Kenward,etal.Monotonemissingdataandpattern-mixturemodels.In:

    StatisticaNeerlandica52.2(1998),pp.153-161.

    DOI

    1998 Article

    Molnar,F.J.,B.Hutton,andD.Fergusson.Doesanalysisusinglastobservationcarriedforwardintroducebiasin

    dementiaresearch?In:CanadianMedicalAssociationJournal179.8(2008),pp.751-753.

    DOI

    2008 Article

    Moritz,S.andT.Bartz-Beielstein.imputeTS:timeseriesmissingvalueimputationinR.In:TheRJournal9.1

    (2017),pp.207-218.

    URL

    2017 Article

    Moritz,S.,A.Sardá,T.Bartz-Beielstein,etal.Comparisonofdifferentmethodsforunivariatetimeseries

    imputationinR.PrepintarXiv1510.03924.2015.

    URL

    2015 Misc

    Murray,J.S.andJ.P.Reiter.MultipleImputationofMissingCategoricalandContinuousValuesviaBayesian

    MixtureModelsWithLocalDependence.In:JournaloftheAmericanStatisticalAssociation111.516(2016),

    pp.1466-1479.

    DOI

    2016 Article

    NationalResearchCouncil,U.ThePreventionandTreatmentofMissingDatainClinicalTrials.Washington(DC),

    USA:NationalAcademiesPress,2010.ISBN:9780309158145.

    DOI

    2010 Book

    Nowicki,R.K.,R.Scherer,andL.Rutkowski.Novelroughneuralnetworkforclassificationwithmissingdata.In:

    21stInternationalConferenceonMethodsandModelsinAutomationandRobotics(MMAR).(Sep.29,2016-

    Sep.01,2016).Ed.by-.IEEE,2016,pp.820–825.

    DOI

    2016 Paper

    O’Kelly,M.andB.Ratitch.ClinicalTrialswithMissingData:AGuideforPractitioners.JohnWiley&Sons,Ltd,

    2014.

    DOI

    2014 Book

    Orchard,T.andM.A.Woodbury.Amissinginformationprinciple:theoryandapplications.In:Proceedingsofthe

    SixthBerkeleySymposiumonMathematicalStatisticsandProbability,Volume1:TheoryofStatistic.Ed.byL.M.

    LeCam,N.J.andE.L.Scott.Vol.1.UniversityofCaliforniaPress,1972,pp.697–715.

    URL

    1972 Paper

    Peugh,J.L.andC.K.Enders.Missingdataineducationalresearch:areviewofreportingpracticesand

    suggestionsforimprovement.In:ReviewofEducationalResearch74.4(2004),pp.525–556.

    DOI URL

    2004 Article

    Pigott,T.D.Areviewofmethodsformissingdata.In:EducationalResearchandEvaluation7.4(2001),pp.353–

    383.

    DOI

    2001 Article

    Preisser,J.S.,K.K.Lohman,andP.J.Rathouz.Performanceofweightedestimatingequationsforlongitudinal

    binarydatawithdrop-outsmissingatrandom.In:StatisticsinMedicine21.20(2002),pp.3035–3054.

    DOI

    2002 Article

    Rahman,G.andZ.Islam.Missingvalueimputationusingdecisiontreesanddecisionforestsbysplittingand

    mergingrecords:Twonoveltechniques.In:Knowledge-BasedSystems53(2013),pp.51–65.

    DOI URL

    2013 Article

    Rao,J.N.K.andJ.Shao.Jackknifevarianceestimationwithsurveydataunderhotdeckimputation.In:

    Biometrika79.4(1992),pp.811-822.

    DOI

    1992 Article

    Reilly,M.andM.Pepe.Therelationshipbetweenhot-deckmultipleimputationandweightedlikelihood.In:

    StatisticsinMedecine16.1-3(1997),pp.5-19.

    DOI

    1997 Article

    Reiter,J.P.andM.Sadinle.Itemwiseconditionallyindependentnonresponsemodellingforincomplete

    multivariatedata.In:Biometrika104.1(Jan.2017),pp.207-220.eprint:http://oup.prod.sis.lan/biomet/article-

    pdf/104/1/207/13066719/asw063.pdf.

    DOI URL

    2017 Article

    Rieger,A.,T.Hothorn,andC.Strobl.Randomforestswithmissingvaluesinthecovariates.Tech.rep.79.

    UniversityofMunich,DepartmentofStatistics,2010.

    URL

    2010 Misc

    Robins,J.M.,A.Rotnitzky,andL.P.Zhao.EstimationofRegressionCoefficientsWhenSomeRegressorsare

    notAlwaysObserved.In:JournaloftheAmericanStatisticalAssociation89.427(1994),pp.846-866.

    DOI

    1994 Article

    Robins,J.M.,A.Rotnitzky,andL.P.Zhao.Analysisofsemiparametricregressionmodelsforrepeatedoutcomes

    inthepresenceofmissingdata.In:JournaloftheAmericanStatisticalAssociation90.429(1995),pp.106-121.

    DOI

    1995 Article

    Robins,J.M.andN.Wang.Inferenceforimputationestimators.In:Biometrika87.1(2000),pp.113-124.

    URL

    2000 Article

    Rosseel,Y.lavaan:anRpackageforstructuralequationmodeling.In:JournalofStatisticalSoftware48.2(2012).

    DOI

    2012 Article

    Rotnitzky,A.,J.M.Robins,andD.O.Scharfstein.Semiparametricregressionforrepeatedoutcomeswith

    nonignorablenonresponse.In:JournaloftheAmericanStatisticalAssociation93.444(1998),pp.1321-1339.

    DOI

    1998 Article

    Rubin,D.B.Multipleimputationafter18+years.In:JournaloftheAmericanStatisticalAssociation91.434

    (2012),pp.473-489.

    DOI

    2012 Article

    Rubin,D.B.MultlipeImputationforNonresponseinSurveys.Hoboken,NJ,USA:Wiley,1987.ISBN:

    9780471655740.

    1987 Book

    Rubin,D.B.Formalizingsubjectivenotionsabouttheeffectofnonrespondentsinsamplesurveys.In:Journalof

    theAmericanStatisticalAssociation72.359(1977),pp.538-543.

    DOI

    1977 Article

    Rubin,D.B.Inferenceandmissingdata.In:Biometrika63.3(1976),pp.581-592.

    DOI

    1976 Article

    Sadinle,M.andJ.P.Reiter.SequentialIdentificationofNonignorableMissingDataMechanisms.In:Statistica

    Sinica28.4(2018),pp.1741–1759.

    DOI

    2018 Article

    Schafer,J.L.Multipleimputation:aprimer.In:StatisticalMethodsinMedicalResearch8.1(1999),pp.3-15.

    DOI

    1999 Article

    Schafer,J.L.AnalysisofIncompleteMultivariateData.CRCMonographsonStatistics&AppliedProbability.

    BocaRaton,FL,USA:ChapmanandHall/CRC,1997.ISBN:0412040611.

    1997 Book

    Schafer,J.L.andJ.W.Graham.Missingdata:ourviewofthestateoftheart.In:PsychologicalMethods7.2

    (2002),pp.147-177.

    DOI

    2002 Article

    Schafer,J.L.andM.K.Olsen.MultipleImputationformultivariatemissing-dataproblems:adataanalyst’s

    perspective.In:MultivariateBehavioralResearch33.4(1998),pp.545-571.

    DOI

    1998 Article

    Seaman,S.,J.Galati,D.Jackson,etal.WhatIsMeantby“MissingatRandom”?In:StatisticalScience28.2

    (2013),pp.257–268.

    DOI URL

    2013 Article

    Seaman,S.R.andI.R.White.Reviewofinverseprobabilityweightingfordealingwithmissingdata.In:

    StatisticalMethodsinMedicalResearch22.3(2011),pp.278-295.

    DOI

    2011 Article

    Seaman,S.R.andS.Vansteelandt.IntroductiontoDoubleRobustMethodsforIncompleteData.In:Statistical

    Science33.2(2018),p.184.

    DOI

    2018 Article

    Shao,J.andJ.Zhang.Atransformationapproachinlinearmixed-effectsmodelswithinformativemissing

    responses.In:Biometrika102.1(2015),pp.107-119.

    DOI

    2015 Article

    Sharpe,P.K.andR.J.Solly.Dealingwithmissingvaluesinneuralnetwork-baseddiagnosticsystems.In:Neural

    Computing&Applications3.2(1995),pp.73-77.

    DOI

    1995 Article

    Simon,G.A.andJ.S.Simonoff.Diagnosticplotsformissingdatainleastsquaresregression.In:Journalofthe

    AmericanStatisticalAssociation81.394(1986),pp.501-509.

    DOI

    1986 Article

    Śmieja,M.,Ł.Struski,J.Tabor,etal.Processingofmissingdatabyneuralnetworks.In:ComputingResearch

    Repositoryabs/1805.07405(2018).eprint:1805.07405.

    URL

    2018 Article

    Sovilj,D.,E.Eirola,Y.Miche,etal.Extremelearningmachineformissingdatausingmultipleimputations.In:

    Neurocomputing174.A(2016),pp.220-231.

    DOI

    2016 Article

    Sportisse,A.,C.Boyer,andJ.Josse.Imputationandlow-rankestimationwithMissingNonAtRandomdata.In:

    arXivpreprint(2018).arXiv:1812.11409[stat.ML].

    URL

    2018 Article

    Stacklies,W.,H.Redestig,M.Scholz,etal.pcaMethods–abioconductorpackageprovidingPCAmethodsfor

    incompletedata.In:Bioconductor23.9(2007),pp.1164-1167.

    DOI

    2007 Article

    Stage,A.R.andN.L.Crookston.Partitioningerrorcomponentsforaccuracy-assessmentofnear-neighbor

    methodsofimputation.In:ForestScience53.1(2007),pp.62-72.

    DOI

    2007 Article

    Stekhoven,D.J.andP.Bühlmann.Missforest-non-parametricmissingvalueimputationformixed-typedata.In:

    Bioinformatics28.1(2012),pp.112-118.eprint:1105.0828.

    DOI

    2012 Article

    Strobl,C.,A.L.Boulesteix,andT.Augustin.UnbiasedsplitselectionforclassificationtreesbasedontheGini

    Index.In:ComputationalStatistics&DataAnalysis52.1(2007),pp.483-501.

    DOI

    2007 Article

    Stuart,E.A.,M.Azur,C.Frangakis,etal.Multipleimputationwithlargedatasets:acasestudyofthechildren’s

    mentalhealthinitiative.In:AmericanJournalofEpidemiology169.9(2009),pp.1133-1139.

    DOI

    2009 Article

    Stubbendick,A.L.andJ.G.Ibrahim.MaximumLikelihoodMethodsforNonignorableMissingResponsesand

    CovariatesinRandomEffectsModels.In:Biometrics59.4(2003),pp.1140–1150.

    DOI

    2003 Article

    Stubbendick,A.L.andJ.G.Ibrahim.Likelihood-basedinferencewithnonignorablemissingresponsesand

    covariatesinmodelsfordiscretelongitudinaldata.In:StatisticaSinica16.4(2006),pp.1143–1167.

    URL

    2006 Article

    Su,Y.S.,A.Gelman,J.Hill,etal.Multipleimputationwithdiagnostics(mi)inR:openingwindowsintotheblack

    box.In:JournalofStatisticalSoftware45(2011),p.2.

    DOI

    2011 Article

    Tanner,M.A.andW.Wong.Thecalculationofposteriordistributionsbydataaugmentation.In:Journalofthe

    AmericanStatisticalAssociation82.398(1987),pp.528-540.

    DOI URL

    1987 Article

    TchetgenTchetgen,E.J.,L.Wang,andB.Sun.Discretechoicemodelsfornonmonotonenonignorablemissing

    data:identificationandinference.In:StatisticaSinica28.4(2018),pp.2069–2088.

    DOI

    2018 Article

    Templ,M.,A.Alfons,andP.Filzmoser.ExploringIncompletedatausingvisualizationtechniques.In:Advancesin

    DataAnalysisandClassification6.1(2012),pp.29-47.

    DOI

    2012 Article

    Thijs,H.,G.Molenberghs,B.Michiels,etal.Strategiestofitpattern-mixturemodels.In:Biostatistics3.2(2002),

    pp.245-265.

    DOI

    2002 Article

    Tierney,N.andD.Cook.Expandingtidydataprinciplestofacilitatemissingdataexploration,visualizationand

    assessmentofimputations.MonashEconometricsandBusinessStatisticsWorkingPapers14/18.Monash

    University,DepartmentofEconometricsandBusinessStatistics,2018.

    URL

    2018 Misc

    Tierney,N.J.,F.A.Harden,M.J.Harden,etal.Usingdecisiontreestounderstandstructureinmissingdata.In:

    BMJOpen5.6(2015),p.e007450.

    DOI

    2015 Article

    Tran,L.,X.Liu,J.Zhou,etal.MissingModalitiesImputationviaCascadedResidualAutoencoder.In:2017IEEE

    ConferenceonComputerVisionandPAtternRecognition(CVPR).(Jul.21,2017-Jul.26,2017).Ed.by-.IEEE,

    2017,pp.4971-4980.

    DOI

    2017 Paper

    Troyanskaya,O.,M.Cantor,G.Sherlock,etal.MissingvalueestimationmethodsforDNAmicroarrays.In:

    Bioinformatics17.6(2001),pp.520-525.

    DOI

    2001 Article

    Twala,B.E.T.H.,M.C.Jones,andD.J.Hand.Goodmethodsforcopingwithmissingdataindecisiontrees.In:

    PatternRecognitionLetters29.7(2008),pp.950-956.

    DOI

    2008 Article

    Unnebrink,K.andJ.Windeler.Intention-to-treat:methodsfordealingwithmissingvaluesinclinicaltrialsof

    progressivelydeterioratingdiseases.In:StatisticsinMedecine20.24(2001),pp.3931-3946.

    DOI

    2001 Article

    Velden,M.vandeandT.H.A.Bijmolt.Generalizedcanonicalcorrelationanalysisofmatriceswithmissingrows:

    asimulationstudy.In:Psychometrika(2006).

    DOI

    2006 Article

    Verbanck,M.,J.Josse,andF.Husson.RegularisedPCAtodenoiseandvisualisedata.In:Statisticsand

    Computing25.2(2015),pp.471-486.

    DOI

    2015 Article

    Verbeke,G.,G.Molenberghs,H.Thijs,etal.Sensitivityanalysisfornonrandomdropout:alocalinfluence

    approach.In:Biometrics57.1(2001),pp.7-14.

    DOI

    2001 Article

    Voillet,V.,P.Besse,L.Liaubet,etal.Handlingmissingrowsinmulti-omicsdataintegration:multipleimputation

    inmultiplefactoranalysisframework.In:BMCBioinformatics17.402(2016).Forthcoming.

    DOI

    2016 Article

    Wal,W.M.vanderandR.B.Geskus.ipw:anRpackageforinverseprobabilityweighting.In:Journalof

    StatisticalSoftware43.13(2011).

    DOI

    2011 Article

    Vansteelandt,S.,J.Carpenter,andM.G.Kenward.Analysisofincompletedatausinginverseprobability

    weightinganddoublyrobustestimators.In:Methodology–EuropeanJournalofResearchMethodsforthe

    BehavioralandSocialSciences6.1(2010),pp.37–48.

    DOI

    2010 Article

    Vansteelandt,S.,A.Rotnitzky,andJ.Robins.Estimationofregressionmodelsforthemeanofrepeated

    outcomesundernonignorablenonmonotonenonresponse.In:Biometrika94.4(2007),pp.841–860.

    DOI

    2007 Article

    Wainer,H.,ed.DrawingInferencesfromSelf-SelectedSamples.NewYork,NY,USA:Springer,1986. 1986 Book

    Wang,N.andJ.M.Robins.Large-sampletheoryforparametricmultipleimputationprocedures.In:Biometrika

    85.4(1998),pp.935–948.

    DOI

    1998 Article

    White,I.R.,J.Carpenter,andN.J.Horton.Ameanscoremethodforsensitivityanalysistodeparturesfromthe

    missingatrandomassumptioninrandomisedtrials.In:StatisticaSinica28.4(2018),pp.1985–2003.

    DOI

    2018 Article

    Wu,M.C.andR.J.Carroll.Estimationandcomparisonofchangesinthepresenceofinformativeright

    censoringbymodelingthecensoringprocess.In:Biometrics44.1(1988),pp.175-188.

    DOI

    1988 Article

    Xie,X.andX.L.Meng.Dissectingmultipleimputationfromamulti-phaseinferenceperspective:whathappens

    whenGod’s,imputer’sandanalyst’smodelsareuncongenial?In:StatisticaSinica27.4(2017),pp.1485–1594.

    DOI

    2017 Article

    Yang,S.,L.Wang,andP.Ding.Identificationandestimationofcausaleffectswithconfounderssubjectto

    instrumentalmissingness.In:StatisticsMethodologyRepository(2017).

    URL

    2017 Article

    Yoon,J.,J.Jordon,andM.vanderSchaar.GAIN:MissingDataImputationusingGenerativeAdversarialNets.

    In:Proceedingsofthe35thInternationalConferenceonMachineLearning.(Jul.10,2018-Jul.15,2018).Ed.by

    J.DyandA.Krause.Vol.80.ProceedingsofMachineLearningResearch.Stockholmsmässan,Stockholm

    Sweden:PMLR,2018,pp.5689–5698.

    URL

    2018 Paper

    Zhang,H.,P.Xie,andE.Xing.MissingValueImputationBasedonDeepGenerativeModels.In:Computing

    ResearchRepositoryabs/1808.01684(2018).

    URL

    2018 Article

    Zhang,S.NearestneighborselectionforiterativekNNimputation.In:JournalofSystemsandSoftware85.11

    (2012),pp.2541-2552.

    DOI

    2012 Article

    Zhou,Y.,R.J.A.Little,andJ.D.Kalbfleisch.Block-conditionalmissingatrandommodelsformissingdata.In:

    StatisticalScience25.4(2010),pp.517–532.

    DOI

    2010 Article

    Zhu,Z.,T.Wang,andR.J.Samworth.High-dimensionalprincipalcomponentanalysiswithheterogeneous

    missingness.In:arXivpreprint(2019).

    URL

    2019 Article

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    (b) Unordered version

    Figure 2: Bibliography overview.

    10

  • 2.4 Implementations

    R packages As mentioned in the introduction, prior to the platform development, theproject started with the release of the MissingData CRAN Task View, which currently listsapproximately 150 R packages. The CRAN Task View is continuously updated, adding newR packages, and removing obsolete ones. Packages are organized by topic: exploration ofmissing data, likelihood based approaches, single imputation, multiple imputation, weightingmethods, specific types of data, specific application fields. We selected only those that weresufficiently mature and stable, and that were already published on CRAN or Bioconductor.This choice was made to ensure all listed packages can easily be installed and used bypractitioners.

    Despite the Task View classifying packages into different sub-domains, it may still be achallenge for practitioners and researchers inexperienced with missing values to choose theright package for the right application. To address this challenge, we provide a partial, lesscondensed overview on existing R packages on the platform, choosing the most popularand versatile. This overview is a blend of the Task View, and the individual packagedescription pages and vignettes. For each selected package (7 at the date of writing of thisarticle, namely imputeTS, mice, missForest, missMDA, naniar, simputation and VIM), weprovided a category (in the style of the categorization in the Task View), a short descriptionof use-cases, its description (as on CRAN), the usual CRAN statistics (number of monthlydownloads, last update), the handled data formats (e.g., data.frame, matrix, vector), alist of implemented algorithms (e.g., k-means, PCA, decision tree), and the list of availabledatasets, some references (such as articles and books) and a small chunk of code, ready fora direct execution on the platform via the R package Documentation9. Figure 3 shows thecondensed view of the package page and the expanded description sheet of a given package(here naniar).

    We believe shortlisting R packages is especially useful for practitioners new to the field,as it demonstrates data analysis that handles missing values in the data. We are awarethat this selection is subjective, and welcome external suggestions for other packages toadd to this shortlist.

    Python modules To the best of our knowledge, there only exist very few implementedmethods for handling missing data in Python. However, one of the major libraries for ma-chine learning and data analysis, scikit-learn (Pedregosa et al., 2011) has recently proposeda module for simple and multiple imputations, sklearn.impute. And, as an alternative, the

    9https://rdrr.io/snippets/

    11

    https://CRAN.R-project.org/view=MissingData

  • Home

    Workflows

    Bibliography

    Lectures

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Back to top

    R Packages

    This page provides introductions to popular missing data packages with small examples

    on how to use them. Thus the page gives more extensive information than the CRAN

    Task View on Missing Data, which is recommended to get a first overall overview about

    the CRAN missing data landscape.

    Youcanalsocontributeonyourowntothispageandprovideashortintroductiontoa

    missingdatapackage.Takea lookat thisshortdescriptiononhow todo this.Weare

    veryhappyaboutallcontributions.

    Search Sort by name Sort by Category

    missMDA

    Category: Single and multiple Imputation, Multivariate Data Analysis

    Imputationofincompletecontinuousorcategoricaldatasets;Missing

    valuesareimputedwithaprincipalcomponentanalysis(PCA),a

    multiplecorrespondenceanalysis(MCA)modeloramultiplefactor

    analysis(MFA)model;PerformmultipleimputationwithandinPCAor

    MCA.

    downloadsdownloads 4000/month4000/month CRANCRAN 2019-01-232019-01-23

    more..

    imputeTS

    Category: Time-Series Imputation, Visualisations for Missing Data

    Imputation(replacement)ofmissingvaluesinunivariatetimeseries.

    Offersseveralimputationfunctionsandmissingdataplots.Available

    imputationalgorithmsinclude:'Mean','LOCF','Interpolation','Moving

    Average','SeasonalDecomposition','KalmanSmoothingonStructural

    TimeSeriesmodels','KalmanSmoothingonARIMAmodels'.

    downloadsdownloads 12K/month12K/month CRANCRAN 2019-07-012019-07-01

    more..

    mice

    Category: Multiple Imputation

    MultipleimputationusingFullyConditionalSpecification(FCS)

    implementedbytheMICEalgorithmasdescribedinVanBuurenand

    Groothuis-Oudshoorn(2011).Eachvariablehasitsownimputation

    model.Built-inimputationmodelsareprovidedforcontinuousdata

    (predictivemeanmatching,normal),binarydata(logisticregression),

    unorderedcategoricaldata(polytomouslogisticregression)and

    orderedcategoricaldata(proportionalodds).MICEcanalsoimpute

    continuoustwo-leveldata(normalmodel,pan,second-levelvariables).

    Passiveimputationcanbeusedtomaintainconsistencybetween

    variables.Variousdiagnosticplotsareavailabletoinspectthequalityof

    theimputations.

    downloadsdownloads 41K/month41K/month CRANCRAN 2019-07-102019-07-10

    more..

    naniar

    Category: Visualisations for Missing Data

    Missingvaluesareubiquitousindataandneedtobeexploredand

    handledintheinitialstagesofanalysis.'naniar'providesdatastructures

    andfunctionsthatfacilitatetheplottingofmissingvaluesand

    examinationofimputations.Thisallowsmissingdatadependenciesto

    beexploredwithminimaldeviationfromthecommonworkpatternsof

    'ggplot2'andtidydata.

    downloadsdownloads 5305/month5305/month CRANCRAN 2019-02-152019-02-15

    more..

    missForest

    Category: Single Imputation

    Thefunction'missForest'inthispackageisusedtoimputemissing

    valuesparticularlyinthecaseofmixed-typedata.Itusesarandom

    foresttrainedontheobservedvaluesofadatamatrixtopredictthe

    missingvalues.Itcanbeusedtoimputecontinuousand/orcategorical

    dataincludingcomplexinteractionsandnon-linearrelations.Ityieldsan

    out-of-bag(OOB)imputationerrorestimatewithouttheneedofatest

    setorelaboratecross-validation.Itcanberuninparalleltosave

    computationtime.

    downloadsdownloads 9335/month9335/month CRANCRAN 2013-12-312013-12-31

    more..

    simputation

    Category: Single Imputation, Meta-Package

    Easytouseinterfacestoanumberofimputationmethodsthatfitinthe

    not-a-pipeoperatorofthe'magrittr'package.

    downloadsdownloads 1093/month1093/month CRANCRAN 2019-05-202019-05-20

    more..

    VIM

    Category: Single Imputation, Visualisations for Missing Data

    Newtoolsforthevisualizationofmissingand/orimputedvaluesare

    introduced,whichcanbeusedforexploringthedataandthestructure

    ofthemissingand/orimputedvalues.Dependingonthisstructureof

    themissingvalues,thecorrespondingmethodsmayhelptoidentifythe

    mechanismgeneratingthemissingvaluesandallowstoexplorethe

    dataincludingmissingvalues.Inaddition,thequalityofimputationcan

    bevisuallyexploredusingvariousunivariate,bivariate,multipleand

    multivariateplotmethods.Agraphicaluserinterfaceavailableinthe

    separatepackageVIMGUIallowsaneasyhandlingoftheimplemented

    plotmethods.

    downloadsdownloads 15K/month15K/month CRANCRAN 2019-02-112019-02-11

    more..

    Your favorite package is missing? Here is an explanation on how to make an entry for

    your package. Template

    Share

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    (a) Extract of global view

    Home

    Workflows

    Bibliography

    Lectures

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Back to top

    Package:

    naniar

    Category:

    Data Structures, Summaries, and Visualisations for Missing Data

    Use-Cases:

    Visualization of missing values, descriptive statistics, …

    Popularity:

    downloadsdownloads 5305/month5305/month

    Description:

    Missing values are ubiquitous in data and need to be carefully explored and handled in

    the initial stages of analysis. In this vignette we describe the tools in the package naniar

    for exploring missing data structures with minimal deviation from the common

    workflows of ggplot and tidy data.

    Last update:

    CRANCRAN 2019-02-152019-02-15

    Datasets:

    oceanbuoys

    pedestrian

    riskfactors

    Further Information:

    Tierney, N. J., & Cook, D. H. (2018). Expanding tidy data principles to facilitate

    missing data exploration, visualization and assessment of imputations. arXiv

    preprint arXiv:1809.02264. PDF (on arXiv)

    Vignettes

    Related visdat R-package

    Input:

    data.frame, vector

    Example:

    library(naniar)

    data(airquality)

    print("printdatasetwithNAs")

    print(head(airquality))

    ##Replace"NA"valueswithvalues10%lower

    ##thantheminimumvalueinthatvariable.

    ##Thisisdonebycallingthegeom_miss_point()function

    ggplot2::ggplot(airquality,

    ggplot2::aes(x=Solar.R,

    y=Ozone))+

    geom_miss_point()

    Here you can have a interactive look at the example:

    library(naniar)

    data(airquality)

    print("printdatasetwithNAs")print(head(airquality))

    ##Replace“NA”valueswithvalues10%lower##thantheminimumvalueinthatvariable.##Thisisdonebycallingthegeom_miss_point()functionggplot2::ggplot(airquality, ggplot2::aes(x=Solar.R, y=Ozone))+geom_miss_point()Run (Ctrl-Enter)

    You should assume that any scripts or data

    that you put into this service are public.

    Privacy policy.

    Computation provided by rdrr.io: hosting

    documentation for all R packages.

    https://rdrr.io/snippets/embedding/

    Share

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    (b) Description sheet

    Figure 3: R packages overview.

    12

  • statsmodels10 library now also has an implementation module for multiple imputation inPython. We regularly survey new Python implementations for handling missing values and,if pertinent from a theoretical and practical point of view, reference them on our platform.We expect this to promote their use but also additional assessment by practitioners andresearchers from the missing values (statistics/machine learning) community.

    2.5 Datasets

    Especially in methodology research, an important aspect is the comparison of differentmethods to assess the respective strengths and weaknesses. There are several datasets thatare recurrent in the missing values literature but, they have not been listed together yet.We gathered publicly available datasets that have been used recurrently for comparisonor illustration purposes in publications, R packages and tutorials. Most of these datasetsare already included in R packages but some are available on other data collections. Fig-ure 4 shows how the datasets are presented, with a detailed description shown for oneof the dataset (“Ozone”, obtained by clicking on its name). The description follows theUCI Machine Learning Repository presentation (Dua and Graff, 2019), including a shortdescription of the dataset, how to obtain it, external references describing the dataset inmore details, and links to tutorials/lectures on our websites or to vignettes in R packagesthat use the dataset.

    In addition, the Datasets section also references existing methods for generating miss-ing data, given assumptions on their generation mechanisms (as in the R package mice).These datasets also serve as benchmark in the proposed workflows and allow for a fair andtransparent comparison between the different methods.

    Note however, that the list of datasets we gather here is comparatively short whencompared to benchmark datasets for full data methods such as the UCI Machine LearningRepository. Therefore, our proposed list also serves as an invitation to tackle this lack ofa wider variety of common benchmark datasets in the missing data community.

    2.6 Additional content

    This unified platform collects and edits the contributions of numerous individuals whohave investigated the missing values problems, and developed methods to handle them formany years. To provide an overview of some of the main actors in this field, the list of all

    10https://www.statsmodels.org/stable/about.html

    13

    https://www.statsmodels.org/stable/about.html

  • Home

    Workflows

    Bibliography

    Lectures

    R packages

    Data

    People

    More info & links

    Contact

    R-miss-tastic

    A resource website on missing values - Methods and references for managing missing data

    Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.

    Back to top

    Here you will find a constantly growing list of interesting data sets which are frequently

    used in the R community working on missing values. These data sets can be useful to

    get familiar with different concepts in handling missing values and to assess the quality

    and performance of new methods.

    Ifyouhavesuggestionsonotherdatasetswhichmightbeofinteresttoothers,please

    feelfreetocontactusviatheContactform.

    Complete data

    If you wish to evaluate a certain missing data method on real (or simulated) data it can

    be useful to first generate missing values in a complete dataset. This allows to control

    the response mechanism and evaluate the method for different response mechanisms.

    Some useful tools for this:

    The ampute function of the mice R-package. Rianne Schouten and her colleagues

    wrote a self-contained tutorial on how to ampute data.

    The R workflow on How to generate missing values? extending some

    functionalities of the ampute function. For the related R source code click here.

    The missCompare R-package.

    Incomplete data

    The data sets listed below are either widely used in general in the missing data

    community or used for illustration of different methods handling missing values in the

    tutorials from the Tutorials and R packages sections. This presentation scheme is

    inspired by the UCI Machine Learning Repository.

    Click on a table entry to obtain further information about the data set.

    Share

    Name Data Types

    Attribute

    Types

    #

    Instances

    #

    Attributes

    %

    Missing

    entries

    Complete

    data

    available Year

    Airquality Multivariate,

    Time Series

    Real 154 6 7 No 1973

    chorizonDL Multivariate Integer,

    Real

    606 110 15 Yes 1998

    Health

    Nutrition And

    Population

    Statistics

    Multivariate,

    Time Series

    Integer,

    Real

    15,022 397 54 No 2017

    NHANES Multivariate Categorical,

    Integer,

    Real

    10,000 75 37 No 2012

    oceanbuoys Multivariate,

    Time Series

    Real 736 8 3 No 1997

    Ozone Multivariate Categorical,

    Integer,

    Real

    366 13 6 No 1976

    Los Angeles Ozone Pollution Data, 1976. This data set contains daily measurements of ozone concentration and

    meteorological quantities. It can be found in R in the mlbench package and is loaded by calling data(Ozone).

    More information on the dataset.

    Tutorials illustrating methods on this data:

    Julie Josse's course on missing values imputation using PC methods.

    Julie Josse's and Nick Tierney's tutorial on handling missing values. Download the data set from this

    tutorial: ozoneNA.csv

    Nick Tierney's naniar vignette for missing data visualization.

    pedestrian Multivariate,

    Time series

    Categorical,

    Integer

    37,700 9 2 No 2016

    riskfactors Multivariate Categorical,

    Integer,

    Real

    245 34 14 No 2009

    SBS52424 Multivariate Real 262 9 2 No 2016

    sleep Multivariate Integer,

    Real

    62 10 6 No 1976

    tsAirgap Time series Integer 144 1 9 Yes 1960

    tsHeating Time series Real 606,837 1 9 Yes 2015

    tsNH4 Time series Real 4,552 1 9 Yes 2014

    About

    This website is proudly

    sponsored by R Consortium

    and maintained by Julie

    Josse, Imke Mayer, Nicholas

    Tierney and Nathalie

    Vialaneix.

    Read more →

    FAQ →

    Follow us!

    Events

    GitHub

    Figure 4: Datasets overview.

    14

  • contributors who agreed to appear on this platform is given with links to their personal orto their research lab website.

    We also provide links to other interesting websites or working groups, not necessarilyworking with R and Python (Van Rossum and Drake, 2009) but with other programminglanguages such as SAS/STAT® and STATA (StataCorp., 2019).

    The platform also provides two other features to engage the community:

    1. a regularly updated list of events such as conferences or summer schools with specialfocus on missing values problems, and

    2. a list of recurring questions together with short answers and links for more details forevery question.

    3 Workflows

    After this general introduction to the R-miss-tastic project and platform and the overviewof its structure, we now turn to a more detailed presentation of the various workflows wehave developed and proposed on this platform.

    To allow for both hands-on tutorials illustrating current practices and state-of-the-art and ready-to-use pipelines, we propose the workflows under different formats such asHTML, PDF, R Markdown (for R code) and IPYNB (for Python code). We encouragepractitioners and researchers to use and adapt these workflows, in order to increase repro-ducibility and comparability of their work. Of course, we are aware that these workflowsdo not cover the entire spectrum of existing methods and data problems. The goal of theproposed workflows is rather to initiate a joint effort to create a larger spectrum of open-source workflows, and to encourage the use of standardized procedures to handle missingvalues.

    With an incomplete dataset at hand, prior to embarking on an in-depth statisticalanalysis, a specific aim has to be defined in order to choose a specific method to use. Anexample of a method whose success crucially depends on the analyst’s goal is mean imputa-tion: this approach is strongly counter-indicated if the aim is to estimate parameters, butit can be consistent if the aim is to predict as well as possible (Josse et al., 2019). Followingthis observation, our workflows are divided into different parts, defined by the objective ofthe statistical analysis. We have tried to present and compare the main implementationsavailable both in R and Python for each objective. Currently there are seven workflowsavailable on the platform and we present their scope and use below.

    15

  • 3.1 How to generate missing values?

    Rubin (1976) classifies the cause for a lack of data into three missing data mechanisms.The missing data mechanism is said to be: (i) missing completely at random (MCAR) ifthe lack of data is totally independent of the data values, (ii) missing at random (MAR) ifthe process that causes the missing data may only depend on the observed values and (iii)missing not at random (MNAR) if the unavailability of the data depends on the missingvariables, for instance on their values themselves and possibly on the values of observedvariables. More formally, let us define the indicator matrix R ∈ Rn×p, which indicates theindices of the observed values in X ∈ Rn×p, i.e., Rij = 1 if Xij is observed and Rij = 0otherwise. The missing data mechanism is then the distribution of the indicator matrix Rgiven the data X.

    The goal of this workflow is to propose functions to generate missing values under thesedifferent mechanisms. The way in which the missing values are generated is crucial forcomparing methods (and studies) in a fair manner and is often subject of debate (Seamanet al., 2013). This code aims to unify classical solutions to do this. Indeed, a usual strategyto compare imputation or estimation strategies is to introduce (additional) missing valuesin the dataset, and use the ground truth for these missing values to evaluate the strategies.

    In R In the R workflow, the main function produce NA allows to generate missing valuesunder the three missing-data mechanisms outlined above. This function internally calls theampute function of the mice package (van Buuren and Groothuis-Oudshoorn, 2011) butwe chose to simplify its use while adding some additional options to specify the missingvalues generation. In addition, the original ampute function generates missing values onlyfor complete dataset. In our workflow, the user can easily introduce (additional) missingvalues in a complete or incomplete dataset composed of quantitative, categorical or mixedvariables. The three main arguments are the initial dataset (data) in which the missingvalues are introduced using a given missing data mechanism (mechanism) and a givenpercentage of missing values (perc.missing). For example, to introduce 20% of MCARvalues in the dataset X, the code is detailed below.

    1 X.miss.mcar

  • The function returns the data matrix containing the new missing values (and the previouslypresent missing values of the input data) and the indicator matrix R.

    To introduce missing values under the MAR and MNAR mechanisms, there are severaloptions detailed and illustrated in the workflow. We consider the definitions of the missingdata mechanisms of Rubin (1976). For instance, if X contains three variables denoted asX1, X2, X3, two options are available to generate MAR values:

    1. the first option consists of generating missing values in X1 by using a logistic modeldepending on (X2, X3), which are observed, i.e.

    P(R1 = 0|X;φ) = 1/(1 + exp(−(φ2X2 + φ3X3)),

    where φ = (φ2, φ3) is the parameter of the missing-data mechanism. In our function,φ is chosen so that the given percentage of missing values is reached. This allows toobtain missing values in the first variable XNA1 . Then, the same strategy is performedto introduce missing values in X2 and X3, by using a logistic model depending on(X1, X3) (which are observed) and (X1, X2) (which are observed) respectively. Toget the final matrix containing missing values, we concatenate XNA1 , X

    NA2 and X

    NA3

    by handling the rows containing only missing values. To use this option, the code isdetailed below.

    1 X.miss.mar

  • 1 X.miss.marpat

  • For the MAR mechanism, by contrast with the R workflow, the Python code relies onthe definition of Little and Rubin (2002). For instance, if we have X = (X1, X2, X3), thenat least one variable should be fully-observed (say X3) and missing values in X1 and X2are introduced with the following logistic model,

    P(R1 = 0|X;φ) = P(R2 = 0|X;φ) = 1/(1 + exp(−φ3X3),

    with φ = φ3 the parameter of the missing-data mechanism chosen to reach the givenpercentage of missing values. To introduce 20% missing values in each missing variable(i.e., X1 and X2), the following code can be used, with p obs the proportion of fullyobserved variables.

    1 X_miss_mar = produce_NA(X=X, p_miss =0.2,

    2 mecha="MAR", p_obs =0.3)

    Listing 6: Generating MAR values in Python.

    For the MNAR mechanism, three main options are available, using a logistic model,a quantile censorship or a logistic model for a self-masked mechanism (for their exactdefinition, we refer to the workflow). For example, to introduce 20% of self-masked missingvalues (in all the variables), we can use the code below.

    1 X_miss_selfmasked = produce_NA(X=X, p_miss =0.4,

    2 mecha="MNAR", opt="selfmasked")

    Listing 7: Generating self-masked MNAR values in Python.

    3.2 How to impute missing values?

    There exists a vast literature on how to impute missing values. The aim of these workflows(in R and Python) is to compare the most classical imputation methods and to proposea reference pipeline for comparison on simulated and real datasets, which can be easilyextended with other imputation methods. Different types of methods are included:

    1. imputation by the mean, which serves as a naive benchmark.

    2. conditional models, if, roughly speaking, the imputation relies on the distributions ofeach variable given the others.

    • in R:

    19

    https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values

  • – mice (van Buuren and Groothuis-Oudshoorn, 2011): it allows to computemultiple imputations by chained equations and thus returns several imputeddatasets. We use the predictive mean matching method (default method)and aggregate the complete datasets using the mean of the imputations toget a simple imputation.

    – missForest (Stekhoven and Bühlmann, 2012): it imputes missing valuesiteratively by training random forests.

    • in Python:– IterativeImputer of scikit-learn library (Pedregosa et al., 2011): this func-

    tion is inspired by mice, but it uses (iterative) regularized imputation usingconditional expectation and provides a simple imputation. We also use theExtraTreesRegressor estimator of IterativeImputer, which trains iterativerandom forests.

    3. low-rank based models, if the data matrix to impute is assumed to be low rank and thesimilarities between the variables (or the observations) may inform the imputation,

    • in R:– softImpute (Hastie et al., 2015): it minimizes the reweighted least squares

    error penalized by the nuclear norm.

    – missMDA (Josse et al., 2016): it minimizes the reweighted least squareserror penalized by a mix between the `2-norm and `0-norm.

    • in Python: softImpute (coded in Python by ourselves).

    4. recent methods (for the Python workflow only) using optimal transport or variationalautoencoders, variables (or the observations) may inform the imputation,

    • in Python:– MIWAE (Mattei and Frellsen, 2019): it imputes missing values with a deep

    latent variable model based on importance weighted variational inference.

    – Sinkhorn (Muzellec et al., 2020): it extracts randomly several batches andconsists in minimizing optimal transport distances between batches to im-pute missing values.

    Other methods such as GAIN (Yoon et al., 2018) which uses generative adversarial nets,have not yet been compared, as they are close to those already being compared, but willbe added in the future.

    20

  • The metric we choose to compare the methods is the mean squared error (MSE), whichcan be calculated if the ground truth of the missing values is known. More precisely, theprocedure is the following one: (i) we have access to a complete dataset X, (ii) missingvalues are introduced in X and we get an incomplete dataset XNA, (iii) this incompletedataset is imputed and we obtain an imputed dataset X imp. The MSE for X imp is computedas follows

    MSE(X imp, X) =1

    nNA

    ∑i

    ∑j

    1{XNAij =NA}(Ximpij −Xij)2

    where nNA =∑

    i

    ∑j 1{XNAij =NA} is the number of missing entries in X

    NA. Note that this

    procedure can also be performed on an incomplete dataset by introducing additional missingvalues. However, for now, both R and Python notebooks only consider complete datasets.

    In R This workflow first presents the main imputation methods available in R, includingmice, missForest, softImpute and missMDA.

    We compare the methods on a simulated dataset X ∈ Rn×d under the multivariateGaussian law X ∼ N (µ,Σ), with µ the mean vector and Σ the covariance matrix. Thefunction HowtoImpute compares the imputation methods by introducing missing valuesin a complete dataset (X) using different percentages of missing values (perc.list) andmissing data mechanisms (mecha.list). It returns the mean of the methods’ MSEs for thedifferent missing values settings by taking the average over nbsim repetitions. The codeto use this function is given below. For the sake of clarity, in the workflow, all the code isdetailed and commented. The output of this function and its associated plot are shown inFigure 5, when n = 1000, d = 10, µi = 1, ∀i ∈ {1, . . . , d} and Σij = 0.5 if i 6= j ∈ {1, . . . , d}and Σij = 1 if i = j. For the MCAR mechanism, the methods perform well, while for theMNAR mechanism, the results are close to those of the naive i