R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...
Transcript of R-miss-tastic: a uni ed platform for missing values methods and … · 2019. 8. 15. ·...
-
R-miss-tastic: a unified platform for missingvalues methods and workflows
Imke Mayer∗ Aude Sportisse† Julie Josse‡ Nicholas Tierney§
Nathalie Vialaneix¶
March 25, 2021
Abstract
Missing values are unavoidable when working with data. Their occurrence isexacerbated as more data from different sources become available. However, moststatistical models and visualization methods require complete data, and improperhandling of missing data results in information loss, or biased analyses. Since theseminal work of Rubin (1976), there has been a burgeoning literature on missing val-ues with heterogeneous aims and motivations. This has resulted in the developmentof various methods, formalizations, and tools (including a large number of R packagesand Python modules). However, for practitioners, it remains challenging to decidewhich method is most suited for their problem, partially because handling missingdata is still not a topic systematically covered in statistics or data science curricula.
To help address this challenge, we have launched a unified platform: “R-miss-tastic”, which aims to provide an overview of standard missing values problems,methods, how to handle them in analyses, and relevant implementations of method-ologies. In the same perspective, we have also developed several pipelines in R andPython to allow for a hands-on illustration of how to handle missing values in vari-ous statistical tasks such as estimation and prediction, while ensuring reproducibility
∗CAMS, EHESS, Paris, France (email: [email protected]). Corresponding co-author.†LPSM, Sorbonne Université, Paris, France (email: [email protected]). Corresponding co-author.‡Inria Sophia-Antipolis, Montpellier, France (email: [email protected]).§NUMBAT, Monash University, Victoria, Australia (email: [email protected]).¶MIAT, Université de Toulouse, INRA, Castanet Tolosan, France (email: [email protected]).
1
arX
iv:1
908.
0482
2v2
[st
at.M
E]
24
Mar
202
1
-
of the analyses. This will hopefully also provide some guidance on deciding whichmethod to choose for a specific problem and data. The objective of this work is notonly to comprehensively organize materials, but also to create standardized analysisworkflows, and to provide a common ground for discussions among the community.This platform is thus suited for beginners, students, more advanced analysts andresearchers.
Keywords: missing data; state of the art; bibliography; reproducibility; guided workflows;teaching material
2
-
1 Context and motivation
Missing data are unavoidable as soon as collecting or acquiring data is involved. Theyoccur for many reasons including: individuals choose not to answer survey questions, mea-surement devices fail, or data have simply not been recorded. Their presence becomes evenmore important as data are now obtained at increasing velocity and volume, and fromheterogeneous sources not originally designed to be analyzed together. As pointed out byZhu et al. (2019), “one of the ironies of working with Big Data is that missing data play anever more significant role, and often present serious difficulties for analysis”. Despite this,the approach most commonly implemented by default in software is to toss out cases withmissing values. At best, this is inefficient because it wastes information from the partiallyobserved cases. At worst, it results in biased estimates, particularly when the distributionsof the missing values are systematically different from those of the observed values (e.g.,Enders, 2010, Chap. 2).
However, handling missing data in a more efficient and relevant way (than limiting theanalysis on solely the complete cases) has attracted a lot of attention in the literature in thelast two decades. In particular, a number of reference books have been published (Schaferand Graham, 2002; Little and Rubin, 2002; van Buuren, 2012; Carpenter and Kenward,2012) and the topic is an active field of research (Josse and Reiter, 2018). The diversityof the problems of missing data means there is great variety in the methods proposed andstudied. They include model-based approaches, integrating likelihoods or posterior distri-butions over missing values, filling in missing values in a realistic way with single, or multipleimputations, or weighting approaches, appealing to ideas from the design-based literaturein survey sampling. The multiplicity of the available solutions makes sense because there isno single solution or tool to manage missing data: the appropriate methodology to handlethem depends on many features, such as the objective of the analysis, type of data, thetype of missing data and their pattern. Some of these methods are available in various andheterogeneous software solutions. As R (R Core Team, 2020) is one of the main softwaresfor statisticians and data scientists and as its development has started almost three decadesago (Ihaka, 1998), R is one of the language that offers the largest number of implementedapproaches. This is also due to its ease to incorporate new methods and its modular pack-aging system. Currently, there are over 270 R packages on CRAN that mention missingdata or imputation in their DESCRIPTION files. These packages serve many differentapplications, data types or types of analysis. More precisely, exploratory and visualizationtools for missing data are available in packages like naniar, VIM, and MissingDataGUI(Tierney, Cook, McBain, and Fay, Tierney et al.; Tierney and Cook, 2018; Kowarik and
3
-
Templ, 2016; Cheng et al., 2015). Imputation methods are included in packages like mice,Amelia, and mi (van Buuren and Groothuis-Oudshoorn, 2011; Honaker et al., 2011; Gelmanand Hill, 2011). Other packages focus on dealing with complex, heterogeneous (categor-ical, quantitative, ordinal variables) data or with large dimension multi-level data, suchas missMDA, and MixedDataImpute (Josse et al., 2016; Murray and Reiter, 2015). Toour knowledge, R is the programming language offering the largest variety of implementedmethods. However, other languages such as Python (Van Rossum and Drake, 2009), whichcurrently only have few publicly available implementations of methods that handle missingvalues in statistical tasks, offer more and more solutions. Two prominent examples are:1) the scikit-learn library (Pedregosa et al., 2011) which has recently added a module formissing values imputation; and 2) the DataWig library (Biessmann et al., 2018) whichprovides a framework to learn to impute incomplete data tables.
The rich variety of methods and tools for working with missing data means there aremany solutions for a variety of applications. Despite the number of options, missing dataare often not handled appropriately by practitioners. This may be for a few reasons. First,the plethora of options can be a double-edged sword - the sheer number of options makingit challenging to navigate and find the best one. Second, the topic of missing data isoften itself missing from many statistics and data science syllabuses, despite its relativeomnipresence in data. So then, faced with missing data, practitioners are left powerless:quite possibly never having been taught about missing data, they do not have an ideaof how to approach the problem, what are the dangers of mismanagement, navigate themethods, software, or choose the appropriate method or workflow for their problem.
To help promote better management and understanding of missing data, we have re-leased R-miss-tastic, an open platform for missing values. The platform takes the form ofa reference website1, which collects, organizes and produces material on missing data. Ithas been conceived by an infrastructure steering committee (ISC; its members are authorsof this article) working group, which first provided a CRAN Task View2 on missing data3 that lists and organizes existing R packages on the topic. The “R-miss-tastic” platformextends and builds on the CRAN Task View by collecting and organizing articles, tutorials,documentation, and workflows for analyses with missing data. The platform provides newtutorials, examples and pipelines of analyses we have developed with missing data thatspan the entirety of an analysis. The latter have been developed by us and implemented inR and in Python for this platform, implementing standard methods for generating missing
1https://rmisstastic.netlify.com/2https://CRAN.R-project.org/package=ctv3https://cran.r-project.org/web/views/MissingData.html
4
https://rmisstastic.netlify.com/https://CRAN.R-project.org/package=ctvhttps://cran.r-project.org/web/views/MissingData.html
-
values and analyzing them under different perspectives. Starting from data preparation,these pipelines also cover exploratory data analyses, establishing statistical models, analysisdiagnostics, applying machine learning methods, through to communication of the resultsobtained from incomplete data. We hope that these pipelines also serve as guidance whenit comes to choosing a method to handle missing values in a specific context.
Another important ingredient in statistical analyses, some may even argue the mostimportant one, is data which a proposed method or estimator is apply to. We thus alsoreference publicly available datasets that are commonly used as benchmark for new missingvalues methodologies. It is easily extendable and well documented, so it can seamlesslyincorporate future research in missing values. The intent of the platform is to foster awelcoming community, within and beyond the R community . “R-miss-tastic” has beendesigned to be accessible for a wide audience with different levels of prior knowledge, needs,and questions. This includes, for instance, students looking for course material to com-plement their studies, teachers and professors who can use a reference website for theirown classes or refer to students, statisticians or researchers in a different fields using statis-tics searching for solutions or existing work to help with analysis, researchers wishing tounderstand or contribute information for specific research questions, or find collaborators.
The remainder of the article is organized as follows: Section 2 describes the differentcomponents of the platform, the structure that has been chosen, and the targeted audi-ence. The section is organized as the platform itself, starting by describing materials forless advanced users then materials for researchers and finally resources for practical imple-mentation. Section 3 details the implementation and use-cases of the provided workflows,implemented both in R and in Python. Finally, in Section 4 we conclude with an overviewof the planed future development for the platform and of interesting areas in missing valuesresearch that we would like to bring to a broader audience.
2 Structure and content of the platform
The R-miss-tastic platform is released at https://rmisstastic.netlify.com/. It hasbeen developed using the R package blogdown Xie et al. (2017) which wraps hugo4. Liveexamples have been included using the tool https://rdrr.io/snippets/ provided bythe website “R Package Documentation”. The source code and materials of the platformhave been made publicly available on GitHub5, which provides a transparent record of the
4https://gohugo.io/5repository R-miss-tastic/website
5
https://rmisstastic.netlify.com/https://rdrr.io/snippets/https://gohugo.io/
-
platform’s development, and facilitates community contributions.We now discuss the structure of the R-miss-tastic platform, the aim and content of each
subsection, and highlight key features of the platform.
2.1 Workflows
An important contribution and novelty of this work is the proposal of several workflowsthat allow for a hands-on illustration of classical analyses with missing values, both onsimulated data and on publicly available real-world data. These workflows are providedboth in R and in Python code and cover the following topics:
• How to generate missing values? Generate missing values under different mecha-nisms, on complete or incomplete datasets. This is useful when performing simula-tions to compare methods.
• How to do statistical inference with missing values? We focus in particular in thedifferent solutions (maximum likelihood or multiple imputation) that are available toestimate linear and logistic regression parameters with missing values in the covari-ates.
• How to impute missing values? We compare different single imputation/matrix com-pletion methods (for instance using conditional models, low-rank models, etc.)
• How to predict with missing values? We consider establishing predictive models (forinstance using random forests (Breiman, 2001)) on data with incomplete predictors.The workflows present different strategies to deal with the missing values in thecovariates both in the train set and in the test set.
The aim of these workflows is threefold: 1) they provide a practical implementation ofconcepts and methods discussed in the lectures and bibliography sections (see Sections 2.2and 2.3); 2) they are coded in a generic way, allowing for simple re-use on other datasets, forintegration of other estimation or imputation methods; 3) the distinction between inference,imputation, and prediction lets the user keep in mind that the solutions are not the samein these cases.
Furthermore, they allow for a transparent and open discussion about the proposedimplementations which can be followed on the project GitHub repository6.
6https://github.com/R-miss-tastic/website
6
https://github.com/R-miss-tastic/website
-
We provide a more detailed view on the proposed workflows in Section 3, giving codeexamples and their corresponding tabular or graphical outputs as well as recommendationson how to interpret and leverage these outputs.
2.2 Lectures
Before starting to use any of the existing implementations for handling missing values in astatistical analysis, it is essential to understand why missing values are problematic. Thereare many lenses to view missing data through: the source of the missing values, theirpotential meaning, the relevance of the features they occur in – and the implications fordifferent types of analyses. For someone unfamiliar with missing data, it is a challengeto know where to begin the journey of understanding them, and the methods to addressthem. This challenge is being addressed with “R-miss-tastic”, which makes the materialto get started easily accessible.
Teaching and workshop material takes many forms – from slides, course notes, labworkshops, video tutorials and in-depth seminars. The material is of high quality, andhas been generously contributed by numerous renowned researchers who investigate theproblems of missing values, many of whom are professors having designed introductoryand advanced classes for statistical analysis with missing data. This makes the material onthe “R-miss-tastic” platform well suited for both beginners and more experienced users.
These teaching and workshop materials are described as “lectures”, and are organizedinto five sections:
1. General lectures: introduction to statistical analyses with missing values, theoryand concepts are covered, such as missing values mechanism, likelihood methods,imputation.
2. Multiple imputation: introduction to popular methods of multiple imputation (jointmodeling and fully conditional), how to correctly perform multiple imputation andlimits of imputation methods.
3. Principal component methods: introduction to methods exploiting low-rank typestructures in the data for visualization, imputation and estimation
4. Specific data or applications types: lectures covering in detail various sub-problemssuch as missing values in time series, in surveys, or in treatment effect estimation. In-deed, certain data types require adaptations of standard missing values methods (for
7
-
instance handling the time dependence in time series (Moritz and Bartz-Beielstein,2017)) or additional assumptions about the impact of missing values (such as theimpact on confounded treatment effects in the causal inference context (Mayer et al.,2020)). But also more in-depth material, for instance video recordings from a virtualworkshop on Missing Data Challenges in Computation, Statistics and Applications7
held in 2020.
5. Implementations: a non-exhaustive list of detailed vignettes describing functionalitiesof R packages and of Python modules that implement some of the statistical analysismethods covered in the other lectures. For instance the functionalities and possibleapplications of the missMDA R package are presented in a succinct summary, allowingthe reader to compare the main differences between this package and the mice packagewhich is also summarized using the same summary format.
Figure 1 is a screenshot of two views of the lectures page: Figure 1A shows a collapsedview presenting the different topics, Figure 1B shows an example of the expanded view ofone topic (General tutorials), with a detailed description of one of the lecture (obtainedby clicking on its title), “Analysis of missing values” by Jae-Kwang Kim. Each lecture cancontain several documents (as is the case for this one) and is briefly described by a headerpresenting its purpose.
Lectures that we found very complete and thus recommend are:
• Statistical Methods for Analysis with Missing Data by Mauricio Sadinle (in “Generaltutorials”)
• Missing Values in Clinical Research – Multiple Imputation by Nicole Erler (in “Mul-tiple imputation”)
• Handling missing values in PCA and MCA by François Husson. (in “Missing valuesand principal component methods”)
2.3 Bibliography
Complementary to the lectures section, this part of the platform serves as a broad overviewon the scientific literature discussing missing values taxonomies and mechanisms and sta-tistical, machine learning methods to handle them. This overview covers both classical
7https://www.ias.edu/math/mdccsa
8
https://www.ias.edu/math/mdccsa
-
Home
Workflows
Lectures
Bibliography
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Collapse All
Collapse All
Below you will find a selection of high-quality lectures, tutorials and labs on different
aspects of missing values.
Ifyouwishtocontributesomeofyourownmaterialtothisplatform,pleasefeelfreeto
contactusviatheContactform.
General lectures
Multiple imputation
Missing values and principal component methods
Specific data or application types
Implementation in R
++
++
+
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
(a) Collapsed view
Home
Workflows
Bibliography
Lectures
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Back to top
Collapse All
Collapse All
Below you will find a selection of high-quality lectures and labs on different aspects of
missing values.
Ifyouwishtocontributesomeofyourownmaterialtothisplatform,pleasefeelfreeto
contactusviatheContactform.
General tutorials
Multiple imputation
Missing values and principal component methods
Specific data or application types
Implementation in R
+
Statistical Methods for Analysis With Missing Data
(Marie Davidian, course at NC State University, spring 2017)
Handling missing values
(Julie Josse, course at École Polytechnique, fall 2018 and Julie Josse & Nick Tierney, tutorial at useR!2018, 2018)
This course focuses on the theory and methods for missing data analysis. Topics include maximum likelihood estimation under
missing data, EM algorithm, Monte Carlo computation techniques, imputation, Bayesian approach, propensity scores, semi-
parametric approach, and non-ignorable missing data.
Introduction
Likelihood-based approach
Computation
Imputation
Multiple imputation
Propensity Score approach
Nonignorable missing data
Analysis of missing values
(Jae-Kwang Kim, course at Iowa State University, fall 2015)
Statistical Methods for Analysis with Missing Data
(Mauricio Sadinle, course at University of Washington, winter 2019)
++
++
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
(b) Extract
Figure 1: Lectures overview.
references with books, articles, etc. such as Schafer and Graham (2002); Little and Rubin(2002); van Buuren (2012); Carpenter and Kenward (2012) and more recent developmentssuch as Josse et al. (2019); Gondara and Wang (2018) who study the consistency of su-pervised learning with missing values. The entire (non-exhaustive) bibliography can bebrowsed in two ways: 1) a complete list, filtering by publication type and year, with asearch option for the authors or 2), as a contextualized version. For 2), we classified thereferences into several domains of research or application, briefly discussing important as-pects of each domain. This double representation is shown in Figure 2 and allows anextensive search in the existing literature, while providing some guidance for those focusedon a specific topic. All references are also collected in a unique BibTex file made avail-able on the GitHub repository8. This centralized file allows external users to easily proposeadditions to the bibliography which are then reviewed by the platform editorial and mainte-nance committee, composed of researchers with different focuses on the handling of missingvalues.
8in resources/rmisstastic_biblio.bib
9
resources/rmisstastic_biblio.bib
-
Home
Workflows
Bibliography
Lectures
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Back to top
Collapse All
Collapse All
On this platform we attempt to give you an overview of main references on missing
values. We do not claim to gather all available references on the subject but rather to
offer a peak into different fields of active research on handling missing values, allowing
for an introductory reading as well as a starting point for further bibliographical research.
See here for a full (and uncommented) list of references.
Inspired by CRAN Task View on Missing Data and a review of Imbert & Villa-Vialaneix on
handling missing values (2018, written in French) we organized our selection of relevant
references on missing values by different topics.
Short introduction to missing values
General references and reviews
Weighting methods
Hot-deck and kNN approaches
Likelihood-based approaches
Single imputation
Multiple imputation
Machine Learning
Missing values mechanisms
Share
++
++
++
++
+
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
Processing math: 100%
(a) Contextualized version
Home
Workflows
Bibliography
Lectures
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Back to top
A commented version of this bibliography can be found here.
Publication type
All
Year
All
Author
Search by author name..
Share
Citation Year
Publication
type
Abayomi,K.,A.Gelman,andM.Levy.Diagnosticsformultivariateimputations.In:JournaloftheRoyalStatistical
Society,SeriesC(AppliedStatistics)57.3(2008),pp.273-291.
DOI
2008 Article
Albert,P.S.andD.A.Follmann.Modelingrepeatedcountdatasubjecttoinformativedropout.In:Biometrics
56.3(2000),pp.667-677.
DOI
2000 Article
Allison,P.D.MissingData.QuantitativeApplicationsintheSocialSciences.ThousandOaks,CA,USA:Sage
Publications,2001.ISBN:9780761916727.
DOI
2001 Book
Andridge,R.andR.J.A.Little.Areviewofhotdeckimputationforsurveynon-response.In:International
StatisticalReview78.1(2010),pp.40-64.
DOI
2010 Article
Audigier,V.,F.Husson,andJ.Josse.Aprincipalcomponentmethodtoimputemissingvaluesformixeddata.
In:AdvancesinDataAnalysisandClassification10.1(2016),pp.5-26.
DOI
2016 Article
Audigier,V.,F.Husson,andJ.Josse.MIMCA:multipleimputationforcategoricalvariableswithmultiple
correspondenceanalysis.In:StatisticsandComputing27.2(2016),pp.1-18.eprint:1505.08116.
DOI
2016 Article
Audigier,V.,F.Husson,andJ.Josse.MultipleimputationforcontinuousvariablesusingaBayesianprincipal
componentanalysis.In:JournalofStatisticalComputationandSimulation86.11(2015),pp.2140-2156.
DOI
2015 Article
Bang,H.andJ.M.Robins.Doublyrobustestimationinmissingdataandcausalinferencemodels.In:Biometrics
61.4(2005),pp.962-973.
DOI
2005 Article
Baraldi,A.N.andC.K.Enders.Anintroductiontomodernmissingdataanalysis.In:JournalofSchool
Psychology48.1(2010),pp.5-37.
DOI
2010 Article
Baretta,L.andA.Santaniello.Nearestneighborimputationalgorithms:acriticalevaluation.In:BMCMedical
InformaticsandDecisionMaking.Proceedingsofthe5thTranslationalBioinformaticsConference(TBC2015):
medicalinformaticsanddecisionmaking16.Supp.3(2016),p.74.
DOI
2016 Article
Bartlett,J.W.,O.Harel,andJ.R.Carpenter.Asymptoticallyunbiasedestimationofexposureoddsratiosin
completerecordslogisticregression.In:Americanjournalofepidemiology182.8(2015),pp.730–736.
DOI
2015 Article
Bengio,Y.andF.Gingras.Recurrentneuralnetworksformissingorasynchronousdata.In:Proceedingsofthe
8thInternationalConferenceonNeuralInformationProcessingSystems.(Nov.27,1995-Dec.02,1995).Ed.by-.
Cambridge,MA,USA:MITPress,1995,pp.395-401.
URL
1995 Paper
Biessmann,F.,D.Salinas,S.Schelter,etal.“Deep”LearningforMissingValueImputationinTableswithNon-
NumericalData.In:Proceedingsofthe27thACMInternationalConferenceonInformationandKnowledge
Management.Ed.by-.CIKM’18.Torino,Italy:ACM,2018,pp.2017–2025.ISBN:978-1-4503-6014-2.
DOI URL
2018 Paper
Blake,H.A.,C.Leyrat,K.Mansfield,etal.Propensityscoresusingmissingnesspatterninformation:apractical
guide.In:arXivpreprint(2019).arXiv:1901.03981[stat.ME].
URL
2019 Article
Buck,S.F.Amethodofestimationofmissingvaluesinmultivariatedatasuitableforusewithanelectronic
computer.In:JournaloftheRoyalStatisticalSociety,SeriesB22(1960),pp.302-306.
DOI
1960 Article
Burns,R.M.Multipleandreplicateitemimputationinacomplexsamplesurvey.In:Proceedingsofthe6th
AnnualResearchConference.Ed.byB.oftheCensus.WashingtonDC,USA,1990,pp.655-665.
1990 Paper
Buuren,S.van.FlexibleImputationofMissingData.BocaRaton,FL:ChapmanandHall/CRC,2018.
URL
2018 Book
Buuren,S.van.Multipleimputationofdiscreteandcontinuousdatabyfullyconditionalspecification.In:
StatisticalMethodsinMedicalResearch16(2007),pp.219-242.
DOI
2007 Article
Buuren,S.van,J.P.L.Brand,C.G.M.Groothuis-Oudshoorn,etal.Fullyconditionalspecificationinmultivariate
imputation.In:JournalofStatisticalComputationandSimulation76.12(2006),pp.1049-1064.
DOI
2006 Article
Buuren,S.vanandK.Groothuis-Oudshoorn.MICE:multivariateimputationbychainedequationsinR.In:
JournalofStatisticalSoftware45(2011),p.3.eprint:NIHMS150003.
DOI
2011 Article
Candès,E.J.,C.A.Sing-Long,andJ.D.Trzasko.Unbiasedriskestimatesforsingularvaluethresholdingand
spectralestimators.In:IEEETransactionsonSignalProcessing61.19(2013),pp.4643-4657.
DOI
2013 Article
Carpenter,J.R.,M.G.Kenward,andS.Vansteelandt.Acomparisonofmultipleimputationanddoublyrobust
estimationforanalyseswithmissingdata.In:JournaloftheRoyalStatisticalSociety:SeriesA(Statisticsin
Society)169.3(2006),pp.571–584.
DOI
2006 Article
Carpenter,J.andM.Kenward.MultipleImputationanditsApplication.Chichester,WestSussex,UK:Wiley,
2013.ISBN:9780470740521.
DOI
2013 Book
Chen,T.andC.Guestrin.XGBoost:AScalableTreeBoostingSystem.In:Proceedingsofthe22ndACM
SIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.(Aug.13,2016-Aug.17,2016).Ed.
by-.NewYork,NY,USA:ACM,2016,pp.785-794.ISBN:0450342322.
DOI
2016 Paper
Chen,J.andJ.Shao.Nearestneighborimputationforsurveydata.In:JournalofOfficialStatistics16.2(2000),
pp.113-131.
URL
2000 Article
Collins,L.M.,J.L.Schafer,andK.Chi-Ming.Acomparisonofinclusiveandrestrictivestrategiesinmodern
missingdataprocedures.In:PsychologicalMethods6.4(2007),pp.330-351.
DOI
2007 Article
Cranmer,S.J.andJ.Gill.Wehavetobediscreteaboutthis:anon-parametricimputationtechniqueformissing
categoricaldata.In:BritishJournalofPoliticalScience43(2012),pp.425-449.
DOI
2012 Article
Crookston,N.L.andA.O.Finley.yaImpute:anRpackageforkNNimputation.In:JournalofStatisticalSoftware
23(2008),p.10.
DOI
2008 Article
Dax,A.ImputingMissingEntriesofaDataMatrix:Areview.In:JournalofAdvancedComputing3.3(2014),
pp.98-222.
DOI
2014 Article
Dempster,A.P.,N.M.Laird,andD.B.Rubin.MaximumlikelihoodfromincompletedataviatheEMalgorithm.
In:JournaloftheRoyalStatisticalSociety,SeriesB(Methodological)39.1(1977),pp.1-38.
URL
1977 Article
Diggle,P.andM.G.Kenward.Informativedrop-outinlongitudinaldataanalysis.In:JournaloftheRoyal
StatisticalSociety,SeriesC(AppliedStatistics)43.1(1994),pp.49-93.
DOI
1994 Article
Ding,P.andF.Li.CausalInference:AMissingDataPerspective.In:StatisticalScience33.2(2018),pp.214–237.
DOI
2018 Article
Ding,Y.andJ.S.Simonoff.Aninvestigationofmissingdatamethodsforclassificationtreesappliedtobinary
responsedata.In:JournalofMachineLearningResearch11.1(2010),pp.131-170.
URL
2010 Article
Dong,Y.andC.J.Peng.Principledmissingdatamethodsforresearchers.In:SpringerPlus2(2013),p.222.
DOI
2013 Article
Enders,C.K.AppliedMissingDataAnalysis.GuilfordPress,2010,p.401.ISBN:9781606236390. 2010 Book
Enders,C.K.Aprimeronmaximumlikelihoodalgorithmsavailableforusewithmissingdata.In:Structural
EquationModeling8.1(2001),pp.128-141.
DOI
2001 Article
Fang,F.,J.Zhao,andJ.Shao.Imputation-basedadjustedscoreequationsingeneralizedlinearmodelswith
nonignorablemissingcovariatevalues.In:StatisticaSinica28.4(2018),pp.1677–1701.
DOI
2018 Article
Fay,R.E.Alternativeparadigmsfortheanalysisofimputedsurveydata.In:JournaloftheAmericanStatistical
Association91.434(1996),pp.490-498.
DOI
1996 Article
Fellegi,I.P.andD.Holt.Asystematicapproachtoautomaticeditandimputation.In:JournaloftheAmerican
StatisticalAssociation71.353(1976),pp.17-35.
DOI
1976 Article
Ferrari,P.A.,P.Annoni,A.Barbiero,etal.Animputationmethodforcategoricalvariableswithapplicationto
nonlinearprincipalcomponentanalysis.In:ComputationalStatistics&DataAnalysis55.7(2011),pp.2410-2420.
DOI
2011 Article
Finkbeiner,C.Estimationforthemultiplefactormodelwhendataaremissing.In:Psychometrika44.4(1979),
pp.409-420.
DOI
1979 Article
Fitzmaurice,G.M.,G.Molenberghs,andS.R.Lipsitz.RegressionModelsforLongitudinalBinaryResponses
withInformativeDrop-Outs.In:JournaloftheRoyalStatisticalSociety.SeriesB(Methodological)57.4(1995),
pp.691–704.
URL
1995 Article
Follmann,D.andM.Wu.Anapproximategeneralizedlinearmodelwithrandomeffectsforinformativemissing
data.In:Biometrics51.1(1995),pp.151-168.
DOI
1995 Article
Gad,A.M.andN.M.M.Darwish.Asharedparametermodelforlongitudinaldatawithmissingvalues.In:
AmericanJournalofAppliedMathematicsandStatistics1.2(2013),pp.30-35.
URL
2013 Article
Gelman,A.,G.King,andC.Liu.Notaskedandnotanswered:Multipleimputationformultiplesurveys.In:
JournaloftheAmericanStatisticalAssociation93.443(1998),pp.846–857.
DOI
1998 Article
Gelman,A.,I.vanMechelen,G.Verbeke,etal.MultipleImputationforModelChecking:Completed-DataPlots
withMissingandLatentData.In:Biometrics61.1(2005),pp.74–85.
DOI
2005 Article
Gondara,L.andK.Wang.MIDA:MultipleImputationusingDenoisingAutoencoders.In:Proceedingsofthe
22ndPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD2018).(Jun.03,2018-Jun.06,
2018).Ed.byD.Phung,V.Tseng,G.Webb,B.Ho,M.GanjiandL.Rashidi.LectureNotesinComputerScience.
SpringerInternationalPublishing,2018,pp.260-272.ISBN:3319930404.
DOI URL
2018 Paper
Goodfellow,I.,M.Mirza,A.Courville,etal.Multi-PredictionDeepBoltzmannMachines.In:Proceedingsofthe
26thInternationalConferenceonNeuralInformationProcessingSystems.(Dec.05,2013-Dec.10,2013).Ed.by
C.Burges,L.Bottou,M.Welling,Z.GhahramaniandK.Weinberger.AdvancesinNeuralInformationProcessing
Systems26.CurranAssociates,Inc.,2013,pp.548–556.
URL
2013 Paper
Graham,J.W.Missingdataanalysis:makingitworkintherealworld.In:AnnualReviewofPsychology60(2009),
pp.549-576.
DOI
2009 Article
Graham,J.W.,S.M.Hofer,S.I.Donaldson,etal.TheScienceofPrevention:MethodologicalAdvancesfrom
AlcoholandSubstanceAbuseResearch.In:TheScienceofPrevention:MethodologicalAdvancesfromAlcohol
andSubstanceAbuseResearch.Ed.byK.Bryant,M.WindleandS.West.Washington,DC,USA:American
PsychologicalAssociation,1997.Chap.Analysiswithmissingdatainpreventionresearch,pp.325-366.ISBN:1-
55798-439-5.
DOI
1997 Book
Graham,J.W.,A.E.Olchowski,andT.E.Gilreath.Howmanyimputationsarereallyneeded?Somepractical
clarificationsofmultipleimputationtheory.In:PreventionScience8.3(2007),pp.206-213.
DOI
2007 Article
Heckman,J.Sampleselectionbiasasaspecificationerror.In:Econometrica47.1(1979),pp.153-161.
DOI
1979 Article
Heckman,J.J.Thecommonstructureofstatisticalmodelsoftruncation,sampleselectionandlimited
dependentvariablesandasimpleestimatorforsuchmodels.In:AnnalsofEconomicandSocialMeasurement
5.4(1976),pp.475-492.
URL
1976 Article
Hogan,J.W.andN.M.Laird.Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.In:
StatisticsinMedecine16.1-3(1997),pp.239-257.
1997 Article
Hogan,J.W.andT.Lancaster.Instrumentalvariablesandinverseprobabilityweightingforcausalinferencefrom
longitudinalobservationalstudies.In:StatisticalMethodsinMedicalResearch13.1(2004),pp.17-48.
DOI
2004 Article
Honaker,J.,G.King,andM.Blackwell.AmeliaII:aprogramformissingdata.In:JournalofStatisticalSoftware
45.7(2011).eprint:arXiv:1501.0228.
DOI
2011 Article
Hothorn,T.,K.Hornik,andA.Zeileis.UnbiasedRecursivePartitioning:AConditionalInferenceFramework.In:
JournalofComputationalandGraphicalStatistics15.3(2012),pp.651-674.
DOI
2012 Article
Horton,N.J.andK.P.Kleinman.MuchAdoAboutNothing-AComparisonofMissingDataMethodsand
SoftwaretoFitIncompleteDataRegressionModels.In:TheAmericanStatistician61.1(2017),pp.79-90.
DOI
2017 Article
Huisman,M.Imputationofmissingitemresponses:somesimpletechniques.In:Quality&Quantity34.4(2000),
pp.331-351.
DOI
2000 Article
Husson,F.andJ.Josse.Handlingmissingvaluesinmultiplefactoranalysis.In:FoodQualityandPreference30
(2013),pp.77-85.
DOI
2013 Article
Ibrahim,J.G.,S.R.Lipsitz,andM.Chen.MissingCovariatesinGeneralizedLinearModelsWhentheMissing
DataMechanismisNon-Ignorable.In:JournaloftheRoyalStatisticalSociety.SeriesB(StatisticalMethodology)
61.1(1999),pp.173-190.
DOI URL
1999 Article
Ibrahim,J.G.,M.Chen,andS.R.Lipsitz.Missingresponsesingeneralisedlinearmixedmodelswhenthe
missingdatamechanismisnonignorable.In:Biometrika88.2(2001),pp.551-564.
DOI
2001 Article
Ilin,A.andT.Raiko.PracticalapproachestoPrincipalComponentAnalysisinthepresenceofmissingvalues.In:
JournalofMachineLearningResearch11(2010),pp.1957-2000.
URL
2010 Article
Imbert,A.,A.Valsesia,C.LeGall,etal.Multiplehot-deckimputationfornetworkinferencefromRNA
sequencingdata.In:Bioinformatics34.10(2018),pp.1726-1732.
DOI
2018 Article
Jönsson,P.andC.Wohlin.Anevaluationofk-nearestneighbourimputationusinglIkertdata.In:Proceedingsof
the10thInternationalSymposiumonSoftwareMetrics.(Sep.14,2004-Sep.16,2004).Ed.by-.Chicago,IL,
USA:IEEE,2004,pp.1530-1435.ISBN:0769521290.
DOI
2004 Paper
Jamshidian,M.andS.Jalal.Testsofhomoscedasticity,normality,andmissingcompletelyatrandomfor
incompletemultivariatedata.In:Psychometrika75.4(2010),pp.649-674.eprint:NIHMS150003.
DOI
2010 Article
Jamshidian,M.,S.Jalal,andC.Jansen.MissMech:anRpackagefortestinghomoscedasticity,multivariate
normality,andmissingcompletelyatrandom(MCAR).In:JournalofStatisticalSoftware56.6(2014),pp.1-31.
DOI
2014 Article
Jiang,W.,J.Josse,andM.Lavielle.LogisticRegressionwithMissingCovariates–ParameterEstimation,Model
SelectionandPrediction.In:arXivpreprint(2018).arXiv:1805.04602[stat.ME].
2018 Article
Joenssen,D.W.andU.Bankhofer.Donorlimitedhotdeckimputation:effectonparameterestimation.In:
JournalofTheoreticalandAppliedComputerScience6.3(2012),pp.58-70.
URL
2012 Article
Jones,M.P.IndicatorandStratificationMethodsforMissingExplanatoryVariablesinMultipleLinearRegression.
In:JournaloftheAmericanStatisticalAssociation91.433(1996),pp.222-230.
DOI
1996 Article
Josse,J.,M.Chavent,B.Liquet,etal.Handlingmissingvalueswithregularizediterativemultiple
correspondanceanalysis.In:JournalofClassification29.1(2012),pp.91-116.
DOI
2012 Article
Josse,J.andF.Husson.missMDA:apackageforhandlingmissingvaluesinmultivariatedataanalysis.In:
JournalofStatisticalSoftware70.1(2016),pp.1-31.
DOI
2016 Article
Josse,J.andF.Husson.Handlingmissingvaluesinexploratorymultivariatedataanalysismethods.In:Journal
delaSociétéFrançaisedeStatistique153.2(2012),pp.79-99.
URL
2012 Article
Josse,J.,F.Husson,andJ.Pagès.GestiondesdonnéesmanquantesenAnalyseenComposantesPrincipales.
In:JournaldelaSociétéFrançaisedeStatistique150.2(2009),pp.28-51.
URL
2009 Article
Josse,J.,J.Pagès,andF.Husson.Multipleimputationinprincipalcomponentanalysis.In:AdvancesinData
AnalysisandClassification5.3(2011),pp.231-246.
DOI
2011 Article
Josse,J.,N.Prost,E.Scornet,etal.Ontheconsistencyofsupervisedlearningwithmissingvalues.In:arXiv
preprint(2019).arXiv:1902.06931[stat.ML].
URL
2019 Article
Kaiser,J.Dealingwithmissingvaluesindata.In:JournalofSystemsIntegration5.1(2014),pp.42-51.
DOI
2014 Article
Kallus,N.,X.Mao,andM.Udell.CausalInferencewithNoisyandMissingCovariatesviaMatrixFactorization.In:
AdvancesinNeuralInformationProcessingSystems.Ed.by-.2018.eprint:1806.00811.
URL
2018 Paper
Kalton,G.andD.Kasprzyk.Thetreatmentofmissingsurveydata.In:SurveyMethodology12.1(1986),pp.1-16.
URL
1986 Article
Kapelner,A.andJ.Bleich.PredictionwithmissingdataviaBayesianadditiveregressiontrees.In:Canadian
JournalofStatistics43.2(2015),pp.224-239.
DOI URL
2015 Article
Kim,J.K.andJ.Shao.StatisticalMethodsforHandlingIncompleteData.BocaRaton,FL,USA:Chapmanand
Hall/CRC,2013.ISBN:9781482205077.
2013 Book
Kohn,R.andC.F.Ansley.Estimation,prediction,andinterpolationforARIMAmodelswithmissingdata.In:
JournaloftheAmericanStatisticalAssociation81.395(1986),pp.751-761.
DOI
1986 Article
Kowarik,A.andM.Templ.ImputationwiththeRPackageVIM.In:JournalofStatisticalSoftware74.7(2016),
pp.1-16.
DOI
2016 Article
Kropko,J.,B.Goodrich,A.Gelman,etal.MultipleImputationforContinuousandCategoricalData:Comparing
JointMultivariateNormalandConditionalApproaches.In:PoliticalAnalysis22.4(2014),pp.497–519.
DOI
2014 Article
Lee,K.M.,R.Mitra,andS.Biedermann.Optimaldesignwhenoutcomevaluesarenotmissingatrandom.In:
StatisticaSinica28.4(2018),pp.1821–1838.
DOI
2018 Article
Little,R.J.A.Modelingthedrop-outmechanisminrepeated-measuresstudies.In:JournaloftheAmerican
StatisticalAssociation90.431(1995),pp.1112-1121.
DOI
1995 Article
Little,R.J.A.Pattern-mixturemodelsformultivariateincompletedata.In:JournaloftheAmericanStatistical
Association88.421(1993),pp.125-134.
DOI
1993 Article
Little,R.J.A.Atestofmissingcompletelyatrandomformultivariatedatawithmissingvalues.In:Journalofthe
AmericanStatisticalAssociation83.404(1988),pp.1198-1202.
DOI
1988 Article
Little,R.J.A.andD.B.Rubin.StatisticalAnalysiswithMissingData.Wiley,2002,p.408.ISBN:0471183865.
DOI
2002 Book
Little,R.J.A.RegressionwithmissingX’s:areview.In:JournaloftheAmericanStatisticalAssociation87.420
(1992),pp.1227-1237.
DOI
1992 Article
Louis,T.A.FindingtheObservedInformationMatrixwhenUsingtheEMAlgorithm.In:JournaloftheRoyal
StatisticalSociety.SeriesB(Methodological)44.2(1982),pp.226–233.
URL
1982 Article
McLachlan,G.J.andT.Krishnan.TheEMAlgorithmandExtensions.Wileyseriesinprobabilityandstatistics.
Hoboken,NJ,USA:Wiley,2008.ISBN:9780471201700.
2008 Book
Meng,S.L.andD.B.Rubin.MaximumlikelihoodestimationviatheECMalgorithm:ageneralframework.In:
Biometrika80.2(1993),pp.267-278.
DOI
1993 Article
Meng,X.L.andD.B.Rubin.UsingEMtoobtainasymptoticvariance-covariancematrices:theSEMalgorithm.
In:JournaloftheAmericanStatisticalAssociation86.416(1991),pp.899-909.
DOI
1991 Article
Meng,X.L.UYouwantmetoanalyzedataIdon’thave?Areyouinsane?In:ShanghaiArchivesofPsychiatry
24.5(2012),pp.287-301.
DOI URL
2012 Article
Miao,W.andE.J.TchetgenTchetgen.Identificationandinferencewithnonignorablemissingcovariatedata.In:
StatisticaSinica28.4(2018),pp.2049–2067.
DOI
2018 Article
Moeur,M.andA.R.Stage.Mostsimilarneighbor:animprovedsamplinginferenceprocedurefornatural
resourcesplanning.In:ForestScience42.1(1995),pp.337-359.
DOI
1995 Article
Molenberghs,G.,G.Fitzmaurice,M.G.Kenward,etal.HandbookofMissingDataMethodology.Chapman&
Hall/CRCHandbooksofModernStatisticalMethods.NewYork,NY,USA:ChapmanandHall/CRC,2014.ISBN:
9781439854624.
2014 Book
Molenberghs,G.andM.G.Kenward.MissingDatainClinicalStudies.Chichester,WestSussex,UK:Wiley,
2007.ISBN:9780470849811.
DOI
2007 Book
Molenberghs,G.,B.Michiels,M.G.Kenward,etal.Monotonemissingdataandpattern-mixturemodels.In:
StatisticaNeerlandica52.2(1998),pp.153-161.
DOI
1998 Article
Molnar,F.J.,B.Hutton,andD.Fergusson.Doesanalysisusinglastobservationcarriedforwardintroducebiasin
dementiaresearch?In:CanadianMedicalAssociationJournal179.8(2008),pp.751-753.
DOI
2008 Article
Moritz,S.andT.Bartz-Beielstein.imputeTS:timeseriesmissingvalueimputationinR.In:TheRJournal9.1
(2017),pp.207-218.
URL
2017 Article
Moritz,S.,A.Sardá,T.Bartz-Beielstein,etal.Comparisonofdifferentmethodsforunivariatetimeseries
imputationinR.PrepintarXiv1510.03924.2015.
URL
2015 Misc
Murray,J.S.andJ.P.Reiter.MultipleImputationofMissingCategoricalandContinuousValuesviaBayesian
MixtureModelsWithLocalDependence.In:JournaloftheAmericanStatisticalAssociation111.516(2016),
pp.1466-1479.
DOI
2016 Article
NationalResearchCouncil,U.ThePreventionandTreatmentofMissingDatainClinicalTrials.Washington(DC),
USA:NationalAcademiesPress,2010.ISBN:9780309158145.
DOI
2010 Book
Nowicki,R.K.,R.Scherer,andL.Rutkowski.Novelroughneuralnetworkforclassificationwithmissingdata.In:
21stInternationalConferenceonMethodsandModelsinAutomationandRobotics(MMAR).(Sep.29,2016-
Sep.01,2016).Ed.by-.IEEE,2016,pp.820–825.
DOI
2016 Paper
O’Kelly,M.andB.Ratitch.ClinicalTrialswithMissingData:AGuideforPractitioners.JohnWiley&Sons,Ltd,
2014.
DOI
2014 Book
Orchard,T.andM.A.Woodbury.Amissinginformationprinciple:theoryandapplications.In:Proceedingsofthe
SixthBerkeleySymposiumonMathematicalStatisticsandProbability,Volume1:TheoryofStatistic.Ed.byL.M.
LeCam,N.J.andE.L.Scott.Vol.1.UniversityofCaliforniaPress,1972,pp.697–715.
URL
1972 Paper
Peugh,J.L.andC.K.Enders.Missingdataineducationalresearch:areviewofreportingpracticesand
suggestionsforimprovement.In:ReviewofEducationalResearch74.4(2004),pp.525–556.
DOI URL
2004 Article
Pigott,T.D.Areviewofmethodsformissingdata.In:EducationalResearchandEvaluation7.4(2001),pp.353–
383.
DOI
2001 Article
Preisser,J.S.,K.K.Lohman,andP.J.Rathouz.Performanceofweightedestimatingequationsforlongitudinal
binarydatawithdrop-outsmissingatrandom.In:StatisticsinMedicine21.20(2002),pp.3035–3054.
DOI
2002 Article
Rahman,G.andZ.Islam.Missingvalueimputationusingdecisiontreesanddecisionforestsbysplittingand
mergingrecords:Twonoveltechniques.In:Knowledge-BasedSystems53(2013),pp.51–65.
DOI URL
2013 Article
Rao,J.N.K.andJ.Shao.Jackknifevarianceestimationwithsurveydataunderhotdeckimputation.In:
Biometrika79.4(1992),pp.811-822.
DOI
1992 Article
Reilly,M.andM.Pepe.Therelationshipbetweenhot-deckmultipleimputationandweightedlikelihood.In:
StatisticsinMedecine16.1-3(1997),pp.5-19.
DOI
1997 Article
Reiter,J.P.andM.Sadinle.Itemwiseconditionallyindependentnonresponsemodellingforincomplete
multivariatedata.In:Biometrika104.1(Jan.2017),pp.207-220.eprint:http://oup.prod.sis.lan/biomet/article-
pdf/104/1/207/13066719/asw063.pdf.
DOI URL
2017 Article
Rieger,A.,T.Hothorn,andC.Strobl.Randomforestswithmissingvaluesinthecovariates.Tech.rep.79.
UniversityofMunich,DepartmentofStatistics,2010.
URL
2010 Misc
Robins,J.M.,A.Rotnitzky,andL.P.Zhao.EstimationofRegressionCoefficientsWhenSomeRegressorsare
notAlwaysObserved.In:JournaloftheAmericanStatisticalAssociation89.427(1994),pp.846-866.
DOI
1994 Article
Robins,J.M.,A.Rotnitzky,andL.P.Zhao.Analysisofsemiparametricregressionmodelsforrepeatedoutcomes
inthepresenceofmissingdata.In:JournaloftheAmericanStatisticalAssociation90.429(1995),pp.106-121.
DOI
1995 Article
Robins,J.M.andN.Wang.Inferenceforimputationestimators.In:Biometrika87.1(2000),pp.113-124.
URL
2000 Article
Rosseel,Y.lavaan:anRpackageforstructuralequationmodeling.In:JournalofStatisticalSoftware48.2(2012).
DOI
2012 Article
Rotnitzky,A.,J.M.Robins,andD.O.Scharfstein.Semiparametricregressionforrepeatedoutcomeswith
nonignorablenonresponse.In:JournaloftheAmericanStatisticalAssociation93.444(1998),pp.1321-1339.
DOI
1998 Article
Rubin,D.B.Multipleimputationafter18+years.In:JournaloftheAmericanStatisticalAssociation91.434
(2012),pp.473-489.
DOI
2012 Article
Rubin,D.B.MultlipeImputationforNonresponseinSurveys.Hoboken,NJ,USA:Wiley,1987.ISBN:
9780471655740.
1987 Book
Rubin,D.B.Formalizingsubjectivenotionsabouttheeffectofnonrespondentsinsamplesurveys.In:Journalof
theAmericanStatisticalAssociation72.359(1977),pp.538-543.
DOI
1977 Article
Rubin,D.B.Inferenceandmissingdata.In:Biometrika63.3(1976),pp.581-592.
DOI
1976 Article
Sadinle,M.andJ.P.Reiter.SequentialIdentificationofNonignorableMissingDataMechanisms.In:Statistica
Sinica28.4(2018),pp.1741–1759.
DOI
2018 Article
Schafer,J.L.Multipleimputation:aprimer.In:StatisticalMethodsinMedicalResearch8.1(1999),pp.3-15.
DOI
1999 Article
Schafer,J.L.AnalysisofIncompleteMultivariateData.CRCMonographsonStatistics&AppliedProbability.
BocaRaton,FL,USA:ChapmanandHall/CRC,1997.ISBN:0412040611.
1997 Book
Schafer,J.L.andJ.W.Graham.Missingdata:ourviewofthestateoftheart.In:PsychologicalMethods7.2
(2002),pp.147-177.
DOI
2002 Article
Schafer,J.L.andM.K.Olsen.MultipleImputationformultivariatemissing-dataproblems:adataanalyst’s
perspective.In:MultivariateBehavioralResearch33.4(1998),pp.545-571.
DOI
1998 Article
Seaman,S.,J.Galati,D.Jackson,etal.WhatIsMeantby“MissingatRandom”?In:StatisticalScience28.2
(2013),pp.257–268.
DOI URL
2013 Article
Seaman,S.R.andI.R.White.Reviewofinverseprobabilityweightingfordealingwithmissingdata.In:
StatisticalMethodsinMedicalResearch22.3(2011),pp.278-295.
DOI
2011 Article
Seaman,S.R.andS.Vansteelandt.IntroductiontoDoubleRobustMethodsforIncompleteData.In:Statistical
Science33.2(2018),p.184.
DOI
2018 Article
Shao,J.andJ.Zhang.Atransformationapproachinlinearmixed-effectsmodelswithinformativemissing
responses.In:Biometrika102.1(2015),pp.107-119.
DOI
2015 Article
Sharpe,P.K.andR.J.Solly.Dealingwithmissingvaluesinneuralnetwork-baseddiagnosticsystems.In:Neural
Computing&Applications3.2(1995),pp.73-77.
DOI
1995 Article
Simon,G.A.andJ.S.Simonoff.Diagnosticplotsformissingdatainleastsquaresregression.In:Journalofthe
AmericanStatisticalAssociation81.394(1986),pp.501-509.
DOI
1986 Article
Śmieja,M.,Ł.Struski,J.Tabor,etal.Processingofmissingdatabyneuralnetworks.In:ComputingResearch
Repositoryabs/1805.07405(2018).eprint:1805.07405.
URL
2018 Article
Sovilj,D.,E.Eirola,Y.Miche,etal.Extremelearningmachineformissingdatausingmultipleimputations.In:
Neurocomputing174.A(2016),pp.220-231.
DOI
2016 Article
Sportisse,A.,C.Boyer,andJ.Josse.Imputationandlow-rankestimationwithMissingNonAtRandomdata.In:
arXivpreprint(2018).arXiv:1812.11409[stat.ML].
URL
2018 Article
Stacklies,W.,H.Redestig,M.Scholz,etal.pcaMethods–abioconductorpackageprovidingPCAmethodsfor
incompletedata.In:Bioconductor23.9(2007),pp.1164-1167.
DOI
2007 Article
Stage,A.R.andN.L.Crookston.Partitioningerrorcomponentsforaccuracy-assessmentofnear-neighbor
methodsofimputation.In:ForestScience53.1(2007),pp.62-72.
DOI
2007 Article
Stekhoven,D.J.andP.Bühlmann.Missforest-non-parametricmissingvalueimputationformixed-typedata.In:
Bioinformatics28.1(2012),pp.112-118.eprint:1105.0828.
DOI
2012 Article
Strobl,C.,A.L.Boulesteix,andT.Augustin.UnbiasedsplitselectionforclassificationtreesbasedontheGini
Index.In:ComputationalStatistics&DataAnalysis52.1(2007),pp.483-501.
DOI
2007 Article
Stuart,E.A.,M.Azur,C.Frangakis,etal.Multipleimputationwithlargedatasets:acasestudyofthechildren’s
mentalhealthinitiative.In:AmericanJournalofEpidemiology169.9(2009),pp.1133-1139.
DOI
2009 Article
Stubbendick,A.L.andJ.G.Ibrahim.MaximumLikelihoodMethodsforNonignorableMissingResponsesand
CovariatesinRandomEffectsModels.In:Biometrics59.4(2003),pp.1140–1150.
DOI
2003 Article
Stubbendick,A.L.andJ.G.Ibrahim.Likelihood-basedinferencewithnonignorablemissingresponsesand
covariatesinmodelsfordiscretelongitudinaldata.In:StatisticaSinica16.4(2006),pp.1143–1167.
URL
2006 Article
Su,Y.S.,A.Gelman,J.Hill,etal.Multipleimputationwithdiagnostics(mi)inR:openingwindowsintotheblack
box.In:JournalofStatisticalSoftware45(2011),p.2.
DOI
2011 Article
Tanner,M.A.andW.Wong.Thecalculationofposteriordistributionsbydataaugmentation.In:Journalofthe
AmericanStatisticalAssociation82.398(1987),pp.528-540.
DOI URL
1987 Article
TchetgenTchetgen,E.J.,L.Wang,andB.Sun.Discretechoicemodelsfornonmonotonenonignorablemissing
data:identificationandinference.In:StatisticaSinica28.4(2018),pp.2069–2088.
DOI
2018 Article
Templ,M.,A.Alfons,andP.Filzmoser.ExploringIncompletedatausingvisualizationtechniques.In:Advancesin
DataAnalysisandClassification6.1(2012),pp.29-47.
DOI
2012 Article
Thijs,H.,G.Molenberghs,B.Michiels,etal.Strategiestofitpattern-mixturemodels.In:Biostatistics3.2(2002),
pp.245-265.
DOI
2002 Article
Tierney,N.andD.Cook.Expandingtidydataprinciplestofacilitatemissingdataexploration,visualizationand
assessmentofimputations.MonashEconometricsandBusinessStatisticsWorkingPapers14/18.Monash
University,DepartmentofEconometricsandBusinessStatistics,2018.
URL
2018 Misc
Tierney,N.J.,F.A.Harden,M.J.Harden,etal.Usingdecisiontreestounderstandstructureinmissingdata.In:
BMJOpen5.6(2015),p.e007450.
DOI
2015 Article
Tran,L.,X.Liu,J.Zhou,etal.MissingModalitiesImputationviaCascadedResidualAutoencoder.In:2017IEEE
ConferenceonComputerVisionandPAtternRecognition(CVPR).(Jul.21,2017-Jul.26,2017).Ed.by-.IEEE,
2017,pp.4971-4980.
DOI
2017 Paper
Troyanskaya,O.,M.Cantor,G.Sherlock,etal.MissingvalueestimationmethodsforDNAmicroarrays.In:
Bioinformatics17.6(2001),pp.520-525.
DOI
2001 Article
Twala,B.E.T.H.,M.C.Jones,andD.J.Hand.Goodmethodsforcopingwithmissingdataindecisiontrees.In:
PatternRecognitionLetters29.7(2008),pp.950-956.
DOI
2008 Article
Unnebrink,K.andJ.Windeler.Intention-to-treat:methodsfordealingwithmissingvaluesinclinicaltrialsof
progressivelydeterioratingdiseases.In:StatisticsinMedecine20.24(2001),pp.3931-3946.
DOI
2001 Article
Velden,M.vandeandT.H.A.Bijmolt.Generalizedcanonicalcorrelationanalysisofmatriceswithmissingrows:
asimulationstudy.In:Psychometrika(2006).
DOI
2006 Article
Verbanck,M.,J.Josse,andF.Husson.RegularisedPCAtodenoiseandvisualisedata.In:Statisticsand
Computing25.2(2015),pp.471-486.
DOI
2015 Article
Verbeke,G.,G.Molenberghs,H.Thijs,etal.Sensitivityanalysisfornonrandomdropout:alocalinfluence
approach.In:Biometrics57.1(2001),pp.7-14.
DOI
2001 Article
Voillet,V.,P.Besse,L.Liaubet,etal.Handlingmissingrowsinmulti-omicsdataintegration:multipleimputation
inmultiplefactoranalysisframework.In:BMCBioinformatics17.402(2016).Forthcoming.
DOI
2016 Article
Wal,W.M.vanderandR.B.Geskus.ipw:anRpackageforinverseprobabilityweighting.In:Journalof
StatisticalSoftware43.13(2011).
DOI
2011 Article
Vansteelandt,S.,J.Carpenter,andM.G.Kenward.Analysisofincompletedatausinginverseprobability
weightinganddoublyrobustestimators.In:Methodology–EuropeanJournalofResearchMethodsforthe
BehavioralandSocialSciences6.1(2010),pp.37–48.
DOI
2010 Article
Vansteelandt,S.,A.Rotnitzky,andJ.Robins.Estimationofregressionmodelsforthemeanofrepeated
outcomesundernonignorablenonmonotonenonresponse.In:Biometrika94.4(2007),pp.841–860.
DOI
2007 Article
Wainer,H.,ed.DrawingInferencesfromSelf-SelectedSamples.NewYork,NY,USA:Springer,1986. 1986 Book
Wang,N.andJ.M.Robins.Large-sampletheoryforparametricmultipleimputationprocedures.In:Biometrika
85.4(1998),pp.935–948.
DOI
1998 Article
White,I.R.,J.Carpenter,andN.J.Horton.Ameanscoremethodforsensitivityanalysistodeparturesfromthe
missingatrandomassumptioninrandomisedtrials.In:StatisticaSinica28.4(2018),pp.1985–2003.
DOI
2018 Article
Wu,M.C.andR.J.Carroll.Estimationandcomparisonofchangesinthepresenceofinformativeright
censoringbymodelingthecensoringprocess.In:Biometrics44.1(1988),pp.175-188.
DOI
1988 Article
Xie,X.andX.L.Meng.Dissectingmultipleimputationfromamulti-phaseinferenceperspective:whathappens
whenGod’s,imputer’sandanalyst’smodelsareuncongenial?In:StatisticaSinica27.4(2017),pp.1485–1594.
DOI
2017 Article
Yang,S.,L.Wang,andP.Ding.Identificationandestimationofcausaleffectswithconfounderssubjectto
instrumentalmissingness.In:StatisticsMethodologyRepository(2017).
URL
2017 Article
Yoon,J.,J.Jordon,andM.vanderSchaar.GAIN:MissingDataImputationusingGenerativeAdversarialNets.
In:Proceedingsofthe35thInternationalConferenceonMachineLearning.(Jul.10,2018-Jul.15,2018).Ed.by
J.DyandA.Krause.Vol.80.ProceedingsofMachineLearningResearch.Stockholmsmässan,Stockholm
Sweden:PMLR,2018,pp.5689–5698.
URL
2018 Paper
Zhang,H.,P.Xie,andE.Xing.MissingValueImputationBasedonDeepGenerativeModels.In:Computing
ResearchRepositoryabs/1808.01684(2018).
URL
2018 Article
Zhang,S.NearestneighborselectionforiterativekNNimputation.In:JournalofSystemsandSoftware85.11
(2012),pp.2541-2552.
DOI
2012 Article
Zhou,Y.,R.J.A.Little,andJ.D.Kalbfleisch.Block-conditionalmissingatrandommodelsformissingdata.In:
StatisticalScience25.4(2010),pp.517–532.
DOI
2010 Article
Zhu,Z.,T.Wang,andR.J.Samworth.High-dimensionalprincipalcomponentanalysiswithheterogeneous
missingness.In:arXivpreprint(2019).
URL
2019 Article
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
(b) Unordered version
Figure 2: Bibliography overview.
10
-
2.4 Implementations
R packages As mentioned in the introduction, prior to the platform development, theproject started with the release of the MissingData CRAN Task View, which currently listsapproximately 150 R packages. The CRAN Task View is continuously updated, adding newR packages, and removing obsolete ones. Packages are organized by topic: exploration ofmissing data, likelihood based approaches, single imputation, multiple imputation, weightingmethods, specific types of data, specific application fields. We selected only those that weresufficiently mature and stable, and that were already published on CRAN or Bioconductor.This choice was made to ensure all listed packages can easily be installed and used bypractitioners.
Despite the Task View classifying packages into different sub-domains, it may still be achallenge for practitioners and researchers inexperienced with missing values to choose theright package for the right application. To address this challenge, we provide a partial, lesscondensed overview on existing R packages on the platform, choosing the most popularand versatile. This overview is a blend of the Task View, and the individual packagedescription pages and vignettes. For each selected package (7 at the date of writing of thisarticle, namely imputeTS, mice, missForest, missMDA, naniar, simputation and VIM), weprovided a category (in the style of the categorization in the Task View), a short descriptionof use-cases, its description (as on CRAN), the usual CRAN statistics (number of monthlydownloads, last update), the handled data formats (e.g., data.frame, matrix, vector), alist of implemented algorithms (e.g., k-means, PCA, decision tree), and the list of availabledatasets, some references (such as articles and books) and a small chunk of code, ready fora direct execution on the platform via the R package Documentation9. Figure 3 shows thecondensed view of the package page and the expanded description sheet of a given package(here naniar).
We believe shortlisting R packages is especially useful for practitioners new to the field,as it demonstrates data analysis that handles missing values in the data. We are awarethat this selection is subjective, and welcome external suggestions for other packages toadd to this shortlist.
Python modules To the best of our knowledge, there only exist very few implementedmethods for handling missing data in Python. However, one of the major libraries for ma-chine learning and data analysis, scikit-learn (Pedregosa et al., 2011) has recently proposeda module for simple and multiple imputations, sklearn.impute. And, as an alternative, the
9https://rdrr.io/snippets/
11
https://CRAN.R-project.org/view=MissingData
-
Home
Workflows
Bibliography
Lectures
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Back to top
R Packages
This page provides introductions to popular missing data packages with small examples
on how to use them. Thus the page gives more extensive information than the CRAN
Task View on Missing Data, which is recommended to get a first overall overview about
the CRAN missing data landscape.
Youcanalsocontributeonyourowntothispageandprovideashortintroductiontoa
missingdatapackage.Takea lookat thisshortdescriptiononhow todo this.Weare
veryhappyaboutallcontributions.
Search Sort by name Sort by Category
missMDA
Category: Single and multiple Imputation, Multivariate Data Analysis
Imputationofincompletecontinuousorcategoricaldatasets;Missing
valuesareimputedwithaprincipalcomponentanalysis(PCA),a
multiplecorrespondenceanalysis(MCA)modeloramultiplefactor
analysis(MFA)model;PerformmultipleimputationwithandinPCAor
MCA.
downloadsdownloads 4000/month4000/month CRANCRAN 2019-01-232019-01-23
more..
imputeTS
Category: Time-Series Imputation, Visualisations for Missing Data
Imputation(replacement)ofmissingvaluesinunivariatetimeseries.
Offersseveralimputationfunctionsandmissingdataplots.Available
imputationalgorithmsinclude:'Mean','LOCF','Interpolation','Moving
Average','SeasonalDecomposition','KalmanSmoothingonStructural
TimeSeriesmodels','KalmanSmoothingonARIMAmodels'.
downloadsdownloads 12K/month12K/month CRANCRAN 2019-07-012019-07-01
more..
mice
Category: Multiple Imputation
MultipleimputationusingFullyConditionalSpecification(FCS)
implementedbytheMICEalgorithmasdescribedinVanBuurenand
Groothuis-Oudshoorn(2011).Eachvariablehasitsownimputation
model.Built-inimputationmodelsareprovidedforcontinuousdata
(predictivemeanmatching,normal),binarydata(logisticregression),
unorderedcategoricaldata(polytomouslogisticregression)and
orderedcategoricaldata(proportionalodds).MICEcanalsoimpute
continuoustwo-leveldata(normalmodel,pan,second-levelvariables).
Passiveimputationcanbeusedtomaintainconsistencybetween
variables.Variousdiagnosticplotsareavailabletoinspectthequalityof
theimputations.
downloadsdownloads 41K/month41K/month CRANCRAN 2019-07-102019-07-10
more..
naniar
Category: Visualisations for Missing Data
Missingvaluesareubiquitousindataandneedtobeexploredand
handledintheinitialstagesofanalysis.'naniar'providesdatastructures
andfunctionsthatfacilitatetheplottingofmissingvaluesand
examinationofimputations.Thisallowsmissingdatadependenciesto
beexploredwithminimaldeviationfromthecommonworkpatternsof
'ggplot2'andtidydata.
downloadsdownloads 5305/month5305/month CRANCRAN 2019-02-152019-02-15
more..
missForest
Category: Single Imputation
Thefunction'missForest'inthispackageisusedtoimputemissing
valuesparticularlyinthecaseofmixed-typedata.Itusesarandom
foresttrainedontheobservedvaluesofadatamatrixtopredictthe
missingvalues.Itcanbeusedtoimputecontinuousand/orcategorical
dataincludingcomplexinteractionsandnon-linearrelations.Ityieldsan
out-of-bag(OOB)imputationerrorestimatewithouttheneedofatest
setorelaboratecross-validation.Itcanberuninparalleltosave
computationtime.
downloadsdownloads 9335/month9335/month CRANCRAN 2013-12-312013-12-31
more..
simputation
Category: Single Imputation, Meta-Package
Easytouseinterfacestoanumberofimputationmethodsthatfitinthe
not-a-pipeoperatorofthe'magrittr'package.
downloadsdownloads 1093/month1093/month CRANCRAN 2019-05-202019-05-20
more..
VIM
Category: Single Imputation, Visualisations for Missing Data
Newtoolsforthevisualizationofmissingand/orimputedvaluesare
introduced,whichcanbeusedforexploringthedataandthestructure
ofthemissingand/orimputedvalues.Dependingonthisstructureof
themissingvalues,thecorrespondingmethodsmayhelptoidentifythe
mechanismgeneratingthemissingvaluesandallowstoexplorethe
dataincludingmissingvalues.Inaddition,thequalityofimputationcan
bevisuallyexploredusingvariousunivariate,bivariate,multipleand
multivariateplotmethods.Agraphicaluserinterfaceavailableinthe
separatepackageVIMGUIallowsaneasyhandlingoftheimplemented
plotmethods.
downloadsdownloads 15K/month15K/month CRANCRAN 2019-02-112019-02-11
more..
Your favorite package is missing? Here is an explanation on how to make an entry for
your package. Template
Share
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
(a) Extract of global view
Home
Workflows
Bibliography
Lectures
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Back to top
Package:
naniar
Category:
Data Structures, Summaries, and Visualisations for Missing Data
Use-Cases:
Visualization of missing values, descriptive statistics, …
Popularity:
downloadsdownloads 5305/month5305/month
Description:
Missing values are ubiquitous in data and need to be carefully explored and handled in
the initial stages of analysis. In this vignette we describe the tools in the package naniar
for exploring missing data structures with minimal deviation from the common
workflows of ggplot and tidy data.
Last update:
CRANCRAN 2019-02-152019-02-15
Datasets:
oceanbuoys
pedestrian
riskfactors
Further Information:
Tierney, N. J., & Cook, D. H. (2018). Expanding tidy data principles to facilitate
missing data exploration, visualization and assessment of imputations. arXiv
preprint arXiv:1809.02264. PDF (on arXiv)
Vignettes
Related visdat R-package
Input:
data.frame, vector
Example:
library(naniar)
data(airquality)
print("printdatasetwithNAs")
print(head(airquality))
##Replace"NA"valueswithvalues10%lower
##thantheminimumvalueinthatvariable.
##Thisisdonebycallingthegeom_miss_point()function
ggplot2::ggplot(airquality,
ggplot2::aes(x=Solar.R,
y=Ozone))+
geom_miss_point()
Here you can have a interactive look at the example:
library(naniar)
data(airquality)
print("printdatasetwithNAs")print(head(airquality))
##Replace“NA”valueswithvalues10%lower##thantheminimumvalueinthatvariable.##Thisisdonebycallingthegeom_miss_point()functionggplot2::ggplot(airquality, ggplot2::aes(x=Solar.R, y=Ozone))+geom_miss_point()Run (Ctrl-Enter)
You should assume that any scripts or data
that you put into this service are public.
Privacy policy.
Computation provided by rdrr.io: hosting
documentation for all R packages.
https://rdrr.io/snippets/embedding/
Share
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
(b) Description sheet
Figure 3: R packages overview.
12
-
statsmodels10 library now also has an implementation module for multiple imputation inPython. We regularly survey new Python implementations for handling missing values and,if pertinent from a theoretical and practical point of view, reference them on our platform.We expect this to promote their use but also additional assessment by practitioners andresearchers from the missing values (statistics/machine learning) community.
2.5 Datasets
Especially in methodology research, an important aspect is the comparison of differentmethods to assess the respective strengths and weaknesses. There are several datasets thatare recurrent in the missing values literature but, they have not been listed together yet.We gathered publicly available datasets that have been used recurrently for comparisonor illustration purposes in publications, R packages and tutorials. Most of these datasetsare already included in R packages but some are available on other data collections. Fig-ure 4 shows how the datasets are presented, with a detailed description shown for oneof the dataset (“Ozone”, obtained by clicking on its name). The description follows theUCI Machine Learning Repository presentation (Dua and Graff, 2019), including a shortdescription of the dataset, how to obtain it, external references describing the dataset inmore details, and links to tutorials/lectures on our websites or to vignettes in R packagesthat use the dataset.
In addition, the Datasets section also references existing methods for generating miss-ing data, given assumptions on their generation mechanisms (as in the R package mice).These datasets also serve as benchmark in the proposed workflows and allow for a fair andtransparent comparison between the different methods.
Note however, that the list of datasets we gather here is comparatively short whencompared to benchmark datasets for full data methods such as the UCI Machine LearningRepository. Therefore, our proposed list also serves as an invitation to tackle this lack ofa wider variety of common benchmark datasets in the missing data community.
2.6 Additional content
This unified platform collects and edits the contributions of numerous individuals whohave investigated the missing values problems, and developed methods to handle them formany years. To provide an overview of some of the main actors in this field, the list of all
10https://www.statsmodels.org/stable/about.html
13
https://www.statsmodels.org/stable/about.html
-
Home
Workflows
Bibliography
Lectures
R packages
Data
People
More info & links
Contact
R-miss-tastic
A resource website on missing values - Methods and references for managing missing data
Website template created by @mdo, ported to Hugo by @mralanorth. Website proudly powered by Blogdown for R.
Back to top
Here you will find a constantly growing list of interesting data sets which are frequently
used in the R community working on missing values. These data sets can be useful to
get familiar with different concepts in handling missing values and to assess the quality
and performance of new methods.
Ifyouhavesuggestionsonotherdatasetswhichmightbeofinteresttoothers,please
feelfreetocontactusviatheContactform.
Complete data
If you wish to evaluate a certain missing data method on real (or simulated) data it can
be useful to first generate missing values in a complete dataset. This allows to control
the response mechanism and evaluate the method for different response mechanisms.
Some useful tools for this:
The ampute function of the mice R-package. Rianne Schouten and her colleagues
wrote a self-contained tutorial on how to ampute data.
The R workflow on How to generate missing values? extending some
functionalities of the ampute function. For the related R source code click here.
The missCompare R-package.
Incomplete data
The data sets listed below are either widely used in general in the missing data
community or used for illustration of different methods handling missing values in the
tutorials from the Tutorials and R packages sections. This presentation scheme is
inspired by the UCI Machine Learning Repository.
Click on a table entry to obtain further information about the data set.
Share
Name Data Types
Attribute
Types
#
Instances
#
Attributes
%
Missing
entries
Complete
data
available Year
Airquality Multivariate,
Time Series
Real 154 6 7 No 1973
chorizonDL Multivariate Integer,
Real
606 110 15 Yes 1998
Health
Nutrition And
Population
Statistics
Multivariate,
Time Series
Integer,
Real
15,022 397 54 No 2017
NHANES Multivariate Categorical,
Integer,
Real
10,000 75 37 No 2012
oceanbuoys Multivariate,
Time Series
Real 736 8 3 No 1997
Ozone Multivariate Categorical,
Integer,
Real
366 13 6 No 1976
Los Angeles Ozone Pollution Data, 1976. This data set contains daily measurements of ozone concentration and
meteorological quantities. It can be found in R in the mlbench package and is loaded by calling data(Ozone).
More information on the dataset.
Tutorials illustrating methods on this data:
Julie Josse's course on missing values imputation using PC methods.
Julie Josse's and Nick Tierney's tutorial on handling missing values. Download the data set from this
tutorial: ozoneNA.csv
Nick Tierney's naniar vignette for missing data visualization.
pedestrian Multivariate,
Time series
Categorical,
Integer
37,700 9 2 No 2016
riskfactors Multivariate Categorical,
Integer,
Real
245 34 14 No 2009
SBS52424 Multivariate Real 262 9 2 No 2016
sleep Multivariate Integer,
Real
62 10 6 No 1976
tsAirgap Time series Integer 144 1 9 Yes 1960
tsHeating Time series Real 606,837 1 9 Yes 2015
tsNH4 Time series Real 4,552 1 9 Yes 2014
About
This website is proudly
sponsored by R Consortium
and maintained by Julie
Josse, Imke Mayer, Nicholas
Tierney and Nathalie
Vialaneix.
Read more →
FAQ →
Follow us!
Events
GitHub
Figure 4: Datasets overview.
14
-
contributors who agreed to appear on this platform is given with links to their personal orto their research lab website.
We also provide links to other interesting websites or working groups, not necessarilyworking with R and Python (Van Rossum and Drake, 2009) but with other programminglanguages such as SAS/STAT® and STATA (StataCorp., 2019).
The platform also provides two other features to engage the community:
1. a regularly updated list of events such as conferences or summer schools with specialfocus on missing values problems, and
2. a list of recurring questions together with short answers and links for more details forevery question.
3 Workflows
After this general introduction to the R-miss-tastic project and platform and the overviewof its structure, we now turn to a more detailed presentation of the various workflows wehave developed and proposed on this platform.
To allow for both hands-on tutorials illustrating current practices and state-of-the-art and ready-to-use pipelines, we propose the workflows under different formats such asHTML, PDF, R Markdown (for R code) and IPYNB (for Python code). We encouragepractitioners and researchers to use and adapt these workflows, in order to increase repro-ducibility and comparability of their work. Of course, we are aware that these workflowsdo not cover the entire spectrum of existing methods and data problems. The goal of theproposed workflows is rather to initiate a joint effort to create a larger spectrum of open-source workflows, and to encourage the use of standardized procedures to handle missingvalues.
With an incomplete dataset at hand, prior to embarking on an in-depth statisticalanalysis, a specific aim has to be defined in order to choose a specific method to use. Anexample of a method whose success crucially depends on the analyst’s goal is mean imputa-tion: this approach is strongly counter-indicated if the aim is to estimate parameters, butit can be consistent if the aim is to predict as well as possible (Josse et al., 2019). Followingthis observation, our workflows are divided into different parts, defined by the objective ofthe statistical analysis. We have tried to present and compare the main implementationsavailable both in R and Python for each objective. Currently there are seven workflowsavailable on the platform and we present their scope and use below.
15
-
3.1 How to generate missing values?
Rubin (1976) classifies the cause for a lack of data into three missing data mechanisms.The missing data mechanism is said to be: (i) missing completely at random (MCAR) ifthe lack of data is totally independent of the data values, (ii) missing at random (MAR) ifthe process that causes the missing data may only depend on the observed values and (iii)missing not at random (MNAR) if the unavailability of the data depends on the missingvariables, for instance on their values themselves and possibly on the values of observedvariables. More formally, let us define the indicator matrix R ∈ Rn×p, which indicates theindices of the observed values in X ∈ Rn×p, i.e., Rij = 1 if Xij is observed and Rij = 0otherwise. The missing data mechanism is then the distribution of the indicator matrix Rgiven the data X.
The goal of this workflow is to propose functions to generate missing values under thesedifferent mechanisms. The way in which the missing values are generated is crucial forcomparing methods (and studies) in a fair manner and is often subject of debate (Seamanet al., 2013). This code aims to unify classical solutions to do this. Indeed, a usual strategyto compare imputation or estimation strategies is to introduce (additional) missing valuesin the dataset, and use the ground truth for these missing values to evaluate the strategies.
In R In the R workflow, the main function produce NA allows to generate missing valuesunder the three missing-data mechanisms outlined above. This function internally calls theampute function of the mice package (van Buuren and Groothuis-Oudshoorn, 2011) butwe chose to simplify its use while adding some additional options to specify the missingvalues generation. In addition, the original ampute function generates missing values onlyfor complete dataset. In our workflow, the user can easily introduce (additional) missingvalues in a complete or incomplete dataset composed of quantitative, categorical or mixedvariables. The three main arguments are the initial dataset (data) in which the missingvalues are introduced using a given missing data mechanism (mechanism) and a givenpercentage of missing values (perc.missing). For example, to introduce 20% of MCARvalues in the dataset X, the code is detailed below.
1 X.miss.mcar
-
The function returns the data matrix containing the new missing values (and the previouslypresent missing values of the input data) and the indicator matrix R.
To introduce missing values under the MAR and MNAR mechanisms, there are severaloptions detailed and illustrated in the workflow. We consider the definitions of the missingdata mechanisms of Rubin (1976). For instance, if X contains three variables denoted asX1, X2, X3, two options are available to generate MAR values:
1. the first option consists of generating missing values in X1 by using a logistic modeldepending on (X2, X3), which are observed, i.e.
P(R1 = 0|X;φ) = 1/(1 + exp(−(φ2X2 + φ3X3)),
where φ = (φ2, φ3) is the parameter of the missing-data mechanism. In our function,φ is chosen so that the given percentage of missing values is reached. This allows toobtain missing values in the first variable XNA1 . Then, the same strategy is performedto introduce missing values in X2 and X3, by using a logistic model depending on(X1, X3) (which are observed) and (X1, X2) (which are observed) respectively. Toget the final matrix containing missing values, we concatenate XNA1 , X
NA2 and X
NA3
by handling the rows containing only missing values. To use this option, the code isdetailed below.
1 X.miss.mar
-
1 X.miss.marpat
-
For the MAR mechanism, by contrast with the R workflow, the Python code relies onthe definition of Little and Rubin (2002). For instance, if we have X = (X1, X2, X3), thenat least one variable should be fully-observed (say X3) and missing values in X1 and X2are introduced with the following logistic model,
P(R1 = 0|X;φ) = P(R2 = 0|X;φ) = 1/(1 + exp(−φ3X3),
with φ = φ3 the parameter of the missing-data mechanism chosen to reach the givenpercentage of missing values. To introduce 20% missing values in each missing variable(i.e., X1 and X2), the following code can be used, with p obs the proportion of fullyobserved variables.
1 X_miss_mar = produce_NA(X=X, p_miss =0.2,
2 mecha="MAR", p_obs =0.3)
Listing 6: Generating MAR values in Python.
For the MNAR mechanism, three main options are available, using a logistic model,a quantile censorship or a logistic model for a self-masked mechanism (for their exactdefinition, we refer to the workflow). For example, to introduce 20% of self-masked missingvalues (in all the variables), we can use the code below.
1 X_miss_selfmasked = produce_NA(X=X, p_miss =0.4,
2 mecha="MNAR", opt="selfmasked")
Listing 7: Generating self-masked MNAR values in Python.
3.2 How to impute missing values?
There exists a vast literature on how to impute missing values. The aim of these workflows(in R and Python) is to compare the most classical imputation methods and to proposea reference pipeline for comparison on simulated and real datasets, which can be easilyextended with other imputation methods. Different types of methods are included:
1. imputation by the mean, which serves as a naive benchmark.
2. conditional models, if, roughly speaking, the imputation relies on the distributions ofeach variable given the others.
• in R:
19
https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values
-
– mice (van Buuren and Groothuis-Oudshoorn, 2011): it allows to computemultiple imputations by chained equations and thus returns several imputeddatasets. We use the predictive mean matching method (default method)and aggregate the complete datasets using the mean of the imputations toget a simple imputation.
– missForest (Stekhoven and Bühlmann, 2012): it imputes missing valuesiteratively by training random forests.
• in Python:– IterativeImputer of scikit-learn library (Pedregosa et al., 2011): this func-
tion is inspired by mice, but it uses (iterative) regularized imputation usingconditional expectation and provides a simple imputation. We also use theExtraTreesRegressor estimator of IterativeImputer, which trains iterativerandom forests.
3. low-rank based models, if the data matrix to impute is assumed to be low rank and thesimilarities between the variables (or the observations) may inform the imputation,
• in R:– softImpute (Hastie et al., 2015): it minimizes the reweighted least squares
error penalized by the nuclear norm.
– missMDA (Josse et al., 2016): it minimizes the reweighted least squareserror penalized by a mix between the `2-norm and `0-norm.
• in Python: softImpute (coded in Python by ourselves).
4. recent methods (for the Python workflow only) using optimal transport or variationalautoencoders, variables (or the observations) may inform the imputation,
• in Python:– MIWAE (Mattei and Frellsen, 2019): it imputes missing values with a deep
latent variable model based on importance weighted variational inference.
– Sinkhorn (Muzellec et al., 2020): it extracts randomly several batches andconsists in minimizing optimal transport distances between batches to im-pute missing values.
Other methods such as GAIN (Yoon et al., 2018) which uses generative adversarial nets,have not yet been compared, as they are close to those already being compared, but willbe added in the future.
20
-
The metric we choose to compare the methods is the mean squared error (MSE), whichcan be calculated if the ground truth of the missing values is known. More precisely, theprocedure is the following one: (i) we have access to a complete dataset X, (ii) missingvalues are introduced in X and we get an incomplete dataset XNA, (iii) this incompletedataset is imputed and we obtain an imputed dataset X imp. The MSE for X imp is computedas follows
MSE(X imp, X) =1
nNA
∑i
∑j
1{XNAij =NA}(Ximpij −Xij)2
where nNA =∑
i
∑j 1{XNAij =NA} is the number of missing entries in X
NA. Note that this
procedure can also be performed on an incomplete dataset by introducing additional missingvalues. However, for now, both R and Python notebooks only consider complete datasets.
In R This workflow first presents the main imputation methods available in R, includingmice, missForest, softImpute and missMDA.
We compare the methods on a simulated dataset X ∈ Rn×d under the multivariateGaussian law X ∼ N (µ,Σ), with µ the mean vector and Σ the covariance matrix. Thefunction HowtoImpute compares the imputation methods by introducing missing valuesin a complete dataset (X) using different percentages of missing values (perc.list) andmissing data mechanisms (mecha.list). It returns the mean of the methods’ MSEs for thedifferent missing values settings by taking the average over nbsim repetitions. The codeto use this function is given below. For the sake of clarity, in the workflow, all the code isdetailed and commented. The output of this function and its associated plot are shown inFigure 5, when n = 1000, d = 10, µi = 1, ∀i ∈ {1, . . . , d} and Σij = 0.5 if i 6= j ∈ {1, . . . , d}and Σij = 1 if i = j. For the MCAR mechanism, the methods perform well, while for theMNAR mechanism, the results are close to those of the naive i