Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U....
description
Transcript of Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U....
![Page 1: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/1.jpg)
Methods and software for editing and imputation: recent advancements at
Istat M. Di Zio, U. Guarnera, O. Luzi, A.
ManzariISTAT – Italian Statistical Institute
UN/ECE Work Session on Statistical Data Editing
Ottawa, 16-18 May 2005
![Page 2: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/2.jpg)
Outline• Introduction• Editing: Finite Mixture Models for
continuous data• Imputation: Bayesian Networks for
categorical data • Imputation: Quis system for continuous
data• E&I: Data Clustering for improving the
search of donors in the Diesis system
![Page 3: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/3.jpg)
Recent advancements at IstatIn order to reduce waste of resources and to disseminate best practices, efforts were addressed in two directions:–identifying methodological solutions for some common types of errors
–providing survey practitioners with generalized tools in order to facilitate the adoption of new methods and increase the processes standardization
![Page 4: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/4.jpg)
EditingIdentifying systematic unity
measure errors (UME)
A UME occurs when the “true” value of a variable Xj is reported in a wrong scale (e.g. Xj ·C, C=100, C=1,000, and so on)
![Page 5: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/5.jpg)
Finite Mixture Models of Normal Distributions
Probabilistic clustering based on the assumption that observations are from a mixture of a finite number of populations or groups Gg in various proportions g
Given some parametric form for the density function in each group maximum likelihood estimates can be obtained for the unknown parameters
![Page 6: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/6.jpg)
Finite Mixture Models for UME Given q variables X1,.., Xq, the h = 2q possible
clusters (mixture components) correspond to groups of units with different subsets of items affected by UME (error patterns)
Assuming that valid data are normally distributed and using a log scale, each cluster is characterized by a p.d.f. fg(yy;t)MN(g,) , where g is translated by a known vector and is constant for all clusters
Units are assigned to clusters based on their posterior probability g (yi; )
![Page 7: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/7.jpg)
Model diagnostics used to prioritise units for manual
check Atypicality Index: allows to identify outliers
w.r.t. the defined model (e.g. units possibly affected by errors other than the UME)
Classification probabilities g (yi; ) allow to identify possibly misclassified units. They can be directly used to identify misclassifications that are possibly influential on target estimates (significance editing)
![Page 8: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/8.jpg)
Main findingsFinite Mixture Modelling allows multivariate and not hierarchical data analyses. Costs for developing ad hoc procedures are saved
Finite Mixture Modelling produces highly reliable automatic data clustering/error localization
Model diagnostics can be used for reducing editing costs due to manual editing
The approach is robust for moderate departures from normality
The number of model parameters is limited by the model constraints on and
![Page 9: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/9.jpg)
ImputationBayesian Neworks for categorical variables
The first idea of using BNs for imputation is by Thibaudeau and Winkler (2002)
• Let C1….,Cj be a set of categorical variables having each a finite set of mutually exclusive states
• BNs allows to represent graphically and numerically the joint distribution of variables:
– A Bn can be viewed as a Directed Acyclic Graph, and– an inferential engine that allow to perform inferences
on distributions parameters
![Page 10: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/10.jpg)
Graphical representation of BNs
To each variable C with parents Pa (Cj) there is attached a conditional probability P(C|Pa (Cj))
BNs allow to factorize the joint probability distribution P(C1,...,Cj) of so that
P(C1….,Cj)=Πj=1,nP(Cj|Pa(Cj))
![Page 11: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/11.jpg)
BN’s and imputation: method 1
1.Order variables according to their “reliability”
2.Estimate the network conditioned on this order
3.Estimate the conditional probabilities for each node according to (2)
4.Impute each missing item by a random draw from its conditional prob. distribution
![Page 12: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/12.jpg)
BNs and imputation: methods 2/3
In a multivariate context is more convenient to use not only information coming from parents, but also from the children. This can be done by using Markov Blanket (Mb):Mb(X)= Pa(X)+Ch(X)+Pa(X Children)In this case for each node the conditional probabilities are estimated w.r.t. its Mb
![Page 13: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/13.jpg)
Main findings BNs allow to express the joint probability
distributions with a dramatic decrease of parameters to be estimated (reduction of complexity)
BNs may estimate the relationships between variables that are really informative for predicting values
Parametric models like BNs are efficient in terms of preservation of joint distributions
The graphical representation facilitates modelling BN’s and hot deck methods have the same
behaviour only in the case that the hot deck is stratified according to variables explaining exactly the missing mechanism
![Page 14: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/14.jpg)
ImputationQuis system for continuous
variablesQuis (QUick Imputation System) is a SAS generalized tool developed at Istat to impute continuous survey data in a unified environmentGiven a set of variables subject to non response, different methods can be used in a completely integrated way: Regression Imputation via EM algorithm Nearest Neighbour Donor Imputation (NND) Multivariate Predictive Mean Matching (PMM)
![Page 15: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/15.jpg)
Regression imputation via EMIn the context of imputation, the EM algorithm is used for obtaining Maximum Likelihood estimates in presence of missing data for the parameters of the model assumed for the dataAssumptions MAR mechanism Normality
![Page 16: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/16.jpg)
Regression imputation via EM Once ML estimates of parameters have been obtained, missing data can be imputed in two different ways:
directly through expectations of missing values conditional on observed ones (predictive means)
by adding a normal random residual to the predictive means (i.e. drawing values from the conditional distributions of missing values)
![Page 17: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/17.jpg)
Multivariate Predictive Mean Matching (PMM)
Let Y =(Y1,...Yq) be a set of variables subject to non responseML estimates of the parameters of the joint distribution of Y are derived via EM
For each pattern of missing data ymiss, the parameters of the corresponding conditioned distribution are estimated starting from (sweep operator)
For each unit ui the predictive mean based on estimated parameters is computed
For each unit with missing data, imputation is done using the nearest donor w.r.t. the predictive mean
The Mahalanobis distance is adopted to find donors
![Page 18: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/18.jpg)
Data clustering for improving the search for donors in the Diesis
system• The DIESIS system has been developed at ISTAT
for treating the demographic variables of the 2001 Population Census
• Diesis uses both the data driven and the minimum change approach for editing and imputation
• For each failed household, the set of potential donors contains only the nearest passed households
• The adopted distance function is a weighted sum of the distances for each demographic variable over all the individuals within the household
![Page 19: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/19.jpg)
The in use approach for donor search
• For each failed household e, the identification of potential donors should be made by searching within the set of all passed households D
• When D is very large, as in the case of a Census, the computation of the distance between each e and all dD (exhaustive search) could require unacceptable computational time
• The in use sub-optimal search consists in arresting the search before examining the entire set D according to some stopping criteria. This solution does not guarantee the selection of the potential donors having actual minimum distance from e
![Page 20: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/20.jpg)
The new approach for donor search
• In order to reduce the number of passed households to examine, the set of passed households D is preliminarily divided into smaller homogeneous subsets {D1, …, Dn} (D1 …Dn=D,)
• Such subdivision is obtained by solving an unsupervised clustering problem (donor search guided by clustering)
• The search for the potential donors is then conducted, for each failed household e, by examining only the households within the cluster(s) more similar to e
![Page 21: Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.](https://reader036.fdocuments.in/reader036/viewer/2022070605/5a4d1ad07f8b9ab0599711aa/html5/thumbnails/21.jpg)
Main findings The donor search guided by clustering reduces computational times preserving the E&I quality obtained by the exhaustive search
The donor search guided by clustering increases the proportion of actual minimum distance donors selected with respect to the sub-optimal search (this is especially useful for households having uncommon structure for which few passed households are generally available)