Editing, Imputation, and Synthesis: A Public Use File for ... values and imputation process. ......

Editing, Imputation, and Synthesis: A Public Use File for

the Census of Manufactures

Hang Kim

2015 Affiliates Annual Meeting, Miami, FL Sunday, March 15

NISS / Duke University

Joint work with Jerry Reiter, Alan Karr, and Larry Cox Research supported by NSF [SES-11-31897]

BackgroundDisseminate Public Use File

Production of Public Use File

Survey Design Survey Data Collection

Data Processing Publication

Current Practice: Data Processing

Initial Check / Recontact Imputation

Data EditingDisclosure Control/

Data Masking

Edit Rules

Logical constraints which are satisfied by reported records to be considered plausible and consistent

find unacceptable errors in survey data

e.g. pregnant male, $2M of avg. salary

specify space of reasonably imputed values

Common edit rules for continuous values

1. Range restriction e.g. total emp. > 0

2. Ratio edit e.g. total salary / total emp. < $1M

3. Balance edit

e.g. total emp. = production workers + other emp.

Three Related Research Topics

1. Imputation under linear constraints

• Technical Report No. 182, NISS

• Journal of Business and Economic Statistics, 2015, Vol 32

2. Simultaneous data editing and imputation

• Technical Report No. 189, NISS

3. Synthetic microdata for the U.S. Census of Manufactures

• work in progress

Topic IImputation under linear

constraints

Example: Colombian Manufacturing Survey

Similar to U.S. Census of Manufactures

Data 1977-1991have been used for researchers

All variables are continuous

Complex feasible region given edit rules

Not easy to assume a parametric distribution

Joint Modeling Imputation (NISS report 182)

Nonparametric Bayesian Model

use mixture normals with Dirichlet process (DP) priors

to capture complex features of data under very weak distributional assumption

restrict support under constraints regions

to guarantee that imputed values satisfy edit rules

Multiple Imputation Approach

to capture uncertainty introduced by missing values and imputation process

Illustration of Mixture Normals

Dirichlet process (DP) prior helps the model stochastically decide the number of components and weights

Simulation Study using Colombian Manufacturing Survey data

1. Assume data are truly reported values with no missing

2. Randomly blank some values as simulated nonresponse

3. Fill in simulated missing values using the suggested method

Pink dots: unchanged values Blue dots: (Left) original values before blanking

(Right) imputed values

Topic IISimultaneous data editing and imputation

Automatic Data Editing

Agencies detect and edit unacceptable errors in survey data

Manual editing

utilizing expert knowledge

Automatic editing

fast and handling massive datasets

Automatic editing process

1. Error localization step

• Which variable of a record is incorrect?

2. Imputation step

• What is a reasonable value to replace the incorrect value?

Fellegi-Holt (F-H) ApproachSince proposed by Fellegi and Holt in 1976, the best-known, most-used guiding principle for automatic data editing

Mathematical optimization approach

Objective function

the number of changed variables (to be minimized)

Constraints

imputed/edited values satisfy edit rules

Example

If avg. salary > $ 1M, need to further review

avg. salary = total salary / total employees

F-H changes either variable, but not change both variables

Edited/Imputed Values Under F-H Approach

Case 1. assume the observed value failing edit rules

Case 1. assume the observed value failing edit rulesCase 1. no option but changing the value of X1

Case 1. assume the observed value failing edit rulesCase 1. no option but changing the value of X1Case 1. can draw imputations from high density region

Case 1. assume the observed value failing edit rulesCase 1. no option but changing the value of X1Case 1. can draw imputations from high density regionCase 2. no option but changing the value of X2

Case 1. assume the observed value failing edit rulesCase 1. no option but changing the value of X1Case 1. can draw imputations from high density regionCase 2. no option but changing the value of X2Case 2. can draw imputations from high density region

Case 1. assume the observed value failing edit rulesCase 1. no option but changing the value of X1Case 1. can draw imputations from high density regionCase 2. no option but changing the value of X2Case 2. can draw imputations from high density regionCase 3. both options available: changing X1 or X2

Case 1. assume the observed value failing edit rulesCase 1. no option but changing the value of X1Case 1. can draw imputations from high density regionCase 2. no option but changing the value of X2Case 2. can draw imputations from high density regionCase 3. both options available: changing X1 or X2Case 3. draw imputed values from tails of distribution

Bayesian Data Editing (NISS report 189)

Nonparametric Bayesian Model

use Dirichlet process (DP) mixture normals

restrict support under constrained regions

balance edits as well as ratio edits

utilize latent indicator to stochastically find the location of error

Multiple Imputation Approach

measure uncertainty introduced by imputation process and data editing process

Simulation StudyReported values with errors

Generate simulated reported values with introduced errors

True simulated values

Result with Bayesian EditingEdited values by BE

Bayes. Editing successfully estimates the distribution of simulated true values

Result with Fellegi-HoltEdited values by FH

F-H approach results in some edited values at tails of the distribution of true values

Simulation Study: Comparison of Pairwise Correlations from Edited Data

Application: U.S. Census of Manufactures

Economic census for manufacturing industries, conducted by the U.S. Census Bureau every five years

variables: cost of materials, total emp., total value of shipments

widely used by researchers, e.g., interested in plant-level productivity

Current editing practice

F-H based automatic editing system

using ratio edits and balance edits

additional (separated) manual editing processes

We compare three editing approaches with pairwise correlation

BE: Bayesian Editing

FH: Fellegi-Holt based editing

FH & manual: Final edited data produced by the Census Bureau

2007 Census of Manufactures: Comparison of Pairwise Correlations from Edited Data

Topic IIISynthetic microdata for

the U.S. Census of Manufactures

Integration of Imputation, Editing and Synthesizer (in progress)

Two-stage Multiple Imputation Approach

1. impute/edit survey data X given a size measure z

• resulting in m copies of edited/imputed datasets, X(1), … ,X(m)

2. generate synthetic data given X(l) and z

• resulting in r synthetic file X1(l), … ,Xr

(l) for l=1,…m

Inferences

based on mr complete-data analyses and combining rules

Compared to current practices (with separate steps)

correctly estimate variance of final synthetic data

all benefits enjoyed by Bayesian editing/imputation

Concluding Remarks

We proposed a Bayesian framework to integrate the currently-separated processes: imputation, data editing, and disclosure limitation

A future research topic is simultaneous data editing/imputation methods for mixed type data, such as American Community Survey

R package for Bayesian editing/imputation of continuous variables will be published on CRAN soon

Technical reports are available at NISS websites (http://www.niss.org/publications/technical-reports)

– Hang J. Kim (hangkim@niss.org)

“Thank you”

Editing, Imputation, and Synthesis: A Public Use File for ... values and imputation process. ......

Documents

Transcript of Editing, Imputation, and Synthesis: A Public Use File for ... values and imputation process. ......

new features of vim - visualization and imputation of missing values

Presenter Information - Michigan SAS Users Group Home Page · Handling Missing Data, continued Multiple Imputation is 3 step process: 1. Impute missing data using PROC MI with appropriate

Characterization of missing values in untargeted MS-based ... · for imputation, including multiple imputation by chained equations (MICE) (van Buuren 2007) and K-nearest neigh-bors

Handling missing data in Stata: Imputation and likelihood ... · replaces missing values with multiple sets of simulated values to complete the data—imputation step applies standard

Open Access Research Using decision trees to …data are missing completely at random (MCAR).1–3 Although methods such as multiple imputation could be used to impute the missing

Analyzing Data with Missing Values in Spotfire S+ 8 · 2013-03-13 · Spotfire S+ to support a model–based approach to multiple imputation. Multiple imputation can be viewed as

Exploring Commodity Trading Activity: An Integrated … uses CIT data on agricultural positions to impute unobserved index positions in crude oil. However, imputation may provide a

Multiple Imputation of Right-Censored Wages in the … · Multiple Imputation of ... missing data problem and use multiple imputation approaches to impute the ... especially when

PARAMETRIC FRACTIONAL IMPUTATION FOR ... No.05 291 PARAMETRIC...336 PARAMETRIC FRACTIONAL IMPUTATION FOR LONGITUDINAL DATA WITH INTERMITTENT MISSING VALUES covariance matrix 𝑉 The

2007 PSID Income and Wage Imputation Methodology · In 2007, the variables in the 2007 PSID (prior to imputation) are ER36928/ER37299. The methodology used to impute wage and salary

Imputation of Missing Values using Random Forest

INTERMITTENT EXOTROPIA STUDY 5 (IXT5) A …118 impute 12-month data using multiple imputation for subjects who are prescribed IXT treatment 119 other than overminus or nonoverminus

IMPUTATION FOR MISSING STREAM FLOW USING THE …-To resolve these problems, data imputation can be used, which replaces incorrect and missing values in the dataset with probable ones.-This

Imputation of assay activity data using deep learninggjc29/Talks/Basel2019.pdf · 2019-01-25 · Impute values from sparse data Broadly applicable with ... Validate imputation of

An Empirical Comparison of Multiple Imputation …Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer ﬁlls in missing values

Multiple Imputation of Multilevel Data · modules to impute missing data (e.g., SAS PROC MI and the new Multiple Imputation procedure in SPSS V17.0). These approaches generally ignore

Nonparametric imputation by data depth · 2018-08-08 · missing values arbitrary, using unconditional mean imputation; (2) impute missing values in one variable by the values predicted

Imputation of truncated p-values for meta-analysis methods and its genomic application · 2015. 1. 20. · IMPUTATION IN MICROARRAY META-ANALYSIS 3 expression evidence across studies.

pattern. By default, mi impute chained checks whether imputation variables have a monotone missing-data pattern and, if they do, imputes them using the monotone method (without iteration).

Missing Data in Longitudinal Studies: To Impute or not to Impute? · 2016. 5. 23. · 7. Lee KJ, Carlin JB. Multiple Imputation for Missing Data: Fully Conditional Specification Versus