Use of Machine Learning Methods to Impute Categorical Data

15
24-26 September 2012 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of Machine Learning Methods to Impute Categorical Data Pilar Rey del Castillo* EUROSTAT, Unit B1: Quality, Research and Methodology

description

Use of Machine Learning Methods to Impute Categorical Data. Pilar Rey del Castillo* EUROSTAT, Unit B1: Quality, Research and Methodology . Use of Machine Learning Methods to Impute Categorical Data. non-response in statistical surveys. approaches.  Problem. different. - PowerPoint PPT Presentation

Transcript of Use of Machine Learning Methods to Impute Categorical Data

Page 1: Use of Machine Learning Methods to Impute Categorical Data

24-26 September 2012UNECE CONFERENCE OF EUROPEAN STATISTICIANS

Work Session on Statistical Data Editing

Use of Machine Learning Methods to Impute Categorical Data

Pilar Rey del Castillo*EUROSTAT, Unit B1: Quality, Research and Methodology

Page 2: Use of Machine Learning Methods to Impute Categorical Data

2UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Use of Machine Learning Methods to Impute Categorical Data

24-26 September 2012

Problem

non-response in statistical surveys

missing information in machine learning

different

approaches

evaluation criteria

Aim: show the commitment to the almost exclusive use of probabilistic data models prevents statisticians from using the most convenient technologies

Case of categorical variables: practical recommendations from the statistical approach just reuse procedures designed for numeric variables

Page 3: Use of Machine Learning Methods to Impute Categorical Data

3UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Outline of the presentation

24-26 September 2012

1. Review non-response treatments imputation procedures:

evaluation criteria

2. Recommendations for categorical data imputation from the

statistical community: why these are not appropriate

3. Results of comparisons with two machine learning methods

4. Final remarks

Page 4: Use of Machine Learning Methods to Impute Categorical Data

4UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Non-response treatments

24-26 September 2012

• Deletion procedures: using only the units with

complete data for further analysis

• Tolerance procedures: internal, not removing

incomplete records or completing them

• Imputation procedures: replacing each missing value

by an estimate

Page 5: Use of Machine Learning Methods to Impute Categorical Data

5UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Imputation procedures

24-26 September 2012

• Algorithmic methods: use an algorithm to produce

the imputations (cold and hot-deck, nearest-neighbour,

mean, machine learning classification & prediction

techniques…)

• Model-based methods: the predictive distributions

have a formal statistical model state of the art: MI

Page 6: Use of Machine Learning Methods to Impute Categorical Data

6UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Criteria for evaluating the imputation results

24-26 September 2012

• Statistical surveys: valid & efficient inferences, being treatment part of the overall procedure

"… Judging the quality of missing data procedures

by their ability to recreate the individual missing

values (according to hit-rate, mean square error,

etc.) does not lead to choosing procedures that

result in valid inference, which is our objective" (Rubin, 1996)

• Machine learning: general artificial intelligence framework (empirical results through simulating missing data and measuring the closeness between real & imputed)

Page 7: Use of Machine Learning Methods to Impute Categorical Data

7UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Categorical data imputation in statistical surveys

24-26 September 2012

State of the art: MI or other model-based • Log-linear model : not always possible• Logistic regression models: sometimes problems at the estimation

step• Binary case: Rubin & Schenker (1986), Schafer (1997): to

approximate by using a Gaussian distribution • Non-binary case: Yucel & Zaslavsky (2003), Van Gingel et al.

(2007): rounding multivariate normal distribution• Criticisms from the practical perspective (Horton (2003), Ake

(2005), Allison (2006), Demirtas (2008))• Contradiction (theoretical framework: focus on model adequacy)

(practical recommendations: models clearly not adequate)

Page 8: Use of Machine Learning Methods to Impute Categorical Data

8UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Problem of categorical data imputation to be solved

24-26 September 2012

• Survey microdata file: opinion poll (no.2750 in CIS catalogue)‒ Quantitative variables (8): ideological self-location; rating of three

specific political figures; likelihood to vote; likelihood to vote for three

specific political parties… ‒ Ordered categorical variables (2): government and opposition party

ratings (converted to quantitative)‒ Categorical variables with non-ordered categories (7): voting

intention; voting memory; the autonomous community; the political

party the respondent would prefer to see win…

• Voting intention to be imputed: 11 categories (biggest political parties, "blank vote", "abstention", "others")

• 13.280 interviews with no missing values

Page 9: Use of Machine Learning Methods to Impute Categorical Data

9UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Imputation methods to be compared

24-26 September 2012

• MI logistic regression

• Classifiers (matching each class with one of the Voting intention

categories)

‒ Fuzzy min-max neural network classifier recently extended to

deal with mixed numeric & categorical data as inputs (Rey del

Castillo & Cardeñosa, 2012)

‒ Bayesian network classifier: not Naïve Bayes classifier but a

more complex architecture learnt with a score + search

paradigm

Page 10: Use of Machine Learning Methods to Impute Categorical Data

10UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Comparison criterion

24-26 September 2012

• Not possible classical surveys inference criterion because no

models

• EUREDIT project: Wald statistic for categorical variables: but

none of the methods overcome the proposed test!

• Correctly imputed rate is used (ten-fold cross-validation)

Page 11: Use of Machine Learning Methods to Impute Categorical Data

11UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Results of the comparison

24-26 September 2012

Imputation methodCorrectly imputed rate %

MI logistic regression 66.0

Fuzzy min-max neural network classifier 86.1

Bayesian network classifier 87.4

Page 12: Use of Machine Learning Methods to Impute Categorical Data

12UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Conclusions & final remarks

24-26 September 2012

1. Always similar differences between machine learning / MI logistic

2. Simplest case with missing data exclusively on one variable

3. Extensible to numeric variables ?

4. Machine learning procedures easier to automate

• Non-dependence on model assumptions

• Don't break down when large number of variables ?

• More robust to outliers ?

5. Machine learning may be used for massive imputation tasks

Page 13: Use of Machine Learning Methods to Impute Categorical Data

13UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

Thank you !!!

24-26 September 2012

Page 14: Use of Machine Learning Methods to Impute Categorical Data

14UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

References (1)

24-26 September 2012

• Ake, C. F. (2005), Rounding After Multiple Imputation with Non-Binary Categorical Covariates, SAS Conference Proceedings: SAS User Group International 30, Philadelphia, PA, April 2005.

• Allison, P. (2006), Multiple Imputation of Categorical Variables under the Multivariate Normal Model, paper presented at the Annual Meeting of the American Sociological Association, Montreal Convention Center, Montreal, Quebec, Canada, August 2006.

• Demirtas, H. (2008), On Imputing Continuous Data When the Eventual Interest Pertains to Ordinalized Outcomes Via Threshold Concept, Computational Statistics and Data Analysis, vol. 52, pp. 2261-2271.

• Horton, N. J., Lipsitz, S. R. and Parzen, M. (2003), A Potential for Bias when Rounding in Multiple Imputation, The American Statistician, vol. 57, no. 4, pp. 229-232, November 2003.

• Rey-del-Castillo, P., and Cardeñosa, J. (2012), Fuzzy Min–Max Neural Networks for Categorical Data: Application to Missing Data Imputation, Neural Computing and Applications, vol. 21, no. 6 (2012), pp. 1349-1362, DOI 10.1007/s00521‐ 011‐0574‐x, Springer-Verlag London.

• Rubin, D. B. (1996), Multiple Imputation After 18+ Years, Journal of the American Statistical Association, vol. 91, no. 434, Applications and Case Studies, June 1996.

Page 15: Use of Machine Learning Methods to Impute Categorical Data

15UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

References (2)

24-26 September 2012

• Rubin, D. B. and Schenker, N. (1986), Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse, Journal of the American Statistical Association, vol. 81, no. 394, Survey Research Methods, June 1986.

• Schafer, J. L. and Graham, J. W. (2002), Missing Data: Our View of the State of the Art, Psychological Methods, vol. 7, no. 2, pp. 147-177.

• Van Ginkel, J. R., Van der Ark, L. A. and Sijtsma, K. (2007), Multiple Imputation of Item Scores when Test Data are Factorially Complex, British Journal of Mathematics and Statistical Psychology, vol. 60, pp. 315-337.

• Yucel, R. M. and Zaslavsky, A. M. (2003), Practical Suggestions on Rounding in Multiple Imputation, Proceedings of the Joint American Statistical Association Meeting, Section on Survey Research Methods, Toronto, Canada, August 2003.