MOLLA HUNEGNAW STATISTICIAN AFRICAN CENTRE FOR STATISTICS ECASTATS.UNECA.ORG Confidentiality and...

MOLLA HUNEGNAWSTATISTICIAN

AFRICAN CENTRE FOR STATISTICSECASTATS.UNECA.ORG

Confidentiality and Anonymization of

Microdata1

United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011

Contents

MicrodataMicrodata ConfidentilityAnonymizationAnonymization techniquesAnonymization toolsConclusion

2


Microdata

The term microdata can refer to data about an individual, person, household, business or other entity.

Microdata may be data collected by surveys, censuses or obtained from administrative records.

Microdata are collected using valuable resources of a country.

Microdata are usually of significance yet access to countries’ microdata is often limited by the fear that confidentiality protection cannot be guaranteed.

Microdata must be used and reused.

3


Microdata

Microdata access obstacles: legal obstacles (in many countries the statistical

legislation still in use is out dated and does not recognise the dissemination of electronic data particularly microdata

technical and financial obstacles including in-house capacity to handle the complex aspects of microdata dissemination such as data anonymization,

political obstacles, and psychological obstacles - the tendency to control

access perhaps because of concerns over its mis-interpretation or because ‘data is power’.

4


Confidentiality

United Nations Regional Seminar on Census Data Archiving for Africa

5

Confidentiality refers to limiting data access and disclosure to authorized users and preventing access by or disclosure to unauthorized ones

Confidentiality is also related to the broader concept of data privacy -- limiting access to individuals’ personal information.

Confidentiality

The sixth United Nations Fundamental Principle of Official Statistics about confidentiality. “Individual data collected by statistical agencies for

statistical compilation, whether or not they refer to natural or legal persons, are to be strictly confidential and used exclusively for statistical purposes.”

6


Confidentiality

How to make sure confidentiality Legal arrangements

Legislation be in place before any microdata are released. Legally enforceable agreement may be one of the

requirements of access. It should be possible to set up an arrangement where a prior agreement needs to be signed even where access to microdata is available online.

Technical arrangements The risk of confidentiality can be reduced by the use of a

number of techniques. The downside is that these techniques may reduce the

usefulness of the underlying microdata.

7


Confidentiality

Technical arrangements Remote Access Facilities (RAFs)

The key characteristic is that users do not have access to the microdata itself but tasks using that microdata can be submitted remotely over the Internet.

Data laboratories /Safe sites Effective in controlling identification risk whilst enabling

users access, particularly for data sets where release of a confidentialised microdata file is not possible.

Lack of convenience to the user, including sometimes being forced to use unfamiliar data analysis software. They are also expensive for the NSO to manage.

Anonymization techniques Removing or modifying the identifying variables contained

in the dataset

8


Anonymization

Anonymization is process of removing or modifying the identifying variables contained in the microdata dataset.

Identifying variables are those describe characteristic of a person Direct identifiers, which are variables such as names,

addresses, or identity card numbers. Indirect identifiers, which are characteristics that may be

shared by several respondents, and whose combination could lead to the re-identification of one of them.

Anonymizing the data consists in: determining which variables are potential identifiers (this

relies on one's personal judgment), and in modifying the level of precision of these variables to reduce

the risk of re-identification to an acceptable level.

9


Anonymization techniques

Statistical disclosure limitation techniques can be mainly classified in two categories: data reduction: aim at increasing the number of individuals in

the sample sharing the same or similar identifying characteristics presented by the investigated statistical unit. Avoids the presence of unique or rare recognizable individuals.

data perturbation: Such methods achieve data protection from a twofold perspective. First, if the data are modified, re‑identification by means of record linkage or matching algorithms is harder and uncertain. Secondly, even when an intruder is able to re-identify a unit, he/she cannot be confident that the disclosed data are consistent with the original data.

Synthetic microdata: Synthetic microdata are an alternative approach to data protection, and are produced by using data simulation algorithms.

10



Data Reduction Removing variables: The first obvious application of

this method is the removal of direct identifiers from the datasets (race, religion, HIV, etc.)

Removing records: Removing records can be adopted as an extreme measure of data protection when the unit is identifiable in spite of the application of other protection techniques.

Global recoding: The global recoding method consists in aggregating the values observed in a variable into pre-defined classes (for example, recoding the age into five-year age groups, or the number of employees in three-size classes: small, medium and large).

11



Data reduction Top and bottom coding: A special case of

global recoding that can be applied to numerical or ordinal categorical variables. The variables "Salary" and "Age" are two typical examples. The highest values of these variables are usually very rare and therefore identifiable.

Local suppression: Local suppression consists in replacing the observed value of one or more variables in a certain record with a missing value.

12



Data perturbation Micro-aggregation: replace an

observed value with the average computed on a small group of units (small aggregate or micro-aggregate), including the investigated one. The units belonging to the same group will be represented in the released file by the same value.

Methods in micro-aggregation include individual ranking and multivariate micro‑aggregation.

The easiest way to group records before aggregating them is to sort the units according to their similarity and the values resulting from this criterion, and to aggregate consecutive units into fixed size groups.

13



Data perturbation Data swapping: a perturbation technique for categorical

microdata, and aimed at protecting tabulation stemming from the perturbed microdata file. Data swapping consists in altering a proportion of the records in a file by swapping values of a subset of variables between selected pairs of records (swap pairs).

Post-randomization (PRAM): induces uncertainty in the values of some variables by exchanging them according to a probabilistic mechanism. PRAM can therefore be considered as a randomized version of data swapping.

Adding noise: Adding noise consists in adding a random value ε, with zero mean and predefined variance σ2, to all values in the variable to be protected. Generally, methods based on adding noise are not considered very effective in terms of data protection.

Resampling: a protection method for numerical microdata that consists in drawing with replacement t samples of n values from the original data, sorting the sample and averaging the sampled values. Data protection level guaranteed by this procedure is generally considered quite low.

14



Synthetic Microdata An alternative approach to data protection, and are

produced by using data simulation algorithms. The rationale for this approach is that synthetic data do not pose problems with regard to statistical disclosure control because they do not contain real data but preserve certain statistical properties.

Generally, users are not keen to work with synthetic data as they cannot be confident of the results of their statistical analysis. Nevertheless, this approach can also help to producing “test microdata set.” In this case, synthetic data files would be released to allow users to test their statistical procedures to successively access “true” microdata in a safe site.

15


Anonymization tools

Several computer software have been developed. The broadly used software is µ-Argus.

µ-Argus was created by Statistics Netherlands and further developed. Many of the statistical disclosure methods described above can be applied by µ-Argus. The program is interactive and goes through several steps such as the identification of the dataset (metadata); the selection and computation of frequency tables on which several statistical Disclosure Control methods are based; the selection of possible recoding and inspection of the results (global recoding); the selection and application of other protection methods and finally of the anonymisation microdata file including the production of a data process report.

IHSN has also developed tools and guidelines on microdata anonymization. These tools are developed as Stata, SPSS and/or SAS programs.

A free software for data swapping, the Data Swapping Toolkit (DSTK), is also available. The DSTK is a comprehensive software package for performing and analyzing data swapping on categorical data.

Several other software are also used in specialized areas like health, retailers, ecommerce, etc.

16


Conclusion

There is always some chance of identification, even if very small.

Software now exists which can estimate the proportion of records which are unique and therefore at some risk of identification.

The decision lies in the NSO to release microdata file, whether it an anonymised microdata file, through a remote access facility, or through a data laboratory or via electronic media. the risk of identification is sufficiently small; the adjustments made to the data items have not damaged the

microdata file for research purposes; and the variables that have been collapsed are the most appropriate,

taking into account both the needs of researchers and the identification risk.

Supporting legislation is the key to protect confidentiality and access to microdata is the key

17


References18

1. Anonymized Microdata Files, Eurostat2. International Household Survey Network3. Joint UNECE/Eurostat Work Session on Statistical

Data Confidentiality, October 2011, UNECE4. IPUMS-International Statistical Disclosure Controls:

159 Census Microdata Samples in Dissemination5. Managing Statistical Confidentiality & Microdata

Access: Principles and Guidelines of Good Practice, United Nations Economic Commission For Europe

6. Meeting of the African Association of Statistical Data Archivists, AASDA

7. United Nations Fundamental Principles of Official Statistics, UNSD

8. Data Privacy Through Optimal k-Anonymization


Thank you

19


MOLLA HUNEGNAW STATISTICIAN AFRICAN CENTRE FOR STATISTICS ECASTATS.UNECA.ORG Confidentiality and...

Documents

Transcript of MOLLA HUNEGNAW STATISTICIAN AFRICAN CENTRE FOR STATISTICS ECASTATS.UNECA.ORG Confidentiality and...