European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey:...

Post on 17-Jan-2016

212 views 0 download

Transcript of European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey:...

European Conference on Quality in Official Statistics, Rome, July 2008

Community Innovation Survey:a Flexible Approach to the Dissemination of Microdata Files for Research

Daniela Ichim

European Conference on Quality in Official Statistics, Rome, July 2008

Outline

• Dissemination of Microdata Files for Research• Risk assessment• Disclosure limitation• Data quality

– Record linkage– Data utility

European Conference on Quality in Official Statistics, Rome, July 2008

Confidentiality against Dissemination

Find the right balance!

Disclosure scenarios

European Conference on Quality in Official Statistics, Rome, July 2008

Community Innovation Survey

• IDENTIFYING VARIABLES– Nace– Nuts– Size– Turnover (TURN)

(STRUCTURAL VARIABLES)

• CONFIDENTIAL VARIABLES– Expenditures in innovation (RTOT, …)– Number of patents, …

(VARIABLES INVOLVED IN ANALYSES)

European Conference on Quality in Official Statistics, Rome, July 2008

Confounding

Categorical Numerical

safe

unsafe

AA…Ak-anonymity

cn ttt ,

European Conference on Quality in Official Statistics, Rome, July 2008

a) Given a threshold (on units)b) Local Outlier Factor as a

measure of difference in density between a unit and its nearest neighbours

General risk function

Distance between and

*M

1t 2t

1,0,),(),(),( 211

2121 ccnn Ied tttttt

t

)(

)(

)(

)(*

*'

*

*

*

)(

'

t

t

t

ttt

M

NM

M

M N

LRD

LRD

LOFM

Density around :

European Conference on Quality in Official Statistics, Rome, July 2008

• Threshold - dissemination policy

Parameters*M

• Cut-off point for density (LOF)– quantiles– automatic

European Conference on Quality in Official Statistics, Rome, July 2008

Stratification variables

TUR

N

Analysis by Nace

Nace A all Nace

European Conference on Quality in Official Statistics, Rome, July 2008

Disclosure limitation

MFR Selective masking

k-anonymity Nearest neighbour

Micro-aggregation on tails

European Conference on Quality in Official Statistics, Rome, July 2008

Quality assessment

Dissemination

Confidentiality

European Conference on Quality in Official Statistics, Rome, July 2008

Risk measure assessment

Quality of the external database

D

E

Chambers of Commerce database

Record linkage

European Conference on Quality in Official Statistics, Rome, July 2008

Record linkage

M*=3

1 equal unit within 10%

less than 3 units within 10%

less than 3 units within 20%

less than 3 units within 30%

NACE 88% 84% 97% 100%

NACEEMP 63% 60%a 74%a 87%a

M*=5

1 equal unit within 10%

less than 5 units within 10%

less than 5 units within 20%

less than 5 units within 30%

NACE 88% 73% 87% 96%

NACEEMP 63% 58%a 70%a 80%a

a) 100% for enterprises with more than 250 employees

European Conference on Quality in Official Statistics, Rome, July 2008

Information content analysis

Information preservation• Selective masking

– Data utility– Only identifying and confidential variables were

modified.– Only records at risk were modified.

• The weights were not modified.– weighted totals (coherence with the already

published information)

Some statistical indicators were slightly modified: variances

European Conference on Quality in Official Statistics, Rome, July 2008

Information content analysisData utility

Assessment of the perturbation impact on ratios like RTOT/TURN

Original

Selective masking

Individual ranking

European Conference on Quality in Official Statistics, Rome, July 2008

Conclusions

1. Confidentiality: Risk measure based on the k-anonymity principle

Flexible a) continuous and categorical variables b) easy to implement c) consistent for extreme choices

2. Data utility: Selective protection to achieve the k-anonymity

3. Comparable dissemination: Control both risk of re-identification and information loss

QUALITY DIMENSIONS