Pulling the wool over users´eyes Why is a German Research ... · available commercial data base...

28
Pulling the wool over users´eyes Why is a German Research Data Center interested in Synthetic Data? NSF-Census-IRS Workshop on Synthetic Data and Confidentiality Protection 2009, Washington, D.C. 31. July 2009 Stefan Bender (Institute for Employment Research, Germany)

Transcript of Pulling the wool over users´eyes Why is a German Research ... · available commercial data base...

Page 1: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

Pulling the wool over users´eyes Why is a German Research Data

Center interested in Synthetic Data?

NSF-Census-IRS Workshop on Synthetic Data and Confidentiality Protection 2009, Washington, D.C.

31. July 2009

Stefan Bender(Institute for Employment Research,

Germany)

Page 2: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

2

Overview

A very short History of German Data Access

The Portfolio Approach to Confidentiality Protection Lane 2007)

The RDC of the BA in the IAB

Imputed data sets for research

Conclusions/Future Work

Page 3: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

3

Law for the census (1987): privilege for research

Federal Statistics Law: from absolute to factual anonymity

First scientific use files were published

Data Access in Germany: The 80´s

Page 4: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

4

Constant pressure of the scientific community „Set the data free“ The „Commission to Improve the Statistical Infrastructure in

Cooperation with the Scientific Community and Official Statistics“ (KVI 1999, report 2001) established by the Federal Ministry of Education and Research (BMBF).

German Council for Social and Economic Data (RatSWD). The Council’s main purpose is to advise in the development of the German data infrastructure for empirical research in the social and economic sciences.

Establishment of Research Data Centers (RDC) by data producers and Data Service Centers (DSC). At the beginning all co-financed by the BMBF.

Data Access in Germany: The 90´s

Page 5: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

5

Development in Germany

Free access (web)

No access

1980

- SUF- Remote access of establishment

panel

1995

on site

off site

- PUF/SUF- „Remote access“

RDC

2001

Page 6: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

6

Advisory service on data selection and data access

Handling of remote execution

Assistance and support for visiting researchers

Online data documentation and documentation of methodological aspects of data

Clarification of questions on data protection

Updates of scientific use files and other research datasets

Organization of workshops and user conferences

RDCs provide researchers access to micro data for non-commercial empirical research in the fields of social security and employment

Page 7: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

7

The Portfolio Approach to Confidentiality Protection (Lane 2007)

How should data be protected at the disseminating institution?

Technological Protection

Statistical Protection

Operational Protection

Legal Protection

(RDC in RDC approach, Remote Data Access)

Page 8: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

8

Statistical Protection

Replacing all personal/organizational identifiers

Drawing samples

Standardised microdata (no individual data solutions)

Generating scientific use files (deleting variables, aggregation,

multiple imputation)

Disclosure limitation review

Different kinds of data access

Page 9: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

9

Started in April 2004; positive evaluation in April 2006, since December 2006 100% financed by the Federal Employment Agency

6 researchers, 3 non-researchers, office in Nuremberg Data based on:

the notification process of the social security system, the internal procedures of the Federal Employment Agency data from IAB-surveys (e.g. IAB Establishment Panel).

Available data: IAB Establishment Panel, Establishment History Panel IAB Employment Sample, BA Employment Panel,

Integrated Employment Biographies Sample of the IAB Linked Employer Employee of the IAB

RDC of the BA in the IAB

Page 10: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

10

10

Data Access

Restriction: access for non-commercial empirical research in the fields of social security and employment

On-Site Use

Remote Data Access

Scientific Use Files

No costs Financial support for guest

researchers from abroad

Page 11: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

11

# of remote data accesses and on-site uses

677

1390

2271

256

1328

359

133

0

200

400

600

800

1000

1200

1400

1600

2005 2006 2007 2008

YearRemote Data Access On-site Use

Page 12: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

12

# of users outside Germany (contractual partners, guests)

21

34

105

29

22

0

20

40

60

80

100

120

1

Contractual partners fromabroad 2006

Contractual partners fromabroad 2007

Contractual partners fromabroad 2008

On-site Uses of guestresearchers from abroad2006On-site Uses of guestresearchers from abroad2007On-site Uses of guestresearchers from abroad2008

Page 13: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

13

13

Employment statistics

Participants in

measures

Benefit recipient history

Application pool

IAB Employ-

ment Samples

IAB Establishment Panel

Linked Employer/ Employee Data

Integrated Employment Biographies

Sample

Employment history

BA Employ-

ment Panel

Estab-lishment History Panel

Sou

rce

BA/

IAB

Dat

aFD

Z D

ata

com

bine

dda

ta*

Life Situation

and Social Security

2005

Employee and benefit recipient history

Process-generated dataSocial security notifications

Panel Study ‘Labour

Market and Social

Security’

Data Warehouse

Data Sources and Paths

Surveys

Page 14: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

14

Data Access at the RDC

Datasets Data AccessOn-site Use Remote Data

AccessScientific Use

FileIAB Employment Samples

BA Employment Panel

Integrated Employment Biographies Sample

 

IAB Establishment Panel XLinked-Employer/Employee Data ?Establishment History Panel ?Cross-sectional survey 'Life Situation and Social Security 2005 (LSS 2005)'

Panel Study 'Labour Market and Social Security' (PASS)

We have only test data for the Establishment Data.

Page 15: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

15

Anonymising business micro data (FAWE) – Motivation IPublic release of business data is often considered too risky

- Skewed distributions make identification of single units easy- Number of units in total population is small- Information on businesses in the public domain- High benefits from identifying a single unit- High probability of inclusion for large establishments

Standard perturbation methods have to be applied on a high level

Release of high quality data is very difficult

Page 16: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

16

„Democratic“ access to micro data (RatSWD guidelines for RDCs).

Research on anomization techniques.

Need to have scientific use files for establishment/firm data.

European” movement towards a better data access (Essnet projects)

FAWE - Motivation II

Page 17: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

17

RDC-Motivation to join the Project

Yearly conducted establishment survey (IAB Establishment Panel)

Strong demand for access from external researchers

Only on-site and remote access possible so far, only structural test data.

High costs in terms of time and money.

Project-goal: Generate synthetic datasets of the survey for public release.

Project start: summer 2006

Number of people working on this project: 1-1,3

Page 18: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

18

FAWE – Project Information

Former project (2001-2005) by Federal Statistical Office

Germany and Statistical Offices of the Länder, Institute for

Applied Economic Research (IAW), Centre for European

Economic Research (ZEW).

Result: Release of some cross sectional firm data as

scientific use files from the Statistical Offices.

New project with Institute for Employment Research (IAB) to

release panel data and data of the IAB.

Financed by the Federal Ministry for Education and

Research (BMBF)

Page 19: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

19

Results of our Partners

Compared estimation results between anonymized real data and

original data (German Turnover Tax Statistics, German Structure of

Costs Survey).

Anonymization Techniques: information suppression, categorization,

micro aggregation and additive or multiplicative noise.

Micro aggregation and adding noise lead to biased estimations. For

example, Adding independent noise to the covariates leads to the

error-in-variables problem.

Corrections for some (mostly linear) estimator.

Page 20: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

20A fairy tale, where the hedgehog tricks the rabbit in a race with

his wife as a double, already waiting at the finish.

Brother Grimm: The Hare and the Hedgehog

Page 21: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

21

Our ResultsResults: Jörg´s presentation

Real re-identifikation experiments:

• Used data sets: IAB-Establishment Panel and a public

available commercial data base (AMADEUS).

• True matches between the two data sets by using

business names and addresses.

• Probabilistic record linkage with three blocking

variables and 13 matching variables (EM algorithm

frame work).

• Against the literature the re-identification risk of firms

in our experimental setting is comparable to

individuals.

Page 22: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

22

Conclusions: Imputation

Synthetic datasets provide a high level of disclosure protection.

Partially synthetic fulfill our needs for data protection.

Synthetic datasets offer a high level of data utility.

First dataset almost ready for release.

Provide metadata for the user.

Weighting of data set.

Long term goal: release complete longitudinal data.

Future Work

Page 23: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

23

Conclusion for the RDC

Synthetic data is a promising way for generating public use files for sensitive data like business data.

Generating synthetic datasets is a labour intensive task.

Faster release of imputed data?

Need to compare anonymization techniques (imputed data vs real data is only one dimension).

Need to convince users to use imputed data (also convince referees).

Trust in researchers´ activities (“culture of confidentiality”). Raise researchers knowledge and how their activities are related to data confidentiality (researcher training).

Imputation is just one dimension (RDC in RDC approach)

Page 24: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

24

Very Short Summary

Page 25: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

25

Contact: [email protected]

Web site of RDC: http://fdz.iab.de/en

Page 26: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

26

The IAB Establishment Panel

Annually conducted establishment survey

Since 1993 in Western Germany, since 1996 in Eastern Germany

Population: All establishments with at least one employee covered by social security

Source: Official Employment Statistics

Response rate of repeatedly interviewed establishments more than 80%

Sample of more than 16.000 establishments in the last wave

Contents: employment structure, changes in employment, business policies, investment, training,

remuneration, working hours, collective wage agreements, works councils

Page 27: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

27

“RDC in RDC”

Remote Access in Germany is far behind solutions like in Denmark

or in the Netherlands.

We have to fulfil requirements of data protection and data security.

Problem: not every country has the same data protection

standards/regulations/laws.

Nearly the same standards in RDCs all over the world (or the

standards can be established).

Main problem for German data protection: how to control who is

sitting at the PC. That is possible in other RDCs.

Page 28: Pulling the wool over users´eyes Why is a German Research ... · available commercial data base (AMADEUS). • True matches between the two data sets by using business names and

28

Model of the RDC in RDC approach

RDC 1

Computer

2) Personal control

RDC 2

Computer

Computer

RDC of the BA in the IAB

Data and computational server

1) General permission thru contract

3) Output controlFirewall Firewall

Firewall

Secure connection

Secure connection

Secure connection