Microdata Sharing Via Pseudonymization

19
David Galindo Eric R. Verheul Computer Science Department PWC Netherlands & University of Malaga University of Nijmegen Microdata Sharing Via Microdata Sharing Via Pseudonymization Pseudonymization UNECE Work session on statistical data confidentiality Manchester, 2007 December 18th

description

Microdata Sharing Via Pseudonymization. UNECE Work session on statistical data confidentiality Manchester, 2007 December 18th. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. Motivation. Individuals microdata is essential for empirical research - PowerPoint PPT Presentation

Transcript of Microdata Sharing Via Pseudonymization

Page 1: Microdata Sharing Via Pseudonymization

David Galindo Eric R. VerheulComputer Science Department PWC Netherlands &University of Malaga University of Nijmegen

Microdata Sharing Via Microdata Sharing Via PseudonymizationPseudonymization

Microdata Sharing Via Microdata Sharing Via PseudonymizationPseudonymization

UNECE Work session on statistical data confidentiality

Manchester, 2007 December 18th

Page 2: Microdata Sharing Via Pseudonymization

20-06-20062

MotivationMotivationMotivationMotivation

Individuals microdata is essential for empirical research

Its direct release thwarts the privacy of the individuals

Goal: to build privacy-preserving microdata sharing systems through pseudonymization

Page 3: Microdata Sharing Via Pseudonymization

20-06-20063

Problem statementProblem statementProblem statementProblem statement

Suppliers own confidential microdata on individuals ((id1,D(id1)),…, (idn,D(idn))

Researchers want to correlate microdata from different Suppliers

Example: A Researcher wants to find out the correlation between drug prescription (Chemists) and traffic accidents (Insurers)

Question: How to enable Researchers to correlate microdata without having access to sensitive information?

(id1;D(id1)); ;(idn;D(idn))

Page 4: Microdata Sharing Via Pseudonymization

20-06-20064

FrameworkFrameworkFrameworkFramework

id1 DataChm(id1)

. .

. .

. .idn DataChm(idn)

$

idm DataIns(idm)

. .

. .

. .idm DataIns(idt)

Maybe de-

identifieddata?

Maybe de-

identifieddata?

id1 DataChm(id1)

. .

. .

. .idn DataChm(idn)

$

I want to correlate

I want to correlate

Page 5: Microdata Sharing Via Pseudonymization

20-06-20065

Supplying de-identified dataSupplying de-identified dataSupplying de-identified dataSupplying de-identified data

DataChm(id1)

.

.

.DataChm(idn)

$

DataIns(idm)

.

.

.DataIns(idt

)

If Suppliers de-identify the data by:

- removing the identifier field

-applying Statistical Disclosure Control (SDC) mechanisms

no sensitive information is leaked, but…

Matching is not possible!

Page 6: Microdata Sharing Via Pseudonymization

20-06-20066

Pseudonymizing data via TTPsPseudonymizing data via TTPsPseudonymizing data via TTPsPseudonymizing data via TTPs

Solution 1: a Trusted Third Party replaces real identifiers by random identifiers (pseudonyms)

id1 P(id1)

. .

. .

. .idl P(idl)

Where P(id) is random

This table is only know to the TTP

P(idm)

DataIns(idm)

. .

. .

. .P(idt

)DataIns(idt)

P(id1

)DataChm(id1

)

. .

. .

. .P(idn

)DataChm(idn

)

Matching!

Matching!

Page 7: Microdata Sharing Via Pseudonymization

20-06-20067

Pseudonymizing data via TTPs (II)Pseudonymizing data via TTPs (II)Pseudonymizing data via TTPs (II)Pseudonymizing data via TTPs (II)

Advantages: Unconditional security (w.r.t. pnymization) Matching is possible

Drawback: TTP must store a huge table secretly

Solution 2: Use a block cipher (Enc(K,·),Dec(K,·)), and then P(id)= Enc(K,id)

Advantage: Only the key K must be stored secretly

Drawbacks: Security is not unconditional Different Researchers might not have the

same access rights

Page 8: Microdata Sharing Via Pseudonymization

20-06-20068

Pseudonymizing data via TTPs (III)Pseudonymizing data via TTPs (III)Pseudonymizing data via TTPs (III)Pseudonymizing data via TTPs (III)

$

P(idm

)DataIns(idm

)

. .P(id*

)DataIns(id*)

. .P(idt) DataIns(idt)

P(id1

)DataChm(id1)

P(id*)

DataChm(id*)

. .

. .P(idn

)DataChm(idn)

Not allowed to match Chemists and Insurers data

Not allowed to match Chemists and Insurers data

We share and win!

Page 9: Microdata Sharing Via Pseudonymization

20-06-20069

Pseudonymizing data via TTPs (IV)Pseudonymizing data via TTPs (IV)Pseudonymizing data via TTPs (IV)Pseudonymizing data via TTPs (IV)

Solution 3: Allocate a different key Ki for every Researcher Ri

Pseudonyms are destination-dependant:P(id,Ri)=Enc(Ki,id)

P(idm,R

2)DataIns(idm)

. .P(id*,R

2)DataIns(id*)

. .P(idt,R2

)DataIns(idt

)

P(id1,R1

)DataChm(id1)

P(id*,R1)

DataChm(id*)

. .

. .P(idn,R1

)DataChm(idn)

P(id*,R1) and P(id*,R2)

look unrelated

Page 10: Microdata Sharing Via Pseudonymization

20-06-200610

Pseudonymizing data via TTPs (V)Pseudonymizing data via TTPs (V)Pseudonymizing data via TTPs (V)Pseudonymizing data via TTPs (V)

Advantage: Disallowed matching among malicious

Researchers is prevented Drawbacks:

TTP must be on-line to perform sensitive operations (pseudonymization and matching)

Let’s see why…

Page 11: Microdata Sharing Via Pseudonymization

20-06-200611

Pseudonymization with symmetric Pseudonymization with symmetric encryptionencryptionPseudonymization with symmetric Pseudonymization with symmetric encryptionencryption

Supplying pseudonymized data: Supplier Sj sends datablocks D(id1),…,D(idl)

to Researcher Ri

Sj sends the identities id1,…,idl in the same order to the TTP

TTP sends the list P(id,Ri)=Enc(Ki,id) to Ri

Ri forms the pnymized database (P(id1,Ri),D(id1)),…,(P(idl,Ri),D(idl))

Page 12: Microdata Sharing Via Pseudonymization

20-06-200612

Pseudonymization with symmetric Pseudonymization with symmetric encryptionencryptionPseudonymization with symmetric Pseudonymization with symmetric encryptionencryption

Matching Ri and Rd pnymized databases: Ri sends to Rd the data D(id1,i),…,D(idl,i)

Ri sends to TTP P(id1,Ri),…, P(idl,Ri)

TTP decrypts Dec(Ki,P(id,Ri))=id and encrypts P(id,Rd)=Enc(Kd,id). The result is sent to Rd

Rd matches the pnymized databases (P(id1,Rd),D(id1,i)),…,(P(idl,Rd),D(idl,i)) (P(idl,Rd),D(id1,d)),…,(P(idm,Rd),D(idm,d))

As a result the TTP is a bottleneck to the system

P(idm,R

d)D(idm,Rd

)

. .P(id*,R

d)D(id*,Rd

)

. .P(idt,Rd

)D(idt,Rd)

P(id1,Ri

)D(id1,Ri)

P(id*,Ri

)D(id*,Ri)

. .

. .P(idn,Ri

)D(idn,Ri)

Page 13: Microdata Sharing Via Pseudonymization

20-06-200613

Pseudonymization using public key Pseudonymization using public key cryptocryptoPseudonymization using public key Pseudonymization using public key cryptocrypto

Let G=<g> a prime order group. Let H:{0,1}*! G a hash function

TTP assigns a secret key xi 2 Zp to Researcher Ri

P(id,Ri)=H(id)x{i}

Supplying pseudonymized data from Sj to Ri

Supplier Sj and Researcher Ri jointly compute the pnymized database {P(id,Ri),D(id)}

TTP allocates pnymizing keys (¹,º) 2 Zp£Zp, such that ¹¢º=xi; ¹ is sent to Si, º is sent to Rj

Sj computes and sends H(id1)¹,…,H(idl)¹ to Rj

Rj computes (H(id)¹)º=H(id)x{i} =P(id,Ri)

Ri forms the pnymized database (P(id1,Ri),D(id1)),…,(P(idl,Ri),D(idl))

Page 14: Microdata Sharing Via Pseudonymization

20-06-200614

Pseudonymization with public key Pseudonymization with public key crypto (II)crypto (II)Pseudonymization with public key Pseudonymization with public key crypto (II)crypto (II)

Matching Ri and Rd pnymized databases: This can be done by Ri and Rd with a 1-

round interactive protocol provided certain keys are obtained off-line from the TTP

Ri nor Rd learn their pnymizing keys xi, xd even if colluding

Rd only learns D(id,Ri) for id’s in the intersection

Security is based on Decision Diffie-Hellman assumption

H(idm)x{j

}

D(idm,Rd

)

. .H(id*)x{j} D(id*,Rd

)

. .H(idt)x{j} D(idt,Rd)

H(id1)x{i} D(id1,Ri

)

H(id*)x{i} D(id*,Ri

)

. .

. .H(idn)x{i} D(idn,Ri

)

Page 15: Microdata Sharing Via Pseudonymization

20-06-200615

Pseudonymization with public key Pseudonymization with public key crypto (III)crypto (III)Pseudonymization with public key Pseudonymization with public key crypto (III)crypto (III)

Advantages: Matching is possible Disallowed matching among malicious

Researchers is prevented TTP is not a bottleneck (only delivers off-

line crypto keys) Drawbacks:

Suppliers must collaborate for every pnymization

Interactive protocols (on-line communication)

Page 16: Microdata Sharing Via Pseudonymization

20-06-200616

Advanced settingAdvanced settingAdvanced settingAdvanced setting

Page 17: Microdata Sharing Via Pseudonymization

20-06-200617

PropertiesPropertiesPropertiesProperties

Suppliers and Accumulators are assumed Honest-But-Curious

Researchers are assumed Malicious Accumulators’ intersection and union

operations are non-interactive Two levels of pseudonymization

corresponding to the different levels of trust It uses ‘composite bilinear groups’

Page 18: Microdata Sharing Via Pseudonymization

20-06-200618

GovernanceGovernanceGovernanceGovernance

The allowance of these protocols is governed by a Regulatory Privacy Body (RPB) from a functional perspective. A strict licensing infrastructure will be enforced by the RPB, describing:

Which parties are allowed to perform what protocols with each

What kind of data can be exchanged Which subsets of identities or pnyms are

allowed as input to the protocols

Page 19: Microdata Sharing Via Pseudonymization

20-06-200619

Thanks!