Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some...

44
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Transcript of Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some...

Page 1: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Privacy-Preserving Data Publishing

Donghui Zhang

Northeastern University

Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Page 2: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

motivation

• several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available– termed microdata (vs. aggregated macrodata) used for analysis– often required and imposed by law

• to protect privacy microdata are sanitized– explicit identifiers (SSN, name, phone #) are removed

• is this sufficient for preserving privacy?

• no! susceptible to link attacks– publicly available databases (voter lists, city directories) can reveal the

“hidden” identity

Page 3: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

link attack example

• [Sweeney01] managed to re-identify the medical record of the governor of Massachussetts

– MA collects and publishes sanitized medical data for state employees (microdata) left circle

– voter registration list of MA (publicly available data) right circle

• looking for governor’s record• join the tables:

– 6 people had his birth date– 3 were men– 1 in his zipcode

• regarding the US 1990 census data– 87% of the population are unique based on (zipcode, gender,

dob)

Page 4: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Microdata

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

Alice 22 14000 bronchitisAndy 24 18000 fluDavid 23 25000 gastritisGary 41 20000 fluHelen 36 27000 gastritisJane 37 33000 dyspepsiaKen 40 35000 flu

Linda 43 26000 gastritisPaul 52 33000 dyspepsiaSteve 56 34000 gastritis

Page 5: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Inference Attack

Published table

An adversary

Quasi-identifier (QI) attributes

Age Zipcode Disease21 12000 dyspepsia22 14000 bronchitis24 18000 flu23 25000 gastritis41 20000 flu36 27000 gastritis37 33000 dyspepsia40 35000 flu43 26000 gastritis52 33000 dyspepsia56 34000 gastritis

Name Age ZipcodeBob 21 12000

Page 6: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

k-anonymity [Samarati and Sweeney02]• Transform the QI values into less specific forms

generalize

Age Zipcode Disease21 12000 dyspepsia22 14000 bronchitis24 18000 flu23 25000 gastritis41 20000 flu36 27000 gastritis37 33000 dyspepsia40 35000 flu43 26000 gastritis52 33000 dyspepsia56 34000 gastritis

Age Zipcode Disease[21, 22] [12k, 14k] dyspepsia[21, 22] [12k, 14k] bronchitis[23, 24] [18k, 25k] flu[23, 24] [18k, 25k] gastritis[36, 41] [20k, 27k] flu[36, 41] [20k, 27k] gastritis[37, 43] [26k, 35k] dyspepsia[37, 43] [26k, 35k] flu[37, 43] [26k, 35k] gastritis[52, 56] [33k, 34k] dyspepsia[52, 56] [33k, 34k] gastritis

Page 7: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Generalization

• Transform each QI value into a less specific form

A generalized table

An adversary

Name Age ZipcodeBob 21 12000

Age Zipcode Disease[21, 22] [12k, 14k] dyspepsia[21, 22] [12k, 14k] bronchitis[23, 24] [18k, 25k] flu[23, 24] [18k, 25k] gastritis[36, 41] [20k, 27k] flu[36, 41] [20k, 27k] gastritis[37, 43] [26k, 35k] dyspepsia[37, 43] [26k, 35k] flu[37, 43] [26k, 35k] gastritis[52, 56] [33k, 34k] dyspepsia[52, 56] [33k, 34k] gastritis

Page 8: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Graphically…

12000

14000

18000

25000

20000

2600027000

330003400035000

21 22 23 24 36 37 40 41 43 52 56

Bob

Alice

Page 9: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Why not…

12000

14000

18000

25000

20000

2600027000

330003400035000

21 22 23 24 36 37 40 41 43 52 56

How many people with age in [30, 50] contracted flu?

Page 10: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

k-anonymity

Age Zipcode Disease[21, 22] [12k, 14k] dyspepsia[21, 22] [12k, 14k] bronchitis[23, 24] [18k, 25k] flu[23, 24] [18k, 25k] gastritis[36, 41] [20k, 27k] flu[36, 41] [20k, 27k] gastritis[37, 43] [26k, 35k] dyspepsia[37, 43] [26k, 35k] flu[37, 43] [26k, 35k] gastritis[52, 56] [33k, 34k] dyspepsia[52, 56] [33k, 34k] gastritis

Age Zipcode Disease[21, 56] [12k, 35k] dyspepsia[21, 56] [12k, 35k] bronchitis[21, 56] [12k, 35k] flu[21, 56] [12k, 35k] gastritis[21, 56] [12k, 35k] flu[21, 56] [12k, 35k] gastritis[21, 56] [12k, 35k] dyspepsia[21, 56] [12k, 35k] flu[21, 56] [12k, 35k] gastritis[21, 56] [12k, 35k] dyspepsia[21, 56] [12k, 35k] gastritis

How many people with age in [30, 50] contracted flu?

generalization with low utility:

answer less accurately: [0..3]

generalization with high utility:

answer queries more accurately: 2.

Page 11: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

k-anonymity with utility

• Among all generalizations that enforce k-anonymity, we should maximize utility by minimizing the “rectangle” sizes!

• Several measures. E.g. to minimize the maximal perimeter size of the rectangles.

Page 12: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Mondrian [LDR06]

Recursive half-plane partitioning, alternating dimensions.

let k=2

Page 13: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Mondrian [LDR06]

Unbounded approximation ratio!

let k=4

Page 14: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Our contributions [DXT+07]

• Proved that to find the optimal partitioning is NP-hard.

• Proved that to find a partitioning with approximation ratio less than 1.25 is also NP-hard.

• Provided three algorithms with tradeoffs in complexity and approximation ratio.

Page 15: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Divide-And-Group (DAG)

• Divide the space into square cells with proper size

• Find a set of non-overlapping tiles of 2 x 2 cells to cover the points, such that each tile covers at least k points

• Assign the rest of (uncovered) points to the nearest tile

Page 16: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Min-MBR-Group (MMG)

• For each point p, find the smallest MBR which covers at least k points including p

• Find a set of non-overlapping MBRs from the result of previous step

• Assign the points to the nearest MBR

Page 17: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Nearest-Neighbor-Group (NNG)

• For each point p, find the MBR which covers p and its k-1 nearest neighbors

• Find a set of non-overlapping MBRs from the result of previous step

• Assign the points to the nearest MBR

Page 18: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Analysis

Algorithm Complexity Approximation Ratio

DAG O(3d d n log2n) 8d

MMG O(d n2d+1) 2d+1

NNG O(d n2) 6d

Page 19: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

• In a QI group, if many records have the same sensitive attribute value...

Drawback of k-anonymity

Quasi-identifier (QI) attributes Sensitive attribute

Age Sex Zipcode Disease[21, 40] M [10001, 60000] pneumonia[30, 60] M [10001, 60000] dyspepsia[30, 60] M [10001, 60000] dyspepsia[21, 40] M [10001, 60000] pneumonia[61, 65] F [10001, 60000] flu[63, 70] F [10001, 60000] gastritis[61, 65] F [10001, 60000] flu[63, 70] F [10001, 60000] bronchitis

If Bob is in this

group, he must

have

pneumonia.

Page 20: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

l-diversity [ICDE06]

• A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m / l times in the QI-group.

• A table is l-diverse, iff all of its QI-groups are l-diverse.

• The above table is 2-diverse.

2 QI-groups

Quasi-identifier (QI) attributes Sensitive attribute

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

Page 21: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

What l-diversity guarantees

• From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

Name Age Sex ZipcodeBob 23 M 11000

A 2-diverse generalized table

A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.

ICDE 2006

Page 22: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• A hospital keeps track of the medical records collected in the last three months.

• The microdata table T(1), and its generalization T*(1), published in Apr. 2007.

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

Alice 22 14000 bronchitisAndy 24 18000 fluDavid 23 25000 gastritisGary 41 20000 fluHelen 36 27000 gastritisJane 37 33000 dyspepsiaKen 40 35000 flu

Linda 43 26000 gastritisPaul 52 33000 dyspepsiaSteve 56 34000 gastritis

Microdata T(1)

G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis

2-diverse Generalization T*(1)

Page 23: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• Bob was hospitalized in Mar. 2007

Name Age ZipcodeBob 21 12000

G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis

2-diverse Generalization T*(1)

Page 24: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• One month later, in May 2007

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

Alice 22 14000 bronchitisAndy 24 18000 fluDavid 23 25000 gastritisGary 41 20000 fluHelen 36 27000 gastritisJane 37 33000 dyspepsiaKen 40 35000 flu

Linda 43 26000 gastritisPaul 52 33000 dyspepsiaSteve 56 34000 gastritis

Microdata T(1)

Page 25: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• One month later, in May 2007• Some obsolete tuples are deleted from the microdata.

Microdata T(1)

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

Alice 22 14000 bronchitisAndy 24 18000 fluDavid 23 25000 gastritisGary 41 20000 fluHelen 36 27000 gastritisJane 37 33000 dyspepsiaKen 40 35000 flu

Linda 43 26000 gastritisPaul 52 33000 dyspepsiaSteve 56 34000 gastritis

Page 26: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• Bob’s tuple stays.

Microdata T(1)

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

David 23 25000 gastritisGary 41 20000 fluJane 37 33000 dyspepsia

Linda 43 26000 gastritisSteve 56 34000 gastritis

Page 27: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• Some new records are inserted.

Microdata T(2)

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

David 23 25000 gastritisEmily 25 21000 fluJane 37 33000 dyspepsia

Linda 43 26000 gastritisGary 41 20000 fluMary 46 30000 gastritisRay 54 31000 dyspepsia

Steve 56 34000 gastritisTom 60 44000 gastritis

Vince 65 36000 flu

Page 28: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• The hospital published T*(2).

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

David 23 25000 gastritisEmily 25 21000 fluJane 37 33000 dyspepsia

Linda 43 26000 gastritisGary 41 20000 fluMary 46 30000 gastritisRay 54 31000 dyspepsia

Steve 56 34000 gastritisTom 60 44000 gastritis

Vince 65 36000 flu

Microdata T(2)

G. ID Age Zipcode Disease1 [21, 23] [12k, 25k] dyspepsia1 [21, 23] [12k, 25k] gastritis2 [25, 43] [21k, 33k] flu2 [25, 43] [21k, 33k] dyspepsia3 [25, 43] [21k, 33k] gastritis3 [41, 46] [20k, 30k] flu4 [41, 46] [20k, 30k] gastritis4 [54, 56] [31k, 34k] dyspepsia4 [54, 56] [31k, 34k] gastritis5 [60, 65] [36k, 44k] gastritis5 [60, 65] [36k, 44k] flu

2-diverse Generalization T*(2)

Page 29: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• Consider the previous adversary.

Name Age ZipcodeBob 21 12000

G. ID Age Zipcode Disease1 [21, 23] [12k, 25k] dyspepsia1 [21, 23] [12k, 25k] gastritis2 [25, 43] [21k, 33k] flu2 [25, 43] [21k, 33k] dyspepsia3 [25, 43] [21k, 33k] gastritis3 [41, 46] [20k, 30k] flu4 [41, 46] [20k, 30k] gastritis4 [54, 56] [31k, 34k] dyspepsia4 [54, 56] [31k, 34k] gastritis5 [60, 65] [36k, 44k] gastritis5 [60, 65] [36k, 44k] flu

2-diverse Generalization T*(2)

Page 30: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Problem with multi-publishing

• What the adversary learns from T*(1).

• What the adversary learns from T*(2).

• So Bob must have contracted dyspepsia!• A new generalization principle is needed.

Name Age ZipcodeBob 21 12000

G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis

……

Name Age ZipcodeBob 21 12000

G. ID Age Zipcode Disease1 [21, 23] [12k, 25k] dyspepsia1 [21, 23] [12k, 25k] gastritis

……

Page 31: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

m-invariance [SIGMOD07]

• A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if– T*(1), …, T*(n) are m-unique, and– each individual has the same signature in every gener

alized table s/he is involved.

• Explanation– m-unique: every QI group contains at least m tuples w

ith different sensitive attributes– signature: all the sensitive attributes in the individual’s

QI group.

Page 32: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

m-unique

• A generalized table T*(j) is m-unique, if and only if – each QI-group in T*(j) contains at least m tuples– all tuples in the same QI-group have different sensitive values.

G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis

A 2-unique generalized table

Page 33: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Signature

• The signature of Bob in T*(1) is {dyspepsia, bronchitis}

• The signature of Jane in T*(1) is {dyspepsia, flu, gastritis}

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsia

Alice 1 [21, 22] [12k, 14k] bronchitis… … … … …

Jane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritis… … … … …

T*(1)

Page 34: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

The m-invariance principle

• Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have

risk(o) <= 1/m

Page 35: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

The m-invariance principle

• Lemma: let {T*(1), …, T*(n-1)} be m-invariant. {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant

• Only T*(n - 1) is needed for the generation of T*(n).

T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n)

Can be discarded

Page 36: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Solution idea

• Goal: Given T(n) and T*(n-1), create T*(n) such that {T*(n-1) and T*(n)} is m-invariant.

• Idea: create counterfeits.

• Optimization goal: to impose as little amount of generalization as possible.

Page 37: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Name Group-ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis

David 2 [23, 25] [21k, 25k] gastritisEmily 2 [23, 25] [21k, 25k] fluJane 3 [37, 43] [26k, 33k] dyspepsiac2 3 [37, 43] [26k, 33k] flu

Linda 3 [37, 43] [26k, 33k] gastritisGary 4 [41, 46] [20k, 30k] fluMary 4 [41, 46] [20k, 30k] gastritisRay 5 [54, 56] [31k, 34k] dyspepsia

Steve 5 [54, 56] [31k, 34k] gastritisTom 6 [60, 65] [36k, 44k] gastritis

Vince 6 [60, 65] [36k, 44k] flu

Counterfeited generalization T*(2)

Group-ID Count

1 13 1

The auxiliary relation R(2) for T*(2)

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

David 23 25000 gastritisEmily 25 21000 fluJane 37 33000 dyspepsia

Linda 43 26000 gastritisGary 41 20000 fluMary 46 30000 gastritisRay 54 31000 dyspepsia

Steve 56 34000 gastritisTom 60 44000 gastritis

Vince 65 36000 flu

Microdata T(2)

Page 38: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis

David 2 [23, 25] [21k, 25k] gastritisEmily 2 [23, 25] [21k, 25k] fluJane 3 [37, 43] [26k, 33k] dyspepsiac2 3 [37, 43] [26k, 33k] flu

Linda 3 [37, 43] [26k, 33k] gastritisGary 4 [41, 46] [20k, 30k] fluMary 4 [41, 46] [20k, 30k] gastritisRay 5 [54, 56] [31k, 34k] dyspepsia

Steve 5 [54, 56] [31k, 34k] gastritisTom 6 [60, 65] [36k, 44k] gastritis

Vince 6 [60, 65] [36k, 44k] flu

Counterfeited Generalization T*(2)

Group-ID Count

1 13 1

The auxiliary relation R(2) for T*(2)

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsia

Alice 1 [21, 22] [12k, 14k] bronchitisAndy 2 [23, 24] [18k, 25k] fluDavid 2 [23, 24] [18k, 25k] gastritisGary 3 [36, 41] [20k, 27k] fluHelen 3 [36, 41] [20k, 27k] gastritisJane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritisPaul 5 [52, 56] [33k, 34k] dyspepsiaSteve 5 [52, 56] [33k, 34k] gastritis

Generalization T*(1)

Name Age ZipcodeBob 21 12000

Page 39: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis

David 2 [23, 25] [21k, 25k] gastritisEmily 2 [23, 25] [21k, 25k] fluJane 3 [37, 43] [26k, 33k] dyspepsiac2 3 [37, 43] [26k, 33k] flu

Linda 3 [37, 43] [26k, 33k] gastritisGary 4 [41, 46] [20k, 30k] fluMary 4 [41, 46] [20k, 30k] gastritisRay 5 [54, 56] [31k, 34k] dyspepsia

Steve 5 [54, 56] [31k, 34k] gastritisTom 6 [60, 65] [36k, 44k] gastritis

Vince 6 [60, 65] [36k, 44k] flu

Generalization T*(2)

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsia

Alice 1 [21, 22] [12k, 14k] bronchitisAndy 2 [23, 24] [18k, 25k] fluDavid 2 [23, 24] [18k, 25k] gastritisGary 3 [36, 41] [20k, 27k] fluHelen 3 [36, 41] [20k, 27k] gastritisJane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritisPaul 5 [52, 56] [33k, 34k] dyspepsiaSteve 5 [52, 56] [33k, 34k] gastritis

Generalization T*(1)

• A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if– T*(1), …, T*(n) are m-unique, and– each individual has the same signature in every generalized tabl

e s/he is involved.

Page 40: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis

David 2 [23, 25] [21k, 25k] gastritisEmily 2 [23, 25] [21k, 25k] fluJane 3 [37, 43] [26k, 33k] dyspepsiac2 3 [37, 43] [26k, 33k] flu

Linda 3 [37, 43] [26k, 33k] gastritisGary 4 [41, 46] [20k, 30k] fluMary 4 [41, 46] [20k, 30k] gastritisRay 5 [54, 56] [31k, 34k] dyspepsia

Steve 5 [54, 56] [31k, 34k] gastritisTom 6 [60, 65] [36k, 44k] gastritis

Vince 6 [60, 65] [36k, 44k] flu

Generalization T*(2)

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsia

Alice 1 [21, 22] [12k, 14k] bronchitisAndy 2 [23, 24] [18k, 25k] fluDavid 2 [23, 24] [18k, 25k] gastritisGary 3 [36, 41] [20k, 27k] fluHelen 3 [36, 41] [20k, 27k] gastritisJane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritisPaul 5 [52, 56] [33k, 34k] dyspepsiaSteve 5 [52, 56] [33k, 34k] gastritis

Generalization T*(1)

• A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if– T*(1), …, T*(n) are m-unique, and– each individual has the same signature in every generalized tabl

e s/he is involved.

Page 41: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis

David 2 [23, 25] [21k, 25k] gastritisEmily 2 [23, 25] [21k, 25k] fluJane 3 [37, 43] [26k, 33k] dyspepsiac2 3 [37, 43] [26k, 33k] flu

Linda 3 [37, 43] [26k, 33k] gastritisGary 4 [41, 46] [20k, 30k] fluMary 4 [41, 46] [20k, 30k] gastritisRay 5 [54, 56] [31k, 34k] dyspepsia

Steve 5 [54, 56] [31k, 34k] gastritisTom 6 [60, 65] [36k, 44k] gastritis

Vince 6 [60, 65] [36k, 44k] flu

Generalization T*(2)

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsia

Alice 1 [21, 22] [12k, 14k] bronchitisAndy 2 [23, 24] [18k, 25k] fluDavid 2 [23, 24] [18k, 25k] gastritisGary 3 [36, 41] [20k, 27k] fluHelen 3 [36, 41] [20k, 27k] gastritisJane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritisPaul 5 [52, 56] [33k, 34k] dyspepsiaSteve 5 [52, 56] [33k, 34k] gastritis

Generalization T*(1)

• A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if– T*(1), …, T*(n) are m-unique, and– each individual has the same signature in every generalized tabl

e s/he is involved.

Page 42: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

In case of corruption…

• If an adversary knows from Alice that she has bronchitis, he can conclude that Bob has dyspepsia.

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

Alice 22 14000 bronchitisAndy 24 18000 fluDavid 23 25000 gastritisGary 41 20000 fluHelen 36 27000 gastritisJane 37 33000 dyspepsiaKen 40 35000 flu

Linda 43 26000 gastritisPaul 52 33000 dyspepsiaSteve 56 34000 gastritis

Microdata

G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis

2-diverse Generalization

Page 43: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Anti-corruption publishing [ICDE08]

• We formalized anti-corruption publishing, by modeling the degree of privacy preservation as a function of an adversary’s background knowledge.

• We proposed a solution, by integrating generalization with– perturbation: switch selected records’ sensitive

information.– stratified sampling: sample some records from each

QI group.

Page 44: Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Summary

• Introduced the problem of privacy-preserving publishing.

• Two principles:– k-anonymity– l-diversity

• Two extensions:– multi-publishing– corruption