M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

31
M-Invariance: Towards Privacy Preserving Re- publication of Dynamic Datasets by Tyrone Cadenhead

Transcript of M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Page 1: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-Invariance: Towards Privacy Preserving Re-publication of

Dynamic Datasets

by

Tyrone Cadenhead

Page 2: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Main Contribution

• Paper presents study on privacy preserving publication of fully-dynamic datasets.

• These datasets can be modified by any sequence of insertions and deletions.

• Use of two concepts: m-invariance and counterfeited generalization.

Page 3: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Name Age Zip. Diseas

Page 4: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Page 5: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Page 6: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Generalization

• A generalization is a popular methodology of privacy preservation.

• Divide the tuples into several QI groups, and then generalize the QI values in each group to a uniform format.

• For example a generalized format of table 1a is table 1b.

• A generalized table is considered privacy preserving if it satisfies a generalization principle, for example k-anonymity and l-diversity.

• K-anonymity requires each QI group to include at least k tuples (table 1b is 2-anonymous).

• L-diversity requires every QI group to contain at least l “well-represented” sensitive values (table 1b is 2-diverse)

Page 7: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Example 1

• Consider a hospital that releases patients’ records quarterly, but each publication includes only the results of diagnoses in the 6 months preceding the publication time.

• Table 1a is the microdata first release. Table 1b is the generalized relation.

• Likewise, table 2a is the microdata for the second release and table 2b is the generalized relation.

Page 8: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Example 1 cont’d

• The tuples of Alice, Andy, Helen, Ken and Paul have been deleted (they describe diagnoses over six months ago).

• The tuples Emily, Mary, Ray, Tom and Vince have been added.

• The hospital then publishes the generalized table relation in table 2b.

Page 9: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Example 1 cont’d

• Even though both published relations of table 1a and table 1b are 2-anonymous and 2-diverse, adversary can still precisely determine the disease of a patient by exploiting the correlation between the two snapshots.

• Assume an adversary knows Bob’s age and zipcode, and also knows that Bob has a record in tables 1b and 2b.

• Table 1b would suggest Bob has either dyspepsia or bronchitis. While table 2b would suggest Bob has either dyspepsia or gastritis. By combining the two knowledge, the adversary correctly captures Bob’s real disease.

Page 10: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-invariance and counterfeited generalization

• Instead of publishing table 2b, publish table 3a and a auxiliary table 3b.

• The adversary cannot distinguished between a counterfeit tuple and a real one.

• From the QI details of Bob, the adversary can conclude that Bob has been generalized to the first QI group of tables 1b and 3a. But cannot infer Bob real disease.

Page 11: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Historical union

• At time n ≥ 1, the historical union U(n) contains all the tuples in T at timestamps 1, 2, ..., n, respectively. Formally:

• Each tuple t ε U(n) is implicitly associated with a lifespan [x, y], where x (y) is the smallest (largest) integer j such that t appears in T(j).

• U(n) can be regarded as a table with the same schema as T. Note that, if a tuple appears in several T(j) with different j, it is included only once in U(n). Now we can define the background knowledge that can be tackled by our technique.

• U(2) includes all the tuples in T(1) and T(2) after eliminating duplicates.

• The lifespan of the tuple <Bob, 21,12k,dyspepsia> is [1,2], and the lifespan of <Alice,22,14k, bronchitis> is [1,1].

Page 12: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Adversaries Background Knowledge

Page 13: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Background Knowledge B(n)

• tuple t= <Bob,21,12k,dyspepsia> in U(2) correspond to b ε B(n) in the first tuple in table 4.

• B[Group-ID]=* means the adversary is not sure about the generalized hosting groups of t.

• The adversary also knows that c1 and c2 are counterfeits from table 3b.

• The adversary also knows the lifespan of each tuple.

Page 14: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Privacy Breach

• A privacy breach occurs if an adversary correctly finds out the sensitive value of any tuple t ε U(n), utilizing T*(1),…,T*(n) and B(n).

• The objective of privacy preserving re-publication is to compute a pair of {T*(n),R(n)}, that minimizes the risk of privacy disclosure, yet captures as much information in the microdata as possible.

Page 15: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Microdata Reconstruction

• Deals with how an adversary can reconstruct the microdata tables T(1),…,T(n) from the published T*(1),…,T*(n) and the knowledge table B(n).

• Generalized Historical Union: Given a generalized relation T*(j) (1≤ j ≤n), we convert each row t* ε T*(j) to a timestamped tuple <t*,j>, which augments t* with another attribute Atm, called Timestamp, storing j.

• The generalized historical union U*(n) includes all the timestamped tuples covering T*(1),…,T*(n)

Page 16: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Microdata Reconstruction• Generalized Historical Union

• Rebuilding Surjection: Mapping f: U*(n) to B(n) is a rebuilding surjective function if it fulfils these requirements:

• Superscript l is the lifespan.

Page 17: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Microdata Reconstruction

Page 18: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Microdata Reconstruction

• The arrows depict five mappings in a surjective function f, namely f(t*1) = b1, f(t*2) = b2, etc.

• These mappings reconstruct four tuples in the microdata tables T(1) and T(2).

• We can reconstruct tuple <Bob,21,12k,dyspepsia> in T(1) and T(2); and tuple <Alice,22,14k,bronchitis> in T(2)

• Reconstruction is not always accurate as tuple <Helen,36,27k,flu> in T(1) real disease is gastritis.

• There is a genuine surjective function, which exactly reconstructs the original microdata. If the adversary were able to discover this, then he could get all sensitive information on all individuals. Fortunately , as the tuples in U*(n) have been generalized , there exist a huge number of possible surjective functions from U*(n) to B(n).

Page 19: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Reasonable Surjection

• A function f is reasonable if it satisfies these conditions:

Page 20: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Privacy Disclosure Risk

• Let t be a tuple in the historical union U(n). The privacy disclosure risk risk(t) of t is

Page 21: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Persistent Invariance

• Let t be a tuple in U(n). The candidate sensitive set t.CSS(j) of t at time j is the union of the sensitive values in each QI group QI*.

Page 22: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Persistent Invariance

• Let n=2, and T*(1) = table 1b, and T*(2) = table 2b.

• Let t = <Bob,21,12k,dyspepsia> in U(2) and have lifespan [1,2].

• T.CSS(1) at time 1 includes the sensitive values dyspepsia and bronchitis in QI group 1 of T*(1)

• T.CSS(2) is {dyspepsia, gastritis}.• By lemma 1 the privacy risk of t is 1.

Page 23: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Persistent Invariance

• To protect privacy, re-publication must ensure a sufficiently large at each publication timestamp , until t is deleted from the metadata.

Page 24: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-Invariance

• If a sensitive value is carried by multiple tuples in QI*, the value appears only once in the signature.

Page 25: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-Invariance

• M-invariance demands that each sensitive value appear at most once in every QI group.

• The publisher can simply set m to a sufficiently large value to achieve the target extent of privacy preservation.

Page 26: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-Invariance

• This points to an incremental approach for performing re-publication

• To construct T*(n), the publisher only needs to consult the microdata tables T(n-1) and T(n) and the last release of T*(n-1)

Page 27: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Algorithm by example

Page 28: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Algorithm by example

• The algorithm has three phases: Division, Balancing and Assignment.

• Divide the tuples in T(n) into two Disjoint sets:

• For any tuple t in Sn, t.QI*(n-1) and t.QI*(n) have the same signature.

• For any tuple t in S_, t* in T*(n) is in a QI group, which has at least m tuples, and all the tuples have distinct sensitive values.

Page 29: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Algorithm by example

• Each bucket BUC is inspected in turn. If BUC is not balanced, there is a shortage of some sensitive value.

• S_ = {Emily, Mary, Ray, Tom, Vince}• In this case we move tuples from S_ into BUC, as

long as S_ is still m-eligible• In figure 2a, BUC2 is unbalanced, we move Ray

from S_ to BUC2 because there are 2 flu and 2 gastritis in S_ that are 2-eligible.

Page 30: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Algorithm by example

• If S_ cannot be used to fix an unbalanced bucket BUC, there are two possibilities:

– No tuple in S_ carries the required sensitive value.

– S_ is no longer m-eligible

• In both cases we insert counterfeit to balance BUC

• Both BUC3 and BUC4 are unbalanced, but neither can be remedied with S_. BUC3 needs a bronchitis and BUC4 needs flu.

• So we add c1 and c2.

Page 31: M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Algorithm by example

• In the Assignment phase, we assign the remaining tuples in S_ to buckets, subject to two rules:– Each tuple t in S_ can be placed only in a bucket whose

signature includes t[As], the sensitive attribute of t.

– At the end of the phase, all buckets are still balanced.

• Here S_ = {Emily, Mary, Tom, Vince}• These are assigned to BUC1