Towards Achieving Anonymity

Post on 15-Jan-2016

27 views 0 download

Tags:

description

Towards Achieving Anonymity. An Zhu. Introduction. Collect and analyze personal data Infer trends and patterns Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?. Example: Medical Records. De-identified Records. - PowerPoint PPT Presentation

Transcript of Towards Achieving Anonymity

An ZhuAn Zhu

Towards Achieving Towards Achieving AnonymityAnonymity

IntroductionIntroduction

Collect and analyze personal data Infer trends and patterns

Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?

Example: Medical RecordsExample: Medical Records

Identifiers Sensitive Info

SSN Name Age Race Zipcode Disease

614 Sara 31 Cauc 94305 Flu

615 Joan 34 Cauc 94307 Cold

629 Kelly 27 Cauc 94301 Diabetes

710 Mike 41 Afr-A 94305 Flu

840 Carl 41 Afr-A 94059 Arthritis

780 Joe 65 Hisp 94042 Heart problem

616 Rob 46 Hisp 94042 Arthritis

De-identified RecordsDe-identified Records

Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]

Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 ArthritisPublic Database

UniqueIdentifiers!

Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 ArthritisPublic Database

UniqueIdentifiers!

Anonymize the Quasi-Identifiers!Anonymize the Quasi-Identifiers!

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

*** *** *** Flu

*** *** *** Cold

*** *** *** Diabetes

*** *** *** Flu

*** *** *** Arthritis

*** *** *** Heart problem

*** *** *** ArthritisPublic Database

UniqueIdentifiers!

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Each rowis identicalto at least k-1otherrows

kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

* Cauc * Flu

* Cauc * Cold

* Cauc * Diabetes

41 Afr-A * Flu

41 Afr-A * Arthritis

* Hisp 94042 Heart problem

* Hisp 94042 Arthritis

Definition: Definition: kk-anonymity-anonymity

Input: a table consists of n row, each with m attributes (quasi-identifiers)

Output: suppress some entries such that each row is identical to at least k-1 other rows

Objective: minimize the number of suppressed entries

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Graph RepresentationGraph Representation

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

4

2

4

6

3

A B

F

E D

C

3

W(e)=Hamming distance between the two rows

2

Edge Selection IEdge Selection I

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

2

2

3

A B

F

E D

C

Each node selects thelightest weight edge

0

k=3

3

Edge Selection IIEdge Selection II

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

2

3

A B

F

E D

C

For components with <kvertices, add more edges

0

k=3

2

LemmaLemma

Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at

least the weight of the (k-1)st lightest weight edge

Forest: at most one edge per vertex By construction, the edge weight is no more

than the (k-1)st lightest weight edge per vertex

GroupingGrouping

Ideally, each connected component forms a group

Anonymize vertices within a group

Total cost of a group: (total edge weights)

(number of nodes) (2+2+3+3)6

3 2

2

3

A B

F

E D

C

0

Small groups: O(k)

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k

Aim: all sub-trees <k

kk k

<k<k<k<k

k

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k

Rotate the tree if necessary

kk

k

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k T. condition: max(2k-1, 3k-5)

<k<k

<k

<k<k

An ExampleAn Example

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

3 2

2

3

A B

F

E D

C

0

0

3

An ExampleAn Example

C

FE

D

B

A

2 2 3

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

0

3

An ExampleAn Example

C

FE

D

B

A

2 2

Estimatedcost:43+33

0 * 1 0 * *

* * 0 1 * 1

* * 0 1 * 1

0 * 1 0 * *

* * 0 1 * 1

0 * 1 0 * *

A:

B:

C:

D:

E:

F:

Optimal cost:33+33

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

1.51.5-approximation-approximation

0 0 1 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

0 0 1 0 0 0

1 1 0 1 1 1

1 1 0 1 1 1

A:

B:

C:

D:

E:

F:

1

6

5

6

6

A B

F

E D

C

0

W(e)=Hamming distance between the two rows

MinimumMinimum {1,2} {1,2}-matching-matching

0 0 1 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

0 0 1 0 0 0

1 1 0 1 1 1

1 1 0 1 1 1

A B

F

D

Each vertex is matched to1 or 2 other vertices

0

0

1

E

C

1A:

B:

C:

D:

E:

F:

PropertiesProperties

Each component has 3 nodes

Not OptimalNot possible(degree 2)

>3

Cost 2OPT

For binary alphabet: 1.5OPT

QualitiesQualities

a p q

r p,qOPT pays: 2aWe pay: 2a OPT pays: p+q+r

We pay: 3(p+q) 2(p+q+r)

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111

k = 5, d = 16, c = k d / 2

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111

k = 5, d = 16, c = k d / 2

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000

k = 5, d = 16, c = 2 d

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000

k = 5, d = 16, c = 2 d

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Clustering ApproachClustering Approach [[AFKKPTZ 06AFKKPTZ 06’]’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Transfers into a Metric…Transfers into a Metric…

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Clusters and CentersClusters and Centers

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Clusters and CentersClusters and Centers

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

Cold

Diabetes

Flu

41 Afr-A 94059 Arthritis

Heart problem

46 Hisp 94042 Arthritis

MeasureMeasure

How good are the clusters “Tight” clusters are better

Minimize max radius: Gather-k Minimize max distortion error: Cellular-k

radius num_nodes

Cost:

Gather-k: 10

Cellular-k: 624

MeasureMeasure

How good are the clusters “Tight” clusters are better

Minimize max radius: Gather-k Minimize max distortion error: Cellular-k

radius num_nodes

Handle outliers Constant approximations!

ComparisonComparison

k = 5 5-anonymity

Suppress all entries More distortion

Clustering Can pick R5 as the center Less distortion Distortion is directly related

with pair-wise distances

R1 0 1 1 1

R2 1 0 1 1

R3 1 1 0 1

R4 1 1 1 0

R5 1 1 1 1

ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]

Gather-k Tight 2-approximation Extension to outlier: 4-approximation

Cellular-k Primal-dual const. approximation Extensions as well

ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]

Gather-k Tight 2-approximation Extension to outlier: 4-approximation

Cellular-k Primal-dual const. approximation Extensions as well

22-approximation-approximation

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R.

A

R

2R

22-approximation-approximation

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Pick an arbitrary node as a center and

remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Example: Example: kk = 5 = 5

Optimal SolutionOptimal Solution

1 2

R

Center SelectionCenter Selection

Center SelectionCenter Selection

1

Center SelectionCenter Selection

1

2R

Center SelectionCenter Selection

2R

1

Center SelectionCenter Selection

2

1

2R

Center SelectionCenter Selection

2

1

2R

ReassignmentReassignment

2

1

Degree Constrained MatchingDegree Constrained Matching

1

≥ k-1

≥ k-1

=1

=1

=1=1

=1

=1

=1

=1 =1

2

Actual ClusteringActual Clustering

1

2

Optimal ClusteringOptimal Clustering

1 2

Our guaranteesOur guarantees

Return clusters of radius no more than 2R

If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes

Do a binary search on the value of R suffices

Binary Search on Binary Search on RR

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Pick an arbitrary node as a center and

remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Binary Search on Binary Search on RR

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Not necessary, but is useful for quick pruning

Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Binary Search on Binary Search on RR

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Not necessary, but is useful for quick pruning

Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers. If successful, R could be smaller Otherwise, R should be larger

ResultsResults [[AFKKPTZ 06AFKKPTZ 06’]’]

Gather-k Tight 2-approximation Extension to outliner: 4-approximation

Cellular-k Primal-dual const. approximation Extensions

Ignore Cluster Size ConstraintIgnore Cluster Size Constraint

Similar to Facility Location radius num_nodes vs. invidual_distance_to_center

Caveat Assigning one distant node to an existing

cluster will increase cost proportional to number of nodes in that cluster

Each cluster is a (center, radius) pair

Intermediate Step IIntermediate Step I

Primal-dual constant approximation for radius num_nodes No cluster size constaint Arbitrary cluster setup cost

We want radius num_nodes Cluster size constraint No cluster setup cost

Enforce Cluster SizeEnforce Cluster Size

Introduce extra cluster setup cost Setup cost pays for k nodes to join a

particular cluster, i.e., csetup = k r This at most doubles the actual cost of

any size constrained cluster solution Each cluster’s total cost is at least k r

Intermediate Step IIIntermediate Step II

Shared solution! For each cluster with less than k nodes,

additional nodes can join the cluster At no additional cost, paid for by the cluster

setup cost Now nodes could be shared among multiple

clusters Key: convert a “shared” solution to a

disjoint solution

AttachedAttached

Attached

SeparationSeparation

Starting from small radius clusters

“Open” as long as there are enough nodes

The left over points in clusters “attach” to the intersecting smaller radius (open) clusters

Open

Regroup (Regroup (kk = 5 = 5))

Open cluster has ≥k nodes

Attached cluster has <k nodes

Group clusters to create bigger ones

Choose the “fat” cluster’s center as the new center

3 2 4

6

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

Routing cost is only a constant blowup w.r.t. the fat radius

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

Routing cost is only a constant blowup w.r.t. the fat radius

Need to make sure the merged cluster is of reasonable size

RecapRecap

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Thanks!Thanks!