Towards Achieving Anonymity

An ZhuAn Zhu

Towards Achieving Towards Achieving AnonymityAnonymity

IntroductionIntroduction

Collect and analyze personal data Infer trends and patterns

Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?

Example: Medical RecordsExample: Medical Records

Identifiers Sensitive Info

SSN Name Age Race Zipcode Disease

614 Sara 31 Cauc 94305 Flu

615 Joan 34 Cauc 94307 Cold

629 Kelly 27 Cauc 94301 Diabetes

710 Mike 41 Afr-A 94305 Flu

840 Carl 41 Afr-A 94059 Arthritis

780 Joe 65 Hisp 94042 Heart problem

616 Rob 46 Hisp 94042 Arthritis

De-identified RecordsDe-identified Records

Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]

Sensitive Info

31 Cauc 94305 Flu

34 Cauc 94307 Cold

41 Afr-A 94305 Flu

46 Hisp 94042 ArthritisPublic Database

UniqueIdentifiers!

Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]

Quasi-Identifiers Sensitive Info

31 Cauc 94305 Flu

34 Cauc 94307 Cold

41 Afr-A 94305 Flu

46 Hisp 94042 ArthritisPublic Database

UniqueIdentifiers!

Anonymize the Quasi-Identifiers!Anonymize the Quasi-Identifiers!

*** *** *** Flu

*** *** *** Cold

*** *** *** Diabetes

*** *** *** Flu

*** *** *** Arthritis

*** *** *** Heart problem

*** *** *** ArthritisPublic Database

UniqueIdentifiers!

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]

31 Cauc 94305 Flu

34 Cauc 94307 Cold

41 Afr-A 94305 Flu

Each rowis identicalto at least k-1otherrows

kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]

* Cauc * Flu

* Cauc * Cold

* Cauc * Diabetes

41 Afr-A * Flu

41 Afr-A * Arthritis

* Hisp 94042 Heart problem

* Hisp 94042 Arthritis

Definition: Definition: kk-anonymity-anonymity

Input: a table consists of n row, each with m attributes (quasi-identifiers)

Output: suppress some entries such that each row is identical to at least k-1 other rows

Objective: minimize the number of suppressed entries

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Graph RepresentationGraph Representation

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

W(e)=Hamming distance between the two rows

Edge Selection IEdge Selection I

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

Each node selects thelightest weight edge

Edge Selection IIEdge Selection II

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

For components with <kvertices, add more edges

LemmaLemma

Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at

least the weight of the (k-1)st lightest weight edge

Forest: at most one edge per vertex By construction, the edge weight is no more

than the (k-1)st lightest weight edge per vertex

GroupingGrouping

Ideally, each connected component forms a group

Anonymize vertices within a group

Total cost of a group: (total edge weights)

(number of nodes) (2+2+3+3)6

Small groups: O(k)

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k

Aim: all sub-trees <k

<k<k<k<k

Root tree arbitrarily Divide if Sub-trees & rest k

Rotate the tree if necessary

Root tree arbitrarily Divide if Sub-trees & rest k T. condition: max(2k-1, 3k-5)

An ExampleAn Example

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

Estimatedcost:43+33

0 * 1 0 * *

* * 0 1 * 1

0 * 1 0 * *

* * 0 1 * 1

0 * 1 0 * *

Optimal cost:33+33

1.51.5-approximation-approximation

0 0 1 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

0 0 1 0 0 0

1 1 0 1 1 1

W(e)=Hamming distance between the two rows

MinimumMinimum {1,2} {1,2}-matching-matching

0 0 1 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

0 0 1 0 0 0

1 1 0 1 1 1

Each vertex is matched to1 or 2 other vertices

PropertiesProperties

Each component has 3 nodes

Not OptimalNot possible(degree 2)

Cost 2OPT

For binary alphabet: 1.5OPT

QualitiesQualities

r p,qOPT pays: 2aWe pay: 2a OPT pays: p+q+r

We pay: 3(p+q) 2(p+q+r)

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111

k = 5, d = 16, c = k d / 2

11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111

k = 5, d = 16, c = k d / 2

1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000

k = 5, d = 16, c = 2 d

1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000

k = 5, d = 16, c = 2 d

Clustering ApproachClustering Approach [[AFKKPTZ 06AFKKPTZ 06’]’]

31 Cauc 94305 Flu

34 Cauc 94307 Cold

41 Afr-A 94305 Flu

Transfers into a Metric…Transfers into a Metric…

31 Cauc 94305 Flu

34 Cauc 94307 Cold

41 Afr-A 94305 Flu

Clusters and CentersClusters and Centers

31 Cauc 94305 Flu

34 Cauc 94307 Cold

41 Afr-A 94305 Flu

Clusters and CentersClusters and Centers

31 Cauc 94305 Flu

Diabetes

Heart problem

MeasureMeasure

How good are the clusters “Tight” clusters are better

Minimize max radius: Gather-k Minimize max distortion error: Cellular-k

radius num_nodes

Gather-k: 10

Cellular-k: 624

MeasureMeasure

How good are the clusters “Tight” clusters are better

Minimize max radius: Gather-k Minimize max distortion error: Cellular-k

radius num_nodes

Handle outliers Constant approximations!

ComparisonComparison

k = 5 5-anonymity

Suppress all entries More distortion

Clustering Can pick R5 as the center Less distortion Distortion is directly related

with pair-wise distances

R1 0 1 1 1

R2 1 0 1 1

R3 1 1 0 1

R4 1 1 1 0

R5 1 1 1 1

ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]

Gather-k Tight 2-approximation Extension to outlier: 4-approximation

Cellular-k Primal-dual const. approximation Extensions as well

ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]

Gather-k Tight 2-approximation Extension to outlier: 4-approximation

Cellular-k Primal-dual const. approximation Extensions as well

22-approximation-approximation

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R.

22-approximation-approximation

neighbors within distance 2R. Pick an arbitrary node as a center and

remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Example: Example: kk = 5 = 5

Optimal SolutionOptimal Solution

Center SelectionCenter Selection

ReassignmentReassignment

Degree Constrained MatchingDegree Constrained Matching

≥ k-1

Actual ClusteringActual Clustering

Optimal ClusteringOptimal Clustering

Our guaranteesOur guarantees

Return clusters of radius no more than 2R

If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes

Do a binary search on the value of R suffices

Binary Search on Binary Search on RR

neighbors within distance 2R. Pick an arbitrary node as a center and

remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

neighbors within distance 2R. Not necessary, but is useful for quick pruning

Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

neighbors within distance 2R. Not necessary, but is useful for quick pruning

Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers. If successful, R could be smaller Otherwise, R should be larger

ResultsResults [[AFKKPTZ 06AFKKPTZ 06’]’]

Gather-k Tight 2-approximation Extension to outliner: 4-approximation

Cellular-k Primal-dual const. approximation Extensions

Ignore Cluster Size ConstraintIgnore Cluster Size Constraint

Similar to Facility Location radius num_nodes vs. invidual_distance_to_center

Caveat Assigning one distant node to an existing

cluster will increase cost proportional to number of nodes in that cluster

Each cluster is a (center, radius) pair

Intermediate Step IIntermediate Step I

Primal-dual constant approximation for radius num_nodes No cluster size constaint Arbitrary cluster setup cost

We want radius num_nodes Cluster size constraint No cluster setup cost

Enforce Cluster SizeEnforce Cluster Size

Introduce extra cluster setup cost Setup cost pays for k nodes to join a

particular cluster, i.e., csetup = k r This at most doubles the actual cost of

any size constrained cluster solution Each cluster’s total cost is at least k r

Intermediate Step IIIntermediate Step II

Shared solution! For each cluster with less than k nodes,

additional nodes can join the cluster At no additional cost, paid for by the cluster

setup cost Now nodes could be shared among multiple

clusters Key: convert a “shared” solution to a

disjoint solution

AttachedAttached

Attached

SeparationSeparation

Starting from small radius clusters

“Open” as long as there are enough nodes

The left over points in clusters “attach” to the intersecting smaller radius (open) clusters

Regroup (Regroup (kk = 5 = 5))

Open cluster has ≥k nodes

Attached cluster has <k nodes

Group clusters to create bigger ones

Choose the “fat” cluster’s center as the new center

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

Routing cost is only a constant blowup w.r.t. the fat radius

Need to make sure the merged cluster is of reasonable size

RecapRecap

Thanks!Thanks!

Towards Achieving Anonymity

Documents

Transcript of Towards Achieving Anonymity

Towards Efﬁcient Trafﬁc-analysis Resistant Anonymity Networksconferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p303.pdf · 2013-07-12 · Modest-latency anonymity networks.

Attributes of School Leaders towards Achieving Sustainable ... · Attributes of School Leaders towards Achieving Sustainable Leadership: A Factor Analysis ... The importance of selecting

Privacy of Mobile Computer Users - Wiki.uio.no - Wiki til ... · Anonymity and location privacy on the Internet Discussion on anonymity is usually about: Anonymity towards servers

Towards Practical and Fundamental Limits of Anonymity Protection ...

MDG REPORT 2014 Progress towards achieving the MDGs.

ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING ...€¦ · So, k-anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals

Tourism and Biodiversity – Achieving Common Goals Towards Sustainability

Achieving Breakthroughs Towards Health Information Exchange

Towards Achieving Zero Defect Quality 1225713411606560 9

Achieving Anonymity via Clustering

- A sustainable initiative towards achieving Zero ...

Towards efficient traffic-analysis resistant anonymity networks

TEACHER EDUCATION: VISIONS TOWARDS ACHIEVING …

SAT: A Security Architecture Achieving Anonymity and ...wzhang/teach-552/ReadingList/552-19.pdf · Anonymity and privacy issues have gained consider- ... security issues such as authentication,

A secure distributed framework for achieving k-anonymity

Towards achieving higher impact industrialisation in the ... Policy... · Towards achieving higher impact industrialisation in the Pharmaceuticals and ... sector designation and localization

Anonymity BB 022107 - phlesig.files.wordpress.com · Anonymity Baggio & Belderrain©2007 Anonymity Risk 3 Anonymity Risk 2 Anonymity Risk 1 Accountability Risk 3 Accountability Risk

ExxonMobil, Improving standards and technology towards achieving

Towards achieving Land Degradation Neutrality: turning the ... · Italy Summary Report Towards achieving Land Degradation Neutrality: turning the concept into practice ITALY Italy

INTEGRATED ECOSYSTEM TOWARDS ACHIEVING DIGITAL TELANGANA ... · ANNUAL REPORT INTEGRATED ECOSYSTEM TOWARDS ACHIEVING DIGITAL TELANGANA 2017 Information Technology, Electronics & Communications