A Novel Anonymization Technique for Privacy Preserving Data Publishing

53
A Novel Anonymization Technique for Privacy Preserving Data Publishing 1

description

A Novel Anonymization Technique for Privacy Preserving Data Publishing

Transcript of A Novel Anonymization Technique for Privacy Preserving Data Publishing

Page 1: A Novel Anonymization Technique for Privacy Preserving Data Publishing

1

A Novel Anonymization Technique for Privacy Preserving Data Publishing

Page 2: A Novel Anonymization Technique for Privacy Preserving Data Publishing

2

Introduction

• Data publishing approach may lead to insufficient

protection. So, providing privacy for the micro data

publishing.

• Privacy protection is an important issue in data processing.

Data must be anonymized to protect privacy.

• Privacy preserving data publishing approach provides

methods for publishing the useful information.

Page 3: A Novel Anonymization Technique for Privacy Preserving Data Publishing

3

• Here micro data contains records each of which contains

information about an individual entity like person,

organization.

• The data mining community has focused on hiding sensitive

rules generated from transactional databases.

• Before anonymizing the data, one can analyze the data

characteristics and use those characteristics in data

anonymization.

Page 4: A Novel Anonymization Technique for Privacy Preserving Data Publishing

4

Literature Survey

• In both Generalization and Bucketization approaches

attributes are partitioned into three categories:

1) Some attributes are identifiers that can uniquely identify

an individual, such as Name or Social Security Number

2) Some attributes are Quasi Identifiers (QI), which the

adversary may already know and can potentially identify

an individual, e.g., Birthdate, Sex, and Zip code.

Page 5: A Novel Anonymization Technique for Privacy Preserving Data Publishing

5

3) Some attributes are Sensitive Attributes (SAs), which are unknown to the adversary and are considered sensitive, such as Disease and Salary.

• In generalization and bucketization, one first removes identifiers from the data and then partitions tuples into buckets.

• Generalization transforms the QI-values in each bucket.

• Bucketization, one separates the SAs from the QIs by randomly permuting the SA values in each bucket.

Page 6: A Novel Anonymization Technique for Privacy Preserving Data Publishing

6

Privacy threats

• When publishing microdata, there are three types of privacy disclosure threats. They are as follows

1) Membership disclosure

2) Identity disclosure

3) Attribute disclosure

Page 7: A Novel Anonymization Technique for Privacy Preserving Data Publishing

7

Problem Specification

• The anonymization techniques for privacy preserving microdata publishing are:

1.Generalization

2. Bucketization

• Generalization losses considerable amount of information,

especially for high-dimensional data. This is due to curse of

dimensionality.

• Bucketization has better data utility than generalization but

have some drawbacks.

Page 8: A Novel Anonymization Technique for Privacy Preserving Data Publishing

8

• It does not prevent member ship disclosure.

• In many data sets, it is unclear which attributes are QIs and

which are SAs.

• It requires a clear separation between QIs and SAs.

• By separating the QI attributes and SA here it breaks down

the attribute correlation between attributes.

• To overcome the drawbacks of above approaches a novel

technique called slicing will be implemented.

Page 9: A Novel Anonymization Technique for Privacy Preserving Data Publishing

9

Slicing

• Slicing partitions the dataset both vertically and horizontally.

• Grouping the attributes into columns, each column contains

a subset of attributes, i.e., vertical partition

• Slicing also partition tuples into buckets. Each bucket

contains a subset of tuples, i.e. horizontal partition.

• Within each bucket, values in each column are randomly

permutated to break the linking between different columns.

Page 10: A Novel Anonymization Technique for Privacy Preserving Data Publishing

10

Slicing Algorithms

• Our algorithm consists of three phases they are as

follows:

Attribute partitioning

Column generalization

Tuple partitioning

Page 11: A Novel Anonymization Technique for Privacy Preserving Data Publishing

11

Flow Chart for Project Design

Page 12: A Novel Anonymization Technique for Privacy Preserving Data Publishing

12

Algorithm

• The algorithm maintains two data structures: a queue of buckets Q and a set of sliced buckets SB.

• Initially, Q contains only one bucket which includes all tuples and SB is empty.

• In each iteration, the algorithm removes a bucket from Q and splits the bucket into two buckets.

• If the sliced table after the split satisfies ‘l-diversity, then the algorithm puts the two buckets at the end of the queue Q.

Page 13: A Novel Anonymization Technique for Privacy Preserving Data Publishing

13

Cont..

• Otherwise, we cannot split the bucket anymore and the algorithm puts the bucket into SB.

• When Q becomes empty, we have computed the sliced table.

• The set of sliced buckets is SB. The main part of the tuple-partition algorithm is to check whether a sliced table satisfies ‘l-diversity.

Page 14: A Novel Anonymization Technique for Privacy Preserving Data Publishing

14

Tuple Partitioning Algorithm

Algorithm tuple-partition(T, ℓ)

1. Q = {T}; SB = .∅2. while Q is not empty

3. remove the first bucket B from Q; Q = Q − {B}.

4. split B into two buckets B1 and B2.

5. if diversity-check(T, Q {B1,B2} SB, ℓ)∪ ∪6. Q = Q {B1,B2}.∪7. else SB = SB {B}.∪8. return SB.

Page 15: A Novel Anonymization Technique for Privacy Preserving Data Publishing

15

Diversity Check Algorithm

Algorithm diversity-check(T,T_, ℓ)

1. for each tuple t T, L[t] = .∈ ∅2. for each bucket B in T_

3. record f(v) for each column value v in bucket B.

4. for each tuple t T∈5. calculate p(t,B) and find D(t,B).

6. L[t] = L[t] {(p(t,B),D(t,B))}.∪7. for each tuple t T∈8. calculate p(t, s) for each s based on L[t].

9. if p(t, s) ≥ 1/ℓ, return false.

10. return true.

Page 16: A Novel Anonymization Technique for Privacy Preserving Data Publishing

16

Examples for Anonymization Techniques

• This the original microdata table and its anonymized versions using anonymization techniques

Page 17: A Novel Anonymization Technique for Privacy Preserving Data Publishing

17

• In above figure it consists of QI and SA. Age, sex, zipcode is QI and disease is SA and generalized table that satisfies 4-anonymity

Page 18: A Novel Anonymization Technique for Privacy Preserving Data Publishing

18

• The above dataset shows the bucketized table that satisfies 2-diversity.

Page 19: A Novel Anonymization Technique for Privacy Preserving Data Publishing

19

The above tables shows the Multiset-based generalization, one attribute per column slicing and the below table shows sliced table

Page 20: A Novel Anonymization Technique for Privacy Preserving Data Publishing

20

Design and Implementation

It shows that the activities of the user. The user can provide the privacy to the microdata by generalizing the records, dividing into number of buckets and breaking the correlation between the attributes. The attributes of the table are sliced by performing random permutation and probability function.

Tuple Partitioning

Attribute Clustering

Correlation Measure

Diverse Slicing

User

Data SlicingUse case diagram

Page 21: A Novel Anonymization Technique for Privacy Preserving Data Publishing

21

Class diagram

It shows that how probability functions are calculated and randomly permuted. It having five different classes with their attributes, methods how data is retrieved and methods are applied.

CorrelationMeasureAttributes[]mscdomain[]partitionset[]

RetrieveAttributes()ComputeDomain()CalculateMSC()ApplyDiscretization()PartitionsAttributeDomain()

TuplePartitioningtuples[]probabilityldiversity

CreateQueueBuckets()AssignTuples()RemoveBucket()CheckDiversity()RecordMatchingProbability()

1..*1..*

AttributeClusteringcentroidattribcorellation

RetrieveCorrelation()ComputeCentroids()CheckCorrelation()ApplyClustering()

1..*1..*

DataSlicingpartSet[]tuplesAttribTemp

SelectAttributes()ReplaceMultiset()PartitionsAttributes()PartitionTuples()RandomPermuteTuple()

DiverseSlicingbuckets[]bucketIDprobablity

DetermineMatchingBuckets()AssignProbability()ComputeProbabilitySensitive()CheckPotentialMatch()

1..*1..*

1..*1..*

Page 22: A Novel Anonymization Technique for Privacy Preserving Data Publishing

22

Modules

Data slicing Diverse slicing Correlation measure Attribute clustering Tuple partitioning

Page 23: A Novel Anonymization Technique for Privacy Preserving Data Publishing

23

• The attributes of tables are used for slicing and

bucketized.

• Slicing first partition attributes into columns and

then partition tuples into buckets.

• In diverse slicing has to extend the above analysis

to the general case and introduce the notion of l-

diverse slicing and applying the probability

function to the attributes.

Page 24: A Novel Anonymization Technique for Privacy Preserving Data Publishing

24

• Two Correlation measures are used for measuring correlation between two continuous attributes and two categorical attributes.

• After the correlations for each pair of attributes, we use clustering to partition attributes into columns.

• After that tuple partition will be done. Here tuples are

partitioned into buckets.

Page 25: A Novel Anonymization Technique for Privacy Preserving Data Publishing

25

H\w And S\w Specifications

Hardware:

(1)2GB RAM

(2) 320GB Hard disk

(3) Intel processor(P3)

Software:

(1) Visual Studio

(2) Windows xp/7

Page 26: A Novel Anonymization Technique for Privacy Preserving Data Publishing

26

Results

Retrieving the dataset file

Page 27: A Novel Anonymization Technique for Privacy Preserving Data Publishing

27Showing the dataset files to show path of the file

Page 28: A Novel Anonymization Technique for Privacy Preserving Data Publishing

28Selecting The Dataset Test File

Page 29: A Novel Anonymization Technique for Privacy Preserving Data Publishing

29Displaying the path of the dataset file

Page 30: A Novel Anonymization Technique for Privacy Preserving Data Publishing

30Displaying the message in the textbox when user cancel the browsing

Page 31: A Novel Anonymization Technique for Privacy Preserving Data Publishing

31Displaying the file path of the dataset attribute

Page 32: A Novel Anonymization Technique for Privacy Preserving Data Publishing

32Displaying the attributes from the dataset

Page 33: A Novel Anonymization Technique for Privacy Preserving Data Publishing

33Selecting the attribute set from the dropdown list

Page 34: A Novel Anonymization Technique for Privacy Preserving Data Publishing

34Displaying the attribute domains of the attribute set

Page 35: A Novel Anonymization Technique for Privacy Preserving Data Publishing

35Displaying the tuples

Page 36: A Novel Anonymization Technique for Privacy Preserving Data Publishing

36Entering the number of buckets

Page 37: A Novel Anonymization Technique for Privacy Preserving Data Publishing

37Displaying the generalization values in both text and table format

Page 38: A Novel Anonymization Technique for Privacy Preserving Data Publishing

38Displaying the bucketization values in text format

Page 39: A Novel Anonymization Technique for Privacy Preserving Data Publishing

39Displaying the values which have a clear separation between Quasi-

Identifiers and Sensitive Attributes

Page 40: A Novel Anonymization Technique for Privacy Preserving Data Publishing

40Displaying the diversity mapping based on the salary values

Page 41: A Novel Anonymization Technique for Privacy Preserving Data Publishing

41Displaying sliced values in textbox

Page 42: A Novel Anonymization Technique for Privacy Preserving Data Publishing

42Displaying the sliced sets in the table

Page 43: A Novel Anonymization Technique for Privacy Preserving Data Publishing

43Showing the probability based on the countries and salary of all the

buckets

Page 44: A Novel Anonymization Technique for Privacy Preserving Data Publishing

44Providing security for sliced sets and storing the values in the database

Page 45: A Novel Anonymization Technique for Privacy Preserving Data Publishing

45

Page 46: A Novel Anonymization Technique for Privacy Preserving Data Publishing

46Duplicating the attributes

Page 47: A Novel Anonymization Technique for Privacy Preserving Data Publishing

47Displaying the duplicate attributes in the table format

Page 48: A Novel Anonymization Technique for Privacy Preserving Data Publishing

48

Comparison

Graph showing time consumed for three techniques.

Anonymity Diversity Slicing0

50

100

150

200

250

300

350

400

Tim

e (m

sec)

Page 49: A Novel Anonymization Technique for Privacy Preserving Data Publishing

49

Comparative graph showing time consumed in Enhanced Slicing.

Slicing Enhanced Slicing0

50

100

150

200

250

300

350

400

Tim

e (m

sec)

Page 50: A Novel Anonymization Technique for Privacy Preserving Data Publishing

50

Conclusion• Dataset is taken and performing anonymization techniques to protect

privacy for micro data.

• Attribute values are considered and applying the probability functions.

• Generalization, bucketization and slicing techniques are implemented.

• By using DES it provides the security to the sliced set table.

• Overlapping slicing is done, which duplicates an attribute in more

than one columns.

• By comparing slicing preserves better data utility than generalization

and bucketization based on time consuming.

Page 51: A Novel Anonymization Technique for Privacy Preserving Data Publishing

51

Future Work

• Slicing gives better privacy than generalization and bucketization but still in future there is a scope to increase the privacy for microdata publishing, by using different anonymization techniques.

Page 52: A Novel Anonymization Technique for Privacy Preserving Data Publishing

52

References[1] Tiancheng Li, Ninghui Li, Jian Zhang, and Ian Molloy “Slicing: A New

Approach for Privacy Preserving Data Publishing” ieee transactions on

knowledge and data engineering, vol. 24, no. 3, march 2012.

[2] A. Inan, M. Kantarcioglu, and E. Bertino, “Using Anonymized Data for

Classification,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 429-440,

2009.

[3] B.-C. Chen, K. LeFevre, and R. Ramakrishnan, “Privacy Skyline: Privacy with

Multidimensional Adversarial Knowledge,” Proc. Int’l Conf. Very Large Data

Bases (VLDB), pp. 770-781, 2007.

[4] C. Dwork, “Differential Privacy: A Survey of Results,” Proc. Fifth Int’l Conf.

Theory and Applications of Models of Computation (TAMC), pp. 1-19, 2008.

Page 53: A Novel Anonymization Technique for Privacy Preserving Data Publishing

53

Thank you..