Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside...

12
Analyzing Efficiency of Methods Used for Privacy Preserving Data Publishing 1 G.Suman 3 G.Shiva Krishna, 2 Shobini.B , 1 Student,Computer science and Engineering ,Swathi Institute of Technology & Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana ,India 501512 2 Assistant Professor,Computer science and Engineering, Swathi Institute of Technology & Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3 Assistant Professor,Computer science and Engineering, Swathi Institute of Technology & Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 [email protected], [email protected], [email protected], Abstract-Privacy-Preserving Data Publishing (PPDP) techniques help in data publishing with privacy preserved. When data is published, it may be subjected to misuse and other privacy attacks. This is where PPDP techniques come into picture to protect data from privacy attacks. There are many important technqiues like k-anonymity, l-diversity and t-closeness. The problem with many PPDP methods existing is that they do not have provision for privacy characterization and measurement. However, of late, Afifi et al. proposed a method for achieving this. They used metrics like entropy leakage and distribution leakage to characterize privacy protection and measure performance of PPDP methods. However, there is need for further optimization in order to improve protection of privacy. In this paper we proposed a framework with an underlying algorithm known as Privacy Leakage Detection (PLD). This algorithm returns optimally anonymized data that can be used by third parties without compromising privacy. We built a prototype application to demonstrate proof of the concept. The empirical results revealed that the proposed method is useful for analyzing efficiency of methods used for privacy preserving data publishing. Keywords -Data privacy, data security, data publishing, big data, data mining, privacy quantification, privacy leakage 1. INTRODUCTIOIN Many organizations, companies and institutions publish privacy related datasets. While the shared dataset gives useful societal information to researchers, it also creates security risks and privacy concerns to the individuals whose data are in the table [1]. To avoid possible identification of individuals from records in published data, uniquely ISSN NO: 0898-3577 Page No: 715 Compliance Engineering Journal Volume 10, Issue 12, 2019

Transcript of Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside...

Page 1: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Analyzing Efficiency of Methods Used for Privacy

Preserving Data Publishing

1G.Suman 3G.Shiva Krishna, 2Shobini.B , 1Student,Computer science and Engineering ,Swathi Institute of Technology & Sciences Near

Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana ,India 501512

2Assistant Professor,Computer science and Engineering, Swathi Institute of Technology &

Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India

501512

3Assistant Professor,Computer science and Engineering, Swathi Institute of Technology &

Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India

501512

[email protected], [email protected], [email protected],

Abstract-Privacy-Preserving Data Publishing (PPDP) techniques help in data publishing with privacy preserved. When data is

published, it may be subjected to misuse and other privacy attacks. This is where PPDP techniques come into picture to protect

data from privacy attacks. There are many important technqiues like k-anonymity, l-diversity and t-closeness. The problem with

many PPDP methods existing is that they do not have provision for privacy characterization and measurement. However, of late,

Afifi et al. proposed a method for achieving this. They used metrics like entropy leakage and distribution leakage to characterize

privacy protection and measure performance of PPDP methods. However, there is need for further optimization in order to

improve protection of privacy. In this paper we proposed a framework with an underlying algorithm known as Privacy Leakage

Detection (PLD). This algorithm returns optimally anonymized data that can be used by third parties without compromising

privacy. We built a prototype application to demonstrate proof of the concept. The empirical results revealed that the proposed

method is useful for analyzing efficiency of methods used for privacy preserving data publishing.

Keywords -Data privacy, data security, data publishing, big data, data mining, privacy quantification, privacy leakage

1. INTRODUCTIOIN

Many organizations, companies and institutions publish privacy related datasets. While the shared dataset gives useful societal information to researchers, it also creates security risks and privacy concerns to the individuals whose data are in the table [1]. To avoid possible identification of individuals from records in published data, uniquely

ISSN NO: 0898-3577

Page No: 715

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 2: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

identifying information such as names and social security numbers are generally removed from the table. While the obvious personal identifiers are removed, the quasi-identifiers (QID) such as zip-code, age, and gender may still be used to uniquely identify a significant portion of the population since the released data makes it possible to infer or limit the available options of individuals than would be possible without releasing the table. In fact, [2] Showed that by correlating this data with the publicly available side information, such as information from voter registration list for Cambridge Massachusetts, medical visits about many individuals could be easily identified [3]. This study estimated that 87 percent of the population of the United States could be uniquely identified using quasi-identifiers through side information based attacks, including the medical records of the governor of Massachusetts in the medical data. The increasing interest in collecting and publishing large amounts of individuals’ data to public for purposes such as medical research, market analysis and economical measures has created major privacy concerns about individual’s sensitive information. To deal with these concerns, many Privacy-Preserving Data Publishing (PPDP) techniques have been proposed in literature. However, they lack a proper privacy characterization and measurement. The proposed method deals with privacy leakage detection by defining two metrics. Our contributions in this paper are as follows.

1. An algorithm named Privacy Leakage Detection (PLD) is proposed and implemented. 2. A prototype application is built to demonstrate proof of the concept. 3. The algorithm is evaluated and the results are compared with the state of the art.

The remainder of the paper is structured as follows. Section 2 provides review of literature. Section 3 presents the proposed system in detail. Section 4 presents implementation details. Section 5 presents experimental results while section 6 concludes the paper besides giving directions for future work.

2. RELATED WORK

Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k − 1 other records with respect to certain identifying attributes. In this article, we show using two simple attacks that a k-anonymized dataset has some subtle but severe privacy problems. First, an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. This is a known problem. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks, and we propose a novel and powerful privacy criterion called -diversity that can defend against such attacks. In addition to building a formal foundation for -diversity, we show in an experimental evaluation that -diversity is practical and can be implemented efficiently [4]. Many researchers explored this issue [5]-[15] and found that there is need for evaluating anonymization methods. In this paper we characterize them and evaluate performance with an empirical study.

3. PROPOSED SYSTEM

As shown in Figure 1, it is understood that the proposed system has different roles to be played. Important ones are doctor, patient and healthcare unit. Since it has taken case study from healthcare industry, it is very important to measure privacy leakage in the form of entropy leakage and distribution leakage. The architecture shows activities supported by the system for different users. It thus provides an application where it demonstrates the measure of privacy leakage. As privacy is related to non-disclosure of sensitive information, healthcare data is the best example to comprehend the concept.

ISSN NO: 0898-3577

Page No: 716

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 3: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Figure 1: Architecture of the proposed system In the proposed system, there ismuch functionality. They are logically divided into the following modules. Each module encapsulates its functionalities in the proposed system. The modules include Healthcare Admin, View and Authorize Users, View Chart Results, Patients and Doctors. In Healthcare Domain module, the Admin has to login by using valid user name and password. After login successful he can do some operations such as View and Authorize Doctors, View and Authorize Patients, Add Hospital, View All Attackers, Recover Data, View Attacked and Data Recovered Results, View Patient's Disease Results.In View and Authorize module, the admin can view the list of users who all registered. In this, the admin can view the user’s details such as, user name, email, address and admin authorizes the users. In View Chart Results module, it is possible to view view Attacked and Data Recovered Results, View Patient's Disease Results.In Patients module, there are n numbers of users are present. User should register before doing any operations. Once user registers, their details will be stored to the database. After registration successful, he has to login by using authorized user name and password. Once Login is successful user will do some operations like Add Patients, Search Patients by Disease.In this module, there are n numbers of users are present. User should register before doing any operations. Once user registers, their details will be stored to the database. After registration successful, he has to login by using authorized user name and password.Once Login is successful user will do some operations like View Patient Details (Original Table), View Privacy Quantification Table, View Privacy Preserving Data Publishing Table, Find t-Closeness Data.

ISSN NO: 0898-3577

Page No: 717

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 4: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

3.1 Privacy Leakage Detection Algorithm

The algorithm PLD takes data to be published and returns quantification of privacy leakage.

Algorithm: Privacy Leakage Detection (PLD)

Inputs: Dataset to be published D

Output: Quantified privacy leakage V

1. Start

2. Initialize entropy leakage p

3. Initialize distribution leakage dist

4. For each equivalence class e in E

5. D’ = Anonymize D with a PPDM method

6. p=Find Entropy Leakage(D’)

7. dist=FindDistributionLeakage(D’)

8. add p and dist to V

9. End For

10. Return V

11. End

Algorithm 1: Privacy leakage detection algorithm Privacy Leakage Detection algorithm takes dataset as input. The algorithm exploits two metrics to know privacy leakage of PPDM methods like eentropy leakage distribution leakage. The algorithm is able to characterize and quantify the privacy leakage encapsulated with the two metrics. For each equivalence class, the privacy leakage is predicted for a given PPDM. The results showed that the proposed metrics are able to quantify performance of PPDM methods. Entropy, gain and symmetric uncertainty used as part of the proposed metrics are provided in Eq. 1, Eq. 2, Eq. 3 and Eq. 4.

SU = �∗����

�(�)��(�) (1)

Where H(x) and H(y) are related to entropy. The entropyis computed as in Eq. 2

and Eq. 3.

H (X) = -∑ �(�) log �(�)�∈� (2)

H (Y) = -∑ �(�) log �(�)�∈� (3)

ISSN NO: 0898-3577

Page No: 718

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 5: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Based on the entropy values obtained, information gain is computed as the reduction in entropy values. It is computed as in Eq. 4.4.

Information gain = H (y) – H (y/x) (4) These metrics are used to improve leakage detection process.

4. IMPLEMENTATION DETAILS

Implementation is made with a prototype application which is web based. It is developed using Java platform. Servlets and JSP technologies are used to realize the application. The algorithm is implemented using Java programming language.

Figure 2: Shows patient original data As presented in Figure 2, a user in doctor role can view patient full details. It shows data before anonymization is performed.

ISSN NO: 0898-3577

Page No: 719

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 6: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Figure 3: Shows privacy quantization details As presented in Figure 3, a user in doctor role can view privacy quantization details. It shows data anonymized appropriately and the performance of the anonymization is quantified as well.

ISSN NO: 0898-3577

Page No: 720

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 7: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Figure 4: Shows anonymized data with k-anonymity and l-diversity As presented in Figure 4, the published data is shown with anonymization results of both k-anonymity and l-diversity.

ISSN NO: 0898-3577

Page No: 721

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 8: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Figure 5: Search t –closeness data of patient As presented in Figure 5, the published data is shown with t-closeness. The data is subjected to t-closeness and the data is anonymized for further characterization as per the proposed algorithm.

ISSN NO: 0898-3577

Page No: 722

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 9: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

Figure 6: Shows anonymized data with t-Closeness algorithm As presented in Figure 6, the healthcare data (patient data) is shown after anonymization with t-Closeness. This will prevent the data from privacy attacks.

5. EXPERMENTAL RESULTS

Experimental results made with a prototype application are presented in this section. The results are observed in terms of entropy and distribution leakage for both existing and new approaches.

Equivalence

Class

Distribution

Leakage

(Existing)

Distribution

Leakage

(Proposed)

Entropy

Leakage

Existing

Entropy

Leakage

Proposed

1 0.3 0.35 0.4 0.45

2 0.2 0.25 0.2 0.25

3 0.3 0.35 0.6 0.65

4 0.1 0.2 0.1 0.2

ISSN NO: 0898-3577

Page No: 723

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 10: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

5 0.2 0.25 0.55 0.6

6 0.1 0.2 0.1 0.2

7 0.4 0.3 1 0.9

8 0.2 0.25 0.2 0.3

Table 1: Privacy leakage against Equivalent Classes As presented in Table 1, the two kinds of leakage values are observed such as distribution leakage and entropy leakage for existing and proposed methods against different equivalence classes.

Figure 7:Privacy Leakage Comparison As presented in Figure 7, the equivalence classes such as 1 to 8 incremented by 1 are used in horizontal axis while the vertical axis shows the privacy leakage detection performance of the existing and the proposed systems. There is influence of equivalence classes on the privacy leakage. Another important observation is that the proposed system showed better performance over the existing.

Original

Distribution

(Existing)

Original

Distribution

Proposed

Published

Distribution Existing

Published

Distribution

Proposed

0.13 0.15 0.23 0.25

0 0.05 0 0.05

0.15 0.17 0.33 0.35

0.05 0.07 0 0.05

0.15 0.17 0.25 0.3

ISSN NO: 0898-3577

Page No: 724

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 11: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

0.05 0.07 0.15 0.2

0.09 0.1 0 0.05

Table 2: Distribution dynamics against number of attributes As presented in Table 2, the distribution values against number of attributes are presented for both existing and proposed approaches.

Figure 8: Distribution Leakage Comparison As presented in Figure 8, the equivalence classes such as 1 to 8 incremented by 1 are used in horizontal axis while the vertical axis shows the privacy leakage detection performance of the existing and the proposed systems. There is influence of equivalence classes on the privacy leakage. Another important observation is that the proposed system showed better performance over the existing.

6. CONCLUSIOIN AND FUTURE WORK

In this paper, we proposed a framework for formally analyzing efficiency of methods used for PPDP. The framework exploits the anonymization algorithms like k-anonymity, l-diversity and t-closeness in order to have them to be characterized and analyzed to ascertain their efficiency. An algorithm named Privacy Leakage Detection (PLD) is proposed and implemented. Empirical study is made with a prototype application which uses privacy measures such as distribution leakage and entropy leakage. The results are evaluated in terms of privacy leakage aginast number of equivalent classes and distribution leakage aginiast different number of attributes. The results revlead that the proposed system showed better performance. In future, we intend to apply our algorithm to big data. References [1] M.H. Afifi, Kai Zhou and JianRen. (2018). Privacy Characterization and Quantification in Data Publishing. IEEE Transactions on Knowledge and Data Engineering, p1-14. [2] M.H. Afifi, Kai Zhou and Jian Ren. (2018). Privacy Characterization and Quantification in Data Publishing. IEEE Transactions on Knowledge and Data Engineering, p1-14.

ISSN NO: 0898-3577

Page No: 725

Compliance Engineering Journal

Volume 10, Issue 12, 2019

Page 12: Analyzing Efficiency of Methods Used for Privacy ... · Sciences Near Ramoji Film City Beside Kothagudem 'X' Roads, Hyderabad, Telangana,India 501512 3Assistant Professor,Computer

[3] A. Narayanan and V. Shmatikov, “Robust de-anonymization oflarge sparse datasets,” in Proc. Secur. Privacy, 2008, pp. 111–125. [4] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam,“‘-diversity: Privacy beyond k-anonymity,” ACM Trans.Knowl. Discovery Data, vol. 1, Mar. 2007, Art. no. 3. [5] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacybeyond k-anonymity and l-diversity,” in Proc. IEEE 23rd Int. Conf.Data Eng., 2007, pp. 106–115. [6] N. Li, W. Qardaji, D. S. Purdue, Y. Wu, and W. Yang,“Membership privacy: A unifying framework for privacy definitions,”in Proc. ACM SIGSAC Conf. Comput. Commun. Secur.,2013, pp. 889–900. [7] I. Wagner and D. Eckhoff, “Technical privacy metrics: A systematicsurvey,” eprint arXiv:1512.00327, 2015. [8] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, andX. Xiao, “Privbayes: Private data release via bayesiannetworks,”in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp. 1423–1434. [9] M. G€otz, S. Nath, and J. Gehrke, “Maskit: Privately releasing usercontext streams for personalized mobile applications,” in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2012, pp. 289–300. [10] Y. Rubner, C. Tomasi, and , L. J. Guibas, “The earth mover’s distanceas a metric for image retrieval,” Int. J. Comput. Vis., vol. 40,no. 2, pp. 99–121, 2000. [11] D. Rebollo-Monedero, J. Forne, and J. Domingo-Ferrer, “From tclosenesslike privacy to postrandomization via information theory,”IEEE Trans. Knowl. Data Eng., vol. 22, pp. 1623–1636, Nov.2010. [12] L. Sankar, S. R. Rajagopalan, and H. V. Poor, “Utility-privacytradeoffs in databases: An information-theoretic approach,” Trans.Inf. ForensicsSecur., vol. 8, no. 6, pp. 838–852, Jun. 2013. [13] C. Dwork, “Differential privacy,” in Proc. 33rd Int. Conf. Auto Languages Programming Volume Part II (ICALP’06), MicheleBugliesi, Bart Preneel, VladimiroSassone, and Ingo Wegener(Eds.), Vol. Part II. Springer-Verlag, Berlin, Heidelberg, 2006,pp. 1–12. [14] K. Lefevre, D. J. Dewitt, and R. Ramakrishnan, “Incognito: Efficient full domain k-anonymity,” in Proc. ACM SIGMOD, 2005,pp. 49–60. [15] P. Samarati, “Protecting respondents identities in microdatarelease,” IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027, Nov./Dec. 2001.

ISSN NO: 0898-3577

Page No: 726

Compliance Engineering Journal

Volume 10, Issue 12, 2019