Privacy-Preserving Hierarchical-k-means Clustering on...

2
Privacy-Preserving Hierarchical-k-means Clustering on Horizontally Partitioned Data ANRONG XUE, DONGJIE JIANG, SHIGUANG JU, WEIHE CHEN, and HANDA MA School of Computer Science and Telecommunication Engineering, Jiangsu University, Zhenjiang, China Privacy preserving mining of distributed data is an important direction for data mining, and privacy preserving clustering is one of the main researches. Privacy-preserving data mining techniques enable knowledge discovery without requiring disclosure of private data. The existing privacy preserving algorithms mainly concentrated on association rules and classification, only few algorithms on privacy preserving clustering, and these algorithms mainly concentrated on centralized and vertically partitioned data. So we proposed privacy preserving hierarchical k- means clustering algorithm on horizontally partitioned data, denoted as HPPHKC. The complexity on k-means clustering algorithm is only O(n), so most existing privacy preserving clustering algorithms are concentrated on k-means and based on two parties and the trusted third party, these algorithms have the drawbacks of inaccurate results because of choosing initial clustering centers randomly and applying to multi-party difficult and revealing privacy because of depending on the third party excessively. By introduction of three protocols for secure multi-party computation: distance computation, cluster center computation, and standardization and combination of the merits of hierarchical and k-means clustering, we presented a privacy-preserving hierarchical-k-means clustering algorithm on horizontally partitioned data for semi-honest parties using some secure multi-party computation protocols. The algorithm uses the security protocol mentioned above to achieve the protection of the privacy data, and uses the hierarchical clustering algorithm to obtain k cluster centers, then uses the k-means clustering algorithm to obtain the final k clusters. We introduce the clustering feature and the clustering feature tree, which are used to summarize the cluster representations. A clustering feature (CF) is a three-dimensional vector summarizing information about clusters of objects. The i th clustering feature is CF i = (cn i ,cc i ,cp i ), where cn i is the number of i th clusters, denoted as the size of i th cluster, cc i is the center of the cn i objects, and cp i is the pointer of the list of cn i objects. The algorithm has two phases: the first phase, every object can be as a cluster, a secure computation protocol is used to compute the dissimilarity matrix and the most similar clusters will be merged. This process is repeated until we get the assigned clusters number k and get k clustering centers. In the second phase, the semi-honest third party and all data involved parties use the k-means algorithm refine the results of the first phase and get the final clustering results. Finally, we give the proof of security of the algorithm and analysis of communication costs, and we show that our scheme is secure and complete with good efficiency. Keywords Clustering; Privacy preserving; Secure multi-party computation; Horizontally partitioned data International Journal of Distributed Sensor Networks, 5: 81, 2009 Copyright Ó Taylor & Francis Group, LLC ISSN: 1550-1329 print / 1550-1477 online DOI: 10.1080/15501320802571863 This work was supported by the National Natural Science Foundation of China (No: 60603041, No:60773049), the Science Foundation of Jiangsu Education Council (No: 05KJB520017), the Science Foundation of Jiangsu (No: BK2006073). Address correspondence to Anrong Xue, School of Computer Science and Telecommunication Engineering, Jiangsu University, Zhenjiang, 212013, China. E-mail: [email protected] 81

Transcript of Privacy-Preserving Hierarchical-k-means Clustering on...

Privacy-Preserving Hierarchical-k-means Clusteringon Horizontally Partitioned Data

ANRONG XUE, DONGJIE JIANG, SHIGUANG JU,WEIHE CHEN, and HANDA MA

School of Computer Science and Telecommunication Engineering, Jiangsu

University, Zhenjiang, China

Privacy preserving mining of distributed data is an important direction for data mining, and privacypreserving clustering is one of the main researches. Privacy-preserving data mining techniquesenable knowledge discovery without requiring disclosure of private data. The existing privacypreserving algorithms mainly concentrated on association rules and classification, only fewalgorithms on privacy preserving clustering, and these algorithms mainly concentrated oncentralized and vertically partitioned data. So we proposed privacy preserving hierarchical k-means clustering algorithm on horizontally partitioned data, denoted as HPPHKC.

The complexity on k-means clustering algorithm is only O(n), so most existing privacy preservingclustering algorithms are concentrated on k-means and based on two parties and the trusted third party,these algorithms have the drawbacks of inaccurate results because of choosing initial clustering centersrandomly and applying to multi-party difficult and revealing privacy because of depending on the thirdparty excessively. By introduction of three protocols for secure multi-party computation: distancecomputation, cluster center computation, and standardization and combination of the merits ofhierarchical and k-means clustering, we presented a privacy-preserving hierarchical-k-meansclustering algorithm on horizontally partitioned data for semi-honest parties using some securemulti-party computation protocols. The algorithm uses the security protocol mentioned above toachieve the protection of the privacy data, and uses the hierarchical clustering algorithm to obtain kcluster centers, then uses the k-means clustering algorithm to obtain the final k clusters. We introducethe clustering feature and the clustering feature tree, which are used to summarize the clusterrepresentations. A clustering feature (CF) is a three-dimensional vector summarizing informationabout clusters of objects. The ith clustering feature is CFi = (cni,cci,cpi), where cni is the number of ith

clusters, denoted as the size of ith cluster, cci is the center of the cni objects, and cpi is the pointer of thelist of cni objects. The algorithm has two phases: the first phase, every object can be as a cluster, asecure computation protocol is used to compute the dissimilarity matrix and the most similar clusterswill be merged. This process is repeated until we get the assigned clusters number k and get k clusteringcenters. In the second phase, the semi-honest third party and all data involved parties use the k-meansalgorithm refine the results of the first phase and get the final clustering results. Finally, we give theproof of security of the algorithm and analysis of communication costs, and we show that our scheme issecure and complete with good efficiency.

Keywords Clustering; Privacy preserving; Secure multi-party computation; Horizontallypartitioned data

International Journal of Distributed Sensor Networks, 5: 81, 2009

Copyright � Taylor & Francis Group, LLC

ISSN: 1550-1329 print / 1550-1477 online

DOI: 10.1080/15501320802571863

This work was supported by the National Natural Science Foundation of China (No: 60603041,No:60773049), the Science Foundation of Jiangsu Education Council (No: 05KJB520017), theScience Foundation of Jiangsu (No: BK2006073).

Address correspondence to Anrong Xue, School of Computer Science and TelecommunicationEngineering, Jiangsu University, Zhenjiang, 212013, China. E-mail: [email protected]

81

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

DistributedSensor Networks

International Journal of