Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce...

12
2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEE Transactions on Cloud Computing 1 Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset Jiawei Yuan, Member, IEEE, Yifan Tian, Student Member, IEEE Abstract—Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, targeted marketing, digital forensics, etc. With the explosion of data in today’s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to public cloud platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to public cloud servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving K- means clustering scheme that can be efficiently outsourced to cloud servers. Our scheme allows cloud servers to perform clustering directly over encrypted datasets, while achieving com- parable computational complexity and accuracy compared with clusterings over unencrypted ones. We also investigate secure integration of MapReduce into our scheme, which makes our scheme extremely suitable for cloud computing environment. Thorough security analysis and numerical analysis carry out the performance of our scheme in terms of security and efficiency. Experimental evaluation over a 5 million objects dataset further validates the practical performance of our scheme. Index Terms—Privacy-preserving, K-means Clustering, Cloud Computing I. I NTRODUCTION C LUSTERING is one major task of exploratory data mining and statistical data analysis, which has been ubiq- uitously adopted in many domains, including healthcare, social network, image analysis, pattern recognition, etc. Meanwhile, the rapid growth of big data involved in today’s data mining and analysis also introduces challenges for clustering over them in terms of volume, variety, and velocity. To efficiently manage large-scale datasets and support clustering over them, public cloud infrastructure is acting the major role for both performance and economic consideration. Nevertheless, using public cloud services inevitably introduces privacy concerns. This is because not only many data involved in data mining applications are sensitive by nature, such as personal health information, localization data, financial data, etc, but also the public cloud is an open environment operated by external third-parties [1]. For example, a promising trend for predicting an individual’s disease risk is clustering over existing patients’ health records [2], which contain sensitive patient information according to the Health Insurance Portability and Accountabil- ity Act (HIPAA) Policy [3]. Therefore, appropriate privacy protection mechanisms must be placed when outsourcing sensitive datasets to the public cloud for clustering. The problem of privacy-preserving K-means clustering has been investigated under the multi-party secure computation model [4]–[9], in which owners of distributed datasets interact for clustering without disclosing their own datasets to each other. In the multi-party setting, each party has a collection of data and wishes to collaborate with others in a privacy- preserving manner to improve clustering accuracy. Differently, the dataset in clustering outsourcing is typically owned by a single entity, who aims at minimizing the local computation by delegating the clustering task to a third-party cloud server. In addition, existing multi-party designs always rely on powerful but expensive cryptographic primitives (e.g., secure circuit evaluation, homomorphic encryption and oblivious transfer) to achieve collaborative secure computation among multiple parties, and are inefficient for large-scale datasets. Thus, these multi-party designs are not practical for privacy-preserving outsourcing of clustering. Another line of research that targets at efficient privacy-preserving clustering is to use distance- preserving data perturbation or data transformation to encrypt datasets [10], [11]. Nevertheless, utilizing data perturbation and data transformation for privacy-preserving clustering may not achieve enough privacy and accuracy guarantee [12], [13]. For example, adversaries who get a few unencrypted data records in the dataset will be able to recover rest records protected by data transformation [12]. Recently, the outsourc- ing of K-means clustering is studied in ref [14] by utilizing homomophic encryption and order preserving index. However, the homomophic encryption utilized in [14] is not secure as pointed out in ref [15]. Moreover, due to the cost of relative expensive homomophic encryption, ref [14] is efficient only for small datasets, e.g., less than 50,000 data objects. Another possible candidate to achieve privacy-preserving K-means clustering is to extend existing privacy-preserving K-nearest neighbors (KNN) search schemes [16]–[18]. Unfortunately, these privacy-preserving KNN search schemes are limited by the vulnerability to linear analysis attacks [16], the support up to two dimension data [17], or accuracy loss [18]. In addition, KNN is a single round search task, but K-means clustering is an iterative process that requires the update of clustering centers based on the entire dataset after each round of clustering. Considering the efficient support over large-scale datasets, these update processes also need to be outsourced to the cloud server in a privacy-preserving manner. Besides privacy protection, there are two other major factors in the outsourcing of K-means clustering: Clustering Efficiency and Clustering Accuracy. Specifically, a practical privacy- preserving outsourcing of K-means clustering shall be easily parallelized, which is important in cloud computing envi- ronment for performance guarantee on large-scale datasets. Meanwhile, the computational cost of the dataset owner shall be minimized, i.e., the owner is only responsible for the system

Transcript of Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce...

Page 1: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

1

Practical Privacy-Preserving MapReduce BasedK-means Clustering over Large-scale Dataset

Jiawei Yuan, Member, IEEE, Yifan Tian, Student Member, IEEE

Abstract—Clustering techniques have been widely adopted inmany real world data analysis applications, such as customerbehavior analysis, targeted marketing, digital forensics, etc. Withthe explosion of data in today’s big data era, a major trendto handle a clustering over large-scale datasets is outsourcingit to public cloud platforms. This is because cloud computingoffers not only reliable services with performance guarantees, butalso savings on in-house IT infrastructures. However, as datasetsused for clustering may contain sensitive information, e.g., patienthealth information, commercial data, and behavioral data, etc,directly outsourcing them to public cloud servers inevitably raiseprivacy concerns.

In this paper, we propose a practical privacy-preserving K-means clustering scheme that can be efficiently outsourced tocloud servers. Our scheme allows cloud servers to performclustering directly over encrypted datasets, while achieving com-parable computational complexity and accuracy compared withclusterings over unencrypted ones. We also investigate secureintegration of MapReduce into our scheme, which makes ourscheme extremely suitable for cloud computing environment.Thorough security analysis and numerical analysis carry out theperformance of our scheme in terms of security and efficiency.Experimental evaluation over a 5 million objects dataset furthervalidates the practical performance of our scheme.

Index Terms—Privacy-preserving, K-means Clustering, CloudComputing

I. INTRODUCTION

CLUSTERING is one major task of exploratory datamining and statistical data analysis, which has been ubiq-

uitously adopted in many domains, including healthcare, socialnetwork, image analysis, pattern recognition, etc. Meanwhile,the rapid growth of big data involved in today’s data miningand analysis also introduces challenges for clustering overthem in terms of volume, variety, and velocity. To efficientlymanage large-scale datasets and support clustering over them,public cloud infrastructure is acting the major role for bothperformance and economic consideration. Nevertheless, usingpublic cloud services inevitably introduces privacy concerns.This is because not only many data involved in data miningapplications are sensitive by nature, such as personal healthinformation, localization data, financial data, etc, but also thepublic cloud is an open environment operated by externalthird-parties [1]. For example, a promising trend for predictingan individual’s disease risk is clustering over existing patients’health records [2], which contain sensitive patient informationaccording to the Health Insurance Portability and Accountabil-ity Act (HIPAA) Policy [3]. Therefore, appropriate privacyprotection mechanisms must be placed when outsourcingsensitive datasets to the public cloud for clustering.

The problem of privacy-preserving K-means clustering hasbeen investigated under the multi-party secure computation

model [4]–[9], in which owners of distributed datasets interactfor clustering without disclosing their own datasets to eachother. In the multi-party setting, each party has a collectionof data and wishes to collaborate with others in a privacy-preserving manner to improve clustering accuracy. Differently,the dataset in clustering outsourcing is typically owned by asingle entity, who aims at minimizing the local computation bydelegating the clustering task to a third-party cloud server. Inaddition, existing multi-party designs always rely on powerfulbut expensive cryptographic primitives (e.g., secure circuitevaluation, homomorphic encryption and oblivious transfer)to achieve collaborative secure computation among multipleparties, and are inefficient for large-scale datasets. Thus, thesemulti-party designs are not practical for privacy-preservingoutsourcing of clustering. Another line of research that targetsat efficient privacy-preserving clustering is to use distance-preserving data perturbation or data transformation to encryptdatasets [10], [11]. Nevertheless, utilizing data perturbationand data transformation for privacy-preserving clustering maynot achieve enough privacy and accuracy guarantee [12], [13].For example, adversaries who get a few unencrypted datarecords in the dataset will be able to recover rest recordsprotected by data transformation [12]. Recently, the outsourc-ing of K-means clustering is studied in ref [14] by utilizinghomomophic encryption and order preserving index. However,the homomophic encryption utilized in [14] is not secure aspointed out in ref [15]. Moreover, due to the cost of relativeexpensive homomophic encryption, ref [14] is efficient onlyfor small datasets, e.g., less than 50,000 data objects. Anotherpossible candidate to achieve privacy-preserving K-meansclustering is to extend existing privacy-preserving K-nearestneighbors (KNN) search schemes [16]–[18]. Unfortunately,these privacy-preserving KNN search schemes are limited bythe vulnerability to linear analysis attacks [16], the supportup to two dimension data [17], or accuracy loss [18]. Inaddition, KNN is a single round search task, but K-meansclustering is an iterative process that requires the update ofclustering centers based on the entire dataset after each roundof clustering. Considering the efficient support over large-scaledatasets, these update processes also need to be outsourced tothe cloud server in a privacy-preserving manner.

Besides privacy protection, there are two other major factorsin the outsourcing of K-means clustering: Clustering Efficiencyand Clustering Accuracy. Specifically, a practical privacy-preserving outsourcing of K-means clustering shall be easilyparallelized, which is important in cloud computing envi-ronment for performance guarantee on large-scale datasets.Meanwhile, the computational cost of the dataset owner shallbe minimized, i.e., the owner is only responsible for the system

Page 2: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

2

setup as well as lightweight interactions with cloud servers.Although a number of MapReduce based K-means clusteringschemes have been proposed to handle large-scale dataset inparallel [19]–[21], none of them consider privacy protectionfor the outsourced dataset. Moreover, the privacy protectionoffered in an outsourced K-means design shall have slight(and even no) influence on the clustering accuracy. This isbecause accuracy is the key factor to determine the quality ofa clustering algorithm. To the best of our knowledge, thereis no existing privacy-preserving MapReduce based K-meansoutsourcing design, which can achieve comparable efficiencyand accuracy to the clustering over unprotected datasets.

In this work, we proposed a practical privacy-preservingK-means clustering scheme for large-scale datasets, whichcan be efficiently outsourced to public cloud servers. Ourproposed scheme simultaneously meets the privacy, efficiency,and accuracy requirements as discussed above. In particular,we propose a novel encryption scheme based on the Learnwith Error (LWE) hard problem [22], which achieves privacy-preserving similarity measurement of data objects directlyover ciphertexts. Based on our encryption scheme, we furtherconstruct the whole K-means clustering process in a privacy-preserving manner, in which cloud servers only have accessto encrypted datasets, and will perform all operations withoutany decryption. Moreover, we uniquely incorporate MapRe-duce [23] into our scheme with privacy protection, and thussignificantly improving the clustering performance in cloudcomputing environment. We provide thorough analysis for ourscheme in terms of security and efficiency. We also imple-mented a prototype of our scheme on Microsoft Azure cloud.Our extensive evaluation results over 5 million objects showthat our privacy-preserving clustering is efficient, scalable, andaccurate. Specifically, compared with the K-means clusteringover unencrypted datasets, our scheme achieves the sameaccuracy as well as comparable computational performanceand scalability.

The rest of this paper is organized as follows: In SectionII, we explain our system model and threat model. Section IIIdescribes the background of K-means clustering and MapRe-duce framework. We provide the detailed construction of ourscheme and its security analysis in Section IV. We evaluate theperformance of our scheme in Section V, which is followedby the review of related work in Section VI. Section VIIconcludes this work.

II. MODELS

A. System Model

In our design, we consider two major entities as shown inFig.1: a Dataset Owner and a Cloud Server. The owner hasa collection of data objects, which will be outsourced to thecloud server for clustering after encryption. The cloud serverperforms the K-means clustering directly over the encrypteddataset without any decryption. During the clustering, thecloud server interacts with the owner for a small amountof encrypted intermediate inputs/outputs. The clustering isfinished when the clustering results do not change any more,or a predefined number of iterations is reached.

Single Round Privacy-preserving Clustering:

Cloud Server

?Encrypted Dataset

Outsourcing

Intermediate Result

Privacy-preserving Clustering Center

Update

Updated Encrypted Clustering Centers

Final Clustering Results

Dataset Owner ?

?

Fig. 1. System Architecture

B. Threat Model

In this work, the cloud server is considered as “honest-but-curious” [24], i.e., the cloud server will honestly follow thedesigned protocol but try to disclose content of the dataset asmuch as possible. This assumption is consistent with existingworks on privacy-preserving outsourcing in cloud computing[14], [25]–[27]. Based on available information to the cloudserver, we consider the following threat models in terms ofthe privacy protection of data.• Ciphertext Only Model: The cloud server only has access

to all encrypted data objects in the dataset, all encryptedclustering centers, and all intermediate outputs generatedby the cloud server.

• Known Background Model: In this stronger threat model,the cloud server has all information as in the CiphertextOnly Model. In addition, the cloud server may have somebackground information about the dataset (e.g., what isthe topic of the dataset?), and get a small number of dataobjects in the dataset. We also consider the cloud server isnot able to obtain the clustering centers from backgroundinformation, since they are generated based on all dataobjects on the fly during the clustering process.

We assume the owner will not be compromised by adver-saries. In our scheme, we should prevent the cloud server aswell as outside adversaries from learning any data object andclustering center outsourced by the owner.

III. BACKGROUND AND TECHNICAL PRELIMINARIES

A. K-means Clustering

K-means clustering algorithm aims at reallocate a set ofdata objects into k disjoint clusters, each of which has acenter. For each data object, it will be assigned to the clusterwhose center has shortest distance to the object. Data objectsand centers can be denoted as multi-dimensional vectors, andtheir distances can be measured using the square of Euclideandistance. For example, the square of the distance for twom-dimensional vectors ~D1 = [d11, d12, · · · , d1m] and ~D2 =[d20, d21, · · · , d2m] can be calculated as Dist( ~D1, ~D2) =∑m

j=1 (d1j − d2j)2.As shown in Algorithm 1, K-means clustering is an iterative

processing. The algorithm selects k initial cluster centers, andall data objects are allocated into the cluster whose centerhas the shortest distance to them. After a round of clustering,centers of clusters are updated. Particularly, the new center of

Page 3: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

3

Algorithm 1: K-means ClusteringInput: k: number of clusters; max: a predefined number

of iterations; n data objects ~Di = (di1, · · · , dim),1 ≤ i ≤ n

Output: k clustersbegin

Select k initial cluster centers ~Cx, 1 ≤ x ≤ kwhile max>0 do

1. Assign each data object ~Di to the cluster centerwith minimum distance Dist( ~Di, ~Cx) to it.

2. Update ~Cx to the average value of those ~Di

assigned to the cluster x.Output k reallocated clusters.

a cluster is generated by averaging each element over all dataobject vectors in the same cluster. For example, if ~Dx, ~Dy, ~Dz

are assigned to the same cluster, the new center is calculatedas

~Dx+~Dy+~Dx

3 . This clustering and center update process willconducted iteratively until data objects in each cluster do notchange any more, or a predefined number of iterations isreached. For more details about K-means clustering, pleaserefer to ref [28].

B. Weighted K-means Clustering

A popular extension of the original K-means clusteringis the weighted K-means clustering [29]. In weighted K-means clustering, every data element is associated with areal valued weight. This is because different elements in anobject can have different level of importance. In weightedK-means clustering, a weight vector ~W = [w1, w2, · · · , wm]will be created for the dataset. Now, instead of directly usingEuclidean distance for clustering measurement, the distancebetween a data object vector ~D = [d1, d2, · · · , dm] anda clustering center ~C = [c1, c2, · · · , cm] is calculated asWDist( ~D, ~C, ~W ) =

∑mj=1 wj(dj − dj)2. Other operations in

weighted K-means clustering remain the same as the originalK-means clustering. For more details about weighted K-meansclustering, please refer to ref [29].

C. MapReduce Framework

MapReduce is the programming framework for processinglarge scale datasets in a distributed manner. As shown in Fig.2,to process a task with massive amounts of data, MapReducedivides the task into two phases: map and reduce. Thesetwo phases are expressed with map and reduce functions,which take <key,value> pairs as input and output data format.In a cluster, nodes that are responsible for map and reducefunctions are called mappers and reducers respectively.

In a MapReduce task, the framework splits input datasetsinto data chunks, which are processed by independent mappersin parallel. Each map function processes data and generatesintermediate output as <key,value> pairs. These intermediateoutputs are forwarded to reducers after shuffle. According tothe key space of <key,value> pairs in intermediate outputs,each reducer will be assigned with a partition of pairs. In

MapReduce, intermediate <key,value> outputs with the samekey are sent to the same reducer. After that, reducers sort andgroup all intermediate outputs in parallel to generate the finalresult. More details about MapReduce are introduced in ref[23].

Fig. 2. MapReduce Framework

IV. CONSTRUCTION OF PRIVACY-PRESERVINGMAPREDUCE BASED K-MEANS CLUSTERING

A. Scheme OverviewOur scheme consists of three stages as shown in Fig.3:

1) System Setup and Data Encryption; 2) Single RoundMapReduce Based Privacy-preserving Clustering 3) Privacy-preserving Clustering Center Update. In Stage 1, the ownerfirst setups the system by selecting parameters for K-Meansand MapReduce. The owner then generates encryption keysfor the system, and encrypts the dataset for clustering. InStage 2, the cloud server performs a round of clustering andallocates encrypted objects to their closet clustering centers.After that, the cloud server returns a small amount of encryptedinformation back to the owner as the intermediate outputs.In Stage 3, the owner updates clustering centers based oninformation from the cloud server and his/her own secret keys.These new centers are sent to the cloud server in encryptedformat for the next round of clustering. Stage 2 and Stage3 will be iteratively executed until the clustering result doesnot change any more or the predefined number of iterations isreached.

Stage1:System Setup and Data Encryption

Stage 2:Single Round MapReduce Based Privacy-preserving

Clustering

1. Parameter Selection2. System Key Generation3. Data Encryption

Dataset Owner Cloud Server

Stage 3:Privacy-preserving

Clustering Center Update

Aggregated Ciphertexts for each cluster

Updated Encrypted Clustering Centers

1. Decrypt Aggregated Ciphertexts2. Update and Re-encrypt Clustering Centers

Encrypted Dataset & Encrypted Intial Clustering Centers

Iterative Process Untial the Clustering is Finished

Fig. 3. Scheme Overview

We now give the detailed construction of each stage in ourscheme. We summarize important notations used in our con-

Page 4: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

4

n Total Number of Data Objects in the Datasetm Total Number of Elements in a Data ObjectK Total Number of Clusters

~Di, ~D′i Extended Data Object Vectors

~Ck Extended Clustering Centers~ei, ~ek Random Noise Vectors~W Weight VectorM,Q Random Invertible Matrices

M−1,Q−1 Inverses of M and QI Identity Matrix

E(·) Encryption

SUMk Aggregated Ciphertexts of Data ObjectsListk Clustering Result List

TABLE INOTATION

struction in Table I, and define two mathematical operationsas below.

Definition IV.1. For a ∈ R, define dac to be the nearestinteger of a, dacq to be the nearest integer of a with modulusq.

Definition IV.2. For a vector ~D (or a matrix M), define|max( ~D)| (or |max(M)|) to be maximum absolute value oftheir elements.

B. Detailed Construction

1) Stage 1 - System Setup and Data Encryption: In oursystem, we consider a dataset with n data objects, which needsto be clustered into K clusters. Each data object contains melements, and elements in all objects will be scaled to integerswith the same scale factor. We denote each scaled data objectas {di1, di2, · · · , dim} ∈ Zp. Given a data object, the ownerfirst extends it to two 2m-dimensional vector as

~Di = [ridi1, ridi2, · · · , ridim, ri, αi1, αi2, · · · , αim−1]

~D′i = [di1, di2, · · · , dim, ri, αi1, αi2, · · · , αim−1]

where ri, αi1, αi2, · · · , αi(m−1) ∈ Zp are random numbersselected by the owner for each data object, and ri is positive.~Di will be used for privacy-preserving clustering, and ~D′i willbe used for privacy-preserving update of clustering centers.The owner also selects K initial clustering centers and extendsthem to 2m-dimensional vectors ~Ck, 1 ≤ k ≤ K as

~Ck = [ck1, ck2, · · · , ckm,−1

2

m∑j=1

c2kj , β1, β2, · · · , βm−1]

where β1, β2, · · · , βm−1 ∈ Zp are random numbers selectedby the owner and will be regenerated for each round ofclustering. Note that, there are different ways of selecting theinitial centers [30], and our design is independent to how initialcenters are selected.

Key Generation: The key generation of our scheme in-volves the selection of two random 2m × 2m invertiblematrices M and Q. We also use M−1 and Q−1 to denotethe inverses ofM and Q respectively, whereM×M−1 = I,Q × Q−1 = I, and I is a 2m × 2m identity matrix.SK = {M,Q,M−1,Q−1} is set as secret key for the system,and is only known to the owner.

Data Encryption: Given extended data vectors ~Di and ~D′iof a data object, the owner encrypts them as

E( ~Di) = (Γ · ~Di + ~ei)×M (1)

E( ~D′i) = (Γ · ~D′i + ~ei)×Q (2)

Here, Γ ∈ Zq, q � p,Γ� 2|max(~ei)|, ~ei ∈ Z2mq is a random

integer noise vector generated for each ~Di. Then, for eachextended clustering center ~Ck, the owner encrypts it as

E(~Ck) =M−1 × (Γ · ~CTk + ~eTk ) (3)

where ~CTk and ~eTk are column vectors of ~Ck and ~ek respec-

tively, ~ek ∈ Z2mq is a random integer noise vector generated

for each clustering center ~Ck.Considering the support of MapReduce, ciphertexts

of each data object are organized as a key-value pair< i,E( ~Di)||E( ~D′i) >, where the index of the data object(i) is used as the key and the concatenation of E( ~Di)and E( ~D′i) is set as the value. All n key-value pairs< i,E( ~Di)||E( ~D′i) >1≤i≤n, K encrypted clustering centers{E(~Ck)}1≤k≤K , and the public parameter Γ are outsourcedto the cloud server.

2) Stage 2 - Single Round MapReduce Based Privacy-preserving Clustering: As described in Section III-A, theclustering purpose for a data object is to find the clusteringcenter that has the minimum Euclidean distance to it. Asdifferent data objects are independent with each other ina single round of clustering, we set the clustering for anobject as the MapReduce job in our scheme. Now, the firsttask is to achieve Euclidean distance comparison directlyover encrypted data objects and clustering centers. Given anencrypted object E( ~Di) and any two encrypted clusteringcenters E(~Ca), E(~Cb), the cloud server first computes

Compia = d~E(Di)× ~E(Ca)

Γ2cq

(4)

= dm∑j=1

(ridijcaj − ric2aj2

) +m−1∑j=1

αijβj) +~ei × ~eTk

Γ2

+~Di × ~eTk + ~ei × ~CT

k

Γcq

=m∑j=1

ridijcaj −ri2

m∑j=1

c2aj +m−1∑j=1

αijβj

= −ri2

(Dist{Ca, Di} −m∑j=1

d2ij) +

m−1∑j=1

αijβj

Compib = −ri2

(Dist{Cb, Di} −m∑j=1

d2ij) +m−1∑j=1

αijβj

The correctness of the above equations are guaranteed by thedistributive property of matrix multiplication and the fact thatΓ � p, Γ � 2|max(~ei)| and Γ � 2|max(~ek)|. Basedon Compia and Compib, the cloud can easily output thedifference between Dist{Ca, Di} and Dist{Cb, Di} as

Compia − Compib =ri2

(Dist{Cb, Di} −Dist{Ca, Di}) (5)

Page 5: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

5

As ri is positive, it is clear that the output of encrypted distancecomparison in Eq.5 is consistent with the comparison usingexact distance between Dist{Ca, Di} and Dist{Cb, Di}. Ifthe sign of Eq.5 is positive, clustering center Ca has smallerdistance to the data object; otherwise, clustering center Cb hassmaller distance.

To further process privacy-preserving K-means clusteringwith MapReduce, the cloud server splits and distributes n en-crypted data pairs < i,E( ~Di)||E( ~D

i) >1≤i≤n to all mappers.The cloud server also sends all K encrypted clustering centers{E(~Ck)}1≤k≤K to each mapper. Afterwards, each mapperinitiates K 2m-dimensional vectors {SUMk}1≤k≤K , in whichall elements are 0, i.e, {0, 0, · · · , 0}. These vectors are used toaggregate encrypted data objects allocated to the same center,and will be utilized for the update of clustering centers inStage 3.

The map function in our privacy-preserving K-means clus-tering is shown in Fig.4. Taking an encrypted data key-value pair < i,E( ~Di)||E( ~D

i) > and all encrypted clusteringcenters {E(~Ck}1≤k≤K as inputs, a mapper iteratively invokesour privacy-preserving distance comparison described aboveto figure out the closest center for the data object. Theintermediate outputs in our map function include two parts:1) a key-value pair < k, i > with the index of the closestcenter k as key and the index of data object i as value; 2)updated SUMk for the k-th center. Once a mapper finishesall jobs assigned to it, it will organize its final SUMk as key-value pairs < k, SUMk >1≤k≤K . Finally, all outputs will besent to reducers.

Map Process of Privacy-preserving K-means

Input: K encrypted centers {E( ~Ck)}1≤k≤K , an encrypted datapair < i,E( ~Di)||E( ~D

′i) >, aggregation of encrypted vectors

{SUMk}1≤k≤K

Output: Key-value pair < k, i > and updated SUMk , k is theindex of clustering center assigned for data object ~Di.

1. index = 1;2. minCandidate = Compi1 //Computed as Eq.4.3. For (k = 2; k< K+1; k++){

If (Compik-minCandidate> 0){ //Compared as Eq.5.minCandidate = Compik;index = k;}}

4. Output a < k, i > pair, SUMk = SUMk + E( ~D′i);

Fig. 4. Map Process of Privacy-preserving K-means

The reduce function is presented in Fig.5. On receiving out-puts from mappers, reducers first add indexes of data objectsallocated to the same cluster into their corresponding result listListk, where 1 ≤ k ≤ K. Reducers also aggregate partiallyaggregated SUMk from each mapper as {SUMk}1≤k≤K ,whose final values are SUMk =

∑i∈Listk

E( ~D′i)Based on the output of reducers, the cloud server checks

whether the predefined number of clustering iterations isresearched, or whether all result lists Listk, 1 ≤ k ≤ K are thesame as the previous round of clustering. If so, the clustering

Reduce Process of Privacy-preserving K-meansInput: < k, i > pairs for all data objects, partially aggregated <k, SUMk >, 1 ≤ k ≤ K from each mapper.Output: K Classified Clusters < k,Listk >, final aggregated< k, SUMk >, 1 ≤ k ≤ K for all n data objects.

1. While(< k, i > pairs. hasNext()){Add i to Listk;

}

2. While(< k, SUMk > pairs. hasNext()){SUMk = SUMk + SUMk

}3. Output the < k,Listk >, SUMk , 1 ≤ k ≤ K;

Fig. 5. Reduce Process of Privacy-preserving K-means

is finished and the cloud server sends Listk, 1 ≤ k ≤ Kback to the dataset owner as the clustering result; otherwise,the cloud server sends {SUMk}1≤k≤K back to the owner forthe update of clustering centers. To this end, a single roundof privacy-preserving MapReduce based K-means clusteringis finished.

3) Stage 3 - Privacy-preserving Clustering Center Update:After a single round of clustering, clustering centers need tobe updated as described in Section III-A. Particularly, the melements in for a new clustering center is calculated as meanvalues of elements in data objects currently allocated to thecluster, i.e. ckj = 1

|Listk|∑

i∈Listkdij , where 1 ≤ j ≤ m and

|Listk| is the total number of data objects in the k-th cluster.To efficiently achieve this in a privacy-preserving manner,our scheme first utilizes the MapReduce design in Stage 2to generate aggregated ciphertexts of data objects allocatedin the same cluster. Note that, only K aggregated ciphertexts{SUMk}1≤k≤K need to be retrieved by the owner, where Kis the total number of clusters and each ciphertext is a 2m-dimensional vector. Thus, the communication overhead for theinteraction after each round of clustering is lightweight andindependent to the size of the dataset. With {SUMk}1≤k≤K ,the owner decrypts them using the secret key Q−1 and thepublic parameter Γ as

~C ′k = d SUMk × (Q)−1

Γ2cq

(6)

=∑

i∈Listk

~D′Ai

As shown in Eq.6, the decryption of an aggregated ci-phertext outputs the aggregation of corresponding data objectvectors according to the associative and distributive propertiesof matrix multiplication. Given ~C ′k for the k-th cluster, theowner generates new center ~Ck as

~C′k

|Listk| . During the update,the owner only keeps the first m elements of each ~C ′k, sinceall rest elements are extended values or random numbers asdescribed in the data encryption process of Stage 1. After Knew clustering centers ~Ck are generated, the owner extendsthem to 2m-dimension vectors and encrypts them using theencryption process presented in Stage 1.

Page 6: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

6

All new encrypted centers are sent to the cloud server forthe next round of clustering as described in Stage 2. Stage2 and Stage 3 are iteratively executed until the clustering isfinished. To this end, all required operations in a K-meansclustering are supported in the privacy-preserving manner inour construction.

C. Extension to Weighted K-means Clustering

As presented in Section III-B, weighted K-means clus-tering is similar to the original K-means clustering withonly one difference in distance computation. In weightedK-means clustering, the weighted distance is calculated asWDist( ~D, ~C, ~W ) =

∑mj=1 wj(dj − cj)2, where ~W =

[w1, w2, · · · , wm] is the weight vector for the dataset. Thus, tosupport weighted K-means clustering in a privacy-preservingmanner, our design should enables the privacy-preservingweighted distance comparison. In particular, we only needto make the following change to the Stage 1 of our designintroduced in Section IV-B: involving weight values into Kextended clustering center vectors as

~Ck = [w1ck1, · · · , wmckm,−1

2

m∑j=1

wjc2kj , β1, · · · , βm−1]

These extended center vectors will be encrypted using thesame way as that in our design for original K-means.

We now show that such a change can make our designsupport privacy-preserving weighted distance comparison, andfurther lead to privacy-preserving MapReduce based weightedK-means clustering. Specifically, given ciphertexts of a dataobject E( ~Di), we compute Compia and Compib for clusteringcenters ~Ca and ~Cb as

Compia = d~E(Di)× ~E(Ca)

Γ2cq

(7)

=

m∑j=1

ri · dij · wj · caj −ri2

m∑j=1

wj · c2aj +

m−1∑j=1

αijβj

Compib = d~E(Di)× ~E(Cb)

Γ2cq

=m∑j=1

ri · dij · wj · cbj −ri2

m∑j=1

wj · c2bj +m−1∑j=1

αijβj

After that, the weighted distance comparison can be con-ducted as

Compia − Compib

=ri2

m∑j=1

(wj · c2bj − wj · c2aj + 2wj · (caj − cbj) · dij)

=ri2

m∑j=1

wj(d2ij + c2bj − 2dijcbj − (d2ij + c2aj − 2dijcaj))

=ri2

(WDist( ~Di, ~Cb, ~W )−WDist( ~Di, ~Ca, ~W ))

In addition, since our extension does not make any changeto the processing and encryption of the dataset, the laterclustering center update is the same as that in our design

for original K-means clustering. To this end, the privacy-preserving weighted K-means clustering based on MapReducecan be supported.

D. Security Analysis

In this section, we show the security of our design underthe Ciphertext Only Model and Known Background Modelas described in Section II-B. We first prove the security ofencryption scheme for all data objects and clustering centersbased on the hardness assumption of the Learning WithError (LWE) problem [22], which guarantees polynomial-timeadversaries are not able to recover the owner’s data directlyfrom their ciphertexts.

Definition IV.3. Learning with Error (LWE) ProblemGiven polynomially many samples of (~ai ∈ Zm

q , bi ∈ Zq) with

bi = ~D × ~aTi + γi

where the error term γi ∈ Zq is drawn from some probabilitydistribution, it is computational infeasible to recover the vector~D with non-negligible probability.

Theorem IV.4. If the LWE problem is hard, then it iscomputable infeasible for a polynomial-time adversary torecover ~Di from its ciphertexts E( ~Di) and E( ~D′i), ~Ck fromits ciphertext E(~Ck) in our proposed scheme.

Proof. In our encryption, each extended ~Di is encrypted as

E( ~Di) = (Γ · ~Di + ~ei)×M (8)

Since ~Di and ~D′i are encrypted with in the same manner, weuse ~Di in our proof for expression simplicity. In E( ~Di), as ~Di

and ~ei are 2m-dimensional vectors, their multiplication withthe 2m×2m matrixM can be considered as 2m dot productsof 2m-dimensional vectors as follows

E( ~Di)(1) = Γ · ~DiA ×M(1) + ~ei ×M(1)T

· · ·E( ~Di)(2m) = Γ · ~DiA ×M(2m) + ~ei ×M(2m)T

where 1 ≤ j ≤ 2m, E( ~Di)(j) is the jth element of E( ~Di),and M(j) is the jth column of M. By denoting Γ · M(j)as M′(j) and ~ei × M(j) as e′i, we have 2m samples(M′(j), E( ~Di)(j)) with

E( ~Di)(j) = ~DiA ×M′(j) + e′i, 1 ≤ j ≤ 2m

Therefore, recovering ~Di from ciphertext E( ~Di) (respectively,~D′i from E( ~D′i)) becomes the LWE problem presented inDefinition IV.3, which is considered as hard problem andcomputational infeasible. It is notable that asM in our schemeis the secret key and is not available to the adversary,M′(j) isactually also not available to the adversary. Such a fact makesthe recovering of ~Di in our design more difficult than the LWEproblem.

With regard to any ~Ck and its ciphertext E(~Ck), we canalso convert its encryption to 2m dot products similarly as

E(~Ck)(j) =M′−1(j)× ~CTk + e′k), 1 ≤ j ≤ 2m

Page 7: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

7

where M′−1(j) = Γ · M−1(j), M−1(j) is the jth row ofM−1, and e′k = M−1(j) × ~eTk . Thus, recovering ~Ck fromE(~Ck) also becomes the LWE problem by given 2m samples(M′−1(j), E(~Ck)(j))1≤j≤2m. In addition, M′−1(j) is alsonot available to the adversary, since M−1 is the secret keyand only known to the dataset owner.

To this end, if the LWE problem is hard, recovering dataobjects and clustering centers from their corresponding ci-phertexts is computational infeasible for a polynomially-timeadversary. Theorem IV.4 is proved.

As the cloud server only has access to ciphertexts in theCiphertext Only Model, our proposed scheme is secure in thisthreat model according to our proof of Theorem IV.4. Wenow further analyze the security of our scheme in the KnownBackground Model.

Known Background Model: In this threat model, besidesthe ciphertexts available in the Ciphertext Only Model, thecloud server also has access to a small set of data objectsfrom background information and analysis. In the KNNproblem, in which all data objects as well as query objects areindependent with each other and the cloud server is possibleto obtain some query objects without knowing other dataobjects from background information. Differently, clusteringcenters in the K − means clustering setting are generatedaccording to all data objects involved in the dataset on thefly, and will be updated after each round of clustering. Thus,we consider the cloud server will not obtain clustering centersfrom background information and analysis. As the securityof our encryption is guaranteed under the LWE problem, thecloud server is not able to recover data objects as well asclustering centers directly from ciphertexts. We now focus onlinear analysis attack introduced by Yao et al. [17]. Insteadof recovering data objects and clustering centers directly fromtheir ciphertexts, this kind of attack attempts to recover thedata from the Euclidean distance comparison results as shownin Eq.5. Specifically, given the comparison result equation ofone data object and any two clustering centers, the cloud servercan construct an equation

Rstiab =ri2

(Dist{Cb, Di} −Dist{Ca, Di})

In this equation, there are 3m + 1 unknowns from ~Ca, ~Cb,~Di, and ri. As the cloud has access to a small set of dataobjects, it can reduce the number of unknowns to 2m+1 (from~Ca, ~Cb and ri) in an equation Rstiab if Di is in its knownset of data objects. The original idea of the linear analysisattack is: if the cloud can obtain more than 2m data objects,it can construct 2m Rstiab equations for 2m unknowns in ~Ca,~Cb, and solve them to recover ~Ca and ~Cb. However, such anattack cannot work in our design. This is because we embeddifferent random numbers ri for each data object ~Di. Withsuch a design, each additional equation Rstiab constructedfrom a newly obtained ~Di also introduces an unknown ri inthis equation, and thus bring no contribution for recovering ~Ca,~Cb. Therefore, when the cloud obtains 2m data objects, it canonly construct 2m equations for solving 4m unknowns from~Ca, ~Cb, and 2m random numbers ri, which are unsolvableusing linear analysis. To this end, our scheme can prevent the

cloud server from learning data objects as well as clusteringcenters in the Known Background Model.

V. EVALUATION

A. Numerical Evaluation

In this section, we numerically analyze our proposed schemein terms of computational cost, communication overhead, andstorage overhead. We also compare the cost of our schemewith the original K-means algorithm, and summarize the resultin Table II. For expression simplicity, we use DOT2m todenote one 2m-dimensional dot product operation in the restparts of this paper. In particular, given two 2m-dimensionalvectors ~A = [a1, a2, · · · , a2m] and ~B = [b1, b2, · · · , b2m], aDOT2m operation for them is ~A× ~BT =

∑2mj=1 = aj · bj . We

ignore single addition operation and single modular operationin our evaluation, since their cost are negligible compared tothe DOT2m operation.

1) Computational Cost: In our scheme, the dataset owneris responsible for key generation, dataset encryption, andclustering center update. As the key generation process onlyinvolves one-time selection of two random 2m×2m invertiblematrices. To encrypt a data object with m elements, the ownerneeds to perform 4mDOT2m operations as shown in Eq.1 andEq.2. Similarly, the encryption cost for a clustering center is2mDOT2m operations as shown in Eq.3. Therefore, given a nobjects dataset and K clusters, the owner needs (4mn+2mK)DOT2m operations for one-time encryption. For each round ofclustering center update, the owner first needs 2mK DOT2moperations to decrypt intermediate outputs from the cloudserver as shown in Eq.6. Then, another 2mK DOT2m oper-ations are needed to re-encrypt K updated clustering centers.Therefore, the total computational cost on the owner for around of clustering is 4mK DOT2m operations. With regardto the cloud server in each round of clustering, it needs KDOT2m operations as shown in Eq.7 to allocate a data objectinto the closest center. Thus, to process n objects to K clustersin a single round of clustering, the computational cost onthe cloud server is nK DOT2m operations. Compared to theoriginal K-means clustering that presented in Section III-A,the cloud server needs to compute K squares of Euclideandistance to allocate a data object in each round of clustering.Since the square of Euclidean distance between two m-dimensional vectors contains the same number of additionand multiplication operations as in a DOTm operation, werepresent each square of Euclidean distance as a DOTmoperation. To allocate n objects to K clusters in a roundof clustering, the of original K-means clustering algorithmrequires nK DOTm operations on the cloud server. Therefore,our scheme has the same computational complexity on thecloud server for each round of clustering.

2) Communication Overhead: The communication over-head in our scheme mainly comes the interaction after eachround of clustering. Specifically, if the clustering is not fin-ished, the cloud server needs to send back K aggregatedciphertexts {SUMk}1≤k≤K , which are 2m-dimensional vec-tors. After that, the owner needs to return K ciphertexts{E(~C ′)k}1≤k≤K for updated clustering centers, which are

Page 8: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

8

Computational Complexity (Cloud Server) Computational Complexity (Owner) Communication OverheadOur Scheme nK DOT2m 4mK DOT2m 4mK vector elements

Original K-means nK DOTm N/A N/A

*n: the number of data objects in the dataset; m: the number of elements in a data object; K: the number of clustering centers; Typically, n >> m, k

TABLE IINUMERICAL ANALYSIS FOR A SINGLE ROUND OF CLUSTERING

also 2m-dimensional vectors. Thus the total communicationcost in each round of clustering are 4mK vector elements,each of which is 8 Bytes in our implementation. It is no-table that the interaction cost for each round of clustering isindependent to size of the dataset, which makes our schemescalable for large-scale datasets. Although the communicationoverhead is still linear to the number of clustering centers, itis typically a small number in a practical clustering.

3) Storage Overhead: The storage overhead is introducedin the encryption of dataset and clustering centers. In ourscheme, each data object and clustering center are denotedas a m-dimensional vector, and will be encrypted as two 2m-dimensional vectors. Thus, the total storage cost in our schemeis four times to that of the unprotected clustering.

4) Comparison with ref [16]: In this section, we alsocompare our proposed encryption scheme with that in ref[16], and summarize the results in Table III. As ref [16] isnot designed for K − means, our comparison focus on themajor operation in ref [16], i.e., privacy-preserving Euclidean-distance comparison. Specifically, given a data object vectorwith m elements, we assume both our scheme and ref [16]extend it to 2m elements by adding random elements (arti-ficial elements in [16] respectively). To encrypt the extendedvector, say ~D, ref [16] first split it into two vector ~DA and~DB , and then encrypt them as two 2m-dimensional vectorsE( ~DA) and E( ~DB) using matrix multiplication. Differently,our scheme directly encrypts ~D and outputs only one 2m-dimensional vector E( ~D) as ciphertext. As all ciphertexts willbe outsourced to the cloud server for further process, it isclear that the storage overhead introduced in ref [16] is twiceof that of our scheme. With regard to the Euclidean distancecomparison process on cloud, i.e., figuring out the vector fromn encrypted vectors that has the smallest Euclidean distanceto an encrypted request vector, ref [16] requires 2n DOT2moperations. Differently, our proposed scheme only requires nDOT2m operations. This is because ref [16] needs processtwo ciphertexts of each vector in the dataset, while only oneis necessary in our scheme. Similarly, to encrypt a requestvector, ref [16] needs two DOT2m operations on the datasetowner, but our scheme only requires one DOT2m operation.Therefore, compared with ref [16], our scheme saves about50% computational cost on the cloud server and data ownerfor privacy-preserving Euclidean distance comparison.

B. Experimental Evaluation

1) Experiment Configuration: To evaluate the performanceof our privacy-preserving MapReduce based K-means cluster-ing scheme in terms of efficiency, scalability, and accuracy,

Our Scheme Ref [16]Computational Cost n DOT2m 2n DOT2m

(Cloud Server)Computational Cost DOT2m 2 DOT2m

(Owner)Storage Overhead 2nm vector elements 4nm vector elements

TABLE IIINUMERICAL ANALYSIS OF PRIVACY-PRESERVING EUCLIDEAN DISTANCE

COMPARISON

we implemented a prototype on Microsoft Azure cloud [31]using JAVA 1.7. We deployed a cluster of 6 to 10 nodeswith Apache Hadoop 2.6.3 [32] installed. Two nodes areused as head nodes and the other 2 to 8 are used as workernodes. Each node is running Ubuntu Linux 12.04 with four2.40GHz CPU cores and 14GB memory. The local machinefor the dataset owner is a desktop running OS X 10.11with 8 3.3GHz CPU cores and 8GB memory. To supportmatrix related operations in our scheme, jblas library 1.2.4[33] is adopted in the implementation. The dataset used inour evaluation consists of 5 million simulated data objects.Each object has 10 elements, and can be represented as a10-dimensional vector. These objects need to be allocatedinto 10 clusters. To demonstrate that our scheme introducesreasonable computation and communication overhead for pri-vacy guarantee, we also implemented a non-privacy-preservingMapReduce based K-means under the same configuration. Allexperimental results represent the mean of 10 trials.

Number of Data Objects (million)1 2 3 4 5

Encr

yptio

n C

ost (

Seco

nd)

1

2

3

4

5

6

7

Fig. 6. Dataset Encryption Cost

2) System Setup: As discussed in Section V-A1, the systemsetup cost mainly comes the dataset encryption by the datasetowner. As shown in Fig.6, the dataset encryption cost increases

Page 9: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

9

linearly from 1.26s to 6.34s when we change the size of datasetfrom 1 million objects to 5 million objects. Note that, thisis one-time cost in our scheme, and will not affect the laterclustering performance.

3) Efficiency: In our evaluation, we focus on evaluatingthe efficiency of a single round clustering, because differentrounds of clusterings have the same computational cost onthe owner and the cloud server as shown in Section V-A1. Inaddition, the number of clustering rounds is mainly determinedby the dataset itself and the selection of initial clusteringcenters, and is independent to the design our scheme.

Number of Data Objects (million)1 2 3 4 5Si

ngle

Rou

nd C

lust

erin

g C

ost (

seco

nd)

4

5

6

7

8

9

10

11

12Our Scheme - Cloud Server CostOur Scheme - Total CostNon-Privacy-preserving Solution

Fig. 7. Computational Cost on for a Single Round of Clustering

By using 4 worker nodes, Fig.7 shows the cloud serverspends 5.05s to 10.22s to perform a single round of clusteringover datasets from 1 million to 5 million. Compared with thenon-privacy-preserving version, our scheme only brings within35% more computational cost on cloud for a single roundof clustering. This is consistent with our analysis in SectionV-A1, since our scheme achieves the same computationalcomplexity on the cloud server compared to that of a non-privacy-preserving design. After each round of clustering, thedataset owner needs to update 10 clustering centers, whichonly costs 65ms. The total communication overhead after eachround of clustering is 3.2KB1, in which 1.6KB are aggregatedciphertexts returned by the cloud server and the other 1.6KBare updated clustering centers uploaded by the owner. It isnotable that our communication overhead is independent to thesize of the dataset as shown in Fig. 8. This decent feature alsopromotes the scalability of our scheme for large-scale datasets.Using an 100Mb bandwidth Internet in our experiment, thecommunication after each round of clustering spends 1.34s.Therefore, the total cost for a single round of clustering startsfrom 6.39s to 11.56s when the size of dataset varies from 1million to 5 million as shown in Fig. 7.

4) Scalability: We evaluate the scalability of our schemewith respect to “scaleup” [34]. Specifically, scaleup is theability of using m-times larger resources to perform a m-timeslarger job in the same running time as the original job. Thus,

1Each element in a data object is formated as Long in Java, which is 8Bytes.

Number of Data Objects (million)1 2 3 4 5

Sing

le R

ound

Com

mun

icat

ion

Ove

rhea

d (K

B)

2.22.42.62.83

3.23.43.63.84

4.2

Fig. 8. Communication Overhead for a Single Round of Clustering

given a original job time T , the scaleup rate is defined as% of Job Finished in T

100% . In our evaluation, the original job isset as the clustering over 1 million objects using two workernodes. We then increase the number of worker nodes to 4,6, 8 and the number of data objects as 2 million, 3 million,and 4 million respectively. As demonstrated in Fig.9, 2-times,3-times, and 4-times scaleups in our scheme have scaleup rateat 0.92, 0.84, and 0.73 respectively, which is comparable withthe scalability of the non-privacy-preserving MapReduce basedK-means clustering [19].

m-times1 1.5 2 2.5 3 3.5 4

Scal

eup

Rat

e

0.7

0.75

0.8

0.85

0.9

0.95

1

Our SchemeNon-Privacy-preserving Solution

Fig. 9. Scaleup Evaluation

5) Accuracy: Compared with the original K-means cluster-ing algorithms, our scheme does not introduce any accuracyloss if all initial clustering centers are selected in the sameway. In particular, the allocation of a data object to the closestcenter is determined by the Euclidean distance between theobject and the center. As discussed in Stage 2 of our scheme,the Euclidean comparison result over encrypted data in ourscheme is exactly the same as that over unprotected data.Moreover, the update of clustering centers is also the same asthat in the original K-means clustering. Therefore, our schemeachieves the same accuracy compared with the original K-

Page 10: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

10

Our Scheme Original K-meansCluster # Iterations = 100

1 5991 59912 71883 718833 41598 415984 125032 1250325 17124 171246 320911 3209117 98788 987888 12535 125359 220132 220132

10 86006 86006

TABLE IVACCURACY COMPARISON

means clustering. As a proof of concept, we perform a 100-round clustering over one million data objects. The clusteringresults of our scheme and the original K-means algorithm arecompared by the number of data objects in each cluster afterthe same number of clustering rounds. As shown in Table IV,our scheme has the same clustering results as the original K-means after 100 rounds of clustering.

VI. RELATED WORK

A. Privacy-preserving Clustering

In recent years, a number of schemes have been proposedto outsource clustering tasks in a privacy-preserving manner.In ref [10], [11], distance-preserving data perturbation or datatransformation techniques are adopted to protect the privacyof dataset, while keeping the distance comparison propertyfor clustering purpose. These perturbation based techniquesare very efficient and even achieve the same computationalcost compared to the original clustering algorithm. This isbecause data perturbation based encryption makes the ci-phertxt have the same size as the original data, and uses thesame clustering operations in the original clustering algorithm.However, as shown in ref [12], [13], these data perturbationbased solutions do not provide enough privacy guarantee.Specifically, once adversaries get a small set of unencrypteddata objects in the dataset from background analysis, they willbe able to recover the rest objects [12]. To provide strongprivacy guarantee, novel cryptographic primitives are adoptedin privacy-preserving clustering outsourcing. In ref [14], aprivacy-preserving outsourcing design for K-means clusteringis proposed by utilizing homomophic encryption and orderpreserving indexes. Nevertheless, as shown in ref [15], thehomomophic encryption adopted in ref [14] is not secure.Moreover, ref [14] is efficient only for small datasets, e.g., lessthan 50,000 data objects. As a comparison, ref [14] requires15.48 seconds for a single round clustering over only 30,000data objects, while our proposed scheme can achieve a singleround clustering over 5 million data objects within 15 secondsas evaluated in Section V-B3. The other promising candidatefor privacy-preserving outsourcing of K-means clustering isthe secure outsourcing of general linear algebra computations[35], since all required operations in K-means clustering canbe converted to linear algebra computations. However, generalsecure computation outsourcing mainly focuses on one-round

computation, while K-means clustering is an iterative processand needs the update of ciphertext for each round.

The problem of privacy-preserving clustering has also beenstudied in the distributed setting [4]–[9]. These schemesmainly rely on multi-party secure computation techniques,such as secure circuit evaluation, homomorphic encryption andoblivious transfer. Nevertheless, privacy-preserving distributedclustering has a different purpose with privacy-preservingoutsourcing of clustering. These designs involve multiple en-tities, which perform clustering over their shared data withoutdisclosing their data to each other. Differently, the dataset inclustering outsourcing is owned by a single entity, who wantsto minimize local computational cost for large-scale clustering.

Another line of research that is related to this work isprivacy-preserving KNN search, since both K-means and KNNuse Euclidean distance to measure the similarity of data vec-tors. An efficient matrix based privacy-preserving KNN searchscheme is first proposed by Wong et al. [16], in which theyconvert the Euclidean distance comparison to scalar productcomputation. Nevertheless, as demonstrated by Yao et al. [17],ref [16] is vulnerable to the linear analysis attack when thecloud server obtains a set of data objects from the dataset.To overcome such a security vulnerability, Yao et al. [17]present a secure solution by adopting a novel partition-basedsecure Voronoi diagram design. Unfortunately, their schemeonly supports data with no more than two dimensions, andthus becomes impractical for most types of data in the domainof clustering. Recently, Su et al. [18] further modify ref [16]by adding a random noise term for each data vector, whichleads their scheme to resist the linear analysis attack. However,their design also sacrifices the search accuracy to some extent,since the added noise terms will be included in the Euclideandistance comparison. Differently, our proposed scheme cansupport data of any number of dimensions, is resistant tolinear analysis attacks as shown in Section IV-D, and does notintroduce any accuracy loss. Furthermore, considering privacy-preserving Euclidean distance comparison only, our schemealso significantly reduce computational cost and storage over-head compared with ref [16] as discussed in Section V-A4.In addition, extending privacy-preserving KNN to support theoutsourcing of K-means clustering is not a trivial task. Unlikethe KNN search that is a single round task, K-means clusteringis an iterative process and requires the update of clusteringcenters based on all objects in the dataset after each round ofclustering. To guarantee the efficiency and privacy of the entireclustering process, our scheme uniquely makes these updatescompatible with MapReduce and allows them to be mainlyhandled by the cloud server over ciphertexts. Particularly, thedataset owner only needs to perform a constant number ofoperations for the update of clustering centers as evaluated inSection V, which is independent to the size of the large-scaledataset.

B. MapReduce Based K-means ClusteringTo efficiently perform K-means clustering over large-scale

dataset, MapReduce framework [23] has been frequentlyadopted by researchers [19]–[21]. In ref [19], a fast parallel k-means clustering algorithm based on MapReduce is proposed.

Page 11: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

11

Later on, ref [20] parallelized the initial phase of an existingefficient K-means algorithm and implemented it using MapRe-duce. After that, the performance of MapReduce based K-means is further optimized for large-scale dataset by ref [21].Similarity, the MapReduce is adopted in ref [36] for efficientearth movers distance similarity join over large-scale datasets.Nevertheless, all these designs only focus on improving thecomputational performance over large-scale datasets, and noneof them take privacy protection into consideration.

VII. CONCLUSION

In this work, we proposed a privacy-preserving MapReducebased K-means clustering scheme in cloud computing. Thanksto our light-weight encryption design based on the LWE hardproblem, our scheme achieves clustering speed and accuracythat are comparable to the K-means clustering without privacy-protection. Considering the support of large-scale dataset, wesecurely integrated MapReduce framework into our design,and make it extremely suitable for parallelized processingin cloud computing environment. In addition, the privacy-preserving Euclidean distance comparison component pro-posed in our design can also be used as an independent tool fordistance based applications. We provide thorough analysis toshow the security and efficiency of our scheme. Our prototypeimplementation over 5 million data objects demonstrates thatour scheme is efficient, scalable, and accurate for K-meansclustering over large-scale dataset.

REFERENCES

[1] European Network and Information Security Agency. Cloud computingsecurity risk assessment. https://www.enisa.europa.eu/activities/risk-management/files/deliverables/cloud-computing-risk-assessment.

[2] Darcy A. Davis, Nitesh V. Chawla, Nicholas Blumm, Nicholas Chris-takis, and Albert-Laszlo Barabasi. Predicting individual disease riskbased on medical history. In Proceedings of the 17th ACM Conferenceon Information and Knowledge Management, CIKM ’08, pages 769–778, Napa Valley, California, USA, 2008.

[3] U.S. Dept. of Health & Human Services. Standards for privacy ofindividually identifiable health information, final rule, 45 cfr, pt 160–164. http://www.hhs.gov/sites/default/files/introduction.pdf, 2002.

[4] Jaideep Vaidya and Chris Clifton. Privacy-preserving k-means clusteringover vertically partitioned data. In Proceedings of the Ninth ACMSIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’03, pages 206–215, New York, NY, USA, 2003. ACM.

[5] Geetha Jagannathan and Rebecca N. Wright. Privacy-preserving dis-tributed k-means clustering over arbitrarily partitioned data. In Pro-ceedings of the Eleventh ACM SIGKDD International Conference onKnowledge Discovery in Data Mining, KDD ’05, pages 593–599, NewYork, NY, USA, 2005. ACM.

[6] Paul Bunn and Rafail Ostrovsky. Secure two-party k-means clustering.In Proceedings of the 14th ACM Conference on Computer and Com-munications Security, CCS ’07, pages 486–497, New York, NY, USA,2007. ACM.

[7] Mahir Can Doganay, Thomas B. Pedersen, Yucel Saygin, Erkay Savas,and Albert Levi. Distributed privacy preserving k-means clusteringwith additive secret sharing. In Proceedings of the 2008 InternationalWorkshop on Privacy and Anonymity in Information Society, PAIS ’08,pages 3–11, New York, NY, USA, 2008. ACM.

[8] Jun Sakuma and Shigenobu Kobayashi. Large-scale k-means clusteringwith user-centric privacy-preservation. Knowledge and InformationSystems, 25(2):253–279, 2009.

[9] Xun Yi and Yanchun Zhang. Equally contributory privacy-preservingk-means clustering over vertically partitioned data. Inf. Syst., 38(1):97–107, March 2013.

[10] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving datamining. SIGMOD Rec., 29(2):439–450, May 2000.

[11] Stanley R. M. Oliveira and Osmar R. Zaane. Privacy preservingclustering by data transformation. In Brazilian Symposium on Databases,SBBD, Manaus, Amazonas, Brazil, 2003.

[12] Kun Liu, Chris Giannella, and Hillol Kargupta. An attacker’s viewof distance preserving maps for privacy preserving data mining. InProceedings of the 10th European conference on Principle and Practiceof Knowledge Discovery in Databases, PKDD’06, pages 297–308,Berlin, Heidelberg, 2006. Springer-Verlag.

[13] H. Kargupta, S. Datta, Q. Wang, and Krishnamoorthy Sivakumar. On theprivacy preserving properties of random data perturbation techniques. InData Mining, 2003. ICDM 2003. Third IEEE International Conferenceon, pages 99–106, Nov 2003.

[14] Dongxi Liu, Elisa Bertino, and Xun Yi. Privacy of outsourced k-meansclustering. In Proceedings of the 9th ACM Symposium on Information,Computer and Communications Security, ASIA CCS ’14, pages 123–134, New York, NY, USA, 2014. ACM.

[15] Yongge Wang. Notes on two fully homomorphic encryption schemeswithout bootstrapping. Cryptology ePrint Archive, Report 2015/519,2015.

[16] Wai Kit Wong, David Wai-lok Cheung, Ben Kao, and Nikos Mamoulis.Secure knn computation on encrypted databases. In Proceedings of the2009 ACM SIGMOD International Conference on Management of data,SIGMOD ’09, pages 139–152, New York, NY, USA, 2009. ACM.

[17] B. Yao, F. Li, and X. Xiao. Secure nearest neighbor revisited. InData Engineering (ICDE), 2013 IEEE 29th International Conferenceon, pages 733–744, April 2013.

[18] Sen Su, Yiping Teng, Xiang Cheng, Yulong Wang, and GuoliangLi. Privacy-preserving top-k spatial keyword queries over outsourceddatabase. In Proceedings of the 20th International Conference onDatabase Systems for Advanced Applications, DASFAA’15, pages 589–608, 2015.

[19] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clusteringbased on mapreduce. In Proceedings of the 1st International Conferenceon Cloud Computing, CloudCom ’09, pages 674–679, Berlin, Heidel-berg, 2009. Springer-Verlag.

[20] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, andSergei Vassilvitskii. Scalable k-means++. 5(7):622–633, March 2012.

[21] Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. Opti-mized big data k-means clustering using mapreduce. J. Supercomput.,70(3):1249–1259, December 2014.

[22] Zvika Brakerski, Craig Gentry, and Shai Halevi. Packed ciphertexts inlwe-based homomorphic encryption. In 16th International Conferenceon Practice and Theory in Public-Key Cryptography (PKC), pages 1–13,February 2013.

[23] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified dataprocessing on large clusters. Commun. ACM, 51(1):107–113, January2008.

[24] Sabrina De Capitani di Vimercati, Sara Foresti, Sushil Jajodia, StefanoParaboschi, and Pierangela Samarati. Over-encryption: management ofaccess control evolution on outsourced data. In Proceedings of the 33rdinternational conference on Very large data bases, VLDB ’07, pages123–134. VLDB Endowment, 2007.

[25] Jiawei Yuan and Shucheng Yu. Privacy preserving back-propagationneural network learning made practical with cloud computing. IEEETransactions on Parallel and Distributed Systems, 25(1):212–221, 2014.

[26] Ning Cao, Zhenyu Yang, Cong Wang, Kui Ren, and Wenjing Lou.Privacy-preserving query over encrypted graph-structured data in cloudcomputing. In Distributed Computing Systems (ICDCS), 2011 31stInternational Conference on, pages 393–402, 2011.

[27] Jiawei Yuan and Shucheng Yu. Efficient Privacy-Preserving biometricidentification in cloud computing. In 2013 Proceedings IEEE INFOCOM(INFOCOM’2013), pages 2652–2660, Turin, Italy, April 2013.

[28] Jiawei Han and Micheline Kamber. Chapter 7, Data Mining: Conceptsand Techniques, Second Edition. Morgan Kaufmann, 2006.

[29] Ackerman Margareta, Ben-David Shai, Branzei Simina, and LokerDavid. Weighted clustering. pages 858–863, 2012.

[30] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and ChaitanyaSwamy. The effectiveness of lloyd-type methods for the k-meansproblem. J. ACM, 59(6):28:1–28:22, January 2013.

[31] Microsoft azure cloud. https://azure.microsoft.com/en-us/.[32] Apache hadoop. http://hadoop.apache.org/.[33] Mikio L. Braun. jblas library. http://jblas.org/.[34] Xiaowei Xu, Jochen Jager, and Hans-Peter Kriegel. A fast parallel

clustering algorithm for large spatial databases. Data Min. Knowl.Discov., 3(3):263–290, September 1999.

Page 12: Practical Privacy-Preserving MapReduce Based K-means … · Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset ... 12

2168-7161 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2017.2656895, IEEETransactions on Cloud Computing

12

[35] Mikhail J. Atallah and Keith B. Frikken. Securely outsourcing linearalgebra computations. In Proceedings of the 5th ACM Symposium onInformation, Computer and Communications Security, ASIACCS ’10,pages 48–59, New York, NY, USA, 2010. ACM.

[36] J. Huang, R. Zhang, R. Buyya, and J. Chen. Melody-join: Efficient earthmover’s distance similarity joins using mapreduce. In 2014 IEEE 30thInternational Conference on Data Engineering, pages 808–819, March2014.

Jiawei Yuan (S’11-M’15) is an assistant professorof Computer Science in the Dept. of ECSSE atEmbry-Riddle Aeronautical University since 2015.He received his Ph.D in 2015 from University ofArkansas at Little Rock, and a BS in 2011 fromUniversity of Electronic Science and Technology ofChina. His research interests are in the areas ofcyber-security and privacy in cloud computing andbig data, eHealth security, and applied cryptography.He is a member of IEEE.

Yifan Tian (S’16) is a Ph.D student at Embry-Riddle Aeronautical University since 2016. He re-ceived his M.S in 2015 from Johns Hopkins Uni-versity, and BS in 2014 from Tongji University,China. His research interests are in the areas of cybersecurity and network security, with current focuson secure computation outsourcing. He is a studentmember of IEEE.