IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, … filek-Nearest Neighbor Classification over...

13
k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data Bharath K. Samanthula, Member, IEEE, Yousef Elmehdwi, and Wei Jiang, Member, IEEE Abstract—Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government agencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the classification problem have been proposed under different security models. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preserving classification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol protects the confidentiality of data, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed protocol using a real-world dataset under different parameter settings. Index Terms—Security, k-NN classifier, outsourced databases, encryption Ç 1 INTRODUCTION R ECENTLY, the cloud computing paradigm [1] is revolu- tionizing the organizations’ way of operating their data particularly in the way they store, access and process data. As an emerging computing paradigm, cloud computing attracts many organizations to consider seriously regarding cloud potential in terms of its cost-efficiency, flexibility, and offload of administrative overhead. Most often, organiza- tions delegate their computational operations in addition to their data to the cloud. Despite tremendous advantages that the cloud offers, privacy and security issues in the cloud are preventing companies to utilize those advantages. When data are highly sensitive, the data need to be encrypted before outsourcing to the cloud. However, when data are encrypted, irrespective of the underlying encryption scheme, performing any data mining tasks becomes very challenging without ever decrypting the data. There are other privacy concerns, demonstrated by the following example. Example 1. Suppose an insurance company outsourced its encrypted customers database and relevant data min- ing tasks to a cloud. When an agent from the company wants to determine the risk level of a potential new customer, the agent can use a classification method to determine the risk level of the customer. First, the agent needs to generate a data record q for the customer containing certain personal information of the customer, e.g., credit score, age, marital status, etc. Then this record can be sent to the cloud, and the cloud will compute the class label for q. Nevertheless, since q contains sensitive information, to protect the customer’s privacy, q should be encrypted before send- ing it to the cloud. The above example shows that data mining over encrypted data (denoted by DMED) on a cloud also needs to protect a user’s record when the record is a part of a data mining process. Moreover, cloud can also derive useful and sensitive information about the actual data items by observ- ing the data access patterns even if the data are encrypted [2], [3]. Therefore, the privacy/security requirements of the DMED problem on a cloud are threefold: (1) confidentiality of the encrypted data, (2) confidentiality of a user’s query record, and (3) hiding data access patterns. Existing work on privacy-preserving data mining (PPDM) (either perturbation or secure multi-party computa- tion (SMC) based approach) cannot solve the DMED prob- lem. Perturbed data do not possess semantic security, so data perturbation techniques cannot be used to encrypt highly sensitive data. Also the perturbed data do not pro- duce very accurate data mining results. Secure multi-party computation based approach assumes data are distributed and not encrypted at each participating party. In addition, many intermediate computations are performed based on non-encrypted data. As a result, in this paper, we proposed novel methods to effectively solve the DMED problem assuming that the encrypted data are outsourced to a cloud. Specifically, we focus on the classification problem since it is one of the most common data mining tasks. Because each classification technique has their own advantage, to be con- crete, this paper concentrates on executing the k-nearest neighbor classification method over encrypted data in the cloud computing environment. B.K. Samanthula is with the Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907. E-mail: [email protected]. Y. Elmehdwi and W. Jiang are with the Department of Computer Science, Missouri University of Science and Technology, 310 CS Building, 500 West 15th St., Rolla, MO 65409. E-mail: {ymez76, wjiang}@mst.edu. Manuscript received 23 Oct. 2013; revised 10 Sept. 2014; accepted 29 Sept. 2014. Date of publication 19 Oct. 2014; date of current version 27 Mar. 2015. Recommended for acceptance by G. Miklau. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2014.2364027 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 1261 1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, … filek-Nearest Neighbor Classification over...

k-Nearest Neighbor Classification overSemantically Secure Encrypted Relational Data

Bharath K. Samanthula,Member, IEEE, Yousef Elmehdwi, and Wei Jiang,Member, IEEE

Abstract—Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government

agencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of various

privacy issues, many theoretical and practical solutions to the classification problem have been proposed under different security

models. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted

form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preserving

classification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. In

particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol protects the confidentiality of

data, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop a

secure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed

protocol using a real-world dataset under different parameter settings.

Index Terms—Security, k-NN classifier, outsourced databases, encryption

Ç

1 INTRODUCTION

RECENTLY, the cloud computing paradigm [1] is revolu-tionizing the organizations’ way of operating their data

particularly in the way they store, access and process data.As an emerging computing paradigm, cloud computingattracts many organizations to consider seriously regardingcloud potential in terms of its cost-efficiency, flexibility, andoffload of administrative overhead. Most often, organiza-tions delegate their computational operations in addition totheir data to the cloud. Despite tremendous advantages thatthe cloud offers, privacy and security issues in the cloud arepreventing companies to utilize those advantages. Whendata are highly sensitive, the data need to be encryptedbefore outsourcing to the cloud. However, when data areencrypted, irrespective of the underlying encryption scheme,performing any data mining tasks becomes very challengingwithout ever decrypting the data. There are other privacyconcerns, demonstrated by the following example.

Example 1. Suppose an insurance company outsourced itsencrypted customers database and relevant data min-ing tasks to a cloud. When an agent from the companywants to determine the risk level of a potential newcustomer, the agent can use a classification method todetermine the risk level of the customer. First, theagent needs to generate a data record q for the

customer containing certain personal information ofthe customer, e.g., credit score, age, marital status, etc.Then this record can be sent to the cloud, and thecloud will compute the class label for q. Nevertheless,since q contains sensitive information, to protect thecustomer’s privacy, q should be encrypted before send-ing it to the cloud.

The above example shows that data mining overencrypted data (denoted by DMED) on a cloud also needsto protect a user’s record when the record is a part of a datamining process. Moreover, cloud can also derive useful andsensitive information about the actual data items by observ-ing the data access patterns even if the data are encrypted[2], [3]. Therefore, the privacy/security requirements of theDMED problem on a cloud are threefold: (1) confidentialityof the encrypted data, (2) confidentiality of a user’s queryrecord, and (3) hiding data access patterns.

Existing work on privacy-preserving data mining(PPDM) (either perturbation or secure multi-party computa-tion (SMC) based approach) cannot solve the DMED prob-lem. Perturbed data do not possess semantic security, sodata perturbation techniques cannot be used to encrypthighly sensitive data. Also the perturbed data do not pro-duce very accurate data mining results. Secure multi-partycomputation based approach assumes data are distributedand not encrypted at each participating party. In addition,many intermediate computations are performed based onnon-encrypted data. As a result, in this paper, we proposednovel methods to effectively solve the DMED problemassuming that the encrypted data are outsourced to a cloud.Specifically, we focus on the classification problem since itis one of the most common data mining tasks. Because eachclassification technique has their own advantage, to be con-crete, this paper concentrates on executing the k-nearestneighbor classification method over encrypted data in thecloud computing environment.

� B.K. Samanthula is with the Department of Computer Science, PurdueUniversity, 305 N. University Street, West Lafayette, IN 47907.E-mail: [email protected].

� Y. Elmehdwi and W. Jiang are with the Department of Computer Science,Missouri University of Science and Technology, 310 CS Building,500 West 15th St., Rolla, MO 65409. E-mail: {ymez76, wjiang}@mst.edu.

Manuscript received 23 Oct. 2013; revised 10 Sept. 2014; accepted 29 Sept.2014. Date of publication 19 Oct. 2014; date of current version 27 Mar. 2015.Recommended for acceptance by G. Miklau.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2014.2364027

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 1261

1041-4347� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1.1 Problem Definition

Suppose Alice owns a database D of n records t1; . . . ; tn andmþ 1 attributes. Let ti;j denote the jth attribute value ofrecord ti. Initially, Alice encrypts her database attribute-wise, that is, she computes Epkðti;jÞ, for 1 � i � n and1 � j � mþ 1, where column ðmþ 1Þ contains the classlabels. We assume that the underlying encryption scheme issemantically secure [4]. Let the encrypted database bedenoted by D0. We assume that Alice outsources D0 as wellas the future classification process to the cloud.

Let Bob be an authorized user who wants to classify hisinput record q ¼ hq1; . . . ; qmi by applying the k-NN classifi-cation method based on D0. We refer to such a process asprivacy-preserving k-NN (PPkNN) classification overencrypted data in the cloud. Formally, we define thePPkNN protocol as:

PPkNNðD0; qÞ ! cq;

where cq denotes the class label for q after applying k-NNclassification method onD0 and q.

1.2 Our Contributions

In this paper, we propose a novel PPkNN protocol, a securek-NN classifier over semantically secure encrypted data. Inour protocol, once the encrypted data are outsourced to thecloud, Alice does not participate in any computations.Therefore, no information is revealed to Alice. In addition,our protocol meets the following privacy requirements:

� Contents of D or any intermediate results should notbe revealed to the cloud.

� Bob’s query q should not be revealed to the cloud.� cq should be revealed only to Bob. Also, no other

information should be revealed to Bob.� Data access patterns, such as the records correspond-

ing to the k-nearest neighbors of q, should not berevealed to Bob and the cloud (to prevent any infer-ence attacks).

We emphasize that the intermediate results seen by the cloudin our protocol are either newly generated randomizedencryptions or random numbers. Thus, which data recordscorrespond to the k-nearest neighbors and the output classlabel are not known to the cloud. In addition, after sendinghis encrypted query record to the cloud, Bob does notinvolve in any computations. Hence, data access patterns arefurther protected from Bob (see Section 5 for more details).

The rest of the paper is organized as follows. We discussthe existing related work and some concepts as a back-ground in Section 2. A set of privacy-preserving protocolsand their possible implementations are provided in Section3. The formal security proofs for the mentioned privacy-pre-serving primitives are provided in Section 4. The proposedPPkNN protocol is explained in detail in Section 5. Section 6discusses the performance of the proposed protocol underdifferent parameter settings. We conclude the paper alongwith future work in Section 7.

2 RELATED WORK AND BACKGROUND

Due to space limitations, here we briefly review the existingrelated work and provide some definitions as a background.

Please refer to our technical report [5] for a more elaboratedrelated work and background.

At first, it seems fully homomorphic cryptosystems (e.g.,[6]) can solve the DMED problem since it allows a third-party (that hosts the encrypted data) to execute arbitraryfunctions over encrypted data without ever decryptingthem. However, we stress that such techniques are veryexpensive and their usage in practical applications have yetto be explored. For example, it was shown in [7] that evenfor weak security parameters one “bootstrapping” opera-tion of the homomorphic operation would take at least30 seconds on a high performance machine.

It is possible to use the existing secret sharing techniquesin SMC, such as Shamir’s scheme [8], to develop a PPkNNprotocol. However, our work is different from the secretsharing based solution in the following aspect. Solutionsbased on the secret sharing schemes require at least threeparties whereas our work require only two parties. Forexample, the constructions based on Sharemind [9], a well-known SMC framework which is based on the secret shar-ing scheme, assumes that the number of participating par-ties is three. Thus, our work is orthogonal to Sharemind andother secret sharing based schemes.

2.1 Privacy-Preserving Data Mining

Agrawal and Srikant [10], Lindell and Pinkas [11] werethe first to introduce the notion of privacy-preservingunder data mining applications. The existing PPDM tech-niques can broadly be classified into two categories: (i)data perturbation and (ii) data distribution. Agrawal andSrikant [10] proposed the first data perturbation tech-nique to build a decision-tree classifier, and many othermethods were proposed later (e.g., [12], [13], [14]). How-ever, as mentioned earlier in Section 1, data perturbationtechniques cannot be applicable for semantically secureencrypted data. Also, they do not produce accurate datamining results due to the addition of statistical noises tothe data. On the other hand, Lindell and Pinkas [11] pro-posed the first decision tree classifier under the two-partysetting assuming the data were distributed between them.Since then much work has been published using SMCtechniques (e.g., [15], [16], [17]). We claim that the PPkNNproblem cannot be solved using the data distributiontechniques since the data in our case is encrypted and notdistributed in plaintext among multiple parties. For thesame reasons, we also do not consider secure k-NN meth-ods in which the data are distributed between two parties(e.g., [18]).

2.2 Query Processing over Encrypted Data

Various techniques related to query processing overencrypted data have been proposed, e.g., [19], [20], [21].However, we observe that PPkNN is a more complex prob-lem than the execution of simple kNN queries overencrypted data [22], [23]. For one, the intermediate k-nearestneighbors in the classification process, should not be dis-closed to the cloud or any users. We emphasize that therecent method in [23] reveals the k-nearest neighbors to theuser. Second, even if we know the k-nearest neighbors, it isstill very difficult to find the majority class label amongthese neighbors since they are encrypted at the first place to

1262 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015

prevent the cloud from learning sensitive information.Third, the existing work did not addressed the access pat-tern issue which is a crucial privacy requirement from theuser’s perspective.

In our most recent work [24], we proposed a novelsecure k-nearest neighbor query protocol over encrypteddata that protects data confidentiality, user’s query pri-vacy, and hides data access patterns. However, as men-tioned above, PPkNN is a more complex problem and itcannot be solved directly using the existing secure k-near-est neighbor techniques over encrypted data. Therefore,in this paper, we extend our previous work in [24] andprovide a new solution to the PPkNN classifier problemover encrypted data.

More specifically, this paper is different from our pre-liminary work [24] in the following four aspects. First, inthis paper, we introduced new security primitives,namely secure minimum (SMIN), secure minimum out ofn numbers (SMINn), secure frequency (SF), and proposednew solutions for them. Second, the work in [24] did notprovide any formal security analysis of the underlyingsub-protocols. On the other hand, this paper provides for-mal security proofs of the underlying sub-protocols aswell as the PPkNN protocol under the semi-honest model.Additionally, we discuss various techniques throughwhich the proposed PPkNN protocol can possibly beextended to a protocol that is secure under the malicioussetting. Third, our preliminary work in [24] addressesonly secure kNN query which is similar to Stage 1 ofPPkNN. However, Stage 2 in PPkNN is entirely new.Finally, our empirical analyses in Section 6 are based on areal dataset whereas the results in [24] are based on asimulated dataset. Furthermore, new experimental resultsare included in this paper.

2.3 Threat Model

We adopt the security definitions in the literature of securemulti-party computation [25], [26], and there are three com-mon adversarial models under SMC: semi-honest, covertand malicious. In this paper, to develop secure and efficientprotocols, we assume that parties are semi-honest. Briefly,the following definition captures the properties of a secureprotocol under the semi-honest model [27], [28].

Definition 1. Let ai be the input of party Pi, PiðpÞ be Pi’s execu-tion image of the protocol p and bi be the output for party Pi

computed from p. Then, p is secure if PiðpÞ can be simulatedfrom ai and bi such that distribution of the simulated image iscomputationally indistinguishable from PiðpÞ.In the above definition, an execution image generally

includes the input, the output and the messages communi-cated during an execution of a protocol. To prove a protocolis secure under semi-honest model, we generally need toshow that the execution image of a protocol does not leakany information regarding the private inputs of participat-ing parties [28].

2.4 Paillier Cryptosystem

The Paillier cryptosystem is an additive homomorphic andprobabilistic public-key encryption scheme whose security

is based on the Decisional Composite Residuosity Assump-tion [4]. Let Epk be the encryption function with public keypk given by (N; g), where N is a product of two large primesof similar bit length and g is a generator in Z�

N2 . Also, let Dsk

be the decryption function with secret key sk. For any giventwo plaintexts a; b 2 ZN , the Paillier encryption schemeexhibits the following properties:

(1) Homomorphic addition.

DskðEpkðaþ bÞÞ ¼ DskðEpkðaÞ � EpkðbÞmodN2Þ:

(2) Homomorphic multiplication.

DskðEpkða � bÞÞ ¼ DskðEpkðaÞbmodN2Þ:

(3) Semantic security. The encryption scheme is semanti-cally secure[28], [29]. Briefly, given a set of cipher-texts, an adversary cannot deduce any additionalinformation about the plaintext(s).

For succinctness, we drop the modN2 term during homomor-

phic operations in the rest of this paper.

3 PRIVACY-PRESERVING PRIMITIVES

Here we present a set of generic sub-protocols that willbe used in constructing our proposed k-NN protocol inSection 5. All of the below protocols are considered undertwo-party semi-honest setting. In particular, we assumethe existence of two semi-honest parties P1 and P2 suchthat the Paillier’s secret key sk is known only to P2

whereas pk is public.

� Secure multiplication (SM). This protocol considers P1

with input ðEpkðaÞ; EpkðbÞÞ and outputs Epkða � bÞ toP1, where a and b are not known to P1 and P2. Dur-ing this process, no information regarding a and b isrevealed to P1 and P2.

� Secure squared euclidean distance (SSED). In this proto-col, P1 with input ðEpkðXÞ; EpkðY ÞÞ and P2 with sksecurely compute the encryption of squared euclid-ean distance between vectors X and Y . Here X andY are m dimensional vectors where EpkðXÞ ¼hEpkðx1Þ; . . . ; EpkðxmÞi and EpkðY Þ ¼ hEpkðy1Þ; . . . ;EpkðymÞi. The output EpkðjX � Y j2Þ will be knownonly to P1.

� Secure bit-decomposition (SBD). Here P1 with inputEpkðzÞ and P2 securely compute the encryptions of

the individual bits of z, where 0 � z < 2l. The output½z� ¼ hEpkðz1Þ; . . . ; EpkðzlÞi is known only to P1. Herez1 and zl are the most and least significant bits ofinteger z, respectively.

� Secure minimum. In this protocol, P1 holds privateinput ðu0; v0Þ and P2 holds sk, where u0 ¼ ð½u�;EpkðsuÞÞ and v0 ¼ ð½v�; EpkðsvÞÞ. Here su (resp., sv)denotes the secret associated with u (resp., v). Thegoal of SMIN is for P1 and P2 to jointly compute theencryptions of the individual bits of minimum num-ber between u and v. In addition, they computeEpkðsminðu;vÞÞ. That is, the output is ð½minðu; vÞ�;Epkðsminðu;vÞÞÞ which will be known only to P1.

SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1263

During this protocol, no information regarding thecontents of u; v; su; and sv is revealed to P1 and P2.

� Secure minimum out of n numbers. In this protocol, weconsider P1 with n encrypted vectors ð½d1�; . . . ; ½dn�Þalong with their respective encrypted secrets and P2

with sk. Here ½di� ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi where

di;1 and di;l are the most and least significant bitsof integer di respectively, for 1 � i � n. The secretof di is given by sdi . P1 and P2 jointly compute

½minðd1; . . . ; dnÞ�. In addition, they computeEpkðsminðd1;...;dnÞÞ. At the end of this protocol, the out-

put ð½minðd1; . . . ; dnÞ�; Epkðsminðd1;...;dnÞÞÞ is known

only to P1. During SMINn, no information regardingany of di’s and their secrets is revealed to P1 and P2.

� Secure Bit-OR (SBOR). P1 with input ðEpkðo1Þ;Epkðo2ÞÞ and P2 securely compute Epkðo1 _ o2Þ, whereo1 and o2 are 2 bits. The output Epkðo1 _ o2Þ is known

only to P1.� Secure frequency. Here P1 with private inputðhEpkðc1Þ; . . .EpkðcwÞi; hEpkðc01Þ; . . . ; Epkðc0kÞiÞ and P2

securely compute the encryption of the frequency ofcj, denoted by fðcjÞ, in the list hc01; . . . ; c0ki, for

1 � j � w. Here we explicitly assume that cj’s are

unique and c0i 2 fc1; . . . ; cwg, for 1 � i � k. The out-

put hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi will be known only

to P1. During the SF protocol, no information regard-ing c0i, cj, and fðcjÞ is revealed to P1 and P2, for

1 � i � k and 1 � j � w.Now we either propose a new solution or refer to the

most efficient known implementation to each of theabove protocols. First of all, efficient solutions to SM,SSED, SBD and SBOR were discussed in [24]. Therefore,in this paper, we discuss SMIN, SMINn, and SF prob-lems in detail and propose new solutions to each one ofthem.

Secure minimum. In this protocol, we assume that P1

holds private input ðu0; v0Þ and P2 holds sk, whereu0 ¼ ð½u�; EpkðsuÞÞ and v0 ¼ ð½v�; EpkðsvÞÞ. Here su and svdenote the secrets corresponding to u and v, respectively.The main goal of SMIN is to securely compute theencryptions of the individual bits of minðu; vÞ, denotedby ½minðu; vÞ�. Here ½u� ¼ hEpkðu1Þ; . . . ; EpkðulÞi and ½v� ¼hEpkðv1Þ; . . . ; EpkðvlÞi, where u1 (resp., v1) and ul (resp., vl)

are the most and least significant bits of u (resp., v), respec-tively. In addition, they compute Epkðsminðu;vÞÞ, the encryp-

tion of the secret corresponding to the minimum valuebetween u and v. At the end of SMIN, the outputð½minðu; vÞ�; Epkðsminðu;vÞÞÞ is known only to P1.

We assume that 0 � u; v < 2l and propose a novelSMIN protocol. Our solution to SMIN is mainly moti-vated from the work of [24]. Precisely, the basic idea ofthe proposed SMIN protocol is for P1 to randomly choosethe functionality F (by flipping a coin), where F is eitheru > v or v > u, and to obliviously execute F with P2.Since F is randomly chosen and known only to P1, theresult of the functionality F is oblivious to P2. Based onthe comparison result and chosen F , P1 computes½minðu; vÞ� and Epkðsminðu;vÞÞ locally using homomorphic

properties.

Algorithm 1. SMINðu0; v0Þ ! ½minðu; vÞ�; Epkðsminðu;vÞÞRequire: P1 has u

0 ¼ ð½u�; EpkðsuÞÞ and v0 ¼ ð½v�; EpkðsvÞÞ, where

0 � u; v < 2l; P2 has sk1: P1:

(a). Randomly choose the functionality F(b). for i ¼ 1 to l do:� Epkðui � viÞ SMðEpkðuiÞ; EpkðviÞÞ� Ti Epkðui � viÞ� Hi H

rii�1 � Ti; ri 2R ZN andH0 ¼ Epkð0Þ

� Fi Epkð�1Þ �Hi

� if F : u > v then:� Wi EpkðuiÞ � Epkðui � viÞN�1� Gi Epkðvi � uiÞ � EpkðriÞ; ri 2R ZN

else� Wi EpkðviÞ � Epkðui � viÞN�1� Gi Epkðui � viÞ � EpkðriÞ; ri 2R ZN

� Li Wi �Fr0ii ; r

0i 2R ZN

(c). if F : u > v then: d Epkðsv � suÞ � EpkðrÞelse d Epkðsu � svÞ � EpkðrÞ, where r 2R ZN

(d). G0 p1ðGÞ and L0 p2ðLÞ(e). Send d;G0 and L0 to P2

2: P2:(a). Receive d;G0 and L0 from P1

(b). Decryption:Mi DskðL0iÞ, for 1 � i � l(c). if 9 j such that Mj ¼ 1 then a 1

else a 0(d). if a ¼ 0 then:� M 0

i Epkð0Þ, for 1 � i � l� d0 Epkð0Þelse� M 0

i G0i � rN , where r 2R ZN and is different for1 � i � l

� d0 d � rNd , where rd 2R ZN

(e). SendM 0; EpkðaÞ and d0 to P1

3: P1:(a). ReceiveM 0; EpkðaÞ and d0 from P2

(b). eM p�11 ðM 0Þ and u d0 � EpkðaÞN�r(c). �i eMi � EpkðaÞN�ri , for 1 � i � l(d). if F : u > v then:� Epkðsminðu;vÞÞ EpkðsuÞ � u� Epkðminðu; vÞiÞ EpkðuiÞ � �i, for 1 � i � lelse� Epkðsminðu;vÞÞ EpkðsvÞ � u� Epkðminðu; vÞiÞ EpkðviÞ � �i, for 1 � i � l

The overall steps involved in the SMIN protocol areshown in Algorithm 1. To start with, P1 initially chooses thefunctionality F as either u > v or v > u randomly. Then,using the SM protocol, P1 computes Epkðui � viÞ with thehelp of P2, for 1 � i � l. After this, the protocol has the fol-lowing key steps, performed by P1 locally, for 1 � i � l:

� Compute the encrypted bit-wise XOR between thebits ui and vi using the following formulation1

Ti ¼ EpkðuiÞ � EpkðviÞ � Epkðui � viÞN�2� Compute an encrypted vector H by preserving the

first occurrence of Epkð1Þ (if there exists one) in T byinitializing H0 ¼ Epkð0Þ. The rest of the entries of H

are computed as Hi ¼ Hrii�1 � Ti. We emphasize that

1. In general, for any two given bits o1 and o2, the propertyo1 � o2 ¼ o1 þ o2 � 2ðo1 � o2Þ always holds.

1264 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015

at most one of the entry in H is Epkð1Þ and theremaining entries are encryptions of either 0 or a ran-dom number.

� Then, P1 computes Fi ¼ Epkð�1Þ �Hi. Note that“�1” is equivalent to “N � 1” under ZN . From theabove discussions, it is clear that Fi ¼ Epkð0Þ at mostonce since Hi is equal to Epkð1Þ at most once. Also, ifFj ¼ Epkð0Þ, then index j is the position at which thebits of u and v differ first (starting from the most sig-nificant bit position).

Now, depending on F , P1 creates two encrypted vectors Wand G as follows, for 1 � i � l:

� If F : u > v, compute

Wi ¼ Epkðui � ð1� viÞÞ;Gi ¼ Epkðvi � uiÞ � EpkðriÞ ¼ Epkðvi � ui þ riÞ:

� If F : v > u, compute

Wi ¼ Epkðvi � ð1� uiÞÞ;Gi ¼ Epkðui � viÞ � EpkðriÞ ¼ Epkðui � vi þ riÞ;

where ri is a random number (hereafter denoted by 2R) inZN . The observation is that if F : u > v, then Wi ¼ Epkð1Þ iffui > vi, and Wi ¼ Epkð0Þ otherwise. Similarly, whenF : v > u, we have Wi ¼ Epkð1Þ iff vi > ui, and Wi ¼ Epkð0Þotherwise. Also, depending of F , Gi stores the encryption ofthe randomized difference between ui and vi which will beused in later computations.

After this, P1 computes L by combining F and W . More

precisely, P1 computes Li ¼Wi �Fr0ii , where r0i is a random

number in ZN . The observation here is if 9 an index j suchthat Fj ¼ Epkð0Þ, denoting the first flip in the bits of u and v,then Wj stores the corresponding desired information, i.e.,whether uj > vj or vj > uj in encrypted form. In addition,depending on F , P1 computes the encryption of random-ized difference between su and sv and stores it in d. Specifi-cally, if F : u > v, then d ¼ Epkðsv � su þ rÞ. Otherwise,d ¼ Epkðsu � sv þ rÞ, where r 2R ZN .

After this, P1 permutes the encrypted vectors G and Lusing two random permutation functions p1 and p2. Specifi-

cally, P1 computes G0 ¼ p1ðGÞ and L0 ¼ p2ðLÞ, and sendsthem along with d to P2. Upon receiving, P2 decrypts L0

component-wise to get Mi ¼ DskðL0iÞ, for 1 � i � l, andchecks for index j. That is, if Mj ¼ 1, then P2 sets a to 1, oth-erwise sets it to 0. In addition, P2 computes a new encryptedvector M 0 depending on the value of a. Precisely, if a ¼ 0,then M 0

i ¼ Epkð0Þ, for 1 � i � l. Here Epkð0Þ is different foreach i. On the other hand, when a ¼ 1, P2 sets M

0i to the re-

randomized value of G0i. That is, M 0i ¼ G0i � rN , where the

term rN comes from re-randomization and r 2R ZN shouldbe different for each i. Furthermore, P2 computesd0 ¼ Epkð0Þ if a ¼ 0. However, when a ¼ 1, P2 sets d0 tod � rNd , where rd is a random number in ZN . Then, P2 sends

M 0; EpkðaÞ and d0 to P1. After receiving M 0; EpkðaÞ and d0, P1

computes the inverse permutation of M 0 as eM ¼ p�11 ðM 0Þ.Then, P1 performs the following homomorphic operations

to compute the encryption of ith bit of minðu; vÞ, i.e.,Epkðminðu; vÞiÞ, for 1 � i � l:

� Remove the randomness from eMi by computing

�i ¼ eMi � EpkðaÞN�ri� If F : u>v, compute Epkðminðu; vÞiÞ ¼ EpkðuiÞ�

�i ¼ Epkðui þ a � ðvi � uiÞÞ. Otherwise, computeEpkðminðu; vÞiÞ¼EpkðviÞ � �i ¼ Epkðviþ a � ðui � viÞÞ.

Also, depending on F , P1 computes Epkðsminðu;vÞÞ as fol-lows. If F : u > v, P1 computes Epkðsminðu;vÞÞ ¼ EpkðsuÞ � u,where u¼ d0 � EpkðaÞN�r. Otherwise, he/she computesEpkðsminðu;vÞÞ ¼ EpkðsvÞ � u.

In the SMIN protocol, one main observation (upon whichwe can also justify the correctness of the final output) is thatif F : u > v, then minðu; vÞi ¼ ð1� aÞ � ui þ a � vi alwaysholds, for 1 � i � l. On the other hand, if F : v > u, thenminðu; vÞi ¼ a � ui þ ð1� aÞ � vi always holds. Similar con-clusions can be drawn for sminðu;vÞ. We emphasize that using

similar formulations one can also design a SMAX protocolto compute ½maxðu; vÞ� and Epkðsmaxðu;vÞÞ. Also, we stress

that there can be multiple secrets of u and v that can be fedas input (in encrypted form) to SMIN and SMAX. For exam-

ple, let s1u and s2u (resp., s1v and s2v) be two secrets associatedwith u (resp., v). Then the SMIN protocol takes

ð½u�; Epkðs1uÞ; Epkðs2uÞÞ and ð½v�; Epkðs1vÞ; Epkðs2vÞÞ as P1’s input

and outputs ½minðu; vÞ�; Epkðs1minðu;vÞÞ and Epkðs2minðu;vÞÞ to P1.

Example 2. For simplicity, consider that u ¼ 55, v ¼ 58, andl ¼ 6. Suppose su and sv be the secrets associated with uand v, respectively. Assume that P1 holds ð½55�; EpkðsuÞÞð½58�; EpkðsvÞÞ. In addition, we assume that P1’s randompermutation functions are as given below. Without lossof generality, suppose P1 chooses the functionalityF : v > u. Then, various intermediate results based onthe SMIN protocol are as shown in Table 1. Followingfrom Table 1, we observe that:

� At most one of the entry in H is Epkð1Þ, namelyH3, and the remaining entries are encryptions ofeither 0 or a random number in ZN .

� Index j ¼ 3 is the first position at which the corre-sponding bits of u and v differ.

TABLE 1P1 Chooses F As v > uWhere u ¼ 55 and v ¼ 58

½u� ½v� Wi Gi Gi Hi Fi Li Gi’ L0i Mi �i mini

1 1 0 r 0 0 �1 r 1þ r r r 0 11 1 0 r 0 0 �1 r r r r 0 10 1 1 �1þ r 1 1 0 1 1þ r r r �1 01 0 0 1þ r 1 r r r �1þ r r r 1 11 1 0 r 0 r r r r 1 1 0 11 0 0 1þ r 1 r r r r r r 1 1

All column values are in encrypted form exceptMi column. Also, r 2R ZN isdifferent for each row and column.

i = 1 2 3 4 5 6# # # # # #

p1ðiÞ = 6 5 4 3 2 1p2ðiÞ = 2 1 5 6 3 4

SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1265

� F3 ¼ Epkð0Þ sinceH3 is equal to Epkð1Þ. Also, sinceM5 ¼ 1, P2 sets a to 1.

� Epkðsminðu;vÞÞ ¼ Epkða � su þ ð1� aÞ � svÞ¼ EpkðsuÞ.At the end, only P1 knows ½minðu; vÞ� ¼ ½u� ¼ ½55� andEpkðsminðu;vÞÞ ¼ EpkðsuÞ.

Secure minimum out of n numbers. Consider P1 with pri-vate input ð½d1�; . . . ; ½dn�Þ along with their encrypted secrets

and P2 with sk, where 0 � di < 2l and ½di� ¼ hEpkðdi;1Þ;. . . ; Epkðdi;lÞi, for 1 � i � n. Here the secret of di is denotedby EpkðsdiÞ, for 1 � i � n. The main goal of the SMINn proto-

col is to compute ½minðd1; . . . ; dnÞ� ¼ ½dmin� without revealingany information about di’s to P1 and P2. In addition, theycompute the encryption of the secret corresponding to theglobal minimum, denoted by Epkðsdmin

Þ. Here we construct

a new SMINn protocol by utilizing SMIN as the buildingblock. The proposed SMINn protocol is an iterativeapproach and it computes the desired output in an hierar-chical fashion. In each iteration, minimum between a pair ofvalues and the secret corresponding to the minimum valueare computed (in encrypted form) and fed as input to thenext iteration, thus, generating a binary execution tree in abottom-up fashion. At the end, only P1 knows the finalresult ½dmin� and Epkðsdmin

Þ.

Algorithm 2. SMINnðð½d1�; Epkðsd1ÞÞ; . . . ; ð½dn�; EpkðsdnÞÞÞ! ð½dmin�; Epkðsdmin

ÞÞRequire: P1 has ðð½d1�; Epkðsd1ÞÞ; . . . ; ð½dn�; EpkðsdnÞÞÞ; P2 has sk1: P1:

(a). ½d0i� ½di� and s0i EpkðsdiÞ, for 1 � i � n(b). num n

2: for i ¼ 1 to log2 nd e:(a). for 1 � j � num

2

� �:

� if i ¼ 1 then:� ð½d02j�1�; s02j�1Þ SMINðx; yÞ, where

x ¼ ð½d02j�1�; s02j�1Þ and y ¼ ð½d02j�; s02jÞ� ½d02j� 0 and s02j 0else� ð½d02iðj�1Þþ1�; s02iðj�1Þþ1Þ SMINðx; yÞ, where

x ¼ ð½d02iðj�1Þþ1�; s02iðj�1Þþ1Þ and y ¼ ð½d02ij�1�; s02ij�1Þ� ½d02ij�1� 0 and s02ij�1 0

(b). num num2

� �3: P1: ½dmin� ½d01� and Epkðsdmin

Þ s01

The overall steps involved in the proposed SMINn proto-col are highlighted in Algorithm 2. Initially, P1 assigns ½di�and EpkðsdiÞ to a temporary vector ½d0i� and variable s0i, for1 � i � n, respectively. Also, he/she creates a global vari-able num and initializes it to n, where num represents thenumber of (non-zero) vectors involved in each iteration.Since the SMINn protocol executes in a binary tree hierarchy(bottom-up fashion), we have log2 nd e iterations, and in eachiteration, the number of vectors involved varies. In the firstiteration (i.e., i ¼ 1), P1 with private inputðð½d02j�1�; s02j�1Þ; ð½d02j�; s02jÞÞ and P2 with sk involve in the

SMIN protocol, for 1 � j � num2

� �. At the end of the first iter-

ation, only P1 knows ½minðd02j�1; d02jÞ� and s0minðd0

2j�1;d02jÞ, and

nothing is revealed to P2, for 1 � j � num2

� �. Also, P1 stores

the result ½minðd02j�1; d02jÞ� and s0minðd0

2j�1;d02jÞ in ½d02j�1� and

s02j�1, respectively. In addition, P1 updates the values of

½d02j�, s02j to 0 and num to num2

� �, respectively.

During the ith iteration, only the non-zero vectors (alongwith the corresponding encrypted secrets) are involved inSMIN, for 2 � i � log2 nd e. For example, during the seconditeration (i.e., i ¼ 2), only ð½d01�; s01Þ; ð½d03�; s03Þ, and so on areinvolved. Note that in each iteration, the output is revealed

only to P1 and num is updated to num2

� �. At the end of

SMINn, P1 assigns the final encrypted binary vector ofglobal minimum value, i.e., ½minðd1; . . . ; dnÞ� which is storedin ½d01�, to ½dmin�. Also, P1 assigns s

01 to Epkðsdmin

Þ.Example 3. Suppose P1 holds h½d1�; . . . ; ½d6�i (i.e., n ¼ 6). For

simplicity, here we are assuming that there are no secretsassociated with di’s. Then, based on the SMINn protocol,the binary execution tree (in a bottom-up fashion) tocompute ½minðd1; . . . ; d6Þ� is shown in Fig. 1. Note that,initially ½d0i� ¼ ½di�.Secure frequency. Let us consider a situation where P1

holds private input ðhEpkðc1Þ; . . . ; EpkðcwÞi; hEpkðc01Þ;. . . ; Epkðc0kÞiÞ and P2 holds the secret key sk. The goal of theSF protocol is to securely compute EpkðfðcjÞÞ, for 1 � j � w.Here fðcjÞ denotes the number of times element cj occurs(i.e., frequency) in the list hc01; . . . ; c0ki. We explicitly assumethat c0i 2 fc1; . . . ; cwg, for 1 � i � k.

The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is revealed onlyto P1. During the SF protocol, neither c0i nor cj is revealed toP1 and P2. Also, fðcjÞ is kept private from both P1 and P2,for 1 � i � k and 1 � j � w.

The overall steps involved in the proposed SF protocolare shown in Algorithm 3. To start with, P1 initially com-putes an encrypted vector Si such that Si;j ¼ Epkðcj � c0iÞ, for1 � j � w. Then, P1 randomizes Si component-wise to getS0i;j ¼ Epkðri;j � ðcj � c0iÞÞ, where ri;j is a random number in

ZN . After this, for 1 � i � k, P1 randomly permutes S0i com-ponent-wise using a random permutation function pi

(known only to P1). The output Zi piðS0iÞ is sent to P2.Upon receiving, P2 decrypts Zi component-wise, computesa vector ui and proceeds as follows:

� If DskðZi;jÞ ¼ 0, then ui;j is set to 1. Otherwise, ui;j isset to 0.

� The observation is, since c0i 2 fc1; . . . ; cwg, thatexactly one of the entries in vector Zi is an encryp-tion of 0 and the rest are encryptions of randomnumbers. This further implies that exactly one of thedecrypted values of Zi is 0 and the rest are randomnumbers. Precisely, if ui;j ¼ 1, then c0i ¼ cp�1ðjÞ.

Fig. 1. Binary execution tree for n ¼ 6 based on SMINn.

1266 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015

� Compute Ui;j ¼ Epkðui;jÞ and send it to P1, for1 � i � k and 1 � j � w.

Then, P1 performs row-wise inverse permutation on it to get

Vi ¼ p�1i ðUiÞ, for 1 � i � k. Finally, P1 computes

EpkðcjÞ ¼Qk

i¼1 Vi;j locally, for 1 � j � w.

Algorithm 3. SFðL;L0Þ ! hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞiRequire: P1 has L ¼ hEpkðc1Þ; . . . ; EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . . ;

Epkðc0kÞi and hp1; . . . ;pki; P2 has sk1: P1:

(a). for i ¼ 1 to k do:� Ti Epkðc0iÞN�1� for j ¼ 1 to w do:� Si;j EpkðcjÞ � Ti

� S0i;j Si;jri;j , where ri;j 2R ZN

� Zi piðS0iÞ(b). Send Z to P2

2: P2:(a). Receive Z from P1

(b). for i ¼ 1 to k do� for j ¼ 1 to w do:� ifDskðZi;jÞ ¼ 0 then ui;j 1else ui;j 0� Ui;j Epkðui;jÞ

(c). Send U to P1

3: P1:(a). Receive U from P2

(b). Vi p�1i ðUiÞ, for 1 � i � k(c). EpkðfðcjÞÞ

Qki¼1 Vi;j, for 1 � j � w

4 SECURITY ANALYSIS OF PRIVACY-PRESERVING

PRIMITIVES UNDER THE SEMI-HONEST MODEL

First of all, we emphasize that the outputs in the above men-tioned protocols are always in encrypted format, and areknown only to P1. Also, all the intermediate results revealedto P2 are either random or pseudo-random.

Since the proposed SMIN protocol (which is used as asub-routine in SMINn) is more complex than other protocolsmentioned above and due to space limitations, we are moti-vated to provide its security proof rather than providingproofs for each protocol. Therefore, here we only include aformal security proof for the SMIN protocol based on thestandard simulation argument [28]. Nevertheless, we stressthat similar proof strategies can be used to show that otherprotocols are secure under the semi-honest model. For com-pleteness, we provided the security proofs for the other pro-tocols in our technical report [5].

4.1 Proof of Security for SMIN

As mentioned in Section 2.3, to formally prove that SMIN issecure [28] under the semi-honest model, we need to showthat the simulated image of SMIN is computationally indis-tinguishable from the actual execution image of SMIN.

An execution image generally includes the messagesexchanged and the information computed from these mes-sages. Therefore, according to Algorithm 1, let the executionimage of P2 be denoted by PP2ðSMINÞ, given by

fhd; sþ r modNi; hG0i;mi þ ri mod Ni; hL0i;aig:

Observe that sþ r modN and mi þ ri mod N are derived

upon decrypting d and G0i, for 1 � i � l, respectively. Notethat the modulo operator is implicit in the decryption func-tion. Also, P2 receives L

0 from P1 and let a denote the (obliv-ious) comparison result computed from L0. Without loss ofgenerality, suppose the simulated image of P2 be

PSP2ðSMINÞ, given by

fhd�; r�i; hs01;i; s02;ii; hs03;i;a0i j for 1 � i � lg:

Here d�; s01;i and s03;i are randomly generated from ZN2

whereas r� and s02;i are randomly generated from ZN . In

addition, a0 denotes a random bit. Since Epk is a semanti-cally secure encryption scheme with resulting ciphertext

size less than N2, d is computationally indistinguishable

from d�. Similarly, G0i and L0i are computationally indistin-guishable from s01;i and s03;i, respectively. Also, as r and riare randomly generated from ZN , sþ r mod N andmi þ ri modN are computationally indistinguishable fromr� and s02;i, respectively. Furthermore, because the function-

ality is randomly chosen by P1 (at step 1(a) of Algorithm 1),a is either 0 or 1 with equal probability. Thus, a is computa-tionally indistinguishable from a0. Combining all theseresults together, we can conclude that PP2ðSMINÞ is compu-

tationally indistinguishable from PSP2ðSMINÞ based on Defi-

nition 1. This implies that during the execution of SMIN, P2

does not learn any information regarding u; v; su; sv and theactual comparison result. Intuitively speaking, the informa-tion P2 has during an execution of SMIN is either randomor pseudo-random, so this information does not discloseanything regarding u; v; su and sv. Additionally, as F isknown only to P1, the actual comparison result is obliviousto P2.

On the other hand, the execution image of P1, denoted byPP1ðSMINÞ, is given by

PP1ðSMINÞ ¼ fM 0i; EpkðaÞ; d0 j for 1 � i � lg:

M 0i and d0 are encrypted values, which are random in ZN2 ,

received from P2 (at step 3(a) of Algorithm 1). Let the simu-

lated image of P1 be PSP1ðSMINÞ, where

PSP1ðSMINÞ ¼ fs04;i; b0; b00 j for 1 � i � lg:

The values s04;i; b0 and b00 are randomly generated from ZN2 .

Since Epk is a semantically secure encryption scheme with

resulting ciphertext size less than N2, it implies thatM 0

i ; EpkðaÞ and d0 are computationally indistinguishablefrom s04;i; b

0 and b00, respectively. Therefore, PP1ðSMINÞ iscomputationally indistinguishable from PS

P1ðSMINÞ based

on Definition 1. As a result, P1 cannot learn any informationregarding u; v; su; sv and the comparison result during theexecution of SMIN. Putting everything together, we claimthat the proposed SMIN protocol is secure under the semi-honest model (according to Definition 1).

5 THE PROPOSED PPKNN PROTOCOL

In this section, we propose a novel privacy-preserving k-NNclassification protocol, denoted by PPkNN, which is

SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1267

constructed using the protocols discussed in Section 3 asbuilding blocks. As mentioned earlier, we assume thatAlice’s database consists of n records, denoted byD ¼ ht1; . . . ; tni, and mþ 1 attributes, where ti;j denotes thejth attribute value of record ti. Initially, Alice encrypts herdatabase attribute-wise, that is, she computes Epkðti;jÞ, for1 � i � n and 1 � j � mþ 1, where column ðmþ 1Þ con-tains the class labels. Let the encrypted database be denotedby D0. We assume that Alice outsources D0 as well as thefuture classification process to the cloud. Without loss ofgenerality, we assume that all attribute values and their

euclidean distances lie in ½0; 2lÞ. In addition, let w denote thenumber of unique class labels inD.

In our problem setting, we assume the existence of twonon-colluding semi-honest cloud service providers, denotedby C1 and C2, which together form a federated cloud. Underthis setting, Alice outsources her encrypted database D0 toC1 and the secret key sk to C2. Here it is possible for thedata owner Alice to replace C2 with her private server.However, if Alice has a private server, we can argue thatthere is no need for data outsourcing from Alice’s point ofview. The main purpose of using C2 can be motivated bythe following two reasons. (i) With limited computingresource and technical expertise, it is in the best interest ofAlice to completely outsource its data management andoperational tasks to a cloud. For example, Alice may wantto access her data and analytical results using a smart phoneor any device with very limited computing capability.(ii) Suppose Bob wants to keep his input query and accesspatterns private from Alice. In this case, if Alice uses a pri-vate server, then she has to perform computations assumedby C2 under which the very purpose of outsourcing theencrypted data to C1 is negated.

In general, whether Alice uses a private server or cloudservice provider C2 actually depends on her resources. Inparticular to our problem setting, we prefer to use C2 as thisavoids the above mentioned disadvantages (i.e., in case ofAlice using a private server) altogether. In our solution,after outsourcing encrypted data to the cloud, Alice doesnot participate in any future computations.

The goal of the PPkNN protocol is to classify users’query records using D0 in a privacy-preserving manner.Consider an authorized user Bob who wants to classifyhis query record q ¼ hq1; . . . ; qmi based on D0 in C1. Theproposed PPkNN protocol mainly consists of the follow-ing two stages:

� Stage 1—Secure Retrieval of k-Nearest Neighbors(SRkNN). In this stage, Bob initially sends his queryq (in encrypted form) to C1. After this, C1 and C2

involve in a set of sub-protocols to securely retrieve(in encrypted form) the class labels corresponding tothe k-nearest neighbors of the input query q. At theend of this step, encrypted class labels of k-nearestneighbors are known only to C1.

� Stage 2—Secure Computation of Majority Class(SCMCk). Following from Stage 1, C1 and C2 jointlycompute the class label with a majority votingamong the k-nearest neighbors of q. At the end ofthis step, only Bob knows the class label correspond-ing to his input query record q.

The main steps involved in the proposed PPkNN proto-col are as shown in Algorithm 4. We now explain each ofthe two stages in PPkNN in detail.

Algorithm 4. PPkNNðD0; qÞ ! cq

Require: C1 hasD0 and p; C2 has sk; Bob has q

1: Bob:(a). Compute EpkðqjÞ, for 1 � j � m(b). Send EpkðqÞ ¼ hEpkðq1Þ; . . . ; EpkðqmÞi to C1

2: C1 and C2:(a). C1 receives EpkðqÞ from Bob(b). for i ¼ 1 to n do:� EpkðdiÞ SSEDðEpkðqÞ; EpkðtiÞÞ� ½di� SBDðEpkðdiÞÞ

3: for s ¼ 1 to k do:(a). C1 and C2:� ð½dmin�; EpkðIÞ; Epkðc0ÞÞ SMINnðu1; . . . ; unÞ, where

ui ¼ ð½di�; EpkðItiÞ; Epkðti;mþ1ÞÞ� Epkðc0sÞ Epkðc0Þ

(b). C1:� D EpkðIÞN�1� for i ¼ 1 to n do:� ti EpkðiÞ � D� t0i t

rii , where ri 2R ZN

� b pðt0Þ; send b to C2

(c). C2:� b0i DskðbiÞ, for 1 � i � n� Compute U 0, for 1 � i � n:� if b0i ¼ 0, then U 0i ¼ Epkð1Þ� otherwise, U 0i ¼ Epkð0Þ

Send U 0 to C1

(d). C1: V p�1ðU 0Þ(e). C1 and C2, for 1 � i � n and 1 � g � l:� Epkðdi;gÞ SBORðVi; Epkðdi;gÞÞ

4: SCMCkðEpkðc01Þ; . . . ; Epkðc0kÞÞ

5.1 Stage 1: Secure Retrieval of k-NearestNeighbors

During Stage 1, Bob initially encrypts his query q attribute-wise, that is, he computes EpkðqÞ ¼ hEpkðq1Þ; . . . ; EpkðqmÞiand sends it to C1. The main steps involved in Stage 1 areshown as steps 1 to 3 in Algorithm 4. Upon receiving EpkðqÞ,C1 with private input ðEpkðqÞ; EpkðtiÞÞ and C2 with the secretkey sk jointly involve in the SSED protocol. HereEpkðtiÞ ¼ hEpkðti;1Þ; . . . ; Epkðti;mÞi, for 1 � i � n. The outputof this step, denoted by EpkðdiÞ, is the encryption of squared

euclidean distance between q and ti, i.e., di ¼ jq � tij2. Asmentioned earlier, EpkðdiÞ is known only to C1, for1 � i � n. We emphasize that the computation of exacteuclidean distance between encrypted vectors is hard toachieve as it involves square root. However, in our problem,it is sufficient to compare the squared euclidean distances asit preserves relative ordering. Then, C1 with input EpkðdiÞand C2 securely compute the encryptions of the individualbits of di using the SBD protocol. Note that the output½di� ¼ hEpkðdi;1Þ; . . . ; Epkðdi;lÞi is known only to C1, where di;1and di;l are the most and least significant bits of di, for1 � i � n, respectively.

After this, C1 and C2 compute the encryptions of classlabels corresponding to the k-nearest neighbors of q in an

1268 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015

iterative manner. More specifically, they compute Epkðc01Þ inthe first iteration, Epkðc02Þ in the second iteration, and so on.Here c0s denotes the class label of sth nearest neighbor to q,for 1 � s � k. At the end of k iterations, only C1 knowshEpkðc01Þ; . . . ; Epkðc0kÞi. To start with, consider the first itera-tion. C1 and C2 jointly compute the encryptions of the indi-vidual bits of the minimum value among d1; . . . ; dn andencryptions of the location and class label corresponding todmin using the SMINn protocol. That is, C1 with inputðu1; . . . ; unÞ and C2 with sk compute ð½dmin�; EpkðIÞ; Epkðc0ÞÞ,where ui ¼ ð½di�; EpkðItiÞ; Epkðti;mþ1ÞÞ, for 1 � i � n. Here

dmin denotes the minimum value among d1; . . . ; dn; Iti and

ti;mþ1 denote the unique identifier and class label corre-sponding to the data record ti, respectively. Specifically,ðIti ; ti;mþ1Þ is the secret information associated with ti. For

simplicity, this paper assumes Iti ¼ i. In the output, I and c0

denote the index and class label corresponding to dmin. Theoutput ð½dmin�; EpkðIÞ; Epkðc0ÞÞ is known only to C1. Now, C1

performs the following operations locally:

� Assign Epkðc0Þ to Epkðc01Þ. Remember that, accordingto the SMINn protocol, c0 is equivalent to the classlabel of the data record that corresponds to dmin.Thus, it is same as the class label of the most nearestneighbor to q.

� Compute the encryption of difference between I andi, where 1 � i � n. That is, C1 computes ti ¼ EpkðiÞ�EpkðIÞN�1 ¼ Epkði� IÞ, for 1 � i � n.

� Randomize ti to get t0i ¼ trii ¼ Epkðri � ði� IÞÞ,

where ri is a random number in ZN . Note that t0i isan encryption of either 0 or a random number, for1 � i � n. Also, it is worth noting that exactly one ofthe entries in t0 is an encryption of 0 (which happensiff i ¼ I) and the rest are encryptions of randomnumbers. Permute t0 using a random permutationfunction p (known only to C1) to get b ¼ pðt0Þ andsend it to C2.

Upon receiving b, C2 decrypts it component-wise to getb0i ¼ DskðbiÞ, for 1 � i � n. After this, he/she computes anencrypted vector U 0 of length n such that U 0i ¼ Epkð1Þ if

b0i ¼ 0, and Epkð0Þ otherwise. Since exactly one of entries int0 is an encryption of 0, this further implies that exactly oneof the entries in U 0 is an encryption of 1 and the rest of themare encryptions of 0’s. It is important to note that if b0k ¼ 0,

then p�1ðkÞ is the index of the data record that correspondsto dmin. Then, C2 sends U 0 to C1. After receiving U 0, C1 per-

forms inverse permutation on it to get V ¼ p�1ðU 0Þ. Notethat exactly one of the entries in V is Epkð1Þ and the remain-ing are encryptions of 0’s. In addition, if Vi ¼ Epkð1Þ, then tiis the most nearest tuple to q. However, C1 and C2 do notknow which entry in V corresponds to Epkð1Þ.

Finally, C1 updates the distance vectors ½di� due to the fol-lowing reason:

� It is important to note that the first nearest tuple to qshould be obliviously excluded from further compu-tations. However, since C1 does not know the recordcorresponding to Epkðc01Þ, we need to obliviouslyeliminate the possibility of choosing this recordagain in next iterations. For this, C1 obliviously

updates the distance corresponding to Epkðc01Þ to the

maximum value, i.e., 2l � 1. More specifically, C1

updates the distance vectors with the help of C2

using the SBOR protocol as below, for 1 � i � n and1 � g � l

Epkðdi;gÞ ¼ SBOR�Vi; Epkðdi;gÞ

�:

Note that when Vi ¼ Epkð1Þ, the corresponding dis-tance vector di is set to the maximum value. That is,under this case, ½di� ¼ hEpkð1Þ; . . . ; Epkð1Þi. On theother hand, when Vi ¼ Epkð0Þ, the OR operation hasno effect on the corresponding encrypted distancevector.

The above process is repeated until k iterations, and ineach iteration ½di� corresponding to the current chosen labelis set to the maximum value. However, C1 and C2 doesnot know which ½di� is updated. In iteration s, Epkðc0sÞ isknown only to C1. At the end of Stage 1, C1 hashEpkðc01Þ; . . . ; Epkðc0kÞi, the list of encrypted class labels ofk-nearest neighbors to the query q.

5.2 Stage 2: Secure Computation of Majority Class

Without loss of generality, let us assume that Alice’s datasetD consists of w unique class labels denoted by c ¼hc1; . . . ; cwi. We assume that Alice outsources her list ofencrypted classes to C1. That is, Alice outsourceshEpkðc1Þ; . . . ; EpkðcwÞi to C1 along with her encrypted data-base D0 during the data outsourcing step. Note that, forsecurity reasons, Alice may add dummy categories into thelist to protect the number of class labels, i.e., w from C1 andC2. However, for simplicity, we assume that Alice does notadd any dummy categories to c.

During Stage 2, C1 with private inputs L ¼ hEpkðc1Þ; . . . ;EpkðcwÞi and L0 ¼ hEpkðc01Þ; . . . ; Epkðc0kÞi, and C2 with sk

securely compute EpkðcqÞ. Here cq denotes the majority classlabel among c01; . . . ; c

0k. At the end of stage 2, only Bob knows

the class label cq.

Algorithm 5. SCMCkðEpkðc01Þ; . . . ; Epkðc0kÞÞ ! cq

Require: hEpkðc1Þ; . . . ; EpkðcwÞi, hEpkðc01Þ; . . . ; Epkðc0kÞi are knownonly to C1; sk is known only to C2

1: C1 and C2:(a). hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi SFðL;L0Þ, where

L ¼ hEpkðc1Þ; . . . ; EpkðcwÞi, L0 ¼ hEpkðc01Þ; . . . ; Epkðc0kÞi(b). for i ¼ 1 to w do:� ½fðciÞ� SBDðEpkðfðciÞÞÞ

(c). ð½fmax�; EpkðcqÞÞ SMAXwðc1; . . . ;cwÞ, whereci ¼ ð½fðciÞ�; EpkðciÞÞ, for 1 � i � w

2: C1:(a). gq EpkðcqÞ � EpkðrqÞ, where rq 2R ZN

(b). Send gq to C2 and rq to Bob3: C2:

(a). Receive gq from C1

(b). g 0q DskðgqÞ; send g0q to Bob4: Bob:

(a). Receive rq from C1 and g 0q from C2

(b). cq g 0q � rq modN

The overall steps involved in Stage 2 are shown inAlgorithm 5. To start with, C1 and C2 jointly compute the

SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1269

encrypted frequencies of each class label using the k-nearest

set as input. That is, they compute EpkðfðciÞÞ using ðL;L0Þas C1’s input to the secure frequency (SF) protocol, for1 � i � w. The output hEpkðfðc1ÞÞ; . . . ; EpkðfðcwÞÞi is knownonly to C1. Then, C1 with EpkðfðciÞÞ and C2 with sk involvein the secure bit-decomposition protocol to compute ½fðciÞ�,that is, vector of encryptions of the individual bits of fðciÞ,for 1 � i � w. After this, C1 and C2 jointly involve in theSMAXw protocol. Briefly, SMAXw utilizes the sub-routineSMAX to eventually compute ð½fmax�; EpkðcqÞÞ in an iterativefashion. Here ½fmax� ¼ ½maxðfðc1Þ; . . . ; fðcwÞÞ� and cq denotes

the majority class out of L0. At the end, the outputð½fmax�; EpkðcqÞÞ is known only to C1. After this, C1 computesgq ¼ Epkðcq þ rqÞ, where rq is a random number in ZN

known only to C1. Then, C1 sends gq to C2 and rq to Bob.

Upon receiving gq, C2 decrypts it to get the randomized

majority class label g 0q ¼ DskðgqÞ and sends it to Bob. Finally,

upon receiving rq from C1 and g 0q from C2, Bob computes the

output class label corresponding to q as cq ¼ g 0q � rq mod N .

5.3 Security Analysis of PPkNN under theSemi-Honest Model

First of all, we stress that due to the encryption of q and bysemantic security of the Paillier cryptosystem, Bob’s inputquery q is protected from Alice, C1 and C2 in our PPkNNprotocol. Apart from guaranteeing query privacy, the goalof PPkNN is to protect data confidentiality and hide dataaccess patterns.

In this paper, to prove a protocol’s security under thesemi-honest model, we adopted the well-known securitydefinitions from the literature of SMC. More specifically, asmentioned in Section 2.3, we adopt the security proofsbased on the standard simulation paradigm [28]. For pre-sentation purpose, we provide formal security proofs(under the semi-honest model) for Stages 1 and 2 of PPkNNseparately. Note that the outputs returned by each sub-pro-tocol are in encrypted form and known only to C1.

5.3.1 Proof of Security for Stage 1

As mentioned earlier, the computations involved in Stage 1of PPkNN are given as steps 1 to 3 in Algorithm 4. For sim-plicity, we consider the messages exchanged between C1

and C2 in a single iteration (similar analysis can be deducedfor other iterations).

According to Algorithm 4, the execution image of C2 isgiven by PC2

ðPPkNNÞ ¼ fhbi;b0ii j for 1 � i � ng where bi is

an encrypted value which is random in ZN2 . Also, b0i isderived upon decrypting bi by C2. Remember that, exactlyone of the entries in b0 is 0 and the rest are random numbersin ZN . Without loss of generality, let the simulated image of

C2 be given PSC2ðPPkNNÞ ¼ fha01;i; a02;ii j for 1 � i � ng. Here

a01;i is randomly generated from ZN2 and the vector a02 is ran-domly generated in such a way that exactly one of theentries is 0 and the rest are random numbers in ZN . SinceEpk is a semantically secure encryption scheme with result-ing ciphertext size less than ZN2 , we claim that bi is compu-tationally indistinguishable from a01;i. In addition, since the

random permutation function p is known only to C1, b0 is a

random vector of exactly one 0 and random numbers in ZN .

Thus, b0 is computationally indistinguishable from a02. Bycombining the above results, we can conclude thatPC2ðPPkNNÞ is computationally indistinguishable from

PSC2ðPPkNNÞ. This implies that C2 does not learn anything

during the execution of Stage 1.On the other hand, the execution image of C1 is given by

PC1ðPPkNNÞ ¼ fU 0gwhere U 0 is an encrypted value sent by

C2 (at step 3(c) of Algorithm 4). Let the simulated image of

C1 in Stage 1 be PSC1ðPPkNNÞ ¼ fa0g. Here the value of a0 is

randomly generated from ZN2 . Since Epk is a semanticallysecure encryption scheme with resulting ciphertexts in ZN2 ,we claim that U 0 is computationally indistinguishable froma0. This implies that PC1

ðPPkNNÞ is computationally indis-

tinguishable from PSC1ðPPkNNÞ. Hence, C1 cannot learn

anything during the execution of Stage 1 in PPkNN. Com-bining all these results together, it is clear that Stage 1 issecure under the semi-honest model.

In each iteration, it is worth pointing out that C1 andC2 do not know which data record belongs to currentglobal minimum. Thus, data access patterns are protectedfrom both C1 and C2. Informally speaking, at step 3(c) ofAlgorithm 4, a component-wise decryption of b revealsthe tuple that satisfy the current global minimum distanceto C2. However, due to the random permutation by C1, C2

cannot trace back to the corresponding data record. Also,note that decryption operations on vector b by C2 willresult in exactly one 0 and the rest of the results are ran-dom numbers in ZN . Similarly, since U 0 is an encryptedvector, C1 cannot know which tuple corresponds to cur-rent global minimum distance.

5.3.2 Security Proof for Stage 2

In a similar fashion, we can formally prove that Stage 2 ofPPkNN is secure under the semi-honest model. Briefly,since the sub-protocols SF, SBD, and SMAXw are secure, noinformation is revealed to C2. Also, the operations per-formed by C1 are entirely on encrypted data and thus noinformation is revealed to C1.

Furthermore, the output data of Stage 1 which are passedas input to Stage 2 are in encrypted format. Therefore, thesequential composition of the two stages lead to our PPkNNprotocol and we claim it to be secure under the semi-honestmodel according to the Composition Theorem [28]. In par-ticular, based on the above discussions, it is clear that theproposed PPkNN protocol protects the confidentiality ofthe data, the user’s input query, and also hides data accesspatterns from Alice, C1; and C2. Note that Alice does notparticipate in any computations of PPkNN.

5.4 Security under the Malicious Model

The next step is to extend our PPkNN protocol into a secureprotocol under the malicious model. Under the maliciousmodel, an adversary (i.e., either C1 or C2) can arbitrarilydeviate from the protocol to gain some advantage (e.g.,learning additional information about inputs) over the otherparty. The deviations include, as an example, for C1 (actingas a malicious adversary) to instantiate the PPkNN protocolwith modified inputs (say Epkðq0Þ and Epkðt0iÞÞ and to abortthe protocol after gaining partial information. However, inPPkNN, it is worth pointing out that neither C1 nor C2

1270 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015

knows the results of Stages 1 and 2. In addition, all the inter-mediate results are either random or pseudo-random val-ues. Thus, even when an adversary modifies theintermediate computations he/she cannot gain any addi-tional information. Nevertheless, as mentioned above, theadversary can change the intermediate data or performcomputations incorrectly before sending them to the honestparty which may eventually result in the wrong output.Therefore, we need to ensure that all the computations per-formed and messages sent by each party are correct.

Remember that the main goal of SMC is to ensure thehonest parties to get the correct result and to protect theirprivate input data from the malicious parties. Therefore,under the two-party SMC scenario, if both parties are mali-cious, there is no point to develop or adopt an SMC protocolat the first place. In the literature of SMC [30], it is the normthat at most one party can be malicious under the two-partyscenario. When only one of the party is malicious, the stan-dard way of preventing the malicious party from misbehav-ing is to let the honest party validate the other party’s workusing zero-knowledge proofs [31]. However, checking thevalidity of operations at each step of PPkNN can signifi-cantly increase the cost.

An alternative approach, as proposed in [32], is to instan-tiate two independent executions of the PPkNN protocol byswapping the roles of the two parties in each execution. Atthe end of the individual executions, each party receives theoutput in encrypted form. This is followed by an equalitytest on their outputs. More specifically, suppose Epk1ðcq;1Þand Epk2ðcq;2Þ be the outputs received by C1 and C2 respec-

tively, where pk1 and pk2 are their respective public keys.Note that the outputs in our case are in encrypted formatand the corresponding ciphertexts (resulted from the twoexecutions) are under two different public key domains.Therefore, we stress that the equality test based on the addi-tive homomorphic encryption properties which was used in[32] is not applicable to our problem. Nevertheless, C1 andC2 can perform the equality test based on the traditionalgarbled-circuit technique [33].

5.5 Complexity Analysis

The total computation complexity of Stage 1 is bounded byOðn � ðlþmþ k � l � log2 nÞÞ encryptions and exponentia-tions. On the other hand, the total computation complexityof Stage 2 is bounded by Oðw � ðlþ kþ l � log2 wÞÞ encryp-tions and exponentiations. Due to space limitations, werefer the reader to [5] for detailed complexity analysis ofPPkNN. In general, as w� n, the computation cost of Stage

1 should be significantly higher than that of Stage 2. Thisobservation is further justified by our empirical resultsgiven in the next section.

6 EMPIRICAL RESULTS

In this section, we discuss some experiments demonstratingthe performance of our PPkNN protocol under differentparameter settings. We used the Paillier cryptosystem [4] asthe underlying additive homomorphic encryption schemeand implemented the proposed PPkNN protocol in C. Vari-ous experiments were conducted on a Linux machine withan Intel Xeon Six-Core CPU 3.07 GHz processor and 12 GBRAM running Ubuntu 12.04 LTS. To the best of our knowl-edge, our work is the first effort to develop a secure k-NNclassifier under the semi-honest model. There is no existingwork to compare with our approach. Hence, we evaluatethe performance of our PPkNN protocol under differentparameter settings.

6.1 Dataset and Experimental Setup

For our experiments, we used the Car Evaluation datasetfrom the UCI KDD archive [34]. It consists of 1,728 records(i.e., n ¼ 1; 728) and six attributes (i.e., m ¼ 6). Also, there isa separate class attribute and the dataset is categorized intofour different classes (i.e., w ¼ 4). We encrypted this datasetattribute-wise, using the Paillier encryption whose key sizeis varied in our experiments, and the encrypted data werestored on our machine. Based on our PPkNN protocol, wethen executed a random query over this encrypted data. Forthe rest of this section, we do not discuss about the perfor-mance of Alice since it is a one-time cost. Instead, we evalu-ate and analyze the performances of the two stages inPPkNN separately.

6.2 Performance of PPkNN

We first evaluated the computation costs of Stage 1 inPPkNN for varying number of k-nearest neighbors. Also,the Paillier encryption key size K is either 512 or 1,024 bits.The results are shown in Fig. 2a. For K ¼ 512 bits, the com-putation cost of Stage 1 varies from 9.98 to 46.16 minuteswhen k is changed from 5 to 25, respectively. On the otherhand, when K ¼ 1;024 bits, the computation cost of Stage 1varies from 66.97 to 309.98 minutes when k is changed from5 to 25, respectively. In either case, we observed that thecost of Stage 1 grows almost linearly with k. For any givenk, we identified that the cost of Stage 1 increases by almost afactor of 7 whenever K is doubled. E.g., when k ¼ 10, Stage

Fig. 2. Computation costs of PPkNN for varying number of k nearest neighbors and encryption key sizeK.

SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1271

1 took 19.06 and 127.72 minutes to generate the encryptedclass labels of the 10 nearest neighbors under K ¼ 512 and1024 bits, respectively. Moreover, when k ¼ 5, we observethat around 66.29 percent of cost in Stage 1 is accounted dueto SMINn which is initiated k times in PPkNN (once in eachiteration). Also, the cost incurred due to SMINn increasesfrom 66.29 to 71.66 percent when k is increased from 5 to 25.

We now evaluate the computation costs of Stage 2 forvarying k and K. As shown in Fig. 2b, for K ¼ 512 bits, thecomputation time for Stage 2 to generate the final class labelcorresponding to the input query varies from 0.118 to 0.285seconds when k is changed from 5 to 25. On the other hand,for K ¼ 1; 024 bits, Stage 2 took 0.789 and 1.89 secondswhen k ¼ 5 and 25, respectively. The low computation costsof Stage 2 were due to SMAXw which incurs significantlyless computations than SMINn in Stage 1. This further justi-fies our theoretical analysis in Section 5.5. Note that, in ourdataset, w ¼ 4 and n ¼ 1;728. Like in Stage 1, for any givenk, the computation time of Stage 2 increases by almost a fac-tor of 7 whenever K is doubled. E.g., when k ¼ 10, the com-putation time of Stage 2 varies from 0.175 to 1.158 secondswhen the encryption key size K is changed from 512 to1,024 bits. As shown in Fig. 2b, a similar analysis can beobserved for other values of k andK.

It is clear that the computation cost of Stage 1 is signifi-cantly higher than that of Stage 2 in PPkNN. Specifically,we observed that the computation time of Stage 1 accountsfor at least 99 percent of the total time in PPkNN. For exam-ple, when k ¼ 10 and K ¼ 512 bits, the computation costs ofStage 1 and 2 are 19.06 minutes and 0.175 seconds, respec-tively. Under this scenario, cost of Stage 1 is 99.98 percent ofthe total cost of PPkNN. We also observed that the totalcomputation time of PPkNN grows almost linearly with nand k.

6.3 Performance Improvement of PPkNN

We now discuss two different ways to boost the efficiency ofStage 1 (as the performance of PPkNN depends primarilyon Stage 1) and empirically analyze their efficiency gains.First, we observe that some of the computations in Stage 1can be pre-computed. For example, encryptions of randomnumbers, 0 and 10s can be pre-computed (by the corre-sponding parties) in the offline phase. As a result, the onlinecomputation cost of Stage 1 (denoted by SRkNNo) isexpected to be improved. To see the actual efficiency gainsof such a strategy, we computed the costs of SRkNNo andcompared them with the costs of Stage 1 without an offlinephase (simply denoted by SRkNN) and the results forK ¼ 1;024 bits are shown in Fig. 2c. Irrespective of the val-ues of k, we observed that SRkNNo is around 33 percentfaster than SRkNN. E.g., when k ¼ 10, the computationcosts of SRkNNo and SRkNN are 84.47 and 127.72 minutes,respectively (boosting the online running time of Stage 1 by33.86 percent).

Our second approach to improve the performance ofStage 1 is by using parallelism. Since operations on datarecords are independent of one another, we claim that mostcomputations in Stage 1 can be parallelized. To empiricallyevaluate this claim, we implemented a parallel version ofStage 1 (denoted by SRkNNp) using OpenMP programmingand compared its cost with the costs of SRkNN (i.e., the

serial version of Stage 1). The results for K ¼ 1;024 bits areshown in Fig. 2c. The computation cost of SRkNNp variesfrom 12.02 to 55.5 minutes when k is changed from 5 to 25.We observe that SRkNNp is almost six times more efficientthan SRkNN. This is because our machine has six cores andthus computations can be run in parallel on six separatethreads. Based on the above discussions, it is clear that effi-ciency of Stage 1 can indeed be improved significantly usingparallelism.

On the other hand, Bob’s computation cost in PPkNNis mainly due to the encryption of his input query. In ourdataset, Bob’s computation cost is 4 and 17 millisecondswhen K is 512 and 1,024 bits, respectively. It is apparentthat PPkNN is very efficient from Bob’s computationalperspective which is especially beneficial when he issuesqueries from a resource-constrained device (such asmobile phone and PDA).

6.3.1 A Note on Practicality

Our PPkNN protocol is not very efficient without utilizingparallelization. However, ours is the first work to propose aPPkNN solution that is secure under the semi-honestmodel. Due to rising demands for data mining as a servicein cloud, we believe that our work will be very helpful tothe cloud community to stimulate further research alongthat direction. Hopefully, more practical solutions toPPkNN will be developed (either by optimizing our proto-col or investigating alternative approaches) in the nearfuture.

7 CONCLUSIONS AND FUTURE WORK

To protect user privacy, various privacy-preserving classifi-cation techniques have been proposed over the past decade.The existing techniques are not applicable to outsourceddatabase environments where the data resides in encryptedform on a third-party server. This paper proposed a novelprivacy-preserving k-NN classification protocol overencrypted data in the cloud. Our protocol protects the confi-dentiality of the data, user’s input query, and hides the dataaccess patterns. We also evaluated the performance of ourprotocol under different parameter settings.

Since improving the efficiency of SMINn is an importantfirst step for improving the performance of our PPkNN pro-tocol, we plan to investigate alternative and more efficientsolutions to the SMINn problem in our future work. Also,we will investigate and extend our research to other classifi-cation algorithms.

ACKNOWLEDGMENTS

The authors wish to thank the anonymous reviewers fortheir invaluable feedback and suggestions. This work hasbeen partially supported by the US National Science Foun-dation under grant CNS-1011984.

REFERENCES

[1] P. Mell and T. Grance, “The NIST definition of cloud computing(draft),”NIST Special Publication, vol. 800, p. 145, 2011.

[2] S. De Capitani di Vimercati, S. Foresti, and P. Samarati,“Managing and accessing data in the cloud: Privacy risks andapproaches,” in Proc. 7th Int. Conf. Risk Security Internet Syst., 2012,pp. 1–9.

1272 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015

[3] P. Williams, R. Sion, and B. Carbunar, “Building castles out ofmud: Practical access pattern privacy and correctness onuntrusted storage,” in Proc. 15th ACM Conf. Comput. Commun.Security, 2008, pp. 139–148.

[4] P. Paillier, “Public key cryptosystems based on composite degreeresiduosity classes,” in Proc. 17th Int. Conf. Theory Appl. Crypto-graphic Techn., 1999, pp. 223–238.

[5] B. K. Samanthula, Y. Elmehdwi, andW. Jiang, “k-nearest neighborclassification over semantically secure encrypted relational data,”eprint arXiv:1403.5001, 2014.

[6] C. Gentry, “Fully homomorphic encryption using ideal lattices,”in Proc. 41st Annu. ACM Sympos. Theory Comput., 2009, pp. 169–178.

[7] C. Gentry and S. Halevi, “Implementing gentry’s fully-homomor-phic encryption scheme,” in Proc. 30th Annu. Int. Conf. TheoryAppl. Cryptographic Techn.: Adv. Cryptol., 2011, pp. 129–148.

[8] A. Shamir, “How to share a secret,” Commun. ACM, vol. 22,pp. 612–613, 1979.

[9] D. Bogdanov, S. Laur, and J. Willemson, “Sharemind: A frame-work for fast privacy-preserving computations,” in Proc. 13th Eur.Symp. Res. Comput. Security: Comput. Security, 2008, pp. 192–206.

[10] R. Agrawal and R. Srikant, “Privacy-preserving data mining,”ACM Sigmod Rec., vol. 29, pp. 439–450, 2000.

[11] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” inProc. 20th Annu. Int. Cryptol. Conf. Adv. Cryptol., 2000, pp. 36–54.

[12] P. Zhang, Y. Tong, S. Tang, and D. Yang, “Privacy preservingNaive Bayes classification,” in Proc. 1st Int. Conf. Adv. Data MiningAppl., 2005, pp. 744–752.

[13] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacypreserving mining of association rules,” Inf. Syst., vol. 29, no. 4,pp. 343–364, 2004.

[14] R. J. Bayardo and R. Agrawal, “Data privacy through optimal k-anonymization,” in Proc. IEEE 21st Int. Conf. Data Eng., 2005,pp. 217–228.

[15] H. Hu, J. Xu, C. Ren, and B. Choi, “Processing private queries overuntrusted data cloud through privacy homomorphism,” in Proc.IEEE 27th Int. Conf. Data Eng., 2011, pp. 601–612.

[16] M. Kantarcioglu and C. Clifton, “Privately computing a distrib-uted k-nn classifier,” in Proc. 8th Eur. Conf. Principles PracticeKnowl. Discovery Databases, 2004, pp. 279–290.

[17] L. Xiong, S. Chitti, and L. Liu, “K nearest neighbor classificationacross multiple private databases,” in Proc. 15th ACM Int. Conf.Inform. Knowl. Manage., 2006, pp. 840–841.

[18] Y. Qi and M. J. Atallah, “Efficient privacy-preserving k-nearestneighbor search,” in Proc. IEEE 28th Int. Conf. Distrib. Comput.Syst., 2008, pp. 311–319.

[19] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order preservingencryption for numeric data,” in Proc. ACM SIGMOD Int. Conf.Manage. Data, 2004, pp. 563–574.

[20] H. Hacig€um€us, B. Iyer, C. Li, and S. Mehrotra, “Executing sql overencrypted data in the database-service-provider model,” in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2002, pp. 216–227.

[21] B. Hore, S. Mehrotra, M. Canim, and M. Kantarcioglu, “Securemultidimensional range queries over outsourced data,” VLDB J.,vol. 21, no. 3, pp. 333–358, 2012.

[22] W. K. Wong, D. W.-l. Cheung, B. Kao, and N. Mamoulis, “SecurekNN computation on encrypted databases,” in Proc. ACM SIG-MOD Int. Conf. Manage. Data, 2009, pp. 139–152.

[23] X. Xiao, F. Li, and B. Yao, “Secure nearest neighbor revisited,” inProc. IEEE Int. Conf. Data Eng., 2013, pp. 733–744.

[24] Y. Elmehdwi, B. K. Samanthula, and W. Jiang, “Secure k-nearestneighbor query over encrypted data in outsourced environ-ments,” in Proc. IEEE 30th Int. Conf. Data Eng., 2014, pp. 664–675.

[25] A. C. Yao, “Protocols for secure computations,” in Proc. 23rdAnnu. Symp. Found. Comput. Sci., 1982, pp. 160–164.

[26] O. Goldreich, S. Micali, and A. Wigderson, “How to play anymental game—A completeness theorem for protocols with honestmajority,” in Proc. 19th Annu. ACM Symp. Theory Comput., 1987,pp. 218–229.

[27] O. Goldreich, “General cryptographic protocols,” in The Founda-tions of Cryptography, vol. 2, Cambridge, U.K.: Cambridge Univ.Press, 2004.

[28] O. Goldreich, “Encryption schemes,” in The Foundations of Cryp-tography, vol. 2. Cambridge, U.K.: Cambridge Univ. Press, 2004,pp. 373–470 [Online]. Available: http://www.wisdom.weizmann.ac.il/~oded/PSBookFrag/enc.ps

[29] S. Goldwasser, S. Micali, and C. Rackoff, “The knowledge com-plexity of interactive proof systems,” SIAM J. Comput., vol. 18,pp. 186–208, Feb. 1989.

[30] D. Chaum, C. Cr�epeau, and I. Damgard, “Multiparty uncondition-ally secure protocols,” in Proc. 20th Annu. ACM Symp. Theory Com-put., 1988, pp. 11–19.

[31] J. Camenisch and M. Michels, “Proving in zero-knowledge that anumber is the product of two safe primes,” in Proc. 17th Int. Conf.Theory Appl. Cryptographic Techn., 1999, pp. 107–122.

[32] Y. Huang, J. Katz, and D. Evans, “Quid-Pro-Quo-tocols: Strength-ening semi-honest protocols with dual execution,” in Proc. IEEESymp. Security Privacy, 2012, pp. 272–284.

[33] Y. Huang, D. Evans, J. Katz, and L. Malka, “Faster secure two-party computation using garbled circuits,” in Proc. 20th USENIXConf. Security, 2011, pp. 35–35.

[34] M. Bohanec and B. Zupan. (1997). The UCI KDD Archive[Online]. Available: http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Bharath K. Samanthula received the bachelor’sdegree in computer science and engineeringfrom the International Institute of InformationTechnology, Hyderabad, India, in 2008, and thePhD degree in computer science from the Mis-souri University of Science and Technology,Rolla, MO, in 2013. He is a postdoctoral researchassociate at the Purdue Cyber Center andCERIAS. His research interests include appliedcryptography, personal privacy and data securityin the fields of social networks, cloud computing,

and smart grids. He is a member of the IEEE.

Yousef Elmehdwi received the bachelor’sdegree in computer science from Benghazi Uni-versity, Libya, in 1993 and the MSc degree inInformation Technology from Mannheim Univer-sity of Applied Science, Germany, in 2005. Hejoined the Computer Science Department at theMissouri University of Science and Technology,Rolla, Missouri, as a graduate student in January2010. He is currently working toward the PhDdegree under the supervision of Dr. Wei Jiang.His research interests lie at the crossroads of pri-

vacy, security, and data mining with current focus on privacy-preservingquery processing over encrypted data outsourced to cloud.

Wei Jiang received the bachelor’s degrees inboth computer science and mathematics fromthe University of Iowa, Iowa City, Iowa, in 2002.He received the master’s degree in computer sci-ence and the PhD degree from Purdue Univer-sity, West Lafayette, IN, in 2004 and 2008,respectively. He is an assistant professor in theDepartment of Computer Science at MissouriUniversity of Science and Technology. Hisresearch interests include privacy-preservingdata mining, data integration, privacy issues in

federated search environments, and text sanitization. His research hasbeen funded by the US National Science Foundation, the Office of NavalResearch, and the University of Missouri Research Board. He is a mem-ber of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

SAMANTHULA ET AL.: K-NEAREST NEIGHBOR CLASSIFICATION OVER SEMANTICALLY SECURE ENCRYPTED RELATIONAL DATA 1273