Building Confidential and Efficient Query Services …€¦ · 1 Building Confidential and...

18
1 Building Confidential and Efficient Query Services in the Cloud with RASP Data Perturbation Huiqi Xu, Shumin Guo, Keke Chen Data Intensive Analysis and Computing Lab Ohio Center of Excellence in Knowledge Enabled Computing Department of Computer Science and Engineering Wright State University, Dayton, OH 45435 Abstract—With the wide deployment of public cloud computing infrastructures, using clouds to host data query services has become an appealing solution for the advantages on scalability and cost-saving. However, some data might be sensitive that the data owner does not want to move to the cloud unless the data confidentiality and query privacy are guaranteed. On the other hand, a secured query service should still provide efficient query processing and significantly reduce the in-house workload to fully realize the benefits of cloud computing. We propose the RASP data perturbation method to provide secure and efficient range query and kNN query services for protected data in the cloud. The RASP data perturbation method combines order preserving encryption, dimensionality expansion, random noise injection, and random projection, to provide strong resilience to attacks on the perturbed data and queries. It also preserves multidimensional ranges, which allows existing indexing techniques to be applied to speedup range query processing. The kNN-R algorithm is designed to work with the RASP range query algorithm to process the kNN queries. We have carefully analyzed the attacks on data and queries under a precisely defined threat model and realistic security assumptions. Extensive experiments have been conducted to show the advantages of this approach on efficiency and security. Index Terms—query services in the cloud, privacy, range query, kNN query 1 I NTRODUCTION Hosting data-intensive query services in the cloud is increasingly popular because of the unique advan- tages in scalability and cost-saving. With the cloud infrastructures, the service owners can conveniently scale up or down the service and only pay for the hours of using the servers. This is an attractive feature because the workloads of query services are highly dynamic, and it will be expensive and inefficient to serve such dynamic workloads with in-house infras- tructures [2]. However, because the service providers lose the control over the data in the cloud, data confidentiality and query privacy have become the major concerns. Adversaries, such as curious service providers, can possibly make a copy of the database or eavesdrop users’ queries, which will be difficult to detect and prevent in the cloud infrastructures. While new approachesare needed to preserve data confidentiality and query privacy, the efficiency of query services and the benefits of using the clouds should also be preserved. It will not be meaningful to provide slow query services as a result of security and privacy assurance. It is also not practical for the data owner to use a significant amount of in- house resources, because the purpose of using cloud resources is to reduce the need of maintaining scalable in-house infrastructures. Therefore, there is an intri- cate relationship among the data confidentiality, query privacy, the quality of service, and the economics of using the cloud. We summarize these requirements for constructing a practical query service in the cloud as the CPEL criteria: data Confidentiality, query Privacy, Efficient query processing, and Low in-house processing cost. Satisfying these requirements will dramatically in- crease the complexity of constructing query services in the cloud. Some related approaches have been developed to address some aspects of the problem. However, they do not satisfactorily address all of these aspects. For example, the crypto-index [12] and Order Preserving Encryption (OPE) [1] are vulnerable to the attacks. The enhanced crypto-index approach [14] puts heavy burden on the in-house infrastructure to improve the security and privacy. The New Casper approach [24] uses cloaking boxes to protect data ob- jects and queries, which affects the efficiency of query processing and the in-house workload. We have sum- marized the weaknesses of the existing approaches in Section 7. We propose the RAndom Space Perturbation (RASP) approach to constructing practical range query and k-nearest-neighbor (kNN) query services in the cloud. The proposed approach will address all the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:26 NO:2 YEAR 2014

Transcript of Building Confidential and Efficient Query Services …€¦ · 1 Building Confidential and...

1

Building Confidential and Efficient QueryServices in the Cloud with RASP Data

PerturbationHuiqi Xu, Shumin Guo, Keke Chen

Data Intensive Analysis and Computing LabOhio Center of Excellence in Knowledge Enabled Computing

Department of Computer Science and EngineeringWright State University, Dayton, OH 45435

Abstract —With the wide deployment of public cloud computing infrastructures, using clouds to host data query services hasbecome an appealing solution for the advantages on scalability and cost-saving. However, some data might be sensitive that thedata owner does not want to move to the cloud unless the data confidentiality and query privacy are guaranteed. On the otherhand, a secured query service should still provide efficient query processing and significantly reduce the in-house workload tofully realize the benefits of cloud computing. We propose the RASP data perturbation method to provide secure and efficientrange query and kNN query services for protected data in the cloud. The RASP data perturbation method combines orderpreserving encryption, dimensionality expansion, random noise injection, and random projection, to provide strong resilience toattacks on the perturbed data and queries. It also preserves multidimensional ranges, which allows existing indexing techniquesto be applied to speedup range query processing. The kNN-R algorithm is designed to work with the RASP range query algorithmto process the kNN queries. We have carefully analyzed the attacks on data and queries under a precisely defined threat modeland realistic security assumptions. Extensive experiments have been conducted to show the advantages of this approach onefficiency and security.

Index Terms —query services in the cloud, privacy, range query, kNN query

1 INTRODUCTION

Hosting data-intensive query services in the cloud isincreasingly popular because of the unique advan-tages in scalability and cost-saving. With the cloudinfrastructures, the service owners can convenientlyscale up or down the service and only pay for thehours of using the servers. This is an attractive featurebecause the workloads of query services are highlydynamic, and it will be expensive and inefficient toserve such dynamic workloads with in-house infras-tructures [2]. However, because the service providerslose the control over the data in the cloud, dataconfidentiality and query privacy have become themajor concerns. Adversaries, such as curious serviceproviders, can possibly make a copy of the databaseor eavesdrop users’ queries, which will be difficult todetect and prevent in the cloud infrastructures.

While new approachesare needed to preserve dataconfidentiality and query privacy, the efficiency ofquery services and the benefits of using the cloudsshould also be preserved. It will not be meaningfulto provide slow query services as a result of securityand privacy assurance. It is also not practical forthe data owner to use a significant amount of in-house resources, because the purpose of using cloudresources is to reduce the need of maintaining scalable

in-house infrastructures. Therefore, there is an intri-cate relationship among the data confidentiality, queryprivacy, the quality of service, and the economics ofusing the cloud.

We summarize these requirements for constructinga practical query service in the cloud as the CPELcriteria: data Confidentiality, query Privacy, Efficientquery processing, and Low in-house processing cost.Satisfying these requirements will dramatically in-crease the complexity of constructing query servicesin the cloud. Some related approaches have beendeveloped to address some aspects of the problem.However, they do not satisfactorily address all ofthese aspects. For example, the crypto-index [12] andOrder Preserving Encryption (OPE) [1] are vulnerableto the attacks. The enhanced crypto-index approach[14] puts heavy burden on the in-house infrastructureto improve the security and privacy. The New Casperapproach [24] uses cloaking boxes to protect data ob-jects and queries, which affects the efficiency of queryprocessing and the in-house workload. We have sum-marized the weaknesses of the existing approaches inSection 7.

We propose the RAndom Space Perturbation(RASP) approach to constructing practical rangequery and k-nearest-neighbor (kNN) query services inthe cloud. The proposed approach will address all the

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:26 NO:2 YEAR 2014

2

four aspects of the CPEL criteria and aim to achieve agood balance on them. The basic idea is to randomlytransform the multidimensional datasets with a com-bination of order preserving encryption, dimension-ality expansion, random noise injection, and randomproject, so that the utility for processing range queriesis preserved. The RASP perturbation is designed insuch a way that the queried ranges are securelytransformed into polyhedra in the RASP-perturbeddata space, which can be efficiently processed with thesupport of indexing structures in the perturbed space.The RASP kNN query service (kNN-R) uses the RASPrange query service to process kNN queries. The keycomponents in the RASP framework include (1) thedefinition and properties of RASP perturbation; (2) theconstruction of the privacy-preserving range queryservices; (3) the construction of privacy-preservingkNN query services; and (4) an analysis of the attackson the RASP-protected data and queries.

In summary, the proposed approach has a numberof unique contributions.

• The RASP perturbation is a unique combinationof OPE, dimensionality expansion, random noiseinjection, and random projection, which providesstrong confidentiality guarantee.

• The RASP approach preserves the topology ofmultidimensional range in secure transformation,which allows indexing and efficiently query pro-cessing.

• The proposed service constructions are able tominimize the in-house processing workload be-cause of the low perturbation cost and high pre-cision query results. This is an important featureenabling practical cloud-based solutions.

We have carefully evaluated our approach with syn-thetic and real datasets. The results show its uniqueadvantages on all aspects of the CPEL criteria.

The entire paper is organized as follows. In Sec-tion 3, we define the RASP perturbation method,describe its major properties, and analyze the attacksto the RASP perturbed data. We also introduce theframework for constructing the query services withthe RASP perturbation. In Section 4 we describe thealgorithm for transforming queries and processingrange queries. In Section 5, the range query serviceis extended to handle kNN queries. When describingthese two services, we also analyze the attacks onthe query privacy. Finally, we present some relatedapproaches in Section 7 and analyze their weaknessesin terms of the CPEL criteria.

2 QUERY SERVICES IN THE CLOUD

This section presents the notations, the system archi-tecture, and the threat model for the RASP approach,and prepares for the security analysis [3] in latersections. The design of the system architecture keepsthe cloud economics in mind so that most data storage

and computing tasks will be done in the cloud. Thethreat model makes realistic security assumptions andclearly defines the practical threats that the RASPapproach will address.

2.1 Definitions and Notations

First, we establish the notations. For simplicity, weconsider only single database tables, which can be theresult of denormalization from multiple relations. Adatabase table consists of n records and d searchableattributes. We also frequently refer to an attribute asa dimension or a column, which are exchangeable inthe paper. Each record can be represented as a vectorin the multidimensional space, denoted by low caseletters. If a record x is d-dimensional, we say x ∈ R

d,where R

d means the d-dimensional vector space. Atable is also treated as a d × n matrix, with recordsrepresented as column vectors. We use capital lettersto represent a table, and indexed capital letters, e.g.,Xi, to represent columns. Each column is definedon a numerical domain. Categorical data columnsare allows in range query, which are converted tonumerical domains as we will describe in Section 3.

Range query is an important type of query for manydata analytic tasks from simple aggregation to moresophisticated machine learning tasks. Let T be a tableand Xi, Xj , and Xk be the real valued attributes inT , and a and b be some constants. Take the countingquery for example. A typical range query looks like

select count(*) from Twhere Xi ∈ [ai, bi] and Xj ∈ (aj , bj) and Xk =ak,

which calculates the number of records in the rangedefined by conditions on Xi, Xj , and Xk. Rangequeries may be applied to arbitrary number of at-tributes and conditions on these attributes combinedwith conditional operators “and”/“or”. We call eachpart of the query condition that involves only oneattribute as a simple condition. A simple condition likeXi ∈ [ai, bi] can be described with two half spaceconditions Xi ≤ bi and −Xi ≤ −ai. Without lossof generality, we will discuss how to process halfspace conditions like Xi ≤ bi in this paper. A slightmodification will extend the discussed algorithms tohandle other conditions like Xi < bi and Xi = bi.

kNN query is to find the closest k records to thequery point, where the Euclidean distance is oftenused to measure the proximity. It is frequently usedin location-based services for searching the objectsclose to a query point, and also in machine learningalgorithms such as hierarchical clustering and kNNclassifier. A kNN query consists of the query pointand the number of nearest neighbors, k.

2.2 System Architecture

We assume that a cloud computing infrastructure,such as Amazon EC2, is used to host the query

3

��������

��

Data owner

������Data D D’

AuthorizedUsers

������Query q q’

������� ResultR’

Result R

Trusted Honest but curious

Fig. 1. The system architecture for RASP-based queryservices.

services and large datasets. The purpose of this ar-chitecture is to extend the proprietary database serversto the public cloud, or use a hybrid private-publiccloud to achieve scalability and reduce costs whilemaintaining confidentiality.

Each record x in the outsourced database con-tains two parts: the RASP-processed attributes D′ =F (D,K) and the encrypted original records, Z =E(D,K ′), where K and K ′ are keys for perturbationand encryption, respectively. The RASP-perturbeddata D′ are for indexing and query processing. Figure1 shows the system architecture for both RASP-basedrange query service and kNN service.

There are two clearly separated groups: the trustedparties and the untrusted parties. The trusted partiesinclude the data/service owner, the in-house proxyserver, and the authorized users who can only submitqueries. The data owner exports the perturbed data tothe cloud. Meanwhile, the authorized users can sub-mit range queries or kNN queries to learn statistics orfind some records. The untrusted parties include thecurious cloud provider who hosts the query servicesand the protected database. The RASP-perturbed datawill be used to build indices to support query process-ing.

There are a number of basic procedures in thisframework: (1) F (D) is the RASP perturbation thattransforms the original data D to the perturbed dataD′; (2) Q(q) transforms the original query q to theprotected form q′ that can be processed on the per-turbed data; (3) H(q′, D′) is the query processing al-gorithm that returns the result R′. When the statisticssuch as SUM or AVG of a specific dimension areneeded, RASP can work with partial homomorphicencryption such as Paillier encryption [25] to computethese statistics on the encrypted data, which are thenrecovered with the procedure G(R′).

2.3 Threat Model

Assumptions. Our security analysis is built on the im-portant features of the architecture. Under this setting,we believe the following assumptions are appropriate.

• Only the authorized users can query the propri-etary database. Authorized users are not mali-cious and will not intentionally breach the confi-dentiality. We consider insider attacks are orthog-

onal to our research; thus, we can exclude thesituation that the authorized users collude withthe untrusted cloud providers to leak additionalinformation.

• The client-side system and the communicationchannels are properly secured and no protecteddata records and queries can be leaked.

• Adversaries can see the perturbed database, thetransformed queries, the whole query processingprocedure, the access patterns, and understandthe same query returns the same set of results,but nothing else.

• Adversaries can possibly have the global infor-mation of the database, such as the applicationsof the database, the attribute domains, and pos-sibly the attribute distributions, via other pub-lished sources (e.g., the distribution of sales, orpatient diseases, in public reports).

These assumptions can be maintained and rein-forced by applying appropriate security policies. Notethat this model is equivalent to the eavesdroppingmodel equipped with the plaintext distributionalknowledge in the cryptographic setting.

Protected Assets. Data confidentiality and queryprivacy should be protected in the RASP approach.While the integrity of query services is also an im-portant issue, it is orthogonal to our study. Existingintegrity checking and preventing techniques [34],[30], [19] can be integrated into our framework. Thus,the integrity problem will be excluded from the paper,and we can assume the curious cloud provider isinterested in the data and queries, but it will hon-estly follow the protocol to provide the infrastructureservice.

Attacker Modeling. The goal of attack is to re-cover (or estimate) the original data from the per-turbed data, or identify the exact queries (i.e., locationqueries) to breach users’ privacy. According to thelevel of prior knowledge the attacker may have, wecategorize the attacks into two categories.

• Level 1: The attacker knows only the per-turbed data and transformed queries, withoutany other prior knowledge. This corresponds tothe cipertext-only attack in the cryptographic set-ting.

• Level 2: The attacker also knows the originaldata distributions, including individual attributedistributions and the joint distribution (e.g., thecovariance matrix) between attributes. In prac-tice, for some applications, whose statistics areinteresting to the public domain, the dimen-sional distributions might have been publishedvia other sources.

These levels of knowledge are appropriate accordingto the assumptions we hold. We will analyze thesecurity based on this threat model.

4

Security Definition. Different from the traditionalencryption schemes, attackers can also be satisfiedwith good estimation. Therefore, we will investigatetwo levels of security definitions: (1) it is computa-tionally intractable for the attacker to recover the exactoriginal data based on the perturbed data; (2) the at-tacker cannot effectively estimate the original data. Theeffectiveness measure is defined with the NR MSEmeasure in Section 3.3.

3 RASP: R ANDOM SPACE PERTURBATION

In this section, we present the basic definition ofRAndom Space Perturbation (RASP) method and itsproperties. We will also discuss the attacks on RASPperturbed data, based on the threat model given inSection 2.

3.1 Definition of RASPRASP is one type of multiplicative perturbation, witha novel combination of OPE, dimension expansion,random noise injection, and random projection. Let’sconsider the multidimensional data are numeric andin multidimensional vector space1. The database hask searchable dimensions and n records, which makesa d × n matrix X . The searchable dimensions can beused in queries and thus should be indexed. Let xrepresent a d-dimensional record, x ∈ R

d. Note thatin the d-dimensional vector space R

d, the range queryconditions are represented as half-space functions anda range query is translated to finding the point set incorresponding polyhedron area described by the halfspaces [4].

The RASP perturbation involves three steps. Itssecurity is based on the existence of random invertiblereal-value matrix generator and random real valuegenerator. For each k-dimensional input vector x,

1) An order preserving encryption (OPE) scheme[1], Eope with keys Kope, is applied to eachdimension of x: Eope(x,Kope) ∈ R

d to changethe dimensional distributions to normal distri-butions with each dimension’s value order stillpreserved.

2) The vector is then extended to d+2 dimensionsas G(x) = ((Eopt(x))

T , 1, v)T , where the (d + 1)-th dimension is always a 1 and the (d + 2)-th dimension, v, is drawn from a random realnumber generator RNG that generates randomvalues from a tailored normal distributions. Wewill discuss the design of RNG and OPE later.

3) The (d + 2)-dimensional vector is finally trans-formed to

F (x,K = {A,Kope, RG}) = A((Eope(x))T , 1, v)T ,

(1)

1. For categorical attributes, we use the following simple map-ping because it will not break the query semantics. For a categoricalattribute Xi, the values {c1, . . . , cm} in the domain are mapped to{1, . . . ,m}. A query condition on categorical values, say Xi = cj ,is then converted to j − δ ≤ Xi ≤ j + δ, where δ ∈ (0, 1)

where A is a (d+2)×(d+2) randomly generatedinvertible matrix with aij ∈ R such that thereare at least two non-zero values in each row ofA and the last column of A is also non-zero2.

Kope and A are shared by all vectors in the database,but v is randomly generated for each individualvector. Since the RASP-perturbed data records areonly used for indexing and helping query processing,there is no need to recover the perturbed data. Aswe mentioned, in the case that original records areneeded, the encrypted records associated with theRASP-perturbed records will be returned. We give thedetailed algorithm in Appendix.

Design of OPE and RNG. We use the OPE schemeto convert all dimensions of the original data to thestandard normal distribution N (0, 1) in the limiteddomain [−β, β]. β can be selected as a value >= 4,as the range [−4, 4] covers more than 99% of thepopulation. This can be done with an algorithm suchas the one described in [1]. The use of OPE allowsqueries to be correctly transformed and processed.Similarly, we draw random noises v from N (0, 1) inthe limited domain [−β, β]. Such a design makes theextended noise dimension indifferent from the datadimensions in terms of the distributions.

The design of such an extended data vector(Eope(x)

T , 1, v)T is to enhance the data and queryconfidentiality. The use of OPE is to transform large-scale or infinite domains to normal distributions,which address the distributional attack. The (d+1)-thhomogeneous dimension is for hiding the query con-tent. The (d+2)-th dimension injects random noise inthe perturbed data and also protects the transformedqueries from attacks. The rationale behind differentaspects will be discussed clearly in later sections.

3.2 Properties of RASP

RASP has several important features. First, RASP doesnot preserve the order of dimensional values be-cause of the matrix multiplication component, whichdistinguishes itself from order preserving encryption(OPE) schemes, and thus does not suffer from thedistribution-based attack (details in Section 7). AnOPE scheme maps a set of single-dimensional valuesto another, while keeping the value order unchanged.Since the RASP perturbation can be treated as acombined transformation F (G(Eope(x))), it is suffi-cient to show that F (y) = Ay does not preserve theorder of dimensional values, where y ∈ R

d+2 andA ∈ R

(d+2)×(d+2). The proof is straightforward asshown in Appendix.

Second, RASP does not preserve the distances be-tween records, which prevents the perturbed data

2. Currently, we use a random invertible matrix generator thatdraws matrix elements uniformly at random from the standardnormal distribution and check the matrix invertibility and the non-zero conditions.

5

from distance-based attacks [8]. Because none of thetransformations in the RASP: Eope, G, and F preservesdistances, apparently the RASP perturbation will notpreserve distances. Similarly, RASP does not preserveother more sophisticated structures such as covariancematrix and principal components [18]. Therefore, thePCA-based attacks such as [16], [20] do not work aswell.

Third, the original range queries can be transformedto the RASP perturbed data space, which is the ba-sis of our query processing strategy. A range querydescribes a hyper-cubic area (with possibly openbounds) in the multidimensional space. In Section 4,we will show that a hyper-cubic area in the originalspace is transformed to a polyhedron with the RASPperturbation. Thus, we can search the points in thepolyhedron to get the query results.

3.3 Data Confidentiality Analysis

As the threat model describes, attackers might beinterested in finding the exact original data recordsor estimating them based on the perturbed data.For estimation attack, if the estimation is sufficientlyaccurate (above certain accuracy threshold), we saythe perturbation is not secure. Below, we define themeasure for evaluating the effectiveness of estimationattacks.

3.3.1 Evaluating Effectiveness of Estimation AttacksBecause attackers may not need to exactly recover theoriginal values, an accurate estimation will be suffi-cient. A measure is needed to define the “accuracy” or“uncertainty” as we mentioned. We use the commonlyused mean-squared-error (MSE) to evaluate the effec-tiveness of attack. To be semantically consistent, thej-th dimension can be treated as sample values drawnfrom a random variable Xj . Let xij be the value of thei-th original record in j-th dimension and xij be theestimated value. The MSE for the j-th dimension canbe defined as

MSE(Xj, Xj) =1

n

n∑

i=1

(xij − xij)2,

which is equivalent to the variance: var(Xj−Xj). Thesquare root of MSE (RMSE) represent the uncertaintyof the estimation - for an estimated value x, theoriginal value x could be in the range (x - RMSE, x+RMSE). Thus, the length of the range, 2*RMSE, alsorepresents the accuracy of the estimation.

However, this length is subject to the length of thedomain. Thus, we use the normalized square root ofMSE (NR MSE).

NR MSE(Xj) = 2

MSE(Xi, Xj)/domain length,(2)

instead, which is intuitively the rate between theuncertain range and the whole domain.

To compare MSE for multiple columns, we alsoneed to normalize these two series {xij} and {xij}to eliminate the difference on domain scales. Thenormalization procedure [11] is described as follows.Assume the mean and variance of the series {xij} isµj and σ2

j , correspondingly. The series is transformedby xij ← (xij−µj)/σj . A similar procedure is also ap-plied to the series {xij}. For the normalized domains,the range [−2, 2] almost covers the whole population3

[11]. Therefore, for normalized series, NR MSE issimply RMSE/2.

For an attack that can only result in low-accuracyestimation (e.g., NR MSE ≥ 20%, the uncertaintyis more than 20 % of the domain length.), we callthe RASP-perturbed dataset is resilient to that attack.Intuitively, NR MSE higher than 100% will not bevery meaningful. Thus, we set the absolute upperbound to be 100%. We will discuss the specific upperbounds according to the level of prior knowledge.

3.3.2 Prior-Knowledge Based Analysis

Below, we analyze the security under the two levelsof knowledge the attacker may have, according to thetwo levels of security definitions: exact match andstatistical estimation.Naive Estimation. We assume each value in the vec-tor or matrix is encoded with n bits. Let the perturbedvector p be drawn from a random variable P , and theoriginal vector x be drawn from a random variableX . We show that naive estimation is computationallyintractable to identify the exact original data with theperturbed data, if we use a random invertible realmatrix generator and a random real value generator.The goal is to show the number of valid X datasetin terms of a known perturbed dataset P . Below wediscuss a simplified version that contains no OPEcomponent - the OPE version has at least the samelevel of security.

Proposition 1: For a known perturbed dataset P ,there exists O(2(d+1)(d+2)n) candidate X datasets inthe original space.

Proof: For a given perturbation P = AZ , whereZ is X with the two extended dimensions, we useBd+1 to represent the (d + 1)-th row of A−1. Thus,Bd+1P = [1, . . . , 1], i.e., the appended (d+1)-th row ofZ . Keeping Bd+1 unchanged, we randomly generateother rows of B for a candidate B. The result Z = BPis a validate estimate of Z if B is invertible. Thus, thenumber of candidate X is the number of invertible B.

The total number of B including non-invertibleones is 2(d+1)(d+2)n. Based on the theory of invertiblerandom matrix [28], the probability of generating anon-invertible random matrix is less than exp−c(d+2)

3. For a normal distribution N(µ, σ2), the range (µ − 2σ, µ +2σ) covers about 95% of the population. We use this length 4σto approximately represent the majority of population for all otherdistributions, as normal distribution is a good approximation formany applications.

6

for some constant c. Thus, there are about (1 −exp−c(d+2))2(d+1)(d+2)n invertible B. Correspondingly,there are a same number of candidate X .Thus, finding the exact X has a negligible probabilityin terms of the number of bits, n.

As the candidates have an equal probability overthe whole domain, according to the definition ofNR MSE, the uncertain range is the same as the wholedomain, resulting in NR MSE = 100%.Distribution-based Estimation. With the known dis-tributional information, the attacker can do more onestimating the original data. The known most relevantmethod is called Independent Component Analysis(ICA) [17]. For a multiplicative perturbation P = AX ,the basic idea is to find an optimal projection, wP ,where w is a d+ 2 dimension row vector, to result ina row vector with its value distribution close to thatof one original attribute. It can be extended to finda matrix W , so that WP gives independent and non-gaussian rows, i.e., a good estimate of X .

The ICA algorithms [17], [13] are optimization al-gorithms that try to find such projections by maxi-mizing the non-gaussianity4 of the projection wP . Thenon-gaussianity of the original attributions is crucialbecause any projection of a multidimensional normaldistribution is still a normal distribution, which leavesno clue for recovery.

Therefore, with our design of OPE and the noisedimension in Section 3, we have the following result.

Proposition 2: There are O(2dn) candidate projec-tion vectors, w, that lead to the same level of non-gaussianity.

Proof: The OPE encrypted matrix X (with thehomogeneous dimension excluded, which can be pos-sibly recovered) can be treated as a sample set drawnfrom a multivariate normal distribution N (µ,Σ). Anyinvertible transformation P = AX will result in an-other multivariate normal distribution N (Aµ, AΣAT ).Thus, any projection wP will not change the gaussian-ity, and there are O(2dn) such candidates of w.Thus, the probability to identify the right projectionis negligible in terms of the number of bits n. Thisshows that any ICA-style estimation that dependson non-guassianity is equally ineffective to the RASPperturbation.

In addition to ICA, Principal Component Analysis(PCA) based attack is another possible distributionalattack, which, however, depends on the preservationof covariance matrix [20]. Because the covariancematrix is not preserved in RASP perturbation, thePCA attack cannot be used on RASP perturbed data.It is unknown whether there are other distributionalmethods for approximately separating X or A fromthe perturbed data P , which will be studied in theongoing work.

4. Non-gaussianity means the distribution is not normal distri-bution.

In the worst-case estimation, the attacker can sim-ply draw a sample of Xj from the known distributionof the original Xj ; thus, Xj and Xj are independentbut have the same distribution. It follows that MSE =var(Xj − Xj) = var(Xj) + var(Xj) = 2var(Xj) =2σ2. Correspondingly, NR MSE = (2

√MSE)/(4σ) =√

2/2 ≈ 71%.

4 RASP RANGE-QUERY PROCESSING

Based on the RASP perturbation method, we designthe services for two types of queries: range query andkNN query. This section will dedicate to range queryprocessing. We will first show that a range query inthe original space can be transformed to a polyhedronquery in the perturbed space, and then we develop asecure way to do the query transformation. Then, wewill develop a two-stage query processing strategy forefficient range query processing.

4.1 Transforming Range Queries

Let’s look at the general form of a range querycondition. Let Xi be an attribute in the database. Asimple condition in a range query involves only oneattribute and is of the form “Xi <op> ai”, where aiis a constant in the normalized domain of Xi andop ∈ {<,>,=,≤,≥, 6=} is a comparison operator. Forconvenience we will only discuss how to processXi < ai, while the proposed method can be slightlychanged for other conditions. Any complicated rangequery can be transformed into the disjunction of aset of conjunctions, i.e.,

⋃nj=1(

⋂mi=1 Ci,j), where m,n

are some integers depending on the original queryconditions and Ci,j is a simple condition about Xi.Again, to simplify the presentation we restrict ourdiscussion to a single conjunction condition ∩mi=1Ci,where Ci is in form of bi ≤ Xi ≤ ai. Such aconjunction conditions describes a hyper-cubic areain the multidimensional space.

According to the three nested transformations inRASP F (G(Eope(x))), we will first show that an OPEwill transform the original hyper-cubic area to anotherhyper-cubic area in the OPE space.

Proposition 1: Order preserving encryption func-tions transform a hyper-cubic query range to anotherhyper-cubic query range.

Proof: The original range query condition consistsof simple conditions like bi ≤ Xi ≤ ai for eachdimension. Since the order is preserved, each sim-ple condition is transformed as follows: Eope(bi) ≤Eope(Xi) ≤ Eope(ai), which means the transformedrange is still a hyper-cubic query range.

Let y = Eope(x) and ci = Eope(ai). A simplecondition Yi ≤ ci defines a half-space. With theextended dimensions zT = (yT , 1, v), the half-spacecan be represented as wT z ≤ 0, where w is a d + 2dimensional vector with wi = 1, wd+1 = −ci, andwj = 0 for j 6= i, d + 1. Finally, let u = Az, according

7

ab

d

c root

a b c d

Fig. 2. R-tree index.

Original space Transformed space

Stage1:Boundingbox

Fig. 3. Illustration of thetwo-stage processing al-gorithm.

to the RASP transformations. With this representation,the original condition is equivalent to

wTA−1u ≤ 0 (3)

in the RASP-perturbed space, which is still a half-space condition. However, this half-space conditionwill not be parallel to the coordinate - these trans-formed conditions together form a polyhedron (asillustrated in Figure 3. The query service will needto find the records in the polyhedron area, which issupported by the two-stage processing algorithm.

4.2 Security Enhancement on Query Transforma-tion

The attacker may also target on the transformedqueries. In this section we discuss such attacks anddescribe the methods countering the attacks. Note thatthe attack on small ranges will be described in kNNquery processing.Countering Dimensional Selection Attack We showthat the dimensional selection attack can reveal par-tial information of the selected data dimensions, ifthe attacker knows the distribution of the dimen-sion. Assume the query condition is applied to thei-th dimension. If the query parameter wTA−1 isdirectly submitted to the cloud side, the server canapply wTA−1 to each record u in the server, and getwTA−1u = Eope(xi) − Eope(ai), where xi is the i-th dimension of the corresponding original record x.After getting all such values for the dimension i, withthe known original data distributions, the attacker canapply the bucket-based distributional attack on theOPE encrypted data (see Section 7) to get an accurateestimate.

According to the design of noise, the extended (d+2)-th dimension v in the RASP perturbation: F (x) =A(Eope(x)

T , 1, v)T is always greater than v0, whichcan be used to construct secure query conditions. In-stead of processing a half space condition Eope(Xi) ≤Eope(ai), we use (Eope(Xi) − Eope(ai))(v − v0) ≤ 0instead. These two conditions are equivalent becausev always satisfies v > v0. Using the similar transfor-mations, we get Eope(Xi) − Eope(ai) = wTA−1u andv = qTA−1u, where qd+2 = −1, qd+1 = v0, and qj = 0,for j 6= d. Thus, we get the transformed quadraticquery condition

uT (A−1)TwqTA−1u ≤ 0. (4)

Let Θi = (A−1)TwqTA−1. Now Θ is submitted to theserver and the server will use uTΘiu ≤ 0 to filter outthe results.

We now show that this query transformation isresilient to the dimensional selection attack. Apply-ing uTΘu to each record u, we get (Eope(Xi) −Eope(ai))(v− v0). Since v is randomly chosen for eachrecord, the value Eope(Xi) − Eope(ai) is protected bythe randomization. Θi does not reveal the key param-eters as well. Let ci = Eope(ai) and ai be the i-th rowof A−1. Θi is (ai− ciad+1)

T (v0ad+1− ad+2). As all thecomponents: ai, ci, ad+1, and ad+2 are unknown andcannot be further reduced, Θi provide no informationto help drive information about A−1.Other Potential Threats. Because the query transfor-mation method does not introduce randomness - thesame query will always get the same transformation,and thus the confidentiality of access pattern is notpreserved. We summarize the leaked information re-lated to access patterns as follows.

• Attackers know the exact frequency of each trans-formed query.

• The set relationships (set intersection, union, dif-ference, etc.) between the query results are re-vealed as a result of exact range query processing.

• Some query matrices on the same dimension mayhave special relationship preserved as shown inProposition 3, which we will discuss later.

We admit this is a weakness of the current design.However, according to the threat model, the adversarywill not know any of the original data and queries.Thus, by simply observing the query frequency or re-lationships between queries, one cannot derive usefulinformation. An important future work is to formallydefine the specific information leakage caused by theleaked query and access patterns, and then preciselyanalyze the data and query confidentiality affectedby this information leakage under different securityassumptions.

4.3 A Two-Stage Query Processing Strategy withMultidimensional Index Tree

With the transformed queries, the next important taskis to process queries efficiently and return preciseresults to minimize the client-side post-processingeffects. A commonly used method is to use multi-dimensional tree indices to improve the search per-formance. However, multidimensional tree indicesare normally used to process axis-aligned “boundingboxes”; whereas, the transformed queries are in ar-bitrary polyhedra, not necessarily aligned to axes. Inthis section, we propose a two-stage query processingstrategy to handle such irregular-shape queries in theperturbed space.Multidimensional Index Tree. Most multidimen-sional indexing algorithms are derived from R-treelike algorithms [22], where the axis-aligned minimum

8

bounding region (MBR) is the construction block forindexing the multidimensional data. For 2D data, anMBR is a rectangle. For higher dimensions, the shapeof MBR is extended to hyper-cube. Figure 2 shows theMBRs in the R-tree for a 2D dataset, where each nodeis bounded by a node MBR. The R-tree range queryalgorithm compares the MBR and the queried rangeto find the answers.The Two-Stage Processing Algorithm. The trans-formed query describes a polyhedron in the perturbedspace that cannot be directly processed by multi-dimensional tree algorithms. New tree search algo-rithms could be designed to use arbitrary polyhedronconditions directly for search. However, we use asimpler two-stage solution that keeps the existing treesearch algorithms unchanged.

At the first stage, the proxy in the client side findsthe MBR of the polyhedron (as a part of the submittedtransformed query) and submit the MBR and a set ofsecured query conditions {Θ1, . . . ,Θm} to the server.The server then uses the tree index to find the set ofrecords enclosed by the MBR.

The MBR of the polyhedron can be efficientlyfounded based on the original range. The originalquery condition constructs a hyper-cube shape. Withthe described query transformation, the vertices of thehyper cube are also transformed to vertices of thepolyhedron. Therefore, the MBR of the vertices is alsothe MBR of the polyhedron [27]. Figure 3 illustratesthe relationship between the vertices and the MBRand the two-stage processing strategy.

At the second stage, the server uses the transformedhalfspace conditions to filter the initial result. In mostcases of tight ranges, the initial result set will bereasonably small so that it can be filtered in mem-ory by simply checking the transformed half-spaceconditions. However, in the worst case, the MBR ofthe polyhedron will possibly enclose the entire datasetand the second stage is reduced to a linear scan of theentire dataset. The result of second stage will returnthe exact range query result to the proxy server, whichsignificantly reduces the post-processing cost that theproxy server needs to take. It is very important to thecloud-based service, because low post-processing costrequires low in-house investment.

5 KNN QUERY PROCESSING WITH RASPBecause the RASP perturbation does not preservedistances (and distance orders), kNN query cannot bedirectly processed with the RASP perturbed data. Inthis section, we design a kNN query processing algo-rithm based on range queries (the kNN-R algorithm).As a result, the use of index in range query processingalso enables fast processing of kNN queries.

5.1 Overview of the kNN-R Algorithm

The original distance-based kNN query processingfinds the nearest k points in the spherical range that

is centered at the query point. The basic idea of ouralgorithm is to use square ranges, instead of sphericalranges, to find the approximate kNN results, so thatthe RASP range query service can be used. Thereare a number of key problems to make this worksecurely and efficiently. (1) How to efficiently findthe minimum square range that surely contains the kresults, without many interactions between the cloudand the client? (2) Will this solution preserve dataconfidentiality and query privacy? (3) Will the proxyserver’s workload increase? to what extent?

The algorithm is based on square ranges to approx-imately find the kNN candidates for a query point,which are defined as follows.

Definition 1: A square range is a hyper-cube thatis centered at the query point and with equal-lengthedges.Figure 5 illustrates the range-query-based kNN pro-cessing with two-dimensional data. The Inner Rangeis the square range that contains at least k points,and the Outer Range encloses the spherical rangethat encloses the inner range. The outer range surelycontains the kNN results (Proposition 2) but it mayalso contain irrelevant points that need to be filteredout.

Proposition 2: The kNN-R algorithm returns resultswith 100% recall.

Proof: The sphere in Figure 5 between the outerrange and the inner range covers all points with dis-tances less than the radius r. Because the inner rangecontains at least k points, there are at least k nearestneighbors to the query points with distances less thanthe radius r. Therefore, the k nearest neighbors mustbe in the outer range.

The kNN-R algorithm consists of two rounds ofinteractions between the client and the server. Figure 4demonstrates the procedure. (1) The client will sendthe initial upper-bound range, which contains morethan k points, and the initial lower-bound range,which contains less than k points, to the server. Theserver finds the inner range and returns to the client.(2) The client calculates the outer range based on theinner range and sends it back to the server. The serverfinds the records in the outer range and sends themto the client. (3) The client decrypts the records andfind the top k candidates as the final result.

If the points are approximately uniformly dis-tributed, we can estimate the precision of the returnedresult. With the uniform assumption, the number ofpoints in an area is proportional to the size of the area.If the inner range contains m points, m >= k, theouter range contains q points, and the dimensionalityis d, we can derive q = 2d/2m. Thus, the precision isk/q = k/(2d/2m). If m ≈ k and d = 2, the precision isaround 0.5. When d increases, the precision decreasesexponentially due to the curse of dimensionality [23],which suggests kNN-R should not work effectively onhigh-dimensional data. We will show this weakness in

9

��������������

��� �������

���������������

�����������

���������������

������������������������������

���������

��������������������

��� ����������

���������

�������!����������

�����"����

�������

�����#$$

Fig. 4. Procedure of KNN-R algorithm

������

���

��

�� ��

������

�����

����

�����

Fig. 5. Illustration for kNN-R Algorithmwhen k=3

experiments.

5.2 Finding Compact Inner Square Range

An important step in the kNN-R algorithm is tofind the compact inner square range to achieve highprecision. In the following, we give the (k, δ)-rangefor efficiently finding the compact inner range.

Definition 2: A (k, δ)-range is any square range cen-tered at the query point, the number of points inwhich is in the range [k, k + δ], δ is a nonnegativeinteger.

We design an algorithm similar to binary searchto efficiently find the (k, δ)-range. Suppose a squarerange centered at the query point with length of Lin each dimension is represented as S(L). Let thenumber of points included by this range is N (L). Ifa square range S(in) is enclosed by another squarerange S(out), we say S(in) ⊂ S(out). It directly followsthat N (in) ≤ N (out), and also

Corollary 1: If N (1) < N (2), S(1) ⊂ S(2).Using this definition and notation, we can alwaysconstruct a series of enclosed square ranges centeredon the query point: S(L1) ⊂ S(L2) ⊂ . . . ,⊂ S(Lm).Correspondingly, the numbers of points enclosed by{S(Li)} have the ordering N (L1) ≤ N (L2) ≤ . . .N (Lm).Assume that S(L1) is the initial range containing lessthan k points and S(Lm) is the initial upper boundrange; both are sent by the client. The problem offinding the compact inner range S can be mappedto a binary search over the sequence {S(Li)}.

In each step of the binary search, we start with alower bound range, denoted as S(low) and a higherbound range, S(high). We want the correspondingnumbers of enclosed points to satisfy N (low) < k ≤N (high) in each step, which is achieved with thefollowing procedure. First, we find the middle squarerange S(mid), where mid = (low + high)/2. If S(mid)

covers no less than k points, the higher bound: S(high)

is updated to S(mid); otherwise, the lower bound:S(low) is updated to S(mid). At the beginning stepS(low) is set to S(L1) and S(high) is S(Lm). This processrepeats until N (mid) < k + δ or high − low < E ,

where E is some small positive number. Algorithm4 in Appendix describes these steps.

Selection of Initial Inner/Outer Bounds. The se-lection of initial inner bound can be the query point.If the query point is q(q1, . . . , qd), S(L1) is a hyper-cube defined by {qi ≥ Xi ≥ qi, i = 1 . . . d}. The naiveselection of S(Lm) would be the whole domain. How-ever, we can effectively reduce the range with a coarsedensity map organized in a tiny flat multidimensionaltree, which can be included in the preprocessing stepin the client side. The details will be ignored due tothe space limitation.

5.3 Finding Inner Range with RASP PerturbedDataAlgorithm 4 gives the basic ideas of finding the com-pact inner range in iterations. There are two criticaloperations in this algorithm: (1) finding the numberof points in a square range and (2) updating the higherand lower bounds. Because range queries are securedin the RASP framework, the key is to update thebounds with the secured range queries, without thehelp of the client-side proxy server.

As discussed in the RASP query processing, a rangequery such as S(L) is encoded as the MBR(L) of itspolyhedron range in the perturbed space and the 2(d+

2) dimensional conditions. yTΘ(L)i y ≤ 0 determining

the sides of the polyhedron, and each of the d + 2extended dimensions gets a pair of conditions for theupper and lower bounds, respectively.

The problem of binary range search is to use thehigher bound range S(high) and the lower boundrange S(low) to derive S(mid). When all of theseranges are secured, the problem is transformed to

(1) deriving Θ(mid)i from Θ

(high)i and Θ

(low)i ; and (2)

deriving MBR(mid) from MBR(high) and MBR(low). Thefollowing discussion will be focused on the simplifiedRASP version without the OPE component, whichwill be extended with the OPE component.

We show thatProposition 3:

(Θ(high)i +Θ

(low)i )/2 = Θ

(mid)i .

10

Proof: Remember that Θi for Xi < ci can berepresented as (ai − ciad+1)

T (v0ad+1 − ad+2), whereai is the i-th row of the matrix A. Let the conditionsbe Xi < h, Xi < l, and Xi < (h+l)/2 for the high, low,

and middle bounds, correspondingly. Thus, (Θ(high)i +

Θ(low)i )/2 = (ai − ((h + l)/2)ad+1)

T (v0ad+1 − ad+2),

which is Θ(mid)i .

As we have mentioned, the MBR of an arbitrarypolyhedron can be derived based on the vertices ofthe polyhedron. A polyhedron is mapped to anotherpolyhedron after the RASP perturbation. Concretely,let a polyhedron P has m vertices {x1, . . . , xm}, whichare mapped to the vertices in the perturbed space:{y1, . . . , ym}. Then, the upper bound and lower boundof dimension j of the MBR of the polyhedron inthe perturbed space are determined by max{yij, i =1 . . .m} and min{yij , i = 1 . . .m}, respectively.

Let the j-th dimension of MBR(L) represented as

[s(L)j,min, s

(L)j,max], where s

(L)j,min = min{y(L)

ij , i = 1 . . .m},and s

(L)j,max = max{y(high)ij , i = 1 . . .m}. Now we

choose the MBR(MID) as follows: for j-th dimension

we use [(s(low)j,min + s

(high)j,min )/2, (s

(low)j,max + s

(high)j,max )/2]. We

show that

Proposition 4: MBR(MID) encloses MBR(mid).

The details of proof can be found in Appendix. Be-cause the MBR is only used for the first stage of rangequery processing, a slightly larger MBR still enclosesthe polyhedron, which guarantees the correctness ofthe two-stage range query processing.

Including the OPE component. The results on

Θ(mid)i and MBR(MID) can be extended to the RASP

scheme with the OPE component. However, due tothe introduction of the order preserving function fi(),the middle point may not be strictly the middle point,but somewhere between the higher bound and lowerbound. We use “between”(btw) to denote it.

Specifically, if Xi < h and Xi < l are the corre-sponding conditions for the higher and lower bounds.Let the condition for the “between” bound be Xi < bthat satisfies fi(b) = (fi(h) + fi(l))/2. According tothe OPE property, we have l < b < h, i.e., thecorresponding range is still between the lower rangeand higher range. Therefore, the same binary searchalgorithm can still be applied, according to Corollary

1. The server can also derive (Θ(high)i + Θ

(low)i )/2 =

(ai − ((fi(h) + fi(l))/2)ad+1)T (v0ad+1 − ad+2) = Θbtw

i ,a result similar to Proposition 3.

Similarly, we define MBR(BTW ) with

fi(s(BTW )i,max ) = (fi(s

(low)i,max) + fi(s

(high)i,max ))/2 and

fi(s(BTW )i,min ) = (fi(s

(low)i,min) + fi(s

(high)i,min ))/2, while

MBR(btw) is defined based on the vertices to beconsistent with Θ

(btw)i . Because the relationships Eq.

6 and 7 in Appendix are still true with the OPEtransformation fi(), we can prove that MBR(BTW )

also encloses MBR(btw). Due to the space limitation,we skip the details.

5.4 Defining Initial Bounds

The complexity of the (k, δ)-range algorithm is deter-mined by the initial bounds provided by the client.Thus, it is important to provide compact ones tohelp the server process queries more efficiently. Theinitial lower bound is defined as the query point.For q(q1, . . . , qd), the dimensional bounds are simplyqj ≤ Xj ≤ qj .

The higher bounds can be defined in multiple ways.(1) Applications often have a user-specified interestbound, for example, returning the nearest gas stationin 5 miles, which can be used to define the higherbound. (2) We can also use center-distance basedbound setting. Let the query point has a distance γto the distribution center - as we always work onnormalized distributions, the center is (0, . . . , 0). Theupper bound is defined as qj − ǫγ ≤ Xj ≤ qj + ǫγ,where epsilon ∈ (0, 1] defines the level of conservativ-ity. (3) If it is really expected to include all candidatekNN regardless how distant they are, we can includea rough density-map (a multidimensional histgram)for quickly identifying the appropriate higher bound.However, this method works best for low dimensionaldata as the number of bins exponentially increaseswith the number of dimensions. In experiments, wesimply use the method (1) and 5% of the domainlength for the extension.

5.5 Security of kNN Queries

As all kNN queries are completely transformed torange queries, the security of kNN queries are equiva-lent to the security of range queries. According to theprevious discussion in Section 4.2, the transformedrange queries are secure under the assumptions.Therefore, the kNN queries are also secure. Detailedproofs have to be skipped for space limitation.

6 EXPERIMENTS

In this section, we present four sets of experimentalresults to investigate the following questions, corre-spondingly. (1) How expensive is the RASP pertur-bation? (2) How resilient the OPE enhanced RASPis to the ICA-based attack? (3) How efficient is thetwo-stage range query processing? (4) How efficientis the kNN-R query processing and what are theadvantages?

6.1 Datasets

Three datasets are used in experiments. (1) A syntheticdataset that draws samples from uniform distribu-tion in the range [0, 1]. (2) The Adult dataset fromUCI machine learning database5. We assign numericvalues to the categorical values using a simple one-to-one mapping scheme, as described in Section 3.(3) The 2-dimensional NorthEast location data fromrtreeportal.org.

5. http://archive.ics.uci.edu/ml/

11

����

����

����

����

���

����

����

����

����

� � �

������������

������������� ���

�� ��������

�� ���������������������

Fig. 6. The cost distribution of the full RASPscheme. Data: Adult (20K records,5-9 dimensions)

���

���

���

���

���

���

��

��

���

� � �� �� �� �� �� �� � �� �� � �� �� �� ��

�����

������

������� ���

��

����������

� �������� ������

��������� ������

������������ ������

� ������������

Fig. 7. Randomly generated matrix A and theprogressive resilience to ICA attack. Data: Adult(10 dimensions, 10K records)

6.2 Cost of RASP Perturbation

In this experiment, we study the costs of the com-ponents in the RASP perturbation. The major costscan be divided into two parts: the OPE and the restpart of RASP. We implement a simple OPE scheme [1]by mapping original column distributions to normaldistributions. The OPE algorithm partitions the targetdistribution into buckets. Then, the sorted originalvalues are proportionally partitioned according to thetarget bucket distribution to create the buckets for theoriginal distribution. With the aligned original andtarget buckets, an original value can be mapped tothe target bucket and appropriately scaled. Therefore,the encryption cost mainly comes from the bucketsearch procedure (proportional to logD, where Dis the number of buckets). Figure 6 shows the costdistributions for 20K records at different number ofdimensions. The dimensionality has slight effects onthe cost of RASP perturbation. Overall, the cost ofprocessing 20K records is only around 0.1 second.

6.3 Resilience to ICA Attack

We have discussed the methods for countering theICA distributional attack on the perturbed data. Inthis set of experiments, we evaluate how resilient theRASP perturbation is to the distributional attack.

Results. We simulate the ICA attack for randomlychosen matrices A. The data used in the experimentis the 10-dimensional Adult data with 10K records.Figure 7 shows the progressive results in a numberof randomly chosen matrices A. The x-axis representsthe total number of rounds for randomly choosingthe matrix A; the y-axis represents the minimumdimensional NR MSE among all dimension. With-out OPE, the label “Best-without-OPE” representsthe most resilient A at the round i, “Worst-without-OPE” represents the A of the weakest resilience, and

“Average-without-OPE” is the average quality of thegenerated A matrices for i rounds. We see that the bestcase is already close to the upper bound 0.7 (Section3.3). With the OPE component, the worst case can alsobe significantly improved.

6.4 Performance of Two-stage Range Query Pro-cessing

In this set of experiments, we study the performanceaspects of polyhedron-based range query processing.We use the two-stage processing strategy described inSection 4, and explore the additional cost incurred bythis processing strategy. We implement the two-stagequery processing based on an R*tree implementationprovided by Dr. Hadjieleftheriou at AT&T Lab6. Theblock size is 4KB and we allow each block to containonly 20 entries to mimic a large database with manydisk blocks. Samples from the original databases indifferent size (10,000 − 50,000 records, i.e., 500-2500data blocks) are perturbed and indexed for queryprocessing. Another set of indices is also built onthe original data for the performance comparisonwith non-perturbed query processing. We will use thenumber of disk block accesses, including index blocksand data blocks, to assess the performance to avoidthe possible variation caused by other parts of thecomputer system. In addition, we will also show thewall-clock time for some results.

Recall the two-stage processing strategy: using theMBR to search the indexing tree, and filtering thereturned result with the secured query in quadraticform. We will study the performance of the first stageby comparing it to two additional methods: (1) theoriginal queries with the index built on the originaldata, which is used to identify how much additional

6. http://www2.research.att.com/ marioh/spatialindex/

12

cost is paid for querying the MBR of the trans-formed query; (2) the linear scan approach, whichis the worst case cost. Range queries are generatedrandomly within the domain of the datasets, andthen transformed with the method described in theSection 4. We also control the range of the queries tobe [10%,20%,30%,40%,50%] of the total range of thedomain, to observe the effect of the scale of the rangeto the performance of query processing.Results. The first pair of figures (the left subfigures ofFigure 8 and 9) shows the number of block accessesfor 10,000 queries on different sizes of data with differ-ent query processing methods. For clear presentation,we use log10(# of block accesses) as the y-axis. Thecost of linear scan is simply the number of blocks forstoring the whole dataset. The data dimensionality isfixed to 5 and the query range is set to 30% of thewhole domain. Obviously, the first stage with MBR forpolyhedron has a cost much cheaper than the linearscan method and only moderately higher than R*treeprocessing on the original data. Interestingly, differentdistributions of data result in slightly different pat-terns. The costs of R*tree on transformed queries arevery close to those of original queries for Adult data,while the gap is larger on uniform data. The costsover different dimensions and different query rangesshow similar patterns.

Linear Scan R*Tree-Orig PrepQ Stage-1 Stage-2 rpq purityUniform5D 21.12 0.27 0.007 4.19 0.01 51.92 7.76%

Adult5D 16.28 0.39 0.007 1.9 0.01 5.12 1.17%

TABLE 1Wall clock cost distribution (milliseconds) and

comparison.

We also studied the cost of the second stage. We use“PrepQ” to represent the client-side cost of transform-ing queries, “purity” to represent the rate (final resultcount)/(1st stage result count), and records per query(“RPQ”) to represent the average number of recordsper query for the first stage results. The quadraticfiltering conditions are used in experiments. Table 1compares the average wall-clock time (milliseconds)per query for the two stages, the RPQ values for stage1, and the purity of the stage-1 result. The tests arerun with the setting of 10K queries, 20K records, 30%dimensional query range and 5 dimensions. Since the2nd stage is done in memory, its cost is much lowerthan the 1st-stage cost. Overall, the two stage process-ing is much faster than linear scan and comparable tothe original R*Tree processing.

6.5 Performance of kNN-R Query Processing

In this set of experiments, we investigate severalaspects of kNN query processing. (1) We will studythe cost of (k, δ)-Range algorithm, which mainlycontributes to the server-side cost. (2) We will show

the overall cost distribution over the cloud side andthe proxy server. (3) We will show the advantages ofkNN-R over another popular approach: the Casperapproach [24] for privacy-preserving kNN search.

(k, δ)-Range Algorithms In this set of experiments,we want to understand how the setting of the δparameter affects the performance and the resultprecision. Figure 10 shows the effect of δ settingto the (k, δ)-range algorithm. Both datasets are two-dimensional data. As δ becomes larger, both the pre-cision and the number of rounds needs to reach theδ condition decreases. Note that each round corre-sponds to one server-side range query. The choice ofδ represents a tradeoff between the precision and theperformance.

���

���

���

���

���

���

� � � � ��

���������

���������������� �����

�������

���� �����

� � � � ��

������ �

���������������� �����

�������

���� �����

Fig. 10. Performance and resultprecision for different δ setting ofthe (k, δ)-range algorithm for 2-dimensional data.

���

���

���

���

���

���

� � � � �

���������

�����������

����������

�� �������

Fig. 11. Preci-sion reductionwith moredimension.

As we have discussed, the major weakness withthe kNN-R algorithm is the precision reduction withincreased dimensionality. When the dimensionalityincreases, the precision can significantly drop, whichwill increase the cost of post-processing in the clientside. Figure 11 shows this phenomenon with the realAdult data and the simulated uniform data. However,compared to the overall cost, the client-side cost in-crease is still acceptable. We will show the comparisonnext.

Overall Costs. Many secure approaches cannot useindices for query processing, which results in poorperformance. For example, the secure dot-productapproach [33] encodes the points with random projec-tions and recovers dot-products in query processingfor distance comparison. The way of encoding datadisallows the index-based query processing. Withoutthe aid of indices, processing a kNN query will haveto scan the entire database, leaving many optimizationimpossible to implement.

One concern with the kNN-R approach is the work-load on the proxy server. Different from range query,

13

4.5

5

5.5

6

6.5

7

7.5

8

8.5

9

10 20 30 40 50

Log1

0(B

lock

Acc

esse

s)

Number of Records (thousands)

R*Tree OriginalR*Tree Transformed

Linear Scan

4.5

5

5.5

6

6.5

7

7.5

8

8.5

5 6 7 8 9

Log1

0(B

lock

Acc

esse

s)

Number of Dimensions

R*Tree OriginalR*Tree Transformed

Linear Scan

4.5

5

5.5

6

6.5

7

7.5

8

8.5

10 20 30 40 50

Log1

0(B

lock

Acc

esse

s)

Length of Query Range(%)

R*Tree OriginalR*Tree Transformed

Linear Scan

Fig. 8. Performance comparison on Uniform data. Left: data size vs. cost of query; Middle: data dimensionalityvs. cost of query; Right: query range (percentage of the domain) vs. cost of query

4.5

5

5.5

6

6.5

7

7.5

8

8.5

9

10 20 30 40 50

Log1

0(B

lock

Acc

esse

s)

Number of Records (thousands)

R*Tree OriginalR*Tree Transformed

Linear Scan

4.5

5

5.5

6

6.5

7

7.5

8

8.5

5 6 7 8 9

Log1

0(B

lock

Acc

esse

s)

Number of Dimensions

R*Tree OriginalR*Tree Transformed

Linear Scan

4.5

5

5.5

6

6.5

7

7.5

8

8.5

10 20 30 40 50

Log1

0(B

lock

Acc

esse

s)

Length of Query Range (%)

R*Tree OriginalR*Tree Transformed

Linear Scan

Fig. 9. Performance comparison on Adult data. Left: data size vs. cost of query; Middle: data dimensionality vs.cost of query; Right: query range (percentage of the domain) vs. cost of query

the proxy server will need to filter out the pointsreturned by the server to find the final kNN. Areduced precision due to the increased dimensionalitywill imply an increased burden for the proxy server.We need to show how significant this proxy cost is.

We use the database of 100 thousands of data pointsand 1000 randomly selected queries for the 1NNexperiment. The wall clock time (milliseconds) is usedto show the average cost per query in Table 2. Wealso list the cost of the secure dot-product method[33] for comparison. Table 2 shows that the proxyserver takes a negligible pre-processing cost and avery small post-processing cost, even for reducedprecision in the 5D datasets. We use 5% domain lengthto extend the query point to form the initial higherbound. Compared to the dot-product method, theuser-specified higher bound setting can cut off unin-teresting regions, giving significant performance gainfor sparse or skewed datasets, such as Adult5D. Thiscut-off effect cannot be implemented with the dot-product method. Furthermore, even for dense caseslike the 2D datasets, the overall cost is only abouthalf of the dot-product method.

Comparing kNN-R with the Casper Approach. Inthis set of experiments, we compare our approachand the Casper approach with a focus on the tradeoffbetween the data confidentiality and the query resultprecision (which indicates the workload of the in-house proxy). Based on the description in the paper[24], we implement the 1NN query processing algo-rithm for the experiment.

The Casper approach uses cloaking boxes to hide

Data& setting Liner Scan Pre-processing Server Cost Post-processingUniform2D/kNN-R 27.37 0.01 13.54 0.04

Adult2D/kNN-R 26.09 0.01 14.48 0.06Uniform5D/kNN-R 33.03 0.01 13.79 0.34

Adult5D/kNN-R 31.96 0.01 2.56 0.05

TABLE 2Per-query performance comparison (milliseconds)between linear scan on the original non-perturbed

data and index-aided kNN-R processing on perturbeddata.

both the original data points in the database and thequery points. It can also use the index to processkNN queries. The confidentiality of data in Casperis solely defined by the size of cloaking box. Roughlyspeaking, the actual point has the same probabilityto be anywhere in the cloaking box. However, thesize of cloaking box also directly affects the precisionof query results. Thus, the decision on the box sizerepresents a tradeoff between the precision of queryresults and the data confidentiality.

For clear presentation, we assume each dimensionhas the same length of domain, h and each cloak-ing box is square with an edge-length e. Assumethe whole domain also has a uniform distribution.According to the variance of uniform distribution,the NR MSE measure is

√6e/(3h). To achieve the

protection of 10% domain length, we have e ≈ 0.12h.

In Figure 12, the x-axis represents NR MSE, i.e.,the Casper’s relative cloaking-edge length. It showsthat when the edge length is increased from 2% to10%, the precision dramatically drops from 62% to

14

���

���

���

���

���

���

��

� � � � �����������

������������������ ��������

������������ �������� ������������ ��� �����

Fig. 12. The impact of cloaking-box size on precisionfor Casper for the NE data.

13% for the 2D uniform data and 43% to 10% for the2D NE data, which shows the severe conflict betweenprecision and confidentiality. The kNN-R’s results arealso shown for comparison.

7 RELATED WORK

7.1 Protecting Outsourced Data

Order Preserving Encryption. Order preserving en-cryption (OPE) [1] preserves the dimensional value or-der after encryption. It can be described as a functiony = F (x), ∀xi, xj , xi < (>,=)xj ⇔ yi < (>,=)yj . Awell-known attack is based on attacker’s prior knowl-edge on the original distributions of the attributes.If the attacker knows the original distributions andmanages to identify the mapping between the originalattribute and its encrypted counterpart, a bucket-based distribution alignment can be performed tobreak the encryption for the attribute [6]. There aresome applications of OPE in outsourced data process-ing. For example, Yiu et al. [21] uses a hierarchicalspace division method to encode spatial data points,which preserves the order of dimensional values andthus is one kind of OPE.

Crypto-Index. Crypto-Index is also based oncolumn-wise bucketization. It assigns a random IDto each bucket; the values in the bucket are replacedwith the bucket ID to generate the auxiliary data forindexing. To utilize the index for query processing, anormal range query condition has to be transformedto a set-based query on the bucket IDs. For example,Xi < ai might be replaced with X ′

i ∈ [ID1, ID2, ID3].A bucket-diffusion scheme [14] was proposed to pro-tect the access pattern, which, however, has to sacrificethe precision of query results, and thus increase theclient’s cost of filtering the query result.

Distance-Recoverable Encryption. DRE is the mostintuitive method for preserving the nearest neighborrelationship. Because of the exactly preserved dis-tances, many attacks can be applied [33], [20], [8].Wong et al. [33] suggest preserving dot productsinstead of distances to find kNN, which is moreresilient to distance-targeted attacks. One drawbackis the search algorithm is limited to linear scan andno indexing method can be applied.

7.2 Preserving Query Privacy

Private information retrieval (PIR) [9] tries to fullypreserve the privacy of access pattern, while the datamay not be encrypted. PIR schemes are normally verycostly. Focusing on the efficiency side of PIR, Williamset al. [32] use a pyramid hash index to implement effi-cient privacy preserving data-block operations basedon the idea of Oblivious RAM. It is different from oursetting of high throughput range query processing.

Hu et al. [15] addresses the query privacy problemand requires the authorized query users, the dataowner, and the cloud to collaboratively process kNNqueries. However, most computing tasks are donein the user’s local system with heavy interactionswith the cloud server. The cloud server only aidsquery processing, which does not meet the principleof moving computing to the cloud.

Papadopoulos et al. [26] uses private informationretrieval methods [9] to enhance location privacy.However, their approach does not consider protectingthe confidentiality of data. SpaceTwist [35] proposes amethod to query kNN by providing a fake user’s loca-tion for preserving location privacy. But the methoddoes not consider data confidentiality, as well. TheCasper approach [24] considers both data confiden-tiality and query privacy, the detail of which has beendiscussed in our experiments.

7.3 Other Related Work

Another line of research [29] facilitates authorizedusers to access only the authorized portion of data,e.g., a certain range, with a public key scheme. How-ever, the underlying encryption schemes do not pro-duce indexable encrypted data. The setting of multi-dimensional range query in [29] is different from ours.Their approach requires that the data owner providesthe indices and keys for the server, and authorizedusers use the data in the server. While in the clouddatabase scenario, the cloud server takes more respon-sibilities of indexing and query processing. Securekeyword search on encrypted documents [10], [31], [5]scans each encrypted document in the database andfinds the documents containing the keyword, whichis more like point search in database. The research onprivacy preserving data mining has discussed multi-plicative perturbation methods [7], which are similarto the RASP encryption, but with more emphasis onpreserving the utility for data mining.

8 CONCLUSION

We propose the RASP perturbation approach to host-ing query services in the cloud, which satisfies theCPEL criteria: data Confidentiality, query Privacy,Efficient query processing, and Low in-house work-load. The requirement on low in-house workload is acritical feature to fully realize the benefits of cloud

15

computing, and efficient query processing is a keymeasure of the quality of query services.

RASP perturbation is a unique composition of OPE,dimensionality expansion, random noise injection,and random projection, which provides unique se-curity features. It aims to preserve the topology ofthe queried range in the perturbed space, and allowsto use indices for efficient range query processing.With the topology-preserving features, we are able todevelop efficient range query services to achieve sub-linear time complexity of processing queries. We thendevelop the kNN query service based on the rangequery service. The security of both the perturbed dataand the protected queries is carefully analyzed undera precisely defined threat model. We also conductseveral sets of experiments to show the efficiencyof query processing and the low cost of in-houseprocessing.

We will continue our studies on two aspects: (1)further improve the performance of query processingfor both range queries and kNN queries; (2) formallyanalyze the leaked query and access patterns and thepossible effect on both data and query confidentiality.

REFERENCES

[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order pre-serving encryption for numeric data,” in Proceedings of ACMSIGMOD Conference, 2004.

[2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. K. andAndyKonwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, andM. Zaharia, “Above the clouds: A berkeley view of cloudcomputing,” Technical Report, University of Berkerley, 2009.

[3] J. Bau and J. C. Mitchell, “Security modeling and analysis,”IEEE Security and Privacy, vol. 9, no. 3, pp. 18–25, 2011.

[4] S. Boyd and L. Vandenberghe, Convex Optimization. Cam-bridge University Press, 2004.

[5] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving multi-keyword ranked search over encrypted clouddata,” in INFOCOMM, 2011.

[6] K. Chen, R. Kavuluru, and S. Guo, “Rasp: Efficient mul-tidimensional range query on attack-resilient encrypteddatabases,” in ACM Conference on Data and Application Securityand Privacy, 2011, pp. 249–260.

[7] K. Chen and L. Liu, “Geometric data perturbation for out-sourced data mining,” Knowledge and Information Systems, 2011.

[8] K. Chen, L. Liu, and G. Sun, “Towards attack-resilient geomet-ric data perturbation,” in SIAM Data Mining Conference, 2007.

[9] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Privateinformation retrieval,” ACM Computer Survey, vol. 45, no. 6,pp. 965–981, 1998.

[10] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Search-able symmetric encryption: improved definitions and efficientconstructions,” in Proceedings of the 13th ACM conference onComputer and communications security. New York, NY, USA:ACM, 2006, pp. 79–88.

[11] N. R. Draper and H. Smith, Applied Regression Analysis. Wiley,1998.

[12] H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra, “Executing sqlover encrypted data in the database-service-provider model,”in Proceedings of ACM SIGMOD Conference, 2002.

[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements ofStatistical Learning. Springer-Verlag, 2001.

[14] B. Hore, S. Mehrotra, and G. Tsudik, “A privacy-preservingindex for range queries,” in Proceedings of Very Large DatabasesConference (VLDB), 2004.

[15] H. Hu, J. Xu, C. Ren, and B. Choi, “Processing private queriesover untrusted data cloud through privacy homomorphism,”Proceedings of IEEE International Conference on Data Engineering(ICDE), pp. 601–612, 2011.

[16] Z. Huang, W. Du, and B. Chen, “Deriving private informa-tion from randomized data,” in Proceedings of ACM SIGMODConference, 2005.

[17] A. Hyvarinen, J. Karhunen, and E. Oja, Independent ComponentAnalysis. Wiley, 2001.

[18] I. T. Jolliffe, Principal Component Analysis. Springer, 1986.[19] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynamic

authenticated index structures for outsourced databases,” inProceedings of ACM SIGMOD Conference, 2006.

[20] K. Liu, C. Giannella, and H. Kargupta, “An attacker’s view ofdistance preserving maps for privacy preserving data mining,”in Proceedings of PKDD, Berlin, Germany, September 2006.

[21] M. L. Liu, G. Ghinita, C. S.Jensen, and P. Kalnis, “Enablingsearch services on outsourced private spatial data,” The Inter-national Journal of on Very Large Data Base, vol. 19, no. 3, 2010.

[22] Y. Manolopoulos, A. Nanopoulos, A. Papadopoulos, andY. Theodoridis, R-trees: Theory and Applications. Springer-Verlag, 2005.

[23] R. Marimont and M. Shapiro, “Nearest neighbour searchesand the curse of dimensionality,” Journal of the Institute ofMathematics and its Applications, vol. 24, pp. 59–70, 1979.

[24] M. F. Mokbel, C. yin Chow, and W. G. Aref, “The new casper:Query processing for location services without compromis-ing privacy,” in Proceedings of Very Large Databases Conference(VLDB), 2006, pp. 763–774.

[25] P. Paillier, “Public-key cryptosystems based on compositedegree residuosity classes,” in EUROCRYPT. Springer-Verlag,1999, pp. 223–238.

[26] S. Papadopoulos, S. Bakiras, and D. Papadias, “Nearest neigh-bor search with strong location privacy,” in Proceedings of VeryLarge Databases Conference (VLDB), 2010.

[27] F. P. Preparata and M. I. Shamos, Computational Geometry: AnIntroduction. Springer-Verlag, 1985.

[28] M. Rudelson and R. Vershynin, “Smallest singular value ofa random rectangular matrix,” Communications on Pure andApplied Mathematics, vol. 62, pp. 1707–1739, 2009.

[29] E. Shi, J. Bethencourt, T.-H. H. Chan, D. Song, and A. Perrig,“Multi-dimensional range query over encrypted data,” in IEEESymposium on Security and Privacy, 2007.

[30] R. Sion, “Query execution assurance for outsourceddatabases,” in Proceedings of Very Large Databases Conference(VLDB), 2005.

[31] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure rankedkeyword search over encrypted cloud data,” in Proceedings ofIEEE International Conference on Distributed Computing Systems(ICDCS), 2010.

[32] P. Williams, R. Sion, and B. Carbunar, “Building castles outof mud: Practical access pattern privacy and correctness onuntrusted storage,” in ACM Conference on Computer and Com-munications Security, 2008.

[33] W. K. Wong, D. W.-l. Cheung, B. Kao, and N. Mamoulis, “Se-cure knn computation on encrypted databases,” in Proceedingsof ACM SIGMOD Conference. New York, NY, USA: ACM,2009, pp. 139–152.

[34] M. Xie, H. Wang, J. Yin, and X. Meng, “Integrity auditingof outsourced data,” in Proceedings of Very Large DatabasesConference (VLDB), 2007, pp. 782–793.

[35] M. L. Yiu, C. S. Jensen, X. Huang, and H. Lu, “Spacetwist:Managing the trade-offs among location privacy, query perfor-mance, and query accuracy in mobile services,” in Proceedingsof IEEE International Conference on Data Engineering (ICDE),Washington, DC, USA, 2008, pp. 366–375.

16

Huiqi Xu is a PhD student in the DistributedComputing Systems group in the Universityof Minnesota at Twin Cities. He obtained hisMaster’s degree in Computer Science fromWright State University in June 2012 andhis Bachelor’s degree in Computer Sciencefrom Chongqing University in June 2009.His research interests include privacy-awarecomputing and cloud computing.

Shumin Guo is currently a PhD student inthe Department of Computer Science andEngineering, and a member of the Data In-tensive Analysis and Computing (DIAC) Lab,at Wright State University, Dayton, OH, USA.He received his Master’s degree in Electron-ics Engineering from Xidian University, Xi’anChina, in 2008. His current research interestare privacy preserving data mining, socialnetwork analysis and cloud computing.

Keke Chen is an assistant professor in theDepartment of Computer Science and Engi-neering, and a member of the Ohio Centerof Excellence in Knowledge Enabled Com-puting (the Kno.e.sis Center), at Wright StateUniversity. He directs the Data IntensiveAnalysis and Computing (DIAC) Lab at theKno.e.sis Center. He earned his PhD degreefrom Georgia Institute of Technology in 2006,his Master’s degree from Zhejiang Universityin China in 1999, and his Bachelor’s degree

from Tongji University in China in 1996. All degrees are in ComputerScience. His current research areas include visual exploration of bigdata, secure data services and mining of outsourced data, privacy ofsocial computing, and cloud computing. During 2006-2008, he wasa senior research scientist at Yahoo! Labs, working on web searchranking, cross-domain ranking, and web-scale data mining. He ownsthree patents for his work in Yahoo!.

9 APPENDIX

9.1 Proofs.

Proving that RASP is not OPE.Let y = (Eope(x)

T , 1, v)T and we only need toprove that F (y) = Ay does not preserve the di-mensional value order. Let f i be the selection vector(0, . . . , 1, . . . , 0) i.e., only the i-th dimension is 1 andother dimensions are 0. Then, (f i)T y will return thevalue at dimension i of y.

Proof: Let A be an invertible matrix with at leasttwo non-zero entries in each row. For any vector y,let y′ = Ay. For any two vectors s and t, using thedimensional selection vector f i, we have s′i = (f i)TAsand t′i = (f i)TAt . If the dimensional order is pre-served, we will have (si − ti)(s

i − t′i) > 0. However,

(si − ti)(s′

i − t′i) = (si − ti)(fi)TA(s− t)

= (si − ti)k

j=1

ai,j(sj − tj), (5)

where ai,j is the i-th row j-th column element of A.Without loss of generality, let’s assume si > ti (forsi < ti the same proof applies). It is straightforwardto see that the sign of (si − ti)(s

i − t′i) is subject tothe values sj and tj in other dimensions j 6= i. As aresult, RASP does not preserve the dimensional order.

Proving that MBR(MID) encloses MBR(mid).Proof: In general, the MBR of an arbitrary poly-

hedron can be derived based on the vertices of thepolyhedron. Based on the property of convexity pre-serving of RASP, a polyhedron is mapped to anotherpolyhedron in the encrypted space. Concretely, leta polyhedron P has m vertices {x1, . . . , xm}, whichare mapped to the vertices in the encrypted space:{y1, . . . , ym}. Then, the upper bound and lower boundof dimension j of the MBR of the polyhedron inthe encrypted space are determined by max{yij , i =1 . . .m} and min{yij, i = 1 . . .m}, respectively.

Since we only use MBR to reduce the set ofresults for filtering, a slightly larger MBR wouldstill guarantee the correctness of the MBR basedquery processing algorithm, with possibly increasedfiltering cost. In the following, we try to find sucha MBR to enclose MBR(mid). By the definition ofthe square ranges S(low), S(mid) and S(high), their

vertices have the relationship x(mid)i = (x

(low)i +

x(high)i )/2. The images of the vertices are notated

as y(low)i , y

(high)i , and y

(mid)i , respectively. Corre-

spondingly, the MBR(mid) in the perturbed space

should be found from {y(mid)1 , . . . , y

(mid)m }, where

y(mid)i = A(x

(mid)i , 1, v

(mid)i )T . Since (y

(low)i +y

(high)i )/2

= A(x(mid)i , 1, (v

(low)i + v

(high)i )/2)T , and (v

(low)i +

v(high)i )/2 is a valid positive random number. Thus,

MBR(mid) can be determined with vertices {(y(low)i +

y(high)i )/2}.

17

Let the j-th dimension of MBR(L) represented as

[s(L)j,min, s

(L)j,max], where s

(L)j,min = min{y(L)

ij , i = 1 . . .m},and s

(L)j,max = max{y(high)ij , i = 1 . . .m}. Now we

choose the MBR(MID) as follows: for j-th dimension

we use [(s(low)j,min + s

(high)j,min )/2, (s

(low)j,max + s

(high)j,max )/2]. We

show thatFor two sets of m real values {a1, . . . , am} and{b1, . . . , bm}, it is easy to verify that

max{a1, . . . , am}+max{b1, . . . , bm} ≥ max{a1+b1, . . . , a1+bm}(6)

min{a1, . . . , am}+min{b1, . . . , bm} ≤ min{a1+b1, . . . , a1+bm}.(7)

Thus, (s(low)i,min+s

(high)i,min )/2 ≤ min{(y(low)

ij +y(high)ij )/2, i =

1 . . .m} = s(mid)i,min , and (s

(low)i,max + s

(high)i,max )/2 ≥

s(mid)i,max. Since for each dimension, MBR(MID) encloses

MBR(mid), we have MBR(MID) encloses MBR(mid).

9.2 Algorithms

Algorithm 1 RASP Data Perturbation

1: RASP Perturb(X,RNG,RIMG,Ko)2: Input: X : k × n data records, RNG: random real value

generator that draws values from the standard normaldistribution, RIMG : random invertible matrix genera-tor, Kope: key for OPE Eope; Output: the matrix A

3: A← 0;4: A3 ← the last column of A;5: v0 ← 4;6: while A3 contains zero do7: generate A with RIMG;8: end while9: for each record x in X do

10: v ← v0 − 1;11: while v < v0 do12: v ← RNG;13: end while14: y ← A((Eope(x,Kope))

T , 1, v)T ;15: submit y to the server;16: end for17: return A;

Algorithm 2 encodes a normal range query and gen-erate the Qi matrices and the MBR for the transformedquery.

In Algorithm 3, the two-stage query processing usesthe MBR to find the initial query result and then fil-ters the result with the transformed query conditionsyTQiy < 0, where the matrices {Qi} and the MBR arepassed by the client and y is each perturbed record.

The following Algorithm 4 describes the details ofthe (K, δ)-Range algorithm for determining the innerrange.

Algorithm 2 RASP Secure Query Transformation.

1: QuadraticQuery(Cond, A)2: Input: Cond: 2d simple conditions for d-dimensional

data, 2 conditions for each dimension. A:the perturba-tion matrix. Output: the MBR of the transformed rangeand the quadratic query matrices Qi, i = 1 . . . 2d.

3: v0 ← 4;4: for each condition Ci in Cond do5: u← zeros(d+ 2, 1);6: if Ci is like Xj < aj then7: uj ← 1, ud+1 ← −aj ;8: end if9: if Ci is like Xj > aj then

10: uj ← −1, ud+1 ← aj ;11: end if12: w ← zeros(d+ 2, 1);13: wd+2 ← 1;14: wd+1 ← v0;15: Qi ← (A−1)TuwTA−1;16: end for17: Use the vertex transformation method to find the MBR

of the transformed queries;18: return MBR and {Qi, i = 1 . . . 2d};

Algorithm 3 Two-Stage Query Processing.

1: ProcessQuery(MBR, {Qi})2: Input: MBR: MBR for the transformed query;{Qi}:filtering conditions; Output: the set of per-turbed records satisfying the conditions.

3: Y ← use the indexing tree to find answers forMBR;

4: Y ′ ← ∅;5: for each record y in Y do6: success ← 17: for each condition Qi do8: if yTQiy ≥ 0 then9: success ← 0;

10: break;11: end if12: end for13: if success = 1 then14: add yi into Y ′;15: end if16: end for17: return Y ′ to the client;

18

Algorithm 4 (K, δ)-Range Algorithm

1: procedure (K, δ)-RANGE(L1, Lm, k, δ)2: high← Lm, low← L1;3: while high− low ≥ E do4: mid← (high+ low)/2;5: num ← number of points in S(mid);6: if num ≥ k&&num 6 k + δ then7: Break the loop;8: else if num > k + delta then9: high← mid;

10: else11: low← mid;12: end if13: end while14: return S(mid);15: end procedure