Keywords

1 Introduction

With the development of cloud computing, a data owner can outsource his/her database and their managements to a cloud. By outsourcing the database, the data owner can flexibly utilize the resource of the cloud, thus reducing the management costs. The cloud not only stores the database, but also provides an authorized user with querying services on the outsourced database. However, if the original database is outsourced to the cloud, the cloud or an attacker can abuse the private information stored in the database. For example, if a real estate agent outsources his/her original database to the cloud, the cloud or an attacker can sell the property information to the other agents. In addition, the private information of a user can be revealed to the attacker. For example, if a user sends a query with his/her location information to use location-based services, the attacker can find places where the user frequently visit.

Meanwhile, a range query, one of the most typical query types, is widely used as a baseline technique in many fields. The range query finds all the data inside a given query range. However, some privacy threat can occur when issuing the range query. This is because a range information is closely related to the interest of a user.

Therefore, researches on the range query processing algorithms which consider the data privacy have been performed [15]. However, all the existing works fail to hide the data access patterns during the query processing. The data access patterns are the good source to derive not only the actual data items, but also the private information of a querying issuer. This is the critical problem because the data access patterns can be exposed even though the data and query are encrypted [6]. To the best of knowledge, a scheme proposed in [7] is the only work that hides the data access patterns over the encrypted database. However, the scheme only supports kNN query processing and requires high computation cost. To solve the problem, in this paper we propose a new range query processing algorithm on the encrypted database. Our method conceals the data access patterns while supporting efficient query processing by using our proposed encrypted index search scheme.

The rest of the paper is organized as follows. Section 2 introduces the related work and Sect. 3 presents the overall system architecture and secure protocols. In Sect. 4, we propose our secure range query processing algorithm based on the encrypted index. The performance analysis of our scheme is presented in Sect. 5. Finally, we conclude the paper with some future research directions in Sect. 6.

2 Related Work

Yiu et al. [1] proposed the cryptographic transformation (CRT) method which utilizes encrypted R-tree index. However, CRT cannot preserve the data access pattern as a user hierarchically requests the required R-tree nodes to the cloud. A scheme proposed by Hore et al. [2] partitions the data into a set of buckets and builds indices for buckets. However, a data owner should store and search the indices locally. In addition, the result of schemes in [1, 2] usually contains the false-positives. Wang et al. [3] proposed a scheme which utilizes the encrypted version of R-tree. However, the scheme has a shortcoming that a result contains many false-positives. In addition, the data access patterns are revealed because the cloud returns a set of nodes which intersect the query range. Wang et al. [4] proposed an encrypted R-tree based range query processing scheme. However, the data access patterns are revealed to a cloud because all the identifiers of the data that satisfy the query are returned by the cloud. Most recently, Kim et al. [5] proposed a range query processing scheme using the Hilbert-curve order based index. However, the scheme has a problem that a user is in charge of index traversal during the query processing. In addition, the scheme leaks the data access pattern and the query result may contain false-positives.

Next, we briefly review the Paillier cryptosystem [8]. The Paillier cryptosystem is an additive homomorphic and probabilistic asymmetric encryption scheme for public key cryptography. The public key pk for encryption is given by (N, g), where N is a product of two large prime numbers p and q, and g is in \( Z_{N^{2}}^{*} \). The secret key sk for decryption is given by (p, q). Let E() denote the encryption function and D() denote the decryption function. The Paillier crypto system has the following properties. (i) The product of two ciphertexts E(m 1 ) and E(m 2 ) results in the encryption of the sum of their plaintexts m 1 and m 2 ; E(m 1  + m 2 ) = E(m 1 )*E(m 2 ) mod N 2. (ii) The bth power of ciphertext E(m 1 ) results in the encryption of the product of b and m 1 ; E(m 1 *b) = E(m 1 )b mod N 2. (iii) Encrypting the same plaintexts with the same public key results in distinct ciphertexts (aka ‘semantic security’).

3 System Architecture and Secure Protocols

The system consists of four components: data owner (DO), authorized user (AU), and two clouds (C A and C B , respectively). The DO owns the original database (T) of n records. A record t i (1 ≤ i ≤ n) consists of m attributes and jth attribute value of t i is denoted by t i,j . To provide the indexing on T, the DO partitions T by using kd-tree. If we retrieve the tree structure in hierarchical manner, the access pattern can be disclosed. So, we only consider the leaf nodes of the kd-tree and all the leaf nodes are retrieved once during the query processing. Let h denote the level of the constructed kd-tree and F be the fanout of each leaf node. The total number of leaf nodes is 2h−1. From now on, a node means a leaf node. Each node is represented as the lower bound lb z,j and the upper bound ub z,j for 1 ≤ z ≤ 2h−1 and 1 ≤ j ≤ m. Each node stores the identifiers (id) of data being located inside the node region.

To preserve the data privacy, the DO encrypts T attribute-wise using the public key (pk) of the Paillier cryptosystem [8] before outsourcing the database. So, the DO generates E(t i,j ) for 1 ≤ i ≤ n and 1 ≤ j ≤ m. The DO also encrypts the kd-tree nodes so as to support efficient query processing over the encrypted database. The lb and the ub of each node are encrypted attribute-wise, so E(lb z,j ) and E(ub z,j ) are generated for 1 ≤ z ≤ 2h1 and 1 ≤ j ≤ m. In the system, we assume that the clouds, C A and C B , act as semi-honest adversaries. This is because protocols under the semi-honest adversaries are efficient in practice and can be used to design protocols against malicious adversaries. So, the C A and C B correctly perform the given protocols and do not exchange unpermitted data. However, an adversary may try to obtain additional information from the intermediate data during executing his/her own protocol.

To support range query processing over the encrypted database, a secure multiparty computation (SMC) is required between C A and C B . For this, the DO outsources the encrypted database and its encrypted index to the C A with pk while the sk is sent to the C B . The encrypted index includes the region information of each node in cipher-text and the ids of data that are located inside the node in plain-text. The DO also sends pk to AUs to enable them to encrypt a query. At query time, an AU first encrypts a query attribute-wise. Then, the AU sends E(q.lb j ) and E(q.ub j ) for 1 ≤ j ≤ m to C A . C A processes the query with the help of C B and returns a query result to the AU.

As an example, assume that an AU has 8 data in two-dimensional space (e.g., x-axis and y-axis) as depicted in Fig. 1. The data are partitioned into 4 nodes for a kd-tree. To outsource the database, the DO encrypts each data and the information of each node attribute-wise. For example, t 1 is encrypted as E(t 1 ) = {E(2), E(1)}.

Fig. 1
figure 1

An example in two-dimensional space

Meanwhile, our range query processing algorithm is constructed using several secure protocols. We describe the secure protocols that are used in our range query processing algorithm. All the protocols except SBN protocol are performed through the SMC technique between C A and C B . SBN protocol can be executed by C A alone. Due to the space limitation, we first briefly introduce two existing secure protocols that we adopt from [7, 9]. SM (Secure Multiplication) protocol [7] computes the encryption of a × b, i.e., E(a × b), when two encrypted data E(a) and E(b) are given as inputs. SBD (Secure Bit-Decomposition) protocol [9] computes the encryptions of binary representation of the encrypted input E(a). The output is [a] = < E(a 1), …, E(a l ) > where a 1 and a l denote the most and least significant bits of a, respectively. We use symbol [a] to denote the encryptions of binary representation.

Next, we propose our new secure protocols. First, SBN (Secure Bit-Not) protocol performs the bit-not operation when an encrypted bit E(a) is given as input. The output E(~a) is computed by E(a)N1 × E(1). Here, “−1” is equivalent to “N−1” under Z N .

Second, SCMP (Secure Compare) protocol returns E(1) if u ≤ v, E(0) otherwise, when [u] and [v] are given as inputs. We devise SCMP by modifying SMIN [7] protocol which outputs [min] between two inputs [u] and [v]. The variables generated during SMIN can be categorized into two folds. One set of the variables include hints about what the minimum value is. Another set of the variables is used to securely extract the minimum value. Because we only need the information about whether u is smaller or not, we only compute the former (e.g., W, G, H, Φ, L, L′). The goal of designing SCMP is to make the returned value from C B be exactly opposite for the same inputs, based on the functionality selected by C A .

The overall procedure of SCMP is as follows. (i) C A appends E(0) to the least significant bits of [u] and E(1) to the least significant bits of [v]. By doing so, SCMP makes u smaller than v only when two values are the same. (ii) C A randomly chooses one functionality between F 0 :u > v and F 1 :v > u. The selected functionality is oblivious to C B . Then, C A computes E(u i  × v i ) using SM and W i, depending on the selected functionality. In particular, if F 0 :u > v is selected, C A computes W i  = E(u i ) × E(u i  × v i )N1. If F 1 :v > u is selected, C A computes W i  = E(v i ) × E(v i  × u i )N1. For F 0 :u > v, W i  = E(1) when u i  > v i , and W i  = E(0) otherwise. Similarly, for F 1 :v > u, W i  = E(1) when v i  > u i , and W i  = E(0) otherwise. (iii) C A performs bit-xor between E(u i ) and E(v i ) and stores the result into G i . C A computes H i  = (H i1 )ri × G i and Φ i  = E(−1) × H i where H 0  = E(0). Here, r i is a random number in Z N . Assume that j is the index of the first appearance of E(1) in G i . j means the first position where the minimum value between u and v can be determined. (iv) C A computes L i  = W i  × Φ ri i where L i involves the information about which value is smaller between u and v at j. C A generates L′ by permuting L by using a random permutation function π 1 and sends L′ to C B . (v) C B decrypts L′ attribute-wise and checks whether there exists 0 in L i for 1 ≤ i ≤ l. If so, C B sets α as 1, and 0 otherwise. After encrypting α, C B sends E(α) to C A . By doing so, the returned values by C B are exactly opposite with the selected functionalities for the same input which coincides with the goal of SCMP protocol. (vi) C A performs E(α) = SBN(E(α)) only when the selected functionality is F 0 :u > v and returns the E(α). So, the final E(α) is E(1) when u ≤ v, regardless of the selected functionality. Note that the only information decrypted during SCMP is L′ which is seen by C B . However, C B cannot obtain an additional information from D(L′) because the selected functionality is oblivious to C B .

Third, SRO (Secure Range Overlapping) protocol returns E(1) when range 1 overlaps range 2 , E(0) otherwise, when the encryptions of binary representation of two ranges [range 1 ] and [range 2 ] are given as inputs. Assuming that both range 1 and range 2 consist of [lb j ] and [ub j ], where 1 ≤ j ≤ m, the two ranges overlap only if two following conditions are satisfied; (i) E(range 1 .lb j ) ≤ E(range 2 .ub j ) for 1 ≤ j ≤ m, (ii) E(range 2. lb i ) ≤ E(range 1 .ub j ) for 1 ≤ j ≤ m. SRO determines the conditions by using our SCMP. The overall procedure of SRO is as follows. (i) C A initializes E(α) as E(1). (ii) C A obtains E(α′) by performing SCMP([range 1 .lb j ], [range 2 .ub j ]) and updates E(α) by executing SM(E(α), E(α′)). C A repeats this step for 1 ≤ j ≤ m. Similarly, C A computes E(α′) by performing SCMP([range 2 .lb j ], [range 1 .ub j ]) and updates E(α) by executing SM(E(α), E(α′)) for all attribute values. Only when all conditions are satisfied, the value of E(α) remains E(1). (iii) C B returns the final E(α). Note that no decryption is performed during SRO except performing SCMP and SM protocols.

Finally, SPE (Secure Point Enclosure) protocol returns E(1) when p is inside the range or on a boundary of the range, E(0) otherwise, when the encryptions of binary representation of a point [p] and [range] are given as inputs. The overall procedure of SPE is identical to SRO. This is because if a low bound and an upper bound of a range is the same, the range can be considered as a point. However, we also define SPE protocol to make the relations between inputs clear.

4 Secure Range Query Processing Algorithm

In this section, we present our secure range query processing algorithm (SRangeI) using the kd-tree on the encrypted database. SRangeI consists of two steps; encrypted kd-tree search step and result retrieval step.

First, the procedure of the encrypted kd-tree search step is as follows.

  1. (i)

    C A computes [q.lb j ] and [q.ub j ] for 1 ≤ j ≤ m by using the SBD. C A also computes [node z .lb j ] and [node z .ub j ] for 1 ≤ z ≤ num node and 1 ≤ j ≤ m by using SBD where num node is the total number of kd-tree leaf nodes. Then, C A securely finds nodes which overlap the query range by executing E(α z ) ← SRO([q], [node z ]) for 1 ≤ z ≤ num node . Note that the nodes with E(α z ) = E(1) overlaps the query range, but both C A and C B cannot know whether the value of each E(α z ) is E(1).

  2. (ii)

    C A generates E(α′) by permuting E(α) using a random permutation function π and sends E(α′) to C B . For example, SRO returns E(α) = {E(0), E(0), E(1), E(1)} in Fig. 1 as the query range overlaps the node3 and node4. Assuming that π permutes data in reverse way, C A sends the E(α′) = {E(1), E(1), E(0), E(0)} to C B .

  3. (iii)

    Upon receiving the E(α′), C B obtains α′ by decrypting the E(α′) and counts the number of α′ with the value of 1. The number of α′ = 1 is stored into c. So, c means the number of nodes that overlaps the query range.

  4. (iv)

    C B creates c number of node groups (e.g., NG). C B assigns to each NG a node with α′ = 1 and num node /c−1 nodes with α′ = 0. Then, C B computes NG′ by randomly shuffling the ids of nodes in each NG and sends NG′ to C A . For example, C B can obtain α′ = {1, 1, 0, 0}. However, C B cannot correctly point out ids of the nodes overlapping the query range because the values in α′ were permutated by C A . As two node groups are required, C B assigns node 1 and node 2 to NG 1 and NG 2 , respectively. In this example, num node /c−1 is calculated as 1 because num node  = 4 and c = 2. So, C B randomly assigns a node to each node group. Assume that C B assigns node 3 to NG 1 and node 4 to NG 2 . So, NG 1 = {1, 3} and NG 2 = {2, 4}. Then, C B randomly shuffles the ids of the nodes in each NG. The result can be like NG 1′ = {1, 3} and NG 2′ = {4, 2}.

  5. (v)

    C A obtains NG * by permuting the ids of nodes using π 1 in each NG′. In each NG *, there exists only one node overlapping the query range. However, C A cannot know the correct id of the node because the ids of the nodes in NG * are shuffled by C B . Sixth, C A gets access to one datum in each node (e.g., node z ) for each NG * and performs E(t i,j ) ← SM(node z .t s,j , E(α z )) for 1 ≤ s ≤ F and 1 ≤ j ≤ m. Here, α z is the outputs of SPE, corresponding to the node z . If a node has the less number of data than F, it performs SM by using E(max), instead of using node z .t s,j , where E(max) is the largest value in the domain. When C A accesses one datum from every node in a NG *, C A performs a homomorphic addition such as E(cand cnt,j ) ← \( \prod\nolimits_{i = 1}^{num} {E(t'_{i,j} )} \), where num means the total number of nodes in the selected NG *. By doing so, a datum in the node overlapping the query range is securely extracted without revealing the data access patterns. By repeating these steps, all the data in the nodes are safely stored into the E(cand cnt,j ) where cnt means the total number of data extracted during the index search.

As an example, C A obtains NG *1  = {2, 4} and NG *2  = {1, 3} by permuting NG 1′ = {3, 1} and NG 2′ = {4, 2} using π −1. Then, C A accesses E(t 3 ) in node 2 , E(t 7 ) in node 4 for NG * 1 . The results of SM using E(t 3), e.g., E(t1), are E(0) for every attribute because the E(α) value of node 2 is E(0). However, the results of SM using E(t 7 ), e.g., E(t′ 2 ), become E(7) and E(7) for x and y dimension, respectively. So, the results of the attribute-wise homomorphic addition of E(t′ 1 ) and E(t′ 2 ) are E(7) and E(7) for x and y dimension, respectively. Thus, one datum E(t 7 ) in node4 is securely extracted into E(cand 1 ). Similarly, values of E(t 8 ) can be securely extracted into E(cand 2 ) by using E(t 4 ) and E(t 8 ). In the same way, for NG * 2 , all the data in the node 3 (e.g., E(t 5 ) and E(t 6 )) are securely extracted into E(cand 3 ) and E(cand 4 ), respectively.

Second, the procedure of the result retrieval step is as follows. (i) C A computes {([cand i,j ] | 1 ≤ i ≤ cnt, 1 ≤ j ≤ m} by using the SBD. Here, cnt is equal to F × (the number of node groups).

(ii) C A securely finds data inside the query range by executing E(α i ) ← SPE([cand i ], [q]) for 1 ≤ i ≤ cnt. Note that the data with E(α i ) = E(1) are included in the query range. However, both C A and C B cannot know whether the value of each E(α i ) is E(1). For example, when E(cand) = {E(t 7 ), E(t 8 ), E(t 5 ), E(t 6 )} is given from the step 1, SPE returns E(α) = E(1) for E(t 7 ) and E(t 6 ), which are located inside the query range. The data with E(α) = E(1) should be sent to the user. To minimize the computation cost at the user side, it is required to send decrypted results. However, if the cloud decrypts the results, the actual content of the data are revealed to the cloud.

So, (iii) C A computes E(γ i,j ) = E(result i ) × E(r i,j ) for 1 ≤ i ≤ cnt and 1 ≤ j ≤ m by generating a random value r i,j . C A generates E(α′), E(γ′), and r′ by permuting E(α), E(γ), and r, using a random permutation function π 1 . Then, C A sends E(α′) and E(γ′) to C B , and r′ to AU, respectively. iv) C B decrypts E(α i ) and E(γ′ i,j ) for 1 ≤ i ≤ cnt and for 1 ≤ j ≤ m. Then, C B sends α′ and γ′ corresponding to the α′ = 1 to AU. Finally, AU computes γ′ i,j r′ i,j for 1 ≤ i ≤ cnt and 1 ≤ j ≤ m only for the corresponding α i  = 1.

5 Performance Analysis

There is no existing range query processing algorithm that hides the data access patterns. So, in this section, we compare our SRangeI with a baseline algorithm SRangeB which only performs result retrieval step by considering all the data without using an index. We do the performance analysis of both schemes in terms of query processing time with different parameters. We used the Paillier cryptosystem to encrypt a database for both schemes. We implemented both schemes by using C++. Experiments were performed on a Linux machine with an Intel Xeon E3-1220v3 4-Core 3.10 GHz and 32 GB RAM running Ubuntu 14.04.2. To examine the performance under various parameters, we randomly generated synthetic datasets by following [7]. We set the size of the range as 0.1 which means the relational portion of the range compared to the total domain size (i.e., l). In addition, we set the domain size (l) as 12 and the encryption key size (K) as 1024.

Figure 2 shows the performance of SRangeI for varying the level of kd-tree. Figure 2a shows the performance of SRangeI for varying h and n. Overall, as the n becomes larger, the query processing time increases. Meanwhile, when the n increases, the h that shows the best performance becomes larger. For all cases, when the h is increased, the query processing time decreases for the smaller h while the query processing time increases for the larger h. Figure 2b shows the performance of SRangeI for varying h and m. As the m becomes larger, the query processing time linearly increases. This is because all the protocols including SRO and SPE should process additional data as the m increases. We set the h as 7 in the following performance evaluation because SRangeI shows good performance when h = 7 in our parameter settings.

Fig. 2
figure 2

Performance of SRangeI for varying h. a m = 6, K = 1024, b n = 6 k, K = 1024

Figure 3a shows the performance of both SRangeI and SRangeB schemes varying the n. Overall, as the n becomes larger, the query processing time linearly increases for both schemes. Overall, SRangeI shows much better performance than SRangeB because SRangeI filters irrelevant data by using the index. On average, SRangeI shows about 25 times better performance than SRangeB. Meanwhile, Fig. 3b shows the performance of both schemes varying the m. Overall, as the m becomes larger, the query processing time linearly increases for both schemes. However, on average, SRangeI shows about 27 times better performance than SRangeB.

Fig. 3
figure 3

Comparison of SRangeI and SRangeB. a m = 6, K = 1024, b n = 6 k, K = 1024

6 Conclusion

With the popularity of the outsourced databases, researches on the range query processing methods over the encrypted database have been actively performed. They can preserve the data privacy and the query privacy, but there is no work that hides the data access patterns during the query processing. To solve the problem, we proposed a new secure range query processing algorithm on the encrypted database. Our method conceals the data access patterns while supporting efficient query processing by using our proposed encrypted index search scheme. We showed from our performance analysis that our algorithm achieves efficient query processing performance while hiding the data access patterns. As a future work, we plan to expand our work to support other query types, such as Top-k and skyline queries.