1 Introduction

A cloud database (CloudDB) refers to a database deployed on an Internet-based virtual computing environment, which allows users to store, modify and retrieve data anywhere in the world, as long as they have access to the Internet [20]. Due to the advantages of pay-on-demand, expand-on-demand and high availability, CloudDB has been widely used in information systems [10]. However, since a CloudDB is distributed on the cloud side instead of a local server, it is important how to effectively protect massive information about personal privacy (such as phone number and personal name) stored in the CloudDB [11]. To ensure the security of personal privacy information, many strategies have been used in information systems, such as identity authentication, and authorization and access control [3, 30, 32]. These strategies can prevent illegal users from accessing unauthorized data, consequently, ensuring the security of personal information to a large extent. However, almost all the strategies are targeted only for external illegal users of an information system, and they cannot prevent internal users (such as administrators) at the cloud side from accessing personal information stored in the CloudDB.

A general framework of a CloudDB information system is shown in Figure 1, where (1) external users, who generally work at client sides, store their data into the CloudDB and use data services supplied by the CloudDB; and (2) internal users, who work at the cloud side, manage the CloudDB and a large amount of external users’ personal data stored in the CloudDB. In a CloudDB, the client sides are considered to be trusted because existing security strategies used by database systems can prevent users from accessing unauthorized data. However, the cloud side is considered to be untrusted [31]. For example, an administrator or a attacker who has broken into the cloud side can access private data stored in the CloudDB easily. In other words, it is possible for internal users at the cloud side to access and expose the personal data in the CloudDB, driven by economic interests, thereby leading to the disclosure of personal information.

Figure 1
figure 1

A general framework of a CloudDB information system

1.1 Motivation

It has been reported by iResearchFootnote 1 that the frequent occurrence of disclosure of personal privacy is making people become “transparent”, and more than half of the disclosure events are caused by the internal users of a network system. Therefore, it is very important to supply an effective approach to ensure the security of personal privacy in a CloudDB, which should be able to prevent the disclosure of personal information caused by internal users at the cloud side, not just by external users at the client sides. To protect personal privacy in a CloudDB, a straightforward way is to encrypt personal data, so that even if the encrypted data are exposed, they are difficult to be decrypted [22]. However, in an information system, generally, there are a large number of database query operations, which are relevant to personal data (i.e., defined over personal data). Once the private data are encrypted using a traditional encryption algorithm (e.g., those in [2, 17]), most of the database query operations (such as text similarity queries) will no longer be able to be executed correctly over the encrypted data in the CloudDB.

To solve the above problem on querying encrypted data, we can transmit the encrypted data (which may be a whole table) from the cloud side, decrypt the encrypted data and then execute the query operations over the decrypted data. However, as the cost of transmitting and decrypting an encrypted table is expensive, such a way (i.e., decrypting before querying) will greatly reduce the execution efficiency of database query operations, resulting in its inapplicability to a CloudDB. Although the homomorphic encryption techniques [17] can maintain the original order and comparability of the encrypted data so that a number of database query operations can be executed correctly over the encrypted data, they are generally of weak security, e.g., the encrypted data are easy to be decrypted by statistical attack, as pointed out in [13, 33]. Although there are a number of studies on data encryption [16], most of them require to first decrypt the encrypted data and then execute queries over the decrypted data, consequently, making them difficult to satisfy the efficiency requirement of a CloudDB. Although there are a small number of data encryption algorithms (see the related work section for detail) that allow users to query encrypted data directly without the need to decrypt data, they generally have the disadvantages of weak security or inability to fully support query operations, thereby, making them difficult to solve the problem on querying encrypted personal data in a CloudDB.

1.2 Contribution

In this paper, we propose a client-based approach to protect personal privacy in a CloudDB. In the approach, before being submitted to the cloud side, personal data have to be encrypted on a trusted client side using a traditional encryption algorithm, so as to ensure the security of personal data on the untrusted cloud side. Meanwhile, to execute various kinds of query operations over the encrypted data efficiently, the approach generates additional feature information (called feature index) for the encrypted data, which allows a certain amount of query processing to occur on the cloud side without the need to decrypt the data. Thus, the approach mainly explores how the feature index of personal data is constructed, and how each query operation over personal data is transformed into a new query operation over the feature index so that it can be executed correctly on the cloud side. Specifically, the contributions of this paper are threefold.

  1. (1)

    We present a scheme to generate the feature index for personal data, which has not only good security (i.e., it is difficult to infer the original personal data from the feature index), but also good usability (i.e., it can support various kinds of database query operations).

  2. (2)

    We present a scheme to transform each user query relevant to personal data into a cloud-side query relevant to the feature index, so that the new query can be executed on the cloud side correctly, consequently, improving the execution efficiency of database query operations.

  3. (3)

    We demonstrate the effectiveness of our approach by theoretical analysis and experimental evaluation. The results show that the approach has good performance in terms of security, usability and efficiency, thus it is applicable to effectively protect personal privacy in a CloudDB.

The rest of this paper is organized as follows. Section 2 surveys related work. Section 3 presents the system model and the problem studied in this paper, i.e., formally describing what requirements should be satisfied so as to protect personal privacy effectively in a CloudDB. Section 4 presents a scheme to generate the feature index for personal data, and analyzes the security of the scheme. Section 5 presents a scheme to map each query over personal data into a cloud side query over the feature index, and analyzes the usability of the scheme. Section 6 presents the experimental evaluation to demonstrate the efficiency of our approach. Finally, we conclude this paper in Section 7.

2 Related work

In this section, we briefly describe some research related to querying encrypted data in outsourced databases. In [13], Hacigumus et al. first proposed the bucket partitioning idea for querying encrypted data in the database-as-service model. The basic idea is to divide the attribute domains into multiple buckets and then map bucket identifiers to random numbers, thereby, protecting the security of sensitive data. Moreover, this makes that much of a query operation over encrypted data can be processed at the database service provider, thereby, improving query performance. Later, in [14], the authors proposed to use the homomorphism encryption techniques to enhance their approach, so as to support aggregation queries over encrypted data, and in [15], the authors further discussed an optimization technique for their approach, i.e., how to use multiple communications between the server and the client to decrease the workload of the client. In order to better support range queries over encrypted data, Hore et al. [7] explored an optimal approach to partitioning data domain, thereby, improving the precision of range queries. The work presented by Hacigumus et al. is significant, which presented a basic framework to ensure data security in the database-as-service model. However, the work did not analyze the security formally for the approach. Besides, the work is valid only for numerical data without considering text data. Since personal privacy data in an information system are generally of text type, it is not suitable to apply the above approach to protect personal privacy in a CloudDB.

Recently, many studies on the data security in cloud databases have been presented. Li et al. [19] discussed the problem about privacy preserving range query processing on clouds, and presented a fast range query processing scheme by organizing indexing elements in a complete binary tree. Wai et al. [28] addressed security issues in a cloud database system, and proposed a secure query processing scheme on relational tables and a set of elementary operators on encrypted data, which allows a wide range of database queries to be processed by the server on encrypted data. Chen et al. [9] proposed an efficient privacy and integrity preserving scheme for multi-dimensional range queries over cloud computing. Luca et al. [23] proposed an architecture that integrates cloud database services with data confidentiality and the possibility of executing concurrent operations on encrypted data. This is the first solution supporting geographically distributed clients to connect directly to an encrypted cloud database, and to execute concurrent and independent operations. Recent work [6] proposed a general framework for boolean queries of disjunctive normal form queries on encrypted data. Although all the approaches are proposed for cloud databases, most of them are targeted for building a secure cloud database. As mentioned above, they are not proposed for a CloudDB, so based on them, we cannot build an effective CloudDB that can support a variety of database query operations over privacy data (such as text similarity queries and range queries).

Some researchers also proposed to split sensitive data among multiple servers to ensure data security. In [12], a scheme for vertical partitioning of relations among multiple untrusted servers was employed, whose privacy goal is to prevent access of a subset of attributes by any single server. Aggarwal et al. [1] also used a similar vertical partitioning scheme which has the same privacy goal but different partitioning and optimization algorithms. Wang et al. [24] used a salted version of IDA scheme to split encrypted tuple data among multiple servers. In [26], a novel l-diversity privacy model was proposed for privacy preservation in the release of data for mining purposes. Recently, some researchers also proposed to use a hardware approach to ensure data security, such as TrustedDB [25], MONOMI [27] and Cipherbase [4]. The advantages of the hardware approach are that it can provide strong security protection, and it does not limit query expressiveness. However, the hardware approach needs to reconstruct the system structure of a CloudDB. In addition, there are also other related encryption techniques for spatial data [8, 21, 34].

From the above, we see that most of existing approaches to data security protection in outsourced databases focus on constructing a secure framework, without fully taking into consideration the structure and type of sensitive data. As a result, if we apply these approaches immediately into a CloudDB, it is difficult to support a variety of similarity queries and range queries over encrypted privacy data. Actually, aiming at the problem of querying encrypted textual data in a database, there are some related studies. Wang et al. [29] proposed to turn a character string into characteristic values, so as to support similarity queries. This approach can reduce the scope of data decryption, and thus improve query performance. However, the approach cannot well solve the similarity queries in the form of “LIKE ‘%s’” and “LIKE ‘s%’”, and cannot support range queries. Besides, owing to using only one characteristic function, the approach is difficult to withstand statistics attack or inference attack. By analyzing the traditional order-preserving encryption approach to numerical data, a fuzzy matching encryption approach aiming at character strings was proposed in [18]. In this approach, a character string is first transformed to numerical values, and an order-preserving encryption technique in [14] for numerical data, is then used to encrypt the transformed numerical values. To solve the problem of not supporting range queries for the approach in [29], Wu et al. [33] defined a structure called n-phase reachability matrix for a character string and used it as the characteristic index values, and then presented split a database query into its server-side representation and client-side representation for partitioning the computation of a query across the client and the server and thus improving query performance. However, it is space-consuming to store a complete reachability matrix.

3 Problem statement

3.1 System model

The system model used by our approach is presented in Figure 2. As shown in Figure 2, the system model consists of an index generator, a query translator and a query executor, whose processing flows can be briefly described as follows.

  1. (1)

    Before being submitted to the cloud side, privacy data u has to be handled by the index generator, so as to generate the ciphertext E(u) and the feature index X(u), where E and X denote an encryption function and an index function, respectively.

  2. (2)

    Each query operation q u relevant to privacy data, before being submitted to the cloud side, has to be transformed into a new cloud-side query operation q x , which is defined over the feature index so can be executed by the encrypted CloudDB correctly. This process is completed by the query translator.

  3. (3)

    The query executor decrypts the temporal result R(q x ) returned from the cloud side, which is obtained by executing the cloud-side query operation q x over the CloudDB; and then executes the user query operation q u over the decrypted data D(R(q x )) (where D denotes a decryption function), thereby, obtaining the accurate result R(q u ) of q u .

  4. (4)

    Meanwhile, each client side of a CloudDB also maintains an internal metadata structure that is used to store all kinds of parameter information for the above components.

Figure 2
figure 2

The system model used in our approach, where the arrows denote data processing flows

It can be seen that the system model is located on a client side (i.e., it is client-based), but it is transparent to the client side, i.e., it requires no change to existing softwares on the client side. In the system model, a cloud side is deemed untrusted, i.e., the adversary is located on the cloud side, who has full access to not only the entire CloudDB, but also all the database query operations from client sides. Thus, the adversary is deemed powerful, who can master a large quantity of plaintext, ciphertext and feature index information.

3.2 Problem analysis

In the system model, the encryption function E is developed based on an existing data encryption algorithm (e.g., AES [5]), so it is almost impossible for the adversary to infer the plaintext u from the encrypted data E(u), i.e., the security of encrypted data can be well protected in the untrusted cloud side. Thus, this paper non longer pays attention to the security of the encrypted data. From Figure 2, we know that the feature index is the key to protecting personal privacy effectively. In general, a good feature index scheme should satisfy the following requirements.

  1. (1)

    Good security. On the cloud side, the indexes are visible to the adversary, thus the feature indexes should ensure their own security, i.e., it should be difficult for the adversary to infer the plaintext u from the index X(u).

  2. (2)

    Good usability. Based on the feature index, each familiar user query q u over privacy data should be able to be transformed into a cloud-side query q x that can be executed over the encrypted CloudDB. It is required that the result returned by q x should be a superset of the accurate result of q u , i.e., R(q u ) ⊆ D(R(q x )).

  3. (3)

    Good efficiency. On the cloud side, the query q x should be able to filter as many of the non-target tuples (i.e., which do not satisfy q u ) as possible, making the temporal result R(q x ) as close to the accurate result R(q u ) as possible, so as to lighten the computation on a client side, and thus improve the execution efficiency of q u .

However, it is difficult to meet the above requirements simultaneously. On the one hand, good security generally requires the feature index to describe as little feature information on privacy data as possible, so as to make it difficult for the adversary to obtain the original plaintext based on the index. On the other hand, good usability and efficiency require that as much feature information as possible on privacy data can be reflected by the index. Thus, good feature index should be a reasonable compromise among security, usability and efficiency.

3.3 Problem definition

To simplify the presentation, below we use a symbol Θ to represent a privacy protection approach that runs on the system model in Section 3.1, and use X Θ to represent an index function used by the approach Θ. Based on the analysis given in Section 3.2, we formulate the requirements that the approach Θ has to satisfy so as to effectively protect personal privacy.

Let \(\mathcal {U}\) denote the domain of privacy data, and \(\mathcal {X}_{\Theta }\) the domain of index data generated by the approach Θ. Then, we have \(\mathcal {X}_{\Theta }=\left \{x \,|\, u \in \mathcal {U} \wedge x=\boldsymbol {X}_{\Theta }(u)\right \}\). As we mentioned above, the adversary can master a large quantity of plaintext and the corresponding feature index. Thus, (1) the prior knowledge that the adversary has mastered can be defined as a set of two-tuples from privacy data \(u \, (u \in \mathcal {U})\) to index data X Θ(u); (2) the limit \(k_{\Theta }^{*}\) of prior knowledge of the adversary can be represented as: \(k_{\Theta }^{*}=\left \{(u,x) \,|\, u \in \mathcal {U} \wedge x \in \mathcal {X}_{\Theta } \wedge x=\boldsymbol {X}_{\Theta }(u) \right \}\); and (3) the domain \(\mathcal {K}_{\Theta }\) of prior knowledge can be represented as: \(\mathcal {K}_{\Theta }=2^{\,k_{\Theta }^{*}}\), i.e., the prior knowledge that an adversary has mastered is a subset of \(k_{\Theta }^{*}\). From experience, we know that, (1) given any index data \(x \, (x \in \mathcal {X}_{\Theta })\), the probability that the adversary infers the plaintext from the index data x mainly depends on the prior knowledge \(k_{\Theta } \, (k_{\Theta } \in \mathcal {K}_{\Theta })\) that an adversary has mastered, so we denote it as: P r(k Θ) (0 < P r(k Θ) ≤ 1); and (2) P r(k Θ) ∝ | k Θ |, i.e., the probability of inferring the plaintext is proportional to the amount of the prior knowledge mastered by the adversary. Now, the security of the approach Θ can be defined as follows.

Definition 1

Given a threshold λ (0 < λ ≤ 1), an approach Θ meets λ-security, if and only if the probability that an adversary infers the plaintext from any index data established by Θ is always less than λ, regardless of the prior knowledge mastered by the adversary. Formally, an approach Θ meets λ-security, if and only if \(\forall k_{\Theta } \left (k_{\Theta } \in \mathcal {K}_{\Theta } \rightarrow {Pr}(k_{\Theta }) \leq \lambda \right )\), i.e., \({Pr}(k_{\Theta }^{*}) \leq \lambda \) (since P r(k Θ) ∝ | k Θ |).

Let \(\mathcal {Q}_{u}\) denote the domain of user query operations relevant to privacy data. As mentioned in Section 3.2, good usability requires that each query \(q_{u} \, (q_{u} \in \mathcal {Q}_{u})\) can be transformed into a cloud-side query q x , so that R(q u ) ⊆ D(R(q x )). However, it can be seen that \(\mathcal {Q}_{u}\) is an infinite set. Thus, we first define a core set of user query operations, and then define the usability of the approach Θ.

Definition 2

\(\mathcal {Q}_{u}^{*}\) is a core set of user queries relevant to privacy data if it meets: (1) \(\mathcal {Q}_{u}^{*} \subseteq \mathcal {Q}_{u}\); (2) \(\forall q_{1} \exists q_{2} (q_{1} \in \mathcal {Q}_{u} \wedge q_{2} \in \mathcal {Q}_{u}^{*} \rightarrow R(q_{1}) = R(q_{2}))\); and (3) \(\forall q_{1} \forall q_{2} (q_{1} \in \mathcal {Q}_{u}^{*} \wedge q_{2} \in \mathcal {Q}_{u}^{*} \wedge q_{1} \neq q_{2} \rightarrow R(q_{1}) \neq R(q_{2}))\).

Definition 3

An approach Θ meets usability, if and only if any user query operation \(q_{u} \, (q_{u} \in \mathcal {Q}_{u}^{*})\) can be transformed into a cloud-side query operation q x , which is defined over the feature index \(\mathcal {X}_{\Theta }\), and the actual result R(q u ) of q u is contained in the temporal result R(q x ) returned by q x , i.e., R(q u ) ⊆ D(R(q x )).

Let T(q u ) denote all the tuples in the table related to a query operation \(q_{u} \, (q_{u} \in \mathcal {Q}_{u})\) (i.e., the table presented in the FROM clause of q u ). Then, | T(q u ) − R(q u ) | denotes the number of non-target tuples of q u ; and | T(q u ) −D(R(q x )) | (where q x is a cloud-side query transformed from q u by the approach Θ) denotes the number of non-target tuples filtered out on the cloud side by q x . As mentioned above, good efficiency requires that as many of non-target tuples as possible can be filtered out by q x , i.e., | T(q u ) −D(R(q x )) | is as close as possible to | T(q u ) − R(q u ) |. Below, we first define filtering rate, and then define the efficiency of the approach Θ.

Definition 4

For any user query operation \(q_{u} \, (q_{u} \in \mathcal {Q}_{u}^{*})\), we use q x to denote its cloud-side query operation generated by the approach Θ. Then, the filtering rate F r Θ(q u ) of Θ to non-target tuples of q u is defined as: \(Fr_{\Theta }\left (q_{u}\right )=\frac {|\,T(q_{u})-\boldsymbol {D}(R(q_{x}))\,|}{|\,T(q_{u})-R(q_{u})\,|}\).

Definition 5

Given a threshold μ (0 ≤ μ ≤ 1), an approach Θ meets μ-efficiency if and only if the mathematical expectation \(\sum \nolimits _{q_{u} \in \mathcal {Q}_{u}^{*}} Pr\left (q_{u}\right ) \cdot Fr_{\Theta }\left (q_{u}\right ) \geq \mu \), wherein, P r(q u ) denotes the probability of a query \(q_{u} \, (q_{u} \in \mathcal {Q}_{u}^{*})\) issued by external users, and \(\sum \nolimits _{q_{u} \in \mathcal {Q}_{u}^{*}} Pr\left (q_{u}\right )=1\).

Now, based on Definitions 1, 3 and 5, we define the requirements that the approach Θ has to satisfy so as to protect personal privacy effectively.

Definition 6

Given two thresholds λ (0 < λ ≤ 1) and μ (0 ≤ μ < 1), if an approach Θ meets λ-security, usability and μ-efficiency, then Θ is effective to protect personal privacy in a CloudDB.

4 Privacy protection scheme

In this section, before introducing the approach of encrypting and indexing privacy data, we first show how the encrypted data and their feature indexes are stored into the CloudDB. Note that in a CloudDB, privacy data such as identification number, phone number, personal name and bank account are generally stored as a text field (i.e., whose field type is CHAR or VARCHAR), so in our work privacy data of any type are treated as text uniformly (i.e., we take no account of the privacy data of numeric type, which is out of the scope of this paper). We suppose that there exists one relational table R(A 1, A 2, ..., A r , ...) in the CloudDB, where A r is a field used to store privacy data thus needs to be encrypted (A r is called a private field. To simplify presentation, we assume that there is only one private field A r in the table R. Then, in the encrypted CloudDB, we will store an encrypted relational table \(R^{E}\left (A^{E}, A_{1}, A_{2}, ..., {A_{r}^{X}}, ...\right )\) instead of R, wherein,

  1. (1)

    The field A E (called an encrypted field) stores an encrypted binary string (i.e., ciphertext) that corresponds to a tuple in the table R (we will explain how the encrypted field A E is constructed in Section 4.2).

  2. (2)

    The field \({A_{r}^{X}}\) (called an index field) corresponds to the feature index for the private field A r , and the type of \({A_{r}^{X}}\) is identical to that of A r , i.e., whose type is also CHAR or VARCHAR.

  3. (3)

    The remaining fields in the encrypted table R E are all consistent with those in the original table R.

Below, we study a privacy protection scheme used in our approach, i.e., study how privacy data are encrypted and indexed so as to ensure the security of privacy data in the untrusted cloud side. Specifically, we first show how to construct a feature index function X for privacy data. Second, we show how privacy data are encrypted and indexed, and then stored into the encrypted CloudDB. Finally, we analyze the security of the privacy protection scheme.

4.1 Feature index function

For any value u in the domain of the private field A r of R, this subsection explains how it is mapped to the feature index value X(u), so that it can be stored into the index field \({A_{r}^{X}}\) of R E, i.e., how the feature index function X is constructed. For the private field A r of R, suppose that each value in its domain contains no more than \(n \, (n \in \mathbb {N})\) characters. Then, the private field A r consists of n character units. Below, we use P i (i = 1, 2, ..., n) to denote each character unit of A r , and dom(P i ) to denote the domain of values of P i . In order to construct a feature index function over the private field A r , we need the following three steps: (1) automatically assigning a number num(P i ) for each character unit P i ; (2) automatically dividing the domain dom(P i ) into num(P i ) partitions; and (3) automatically assigning a character identification for each partition of P i . Below, we detail the steps and their implementations.

figure a

Step 1

Automatically assign a number num(P i ) (called a partition number) for each character unit P i (i = 1, 2, ..., n) of the private field A r , which has to meet the following requirements:

  1. (1)

    The partition number num(P i ) has to be a positive integer and less than the size of the domain dom(P i ), i.e., \(\boldsymbol {num}(P_{i}) \in \mathbb {N} \, \wedge \, \boldsymbol {num}(P_{i}) \leq |\boldsymbol {dom}(P_{i})|\).

  2. (2)

    All the partition numbers from the private field A r have to meet \(\rho {\prod }_{i=1}^{n} \boldsymbol {num}(P_{i}) \leq {\prod }_{i=1}^{n} |\boldsymbol {dom}(P_{i})|\), where ρ is a given factor.

In Step 1, ρ is called a security factor and \(\rho \in \mathbb {N} \, \wedge \, \rho \leq {\prod }_{i=1}^{n} |\boldsymbol {dom}(P_{i})|\), which is preset for the private field A r , and used to control the security of the generated feature index function. In general, the greater the value of ρ, the safer the generated feature index function X. The detailed analysis of ρ on how to impact the security of X will be presented in Section 4.3. It can be found that generally there are a large number of solutions that satisfy the requirements mentioned in Step 1. In our approach, we use Algorithm 1 to perform Step 1, so as to automatically assign a group of partition numbers for all the character units of the private field A r . From Algorithm 1, we can see that the time complexity of Lines 4 and 5 is O(n) and the loop will terminate after log ρ operations, so the time complexity of Algorithm 1 is O(n ⋅ log ρ).

Example 1

Consider a private field of phone number. Because a phone number in China generally consists of 11 numeric characters, the private field created for storing phone numbers also consists of 11 character units (n = 11). Besides, the first two numeric characters of a phone number can only be ‘13’, ‘15’ or ‘18’, so the domain of each character unit of the phone field is given as follows:

$${\boldsymbol{dom}(P_{1})=\{\text{`1'}\};\boldsymbol{dom}(P_{2})=\{\text{`3', `5', `8'}\}; \boldsymbol{dom}(P_{3})=...=\boldsymbol{dom}(P_{11})=\{\text{`0', `1', `2', ..., `9'}\}}$$

If the security factor ρ is set to 30, then using Algorithm 1, the partition number for each character unit of the phone field is assigned as follows:

$$\boldsymbol{num}(P_{1})=1;\boldsymbol{num}(P_{2})=3;\boldsymbol{num}(P_{3})=...=\boldsymbol{num}(P_{6})=10; \boldsymbol{num}(P_{7})=...=\boldsymbol{num}(P_{11})=5$$
figure b

Step 2

Based on num(P i ) assigned by Step 1 for each character unit of A r , we use some strategy (e.g., Equi-width or Equi-depth) to divide the domain dom(P i ) of P i into num(P i ) subsets (called partitions). Let \(B_{k}^{(i)} \, (k=1, 2, ...,\boldsymbol {num}(P_{i}))\) denote a partition of P i . The partitions of P i have to meet the following requirements:

  1. (1)

    Each partition \(B_{k}^{(i)}\) is nonempty, i.e., \(\forall k (k \in \mathbb {N} \wedge k \leq \boldsymbol {num}(P_{i}) \rightarrow B_{k}^{(i)} \neq \oslash )\).

  2. (2)

    Each partition \(B_{k}^{(i)}\) is mutually disjoint with another partition \(B_{j}^{(i)}\), i.e., \(\forall k \forall j(k, j \in \mathbb {N} \wedge k, j \leq \boldsymbol {num}(P_{i}) \wedge k \neq j \rightarrow B_{k}^{(i)} \cap B_{j}^{(i)} = \oslash )\).

  3. (3)

    The union of all the partitions of P i is dom(P i ), i.e., \(\bigcup _{k=1}^{\boldsymbol {num}(P_{i})} B_{k}^{(i)} = \boldsymbol {dom}(P_{i})\).

  4. (4)

    Each element in the partition \(B_{k}^{(i)}\) is greater than each element in \(B_{k-1}^{(i)}\), that is, \(\forall k \forall a \forall b (k \in \mathbb {N} \wedge 2 \leq k \leq \boldsymbol {num}(P_{i}) \wedge a \in B_{k}^{(i)} \wedge b \in B_{k-1}^{(i)} \rightarrow a > b)\).

In our approach, we use Algorithm 2, which is developed based on an Equi-width strategy, to perform Step 2, so as to automatically divide the domain dom(P i ) of each character unit P i of the private field A r into num(P i ) partitions. It can be seen that Line 9 of the inner loop needs to scan all the elements in \(B_{k}^{(i)}\), so the time complexity of the inner loop is O(|dom(P i )|), and hence the time complexity of Algorithm 2 is O(nα) (where α denotes the averaged domain size of each character unit).

Example 2

Using the result of Example 1 as input, Algorithm 2 obtains a group of partitions for each character unit of the telephone number field, which are shown as the columns “partition” in Figure 3. It can be seen that the units P 1 and P 2 are divided into 1 and 3 partitions, respectively, and the units P 3 to P 6 and P 7 to P 11 are divided into 10 and 5 partitions, respectively.

Figure 3
figure 3

The partitions and identifiers for a private field of telephone number (each number in columns “partitions” denotes a numeric character)

figure c

Step 3

For each partition \(B_{k}^{(i)}\) constructed by Step 2 for P i , we determine a character \(\boldsymbol {id}\left (B_{k}^{(i)}\right )\) as the identifier of \(B_{k}^{(i)}\). It has to meet the following requirements:

  1. (1)

    The identifiers of any two partitions of each character unit P i are not equal to each other, that is, \(\forall k \forall j \left (1 \leq k, j \leq \boldsymbol {num}(P_{i}) \wedge k \neq j \rightarrow \boldsymbol {id}\left (B_{k}^{(i)}\right ) \neq \boldsymbol {id}\left (B_{j}^{(i)}\right )\right )\).

  2. (2)

    The identifer of any partition of each P i belongs to the same range of values (where 𝜃 is a randomly generated character): \(\forall i \forall k \left (\vphantom {\left (B_{k}^{(i)}\right )}1 \leq i \leq n \wedge 1 \leq k \leq \boldsymbol {num}(P_{i}) \rightarrow 0 \leq \right .\) \(\left .\boldsymbol {id}\left (B_{k}^{(i)}\right ) - \theta \leq \max _{j=1}^{n} \left (\boldsymbol {num}(P_{j}) - 1\right ) \right )\).

We use Algorithm 3 to perform Step 3, so as to automatically assign a character identifier \(\boldsymbol {id}\left (B_{k}^{(i)}\right )\) for each \(B_{k}^{(i)}\) of each P i of A r . It can be seen that the time complexity of the inner loop (Lines 5 to 6) is equal to O(num(P i )), so the time complexity of Algorithm 3 is O(nβ) (where β denotes the averaged partition number of each character unit).

Example 3

Using the results of Examples 1 and 2 as input, Algorithm 3 assigns the identifier for each partition of each character unit of the telephone field, and the output results are shown as the columns “identifier” in Figure 3 (where 𝜃 is set to ‘A’). It can be seen that each partition is set to an identifier within ‘A’ to ‘J’.

Now, based on the above partitions and identifiers presented in Steps 1 to 3, we define n mapping functions: X (1), X (2), ..., X (n). Given any character u i dom(P i ), the function X (i) would map u i to the character identifier of the partition which u i belongs to, i.e., \(\boldsymbol {X}^{(i)}(u_{i})=\boldsymbol {id}\left (B_{k}^{(i)}\right )\), where \(B_{k}^{(i)}\) is the partition which contains u i . Furthermore, given any character string u = u 1 u 2...u m (mn) in the domain of the private field A r , we can define a mapping function to map u to a new character string as: X(u) = X (1)(u 1)X (2)(u 2)...X (m)(u m ).

Example 4

Based on the partitions and identifiers shown in Figure 3, we can generate a feature index function X for the telephone number field. Now, given a telephone number u = ‘13587898721’, we can use X to map u to an index data X(u) = ‘HBCBCIJIABA’. It can be seen that based on our index construction scheme, the index data will be always of the same type and length with its plaintext data.

4.2 Encryption storage

Now, we describe how encrypted data and indexed data are stored into the CloudDB. Given any tuple t = 〈a 1, a 2, ..., a r , ...〉 over the relational table R(A 1, A 2, ..., A r , ...) (where a r is a privacy data over the private field A r ), the corresponding encrypted relational table \(R^{E}\left (A^{E},A_{1},A_{2}, ...,{A_{r}^{X}}, ...\right )\) in the CloudDB stores an encrypted tuple t E = 〈E(〈a 1, a 2, ..., a r , ...〉), a 1, a 2, ..., X(a r ), ...〉, where E is a function used to encrypt a tuple of the relational table R. We treat the encryption function as a black box, thus any well-known data encryption technique (e.g., AES [5]) can be used.

Example 5

Let us consider a relation as: persons(no, name, phone, ...) (see the left in Figure 4). Then, the CloudDB stores an encrypted relation as: personsE(tupleE,no,name,phoneX, ...) (see the right in Figure 4), where the first column “ tupleE” stores the binary strings corresponding to the encrypted tuples. For example, the first tuple in “persons” is encrypted to “110111000011...”, which is obtained by E(〈1, ‘Ada’, ‘13587898721’, ... 〉). Moreover, the column “ phoneX” in “ personsE” denotes the index field corresponding to the private field “phone” in “persons”.

Figure 4
figure 4

A translation from a relational table and its encrypted relational table

Note that in our approach, the tasks such as encrypting privacy data and generating their feature index values are all completed on a trusted client side by the index generator; and then the encrypted data and their index data are transmitted through network and stored into the CloudDB (see Figure 2).

4.3 Security analysis

In this subsection, we firstly demonstrate that our approach can meet λ-security, and then briefly analyze the security of the index function generated by our approach in terms of two types of common attacks: statistical attack and known-plaintext attack.

Observation 1

Given a threshold λ (0 < λ ≤ 1), after the security factor ρ of the privacy protection scheme is set to ⌊1/λ⌋, our approach can meet λ-security.

Rationale

Based on the index function X constructed in Section 4.1, we know that the index function X is a many-to-one mapping from privacy domain \(\mathcal {U}\) to index domain \(\mathcal {X}\). Moreover, each index value in \(\mathcal {X}\) would correspond to ρ privacy values in \(\mathcal {U}\). Thus, we conclude that if the security factor ρ of the privacy protection scheme is set to ⌊1/λ⌋, each index value in \(\mathcal {X}\) corresponds to ⌊1/λ⌋ privacy values in \(\mathcal {U}\), i.e., the probability of inferring the plaintext from any index value is always less than λ, even if an adversary has mastered the index function X. After combining Definition 1, we know that our approach meets λ-security if the security factor ρ is set to ⌊1/λ⌋.

The precondition of statistical attack is that an attacker has known a set \(\mathcal {U}^{*}\) of privacy data and a set \(\mathcal {X}^{*}\) of corresponding index data. Then, the attacker attempts to establish a set of two-tuples from \(\mathcal {U}^{*}\) to \(\mathcal {X}^{*}\), so as to reconstruct the index function X. The implementation of statistical attack is based on an observation that the probability of occurrences of each u in \(\mathcal {U}^{*}\) is basically consistent with X(u) in \(\mathcal {X}^{*}\). However, in our approach, each unit of privacy data is divided into several partitions by some strategy (e.g., Equi-width or Equi-depth), so that many privacy values would be mapped into the same index value, as a result, lightening the consistency of probability distribution between privacy data and index data.

The precondition of known-plaintext attack is that an attacker has known a small set of two-tuples from privacy data \(\mathcal {U}^{*}\) to index data \(\mathcal {X}^{*}\). Then, the attacker attempts to reconstruct the index function X. From Section 4.1, we know that to reconstruct X, we need to know the partitions of each unit of privacy data. For the unit P i of privacy data, we at least need to know |dom(P i )| two-tuples from privacy data to index data to reconstruct X (i). Therefore, we probably need to know at least \(({\sum }_{i=1}^{n} |\boldsymbol {dom}(P_{i})|)\) two-tuples from privacy data to index data to reconstruct X. Finally, it should be pointed out that, it is possible for an attacker to guess the function X using statistical or known-plaintext attack, especially when the security factor μ is set to a greater value (e.g., equal to 1.0)

However, even if an attacker has completely mastered the index function X based on statistical attack or known-plaintext attack, it is still difficult for the attacker to guess the corresponding plaintext u from a given index value x. This is because our approach can meet λ-security, making that the attacker only has a ⌊1/λ⌋ probability to guess the corresponding plaintext u from the index value x. Besides, based on the index function X constructed in Section 4.1, we know that although our approach can meet λ-security, the index values generated by our approach still might reveal some sensitive information to the cloud-side, e.g., the length of the private field since the index field is of the same length as the corresponding private field. It is our next work how to improve the feature index scheme so as to make the index safer (not just to meet the λ-security).

5 Privacy query scheme

In our approach, privacy data will be encrypted before being stored into the CloudDB, so as to ensure the security. However, this leads to that a number of user query operations defined over the private field will be no longer able to be executed correctly on the encrypted CloudDB. In this section, we discuss the privacy query scheme used in our approach, i.e., how each database query q u over the private field A r is mapped into a new cloud-side query q x over the corresponding index field \({A_{r}^{X}}\), so that the query q x can be executed on the CloudDB correctly. To this end, we first discuss how each type of basic query conditions over the private field is mapped to the cloud-side representation over the index field. Second, based on the condition mappings, we discuss how a database query q u is transformed to its cloud-side query q x . Finally, we analyze the usability and efficiency of the proposed privacy query scheme.

5.1 Mapping query conditions

A database query operation consists of several basic query conditions. Thus, once we know how each type of basic query conditions over the private field is mapped correctly into its cloud-side representation, we can know how a database query operation is mapped into its cloud-side query operation. In this subsection, we consider three main types of basic query conditions over the private field A r : (1) equivalent conditions, e.g., R.A r = ‘123’; (2) similarity conditions, e.g., R.A r LIKE ‘%123%’; and (3) range conditions, e.g., R.A r > ‘123’. Below, we call the process of mapping a basic query condition to its cloud-side representation as condition mapping for short, and use map to denote such a condition mapping. Besides, we use the table in Example 5 and the index function generated in Example 4 to illustrate the condition mapping.

Mapping 1

R.A r = u: this is the basic form of an equivalent condition, where u denotes a character string constant, and R.A r denotes a private field of a relational table R. If u = u 1 u 2...u m (mn), then the condition mapping is defined as follows:

$$\boldsymbol{map}(R.A_{r} = u) \Rightarrow R^{E}.{A_{r}^{X}} = \boldsymbol{X}(u) \Rightarrow R^{E}.{A_{r}^{X}} = \boldsymbol{X}^{(1)}(u_{1})\boldsymbol{X}^{(2)}(u_{2})...\boldsymbol{X}^{(m)}(u_{m}). $$

For example, since ‘13587898721’ would be mapped into ‘HBCBCIJIABA’ by the index function generated in Section 4, we have a condition mapping as follows:

$$\boldsymbol{map}\left( \text{persons.phone} = \text{`13587898721'}\right) \Rightarrow \text{persons}^{E}.\text{phone}^{X}=\text{`HBCBCIJIABA'} $$

A similarity query condition generally contains some wildcards, and the similarity wildcards include: (1) ‘%’, it denotes to match one or more characters; (2) ‘_’, it denotes to match only one character; and (3) ‘[CharList]’, it denotes to match any character described in ‘CharList’. Thus, we below present the mappings for three main types of similarity conditions.

Mapping 2

R.A r LIKE u_v: this is the basic form of a similarity condition based on the wildcard ‘_’, where u and v represent two character string constants. If u = u 1 u 2...u m and v = v 1 v 2...v k (m + kn − 1; 0 ≤ m; 0 ≤ k), then the condition mapping is defined as follows:

$$\begin{array}{llllll} \boldsymbol{map}(R.A_{r} \,\, \text{LIKE} \,\, u\_v) \Rightarrow R^{E}.{A_{r}^{X}} \,\, \text{LIKE} \,\, u^{X}\_\,v^{X} \text{, where } \left\{ \begin{array}{llllll} & u^{X} = \boldsymbol{X}^{(1)}(u_{1})...\boldsymbol{X}^{(m)}(u_{m})\\ & v^{X} = \boldsymbol{X}^{(m+2)}(v_{1})...\boldsymbol{X}^{(m+k+1)}(v_{k}) \end{array} \right. \end{array} $$

For example, we have a mapping of similarity condition as follows:

$$\begin{array}{lllll} \boldsymbol{map}\left( \text{persons.phone LIKE `1358789\_721'}\right) \Rightarrow \text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBCBCIJ\_ABA'} \end{array} $$

Mapping 3

R.A r LIKE u[l]v: this is the basic form of a similarity condition based on the wildcard ‘[]’, where u and v denote two character string constants, and l denotes a character list. If l = l 1 l 2...l t , u = u 1 u 2...u m and v = v 1 v 2...v k (m + kn − 1; 0 ≤ m; 0 ≤ k; 0 ≤ t), then the condition mapping is defined as follows:

$$\begin{array}{lllllll} \boldsymbol{map}\left( R.A_{r} \,\, \text{LIKE} \,\, u[l]v\right) \Rightarrow R^{E}.{A_{r}^{X}} \,\, \text{LIKE} \,\, u^{X}[l^{X}]v^{X} \text{, where } \left\{ \begin{array}{lllllll} & u^{X} = \boldsymbol{X}^{(1)}(u_{1})...\boldsymbol{X}^{(m)}(u_{m})\\ & l^{X} = \boldsymbol{X}^{(m+1)}(l_{1})...\boldsymbol{X}^{(m+1)}(l_{t})\\ & v^{X} = \boldsymbol{X}^{(m+2)}(v_{1})...\boldsymbol{X}^{(m+k+1)}(v_{k}) \end{array} \right. \end{array} $$

For example, since each character in the list ‘[789]’ is respectively mapped to ‘H’, ‘I’ and ‘I’ by the index function X, we have a condition mapping as follows:

$$\begin{array}{llllll} \boldsymbol{map}\left( \text{persons.phone LIKE `1358789[789]721'}\right) \Rightarrow \text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBCBCIJ[HII]ABA'} \end{array} $$

Mapping 4

R.A r LIKE u % v: this is the basic form of a similarity condition based on the wildcard ‘%’, where u = u 1 u 2...u m and v = v 1 v 2...v k (m + kn − 1; 0 ≤ m; 0 ≤ k). Since the wildcard ‘%’ represents to match one or more characters, it is equivalent to ‘_’ (i.e., one character), ‘__’ (i.e., two characters) etc., and the maximum number of ‘_’ is not more than (nmk). Based on such an observation, with the help of Mapping 3, the condition mapping is defined as follows:

$$\begin{array}{lllllll} \boldsymbol{map}\left( R.A_{r} \,\, \text{LIKE} \,\, u\%v\right) \Rightarrow \mathop{\text{OR}}_{i=1}^{n-m-k} R^{E}.{A_{r}^{X}} \, \text{LIKE} \, u^{X}\overbrace{\_...}^{i}{v_{i}^{X}} \text{, where } \left\{ \begin{array}{lllllll} & u^{X} = \boldsymbol{X}^{(1)}(u_{1})...\boldsymbol{X}^{(m)}(u_{m})\\ & {v_{i}^{X}} = \boldsymbol{X}^{(m+i+1)}(v_{1})...\boldsymbol{X}^{(m+i+k)}(v_{k}) \end{array} \right. \end{array}$$

Specially, if k = 0, then the condition mapping can be defined as follows:

$$\boldsymbol{map}\left( R.A_{r} \, \text{LIKE} \, u\%\right) \Rightarrow R^{E}.{A_{r}^{X}} \, \text{LIKE} \, \boldsymbol{X}^{(1)}(u_{1})...\boldsymbol{X}^{(m)}(u_{m})\%$$

Specially, if m = 0, then the condition mapping can be defined as follows:

$$\boldsymbol{map}\left( R.A_{r} \, \text{LIKE} \, \%v\right) \Rightarrow R^{E}.{A_{r}^{X}} \, \text{LIKE} \, \%\boldsymbol{X}^{(n-k+1)}(v_{1})...\boldsymbol{X}^{(n)}(v_{k}) $$

For example, we have three mappings about similarity condition as follows:

$$\begin{array}{llllll} \boldsymbol{map}\left( \text{persons.phone LIKE `135\%'}\right) \Rightarrow& \text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBC\%'}\\ \boldsymbol{map}\left( \text{persons.phone LIKE `\%721'}\right) \Rightarrow& \text{persons}^{E}.\text{phone}^{X} \text{LIKE `\%ABA'}\\ \boldsymbol{map}\left( \text{persons.phone LIKE `135\%898721'}\right) \Rightarrow& \text{persons}^{E}.\text{phone}^{X} \text{LIKE} \text{`HBC\_BJJHGJ' OR } \\ &\text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBC\_\_IJIABA'} \end{array} $$

Mapping 5

R.A r u: this is the basic form of a range query condition. Without loss of generality, we assume that u = u 1 u 2...u m (mn) and note v i as a character of the greatest value in the character unit P i (i = 1, 2, ..., n), i.e., ∀v (v dom(P i ) → v v i ). Then, any character string \(u^{*} = u_{1}^{*}u_{2}^{*}... u_{h}^{*} (1 \leq h)\) is greater than u, if and only if it satisfying that: \((u_{1} = u_{1}^{*}, u_{2} = u_{2}^{*}, ..., u_{m} = u_{m}^{*}, m \leq h)\); or \((u_{1} \leq u_{1}^{*}+1)\); or (u 1 = u1∗, u 2u2∗ + 1; or ...; or \((u_{1} = u_{1}^{*}, u_{2} = u_{2}^{*}, ..., u_{k-1} = u_{k-1}^{*}, u_{k} \leq u_{k}^{*}+1)\), where k is equal to m (if mh) or h (if m > h). Based on such an observation, the range condition mapping is defined as follows:

$$\begin{array}{llllll} \boldsymbol{map}\left( R.A_{r} \geq u\right) \Rightarrow \mathop{\text{OR}}_{i=1}^{m} \boldsymbol{map}\left( R.A_{r} \,\, \text{LIKE} \,\, u_{1}u_{2}...u_{i-1}[(u_{i}+1)-v_{i}]\right)\% \,\text{OR} \,\boldsymbol{map}\left( R.A_{r} \,\, \text{LIKE} \,\, u\right)\% \end{array} $$

For example, we have a mapping about range query condition as follows:

$$\begin{array}{lllllll} \boldsymbol{map}\left( \text{persons.phone} \geq \text{`13587'}\right) \Rightarrow &\text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBCBC\%' OR}\\ &\text{persons}^{E}.\text{phone}^{X} \text{LIKE `H[FG]\%' OR }\\&\text{persons}^{E}.\text{phone}^{X} \text{LIKE `HB[DEFG]\%' OR}\\ &\text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBC[A]\%' OR }\\&\text{persons}^{E}.\text{phone}^{X} \text{LIKE `HBCB[BA]\%'} \end{array} $$

Besides, we can define a similar mapping for another range condition: R.A r < u.

Above, we describe the condition mappings for three main types of basic query conditions defined over privacy data. It can be seen that all the cloud-side condition representations are defined over the index field \(R^{E}.{A_{r}^{X}}\), thus can be executed on the encrypted CloudDB. Besides, it can be noted that each cloud-side representation p x is a sufficient condition of the corresponding query representation p u over privacy data, i.e., the result of executing p x on the CloudDB would be a superset of that of p u (i.e., R(p u ) ⊆ D(R(p x ))).

5.2 Privacy query processing

For any SELECT query operation from a client side, from the WHERE clause of the query, we can first obtain all the relevant basic query conditions defined over the privacy field. Second, based on the condition mappings (i.e., Mappings 1 to 5) mentioned above, we map each basic query condition into a new condition representation defined over the corresponding index field of the CloudDB. Finally, we combine all the new condition representations to form a new cloud-side query operation. Note that all the works are completed by the query translator running on a trusted client side (see Figure 2).

Example 6

Consider the relation “persons” and the encrypted relation “persons E” mentioned in Example 5. First, we present a SQL query operation (q u .) as follows:

SELECT p.no, p.name FROM persons p WHERE p.phone = ‘15858707069’ OR p.phone LIKE ‘1358789_721’.

Then, based on the condition mappings, the query operation q u can be mapped into a new cloud-side query operation (q x ) defined over the index field “persons E.phone X”:

SELECT p.tuple E FROM persons E p WHERE p.phone X = ‘HFFHBCHABCC’ OR p.phone X LIKE ‘HBCBCIJ_ABA’.

Finally, the cloud-side query q x will be submitted to the cloud side, instead of the query q u . After the query q x is executed by the CloudDB, a set R(q x ) of encrypted tuples will be returned to the client side, which is a superset of the result R(q u ) of q u . Then, on the client side, the query executor will decrypt the set R(q x ) of encrypted tuples, and execute the original query q u over the decrypted tuples again, so as to obtain the accurate query result R(q u ).

From above, we can see that, for a user query q u , its corresponding cloud-side query q x and the intermediate query result R(q x ) are both revealed to the untrusted cloud-side. However, the cloud-side query q x and the query result R(q x ) are not plaintext, where the query result R(q x ) are in the form of ciphertext, and the conditions of q x are defined over the index fields. Therefore, although R(q x ) and q x are visible to the cloud-side, it is difficult for the cloud-side to guess the private information from them.

5.3 Usability and efficiency analysis

In this subsection, based on Definitions 2 and 3, we use some observations to demonstrate the usability and efficiency of our proposed approach.

Observation 2

Let \(\mathcal {P}_{u}\) denote a set of all the basic query conditions defined over the private field R.A r . Then, any query requirement that is relevant to the private field R.A r can be described using a logical operation (i.e., an expression connected by NOT, AND and OR) among the basic query conditions in \(\mathcal {P}_{u}\).

Observation 3

Let p 1 and p 2 denote two basic query conditions over the private field R.A r , and \(p_{1}^{*}\) and \(p_{2}^{*}\) two cloud-side query representations corresponding to p 1 and p 2. Then, the result of executing the AND condition “ p1∗ AND \(p_{2}^{*}\)” on the CloudDB will be a superset of that of “ p 1 AND p 2” (i.e., \(R(p_{1} \text { AND } p_{2})\subseteq \boldsymbol {D}(R(p_{1}^{*} \text { AND } p_{2}^{*}))\)). Similarly, we have \(R(p_{1} \text { OR } p_{2})\subseteq \boldsymbol {D}(R(p_{1}^{*} \text { OR } p_{2}^{*}))\).

Rationale

The two observations mentioned above can be easily demonstrated by the fundamentals of the logic algebra and the relation algebra.

In Observation 3, we have not mentioned the NOT condition. Actually, based on Mappings 1 to 5, we know that \(R(\text {NOT }p_{1})\nsubseteq \boldsymbol {D}(R(\text {NOT }p_{1}^{*}))\), that is, our approach cannot support NOT logical operations. However, for any NOT condition, we generally can generate an equivalent positive condition. For example, “NOT p.phone > 13587898721” is equivalent to “p.phone <= 13587898721”. Thus, based on Observation 2 and 3, we have an observation as follows.

Observation 4

The privacy query scheme used in our approach can meet usability, i.e., any user query operation q u over private data can be transformed into a cloud-side query operation q x , which is defined over the corresponding index data, and the result R(q u ) of q u is contained in the result R(q x ) returned by q x , i.e., R(q u ) ⊆ D(R(q x )).

Observation 5

Given any threshold μ (0 < μ ≤ 1), after setting a suitable value for the security factor ρ of the privacy protection scheme, our approach can meet μ-efficiency.

Rationale

Let us consider an extreme case where the security factor ρ = 1. At this time, the index function become an one-to-one mapping, and thus all the non-target tuples can be filtered out by cloud-side query operations, i.e., at this time, our approach can meet 1.0-efficiency. Thus, for any given threshold μ (0 ≤ μ ≤ 1), our approach can meet μ-efficiency.

Based on Observation 1 and Observation 5, we can conclude that the security factor ρ is proportional to the security of our approach, but is inversely proportional to the efficiency of our approach.

6 Experiment evaluation

In Section 4.3, we have demonstrated the efficiency of our approach by theoretical analysis. In this section, we evaluate the efficiency of our approach by experiments, i.e., to evaluate the filtering rate (refer to Definition 4) of cloud-side query operations generated by our approach to filter out non-target tuples in the CloudDB.

6.1 Experimental setup

Before the experimental evaluation,we briefly describe the experimental setup, including the dataset preparation, user queries and system configuration.

  1. (1)

    Dataset preparation. To perform the experiments, we in advance constructed a database, which only contains one relational table “persons”. The schema of the table “persons” is similar to that shown as Example 5, but it contains two private fields “phone” and “name”. Table 1 presents some information related to the two private fields. Then, we randomly generated a million of tuples for the table “persons” (i.e., the database size is about one million orders of magnitude), where the privacy field values were generated based on the two regular expressions presented in the fifth column of Table 1. As shown in Table 1, each value of the field “name” is defined over a set of 100 Chinese characters, and consists of at least three but up to five characters.

  2. (2)

    User query operations. Table 2 presents the basic query conditions used in our experiments. Table 2 shows the general cases for two main types of basic queries (i.e., basic similarity queries and basic range queries) over the private field “phone” or “name”. It should be noted that equivalent queries can be considered as a special kind of similarity queries. In addition, other more complex similarity queries or range queries can be generated based on these basic query operations.

  3. (3)

    System configuration. The experiments were conducted over two Lenovo personal computers with an Intel (R) Core (TM) I7-4510U CPU and 8 GB RAM, where one of the two computers performed as the cloud-side, and the other as the client side. The network speed between computers is about 2.0 MB/s, and the disk speed is about 200 MB/s. In addition, we used Microsoft Windows 7 as the operating system, and MySQL (version 5.7.17) as the database system.

Table 1 The information about the private field that needs to be encrypted
Table 2 The similarity and range conditions used in the experiments, where A 1, A 2 and A 3 denote three characters

6.2 Efficiency evaluation and analysis

In the experiments, we used the metric Fr (i.e., the filtering rate defined in Section 3.3) to evaluate the efficiency of the approach. Aiming at similarity queries, we conducted two groups of experiments over the private fields “phone” and “name”, respectively, by setting different values for the security factor ρ. The experimental results are shown in Table 3, where each value was obtained by performing 10 experiments and then computing their average value.

Table 3 The Fr values for different similarity query conditions over the private field “phone” or “name” (ρ is set to 29 − 221)

From Table 3, we have the following four observations. First, with the increasing of the security factor ρ, the Fr value decreases, i.e., the effectiveness of the cloud-side query operations to filter non-target tuples is reduced, and thus the efficiency of the approach is also reduced accordingly. The reason is that with the increasing of ρ, it would increase the number of different privacy field values mapped into the same index field value by the index function, consequently, decreasing the probability of non-target tuples being filtered. Second, different similarity conditions lead to the different change trends of Fr values, and the Fr values would increase with the increasing of quantity of information contained by the similarity matching conditions (PL1 −PL3 and NL1 −NL3). This is because the increasing of quantity of information in the matching conditions would decrease the number of tuples returned by the cloud-side query (i.e., R(q x ) in Definition 4), resulting in the increasing of the Fr values. Third, each Fr value related to “name” is generally greater than that related to “phone”, which is caused by the larger value domain of the private field “name”. Finally, we find that the mathematical expectation of the Fr values for the similarity query operations over the private field “phone” is equal to 0.86222 (we assume the same probability of occurrence of each similarity query operation); and the mathematical expectation over “name” is equal to 0.98183. As a result, this would reduce the number of encrypted tuples transmitted from the cloud side to the client, thereby, improving the efficiency of similarity query operations.

Aiming at range queries, we also conducted two groups of experiments over the private fields “phone” and “name”, respectively. The experimental results are presented in Table 4. In general, the experimental results are similar to those of similarity queries: (1) the increasing of the security factor ρ decreases the effectiveness of the cloud-side query operations to filter non-target tuples of the CloudDB, thereby, decreasing the efficiency of the approach; (2) the greater value domain of the private field “name” makes that each Fr value of “name” is generally greater than that of “phone”; and (3) the mathematical expectations over “phone” and “name” are respectively equal to 0.75003 and 0.96468, i.e., most of non-target tuples generally can be filtered by the cloud-side query operations, consequently, improving the efficiency of range query operations.

Table 4 The Fr values for different range query conditions over the private field “phone” or “name” (ρ is set to 29 − 221)

In addition, we have also conducted experiments to evaluate the actual execution performance of our approach. In the experiments, we compared our approach (below, denoted by “We”) with the following two ways: (1) decrypting the encrypted data before querying them (denoted by “Bw”); and (2) querying data without encryption (denoted by “Bn”, i.e., directly storing plaintext into the CloudDB). In the experiments, the execution performance of our approach is computed by adding: (1) the time of executing a cloud-side query on the CloudDB, and transmitting the encrypted data from the cloud-side to the client, i.e., the time consumed on the cloud-side; and (2) the time of decrypting and querying the data on the client, i.e., the time consumed on the client-side. The experiments were performed based on the basic similarity queries “PL2” and “NL2”, and the basic range queries “PR2” and “NR2”. The experimental results are shown in Figure 5, where the security factor ρ is set to 215. From the two subfigures, we see that based on the feature index generated by our approach, the overall execution performance of similarity and range queries over the private fields can be improved effectively: compared with those of “Bw”, the overall execution time of a basic similarity query is decreased to about 0.3, and the execution time of a range query is decreased to about 0.6. In addition, we also see that the execution performance of our approach is almost twice that of “Bn”, which is mainly because the volume of the encrypted tuples is greater than the volume of the original tuples (i.e., the tuples without encryption).

Figure 5
figure 5

The execution times for performing similarity and range queries over the private field “phone” or “name”

Finally, based on the above experimental results, we conclude that the increasing of the security factor ρ would decrease the efficiency of the proposed approach, i.e., it would decrease the effectiveness of the cloud-side query operations generated by our approach to filter non-target tuples of the CloudDB.

6.3 Effectiveness comparison and analysis

From the related work section, we know that there have been many approaches to database encryption, but most of them were not designed for personal privacy protection in a CloudDB, thereby, making them difficult to be applied into a CloudDB. In this subsection, we compare our approach with three existing ones proposed in [18], [29], and [33], respectively. It should be pointed out that all the approaches were not designed for a CloudDB. For comparison, we have re-implemented the approaches over our prototype experimental system.

First, we make an effectiveness comparison in terms of efficiency (i.e., Definition 4). In the experiments, (1) for our approach, the security factor ρ is set to 215; (2) for the approach in [29], the number of bits of characteristic index field is set to 32 (that is recommended by the authors); and (3) for the approach in [33], the size of an index matrix is set to 8 (for the private field “phone”) or 20 (for the private field “name”). The experiments were performed based on the basic similarity queries “PL2” and “NL2”, and the basic range queries “PR2” and “NR2”. The experimental results are shown in Figure 6, where “AP-1”, “AP-2” and “AP-3” denotes the approaches presented in [18, 29], and [33], respectively, and “AP-We” denotes our approach. From Figure 6, we see that, our approach has nearly the same running efficiency to the approaches presented in [18] and [29], and has better running efficiency than that presented in [33].

Figure 6
figure 6

The efficiency comparisons between our approach and other existing ones

Second, based on the above results and the results mentioned in [18, 29], and [33], we make an overall effectiveness comparison in terms of security (i.e., Definition 1), usability (i.e., Definition 3) and efficiency (i.e., Definition 4). The comparison results are shown in Table 5. From Table 5, we can see that, compared to the other approaches, our approach not only has better usability better, namely, which can support all kinds of query operations over text private fields (including similarity queries and range queries), but also has better security, enabling us to prevent attackers from attacking, thus, better ensuring the security of personal privacy in a CloudDB. Overall, our approach has a better overall effectiveness in terms of security, usability and efficiency than the other approaches.

Table 5 The effectiveness comparison, where “low” denotes non-support, “high” denotes good support, and “medium” denote some support

7 Conclusion

In this paper, we proposed a client-based approach to protect personal privacy in a CloudDB. The approach presents a privacy protection scheme and a privacy query scheme, which can ensure not only good security of privacy data, but also good efficiency of query operations over privacy data. Moreover, we demonstrated the effectiveness of the approach by theoretical analysis and experimental evaluation. The results show that: (1) the feature index constructed by the approach has good security, i.e., it is difficult to infer the plaintext from the feature index data; (2) the approach has good usability, i.e., with the help of the feature index, each type of familiar query operations over privacy data can be transformed into a cloud-side query operation that can be performed correctly at the cloud-side; and (3) the approach has good efficiency, i.e., with the help of a cloud-side query operation, most of non-target tuples can be filtered out at the cloud-side, consequently, improving the execution efficiency of a client-side query operation over privacy data.

However, the approach proposed in this paper is not the end of our work. As the future work, we will try to further study some problems, e.g., (1) how to establish a solution to automatically determine the security factor ρ based on the characteristics of users’ privacy data, instead of being preset by users; (2) how to improve the approach, so as to support more privacy data types, not just text data type; and (3) the practical implementation of this approach in a CloudDB. In addition, in this work, we only focus on the protection of users’ privacy data; however, in a CloudDB, users’ behaviour may potentially pose a threat to personal privacy. Therefore, it is also our future work of how to protect the privacy behind users’ behaviour.