1 Introduction

Organizations today collect and store a large volume of data, which is analyzed for diverse purposes. However, in-house computational capabilities of organizations may become obstacles for storing and processing data. Many untrusted cloud computing platforms (e.g., Amazon AWS, Google App Engine, and Microsoft Azure) offer database-as-a-service using which data owners, instead of purchasing, installing, and running data management systems locally, can outsource their databases and query processing to the cloud. Such cloud-based services available using the pay-as-you-go model offers significant advantages to both small, medium and at times large organizations. The numerous benefits of public clouds impose significant security and privacy concerns related to sensitive data storage (e.g., sensitive client information, credit card, social security numbers, and medical records) or the query execution. The untrusted public cloud may be an honest-but-curious (or passive) adversary, which executes an assigned job but tries to find some meaningful information too, or a malicious (or active) adversary, that may tamper the data or query. Such concerns are not a new revelation – indeed, they were identified as a key impediment for organizations adopting the database-as-as-service model in early work on data outsourcing [25, 26]. Since then, security/confidentiality challenge has been extensively studied in both the cryptography and database literature, which has resulted in many techniques to achieve data privacy, query privacy, and inference prevention. Existing work can loosely be classified into the following three categories:

  1. 1.

    Encryption based techniques. E.g., order-preserving encryption [3], deterministic encryption (Chap. 5 of [24]), homomorphic encryption [21], bucketization [25], searchable encryption [41], private informational retrieval (PIR) [8], practical-PIR (P-PIR) [42], oblivious-RAM (ORAM) [23], oblivious transfers (OT) [39], oblivious polynomial evaluation (OPE) [34], oblivious query processing [5], searchable symmetric encryption [13], and distributed searchable symmetric encryption (DSSE) [27].

  2. 2.

    Secret-sharing [40] based techniques. E.g., distributed point function [22], function secret-sharing [7], functional secret-sharing [30], accumulating-automata [18, 19], Obscure [46], and others [20, 31, 32].

  3. 3.

    Trusted hardware-based techniques. They are either based on a secure coprocessor or Intel SGX, e.g., [4, 6]. The secure coprocessor and Intel SGX [12] allow decrypting data in a secure area and perform some computations.

Fig. 1.
figure 1

Comparing different cryptographic techniques.

While approaches to compute over encrypted data and systems supporting such techniques are plentiful, secure data outsourcing and query processing remain an open challenge. Existing solutions suffer from several limitations. First, cryptographic approaches that prevent leakage, e.g., fully homomorphic encryption coupled with ORAM, simply do not scale to large data sets and complex queries for them to be of practical value. Most of the above-mentioned techniques are not developed to deal with a large amount of data and the corresponding overheads of such techniques can be very high (see Fig. 1 comparing the time taken for TPC-H selection queries under different cryptographic solutions). To date, a scalable non-interactive mechanism for efficient evaluation of join queries based on homomorphic encryption that does not leak information remains an open challenge. Systems such as CryptDB [38] have tried to take a more practical approach by allowing users to explore the tradeoffs between the system functionality and the security it offers. Unfortunately, precisely characterizing the security offered by such systems given the underlying cryptographic approaches have turned out to be extremely difficult. For instance, [28, 35] show that when order-preserving and deterministic encryption techniques are used together, on a dataset in which the entropy of the values is not high enough, an attacker might be able to construct the entire plaintext by doing a frequency analysis of the encrypted data. While mechanisms based on secret-sharing [40] are potentially more scalable, splitting data amongst multiple non-colluding cloud operators (an assumption that is not valid in a general setting) incurs significant communication overheads and can only support a limited set of selection and aggregation queries efficiently.

While the race to develop cryptographic solutions that (i) are efficient, (ii) support complex SQL queries, (iii) offer provable security from the application’s perspective is ongoing, this chapter departs from the above well-trodden path by exploring a different (but complementary) approach to secure data processing by partitioning a computation over either the hybrid cloud or the public cloud based on the data classification into sensitive and non-sensitive data. We focus on an approach for situations when only part of the data is sensitive, while the remainder (that may consist of the majority) is non-sensitive. In particular, we consider a partitioned computation model that exploits such a classification of data into sensitive/non-sensitive subsets to develop efficient data processing solutions with provable security guarantees. Partitioned computing potentially provides significant benefits by (i) avoiding (expensive) cryptographic operations on non-sensitive data, and, (ii) allowing query processing on non-sensitive data to exploit indices.

The data classification into sensitive or non-sensitive may seem artificial/limiting at first, we refer to the readers to the ongoing dialogue in the popular media [1, 2] about cloud security and hybrid cloud that clearly identify data classification policies to classify data as sensitive/non-sensitive as a key strategy to securing data in a cloud. Furthermore, similar to the model considered in this chapter, such articles emphasize either storing sensitive data on a private cloud while outsourcing the rest in the context of hybrid cloud or encrypting only the sensitive part of the data prior to outsourcing. Also, note that data classification based on column-level sensitivity is not a new concept. Papers [9,10,11, 15,16,17] have explored many ways to outsource column-level partitioned data to the cloud. However, these papers does not dictate a joint query execution on two relations. Some recent database systems such as JanaFootnote 1 and Opaque [45] are exploring architectures will allow for only some parts of the data (that is sensitive) to be encrypted while the remainder of the (non-sensitive) data remains in plaintext, thereby supporting partitioned computing. That organizational data can actually be classified as sensitive/non-sensitive is not difficult to see if we consider specific datasets. For instance, in a university dataset, data about courses, catalogs, location of classes, faculty and student enrollment would likely be not considered sensitive, but information about someone’s SSN, or grade of the student would be considered sensitive.

Contribution. Our contributions in this chapter are twofold:

  • Partition computation on the hybrid cloud. Our work is motivated by recent works on the hybrid cloud that has exploited the fact that for a large class of application contexts, data can be partitioned into sensitive and non-sensitive components. Such a classification was exploited to build hybrid cloud solutions [29, 36, 37, 43, 44] that outsource only non-sensitive data and enjoy both the benefits of the public cloud as well as strong security guarantees (without revealing sensitive data to an adversary).

  • Partition computation on the public cloud. In the setting of the public cloud, sensitive data is outsourced in an appropriate encrypted form, while non-sensitive data can be outsourced in cleartext form. While partitioned computing offers new opportunities for efficient and secure data processing due to avoiding cryptographic approach on the non-sensitive data, it raises several challenges when used in the public cloud. Specifically, the partitioned approach introduces a new security challenge – that of leakage due to simultaneous execution of queries on the encrypted (sensitive) dataset and on the plaintext (non-sensitive) datasets. In this chapter, we will study such a leakage (Sect. 3), a partitioned computing security definition in the context of the public cloud (Sect. 3), and a way to execute partitioned data processing techniques for selection queries (Sect. 4) that support partitioned data security while exploiting existing cryptographic mechanisms for secure processing of sensitive data and cleartext processing of non-sensitive data. Note that the proposed approach can also be extended to other operations such as join or range queries, which are provided in [33].

2 Partitioned Computations at the Hybrid Cloud

In this section, our goal is to develop an approach to execute SQL style queries efficiently in a hybrid cloud while guaranteeing that sensitive data is not leaked to the (untrusted) public machines. At the abstract level, the technique partitions data and computation between the public and private clouds in such a way that the resulting computation (i) minimizes the execution time, and (ii) ensures that there is no information leakage. Information leakage, in general, could occur either directly by exposing sensitive data to the public machines, or indirectly through inferences that can be made based on selective data transferred between public and private machines during the execution.

The problem of securely executing queries in a hybrid cloud naturally leads to two interrelated subproblems:

  • Data distribution: How is data distributed between private and public clouds? Data distribution depends on factors such as the amount of storage available on private machines, expected query workload, and whether data and query workload is largely static or dynamic.

  • Query execution: Given a data distribution strategy, how do we execute a query securely and efficiently across the hybrid cloud, while minimizing the execution time and obtaining the correct final outputs?

Since data is stored on public cloud in the clear text, data distribution strategy must guarantee that sensitive data resides only on private machines. Non-sensitive data, on the other hand, could be stored on private machines, public machines, or be replicated on both. Given a data distribution, the query processing strategy will split a computation between public and private machines while simultaneously meeting the goals of good performance and secure execution.

2.1 Split Strategy

In order to ensure a secure query execution, we develop a split strategy for executing SQL queries in the hybrid cloud setting. In a split strategy, a query Q is partitioned into two subqueries that can be executed independently over the private and the public cloud respectively, and the final results of the query can be computed by appropriately merging the results of the two sub-queries. In particular, a query Q on dataset D is split as follows:

$$ Q(D) = Q_{merge}\Big ( Q_{priv}(D_{priv}), Q_{pub}(D_{pub}) \Big )$$

where \(Q_{priv}\) and \(Q_{pub}\) are private and public cloud sub-queries respectively. \(Q_{priv}\) is executed on the private subset of D (i.e., \(D_{priv}\)); whereas \(Q_{pub}\) is performed over the public subset of D (i.e., \(D_{pub})\). \(Q_{merge}\) is a private cloud merge sub-query that reads the outputs of former two sub-queries as input and creates the outputs equivalent to that of original Q. We call such an execution strategy as split-strategy.

Two aspects of split-strategy are noteworthy:

  1. 1.

    It offers full security, since the public machines only have access to \(D_{pub}\) that do not contain any sensitive data. Moreover, no information is exchanged between private and public clouds during the execution of \(Q_{pub}\), resulting in the execution at the public cloud to be observationally equivalent to the situation where \(D_{priv}\) could be any random data.

  2. 2.

    Split-strategy gains efficiency by executing \(Q_{priv}\) and \(Q_{pub}\) in parallel at the private and public cloud respectively, and furthermore, by performing inter-cloud data transfer at most once throughout the query execution. Note that the networks between private and public clouds can be significantly slower compared to the networks used within clouds. Thus, minimizing the amount of data shuffling between the clouds will have a big performance impact.

Split strategy, and its efficiency, depends upon the data distribution strategy used to partition the data between private and public clouds. Besides storing sensitive data, the private cloud must also store part of non-sensitive data (called pseudo sensitive data) that may be needed on the private side to support efficient query processing. For instance, a join query may necessitate that non-sensitive data be available at the private node in case-sensitive records from one relation may join with non-sensitive records in another. Since in the split-execution strategy, the two subqueries execute independently with no communication, if we do not store non-sensitive data at the private side, we will need to transfer entire relation to the private side for the join to be computed as part of the merge query.

Split-Strategy for Selection or Projection. An efficient split-strategy for selection or projection operation is straightforward. In this case, \(Q_{priv}\) is equivalent to the original query Q, but is performed only over sensitive records in \(D_{priv}\). Likewise, \(Q_{pub} = Q\), but only runs over \(D_{pub}\). Finally, \(Q_{merge} = Q_{priv} \cup Q_{pub}\).

Fig. 2.
figure 2

Example relations.

Split-Strategy for Equijoin. An efficient split-strategy for performing a join query such as \(Q= R \underset{C}{\bowtie }S\) is more complex. To see this, consider the relations R and S as shown above in Fig. 2, where sensitive portions of R and S are denoted as \(R_s\) and \(S_s\), respectively, and remaining fraction of them are non-sensitive, denoted as \(R_{ns}\) and \(S_{ns}\), and the join condition is \(C = (R.region = S.Region)\). Let us further assume that \(R_{ns}\) and \(S_{ns}\), besides being stored in the public cloud are also replicated on the private cloud.

The naive split-strategy for \(R \underset{C}{\bowtie }S\) would be:

  • \(Q_{pub} = R_{ns} \underset{C}{\bowtie } S_{ns}\)

  • \(Q_{priv} = (R_{s} \underset{C}{\bowtie } S_{s}) \cup (R_{s} \underset{C}{\bowtie } S_{ns}) \cup (R_{ns} \underset{C}{\bowtie } S_s\)).

Note that if Q is split as above, \(Q_{priv}\) consists of three subqueries which scan 2, 3, and 3 tuples in R and S respectively resulting in 8 tuples to be scanned and joined. In contrast, if we simply executed the query Q on the private side (notice that we can do so, since R and S are fully stored on the private side), it would result in lower cost requiring scan of 6 tuples on the private side. Indeed, the overhead of the above split strategy increases even further if we consider multiway joins (e.g., \(R \underset{C}{\bowtie } S \underset{C^{\prime }}{\bowtie } T\)) compared to simply implementing the multiway join locally. Thus, if we use split-strategy for computing \(R \underset{C}{\bowtie } S \underset{C^{\prime }}{\bowtie } T\), where \(C^{\prime }\) is \(S.Region=T.Region\), then the number of tuples that are scanned/joined in the private cloud will be much higher than that of the original query.

A Modified Approach for Equijoin. The cost of executing Q in the private cloud can be significantly reduced by pre-filtering relations R and S based on sensitive records of the other relation. To perform such a pre-filtering operation, the tuples in the relations \(R_{ns}\) and \(S_{ns}\) have to be co-partitioned based on whether they join with a sensitive tuple from the other table under condition C or not.

Let \(R^{S}_{ns}\) be a set of non-sensitive tuples of R that join with any sensitive tuple in S. In our case, \(R^{S}_{ns} = \langle \)apple, 1\(\rangle \). Similarly, let \(S^{R}_{ns}\) be non-sensitive tuples of S that join with any record from \(R_s\), i.e., \(\langle \)Chris, 1\(\rangle \). In that case, the new private side computation can be rewritten as:

$$\begin{aligned} (R_s \cup R^{S}_{ns}) \underset{C}{\bowtie } (S_s \cup S^{R}_{ns}). \end{aligned}$$
(1)

Thus, the scan and join cost of this new plan at the private cloud is 4, which is lower compared to computing the query entirely on the private side that had a cost of 6.

Guarded Join. The above mentioned modified strategy, nonetheless, introduces a new challenge. Since \(R^{S}_{ns} \underset{C}{\bowtie } S^{R}_{ns}\) is both repeated at public and private cloud, the output of \(R^{S}_{ns} \underset{C}{\bowtie } S^{R}_{ns}\), \(\langle \)apple, Chris, 1\(\rangle \), is computed on both private and public clouds. To prevent this, we do a guarded join (\(\bowtie ^{\prime }\)) on the private cloud, which discards the output, if it is generated via joining two non-sensitive tuples. This feature can easily be implemented by adding a column to R and S that marks the sensitivity status of a tuple, whether it is sensitive or non-sensitive, and then by adding an appropriate selection after the join operation. In other words, the complete representation of private side computation for \(R \underset{C}{\bowtie } S\) would be

$$\begin{aligned} \sigma _{R.sens = true \vee S.sens = true}((R_s \cup R^{S}_{ns}) \underset{C}{\bowtie } (S_s \cup S^{R}_{ns})) \end{aligned}$$
(2)

where sens is a boolean column (or partition id) appended to relations R and S on the private cloud. Assume that it is set to true for sensitive records and false for non-sensitive records.

Challenges. There exist multiple challenges in implementing this new approach. First challenge is the cost of creating \(R^{S}_{ns}\) and \(S^{R}_{ns}\) beforehand. Extracting these partitions for a query might take as much time as executing the original query. However, the costs are amortized since these relations are computed once, and used multiple times to improve join performance at the private cloud.

The second challenge is the creation of co-partitioning tables for complex queries. For instance, in case of a query \(R \underset{C}{\bowtie } S \underset{C^{\prime }}{\bowtie } T\), the plan would be to first compute results of \(R \underset{C}{\bowtie } S\), and then to join them with T. However, if we do the private side computation of \(R \underset{C}{\bowtie } S\), based on Eq. 1 (no duplicate filtering) and join the results with T, then we will not be able to obtain the complete set of sensitive \(R \underset{C}{\bowtie } S \underset{C^{\prime }}{\bowtie } T\) results.

To see this, consider the sensitive record (in Fig. 2) \(\langle \)Japan, 2\(\rangle \) in T that joins with non-sensitive \(\langle \)grape, 2\(\rangle \) tuple in \(R- R^{S}_{ns}\) or joins with non-sensitive \(\langle \)James, 2\(\rangle \) tuple from \(S-S^{R}_{ns}\). Thus, the non-sensitive records of R and S has to be co-partitioned based on the sensitive records of T via their join paths from T. In \(R\underset{C}{\bowtie }S \underset{C^{\prime }}{\bowtie } T\), the join path from T to R is \(T \underset{C^{\prime }}{\bowtie } S \underset{C}{\bowtie } R\) and from T to S is \(T \underset{C^{\prime }}{\bowtie } S\). Similarly, the non-sensitive T records has to be co-partitioned based on the sensitive R and S records via join paths specified in the query.

Final challenge is in maintaining these co-partitions and feeding the right one when an arbitrary query arrives. Given a workload of queries and multiple possible join paths between any two relations, each relation R in the dataset may need to be co-partitioned multiple times. This implies that any non-sensitive record r of R might appear in more than one co-partition of R. So, maintaining each co-partition separately might be unfeasible in terms of storage. However, the identifiers of each co-partition that record r belongs to can be embedded into r as a new column. We call such a column as the co-partition (CPT) column. Note that CPT column is only defined on the private cloud data, since revealing it to public cloud would violate our security requirement.

CPT column initially will be set to null for sensitive tuples in the private side, since the co-partitions are only for non-sensitive tuples. Thus, it can further be used to serve another purpose, indicating the sensitivity status of a tuple r by setting it to “sens” only for sensitive tuples.

Join Path. To formalize the concept of co-partitioning, we first need to define the notion of join path. Let \(R_{i}\) be a relation in our dataset D, and let Q be a query over the relation \(R_i\). We say a join path exists from a relation \(R_j\) to \(R_i\), if either \(R_{i}\) is joined with \(R_j\) directly based on a condition C, i.e., \(R_j \underset{C}{\bowtie } R_i\), or \(R_{j}\) is joined with \(R_i\) indirectly using other relations in Q. A join path p can be represented as a sequence of relations and conditions between \(R_{j}\) and \(R_{i}\) relations. Let PathSet be the set of all join paths that are extracted either from the expected workload or a given dataset schema.

$$\begin{aligned} PathSet_{i} = \{ \forall p \in PathSet : \text {path } p \text { ends at relation } R_i \}. \end{aligned}$$
(3)

Let \(CP(R_{i}, p)\) be the set of non-sensitive \(R_{i}\) records that will be joined with at least one sensitive record from any other relation \(R_j\) via the join path p. Note that p starts from \(R_j\) and ends at \(R_i\) that can be used as an id to \(CP(R_{i}, p)\). Any \(CP(R_{i}, p)\) is called as “co-partition” of \(R_{i}\). Given these definitions, the CPT column of a \(R_{i}\) record, say r, can be defined as:

(4)

Figure 3 shows our example R, S and T relations with their CPT column. For instance, the join path \(R \bowtie S\) will be appended to the CPT column of all the tuples in \( S^{R}_{ns}\). Additionally, the CPT column of all tuples in \(R_s\) will be set to sens.

Fig. 3.
figure 3

Example relations with the CPT columns.

2.2 Experimental Analysis

To study the impact of table partitioning discussed in the previous section, we differentiate between two realizations of our strategy: in our first technique, entitled (CPT-C), every record in a table at the private cloud contains a CPT column and they are physically stored together; whereas in our second approach, entitled CPT-P, the tables are partitioned based on their record’s CPT column and each partition is stored separately. Each partition file then appended to the corresponding Hive table as a separate partition, so at querying stage, Hive filters out the unnecessary partitions for that particular query.

Sensitive Data Ratio. For these experiments, we varied the amount of sensitive records (\(1,5,10,25,50\%\)) in customer and supplier tables. Also, we set the number of public machines to 36. As expected, Fig. 4 shows that a larger percentage of sensitive data within the input leads to a longer workload execution time for both, CPT-C and CPT-P in Hadoop and Spark. The reason behind this is that a higher sensitive data ratio results in more computations being performed on the private side and implies a longer query execution time in split-strategy. When the sensitivity ratio increases, CPT-P’s scan cost increases dramatically. Since the scan cost of queries is the dominant factor compared to other operators (join, filtering etc.) in Spark, CPT-C provides a very low-performance gain compared to All-Private in Spark. Because the scan cost of these two approaches is same. Overall, when sensitivity ratio is as low as \(1\%\), CPT-P provides \(8.7{\times }\) speed-up in Hadoop and \(5{\times }\) speed-up in Spark compared to All-Private.

Fig. 4.
figure 4

Running times for different sensitivity ratios.

Fig. 5.
figure 5

The CPT column’s creation for different sensitivity ratios.

Fig. 6.
figure 6

Comparison of pseudo-sensitive data and sensitivity ratio.

Recall that we created the CPT column using a Spark job for CPT-C solution. We then physically partitioned tables for CPT-P solution. Figure 5 shows how much time we spent in preparing private cloud data for both CPT-C and CPT-P. It also indicates the gains of these approaches compared to All-Private in terms of the overall workload execution time. As indicated in Fig. 5, until \(25\%\) sensitivity, CPT-P’s data preparation time is less than that of performance gain in Hadoop; whereas in Spark, data preparation time is always higher than the performance gain for both CPT-P and CPT-C. Note that, we prepare the CPT column only once on a static data for an expected workload that will more likely be executed more than once with different selection and projection conditions. In Spark, if the sensitivity ratio is as high as \(10\%\), executing the workload more than once will be enough for the performance gain of CPT-P solution to be higher than the overhead of data preparation time.

Size of Private Storage. Besides storing sensitive data, in our technique, we also store pseudo-sensitive data on the private cloud. This enables us to execute queries in a partitioned manner while minimizing expensive inter-cloud communication during query execution. In Fig. 6, we plot the size of pseudo-sensitive data as a percentage of total database size at different sensitivity levels. We note that even when sensitivity levels are as high as 5–10%, the pseudo-sensitive data remains only a fraction (15–25% of the total data). At smaller sensitivity levels, the ratio is much smaller.

2.3 Other Approaches to Partitioned Computing

The discussion above focused on partitioned computing in hybrid clouds in the context of SQL queries and is based primarily on the work that appeared in [36]. Several other approaches to partitioned computing in the hybrid cloud have also been developed in the literature that, similar to the above-mentioned method, offer security by controlling data distribution between private and public clouds. Many of these approaches [29, 37, 43, 44] have been developed in the context of MapReduce job execution, and they address security at a lower level compared to the approach defined above, which is at SQL level. Note that one could, potentially, transform SQL/Hive queries into lower level MapReduce jobs and run such MapReduce jobs using privacy preserving extensions. There are several limitations of such an approach, however, and we refer the reader to [36] for a detailed discussion of the limitations of such an approach and to [14] for a detailed survey on the hybrid cloud based MapReduce security.

3 Partitioned Computations at the Public Cloud and Security Definition

In this section, we define the partitioned computation, illustrate how such a computation can leak information due to the joint processing of sensitive and non-sensitive data, discuss the corresponding security definition, and finally discuss system and adversarial models under which we will develop our solutions.

Partitioned Computations

Let R be a relation that is partitioned into two sub-relations, \(R_e\supseteq R_s\) and \(R_p \subseteq R_{ ns }\), such that \(R= R_e \cup R_p\). The relation \(R_e\) contains all the sensitive tuples (denoted by \(R_s\)) of the relation R and will be stored in encrypted form in the cloud. Note that \(R_e\) may contain additional (non-sensitive) tuples of R, if that helps with secure data processing). The relation \(R_p\) refer to the sub-relation of R that will be stored in plaintext on the cloud. Naturally, \(R_p\) does not contain any sensitive tuples. For the remainder of the chapter, we will assume that \(R_e=R_s\) and \(R_p=R_{ns}\), though our approach will be generalized to allow for a potentially replicated representation of non-sensitive data in encrypted form, if it helps to evaluate queries more efficiently. Let us consider a query Q over relation R. A partition computation strategy splits the execution of Q into two independent sub-queries: \(Q_s\): a query to be executed on \(E(R_e)\) and \(Q_{ ns }\): a query to be executed on \(R_{p}\). The final results are computed (using a query \(Q_{merge}\)) by appropriately merging the results of the two sub-queries at the trusted database (DB) owner side (or in the cloud, if a trusted component, e.g., Intel SGX, is available for such a merge operation). In particular, the query Q on a relation R is partitioned, as follows:

$$Q(R) = Q_{merge}\Big ( Q_s(R_e), Q_{ ns }(R_{ p }) \Big )$$

Let us illustrate partitioned computations through an example.

Fig. 7.
figure 7

A relation: Employee.

Example 1

Consider an Employee relation, see Fig. 7. In this relation, the attribute SSN is sensitive, and furthermore, all tuples of employees for the Department \(=\)Defense” are sensitive. In such a case, the Employee relation may be stored as the following three relations: (i) Employee1 with attributes EId and SSN (see Fig. 8); (ii) Employee2 with attributes EId, FirstName, LastName, Office#, and Department, where Department \(=\)Defense” (see Fig. 9); and (iii) Employee3 with attributes EId, FirstName, LastName, Office#, and Department, where Department \(<>\)Defense” (see Fig. 10). Since the relations Employee1 and Employee2 (Figs. 8 and 9) contain only sensitive data, these two relations are encrypted before outsourcing, while Employee3 (Fig. 10), which contains only non-sensitive data, is outsourced in clear-text. We assume that the sensitive data is strongly encrypted such that the property of ciphertext indistinguishability (i.e., an adversary cannot distinguish pairs of ciphertexts) is achieved. Thus, the two occurrences of E152 have two different ciphertexts.

Fig. 8.
figure 8

A sensitive relation: Employee1.

Fig. 9.
figure 9

A sensitive relation: Employee2.

Fig. 10.
figure 10

A non-sensitive relation: Employee3.

Consider a query Q: SELECT FirstName, LastName, Office#, Department from Employee where FirstName = ”John”. In partitioned computation, the query Q is partitioned into two sub-queries: \(Q_s\) that executes on Employee2, and \(Q_{ns}\) that executes on Employee3. \(Q_s\) will retrieve the tuple \(t_4\) while \(Q_{ ns }\) will retrieve the tuple \(t_2\). \(Q_{merge}\) in this example is simply a union operator. Note that the execution of the query Q will also retrieve the same tuples.

Inference Attack in Partitioned Computations

Partitioned computations, if performed naively, could lead to inferences about sensitive data from non-sensitive data. To see this, consider following three queries on the Employee2 and Employee3 relations: (i) retrieve tuples of the employee Eid = E259, (ii) retrieve tuples of the employee Eid = E101, and (iii) retrieve tuples of the employee Eid = E199. We consider an honest-but-curious adversarial cloud that returns the correct answers to the queries but wishes to know information about the encrypted sensitive tables, Employee1 and Employee2.

Table 1 shows the adversary’s view based on executing the corresponding \(Q_s\) and \(Q_{ns}\) components of the above three queries assuming that the tuple retrieving cryptographic approaches are not hiding access-patterns. During the execution, the adversary gains complete knowledge of non-sensitive tuples returned, and furthermore, knowledge about which encrypted tuples are returned as a result of \(Q_s\) (\( E(t_i) \) in the table refers to the encrypted tuple \(t_i\)).

Table 1. Queries and returned tuples/adversarial view.

Given the above adversarial view, the adversary learns that employee E259 has tuples in both \(D_s\) (\(= D_e\)) and \(D_p\) (\(=D_{ns}\)). Coupled with the knowledge about data partitioning, the adversary can learn that E259 works in both sensitive and non-sensitive departments. Moreover, the adversary learns which sensitive tuple has an Eid equals to E259. From the 2nd query, the adversary learns that E101 works only in a sensitive department, (since the query did not return any answer from the Employee3 relation). Likewise, from the 3rd query, the adversary learns that E199 works only in a non-sensitive department.

In order to prevent such an attack, we need a new security definition. Before we discuss the formal definition of partitioned data security, we first provide intuition for the definition. Observe that before retrieving any tuple, under the assumption that no one except the DB owner can decrypt an encrypted sensitive value, say \(E(s_i)\), the adversary cannot learn which non-sensitive value is identical to cleartext value of \(E(s_i)\); let us denote \(s_i\) as cleartext of \(E(s_i)\). Thus, the adversary will consider that the value \(s_i\) is identical to one of the non-sensitive values. Based on this fact, the adversary can create a complete bipartite graph having |S| nodes on one side and \(| NS |\) nodes on the other side, where |S| and \(| NS |\) are a number of sensitive and non-sensitive values, respectively. The edges in the graph are called surviving matches of the values. For example, before executing any query, the adversary can create a bipartite graph for 4 sensitive and 4 non-sensitive values of EID attribute of Example 1; as shown in Fig. 11.

Fig. 11.
figure 11

A bipartite graph showing an initial condition sensitive and non-sensitive values before query execution.

The query execution on the datasets creates an adversarial view that guides the adversary to create a (new) bipartite graph of the same number of nodes on both sides. The requirement is to preserve all the edges of the initial bipartite graph in the graph obtained after the query execution, leading to the initial condition that the cleartext of the value \(E(s_i)\) is identical to one of the non-sensitive values. Note that if the query execution removes any surviving matches of the values, it will leak that the value \(s_i\) is not identical to those non-sensitive values.

We also need to hide occurrences of a sensitive value. Before a query execution, due to ciphertext indistinguishability, all occurrences of a single sensitive value are different, but a simple search or join query may reveal how many tuples have the same value. Based on the above two requirements, we can define a notion of partitioned data security.

Partitioned Data Security at the Public Cloud

Let R be a relation containing sensitive and non-sensitive tuples. Let \(R_s\) and \(R_{ ns }\) be the sensitive and non-sensitive relations, respectively. Let \(q(R_s,R_{ ns })[A]\) be a query, q, over an attribute A of the \(R_s\) and \(R_{ ns }\) relations. Let X be the auxiliary information about the sensitive data, and \( Pr_{Adv} \) be the probability of the adversary knowing any information. A query execution mechanism ensures the partitioned data security if the following two properties hold:

  • \( Pr _{ Adv }[e_i \overset{\mathrm {a}}{=} ns _j|X] = Pr _{ Adv }[e_i \overset{\mathrm {a}}{=} ns _j|X, q(R_s,R_{ ns })[A]]\), where \(e_i= E(t_i)[A]\) is the encrypted representation for the attribute value A for any tuple \(t_i\) of the relation \(R_s\) and \( ns _j\) is a value for the attribute A for any tuple of the relation \(R_{ ns }\). The notation \(\overset{\mathrm {a}}{=}\) shows a sensitive value is identical to a non-sensitive value. This equation captures the fact that an initial probability of linking a sensitive tuple with a non-sensitive tuple will be identical after executing several queries on the relations.

  • \( Pr _{ Adv }[v_i \overset{\mathrm {r}}{\sim } v_j|X] = Pr _{ Adv }[v_i \overset{\mathrm {r}}{\sim } v_j | X, q(R_s,R_{ ns })[A]]\), for all \(v_i, v_i \in Domain (A)\). The notation \(\overset{\mathrm {r}}{\sim }\) shows a relationship between counts of the number of tuples with sensitive values. This equation states that the probability of adversary gaining information about the relative frequency of sensitive values does not increase by the query execution.

The definition above formalizes the security requirement of any partitioned computation approach. Of course, a partitioned approach, besides being secure, must also be correct in that it returns the same answer as that returned by the original query Q if it were to execute without regard to security.

4 Query Binning: A Technique for Partitioned Computations Using a Cryptographic Technique at the Public Cloud

In this section, we will study query binning (QB) as a partitioned computing approach. QB is related to bucketization, which is studied in past [25]. While bucketization was carried over the data in [25], QB performs bucketization on queries. In general, one may ask more queries than original query while adding overhead but it prevents the above-mentioned inference attack. We study QB under some assumption and setting, given below.Footnote 2.

Problem Setup. We assume the following two entities in our model: (i) A database (DB) owner: who splits each relation R in the database having attributes \(R_s\) and \(R_{ ns }\) containing all sensitive and non-sensitive tuples, respectively. (ii) A public cloud: The DB owner outsources the relation \(R_{ ns }\) to a public cloud. The tuples in \(R_s\) are encrypted using any existing mechanism before outsourcing to the same public cloud. However, in the approach, we use non-deterministic encryption, i.e., the cipher representation of two occurrences of an identical value has different representations.

DB Owner Assumptions. In our setting, the DB owner has to store some (limited) metadata such as searchable values and their frequency counts, which will be used for appropriate query formulation. The DB owner is assumed to have sufficient storage for such metadata, and also computational capabilities to perform encryption and decryption. The size of metadata is exponentially smaller than the size of the original data.

Adversarial Model. The adversary (i.e., the untrusted cloud) is assumed to be honest-but-curious, which is a standard setting for security in the public cloud that is not trustworthy. An honest-but-curious adversarial public cloud, thus, stores an outsourced dataset without tampering, correctly computes assigned tasks, and returns answers; however, it may exploit side knowledge (e.g., query execution, background knowledge, and the output size) to gain as much information as possible about the sensitive data. Furthermore, the adversary can eavesdrop on the communication channels between the cloud and the DB owner, and that may help in gaining knowledge about sensitive data, queries, or results. The adversary has full access to the following information: (i) all non-sensitive data outsourced in plaintext, and (ii) some auxiliary information of the sensitive data. The auxiliary information may contain the metadata of the relation and the number of tuples in the relation. Furthermore, the adversary can observe frequent query types and frequent query terms on the non-sensitive data in case of selection queries. The honest-but-curious adversary, however, cannot launch any attack against the DB owner.

Assumptions for QB. We develop QB initially under the assumption that queries are only on a single attribute, say A. The QB approach takes as inputs: (i) the set of data values (of the attribute A) that are sensitive along with their counts, and (ii) the set of data values (of the attribute A) that are non-sensitive, along with their counts. The QB returns a partition of attribute values that form the query bins for both the sensitive as well as for the non-sensitive parts of the query.

In this chapter, we also restrict to a case when a value has at most two tuples, where one of them must be sensitive and the other one must be non-sensitive, but both the tuples cannot be sensitive or non-sensitive. The scenario depicted in Example 1 satisfies this assumption. The EId attribute values corresponding to sensitive tuples include \(\langle \)E101, E259, E152, E159\(\rangle \) and from the non-sensitive relation values are \(\langle \)E199, E259, E152, E254\(\rangle \). Note that all the values occur only one time in one set.

Full Version. In this chapter, we restrict the algorithm for selection query only on one attribute. The full details of the algorithm, extensions of the algorithm for values having a different number of tuples, conjunctive, range, join, insert queries, and dealing with the workload-skew attack is addressed in [33]. Further, the computing cost analysis and efficiency analysis of QB at different or identical-levels of security against a pure cryptographic technique is given in [33].

The Approach. We develop an efficient approach to execute selection queries securely (preventing the information leakage as shown in Example 1) by appropriately partitioning the query at a public cloud, where sensitive data is cryptographically secure while non-sensitive data stays in cleartext. For answering a selection query, naturally, we use any existing cryptographic technique on sensitive data and a simple search on the cleartext non-sensitive data. Naturally, we can use a secure hardware, e.g., Intel SGX, for all such operations; however, as mentioned in Sect. 1 Fig. 1, SGX-based processing takes a significant amount of time, due to limited space of the enclave.

Informally, QB distributes attribute values in a matrix, where rows are sensitive bins, and columns are non-sensitive bins. For example, suppose there are 16 values, say \(0, 1,\ldots ,15\), and assume all the values have sensitive and associated non-sensitive tuples. Now, the DB owner arranges 16 values in a \(4\times 4\) matrix, as follows:

 

\( NSB _0\)

\( NSB _1\)

\( NSB _2\)

\( NSB _3\)

\( SB _0\)

11

2

5

14

\( SB _1\)

10

3

8

7

\( SB _2\)

0

15

6

4

\( SB _3\)

13

1

12

9

In this example, we have four sensitive bins: \( SB _0\ \{11,2,5,14\}, SB _1\ \{10,3,8,7\}, SB _2\ \{0,15,6,4\}, SB _3\ \{13,1,12,9\}\), and four non-sensitive bins: \( NSB _0 \{11,10,0,13\}, NSB _1\ \{2,3,15,1\}, NSB _2\ \{5,8,6,12\}, NSB _3\ \{14,7,4,9\}\). When a query arrives for a value, say 1, the DB owner searches for the tuples containing values 2,3,15,1 (viz. \( NSB _1\)) on the non-sensitive data and values in \( SB _3\) (viz., 13, 1, 12, 9) on the sensitive data using the cryptographic mechanism integrated into QB. While the adversary learns that the query corresponds to one of the four values in \( NSB _1\), since query values in \( SB _3\) are encrypted, the adversary does not learn any sensitive value or a non-sensitive value that is identical to a clear-text sensitive value.

figure a

Formally, QB appropriately maps a selection query for a keyword w, say q(w), to corresponding queries over the non-sensitive relation, say \(q(W_{ ns })(R_{ ns })\), and encrypted relation, say \(q(W_s)(R_s)\). The queries \(q(W_{ ns })(R_{ ns })\) and \(q(W_s)(R_s)\), each of which represents a set of query values that are executed over the relation \(R_{ ns }\) in plaintext and, respectively, over the sensitive relation \(R_s\), using the underlying cryptographic method. The sets \(W_{ ns }\) from \(R_{ ns }\) and \(W_s\) from \(R_s\) are selected such that: (i) \(w \in q(W_{ ns })(R_{ ns }) \cap q(W_s)(R_s)\) to ensure that all the tuples containing w are retrieved, and, (ii) the execution of the queries \(q(W_{ ns })(R_{ ns })\) and \(q(W_s)(R_s)\) does not reveal any information (and w) to the adversary. The set of \(q(W_{ ns })(R_{ ns })\) is entitled non-sensitive bins, and the set of \(q(W_s)(R_s)\) is entitled sensitive bins. Algorithm 1 provides pseudocode of bin-creation method.Footnote 3 Results from the execution of the queries \(q(W_{ ns })(R_{ ns })\) and \(q(W_s)(R_s)\) are decrypted, possibly filtered, and merged to generate the final answer.

Based on QB Algorithm 1, for answering the above-mentioned three queries in Example 1, given in Sect. 2, Algorithm 1 creates two sets or bins on sensitive parts: sensitive bin 1, denoted by \( SB _1\), contains \(\{\)E101, E259\(\}\), sensitive bin 2, denoted by \( SB _2\), contains \(\{\)E152, E159\(\}\), and two sets/bins on non-sensitive parts: non-sensitive bin 1, denoted by \( NSB _1\), contains \(\{\)E259, E254\(\}\), non-sensitive bin 2, denoted by \( NSB _2\), contains \(\{\)E199, E152\(\}\).

Table 2. Queries and returned tuples/adversarial view when following QB.
figure b

Algorithm 2 provides a way to retrieve the bins. Thus, by following Algorithm 2, Table 2 shows that the adversary cannot know the query value w or find a value that is shared between the two sets, when answering to the above-mentioned three queries. The reason is that the desired query value, w, is encrypted with other encrypted values of the set \(W_s\), and, furthermore, the query value, w, is obscured in many requested non-sensitive values of the set \(W_{ ns }\), which are in cleartext. Consequently, the adversary is unable to find an intersection of the two sets, which is the exact value. Thus, while answering a query, the adversary cannot learn which employee works only in defense, design, or in both.

Correctness. The correctness of QB indicates that the approach maintains an initial probability of associating a sensitive tuple with a non-sensitive tuple will be identical after executing several queries on the relations.

We can illustrate the correctness of QB with the help of an example. The objective of the adversary is to deduce a clear-text value corresponding to an encrypted value of either \(\{\)E101, E259\(\}\) or \(\{\)E152, E159\(\}\), since we retrieve the set of these two values. Note that before executing a query, the probability of an encrypted value, say \(E_i\), (where \(E_i\) may be E101, E259, E152, or E159) to have the clear-text value is 1/4, which QB maintains at the end of a query. Assume that \(E_1\) and \(E_2\) are encrypted representations of E101 and E259, respectively. Also, assume that \(v_1\), \(v_2\), \(v_3\), \(v_4\) are showing the cleartext value of E259, E254, E199, and E152, respectively.

When the query arrives for \(\langle E_1,E_2,v_1,v_2\rangle \), the adversary gets the fact that the clear-text representation of \(E_1\) and \(E_2\) cannot be \(v_1\) and \(v_2\) or \(v_3\) and \(v_4\). If this will happen, then there is no way to associate each sensitive bin of the new bipartite graph with each non-sensitive bin. Now, if the adversary considers the clear-text representation of \(E_1\) is \(v_1\), then the adversary have four possible allocations of the values \(v_1\), \(v_2\), \(v_3\), \(v_4\) to \(E_1\), \(E_2\), \(E_3\), \(E_4\), such as: \(\langle v_1,v_2,v_3,v_4\rangle \), \(\langle v_1,v_2,v_4,v_3\rangle \), \(\langle v_1,v_3,v_4,v_2\rangle \), \(\langle v_1,v_4,v_3,v_2\rangle \).

Since the adversary is not aware of the exact clear-text value of \(E_1\), the adversary also considers the clear-text representation of \(E_1\) is \(v_2\), \(v_3\), or \(v_4\). This results in 12 more possible allocations of the values \(v_1\), \(v_2\), \(v_3\), \(v_4\) to \(E_1\), \(E_2\), \(E_3\), \(E_4\). Thus, the retrieval of the four tuples containing one of the following: \(\langle E_1,E_2,v_1,v_2\rangle \), results in 16 possible allocations of the values \(v_1\), \(v_2\), \(v_3\), and \(v_4\) to \(E_1\), \(E_2\), \(E_3\), and \(E_4\), of which only four possible allocations have \(v_1\) as the clear-text representation of \(E_1\). This results in the probability of finding \(E_1=v_1\) is 1/4.

Note that following this technique, executing queries under for each keyword will not eliminate any surviving matches of the bipartite graph, and hence, the adversary can find the new bipartite graph identical to a bipartite graph before the query execution. Figure 11 shows an initial bipartite graph before the query execution and Fig. 12 shows a bipartite graph after the query execution when creating bins on the values. Note that in Fig. 12 each sensitive bin is linked to each non-sensitive bin, that in turns, shows that each sensitive value is linked to each non-sensitive value.

Fig. 12.
figure 12

A bipartite graph showing sensitive and non-sensitive bins after query execution, where each sensitive value gets associated with each non-sensitive value.

5 Effectiveness of QB

From the performance perspective, QB results in saving of encrypted data processing over non-sensitive data – the more the non-sensitive data, the more potential savings. Nonetheless, QB incurs overhead – it converts a single predicate selection query into a set of predicates selection queries over cleartext non-sensitive data, and, a set of encrypted predicates selection queries albeit over a smaller database consisting only of sensitive data. In this section, we compare QB against a pure cryptographic technique and show when using QB is beneficial.

For our model, we will need the following notations: (i) \(C_{ com }\): Communication cost of moving one tuple over the network. (ii) \(C_p\) (or \(C_e\)): Processing cost of a single selection query on plaintext (or encrypted data). In addition, we define three parameters:

  • \(\alpha \): is the ratio between the sizes of the sensitive data (denoted by S) and the entire dataset (denoted by \(S+ NS \), where \( NS \) is non-sensitive data).

  • \(\beta \): is the ratio between the predicate search time on encrypted data using a cryptographic technique and on clear-text data. The parameter \(\beta \) captures the overhead of a cryptographic technique. Note that \(\beta = C_e/C_p\).

  • \(\gamma \): is the ratio between the processing time of a single selection query on encrypted data and the time to transmit the single tuple over the network from the cloud to the DB owner. Note that \(\gamma = C_e/C_{ com }\).

Based on the above parameters, we can compute the cost of cryptographic and non-cryptographic selection operations as follows:

  • \( Cost _{ plain }(x, D)\) is the sum the processing cost of x selection queries on plaintext data and the communication cost of moving all the tuples having x predicates from the cloud to the DB owner, i.e., \(x(\log (D)P_p+ \rho D C_{ com })\).

  • \( Cost _{ crypt }(x, D)\) is the sum the processing cost of x selection queries on encrypted data and the communication cost of moving all the tuples having x predicates from the cloud to the DB owner, i.e., \(P_e D+ \rho x D C_{ com }\), where \(\rho \) is the selectivity of the query. Note that cost of evaluating x queries over encrypted data using techniques such as [20, 22, 41], is amortized and can be performed using a single scan of data. Hence, x is not the factor in the cost corresponding to encrypted data processing.

Given the above, we define a parameter \(\eta \) that is the ratio between the computation and communication cost of searching using QB and the computation and communication cost of searching when the entire data (viz. sensitive and non-sensitive data) is fully encrypted using the cryptographic mechanism.

$$\eta = \frac{ Cost _{ crypt }(| SB |, S)}{ Cost _{ crypt }(1,D)} + \frac{ Cost _{ plain }(| NSB |, NS )}{ Cost _{ crypt }(1,D)}$$

Filling out the values from above, the ratio is:

$$\eta =\frac{C_e S + | SB |\rho D C_{ com }}{C_e D + \rho D C_{ com }} + \frac{| NSB |\log (D)C_p + | NSB | \rho D C_{ com }}{C_e D + \rho D C_{ com }}$$

Separating out the communication and processing costs, \(\eta \) becomes:

$$\eta =\frac{S}{D}\frac{C_e}{C_e + \rho C_{ com }} + \frac{| NSB |\log (D)C_p}{C_e D + \rho D C_{ com }} +\frac{\rho D C_{ com }(| NSB |+| SB |)}{C_e D + \rho D C_{ com }}$$

Substituting for various terms and cancelling common terms provides:

$$\eta = \alpha \frac{1}{(1 + \frac{\rho }{\gamma })} + \frac{\log (D)}{D} \frac{|NSB|}{\beta (1 + \frac{\rho }{\gamma })} + \frac{\rho }{\gamma } \frac{| NSB | + | SB |}{(1 + \frac{\rho }{\gamma })}$$

Note that \(\rho /\gamma \) is very small, thus the term \((1 + \rho /\gamma )\) can be substituted by 1. Given the above, the equation becomes:

$$\eta = \alpha + \log (D)| NSB /D\beta + \rho (| NSB | + | SB |)/\gamma $$

Note that the term \(\log (D)| NSB |/D\beta \) is very small since \(| NSB |\) is the number of distinct values (approx. equal to \(\sqrt{| NS |}\)) in a non-sensitive bin, while D, which is the size of a database, is a large number, and \(\beta \) value is also very large. Thus, the equation becomes:

$$\eta = \alpha + \rho (| SB |+ | NSB |)/\gamma $$

QB is better than a cryptographic approach when \(\eta < 1\), i.e., \(\alpha + \rho (| SB |+ | NSB |)/\gamma < 1\). Thus,

$$\alpha < 1 - \frac{\rho (| SB |+ | NSB |)}{\gamma }$$

Note that the values of \(| SB | \) and \(| NSB |\) are approximately \(\sqrt{| NS |}\), we can simplify the above equation to: \(\alpha < 1 - 2 \rho \sqrt{| NS |}/\gamma \). If we estimate \(\rho \) to be roughly \(1/| NS |\) (i.e., we assume uniform distribution), the above equation becomes: \(\alpha < 1 - 2/\gamma \sqrt{| NS |}\).

Fig. 13.
figure 13

Efficiency graph using equation \(\eta = \alpha + \rho (| SB |+ | NSB |)/\gamma \).

The equation above demonstrates that QB trades increased communication costs to reduce the amount of data that needs to be searched in encrypted form. Note that the reduction in encryption cost is proportional to \(\alpha \) times the size of the database, while the increase in communication costs is proportional to \(\sqrt{|D|}\), where |D| is the number of distinct attribute values. This, coupled with the fact that \(\gamma \) is much higher than 1 for encryption mechanisms that offer strong security guarantees, ensures that QB almost always outperforms the full encryption approaches. For instance, the cryptographic cost for search using secret-sharing is \({\approx }10\,\mathrm{ms}\) [20], while the cost of transmitting a single row (\(\approx \)200 bytes for TPCH Customer table) is \({\approx }4\,\upmu \mathrm{s}\) making the value of \(\gamma \approx 25000\). Thus, QB, based on the model, should outperform the fully encrypted solution for almost any value of \(\alpha \), under ideal situations where our assumption of uniformity holds. Figure 13 plots a graph of \(\eta \) as a function of \(\gamma \), for varying sensitivity and \(\rho = 10\%\).

To explore the effectiveness of QB under different DB sizes, we tested QB for 3 DB sizes: 150K, 1.5M, and 4.5M tuples. Fig. 14 plots \(\eta \) values for the three sizes while varying \(\alpha \). The figure shows that \(\eta < 1\), irrespective of the DB sizes, confirming that QB scales to larger DB sizes.

Fig. 14.
figure 14

Dataset size.