Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Emerging paradigms like data outsourcing and cloud computing have attracted the attention of the research and industrial communities thanks to their advantages in terms of reduced costs for IT resources, increased storage, flexibility in resource management, and higher scalability. These advantages however do not come for free. In fact, these emerging paradigms also introduce a number of privacy and security risks that may represent a serious obstacle for their wide development and for their acceptance by users and companies. Security and privacy may relate to different aspects, including resources, data and network isolation, attacks to the cloud servers, compliance with laws and regulations, reliability of applications and services, protection of the confidentiality and integrity of data, and data availability (e.g., [11, 19, 38, 39, 44]). In this chapter, we will provide an overview of the problems and solutions related to the proper protection of the confidentiality of the data and to the efficient access to them. These problems become quite complex in a cloud scenario since users release and store their data on external servers that are outside their control. Also, the advances in the Information and Communication Technologies (ICTs), including the possibility of combining and analyzing more information from several data sources, intensify the data protection problem.

The protection of potentially sensitive data stored and managed by external cloud servers poses interesting challenges. In fact, cloud servers can be characterized by different levels of trust, ranging from honest-but-curious servers, meaning that they are trusted for the management of the data but cannot know (access) the data they store, to servers that may intentionally behave improperly in the storing and processing of the data. Data are therefore encrypted by the data owner before their storage in the cloud. Since cloud servers cannot decrypt data, there is the problem of defining techniques (e.g., indexes) for enforcing fine-grained retrieval of the data without compromising their privacy. However, techniques that support effective and efficient accesses to the outsourced data are not enough. In fact, if the server (or a generic observer) monitors the accesses by users, it may be able to draw inferences on which data have been accessed. Also, the presence of multiple users who rely on external storage for making their data available to others, introduces the problem of enforcing selective (read and write) access to the outsourced data.

In this chapter, after a brief overview of the different security and privacy problems that can arise in a cloud computing scenario, we survey and discuss research results related to the protection of the privacy of outsourced data, and on the fine-grained and selective retrieval of data. We also show that the combination of techniques addressing a specific problem can cause privacy breaches. The remainder of the chapter is organized as follows. Section 2 provides an overview of the main security and privacy risks in a cloud scenario. Section 3 illustrates some approaches and open issues related to the protection of data confidentiality, indexing for query support, and selective access. Section 4 describes how the combination of indexes for query support and fragments for data confidentiality can cause leakage of confidential information. Section 5 describes how the combination of indexes and selective encryption may allow unauthorized users to infer (or reduce their uncertainty on) information that they are not authorized to access. Finally, Sect. 6 provides our conclusions.

2 Security and Privacy in the Cloud

The security and privacy problems that arise when data are stored at external servers have been the subject of many studies (e.g., [22, 31, 37]). Depending on the considered aspect, the security and privacy problems can be related to: (i) the privacy of users; (ii) the privacy and integrity of data storage; (iii) the privacy and integrity of queries; and (iv) the secure and private data computations involving multiple providers. Figure 1 illustrates the reference cloud scenario where users interact with external cloud servers for accessing data and services, and different cloud servers collaborate for offering a service or responding to a query. In the remainder of this section, we provide a description of each of the four categories of security and privacy problems mentioned above.

Fig. 1
figure 1

Reference cloud scenario

Privacy of users. Cloud services allow users to access applications and data on demand every-time they need. To successfully complete the required access, users may be asked to provide some information while however wishing to protect their identities for privacy reasons. For instance, a user can be interested in querying a cloud server for collecting information about a given illness without revealing her identity to avoid possible correlations between the illness and herself or a person close to her. The techniques developed for supporting anonymous communication between parties and attribute-based access control can be helpful in protecting the privacy of the users. In fact, anonymous communication techniques allow users to communicate on the Internet without revealing their identities [9], meaning that an observer cannot trace who is communicating with whom, or who is interacting with which server or searching for which data. Attribute-based access control solutions allow users to access resources or data without reveling their identities [13]. The idea is that, instead of declaring their identities, users prove that they satisfy the conditions needed for the access. To this purpose, a user can disclose a credential (a set thereof) certifying the information necessary for the access. The server verifies whether the credential is valid and whether the information it certifies satisfies the policy regulating access to the resource. The research community has also devoted considerable attention to the use of anonymous credentials [16] for access control (e.g., [4]). An anonymous credential allows a user to make statements about attribute values, maintaining the values private. For instance, anonymous credentials permit to selectively release a subset of the properties in a credential or to prove that they satisfy some conditions, without revealing any information about their values. Anonymous credentials can be at the basis of a new generation of access control policy languages that can be particularly suited to open and dynamic scenarios like the cloud.

Recently, some proposals have started to address the problem of regulating the release of users’ personal information according to privacy preferences expressed by the users themselves. These proposals have introduced models relying on user preferences that permit to associate a higher or lower sensitivity with the combined release of a set of properties/credentials (e.g., [57, 40, 53]). For instance, a user may consider the joint release of her name and credit card number more sensitive than the release of each information singularly taken. Although these solutions represent a first step towards the definition of a comprehensive approach for the protection of users’ privacy, there are still several open issues: the development of user-friendly approaches for expressing privacy preferences; the ability of defining privacy preferences that depend on the context; and the integration of these approaches with server-side solutions supporting fine-grained policy disclosure, which permit the server to obfuscate the portions of its policies considered sensitive, while providing the user with enough information for releasing the information necessary to possibly gain access (e.g., [8]).

Privacy and integrity of data storage. When data are outsourced to an external server that is outside the control of the data owner, the protection of the confidentiality and of the integrity of the data, as well as the efficient access to them become clearly of paramount importance. In this context, the research community has been very active and produced advancements in several areas: solutions for protecting data confidentiality (e.g., encryption and fragmentation [1, 21, 37]); indexes for supporting queries (e.g., [17, 37]), solutions for supporting selective access to outsourced data (e.g., [24]), solutions for ensuring data integrity (e.g., signatures [14, 35, 43]). These approaches typically consider a scenario where a data owner outsources her data to an external server that can be trusted to properly manage the data, making them available to requesting users, but it is not trusted to read the content of the data it stores (i.e., honest-but-curious server). The outsourced data can be of any type, including files and relational tables. In the remainder of this chapter, for simplicity and without loss of generality, we will assume that the outsourced data are organized in a single relation r, stored in a (distributed) relational database. Relation r is defined over relational schema \(\mathit{R}_{}(\mathit{a}_{1},\ldots,\mathit{a}_{n})\), with attribute a i defined over domain D i , i = 1, , n. The presentation of solutions and issues related to the protection of the privacy of outsourced data will be the subject of the following sections.

Privacy and integrity of queries. Accessing information from external cloud servers and performing queries over outsourced data introduce several privacy and integrity issues. Existing data management architectures typically assume that the data obtained from distributed parties have not been tampered with, and are available only to authorized parties. Such assumptions do not apply anymore in cloud scenarios, where multi-tenant infrastructures orchestrate different services. Assurances on the fact that the privacy of the queries is preserved and that computations on data are processed in the expected way (integrity and verifiability) are becoming more and more important. In fact, there is an increasing need for novel techniques that support not only data privacy, but also the privacy of the accesses that users make on such data. This problem has been traditionally addressed by Private Information Retrieval (PIR) proposals (e.g., [18]), which provide protocols for querying a database that prevent the external server from inferring which data are being accessed. PIR solutions however have high computational complexity, and alternative approaches have been proposed. These novel approaches rely on the Oblivious RAM structure (e.g., [33, 47, 48]) or on the definition of specific tree-based data structures combined with a dynamic allocation of the data (e.g., [29, 30]). The goal is to support the access to a collection of encrypted data while preserving access and pattern confidentiality, meaning that an observer can infer neither what data are accessed nor whether two accesses aim to the same data. Besides protecting access and pattern confidentiality, it is also necessary to design mechanisms for protecting the integrity and authenticity of the computations, that is, to guarantee the correctness, completeness, and freshness of query results. Most of the techniques that can be adopted for verifying the integrity of query results operate on a single relation and are based on the idea of complementing the data with additional data structures (e.g., Merkle trees) or of introducing in the data collection fake tuples that can be efficiently checked to detect incorrect or incomplete results (e.g., [41, 46, 5052]). Interesting aspects that need further analysis are related to the design of efficient techniques able to verify the completeness and correctness of the results of complex queries (e.g., join operations among multiple relations, possibly stored and managed by different cloud servers with different levels of trust).

Secure and private data computations. More and more emerging scenarios require different cloud servers to cooperate to the aim of sharing information and/or performing distributed computations. This sharing process can be clearly selective, meaning that different servers may have different access privileges. Recently, a significant amount of research has addressed the problem of processing distributed queries under protection requirements (e.g., [2, 15, 26]). Some proposals are based on the concept of access pattern, a profile associated with each relation/view [15]. For each attribute of the relation/view, the access pattern includes a value that may be either i for input or o for output. When accessing a relation, the values for all i attributes must be supplied to obtain the corresponding values of o attributes. Sovereign joins [2] are an alternative solution for securely processing joins. This solution is based on a secure coprocessor, which is involved in query execution, and exploits cryptography. Other approaches propose an authorization model to regulate the view that each server can have on the data, ensuring that query computation exposes to each server only the data that the server can view [26]. The idea is that a relation (base or resulting from the evaluation of a query) can be released to a server whenever the information it carries (either directly or indirectly when the relation has been obtained as the result of a query) is visible from the receiving party. The proposed authorization model operates at the schema level and supports the definition of generic view patterns, thus nicely meeting both expressiveness and simplicity requirements.

Figure 2 summarizes the main categories of security and privacy issues discussed above (gray boxes) along with some of the corresponding solutions (white boxes). Note that this classification does not aim to be complete but only to provide a quick overview of the solutions mentioned.

Fig. 2
figure 2

Summary of security and privacy issues and corresponding solutions

3 Privacy of Data Storage

The problem of protecting outsourced data while enjoying effective and efficient data management and retrieval operations has attracted the attention of many researches, and several investigations have been carried out. The problem is quite complex and involves several aspects, including basic techniques for protecting data at rest (Sect. 3.1), techniques for efficiently accessing encrypted data without compromising their confidentiality (Sect. 3.2), and data-centric techniques for supporting selective access to the outsourced data without relying on the data owner and/or on the honest-but-curious server storing the data (Sect. 3.3). We now describe more in details these aspects.

3.1 Encryption and Fragmentation

The problem of protecting the confidentiality of outsourced data has been one of the first issues investigated in the data outsourcing and cloud scenarios. In fact, the risk that unauthorized parties (or even the external server itself) can access sensitive information is one of the main factors for which users (and not only) are often reluctant to adopt the cloud for storing their data. The solutions proposed to protect data confidentiality are based on encryption and fragmentation, which can be used either singularly or in combination.

Encryption consists in wrapping a protective layer of encryption around data before storing them at an external server (e.g., [17, 34, 37, 44]). Since the encryption key is known only to the data owner and to authorized users, this technique protects the data against both external (malicious) parties, and the server itself. While effective, this approach is based on the conservative assumption that all the outsourced data are equally sensitive and must therefore be protected. However, as first observed in [1, 20, 21], often data are not sensitive per se but what is sensitive is their association with other data. As an example, the list of the names of hospitalized patients and the list of diseases cured in a hospital are not sensitive. On the contrary, the association of patients’ names with the illness they suffer from is highly sensitive and should therefore be kept confidential. Data confidentiality can then be achieved by properly protecting sensitive associations. Given a relation r over relation schema \(\mathit{R}_{}(\mathit{a}_{1},\ldots,\mathit{a}_{n})\), both sensitive attribute values and sensitive associations among them can be modeled through confidentiality constraints [1]. A confidentiality constraint c over R is a subset of the attributes in R (i.e., c \(\subseteq \) R), modeling a sensitive association on the values of the attributes in c. Constraint c states that, for each tuple t in r: (i) value t[a] is considered sensitive per se, if c is a singleton constraint (i.e., c  = {a}); (ii) the joint visibility of the values of the attributes in c is considered sensitive, if c is an association constraint (i.e., c =\(\,\{\mathit{a}_{i},\ldots,\mathit{a}_{j}\}\)). For instance, Fig. 3b illustrates a set of confidentiality constraints over relation Patients in Fig. 3a. Singleton constraint c 0 states that the list of Social Security Numbers is considered sensitive per se. The remaining association constraints state that the association of: patients’ name with the disease they suffer from (c 1), patients’ names with their job (c 2), and patients’ job with their disease (c 3) are considered sensitive, respectively.

Fig. 3
figure 3

An example of plaintext relation (a) and of a set of confidentiality constraints over it (b)

Given a relation r and a set C of confidentiality constraints over it, the goal is to combine fragmentation and encryption techniques to guarantee that sensitive values and sensitive associations are properly obfuscated. Intuitively, singleton constraints are enforced by encrypting the attribute values before outsourcing or by not outsourcing the attribute values at all. Association constraints are enforced by partitioning the attributes in R in different subsets (fragments), or by not releasing (in clear form) at least one of the attributes in the constraint. A fragmentation correctly enforces the confidentiality constraints if no fragment stored at the external server represents all the attributes in a constraint in clear form, and fragments cannot be joined by unauthorized users.

The approaches that rely on fragmentation and encryption for enforcing confidentiality constraints differ in how they guarantee that fragments cannot be joined, and in how they protect attribute values considered sensitive per se. Based on these differences, existing techniques can be classified as follows.

  • Non-communicating pair of servers [1]. The data owner partitions relation R in two fragments, F 1 and F 2, stored at two non-communicating servers. Those attributes that cannot be stored at any of the two servers without violating confidentiality constraints are encoded and the result is stored at the two servers (e.g., the attribute values are encrypted via one-time-pad, and the result of encryption is stored at one server, while the key is stored at the other one). Only users who can access both the versions of an encoded attribute can reconstruct its plaintext values. Figure 4 illustrates an example of fragmentation for relation Patients in Fig. 3a that satisfies the confidentiality constraints in Fig. 3b. It is composed of fragments F 1  = {\(\underline{\mathtt{tid}}\), Name, YoB, SSN k, Disease k} and F 2  = {\(\underline{\mathtt{tid}}\), Job, SSN k, Disease k}. Attribute tid is a tuple identifier introduced in the two fragments to permit authorized users to correctly join F 1 and F 2 to reconstruct the original content of relation Patients. Attributes SSN k and Disease k represent the encoded version of attributes SSN and Disease, respectively.

  • Multiple fragments [21]. The data owner partitions relation R in an arbitrary set of fragments, \(\{\mathit{F}_{1},\ldots,\mathit{F}_{m}\}\), possibly stored at the same server. Fragments are disjoint, meaning that no attribute is represented in clear form in more than one fragment. All the attributes in R that are not represented in clear form in a fragment are however represented in encrypted form in the fragment (i.e., each fragment is complete). Figure 4 illustrates an example of fragmentation for relation Patients in Fig. 3a that satisfies the confidentiality constraints in Fig. 3b. It is composed of three fragments: F 1  = {\(\underline{\mathtt{salt}}\), enc, Name, YoB}, F 2  = {\(\underline{\mathtt{salt}}\), enc, Job}, and F 3  = {salt, enc, Disease}. Attribute salt is a randomly chosen value, different for each tuple in each fragment. Attribute enc is the result of the encryption of the attributes in the original relation that are not represented in clear form in the fragment, concatenated with salt. For readability, in all our examples tuples in fragments are in the same order as in the original relation, even if the order in which tuples are stored in fragments is independent from the order in which they appear in the original relation. Note that the possibility of using an arbitrary number of fragments has the advantage that all attributes that are not involved in singleton constraints can be represented in clear form in a fragment (in the worst case, we can have a fragment for each attribute), as it is visible from the example above.

  • Departing from encryption [20]. The data owner partitions relation R in two fragments, F o and F s , and locally stores one of them (F o ), while the other is outsourced to an external server (F s ). Since only authorized users can access F o , neither the server nor unauthorized users can join F o and F s to possibly reconstruct sensitive associations. Note that fragment F o can both include attributes considered sensitive per se and sensitive associations. This solution completely departs from encryption, but it requires the data owner to locally store a portion of her data and to cooperate with the external server in query evaluation. Figure 4 illustrates an example of fragmentation for relation Patients in Fig. 3a that satisfies the confidentiality constraints in Fig. 3b. It is composed of fragment F o   = {\(\underline{\mathtt{tid}}\), SSN, Job, Disease} stored at the data owner side, and fragment F s   = {\(\underline{\mathtt{tid}}\), Name, YoB} stored at the external server side.

Fig. 4
figure 4

An example of fragmentation of relation Patients in Fig. 3a according to the non-communication pair of servers, multiple fragments, and departing from encryption scenarios

Encryption, fragmentation, and their combinations are powerful mechanisms for protecting data confidentiality. However, there are still several open issues that need to be further investigated. In fact, fragmentation and encryption break associations among attribute values that could be considered of interest for final recipients, thus compromising the utility of released data. Alternative solutions that protect data while preserving a certain utility are therefore needed [25]. Also, confidentiality constraints are defined over relation schemas, while they could be extended to operate at the instance level (i.e., at the attribute values level). We also observe that encryption and fragmentation work under the assumption that the data collection never changes. Techniques supporting updates to the outsourced data collection without compromising confidentiality still need to be designed.

3.2 Indexes

The adoption of encryption for protecting data confidentiality makes query execution difficult. In fact, confidentiality demands that data decryption must be possible only at the user side. Solutions have been then developed to enable cloud servers to execute queries directly on encrypted data. These solutions complement the outsourced relation with a set of indexes, which are metadata information built on the plaintext values of the attributes [44]. Formally, a relation r, defined over schema \(\mathit{R}_{}(\mathit{a}_{1},\ldots,\mathit{a}_{n})\), is represented at the server side through an encrypted relation r k over schema \(\mathit{R}_{}^{k}\)(\(\underline{\mathtt{tid}}\), enc, \(\mathit{I}_{i_{1}},\ldots,\mathit{I}_{i_{j}}\)). Attribute tid is a numerical attribute added to the original relation and acting as a primary key. Attribute enc represents the encrypted tuple. Attribute \(\mathit{I}_{i_{l}}\), l = 1, , j, is the index defined over attribute \(\mathit{a}_{i_{l}}\) in R. Each tuple t in r is represented by an encrypted tuple t k in r k where t k[enc]  = E k (t), with E a symmetric encryption function with key k, and \({ {\mathit t}_{}^{k}}[{{\mathit I}_{i_{l}}}]\)  = \(\iota (\mathit{t}_{}[\mathit{a}_{i_{l}}])\), with ι an index function defined over \(\mathit{D}_{i_{l}}\). Note that \(\mathit{R}_{}^{k}\) has an index only for those attributes in R on which conditions need to be evaluated. Figure 5 illustrates an example of encrypted and indexed version of relation Patients in Fig. 3a, with indexes over attributes Name (I n ), YoB (I y ), Job (I j ), and Disease (I d ).

Fig. 5
figure 5

An example of encrypted and indexed version of relation Patients in Fig. 3a

Different indexing techniques have been proposed in the literature to support different kinds of conditions. Most of these indexing techniques can be classified in the following three classes, depending on how the corresponding index function ι maps the original values to the corresponding index values.

  • Direct index. Index function ι maps each plaintext value to a different index value and vice versa. An example of direct index is represented by encryption-based indexes (e.g., [22]). For each tuple t ∈ r, the value of index I, defined over attribute a, is computed as ι(t[a])  = E k (t[a]). For instance, index I y in relation Patients k in Fig. 5 represents an example of direct index over attribute YoB of relation Patients in Fig. 3a.

  • Bucket-based index. Index function ι maps different plaintext values to the same index value, generating collisions. Each plaintext value is however mapped to only one index value. An example of bucket-based index is represented by partition-based indexes, which partition the domain D of attribute a into non-overlapping subsets of contiguous values, and associate a label with each partition (e.g., [37]). For each tuple t ∈ r, the value of index I, defined over attribute a, corresponds to the label of the unique partition to which value t[a] belongs. For instance, index I n in relation Patients k in Fig. 5 represents an example of partition-based index over attribute Name of relation Patients in Fig. 3a. The domain of attribute Name has been partitioned in four intervals depending on the initial of the name, with labels: π for names with initial in the range [A,B], ρ for names with initial in the range [C,D], σ for names with initial in the range [E,F], and τ for names with initial in the range [G,H]. Another example of bucket-based index is represented by the hash-based indexes (e.g., [17]). For each tuple t ∈ r, the value of index I, defined over attribute a, is computed as ι(t[a])  = h(t[a]), where h is a secure hash function that generates collisions. For instance, index I j in relation Patients k in Fig. 5 represents an example of hash-based index over attribute Job of relation Patients in Fig. 3a. The hash function adopted generates collisions and, in particular, is defined as follows: h(Clerk)  = h(Nurse)  = δ, h(Doctor)  = \(\varepsilon\), and h(Lawyer)  = h(Teacher)  = ζ.

  • Flattened index. Index function ι maps each plaintext value to a set of index values to guarantee that all index values have the same number of occurrences (flattening). Each index value represents one plaintext value only. The index can be obtained by applying an encryption function to the plaintext values of the attribute and a post processing that flattens the distribution of the index values (e.g., [45]). For instance, index I d in relation Patients k in Fig. 5 represents an example of flattened index over attribute Disease of relation Patients in Fig. 3a, where each index value has exactly one occurrence.

These indexing techniques support the partial evaluation at the server-side of SQL queries. Given a query q, it is translated into a query q s executed at the server side on the encrypted relation, and a query q c executed at the client side on the decrypted result of q s . Query q c includes all conditions that cannot be evaluated by the server and aims at eventually discarding all spurious tuples returned by q s , that is, all tuples that do not satisfy the original query submitted by the user. The translation of query q into query q s and q c depends both on the kind of indexes defined for the attributes involved in the query and on the kind of query. As an example, consider query q  = “select Att from R where Cond”, where Att ⊆ R and Cond is a set of equality conditions of the form a  = v, with a ∈ R and v a constant value in the domain D of a. Each equality condition a  = v is translated into an equivalent condition I in ι(v), with I the index defined over a and ι the corresponding index function. Query q is then translated into query q s   = “select enc from R k where Cond k”, where Cond k includes, for each equality condition a  = v, the equivalent condition I in ι(v). The client will decrypt the result of q s computed by the server, and will execute query q c that eliminates spurious tuples, evaluates conditions that cannot be performed at the server side, and projects only the attributes in Att to obtain the result of q. For instance, query q  = select Name from Patients where Job  = ‘Nurse’ and Disease  = ‘Asthma’ is translated into query q s   = select enc from Patients k where I j   = δ and I d  ∈ {η,θ,ω}, which returns the first and third tuples in Fig. 5. The client then filters spurious tuples from the result of q s by evaluating query q c   = select Name from D k (Res k) where Job  = ‘Nurse’, where Res k is the encrypted result returned by the server and D the symmetric decryption function with key k. Query q c returns the value of attribute Name of tuple t 1 in Fig. 3a, which corresponds to the result of the original query q formulated by the user.

Indexing techniques specifically aimed at supporting the efficient evaluation of range conditions are based on order preserving encryption schemas (e.g., [3, 45]). Indexes that support aggregate functions and the basic arithmetic operators (i.e., +,−,×) rely on homomorphic encryption techniques (e.g., [32, 36]). Additional indexing techniques, which cannot be classified as mentioned above, are based, for example, on the definition of data structures (e.g., B+-tree) coupled with the encrypted relation and stored at the server [22].

The definition of indexes over outsourced relations must balance precision in query evaluation and privacy of the data [17]. In fact, more precise indexes provide more efficient query execution, at the price of a greater exposure to possible privacy violations. Also, the number of indexes complementing an outsourced relation should be carefully tuned, since each additional index may cause a rapid growth to the risk of privacy violations.

3.3 Selective Encryption

In many real-world systems, different users may have different privileges on the outsourced data. Traditional access control architectures are based on the presence of a trusted component, called reference monitor, that is in charge of enforcing the access control policy defined by the data owner. In a cloud scenario, however, neither the data owner (for efficiency reasons) nor the cloud server storing the data (for privacy reasons) can enforce the access control policy. An interesting solution addressing this issue consists in adopting selective encryption [24], meaning that different keys are used for encrypting different data. The encryption keys are then (directly or indirectly) released only to the users authorized to access the corresponding data. The idea of using different keys for enforcing access control is not new and has been first introduced in other contexts. For instance, in [42] the authors propose to store encrypted XML documents on (potentially insecure and vulnerable) Web servers. The decisions about access rights to different portions of an XML document can be made by the document creator and are immediately applied to the XML document by using different encryption keys for different portions of the same XML document. To enforce access restrictions, users then obtain only the keys associated with the portions of XML documents for which they have an access right. Other proposals put forward the idea of using hierarchical-based access control in the context of distributed environments and broadcast pay tv content (e.g., [12, 49]). In the remainder of this section, we describe the main characteristics of the selective encryption approach in [24], specifically designed for the cloud scenario.

Given a set U of users and a relation r, the authorization policy regulating access to tuples in r is represented by an access matrix M, with a row for each user u ∈ U and a column for each tuple t ∈ r. Cell M[u,t] is equal to 1 (0, respectively), if user u can (cannot, respectively) access tuple t. For each tuple t, acl(t) denotes the set of users who can access it (i.e., its access control list). For instance, Fig. 6 illustrates an example of access matrix regulating access to the tuples of relation Patients in Fig. 3a by a set U  = {A, B, C, D, E} of users.

Fig. 6
figure 6

An example of access matrix regulating access to relation Patients in Fig. 3a

The authorization policy defined by the data owner is translated into an equivalent encryption policy. The encryption policy regulates keys used to encrypt tuples as well as key distribution to users and must be equivalent to the access control policy defined by the data owner, that is, each user can decrypt all and only the tuples she is authorized to access.

The translation of an authorization policy into an equivalent encryption policy is driven by two requirements: (i) each user must manage at most one key, and (ii) each tuple must be encrypted at most once (i.e., no replication). To satisfy these two desiderata, the approach in [24] adopts a key derivation technique based on public tokens, which permit to compute the value of an encryption key starting from the knowledge of another key and a piece of publicly available information [10]. Each key k i is associated with a public label l i and, given keys k i and k j , token token i, j is computed as k j h(k i ,l j ), with ⊕ the bitwise xor operator, and h a deterministic cryptographic function. Token token i, j permits to derive key k j from k i and public label l j . Key derivation techniques are based on the definition of a key derivation graph, specifying which keys can be derived from other keys. A key derivation graph is a directed acyclic graph whose vertices represent keys, and whose edges represent tokens. The existence of a path from key k i to key k j in the key derivation graph denotes the fact that k j can be (directly or indirectly, via a chain of tokens) derived from k i . A key derivation graph correctly enforces an authorization policy M if each user u i  ∈ U can derive, starting from the key she knows, the keys used to encrypt all and only the tuples t j  ∈ r that she can access (i.e., with M[u i ,t j ]  = 1). To define such a graph, the idea is to exploit the set containment relationship ⊆ over U. A key derivation graph induced by ⊆ over U has a vertex for each subset of users in U and a path from vertex v i to vertex v j if v i represents a subset of the users represented by v j . The correct enforcement of the policy is guaranteed if each user knows the key of the vertex representing herself in the graph, and each tuple is encrypted with the key of the vertex representing its acl. For instance, consider the portion of the access matrix in Fig. 6 defined for the subset {A, B, C, D} of users. The encryption policy in Fig. 7 is equivalent to the access control policy represented by the first four rows in Fig. 6. For readability, each vertex in the graph of Fig. 7 is labeled with the set of users it represents. As an example, user A can decrypt tuples t 1, t 4, t 6, and t 7 since she can derive, starting from vertex labeled A, the keys with which these tuples are encrypted.

Fig. 7
figure 7

An example of encryption policy equivalent to the access control policy in Fig. 6, considering the subset {A, B, C, D} of users

Although effective for enforcing the authorization policy, the solution above defines more keys and tokens than necessary. Since the number of tokens in the system influences the access time, the proposal in [24] reduces the number of tokens by removing from the key derivation graph the vertices and edges that are not necessary to enforce M. The problem of minimizing the number of edges in a key derivation graph is however NP-hard. In [24] the authors propose an heuristic approach, which has been proved to obtain good results, based on two observations: (i) the vertices needed for correctly enforcing an authorization policy are those representing singleton sets of users and the acls of tuples in r; (ii) when two or more vertices have more than two common direct ancestors, the insertion of a vertex representing the set of users corresponding to these ancestors reduces the total number of tokens. Figure 8a illustrates an example of key derivation graph obtained adopting the approach in [24] over the access matrix in Fig. 6. As it is visible from the figure, the graph includes a vertex for each user and for each acl of a tuple in the system. It also includes an additional vertex (i.e., ABC), introduced to limit the number of tokens in the system. Clearly, the encryption policy in Fig. 8 is more convenient than the one in Fig. 7, as it reduces both the number of keys and the number of tokens in the system, while managing an additional user.

Fig. 8
figure 8

An example of encryption policy equivalent to the access control policy in Fig. 6

Since the keys used to encrypt tuples depend on their access control lists, whenever the authorization policy changes, the tuples involved in the policy update may need to be re-encrypted to guarantee the equivalence of the encryption policy. For instance, assume that user E is revoked the privilege to read tuple t 6. Such a tuple should be first decrypted using key k ABCDE , and then encrypted using key k ABCD . However, re-encryption requires the direct involvement of the data owner and can be computationally expensive. The number of re-encryption operations are therefore minimized by adopting two layers of encryption that allow the server to manage policy update operations [24]. The Base Encryption Layer (BEL) is applied by the data owner before transmitting the relation to the server and consists in encrypting the tuples according to the authorization policy existing at initialization time. The Surface Encryption Layer (SEL) is performed by the server over the tuples already encrypted by the data owner. It enforces the dynamic changes over the policy. The basic idea consists in over-encrypting the tuples so that a user can access a tuple only if she knows or can derive the key used for encrypting the tuples at both levels.

The solution in [24] enforces read privileges only and has been complemented with another technique that allows the management of write operations [23]. This work associates each tuple with a write tag. The write tag is a random value chosen by the data owner independently from the tuple content, and is encrypted with a key known only to users who can modify the tuple and to the external server. The server will then enforce a write operation on a tuple only if the requesting user proves to know the write tag of the tuple. The proposal in [23] extends the key derivation graph with a key for the server and the keys necessary for protecting write tags. For instance, consider the read privileges in Fig. 6 over relation Patients in Fig. 3a, and assume that: tuples t 1, t 4, and t 7 can be modified by user A only; tuples t 2 and t 6 can be modified by B, D, and E; tuples t 3 and t 5 can be modified by D; and tuple t 8 can be modified by C. Figure 9 illustrates the encryption policy in Fig. 8, extended to properly enforce write privileges. In the figure, we denote the external server as S.

Fig. 9
figure 9

Encryption policy in Fig. 8, extended to enforce write authorizations

Open issues that still need to be addressed are related to the expressive power of the supported access control policy, especially considering the ever-increasing bring-your-own-device (BYOD) trend. In fact, it would be interesting to develop solutions that will allow the specification of fine-grained restrictions, based on the users’ context and on the specific device adopted for accessing data.

4 Indexes and Fragmentation

The fragmentation works illustrated in Sect. 3.1 permit to delegate to the server the evaluation of any condition over attributes appearing plaintext in a fragment. However, the client still needs to evaluate those queries that operate on encrypted attributes, or that involve attributes that are not represented in plaintext in the same fragment. For instance, consider the fragmentation in Fig. 4 obtained in the multiple fragments scenario of relation Patients in Fig. 3a. Query q  = select Name from Patients where YoB  = 1980 and Disease  = ‘Asthma’ cannot be evaluated by the server, since attributes YoB and Disease do not appear in the clear in the same fragment and the server can neither decrypt attribute enc nor join F 1 and F 3. Hence, one of the two conditions in q must be evaluated by the client. To mitigate the client’s overhead in query evaluation, fragments can be complemented with indexes over encrypted attributes. Figure 10 illustrates three versions of fragment F 1 in Fig. 4, complemented with index I d over attribute Disease, which has been computed using each of the three kinds of indexes illustrated in Sect. 3.2. The presence of indexes in a fragment could however cause unintended leakage of sensitive information [28]. The exposure to leakage varies depending on the knowledge that a curious observer (e.g., the external server) can exploit and the kind of indexes. In particular, the following two kinds of knowledge can be exploited for breaching data confidentiality.

Fig. 10
figure 10

Fragment F 1 in Fig. 4 complemented with a direct index (a), a bucket-based index (b), and a flattened index (c) over attribute Disease

Fig. 11
figure 11

An example of vertical (a) and horizontal (b) knowledge by an observer

  • Vertical knowledge is the knowledge of the projection of attribute a over relation r, and is due to the presence of attribute a in the clear in one fragment and indexed in other fragments. Vertical knowledge does not require any additional external information for an observer since, apart from the case where the attribute appears in a singleton constraint, it refers to information immediately present in other accessible fragments. For instance, fragment F 3 in Fig. 4 makes visible the plaintext values (and their number of occurrences) of attribute Disease (see Fig. 11a).

  • Horizontal knowledge is the knowledge of the presence of a tuple t (or a set thereof) in r, and is due to external knowledge by an observer. For instance, an observer may know that Alice suffers from Asthma (see Fig. 11b).

Let us now examine the exposure risk of indexed fragments under the assumptions of horizontal and vertical knowledge and of the presence of indexes belonging to the three categories discussed in Sect. 3.2 [28].

  • Direct index. Index function ι preserves the frequency distribution of plaintext values, which can be exploited to reconstruct the value-index association by an observer with vertical and/or horizontal knowledge. Vertical knowledge permits to precisely reconstruct the value-index association for values characterized by a unique number of occurrences (outliers). For instance, consider the indexed fragment in Fig. 10a and the vertical knowledge in Fig. 11a. It is immediate to see that ι(Asthma)  = α and ι(Diabetes)  = δ since these are the only plaintext and index values with 3 occurrences and 1 occurrence, respectively. Hence, an observer can infer that Alice, Bob, and Carol have Asthma and Hilary has Diabetes. Horizontal knowledge permits to precisely reconstruct the value-index association for the plaintext value v  = t[a] known by the observer, exposing all the tuples in r with value v for attribute a. For instance, in the example above, knowing that Alice suffers from Asthma permits an observer to infer that ι(Asthma)  = α and then that also Bob and Carol suffer from the same illness.

  • Bucket-based index. Index function ι does not preserve the frequency distribution of plaintext values. However, the index value corresponding to plaintext value v will have a frequency equal to or higher than (in case of collisions) the frequency of v. Values with a high number of occurrences (outliers) are then still exposed. Vertical knowledge permits to identify the index values associated with frequent plaintext values, and then to reconstruct the value-index association for such values with a known probability of error. For instance, consider the indexed fragment in Fig. 10b and the vertical knowledge in Fig. 11a. Clearly, ι(Asthma)  = ε since this is the only index value with at least 3 occurrences. Also, ι(Diabetes)  = ε since Diabetes is the only plaintext value with 1 occurrence. An observer can then infer that three patients among Alice, Bob, Carol, and Hilary has Asthma (each with probability 0.75) and one has Diabetes (each with probability 0.25). Horizontal knowledge permits to identify the index value representing the known plaintext value v  = t[a]. This index value may however correspond also to other plaintext values, limiting the observer’s ability to precisely reconstruct value-index associations. For instance, in the example above, knowing that Alice suffers from Asthma permits an observer to infer that ι(Asthma)  = ε. However, nothing can be said about Bob, Carol, and Hilary since ε could also represent other plaintext values (different from Asthma). By combining horizontal with vertical knowledge, however, she can infer that two among Bob, Carol, and Hilary suffer from Asthma (each with probability 0.66) and one suffers from Diabetes (each with probability 0.33).

  • Flattened index. Index function ι flattens the frequency distribution of index values. Vertical knowledge does not help in establishing correspondences between plaintext values and index values. Horizontal knowledge permits to identify one of the index values representing the known plaintext value v  = t[a], exposing only the tuples associated with this index value (in contrast to the possibly larger set of tuples with value v for a). For instance, consider the indexed fragment in Fig. 10c and the horizontal knowledge in Fig. 11b. An observer can only learn that ι(Asthma)  = κ. However, no other association is exposed, because κ has only one occurrence in F 1 (although Asthma has frequency 3 in F 3).

An index function ι that flattens the frequency distribution of index values and that generates collisions provides protection against both horizontal and vertical knowledge. In fact, as illustrated above, inference attacks caused by vertical knowledge can be counteracted by flattening the frequency distribution of index values. Inference attacks caused by horizontal knowledge are mitigated by index functions that map different plaintext values to the same index value, generating collisions. For instance, Fig. 12 illustrates fragment F 1 in Fig. 4 complemented with a flattened index with collisions over attribute Disease. This indexed fragment is protected against both vertical and horizontal knowledge in Fig. 11. Indeed, vertical knowledge cannot be exploited for frequency-based attacks (all the index values have two occurrences). Horizontal knowledge permits to infer that ι(Asthma)  = α but, since ι generates collisions, the observer cannot say anything about the disease from which Bob suffers. Although the proposal in [28] is focused on the adoption of one index, the discussion can easily be extended to the case where fragments are complemented with multiple indexes. In fact, flattening and collisions provide adequate protection in different scenarios (e.g., multiple indexes in one fragment, a same attribute indexed in different fragments, two attributes appearing one in plaintext and the other indexed in one fragment and reversed in another fragment).

Fig. 12
figure 12

Fragment F 1 in Fig. 4 complemented with a flattened index with collisions over attribute Disease

Although effective to protect data at rest, a flattened index function with collisions has the disadvantage of reducing the performance in query evaluation. In fact, flattening requires to retrieve different index values when searching for one plaintext value, and collisions require a post-processing at the client side to remove spurious tuples in the query result computed by the server. As an example, consider fragment F 1 in Fig. 12, condition Disease  = ‘Asthma’ translates into condition I d in {α,δ}. The evaluation of this condition would however return a tuple with value Diabetes for attribute Disease (i.e., tuple t 8), since Asthma and Diabetes are both mapped to value δ. Also, flattened indexes with collisions remain still vulnerable to dynamic observations (i.e., to adversaries who can observe users’ queries). In fact, by observing a long enough sequence of queries, an observer can easily infer the index values to which each plaintext value has been mapped, since they always appear together in query conditions. With reference to the example above, every query including condition Disease  = ‘Asthma’ is translated into a query including condition I d in {α,δ}. An observer can then easily infer that α and δ represent the same plaintext value (Asthma, in our example). The protection against dynamic observations represents an open issue that still needs to be addressed, along with the problem of defining an efficient index function that provides both flattening and collisions.

5 Indexes and Selective Encryption

Selective encryption approaches illustrated in Sect. 3.3 enforce access control restrictions over outsourced data by guaranteeing that each user can decrypt all and only the tuples she is authorized to access. However, when data are made selectively available, the combination of selective encryption with indexes used for enabling efficient query execution on encrypted data may open the door to inferences. In fact, users may have visibility of indexes even of tuples they are not allowed to access. Such visibility, together with their ability to view data for which they are authorized, can allow them to possibly infer plaintext values of tuples they should not be able to read. In the following, for clarity in the exposition but without loss of generality, we will refer the discussion to one attribute a only.

Fig. 13
figure 13

Knowledge of user A over relation Patients (b) and Patients k (c)

The knowledge that a user u can exploit for inferences can be summarized as follows: (i) index function ι used to define index I over attribute a (necessary to translate user’ queries into queries that operate at the server side); (ii) plaintext tuples that the user can access (i.e., t such that u ∈ acl(t)); (iii) all the encrypted tuples in r k. For instance, consider relation Patients in Fig. 3a and the authorization policy in Fig. 6 (which is also summarized in Fig. 13a for the reader’s convenience), Fig. 13b,c illustrate the knowledge of user A over the plaintext and encrypted relation. Gray cells denote values that A is not authorized to read.

The information that a user with this knowledge can infer depends on the kind of index adopted (see Sect. 3.2), as illustrated in the following [27].

Fig. 14
figure 14

Knowledge inferred by user A over relation Patients

  • Direct index. Index function ι is a bijective function that maps each plaintext value to one index value (and vice versa). It then exposes all the tuples with the same plaintext value for attribute a of a tuple that the user is authorized to access. For instance, index I y over attribute YoB in Fig. 13c has been computed using a direct index function. Since user A can access tuple t 1, she knows that ι(1980)  = α. She can then infer that t 2[YoB]  = 1980, even if she is not authorized to access tuple t 2. In a similar way, A can also infer that ι(1970)  = β and that ι(1960)  = γ (i.e., she knows the plaintext value of attribute YoB of each tuple in Patients). The user also knows index function ι. Hence, she can compute the index value ι(v) associated with each value v in the domain of attribute a, and possibly reconstruct the value that attribute a assumes in each tuple t of the outsourced relation, independently from her access privileges over t.

  • Bucket-based index. Index function ι is a surjective function that maps multiple plaintext values to one index value. The inference risks described for direct indexes are mitigated by collisions. In fact, multiple occurrences of a same index value may correspond to different plaintext values. The user’s knowledge of index function ι could however reduce the uncertainty over the value assumed by attribute a in a tuple t that she is not authorized to access. For instance, index I j over attribute Job in Fig. 13c has been computed using a bucket-based index function. Since user A can access tuple t 1, she knows that ι(Clerk)  = δ. However, she does not know with certainty whether t 3[Job]  = Clerk and t 8[Job]  = Clerk since function ι may generate collisions and map different plaintext values to index value δ.

  • Flattened index. Index function ι is an injective function that maps a plaintext value to multiple index values, guaranteeing a flat distribution of the number of occurrences of index values. Like direct indexes, flattened indexes expose all the tuples with the same plaintext value for attribute a of a tuple that the user is authorized to access. In fact, when decrypting a tuple t that she can access, the user knows one of the index values representing value v  = t[a]. By computing ι(v), she exactly knows which tuples in r k have value v for attribute a. For instance, index I d over attribute Disease in Fig. 13c has been computed using a flattened index function. Since user A can access tuple t 1, she knows that ι(Asthma)  = η and, since she can compute ι(v) for any v in the domain of attribute Disease, she can compute the set of index values representing Asthma, that is, {η,θ,ω}. She can then infer that t 2[Disease]  = t 3[Disease]  = Asthma.

Inferences by user A over relation Patients are summarized in Fig. 14, where light-gray cells represent values, reported in italic, that A is not authorized to access but that she can infer from her knowledge.

From the observations above, we note that inference is mainly caused by the presence of the same index value associated with tuples characterized by different authorizations. In [27] the authors proposed a solution, which is focused on direct indexes since they represent the worst case scenario, based on the principle that different occurrences of the same index value must be mapped to different index values when they should be visible to different subsets of users. The index value to which t[a] should be mapped therefore depends, not only on value v  = t[a], but also on acl(t). To this purpose, each user u has its own index function \(\iota _{{ {\mathit u}_{}}}\), which depends on a private piece of information that she shares with the data owner. Given a tuple t, the data owner computes a different index value \(\iota _{\mathit{u}_{}}(\mathit{t}_{}[\mathit{a}_{}])\) for each u ∈ acl(t). Each user will then use her index function \(\iota _{\mathit{u}_{}}\) to formulate queries to be evaluated by the external server over indexes. For instance, Fig. 15a illustrates relation Patients k, where the index over attribute YoB has been computed adopting a user-dependent function. In the figure, for simplicity, we indicate with a sub-script the user whose index function generated the value (i.e., v u is a value generated by \(\iota _{\mathit{u}_{}}\)). Note that \(v_{u_{i}}\)  ≠  \(v_{u_{j}}\).

Fig. 15
figure 15

An example of encrypted and indexed version of relation Patients with index I y over YoB computed using a user-dependent function (a) and a salted user-dependent function (b)

Since all the index values associated with a specific plaintext value of attribute a are visible to all the users in the system, the adoption of user-dependent index functions is not sufficient to block all the inferences. In fact, tuples sharing the same value for attribute a that are characterized by different but overlapping acls, called conflicting tuples, are exposed to inferences by users who can access at least one of these tuples. For instance, with reference to relation Patients k in Fig. 15a, user A cannot exploit her knowledge of tuple t 1 to infer the value of t 2[YoB]. However, by observing that β D appears in tuples t 4 together with β A , A can infer that β D represents value 1970 and hence that t 3[YoB]  = t 4[YoB]  = t 5[YoB]  = 1970. To block this inference channel, conflicting tuples must be associated with disjoint sets of index values. To impose diversity of indexes, the value computed by index function \(\iota _{\mathit{u}_{}}\) is differentiated by applying different randomly generated salts to conflicting tuples. For instance, Fig. 15a illustrates relation Patients k, where the index over attribute YoB has been computed adopting a salted user-dependent function. In the figure, we denote salted versions of value v as v′ and v″.

While effective, the solution illustrated above presents similar privacy risks to the one described in Sect. 4. More precisely, this indexing technique remains vulnerable to dynamic observations, since monitoring a sufficient number of queries would permit an observer to reconstruct which (salted) index values represent the same plaintext value. Furthermore, collusion between authorized users and the external server may put data confidentiality at risk. The protection against these threats still remains an open issue.

6 Conclusions

Cloud computing offers a variety of new opportunities to users and companies, and many efforts have been therefore dedicated to the design of cloud-based services, applications, and infrastructures. While appealing, cloud computing however introduces new security and privacy issues. In this chapter, we analyzed the data protection issues, and described approaches for the protection of data confidentiality, and for the efficient and selective access to data. We also illustrated open problems arising from the combined application of such solutions and highlighted possible directions to address them.