Keywords

1 Introduction

The establishment of cloud has brought tremendous benefits to users and enterprises. The idea behind the establishment of the cloud is to allocate ubiquitous, on-demand access to processing resources and data storage to computers and other devices to store and process their data at a third-party data centers that might be located outside their premises. The allocated on-demand resources can be invoked and revoked with minimal administration efforts. The shared resources aim to provide coherence and economy of scale, such as the utilities over the networks (electricity, gas, water, etc.). Therefore, companies and enterprises can avoid different infrastructure costs and focus more on their business and productivity.

A cloud provider (or cloud service providers or CSPs) offers some cloud computing components (see Fig. 2.1) on a “pay as you go or pay per use” basis. This may lead to high charges if the cloud-pricing model is not well adapted by the administrators.

Fig. 2.1
figure 1

Cloud component, typically Infrastructure as a Service (IaaS), Software as a Service (SaaS) or Platform as a Service (PaaS) [1]

With the constant growth in demand for cloud computing, the cloud provider might not meet the different organizations legal need while they need to contemplate the benefits of using the cloud against its risks. For instance, the control of the back-end infrastructure is limited only to the CSP. Moreover, CSPs often decide on the usage and management policies, which might abate the cloud user’s ability over their deployment. Cloud users are also restricted with certain control and management policies of their applications, data and services, such as allocating certain amount of bandwidth for each customer and are often shared among other cloud users. Cloud computing involves constraints that make the progress in cloud computing services challenging; these constraints are consolidated in Table 2.1.

Table 2.1 Cloud computing constraints and challenges

The reliance of the cloud computing usage by organizations and users has taken a long time since the time cloud computing came into existence. The reason behind this delay in adopting cloud computing is the security concerns because IT security is challenging even under the best of circumstances. Typically, the cloud environments are likely to have strong security measures deployed at their infrastructures. However, companies and organizations are of more concern of security at the CSP.

The CSP might not be able to meet the regulatory requirement of a company or organization. For instance, a law that allows the government to get at the data in secret is a demotivating factor for foreign companies to store their data inside such countries. Other countries may have even more rigorous government-access rules. Typically, in the cloud environment, the data are processed or stored at data centers that are located far away from the organization city or country. Therefore, losing the control of the data is a security risk to most of the world organizations because in this case, someone else is controlling the data (i.e., the CSP). The concern is even amplified with free CSPs especially that SCPs can delete the outsourced data if they believe that the data violating some service terms [2,3,4]. Even though the demand for cloud computing is increasing, the concerns about users’ data privacy are also increasing and formidable. Therefore, another set of issues concerning the advances in the field of privacy preserving for users’ identity and their data also exists and acts as a barrier in this regard as shown in Table 2.2. Unfortunately, providing and preserving data privacy in the cloud have not been fully developed yet, and still require extra efforts in order to achieve successful results. Therefore, addressing all these issues could assist in designing novel privacy-preserving searching mechanisms over encrypted cloud data that are secure against intruders or attackers. Such designs could be a mark of success in the preservation of privacy in Cloud Computing.

Table 2.2 Privacy-preserving issues and challenges

In this paper, the issues related to cloud data privacy preserving are addressed. Various existing approaches related to data encryption concerning cloud data privacy preserving are discussed. After studying the existing approaches, issues and challenges are pointed out. To the best of our knowledge, this is the first survey that shortlist the issues and challenges of users and data privacy preserving over the cloud along the various possible solutions for the future researches.

2 Privacy-Preserving Methods

Various efforts have been made to address the preservation of data privacy over the cloud. This paper analyzes some of those efforts and provides a brief overview to the most known approaches in the field. This paper therefore classifies the privacy-preserving approaches in cloud computing into five broad categories as illustrated in Fig. 2.2.

Fig. 2.2
figure 2

Categories of privacy-preserving techniques in cloud computing

The following subsections examine most of the known cloud-based privacy-preserving methodologies and analyze these methodologies in terms of their pros and cons in comparison with each other.

3 Searchable Encryption-Based Techniques

Generally, IT managers and even individuals are likely to be cautious of delegating the control of their data to outside service providers because information stored at a third party may have weaker privacy protections than information in the possession of the creator of the information. Moreover, the outside provider has the right to change their underlying technology without their customer’s consent, which may cause issues related to performance, and latency [4, 7]. Traditionally, data privacy is preserved by cryptographic primitives by the side of unique and secure identities for the queries and their responses jointly with usage/access rights policies. However, searching over the encrypted data is a formidable mission. Moreover, users normally lose control over their encrypted outsourced data in a tradeoff relation to their security and privacy preservation of the outsourced data. However, considering the diverse types of data that can be stored in the cloud and the user’s demand for the data safety, preserving the data privacy in the cloud becomes even more challenging [8].

For instance, looking for certain data that are stored in an encrypted form in the cloud, one may need to download all encrypted data, and then decrypts and searches them. However, it is not efficient neither convenient especially with huge encrypted data or a resource constraint devices. Alternatively, the user may require sending his private key to the cloud server to perform the decryption and searching procedures on his behalf. However, sending the private key to the cloud server may cause serious issues with data files integrity and secrecy [9,10,11,12,13]. Therefore, to ensure the privacy of the outsourced data, different searchable encryption-based systems have been proposed. These searchable encryption-based systems entail encrypting the data by the data owner before outsourcing it to the cloud with the ability to search and retrieve relevant data through a keyword search or ranked keyword search techniques. These searchable encryption schemes can be divided into three categories: Symmetric-key based techniques, Fuzzy-searchable based techniques, and Public-key based techniques as portrayed in Fig. 2.3.

Fig. 2.3
figure 3

Taxonomy of searchable encryption-based techniques

3.1 Symmetric-Key Based Techniques

The symmetric-key encryption system allows a data owner to outsource his data, encrypted with a symmetric encryption-based techniques (i.e., stream cipher), to un-trusted locations over the cloud. The encrypted outsourced data are still searchable for relevant files by means of a trapdoor (i.e., a keyword) that is generated via the data owner private key. The generated trapdoor will be transferred to the server to search for a matched encrypted data with the trapdoor. In this regards, Song, Wagner [14] introduced an encryption and a searching technique over encrypted data with sequential scanning. The authors construct a special two-layered encryption technique that allows searching over cipher-texts without disclosing any sensitive information to the server. The authors proposed to encrypt each word separately assuming that each word has the same length, and then compute the bitwise exclusive or (XOR) with a special sequence of pseudorandom bits inside the plain text. To carry out the search, the data owner must create a private key (k i) that is corresponded to the locations of the searched word (Wi). The generated private key is then XORed with the cipher-text (C i ⊕ W i) to extract a corresponding structure that is in the form 〈s, F ki(s)〉 where (s) is some pseudorandom sequence values generated using some stream cipher, and F ki(s) is a pseudorandom function. In this technique, the complexity of encrypting the data and searching for a specific keyword over the encrypted data increases at most linearly with the size of the files collection and the data length. For instance, for a document of length (n) words, the encryption and the searching algorithms require O(n) stream and block cipher operations. However, the proposed technique leaks important information about the documents using any statistical techniques. To handle the variable length words, Goh [15] developed a semantic secure indexes model to prevent leaking any sensitive or statistical information of the outsourced documents against adaptive chosen keyword attacks. The proposed model constructs an index for each document based on pseudo-random functions used as hash functions, and Bloom filters (BF) as a document word index. The word in this model is represented in an index by a codeword for each document which is derived through applying the pseudorandom function once with the word as input and another with a unique document identifier. The non-standard use of the pseudorandom function is to prevent correlation attacks. To search over encrypted documents for the word (y), the user should compute the trapdoor \( {T}_{\mathrm{y}}\leftarrow \mathrm{Trapdoor}\left({K}_{\mathrm{priv}},{I}_{D_{\mathrm{i}}}\right) \) for the word (y), where (K priv) is the master private key, (Di) is a unique document identifier, and \( {\mathrm{I}}_{{\mathrm{D}}_{\mathrm{i}}} \) is the index for each document (Di). The trapdoor (Ty) is then send to the server where the encrypted documents and the corresponding BF index \( {\mathrm{I}}_{{\mathrm{D}}_{\mathrm{i}}}=\left({D}_i,\mathrm{BF}\right) \) existed. The server tests for a match with the documents through the function \( \mathrm{SearchIndex}\left({\mathrm{T}}_{\mathrm{y}},{\mathrm{I}}_{{\mathrm{D}}_{\mathrm{i}}}\right) \). The BF is represented as an array of bits initially set to 0, and a set of hash functions to mark a set element as 1 of some array positions. To verify if an element belongs to the BF array, the hash values for this element are computed to identify the corresponding array positions. If any of the bits at these positions is 0, then the element is not in the set. This technique provides O (1) search time complexity per document and can handle variable length words. However, this scheme only supports exact match queries.

Similarly, Chang and Mitzenmacher [16] built a dictionary-based keyword index for each document based on pseudo-random functions. The authors aim to mask a dictionary keyword index for each file using pseudo-random bits to be kept at a remote server. On the other hand, the users can easily retrieve certain files using a short seed that enables the server to unmask selective parts of the index. For each file, an index is created as a set of linked lists, each linked list is associated with a list of keywords in the dictionary of the corresponding file. Initially, all values are set to 0, then if the document m j contains the keyword w i, its index position I j[P s(i)] is set to 1. The users compute a secrete value r i using a mapping function F where, r i = F r(i), i ∈ [2d]. For each document, a masking index string M j is created through a document mapping function G, such that M j[i] = I j[i] ⊕ G ri(j). The documents are then encrypted using an encryption algorithm and the encrypted documents are outsourced to the cloud along with the corresponding index mask string M j. Two secrets keys (s) and (r) along with the dictionary are kept at the user’s device. Since the authors presume that the data owners are using mobile devices with limited bandwidth and storage space, their solution incur minimum overhead in terms of bandwidth and storage. The search time for this approach is O(n/p), where (n) is the size of the documents collection and (p) is the number of cores. However, this scheme supports an exact single keyword match queries.

To improve the efficiency and the security to a higher degree compared to the previous schemes and to support multi-user environments, Curtmola, Garay [17] proposed a searchable broadcast encryption scheme. The proposed searchable symmetric encryption (SSE-1 and SSE-2) is based on an index per document. The user that owns the data can grant/revoke privileges to authorized users to access/query the outsourced data. In this schema, the proposed index has an array that holds a collection of linked list for documents identifier containing a keyword D(w i) in an encrypted form and a look-up table to trace and decrypt the first elements of each list in the array. The nodes of the linked list L i are the document identifiers D(w i) that contain the keyword w i. The array locations are the nodes of all L iin a scrambled way. The lookup table (T) entries on the other hand are the keywords w i index in the array and the decryption key of the first element in L i. Both the array and the lookup table are encrypted and kept at the server along with the encrypted files. However, if a position in the array is known along with first node encryption key, one can trace and decrypt the other nodes of L i which correspond to the document identifiers D(w i). In this schema, the server complexity is constant per document with the searched word, and the overall complexity for each query is proportional to the number of documents that have the searched single word. The computation and the storage complexity at the user side is O(1) and the search time for the server is optimal, but the update of the index is inefficient. Similarly, Chase and Kamara [18] considered stronger security definitions to produce schema that is efficient, associative, and adaptively secure in structured data. The authors of this schema proposed an encryption model for structured data like social networks, images, maps, location information, etc. and, at the same time, the proposed structured data can be privately queried. The focus of this scheme is to build a structured encryption algorithm that is searchable using specific query token if the secrete key is known. The structured data encryption algorithm operates over a labeled data that has a label (L) and a sequence of data items (m) (i.e., connecting a set of keywords to each data item). For each keyword (w), an array is initiated to hold a pointer j from the pseudo-random permutation set G K(L(w)) and the semi-private item v i. In this schema, the dictionary was implemented based on hash tables which makes this schema yields an optimal search time O(|I|). However, the encrypted index can be very large. Similarly, van Liesdonk et al. [19] proposed a schema to deal with adaptive security based on one index per keyword to support efficient search and updates of the documents stored at a CSP server. Their proposed scheme converts each distinct keyword into a searchable representation of the form SW = (fkf(w), m(Iw), R(w)) that can be tracked by the trapdoor \( {\mathrm{T}}_{\mathrm{w}}=\left({\mathrm{f}}_{{\mathrm{k}}_{\mathrm{f}}}\left(\mathrm{w}\right),{\mathrm{R}}^{\prime}\left(\mathrm{w}\right)\right) \) with the ability to efficiently update the searchable representation whenever needed. \( {f}_{k_f}(w) \) is a pseudorandom function that identifies S W, m(I w) is a masking function for the collection of documents IDs that contains the keyword (w), R(w) and R (w) are the associated unmasking functions. In case \( {f}_{k_f}(w) \) is found, the server sends back the encrypted data items with the matched IDs in I w to the client. Even though this schema uses only a simple primitive like pseudo-random functions, but it still obliges for two rounds of communication to generate, update the index, and to search for the documents. Finally, the proposed schema may produce a very large encrypted index. Kurosawa and Ohtaki [20] proposed a schema that is slightly stronger than Curtmola et al. [17]. They proposed a verifiable searchable symmetric encryption scheme that is universally composable (i.e., Protocols security is preserved even if arbitrarily composed with other instances of the same or other protocols) [21] and reliable against active adversaries or malicious servers. They address the issue of an active adversary who might forge the encrypted files to make the retrieving of the files incorrect. The proposed schema is translated to a client/server protocol. The protocol has two phases: (1) the store phase which is executed once by the client to compute (I, C) ← Enc(Gen(1k), D, W), where I is an encrypted index of the keywords W, C is the encrypted documents D, and the Gen(1k) is the secrete key. (2) The search phase which is executed many times by the server to compute (C(w), Tag) ← Search(I, C, Trpdr(K, w)), where C(w) is a ciphertext of D, t(w) ← Trpdr(K, w) is a trapdoor generated by the client in response to a keyword w query and Tag is MAC(K, m) a tag generation algorithm for a message m encrypted using the key K. If the client receives \( \left(\tilde{\mathrm{C}}\left(\mathrm{w}\right),\mathrm{Tag}\right) \) from the server, the client verifies the validity of the received Tags the Tag \( \mathrm{Accept}/\mathrm{Reject}\leftarrow \mathrm{Verify}\left(K,\mathrm{Trpdr}\left(K,w\right),\tilde{\mathrm{C}}(w),\mathrm{Tag}\right) \). The client decrypts the files if the verification functions returns accept. The proposed scheme consists of six polynomial time algorithms and requires a linear searching time, but supports only single-keyword search.

None of the previous schemes is explicitly dynamic with the ability to add, delete, and update files efficiently. Therefore, Kamara and Papamanthou [22] proposed to extend the inverted index approach proposed in Curtmola and Garay [17] and construct a new sublinear-time schema that is secure against adaptive chosen keyword attacks. The proposed schema has reduced index sizes with the ability to add/delete files efficiently. Therefore, they added three extra encrypted data structure, namely search array, search table (i.e., dictionary), and a deletion array that can be used by the server to monitor the search array positions in case of an update. They used a homomorphic encryption scheme to encrypt the node’s pointers. To modify the pointer without ever having to decrypt the node, they used a private-key encryption scheme which consists of XORing the message with two pseudo-random functions. Finally, they added a free list that can be used by the server to determine the free locations to add new files. The proposed dynamic index-based schemes are a tuple of nine polynomial-time algorithms. The client generates a secret key K ← Gen(1k) to be used for the files (D) encryption to produce an encrypted index I and a sequence of ciphertexts C (I, C) ← (K, D). In order for the client to search for a keyword, the client builds a search token τs ← SrchToken(K, w). The client can also request to add or delete a file (f) through generating add (τa, Cf) ← AddToken(K, f) or delete τa ← DelToken(K, f) tokens. The clients also can issue a search request Iw ← Search(I, C, τs) with the encrypted index I, a sequence of ciphertexts C and a search token τs to retrieve a sequence of files identifiers Iw ⊂ C. In this schema, the searching time for the server is linear (by using a hash table) which is optimal, but this approach is very complex and difficult to implement.

Moreover, the search procedure cannot be parallelized on the server because they represent a T-set as a linked list. As a result, Kamara and Papamanthou [23] improved the efficiency further through proposing a new dynamic and highly parallelizable sub-linear searchable symmetric encryption scheme based on the multi-core architectures. In this schema, they used a new tree-based multi-map data structure which they call a keyword red-black tree (KRB). The KRB tree is a dynamic data structure that is similar to an inverted index but can be used to answer multi-map queries efficiently. The KRB allows both keyword-based search and file-based search operations. This schema is useful for handling updates efficiently. The parallel search is executed similar to the binary trees, where the first processor searches for a specific keyword at the root of the tree. The tree will be divided into two sub-trees, the first processor continues with one sub-tree while another processor is assigned to the other sub-tree. The set of keywords are kept in a keyword hash table as a tuple (key, value) with a key of exponential size and the value is an encryption of a Boolean value. This approach yields very efficient schemes in less than the optimal sequential search time, and allows efficient updates, but this scheme is designed only for single keyword Boolean search, that means whether or not the keyword exists. A complete comparison of all the schemes can be found in Table 2.3 and Table 2.4.

Table 2.3 Comparison of several symmetric-key encryption schemes
Table 2.4 Comparison of several symmetric-key encryption schemes

Although these searchable symmetric encryption techniques allow a user to search securely over encrypted data through keywords, the main disadvantage with these techniques is that they support only exact keyword searches. Consequently, this reduces the system efficiency because the search complexity will be the number of distinct keywords in the document collection. Another approach to solve such problems are the Fuzzy-Searchable Encryption based systems.

3.2 Fuzzy-Searchable Encryption

Fuzzy keyword search returns the matching files to the users’ searching inputs that even matched exactly to a set of predefined keywords or the closest possible matching files based on keyword similarity semantics, because fuzzy keyword search can tolerate minor typos and formatting inconsistencies [24]. In this regards, Adjedj et al. [25] described a way to solve the issue of preserving privacy in a biometric identification system using a fuzzy search scheme. They used symmetric searchable encryption (SSE) which allows a client to encrypt the data in such a way that these data can still be searched to achieve reasonable computational costs for each identification request. In this schema, they combined SSE and locality-sensitive hashing (LSH). The main purpose of using LSH is to make outputs the same result for near points and a different result for distant points by using a matching algorithm which computes a similarity score between the two points. By using SSE architecture, the secret keys are stored on the client side but not on the database side (i.e., server side stores the encrypted data without secret keys). This will ensure the privacy of the stored data, but it is unsuitable for many applications, such as when data are frequently updated or streaming.

In an attempt to tolerate minor typos and formatting inconsistencies, Li et al. [24] realized that depending on a spell checker mechanism does not address the problem (i.e., mistyped words or two valid words typed interchangeably) due to the extra communication cost with the users to identify the correct words. Therefore, they proposed the first solution for effective fuzzy keyword search over encrypted cloud data. They constructed a wildcard-based fuzzy set \( {\mathrm{S}}_{{\mathrm{w}}_{\mathrm{i}},\mathrm{d}}=\left\{{S}_{w_i,0}^{\prime },{S}_{w_i,1}^{\prime },\dots, {S}_{w_i,d}^{\prime}\right\} \) with edit distance d for each keyword w i ∈ W before building the index. The \( {S}_{w_i,\tau}^{\prime } \) denotes the set of words \( {\mathrm{w}}_{\mathrm{i}}^{\prime } \) with τ wildcards representing the edit operations on w i ∈ W. This technique can deal with minor typo errors when users type in query keywords through using the edit distance to quantify keyword similarity through semantic keyword with edit distance d = 1 from w i. That is, all the words that are satisfying the similarity criteria \( \mathrm{ed}\left({\mathrm{w}}_{\mathrm{i}},{\mathrm{w}}_i^{\prime}\right)\le \mathrm{d} \) are listed. The index \( \left\{\left(\left\{{\mathrm{T}}_{{\mathrm{w}}_{\mathrm{i}}^{\prime }}\right\}{\mathrm{w}}_{\mathrm{i}}^{\prime}\in {\mathrm{S}}_{{\mathrm{w}}_{\mathrm{i}},\mathrm{d}},\mathrm{Enc}\left(\mathrm{sk},{\mathrm{FID}}_{{\mathrm{w}}_{\mathrm{i}}}\Big\Vert {\mathrm{w}}_{\mathrm{i}}\right)\right)\right\}{\mathrm{w}}_{\mathrm{i}}\in \mathrm{W} \) with the set of encrypted files IDs (\( {\mathrm{FID}}_{{\mathrm{w}}_{\mathrm{i}}} \)) that contain the keyword w i is built and a trapdoor set \( \left\{{\mathrm{T}}_{{\mathrm{w}}_{\mathrm{i}}^{\prime }}\right\} \) is computed for each word\( {\mathrm{w}}^{\prime}\in {\mathrm{S}}_{{\mathrm{w}}_{\mathrm{i}},\mathrm{d}} \). The index and the encrypted files are then outsourced to the cloud server for storage. The secret key sk is shared between the data owner and authorized users. To search for a keyword w with a private key k, the authorized user computes the trapdoor set \( \left\{{\mathrm{T}}_{{\mathrm{w}}_{\mathrm{i}}^{\prime }}\right\}{w}^{\prime}\in {S}_{w,k} \) and send to the server. The server then compares the request with the index table and returns all possible encrypted file identifiers\( \left\{\mathrm{Enc}\left(\mathrm{sk},{\mathrm{FID}}_{{\mathrm{w}}_{\mathrm{i}}}\Big\Vert {\mathrm{w}}_{\mathrm{i}}\right)\right\} \). The size of the index \( {\mathrm{S}}_{{\mathrm{w}}_{\mathrm{i}},\mathrm{d}} \) with a keyword length of l and edit distance of d is O(ld). This schema is secure and privacy preserving, but it is only applicable to strings under edit distance, and fuzzy sets may become too big with longer words, which necessitates issuing large trapdoors sets. Therefore, Kuzu et al. [26] described an efficient similarity search over the encrypted data based on the locality sensitive hashing (LSH) which is the nearest neighbor algorithm for index creation and the bloom filter (BF) for translation of strings, to provide a more generic solution and to utilize the distinct similarity search contexts. Similar features are put into one bucket with high probability due to the property of LSH while not similar features are kept into different buckets. This schema embeds the query string into the BF and represented as a set of n-grams. Each n-gram is then subject to a hash function and the corresponding bit locations are set to 1. They use a publicly available typo-generator which produces a variety of spelling errors to check if the keywords contain typographical errors, and to measure the Jaccard distance between the encodings of the original and perturbed versions, to determine distance thresholds for their Fuzzy Search scheme. In this schema, one round is needed for a limited number of data items with large set of features, and two rounds are needed if the number of data items is huge, but it introduce a certain degree of false positive rate in the searching results.

However, a semi-honest-but-curious cloud server might save its computation or download bandwidth through executing only a fraction of the search operation honestly and return a fraction of the search results honestly as well. Therefore, a verifiable scheme is needed to ensure that the user can verify the correctness and the completeness of the search results. In this regards, Wang et al. [27] proposed a new efficient and verifiable fuzzy keyword search (VFKS) scheme over the encrypted data in cloud computing to return the closest possible results based on similarity semantics. They use a wildcard-based fuzzy keyword set and the BF to enable a fuzzy keyword search over encrypted data and maintain keyword privacy and the verifiability of the search result. Their approach consists of the algorithms (Keygen, Buildindex, trapdoor, search, Verify). In which the Keygen algorithm (sk, sk) ← (Keygen(1k)) executed by the data owner with a security parameter k to produce the secrete key (sk) to generate the index and the document encryption key (sk) used to decrypt the document. The Buildindex algorithm G W ← Buildindex(sk, W) executed by the data owner to create the index GW, i.e., a symbol-based tree using the secrete key (sk) and the distinct keyword set of the documents collection D.

The symbol-based index tree GW and the encrypted documents are outsourced to the cloud server. The user can generate a trapdoor set \( \left\{{\mathrm{T}}_{\upomega^{\prime }}\right\}{\upomega}^{\prime}\in {\mathrm{S}}_{\upomega, \mathrm{d}}\leftarrow \mathrm{trapdoor}\left(\mathrm{sk},{\mathrm{S}}_{\upomega, \mathrm{d}}\right) \) for all wildcard-based fuzzy keywords \( {\mathrm{S}}_{\upomega_{\mathrm{i}},\mathrm{d}}=\left\{{\mathrm{S}}_{\upomega_{\mathrm{i}},0}^{\prime },{\mathrm{S}}_{\upomega_{\mathrm{i}},1}^{\prime },\dots, {\mathrm{S}}_{\upomega_{\mathrm{i}},\mathrm{d}}^{\prime}\right\} \) of the keyword ω with edit distance ed(ω, ω) < d. The server executes the search algorithm \( \left(\mathrm{flag},{\mathrm{ID}}_{\omega },\mathrm{proof}\right)\leftarrow \mathrm{Search}\left({G}_W,\left\{{T}_{\omega^{\prime }}\right\}\right) \) upon receiving the user trapdoor set \( \left\{{\mathrm{T}}_{\upomega^{\prime }}\right\} \) to search for the document with keyword ω and return the document identifier IDω, true and a proof if document existed otherwise false, and a proof. The user executes (true/false) ← Verify(T ω, (flag, IDω, proof)) to verify whether the server is honest or not over the search result (flag, IDω, proof) and outputs true if the server honestly search, otherwise false is returned. They utilized the well-known multi-way tree to store the fuzzy keyword set over a predefined symbol set, which might grow in size if the keyword length is huge. This schema is secure and privacy preserving, while supporting efficient verifiability of the searching result. However, this schema focuses on key word search but does not consider a phrase search. Moreover, the index generation is handled by the data owner, which means that the owner might abandon the exact keyword index constructed before and generate a specialized fuzzy-keyword index for fuzzy search, hence wasting much more computation and storage resources (Table 2.5).

Table 2.5 Comparison of several Fuzzy-searchable encryption schemes

All these previous techniques are based on symmetric key encryption, in which the same key is used to encrypt and decrypt the data. To enable an authorized user to access the encrypted data, the data owner must share this key. By sharing this key, unauthorized users can also use this key to access the encrypted data.

3.3 Public-Key Encryption

A searchable symmetric key-based encryption schema are valid for users owning the data and wish to upload it to a third-party and untrusted server (i.e., cloud server). On the other hand, there are cases when the outsourced data (medical data, stock quotes, emails, etc.) are public and uploaded by different owners and the user is not aware of it, at the same time, the user wishes to retrieve certain files without revealing to the server which file he wants. The public-key encryption with keyword search is the solution for such cases. The public-key encryption uses two different keys, private and a public key. The private key is given by the data owner to the users and the public key is given to the server in this context as illustrated in Fig. 2.4.

Fig. 2.4
figure 4

Public-key encryption architecture

The first searchable encryption scheme using a public key system was proposed in [29]. This scheme can be extended to handle range, subset, and conjunctive queries. It also hides the attributes for messages that match a query. They use identity-based encryption (IBE), in which the keyword acts as the identity. The proposed searchable public-key encryption consists of four polynomial time randomized algorithms (KeyGen, PEKS, Trapdoor, Test). The data owner generates his public/private key pair using the algorithm (Apub, Apriv) ← KeyGen(s) over a security parameter s. In order to search for any keyword W, the user generates a trapdoor TW ← Trapdoor(Apriv, W) using their private key A priv for certain keywords W. The server determines whether a document contains one of the keywords W specified by the users (yes ‖W = W| no ‖W ≠ W) ← Test(Apub, S, TW) through the received Trapdoor TW, the given public key Apub and a searchable encryption S = PEKS(Apub, W). The proposed scheme has two constructions for 12public-key searchable encryption: (1) An efficient construction based on a variant of the Diffie–Hellman (BDH) assumption by building a non-interactive searchable encryption scheme from a bilinear map. They have proved that this scheme is semantically secure against a chosen keyword attack in the random oracle model based on the difficulty of the bilinear Diffie–Hellman problem. (2) A limited construction using any trapdoor permutation, which is less efficient because this construction assumes that, general trapdoor permutations assuming that the total number of keywords the user wishes to search for is bounded by some polynomial function in the security parameter. They can reduce the size of the public file by allowing the user to re-use individual public keys for different keywords. In this schema, the searching time is linear, but Public key solutions are usually computationally expensive. Furthermore, the keyword privacy cannot be protected in the public key setting, since the server could encrypt any keyword with a public key and then use the received trapdoor to evaluate the ciphertext. Finally, the proposed constructions are applicable to searching on a small number of keywords rather than an entire file.

Bellare et al. [30] proposed a deterministic searchable public-key encryption scheme. The main idea in this technique is to associate a tag with a plaintext, which can be computed by the client to form a particular query F(pk, x 1) and by the server from a ciphertext that encrypts it G(pk, c). They can then use this tag (i.e. the output of the polynomial time algorithms F, G) to create a tree-based index that can be used for searching. Since searchable tags are deterministic, the server can organize them in a sorted system and match the minimum logarithmic time. The proposed scheme consists of three polynomial time algorithms AE = (K, E, D). This schema is t-efficiently searchable encryption where t(.) < 1 ∀ x i ∈ Ptsp(k), Ptsp(k) is the plaintext space and the probability F(pk, x 1) = G(pk, c) = 1 over (pk, sk) ← K(1k)andc ← E(pk, x1). This technique is a combination of any public-key encryption scheme and any deterministic hash function and so this scheme is secure, but they have left without solution the problem of finding standard model schemes. The issue of this proposed approach is that it only provides privacy to text drawn from a space of large min-entropy.

A range of queries over multiple attributes in the public key settings have been studied in the herein cited study [31]. They proposed an encryption scheme called Multi-dimensional Range Query over Encrypted Data (MRQED) that allows a network gateway to encrypt summaries of network flows before submitting them to the cloud. The proposed scheme was proven with the network audit logs. An authority can release a public key to an auditor to decrypt flows within certain ranges only. The proposed scheme operates over a tuple of flow features (t, a, p) representing the flow timestamp range t ∈ [t1, t2], the flow source address range a ∈ [a1, a2] and the destination flow port number range p ∈ [p1, p2]. Their proposed range queries imply (t ≥ t1) ∧ (a = a1) ∧ (p1 ≤ p ≤ p2) where all flows (t, a, p) within the defined range can be decrypted with the provided decryption key without revealing the other flow attribute values nor issuing huge number of keys. The proposed schema consists of four polynomial-time algorithms (Setup(k, LΔ), Encrypt(PK, X, Msg), DeriveKey(PK, DK, B), QeurtyDecrypt (PK, SK, C)) in which the setup algorithm (PK, SK) ← Setup(k, LΔ) over a security parameter k and a point in lattice LΔ (represents a tuple as a point in LΔ) produces a public key PK and a private key SK. The gateway encrypts the pair (Msg., X) that consists of an arbitrary string representing the entire flow summary and a point X in a multi-dimensional space representing the attributes using the public key PK to produce the ciphertext C ← Encrypt(PK, X, Msg). The authority derives a decryption key DK ← DeriveKey(PK, SK, B) for a hyper-rectangle B in LΔ (i.e., test whether a point X falls inside it) using the public and the private key pair (PK, SK). Finally, an auditor can decrypt (plaintext/null) ← QeurtyDecrypt(PK, DK, C) relevant flows using the provided key pair (PK, DK) over the retrieved ciphertext C. However, in this schema, each flow is represented as a hyper-rectangle B in LΔ. This requires issuing one pair of keys for each flow, having a huge number of flows would require a huge number of key pair pools.

Liu et al. [32] proposed an Efficient Privacy Preserving Keyword Search Scheme (EPPKS) in cloud computing, which reduces a client’s computational overhead by allowing the cloud service provider to participate partially in the decipherment process while protecting the data and the queries privacy. The proposed schema does not require a private key transmission; to make it suitable for the cloud environment. This schema consists of the following seven randomized polynomial time algorithms EPPKS = (Keygen, EMBEnc, KWEnc, TCompute, Test, Decrypt, Recovery). The user and the service provider execute the Keygen function to produce public/private key pair. For the user U, he executes U : (Upub, Upriv) ← Keygen(k1) over a sufficiently large security parameter k 1 to produce his key pair (Upub, Upriv). Similarly, the service provider S executes S : (Spub, Spriv) ← Keygen(k2) over a sufficiently large security parameter k 2 to produce his public/private key pair (Spub, Spriv). The user encrypts the data using his public key and the service provider private key to produce the message m ciphertext Cm ← EMBEnc(Upub, Spriv, m). The keywords are also encrypted before outsourcing the data to the service provider using the user public key \( {\mathrm{C}}_{{\mathrm{w}}_{\mathrm{i}}}\leftarrow \mathrm{KWEnc}\left({\mathrm{U}}_{\mathrm{pub}},{\mathrm{W}}_{\mathrm{i}}\right) \). In order to retrieve a file with keywords Wj, the user executes \( {\mathrm{T}}_{{\mathrm{W}}_{\mathrm{j}}}\leftarrow \mathrm{TCompute}\left({\mathrm{U}}_{\mathrm{priv}},{\mathrm{W}}_{\mathrm{j}}\right) \) and sends it to the CSP. The CSP on the other hand executes \( \left({\mathrm{W}}_{\mathrm{i}}\overset{?}{=}{\mathrm{W}}_{\mathrm{j}}\right)\leftarrow \mathrm{KWTest}\left({\mathrm{U}}_{\mathrm{pub}},{\mathrm{C}}_{{\mathrm{W}}_{\mathrm{i}}},{\mathrm{T}}_{{\mathrm{W}}_{\mathrm{j}}}\right) \) to determine whether a given file has the keyword Wj. An intermediate result Cρ will be calculated by the CSP before returning the matching file to the user as a result of executing Cρ ← PDecrypt(Spriv, Upub, Cm). Upon receiving the files, the user executes m ←  Re coery(Upriv, Cm, Cρ). This schema supports multiple keyword searching on the encrypted data and it is semantically secure, because the service provider could search in the encrypted files efficiently without leaking any information, but there is a big challenge if the user requires the service provider to provide the computational service.

All these schemes achieve good security and privacy but they require high computations and memory of the end-devices during the encryption and decryption process. Moreover, these schemes provide unsearchable encryption, but do not fit well for less powerful client devices, which have only limited bandwidth, CPU, and memory as discussed in [33]. Table 2.6 consolidates the various public key-based privacy-preserving approaches advantage and their shortcomings.

Table 2.6 Comparison of several public-key encryption schemes

Among the different available solutions that aim to design operations compatible with data encryptions while preserving the privacy of the data outsourced to the cloud, Searchable Encryption (SE) schemes seem to allow a curious party to carry out searches on encrypted cloud data without having to decrypt it, hence maintaining its privacy. Table 2.7 summarizes the advantages and the disadvantages of the common searchable encryption schemes in cloud computing.

Table 2.7 Comparison of searchable encryption schemes in cloud computing

4 Conclusion and Future Work

While data encryption seems to be the right countermeasure to prevent privacy violations, classical encryption mechanisms fall short of meeting the privacy requirements in the cloud setting. Typical cloud storage systems also provide basic operations on stored data such as statistical data analysis, logging and searching and these operations would not be feasible if the data were encrypted using classical encryption algorithms. Among various solutions aiming at designing operations that would be compatible with data encryption, Searchable Encryption (SE) schemes allow a potentially curious party to perform searches on encrypted data without having to decrypt it. SE seems a suitable approach to solve the data privacy problem in the cloud setting. A further challenge is raised by SE in the multi-user setting, whereby each user may have access to a set of encrypted data segments stored by a number of different users. Multi-user searchable encryption schemes allow a user to search through several data segments based on some search rights granted by the owners of those segments. Privacy requirements in this setting are manifold, and not only the confidentiality of the data segments but also the privacy of the queries should be ensured against intruders and potentially malicious CSP. Recently, few research efforts came up with multi-user keyword search schemes meeting these privacy requirements, either through some key sharing among users or based on a Trusted Third Party (TTP).

These studies provide limited keyword search functionality for cloud storage services. Thus, service providers must implement a complete secure search scheme to promote their services. This study proposes a scheme for performing ranked multikeyword searches with fault tolerance in cloud storage systems. The proposed scheme uses similar keyword sets to perform a similarity search, and a secure k-nearest neighbor (kNN) scheme to perform a ranked multikeyword search. Moreover, the proposed scheme is fault tolerant to account for cloud users inputting an incorrect keyword, and still involves performing a file search. When the files are located, they are assigned an associated correlation value.