A privacy-preserving multi-keyword search approach in cloud computing

Manasrah, Ahmed M.; Abu Nasir, Mahmoud; Salem, Maher

doi:10.1007/s00500-019-04033-z

A privacy-preserving multi-keyword search approach in cloud computing

Focus
Published: 08 May 2019

Volume 24, pages 5609–5631, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

A privacy-preserving multi-keyword search approach in cloud computing

Download PDF

912 Accesses
4 Citations
Explore all metrics

Abstract

Cloud computing provides the users with the ability to outsource their data to a third-party cloud storage for cost-effective management of resources and on-demand network access. However, outsourcing the data to a third-party location may raise concerns about data privacy. To maintain the user’s privacy, users tend to encrypt their sensitive data before outsourcing it. Encrypting the data will preserve its privacy, but at the same time, it makes the searching process for a specific keyword a time-consuming and challenging process, mainly if the encryption key is not provided. On the other hand, the data owner should be able to perform multiple keyword searches to retrieve specific documents that are relevant to the search query. This paper proposes a new privacy-preserving multi-keyword search approach for the cloud outsourced data. The objective of the proposed approach is to allow the data owners and the authorized users to retrieve the most relevant data with minimum computation and communication overhead, and reduced false positives (irrelevant documents) and searching time. To evaluate the proposed approach, the NSF research dataset is used. Results demonstrate that the proposed method achieves better searching time and overall performance of the cloud environment regarding computation and communication overhead as well as false positives in comparison with other approaches.

Fuzzy and Semantic Search over Encrypted Data in the Cloud

Privacy Preserving on Searchable Encrypted Data in Cloud

Privacy-Preserving Multi-keyword Search over Outsourced Data for Resource-Constrained Devices

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cloud computing became the platform of choice for users looking for storage infrastructure with minimal cost and administration efforts. In such an environment, users (i.e., data owners) can easily increase their subscribed storage space without the need to get extra local storage devices (Plageras et al. 2018). Cloud environment, therefore, provides a cost-effective resource utilization with on-demand network access anytime anywhere. Despite the different advantageous benefits of the cloud, users still have serious concerns regarding their outsourced data security in the cloud (Stergiou et al. 2018). Such concerns include exposing outsourced data to malicious attacks, the data integrity of the outsourced data as well as the privacy of the data itself (Li et al. 2017; Ranjan et al. 2017).

Privacy-preserving concerns are one of the main issues that hinder cloud computing adoption by users and organizations because outsourcing the data to the cloud service provider (CSP) deprives the data owners (DO) of directly controlling their data and applications. Therefore, CSPs provide their customers with various infrastructure capabilities with a high degree of transparency into their operations to ensure the preservation of their data privacy and its security. However, CSPs usually provide different mechanisms to suite the customers’ needs such as auditing and history-based access control (Nedjah et al. 2017). Likewise, some mechanisms are also offered by the CSPs to control and optimize the use of the resources through resource allocation and load balancing mechanisms (Aljammal et al. 2017; Manasrah 2017). Nevertheless, balancing between these mechanisms and the preservation of the cloud data privacy is still a significant challenge where physical perimeters are virtual (Takabi 2014). However, researchers are also convinced that data’s sensitivity and privacy can be preserved by using strong encryption mechanisms before outsourcing the data to the cloud. While data encryption offers protection against malicious attacks and illegal accesses, it increases the computation and the communication overhead at the DOs side, especially with users having large data.

Consequently, increasing the overhead and the computation time will also increase the cost of the service being used (Manasrah et al. 2016). On the other hand, the DO will lose the ability to search his encrypted data for specific keywords to retrieve the most related documents to the search query. Encrypting the data before outsourcing them to the cloud will raise a new key management concern, as this key has to be managed by the DO or the CSP. Nevertheless, if the key management is delegated to the CSP, the problem will become even more complicated.

This paper contributes a privacy-preserving multi-keyword search approach for the cloud outsourced data. The computation of the encryption and decryption of the data before outsourcing is reduced through adopting a probabilistic public key encryption. Moreover, the proposed approach also reduces the searching time and increases its accuracy while maintaining the privacy of the retrieved data through a document’s topic clustering summary using a summarization technique. The multi-keyword search capability is another feature to increase the search accuracy and retrieve the most relevant files to the user that fulfills the search criteria. To the best of our knowledge, this is the first approach that attempts to address the problem of privacy preserving using a clustering technique based on document topics and summary.

The paper is organized as follows. Section 2 describes the related work. Section 3 gives the preliminaries and the main notations. Section 4 presents the privacy-preserving multi-keyword search approach. In Sect. 5, we analyze the security of the schemes. Section 6 carries on the experiment, compared with the existing schemes regarding the search results accuracy and efficiency. Section 7 concludes the work and presents future directions.

2 Related work

The research in the area of privacy preserving over cloud environment is mainly classified into encryption-based privacy-preserving approaches in which researchers try to achieve the security of the encrypted files using various encryption-based techniques. Other groups of researches focus on indexing and searching algorithms because encrypting the data before outsourcing is the only acceptable way to secure the files while residing outside the data owner’s premises. However, the question of how to search these encrypted files at the cloud without decrypting them with minimum computations is challenging. The DO first encrypts his data before outsourcing it to CSP and later applies the keyword search or a ranked keyword search to retrieve them. These searchable encryption schemes can be grouped as symmetric key encryption (Chang and Mitzenmacher 2005; Chase and Kamara 2010; Curtmola et al. 2006), fuzzy-searchable encryption (Adjedj et al. 2009; Wang et al. 2012) and public key encryption (Bellare et al. 2007; Boneh et al. 2004; Manasrah and Al-Din 2016). However, these techniques require a considerable number of comparisons to retrieve the results. The required number of comparisons increases the time and computation complexity and hence the overall cost (Pasupuleti et al. 2016).

Clustering similar documents together based on standard features will speed up the searching process over encrypted documents set (Krishna and Handa 2016). In this regard, Zhu et al. (2015) proposed a hierarchical clustering algorithm using fuzzy logic and swarm intelligence (HCSF) to solve the ciphertext search complexity issues. The generated clusters in this schema are created using swarm intelligence to provide the proper choices for the initial cluster centers. The fuzzy logic is used for addressing ambiguous documents during the index establishment. The authors also define the search process as the following sequence: (1) index formation during which the clusters and the sub-cluster are generated using the k-means algorithm to provide proper initial cluster center values. However, to improve the search efficiency, the hierarchy algorithm is used with the clustering process. (2) Encryption of document and clusters center values. (3) Trapdoor generation allows users to generate a query vector. (4) And finally, the search process in which the cloud server receives the trapdoor and chooses the cluster center and the sub-cluster center and the documents based on the similarity with the search query. This schema is efficient because the cloud server does not need to scan all the documents. However, it does not support dynamic updates (i.e., insertion, modification and deletion of data). Similarly, Handa and Challa (2015) proposed a cluster-based searching schema. This schema reduces the number of comparisons and the time required to perform the searching and achieves the security requirements. This search schema includes the following steps: (1) cluster generation to generate multiple clusters depending on the similarity between the documents; (2) document index generation to generate the index for each document; (3) cluster index generation to generate the clustered index using the bitwise product of the indices of all the documents within the cluster; (4) document encryption encrypts all documents; (5) query index generation to calculate the HMAC for each search term in the search query; (6) document searching in which the CSP will select the appropriate cluster and select the document in this cluster; (7) document decryption in which the DO will send the secret key to the end user; then the user decrypts the document. This schema is efficient, but it doesn’t support the dynamic updates.

To overcome and solve the dynamic updates issues, Krishna and Handa (2016) proposed a searching schema based on dynamic clustering. This schema enables dynamic updates by generating the index dynamically. This schema is efficient and reduces the cost of the index generation. Their work is an extension to work presented in Handa and Challa (2015). This search schema includes the following steps: (1) cluster generation to generate the clusters based on the similarity between keywords; (2) cluster index generation to generate the clustered index by calculating the hash-based message authentication code (HMAC) for each keyword in the cluster; (3) document index generation which is similar to the cluster index generation; (4) document encryption to encrypt the documents; (5) query index generation in which the authorized users (AU) calculate the HMAC value for all search query items and generate search index using the bitwise product for each term index and finally send this query index to the CSP; (6) document searching, when the CSP receives the searching query, compares it with the clustered index to select the matching cluster and then compares it with document index to select the matching documents and retrieve it; (7) document decryption, when the user retrieves the documents, he will request the credentials from DO to decrypt the documents. However, the proposed schema never evaluated over big data sets. Therefore, Chen et al. (2016) proposed a schema to search over encrypted data using hierarchical clustering for a big data environment. In this schema, the documents are grouped into sub-clusters (i.e., like a tree and sub-trees), and each document is represented as a point in the vector space to reflect the relevance of the corresponding documents to the normalized length of vectors. The proposed schema delivers good efficiency in terms of searching time. Nevertheless, it needs to store the high-dimensional vectors at the cloud server to represent the clusters and the documents, which will require a substantial unnecessary storage volume in big data environments. Similarly, Xiangyang et al. (2017) presented an efficient multi-keyword text search over encrypted cloud data (MUSE) based on hierarchical clustering. In this schema, the authors introduce a novel index structure, a new tree data structure and a depth-first search algorithm to improve the efficiency of the text searching. This schema may leak the privacy of keywords because the CSP may find some high-frequency keyword usage and check whether a particular keyword exists in a document by comparing the searching requests with the searching results. Therefore, they further enhanced the MUSE by adding some phantom terms into document vectors and query vectors, and then the privacy will be protected. The performance of EMUSE schema is less than the basic MUSE because of adding some phantom terms, but they keep the privacy preserved and does not leak certain frequency on the used keywords neither the text itself.

Most of the proposed works in the field of clustering-based privacy preserving are focusing on clustering the data at the CSP. For instance, Yin et al. (2018) proposed a privacy-preserving k-means clustering algorithm for a multi-dimensional and encrypted data using the scalar-product-preserving encryption primitive at the cloud server. The primary goal of their proposed algorithm is to perform data mining over the encrypted cloud data while preserving the data privacy on behalf of users. Similarly, Jiang et al. (2018) proposed a cloud server-based privacy-preserving collaborative k-means clustering framework on mining encrypted cloud data for useful knowledge while keeping the privacy of both data and the mined results enacted. Despite the number of methods in privacy preserving, these methods are quite complex in terms of computation and memory usage, thus leading to limited usage of these methods especially in the case of paid cloud services. Table 1 provides a comparison and summary for some of the cluster-based privacy-preserving schemas.

Table 1 Summary for some of the known cluster-based privacy-preserving schemas

Full size table

3 Preliminaries and notations

In this paper, the following variables and notations are used: D is the documents collection denoted as a set of n data documents D = {d₁, d₂, …, d_n}, where d_i ∈ D, ∀i ∈ {1, 2, …, n}, W is all distinct words collection denoted as a set of m words W = {w₁, w₂, …, w_m}, where w_j ∈ W, ∀j ∈ {1, 2, …, m}, C is the total cluster collection denoted as a set of k-clusters C = {C₁, C₂, …, C_k}, where C_t ∈ C, ∀t ∈ {1, 2, …, k}, and V_t is the cluster center vector of C_t, id(d_i) is the vector space model which represents the document identifier (id) that can help to identify the actual document d_i uniquely. I_t is the index tree, u is a node in the index I_t, and θ_t is a multinomial topic distribution. The multinomial distribution provides a formula for expanding an expression such as (x₁ + x₂ + ··· + x_t)ⁿ. Word_Index is the index for each word in all documents. Cluster_Index is the index for each cluster in C, and x_i is a pseudorandom binary which is a binary sequence that is generated with a deterministic algorithm. The distance measure between two vectors p and q, i.e., $ E(p,q) $, is the Euclidean distance.

Vector space model based on term frequency-inverse document frequency (TF-IDF) algorithm is very popular in information retrieval (IR) as well as insecure multi-keyword search (Ramos 2003; Salton et al. 1975). By using the vector space model, each document in D will be represented as an n-dimensional vector. All elements in id(d_i) will be normalized in the final score to increase the retrieval performance.

Moreover, the following documents will be used for illustrative purposes:

Document-1 (d₁) has the following sentence {“cloud computing means accessing resources over the Internet instead of your computer’s hard drive”}
and Document-2 (d₂) has the following sentence {“Cloud computing is a model for enabling ubiquitous access to shared pools of configurable resources”}.

4 The privacy-preserving multi-keyword search approach

This section presents the new privacy-preserving schema that is based on topic summary and clustering to reduce the number of comparisons required for the searching and retrieving of the outsourced encrypted data. The proposed schema is based on public key encryption along with a new ranked based multi-keyword searching approach in for cloud environment. In this approach, a model of cloud environment that consists of three entities as illustrated in Fig. 1 is considered.

The DO who will handle most of the computation before outsourcing the data to the cloud, such as creating Cluster_Index, documents index and encrypting the user’s documents and their index. The CSP is an entity that provides different cloud services to the DO’s and AU’s such as searching over the outsourced data on their behalf. The CSP offers storage resources as a pay-per-use model. The way and the time in which a resource is occupied is used to determine the cost of the service (Manasrah and Gupta 2017). The DO shares typically some information with the AU that can be used for searching documents using a set of keywords. The documents usually are encrypted; hence, the AU is allowed to use the shared information to perform the searching process over the encrypted data before decrypting it to its original form. The interactions between the three entities are as follows:

1.
The DO outsources a set of documents to the CSP in an encrypted format and still poses the ability to retrieve them back. To achieve this, the DO handles the document topics-based clustering process as well as extracting the document summary to be used in the index creation. The DO creates a Cluster_Index for each cluster by generating a summary for each cluster and sends the Cluster_Index to be used in a trapdoor generation based on the similarity with the search query. The DO also creates an index for each document and encrypts all user’s documents and their indexes to be kept at the CSP.
2.
The AU retrieves some documents from the CSP by communicating with the DO to request for specific information for the trapdoor generation. The similarity between the search query and the Cluster_Index identifies the matched cluster. Finally, the AU sends the search query to the CSP as a trapdoor. The CSP uses the trapdoor to apply the ranked keyword search for the documents in the matched cluster and returns a set of encrypted and relevant documents to the AU.
3.
Finally, the AU browses through the returned documents to verify their integrity before decrypting them using the private key.

The privacy-preserving multi-keyword search (PPMKS) approach uses a ranked keyword search over encrypted documents. To ensure efficiency, the CSP should return the top-k most relevant documents based on a relevance score rank to enhance the retrieval accuracy as well as minimizing the communication cost. To reduce the time and the number of comparisons that are required for the searching process, the proposed approach clusters the documents based on the document topic and summary. To ensure the preservation of the privacy, the CSP should learn neither the index nor the original plaintext. The index should not contain any information that might be used to guess the original keywords and break the encrypted data. The proposed approach consists mainly of two phases: (1) setup phase and the (2) retrieval phase.

4.1 Setup phase

Upon the DO selection of the documents to be outsourced to the CSP, the DO generates a pair of public and private keys for the encryption and decryption processes. During this phase, a document index will be created from multiple keywords that are extracted from the document collections for each cluster. The relevance score is then calculated for each keyword to be kept in the document index. The index and the document collections are encrypted using the private key to ensure its privacy. Finally, the DO outsources the encrypted documents and the encrypted index to the CSP. The setup phase consists of five steps: (1) key generation, (2) data preprocessing, (3) clustering and summarization, (4) word index creation and (5) privacy preserving, as illustrated in Fig. 2.

4.1.1 Key generation

The key generation phase generates one public (PK) and one private (PR) keys using the key generation algorithm proposed in Pasupuleti et al. (2016). The DO chooses two large primes numbers (p, q) randomly of the same size to compute N = pq. Then it calculates r and s using the extended Euclidean algorithm (Katz et al. 2008) where $ pr + qs = 1 $. Thus, the public key (PK) can now be defined as PK = {N}, and the private key (PR) is PR = {p, q, r, s}. However, the documents may contain punctuations, numbers, stop-words and various suffixes. Processing such documents will increase the complexity and the computation in handling the different types of documents. Therefore, a preprocessing that aims to remove the unnecessary punctuations, numbers, stop-words, the different suffixes and creating the document words matrix to be used with the other setup steps is needed.

4.1.2 Data preprocessing

The data preprocessing phase aims to prepare or transform the raw format of the documents into an understandable format for outsourcing, using four steps; extraction, stop-words removal, spell checking and stemming.

A.
Extraction

In this step, a list of individual words is extracted based on the (1) punctuation marks and (2) the white spaces. This list of words is then used to generate an index after the stop-words removal, spell checking and stemming. To further minimize the needed processing steps, the commonly used words such as over, the, instead, of, your should be removed to focus on the important words instead.

B.
Stop-words removal

To speed up the searching process, the most common words in the English language that does not add any useful meaning to the text, such as a, an, the, with, etc., should be removed from the document. This means reducing the number of the keywords in the index where the stop-words may account for 20–30% of the total word count in a text document (Witten et al. 2016). In this work, a stop list that contains 571 words from Salton and Buckley (1988) is adopted.

C.
Spell checking

The spellchecker Hunspell is used to recognize misspelled words in the documents and replaces them with the correctly spelled ones. Hunspell is a popular spellchecker tool that is widely used in popular software packages including Google Chrome, Mac operating system, Opera and InDesign (Xu et al. 2015). Hunspell has a dictionary of correct words which will be compared with all words in W.

D.
Stemming

The words in W should, therefore, be stemmed to their roots. For example, the words “material, materially, materialize, materialization and materiality” can be stemmed to the word “material” (Ramasubramanian and Ramya 2013). By removing the various English suffixes, the efficiency of the searching process could be enhanced by minimizing the number of words for the matching stems. The stemmer of Porter et al. (2002) which is a set of rules to group words with similar roots will be adapted to remove the suffixes.

For example, after applying the adopted Stemming algorithm over W, the list of words W in ascending order will be:

$$ W = \left\{ {access,\;cloud,\;comput,\;drive,\;hard,\;internet,\;mean,\;resourc} \right\}. $$

After preprocessing all the documents, the identifier (id) for each document must be built to represent the vector space model for each document, using the following equation:

$$ id(d_{i} ) = \left\{ {w_{1,i} , w_{2,i} , \ldots w_{t,i} } \right\} $$

(1)

where $ w_{j,i} $ is how many times the jth word in W appears in $ d_{i} $. t is the number of words in W.

For instance, if d₁ content is “Cloud computing means accessing resources over the Internet instead of your computer’s hard drive”, and d₂content is “Cloud computing is a model for enabling ubiquitous access to shared pools of configurable resources”, then the result of the preprocessing step will be as follows:

d₁ = {cloud comput mean access resourc internet comput hard drive}.
d₂ = {cloud comput model enabl ubiquitou access share pool configur resourc}.
W = {access, cloud, comput, configur, drive, enabl, hard, internet, mean, model, pool, resourc, share, ubiquitou}.
id(d₁) = {1,1,2,0,1,0,1,1,1,0,0,1,0,0}.
id(d₂) = {1,1,1,1,0,1,0,0,0,1,1,1,1,1}.

However, if the DO has a huge number of documents, then the searching process may require a longer time. Therefore, to speed up the searching time, this paper proposes to group the documents with similar topics or concepts (i.e., semantic clustering) together as the following section demonstrates.

4.1.3 Semantic clustering

Clustering is used to group the documents into different clusters before creating the index. Classifying the documents should reduce the number of documents for each cluster to increase the efficiency of the searching process. In this regard, the proposed approach adopts the clustering technique that is proposed by Nagwani (2015) due to its superior performance, semantic similarity and main topics extraction using latent Dirichlet allocation (LDA) for summarizing the huge documents collection (Blei et al. 2003). To perform the clustering, the DO collects all (n) documents into one data set D = {d₁, d₂ … d_n}. The DO then apply the K-means clustering algorithm on D, to generate k-clusters, denoted by C = {C₁, C₂ … C_k} where t = 1, 2 … K and C_t is a set of similar documents with the same features based on documents similarities. That means all documents within the same cluster should be as similar as possible C_t = {∪ (D_i ∈ C_t)}, and the documents in different clusters should be as dissimilar as possible from another document in all other clusters.

Definition 1

Assume that C_t = {id(d₁), id(d₂), …, id(d_i)} is a cluster contains i documents where id(d_i) ∈ C_t is the corresponding identifier vector of the document i, then if the cluster center of C_t is denoted as V_t, then we have:

$$ V_{t} [j] = \frac{{\mathop \sum \nolimits_{l = 1}^{i} V_{l} [j]}}{i} $$

(2)

where j: 1, 2, …, m is the words index in W, and i is the number of documents in C_t. The steps for the K-means clustering are illustrated in Algorithm 1 as follows:

Unfortunately, the k-means algorithm distance computations require (nk) calculations in each iteration, where n is the number of all documents and k is the number of clusters. Therefore, to address the issue of the extensive computations, especially with huge datasets, the parallel K-means clustering (PKMeans) based on MapReduce is also used (Zhao et al. 2009). The Map function is shown in Algorithm 2.

The Combine function will perform the procedure of combining the intermediate data of the same Map function. The Combine algorithm is illustrated in Algorithm 3.

The Reduce function, on the other hand, will perform the procedure of updating the new centers. The Reduce algorithm is illustrated in Algorithm 4.

After the clustering phase is done, the clusters topics have to be extracted to represent the main information in the documents collection to organize and summarize the collection of documents in the same cluster.

4.1.4 Topic modeling

Topic modeling is the process of finding the main words (i.e., keywords) from a collection of documents. Many algorithms were used to create topic models such as the LDA modeling which was proposed in Blei et al. (2003) to extract the main words from a collection of documents. The LDA modeling is adopted in this work to generate topics and main terms for each cluster in C = {C₁, C₂ … C_k}. These terms will be used for the summary extraction of each cluster. The LDA operates as follows:

1.
Select a multinomial θ_t as the topic distribution for each topic t from a Dirichlet distribution with parameter β as the parameter of the Dirichlet prior on the per-topic word distribution.
2.
For each document d, a multinomial document distribution θ_d is selected from a Dirichlet distribution with parameter α as the parameter of the Dirichlet prior on the per-document topic distributions.
3.
For each word w_i in document d_i, a topic t from θ_d is selected.
4.
A word w_i from θ_t is selected to represent the topic for the document.

The probability of generating a corpus to be used is given by the following equation (Blei et al. 2003):

$$ {\iint } \mathop \prod \limits_{t = 1}^{k} P\left( {\theta_{p } \left| \beta \right. } \right)\mathop \prod \limits_{b = 1}^{N} P\left( {\theta_{p } \left| \alpha \right. } \right) \left( {\mathop \prod \limits_{t = 1}^{Nb} \mathop \sum \limits_{b = 1}^{K} P\left( {t_{i } \left| \theta \right. } \right)P\left( {w_{i } \left| {t,} \right.\emptyset } \right)} \right){\text{d}}\theta {\text{d}}\emptyset $$

(3)

where K is the number of topics, $ \theta_{p } $ is the topic distribution for document p, i.e., multi-dimension vector, β is the parameter of the Dirichlet prior on the per-topic word distribution, N is the total number of words in all documents, α is the parameter of the Dirichlet prior on the per-document topic distributions, $ t_{i } $ is the topic for the ith word in a document, θ is the topic distribution for a document, w_i is the ith word in a document, and $ \emptyset $ is the word distribution for a topic.

LDA is used to generate topics and to extract the main terms for each cluster. The extracted sentences from each document represent the summary for each cluster which will be used to generate the index for each cluster. However, to deal with the possible huge amount of document within each cluster, the MapReduce is used with the LDA creation to reduce the expensive computations across clusters and to ensure that all values associated with the same key are brought together in the reducer as illustrated in Algorithm 5.

The Reduce function, on the other hand, updates the parameters that are associated with each topic. It requires an aggregation over all intermediate $ \emptyset $ vectors. The Reduce algorithm is illustrated in Algorithm 6.

The result of the LDA process is the list of the main words for each cluster which should represent the main information in all documents within the same cluster.

Definition 2

Assumes that W_t= {w₁, w₂, …, w_j} is a distinct word in cluster C_t where w_j ∈ W, if w_j is a representative word of C_t, then w_j will be added to the Main_Words list of C_t which has the following structure:

$$ Main\_Words\;(C_{t} ) = \{ w_{j} ,w_{j} \in W\;and\;w_{j} \;is\;main\;word\} $$

(4)

Then the Main_Words (C_t) are used to create the Cluster_Index for the cluster (C_t). The process of creating the Cluster_Index is to find all sentences in the cluster (C_t) which contains any word in Main_Words (C_t).

Definition 3

Assumes that S_t= {S₁, S₂, …, S_i} is all sentences in C_t, if S_i contains any w_j ∈ Main_Words of C_t, then S_i will be added to the Cluster_Index for C_t in the following form:

$$ Cluster\_Index\;(C_{t} ) = \{ S_{q} ,S_{q} \in d_{i} \;and\;S_{q} \in w_{j} \} $$

(5)

where S_q is a sentence in d_i and w_j is a representative word of C_t.

The ranked keyword search proposed in Cao et al. (2014) is adopted to enable the CSP to return the most topmost relevant documents to reduce the network traffic and to support multiple keywords search.

4.1.5 Word index creation

After generating the clusters, the DO creates an index for each cluster using LDA and creates another index for the documents collection within each cluster. The clustered index acts as the summary for each cluster. As discussed in Sect. 2.4 various indexing techniques are existed in the literature. In this paper, the indexing technique proposed by Pasupuleti et al. (2016) is adopted and enhanced with different features such as the document length and the keyword location and frequency. The proposed indexing technique is executed using a ranking function to evaluate the relevance score. The Word_Index consists of three scores: position score, length score and the relevance score. The details of the word index creation steps are illustrated in Algorithm 7 and works as follows:

The DO scans cluster C_t and extracts its distinct words W_t = {w₁, w₂, …, w_m}. For each document d_i in cluster C_t over each Word w_j, the following scores are computed:

1.
Position Score (PS) The location of the words indicates there importance, as the most important word appears first (i.e., the words in the title should have higher weights than those in the abstract and the text), the location score (Sarkar 2012) of word w_j is calculated using Eq. (6).
$$ PS_{i} = \frac{1}{\sqrt k } $$
(6)
where PS_i is the location score of the word w_j in a document d_i. k is the location of the word from the top of the document.
2.
Length Score (LS) Long documents usually contain more important information. The length score is calculated using Eq. (7).
$$ LS_{i} = \frac{{\left| {Ws_{i} } \right|}}{{\mathop \sum \nolimits_{i = 1}^{N} \left| {Ws_{i} } \right|}} $$
(7)
where LS_i is the length score of the document. Ws_i is the set of words in the document. N is the number of documents in a cluster.
3.
The Relevance Score (RS) the information retrieval (IR) community uses the term frequency (TF) × inverse document frequency (IDF) to compute the RS. TF is simply the number of times a given word appears within a document. IDF is obtained for each cluster by dividing the number of documents in this cluster by the number of documents containing the word within the same cluster. The relevance score (Wang et al. 2012) is calculated using Eq. (8).
$$ RS_{wj} = \frac{1}{{\left| {f_{d} } \right|}}.\left( {1 + {\text{In}}\,f_{d.t } } \right) $$
(8)
where RS_wi is the relevance score for the word w_j in the document d_i. |F_d| is the length of the document, and f_d,t denotes the word frequency in document d_i.
Finally, to generate a Word_Index for w_i, all previous scores are calculated as follows:
$$ I(w_{j} ) = PS_{i} + LS_{i} + \, RS_{wi} $$
(9)
where I(w_i) is an index for the word (w_i) in the document.
The result of the word index creation algorithm is an index for each word w_j in all documents (D). The structure of the Word_Index is illustrated in Fig. 3.
Fig. 3
Word_Index structure
Full size image
The data usually have to be encrypted before outsourcing to the cloud to ensure its privacy. Therefore, the Word_Index and all documents must be encrypted before outsourcing to the CSP, and the Cluster_Index must be encrypted before sharing with AUs.

4.1.6 Privacy preserving

The DO encrypts both the document index and the document collection to preserve the privacy using the probabilistic public key encryption technique that is based on the Rabin cryptosystem (Gupta et al. 2016) and proposed in Pasupuleti et al. (2016). This technique should support not only keyword search over the encrypted data but also offer high-security characteristics. The procedure of the encryption process is illustrated in Algorithm 8.

1.
Let the documents D= {d₁, …, d_n} with length n, where each m_i is a binary string of length h and index I(w_i).
2.
Select the random seed r and generate
$$ x = r^{2} \;\bmod \;N $$
(10)
where r is the random seed and N is the public key.
3.
Generate the pseudorandom binary bit $ x_{i} $
$$ x_{i} = x_{i - 1}^{2} \;\bmod \;N $$
(11)
$$ p_{i} = x\;\bmod \;2 $$
(12)
where $ p_{i} $ is the least significant bits (LSB) of $ x_{i} $. LSB is the rightmost bit position in a binary that determines whether the number is even or odd.
4.
The pseudorandom bit sequence p_i XORed with the binary representation of Word_Index to get $ I^{\prime } (W_{i} ) $ and the plaintext of document d_i to get the ciphertext $ c_{i} $;
$$ I^{\prime } (W_{i} ) = p_{i} \oplus I(W_{i} ) $$
(13)
and
$$ c_{i} = p_{i} \oplus d_{i} $$
(14)
5.
Finally, generate the next random
$$ x_{n + 1} = x_{n}^{2} \;\bmod \;N $$
(15)

The DO sends the encrypted documents collection and the encrypted Word_Index D′= {d₁′, d₂′, …, d_n′, I′(w_i)} to the CSP. The resulting bit sequence x_n+1 and the Cluster_Index will be kept at the DO who will share it with the AUs to retrieve the needed documents with certain keywords search.

4.2 Retrieval phase

In this phase, if the DO or the AU wants to retrieve certain documents with certain keywords, they should start by generating a trapdoor for the set of keywords using the bit sequence x_n+1 and the Cluster_Index. The CSP searches for the matched documents in the same cluster based on their corresponding relevance scores using the trapdoor. If the keywords match with the Word_Index, the CSP ranks the matched documents based on the relevance score and sends the top-k most relevant documents back to the DO or the AU in a ranked descending ordered. Then DO or AU decrypts the document using their private key. This phase consists of three processes: (1) trapdoor generation (2) ranked keyword search and (3) data decryption.

4.2.1 Trapdoor generation

If the DO or the AU wants to retrieve the documents with certain keywords, they should compute the trapdoor for keywords w_i ∈ W using the bit sequence x_n+1 and the Cluster_Index to be sent to the CSP as a search request as Algorithm 9 illustrates.

1.
The AU will get the trapdoor information that contains the bit sequence x_n+1 and the Cluster_Index from the DO. The bit sequence x_n+1 will be used to encrypt documents, where the Cluster_Index will be used to determine to which cluster the search query belongs.
2.
Using the cosine similarity proposed by Salton and Buckley (1988), the AU computes the similarity between the search query and Cluster_Index to determine which cluster contains this search query (to determine an index-for-the-specific-cluster) as in Eq. (16).
$$ Sim_{j,k} = \frac{{\mathop \sum \nolimits_{i = 1}^{t} w_{i,j} *w_{i,k} }}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{t} w_{i,j}^{2} } *\sqrt {\mathop \sum \nolimits_{i = 1}^{t} w_{i,k}^{2} } }} $$
(16)
where $ Sim_{j,k} $ is the cosine similarity value between the two sentences j and k. W_i,j is the weight of the term i in the sentence j. t is the number of terms in the sentence.
3.
The weight W of the term j in a search query i (i represents a sentence in a single cluster) can be computed using Eq. (17):
$$ W_{i,j} = tf_{i,j} *{ \log }\frac{n}{{df_{j} }} $$
(17)
where tf_i,j is the number of times that the term j appears in cluster i summary in Cluster_Index. n is the number of clusters in the collection. In a single cluster, n is the number of all sentences in the cluster. df_j is the cluster frequency of term j. In a single cluster, it is the sentence frequency of term j (Alguliev and Aliguliyev 2007).
4.
For search query, the AU computes the trapdoor as follows ($ T_{{w_{i} }} $, index-for-specific-cluster, k) where k is the number of the retrieved documents and $ T_{{w_{i} }} $ is an encrypted form for each word in the search query w as follows:
$$ T_{{w_{i} }} = \mathop \sum \limits_{i = 1}^{m} H(w_{i} )^{r} $$
(18)
where H is a collision-resistant hash function (Secure Hash Algorithm 1 (SHA-1)). The SHA-1 is a cryptographic hash function; it produces a single output message digest of 160-bit (20-byte) from an input message. Multiple chunks of 512 bits each compose the input message. Afterward, each chunk is further divided into sixteen 32-bit words (W_t= {W₀, W₁, W₂, … W₁₅}), one 32-bit word for each round of the SHA-1 processing (Chaves et al. 2006). The $ T_{{w_{i} }} $ generation is illustrated in Algorithm 10.

4.2.2 Ranked keyword search

The CSP searches for the matching documents in a specific cluster after receiving the trapdoor (T_wi, index-for-specific-cluster, k) from the AU or the DO. The CSP conducts a ranked keyword-based search as illustrated in Algorithm 11.

The CSP first compares the index-for-specific-cluster with the clusters number in Words_Index to select the matching cluster and then finds the matching entries of the corresponding encrypted document in this cluster using (T_wi). The CSP gets the matched document identifiers by checking whether the terms in the search query existed in these documents and then finding the ranks for each matched terms to determine the matched documents according to the relevance scores and send the top-k most relevant documents in descending order back to the DO or the AU. To obtain the rank of the corresponding documents, the B⁺ tree-based and the dictionary (key/value) data structure is used per cluster. The B⁺ tree-based is a balanced search tree with O (log(n)) searching time for the worst-case scenario. The data that represent the score for each word are stored at the leaf nodes, where the other nodes only store the keys. The B⁺ tree-based and the dictionary (key/value) data structure can be demonstrated as the following definition.

Definition 4

Assume u is a node of index tree I_t, then if a node u is an internal (non-leaf) node, then it has the following four fields: (1) u.n is the router value currently stored in node u, (2) the router values stored in node u in increasing order (u.router₁ < u.router₂ < ··· < u.router_u.n), (3) u.leaf is a Boolean field where a zero value indicates that u is a non-leaf node and (4) u.n + 1 pointers u.c₁, u.c₂, u.c₃, …, u.c_u.n+1 to the children of u. If a node u is a leaf node, then it has the following three fields (1) u.n is the number of key values currently stored in u. (2) The key values stored in node u in increasing order u.key₁ < u.key₂ < ··· < u.key_u.n and (3) u.leaf are a Boolean field where one value means that u is a leaf node. The searching process using B⁺ tree-based is illustrated in Algorithm 12.

The CSP first selects the matching cluster using the index-for-specific-cluster and then selects the corresponding tree for this cluster and starts comparing each term in the trapdoor (Tw_i) against the selected tree to determine which documents contain the search query. Then it computes the score for each term in (Tw_i) to produce the final score for each document. Finally, these documents must be sorted in descending order based on their scores, and the top-k most relevant documents will be retrieved and send back to the AU or the DO, as portrayed in Fig. 4.

After the ranked keyword search is completed, the top-k most relevant documents will be retrieved and send back to the AU or the DO in an encrypted format.

4.2.3 Data decryption

The AU and the DO need to decrypt the received documents from the CSP in response to his/her search query. The plain text could be obtained using the shared private key (Pasupuleti et al. 2016). The procedure of the decryption process is illustrated in Algorithm 13.

To decrypt the document, the AU computes the four modular square roots (r, c, a, b) and uses them to get the plaintext d_i (Pasupuleti et al. 2016).

To ensure the integrity of the received documents, the authorized user will check the documents by matching the hash value from CSP with the hash value from the data owner, so the authorized user can detect the modification of the encrypted data in a cloud environment if there are matches between these hash values. This could be achieved using any collision-resistant hash function H (e.g., SHA-1). As a result, if any hacker changes the document content, the index content or the encrypted documents, then h₁ ! = h₂ which means the data are changed or corrupted in the cloud, so our approach can achieve the data integrity of encrypted documents stored in the cloud. Therefore, privacy-preserving requirements are satisfied in our approach.

5 Security analysis

Privacy preserving of cloud computing data is considered as one of the most important issues that stop users from adopting cloud computing into their infrastructures. Therefore, in this section, we analyze the security of PPMKS against internal and external attacks. To analyze the security of PPMKS, we adopt the following definitions proposed by Pasupuleti et al. (2016).