1 Introduction

Cloud computing is an internet-based service model that consists of group of remote servers integrated to provide ubiquitous on demand service that can be provisioned rapidly from a shared pool of configurable resources at reduced cost [1]. Due to increase in need for computing resources, individuals and businesses outsource their computing and storage needs to the cloud [2]. The business benefit of cloud storage is undeniable and enterprises achieve effective functionality at a reduced price while improving business agility. Also, the public cloud storage option enables the Small and Medium Enterprises (SMEs) to fully focus on their core business rather than IT management [3]. However, security and privacy are the major barriers for adopting the cloud for both the large enterprises and SMEs [4]. According to a survey [5], an overwhelming majority of 91% of organizations are concerned about public cloud security. This is because the data owner has no physical control over the outsourced data.

The security of stored data is of concern, especially when outsourced data is sensitive data like health records, critical business data and credit card information to name a few [2]. The cloud server (CS) or cloud service provider (CSP) and data owners (DOs) are not in the same trusted domain, which may put the outsourced unencrypted data at risk. CSPs may leak information regarding the outsourced data to unauthorized entities or even be hacked or data could be destroyed with malicious intents. Therefore, sensitive data has to be encrypted by the DO prior to outsourcing to combat unsolicited access and to preserve privacy. Additionally, to speed up retrieval, indices are sent to the CSP, who can provide various functionalities on behalf of the DO. The commonly used indices are index per document (forward index) or index per keyword (inverted index). Inverted index is a popular data structure [6] used to speed up the search. However, the inverted index is inherently sequential. The index has to be rebuilt in order to update (addition/deletion) keywords and documents. Even if the updates are handled using a separate data structure like a delete array, the construction is complex and allows only minimal updates [7]. Second, updates leak information since the update information requires rebuilding inverted index at the CSP-owned resources.

Most of the traditional searchable encryption schemes [8] do not capture the relevance of the retrieved documents. When applied on large data such as big data or collaborative data, there are a few drawbacks. First, without foreknowledge of the encrypted data stored at the CSP, the authorized data users (DUs) have to go through every document to find the ones that match their interest. Second, especially in the pay-per-view cost model, downloading all the documents that match the search query incurs unnecessary traffic and cost. Ranked retrieval conserves bandwidth and computations, besides improving user productivity, as only the most relevant documents are retrieved. Hence, top-k is an essential information retrieval technique that makes economic usage of resources like bandwidth and computational resources. The working principle of public cloud storage as a service therefore relies in the exhibition of the CSPs’ competence in protecting the outsourced data by providing security mechanisms. Outsourcing of data to the CSP raises new security concerns. First, since the CSP and the DO are not in the same trusted domain, the secrecy of the outsourced data is a challenge. It is required that the CSP must not learn any information from the stored data, or the way it is accessed [9]. Second, the completeness of the search, i.e. the retrieved result must include all the documents that match the search query. The search results must be verifiable, since a malicious CSP can return only fragments of the search result in order to save computational resources.

Most of the existing literature do not provide practical and fully functional searchable encryption construction. For searchable encryption to be practical, the constructions should have properties like privacy preserving, efficient ranked search and verifiable search with provisions to dynamically update the encrypted documents without index reconstruction. As an effort to address these issues, verifiable top-k ranked searchable encryption over encrypted cloud data (VSED) is proposed. The contributions are summarized as follows:

  1. 1.

    Construction of a top-k searchable encryption scheme based on inverted index and secret orthogonal vector to retrieve documents based on their relevance.

  2. 2.

    The construction supports dynamic update on encrypted index flexibly without reconstructing the entire index.

  3. 3.

    Verifiable search in order to verify the completeness of the search result returned by the CS.

  4. 4.

    Security analysis to demonstrate that VSED is semantically secure with provable trapdoor unlinkability, correctness and privacy guarantees.

  5. 5.

    Experimental evaluations on real-world dataset is used to show the efficiency of the proposed scheme.

1.1 Organization

The paper is organized as follows. Section 2 describes the related works. Section 3 gives the problem formulation, which includes the system model, system design and notations. Section 4 presents the construction of the proposed work VSED. The security model and the security analysis are given in sections 5 and 6, respectively. The performance analysis of VSED is described in section 7. Finally, section 8 concludes the paper with suggestions for future work.

2 Related works

Searchable encryption is a cryptographic primitive that has gained considerable attention recently. A number of traditional searchable encryption techniques use symmetric key setting with the focus on improvement of efficiency and formalization of security definitions. Song et al [10] were the first to put forward a practical solution for searchable encryption with a linear search complexity. Goh [11] proposed a secure per document index construction based on bloom filter and pseudo-random. The search complexity is proportional to the number of documents in the document set and the construction with bloom filter results in false positives. Chang and Mitzenmacher [12] presented a similar construction with an index per document with linear search time. Curtmola et al [13] constructed an inverted index or per keyword index and pointed a link between the index and the trapdoor. The search complexity realized is a sub-linear search. Boneh and Waters [14] were the first to present an asymmetric searchable encryption scheme. Their construction has a drawback of low search efficiency and large computation cost. The aforementioned constructions support single-keyword and Boolean search without capturing the relevance of the documents.

Multi-keyword top-k search: The common pattern used by data users to search the outsourced data is to use multiple query keywords at a time. Cao et al [15] introduced multi-keyword ranked search using coordinate matching and vector space model. The search complexity is a linear search and the scheme uses a deterministic trapdoor; thus, it is prone to distinguishability attack. Since then, there are numerous schemes and multi-keyword searchable encryption schemes have gained attention [16,17,18]. Sun et al [19] realized a secure multi-keyword search using a searchable index-tree based on vector space model and cosine similarity measures along with the TF–IDF to rank relevant documents. Jiang et al [20] present a multi-keyword ranked search using an inverted index, and a special data structure QSet to mask the correspondence between the keyword and the document set that contains the keyword. TF–IDF values are used to find the relevance scores.

Verifiable search: The search results returned by the CS may not contain complete result or can contain errors. This is possible due to malicious CS intending to save computational resource or due to software/hardware malfunction. Therefore, a mechanism to verify the completeness of the search result is desired. A number of verifiable search is based on Merkle hash tree [21,22,23,24,25,26], but suffers from huge storage overhead. Other schemes include the cryptographic signature schemes [27, 28]. Wang et al [29] propose a verifiable auditing scheme based on bloom filters. The scheme suffers from computational and communication overload.

Dynamic updates: In literature, very few schemes support addition or deletion of keywords in the document. Song et al [10] present a dynamic scheme where the document is divided into fixed-size words and each word is individually indexed. Updating is straightforward but with varying trade-offs between security and efficiency. The construction of Goh [11] using bloom filters updates the document but suffers from false positives and linear search time. Kamara et al [7] presented a construction with inverted index but with complex implementation. Kamara and Papamanthou [30] proposed an index based on keyword red black tree for Boolean and single-keyword search. Cash et al [31] proposed a dynamic searchable encryption scheme using “T-Sets”—a data structure for the keyword/identity tuple. A separate database is used to add tuples, and deleted tuples are stored in a revocation list. The construction supports only single-keyword search. Naveed et al [32] proposed a dynamic searchable encryption via a blind storage that allows DOs to store dynamic collection of documents with the CS. However, the scheme leaks information on the updates. Moreover, updating the existing document by addition or deletion of keywords is not supported. Xia et al [33] proposed an encryption scheme based on keyword balanced binary tree and greedy depth-first search. The cost of the search and the time complexity of trapdoor are high.

3 Problem formulation

3.1 System model

The system model involves four entities, namely the data owner or document owner (DO), CS or the CSP, secure co-processor (SCP) and the data user (DU), as illustrated in figure 1. The SCP [34] (like the IBM PCIe or the Freescale C29x) is assumed to reside at the CSPs’ isolated execution environment. It is assumed that the CS and the SCP do not collude. The DO has a collection of document set \(\mathcal {D=}\{d_{1},...,d_{n}\}\) to be outsourced to the CS. Using a secure encryption algorithm, the DO generates the encrypted document set \({\mathcal {C}}=\{C_{1},...,C_{n}\}\) from the document set \({\mathcal {D}}\). To enable searching through the document set, an encrypted searchable index per keyword (inverted index) is constructed from the document set \({\mathcal {D}}\). Both the encrypted document set \({\mathcal {C}}\) and the encrypted index \(I_{{\mathcal {D}}}\) are outsourced to the CS. The DO computes the term frequency and inverse document frequency \({\mathcal {T}}\), and sends the encrypted score index \({\mathcal {S}}\) to the SCP. To search for the documents containing the keyword of interest, an authorised DU acquires a trapdoor corresponding to the search keyword from the DO. On receiving the trapdoor, the CSP executes search over the encrypted index \(I_{{\mathcal {D}}}\) and retrieves all the documents that match the search query. The CSP sends the trapdoor and an optional k obtained from the DU to the SCP to compute the top-k documents. The SCP verifies and computes the scores for the query keyword using the encrypted score index \({\mathcal {S}}\), ranks according to its relevance to the query and returns the top-k document identifiers along with verifiable parameters to the CSP. The CSP returns the top-k documents along with the verifiable parameter to the DU.

Figure 1
figure 1

Architecture of VSED.

3.2 Design goals

To enable dynamic, efficient and secure top-k ranked search over encrypted cloud data, VSED’s construction aims at the following design goals.

  1. 1.

    Top-k multi-keyword ranked search: The design of the scheme must accept multiple input queries from the DU and return the top-k relevant results that match the search query.

  2. 2.

    Dynamic index: The scheme should have provisions for dynamic update, i.e. both addition and deletion of documents, without reconstructing the index.

  3. 3.

    Search and update efficiency: The scheme should be accomplished with sub-linear search and update time.

  4. 4.

    Verifiable search: The scheme should support the verifiability of the completeness of the results returned from the CS.

  5. 5.

    Privacy preserving: The construction should not leak any information (access pattern, history, search pattern, trace) to the CS beyond what is allowed.

4 Construction of VSED protocol (\(\pi _{V}\))

The construction of the VSED scheme aims to securely and publicly perform computations with encrypted data or to modify the encrypted data using special functions in such a way that they are computed while preserving the privacy. Such encryption mechanisms are called homomorphic cryptosystems. Cryptosystems like RSA, Paillier and ElGammal exhibit partial homomorphic properties. VSED makes use of the Paillier cryptosystem’s homomorphic addition property [35]. The preliminary set-up and related algorithms required for construction of VSED are described in this section. The construction of the VSED scheme consists of the following algorithms.

  • KeyGen: A probabilistic polynomial-time algorithm that generates secret parameters for encryption and index construction. The KeyGen algorithm generates secret parameter for standard symmetric encryption and homomorphic Paillier cryptosystem.

  • Enc(pkmr): A probabilistic polynomial-time algorithm that takes a secret parameter, message and a random number as input to produce a cipher text.

  • Dec(skc): A probabilistic polynomial-time algorithm that takes a secret parameter and cipher text as input to output the original plain text.

  • \(BuildScoreIndex({\mathcal {D}},{\mathcal {W}},pk,{\mathcal {V}})\): This algorithm is executed by the DO, which takes the data item \(w\in {\mathcal {W}}\), public key pk and secret vector set \({\mathcal {V}}\) as inputs, and outputs a score index \({\mathcal {S}}\).

  • \(BuildIndex({\mathcal {D}},{\mathcal {W}},pk,{\mathcal {V}})\): This algorithm is executed by the DO, which takes the keyword \(w\in {\mathcal {W}}\), the public key pk and secret vector set \({\mathcal {V}}\) as inputs, and outputs an encrypted index \(I_{{\mathcal {D}}}\).

  • \(TrapDoor(w_{q},pk,{\mathcal {V}})\): This algorithm takes a keyword \(w_{i}\in {\mathcal {W}}\), public key pk and secret vector set \({\mathcal {V}}\), and generates the trapdoor \(T_{w_{q}}\) to be used for searching a keyword in the encrypted index.

  • \(Search(I_{D},T_{w_{q}},k)\): This algorithm takes an index \(I_{{\mathcal {D}}}\), a trapdoor \(T_{w_{q}}\) and k, the number of top matching documents, to retrieve and outputs the set of top-k file identifiers denoted by \({\mathcal {F}}_{qk}=\{f_{i}\}_{\forall f_{i}\in w_{q}}\).

  • \(topk(k,\mathcal {E_{{\mathcal {F}}}}_{q},sk,T_{w_{q}},{\mathcal {S}})\): This algorithm takes a value k, encrypted binary index vector \(windex_{q}\) (computed by the search algorithm and derived from \(\mathcal {E_{{\mathcal {F}}}}_{q}\)), secret parameter sk, encrypted score index \({\mathcal {S}}\) and a trapdoor \(T_{w_{q}}\), generated by the DO, and outputs the top-k document identifiers that match the query keyword.

  • \(Update(pk,I_{D},{\mathcal {S}},f_{u},{\mathcal {V}})\): This algorithm takes an index \(I_{{\mathcal {D}}}\), trapdoor to update \(Y_{w_{q}},\) the public key pk, secret vector set \({\mathcal {V}}\) and document or keyword to be updated as input, and outputs an updated index and score index. \(AddKeyword(pk,I_{D},Y_{w_{u}},{\mathcal {S}},f_{u},{\mathcal {V}})\), for adding new keywords, and \(DeleteKeyword(pk,I_{D},Y_{w_{u}},{\mathcal {S}},f_{u},{\mathcal {V}})\), for deleting keywords from the encrypted index, are the two algorithms that constitute update.

4.1 Set-up

Let \({\mathcal {D}}={d_{1},\ldots , d_{n}}\) where \({\mathcal {D}}\) denotes the set of documents and \(d_{i}\) denotes the \(i^{\mathrm{th}}\) document in the set of documents. Let \({\mathcal {C}}=\{C_{1},...,C_{n}\}\) denote the set of encrypted documents such that \(C_{i}=Enc(sk_{STD},d_{i})\). Any standard symmetric key encryption algorithm proven to be a pseudo-random function like AES is assumed for encrypting the files or documents. The operation of encrypting the files is denoted as \(Enc(sk_{STD},d_{j})\), where \(sk_{STD}\) denotes the secret key of such encryption algorithm and \(d_{j}\) denotes the document to be encrypted. Let \({\mathcal {F}}={f_{1},\ldots , f_{n}}\) denote the file identifiers of the encrypted documents and each \(f_{i}\) is given by \(f_{i}=id(C_{i})\). Let \({\mathcal {W}}=\{w_{1},..,w_{|{\mathcal {W}}|}\}_{\ne }\) be the set of distinct keywords of the document collection \({\mathcal {D}}\).

4.2 Key generation

Key generation is a probabilistic polynomial-time algorithm that generates the secret parameters used for encryption and building index. The VSED algorithm uses Paillier cryptosystem for generation of trapdoor and index. Hence, its key generation set-up is assumed.

Let \(n=pq\) for primes p and q; set \(\lambda =lcm(p-1,q-1)\), and choose \(g\in Z_{n^{2}}^{*}\) such that \(\gcd (L(g^{\lambda }\mod n^{2}),n)=1\), where \(L(x)=(x-1)/n\). Then \(P=R=Z_{n},C=Z_{n}^{*}\) and \(K={(n,g,p,q,\lambda )}\), where \(n,g,p, q\, \mathrm {and}\,\lambda \) are defined earlier. Given security parameter \(\varepsilon \), the algorithm chooses two distinct \(\varepsilon /2-\)bit primes p and q, sets \(n=pq\) and \(\lambda =lcm(p-1,q-1\)), and chooses \(g\in Z_{n^{2}}^{*}\) such that \(\gcd (L(g^{\lambda }\mod n^{2}),n)=1\). The tuple(ng) is the public key denoted by pk, and the tuple \((p,q,\lambda )\) is a private key denoted by sk.

The Principle of Orthogonality states that two vectors \(X,Y\in R\) are orthogonal or perpendicular if \(X.Y=0\). Moreover, \(X_{1},..,X_{p}\in R^{n}\) are mutually orthogonal if \(X_{i}.X_{j}=0\) whenever \(i\ne j\). A set of mutually orthogonal vectors is called an orthogonal set. Mutually orthogonal unit vectors \({v_{1},...,v_{p}\in R^{n}}\) are said to be orthonormal. Let \({\mathcal {V}}\) be a set of mutually orthogonal vectors given by \({\mathcal {V}}={v_{1},v_{2},\ldots ,v_{|{\mathcal {W}}|}}\) where \(v_{i}\) denotes the \(i^{\mathrm{th}}\) row vector and \(|{\mathcal {W}}|\) denotes the number of keywords in the keyword set \({\mathcal {W}}\). The row vectors in the set \({\mathcal {V}}\) exhibit mutual orthogonal properties such that \(\{v_{i}.v_{i}=1: \) \(v_{i}=v_{i},v_{i}\in {\mathcal {V}}\}\) and \(\{v_{i}.v_{j}=0\) : \(v_{i}\ne v_{j},v_{i},v_{j}\in {\mathcal {V}}\}\).

Alternatively, \({v_{1},v_{2},\ldots ,v_{p}}\) is called an orthonormal set. A square \(n\times n\) matrix H with elements \(\pm 1\) that satisfies \(H\times H^{T}=nI_{n}\) is called a Hadamard matrix of order n [36]. The Hadamard matrices exhibit good orthogonal properties. The Hadamard matrix can be generated by choosing p Hadamard arrays \(HA_{1},HA{}_{2},..,HA{}_{p}\) each of size, say, \(e_{i}\times e_{i}\) for \(1\le i\le p\) where each \(e_{i}\) is either \(2,\,4\) or 8. Later, construct \(e_{1}e_{2}\ldots e_{p}\)-sized matrix \(HA_{M}\) by the tensor product of these p matrices given as

$$\begin{aligned} {\mathcal {V}}=H_{M}={\displaystyle H_{1}\otimes H_{2}\otimes \dots \otimes H_{p}}. \end{aligned}$$
(1)

4.3 Encryption

Given a message \(m\in P\) and a public key \(pk=(n,g), Enc(pk,m,r)\) chooses a random \(r\in Z_{n}^{*}\) such that \(\gcd (r,n)=1\) and returns the cipher text given by

$$\begin{aligned} c=g^{m}r^{n}\mod n^{2}. \end{aligned}$$
(2)

4.4 Decryption

Given a ciphertext \(c\in {\mathcal {C}}\) and a private key \(sk=(p,q,\lambda )\), decryption Dec(skc) returns the message

$$\begin{aligned} m=\frac{(L(c^{\lambda }\mod n^{2}))}{L(g^{\lambda }\mod n^{2})\mod n}. \end{aligned}$$
(3)

4.5 BuildScoreIndex

The document score is computed based on term frequency (tf) and inverse term frequency (idf). The product of tf and idf is denoted by \({\mathcal {T}}\).

Term frequency is the frequency of a keyword \(w_{i}\) for the document \(d_{j}\) and is given by

$$\begin{aligned} tf(w_{i},d_{j})=\mathrm{frequency\,of}\,w_{i}. \end{aligned}$$
(4)

The inverse document frequency of a keyword \(w_{i}\) for the document collection \({\mathcal {D}}\) is given by

$$\begin{aligned} idf(w_{i},{\mathcal {D}})=\log \left( \dfrac{|{\mathcal {D}}|}{|d_{j}\in {\mathcal {D}}:w_{i}\in d_{j}|}\right) . \end{aligned}$$
(5)

Inverse document frequency of a keyword \(w_{i}\) over the document collection \({\mathcal {D}}\), denoted by \(idf(w_{i},{\mathcal {D}})\), is the log of the ratio of the total number of documents in the document collection, represented as \(|{\mathcal {D}}|\), to the number of documents in the document collection containing the keyword \(w_{i}\), represented as \(|d_{j}\in {\mathcal {D}}:w_{i}\in d_{j}|\). The term-inverse document score of a keyword \(w_{i}\) for the document \(d_{j}\) is given by

$$\begin{aligned} {\mathcal {T}}(w_{i},d_{j})=tf(w_{i},d_{j})idf(w_{i},{\mathcal {D}}). \end{aligned}$$
(6)

4.5a Algorithm description for \(\mathcal {S\leftarrow }BuildScoreIndex({\mathcal {D}},{\mathcal {W}},pk,{\mathcal {V}})\):

The algorithm description to compute the term frequency is as follows:

  1. 1.

    Initialize the document term frequency \(tf(w_{i},d_{j})\) and inverted document frequency \(idf(w_{i},d_{j})\) for all keywords and documents to be 0. Also, initialize all term-inverse document score \({\mathcal {T}}(w_{i},d_{j})\) to 0.

  2. 2.

    For each document \(d_{j}\in {\mathcal {D}}\), increment the document term frequency for the corresponding keyword \(w_{i}\in d_{j}\) as \(tf(w_{i},d_{j})=tf(w_{i},d_{j})+1\) .

  3. 3.

    For each keyword \(w_{i}\in {\mathcal {W}}\), compute the inverse document term frequency as \(idf(w_{i},{\mathcal {D}})=\log (\dfrac{|{\mathcal {D}}|}{|d_{j}\in \mathcal { D}:w_{i}{{\in }}d_{j}|})\).

  4. 4.

    For each keyword \(w_{i}\in {\mathcal {W}}\) and document \(d_{j}\in {\mathcal {D}}\), calculate \({\mathcal {T}}(w_{i},d_{j})=tf(w_{i},d_{j})idf(w_{i},{\mathcal {D}})\).

  5. 5.

    Later, the encrypted per document score index denoted by \({\mathcal {S}}(d_{j})\) is computed for all documents \(d_{j}\in {\mathcal {D}}\) as follows:

    $$\begin{aligned} {\mathcal {S}}(d_{j})={\displaystyle \sum _{i=1}^{|w_{i}\in d_{j}|}v_{i}M_{i}{\mathcal {T}}(w_{i},d_{j})+v_{r}r'} \end{aligned}$$
    (7)

    where \(v_{i}={\mathcal {V}}(w_{i})\), \(M_{i}=Enc(pk,w_{i},r_{w_{i}})\), \(v_{r}={\mathcal {V}}(r)\) and \((r_{w_{i}},r')\in Z_{n}^{*}\) such that \(\gcd (r,n)=1\).

The diagrammatic representation of the score index is shown in figure 2.

Figure 2
figure 2

Score index.

Figure 3
figure 3

Index.

4.6 BuildIndex

The index of VSED scheme denoted by \(I_{D}\) is computed as follows and diagrammatically shown in figure 3:

$$\begin{aligned} I_{D}=\sum _{i=1}^{|{\mathcal {W}}|}(v_{i}M_{i}S_{i})+v_{r}r'. \end{aligned}$$
(8)

4.6a Algorithm description for \(I_{{\mathcal {D}}}\leftarrow BuildIndex({\mathcal {D}},{\mathcal {W}},pk,{\mathcal {V}})\):

The step by step procedure to build encrypted index is described as follows:

  1. 1.

    Calculate \(v_{i}={\mathcal {V}}(w_{i})\) the corresponding secret vector assigned for the keyword \(w_{i}\).

  2. 2.

    Calculate \(M_{i}=Enc(pk,w_{i},r_{w_{i}})\) where pk is the public key of the Paillier cryptosystem, \(w_{i}\) is the respective keyword and a random \(r_{w_{i}}\in Z_{n}^{*}\) such that \(\gcd (r,n)=1\).

  3. 3.

    Calculate \(S_{i}=Enc(pk,{\mathcal {L}}(w_{i}))\) where

    1. (a)

      pk is the public key of the Paillier cryptosystem,

    2. (b)

      \({\mathcal {L}}(w_{i})=windex_{i}\Vert h_{w_{i}}\) where

      i. \(windex_{i}\) is the binary vector index for the keyword \(w_{i}\). The binary vector index denoted as \(windex_{i}\) is a data vector in which a value of 1 in a position indicates the presence of a keyword in that document. For example, consider a binary vector \(windex_{i}=1011\) for a keyword \(w_{i}\) in document collection \({\mathcal {D}}={d_{1},d_{2},d_{3},d_{4}}\); then documents \(d_{1},d_{3},d_{4}\) are said to contain the keyword \(w_{i}.\)

      ii. \(h_{w_{i}}={\mathcal {H}}(sk,windex_{i})\) is the keyed hash of \(windex_{i}\) for the keyword \(w_{i}.\)

  4. 4.

    Calculate \(v_{r}={\mathcal {V}}(r)\), a random vector, and \(r'\) is a random number added for randomness.

  5. 5.

    Repeat steps 1–4 for all keywords \(w_{i}\in {\mathcal {W}}\) and cumulatively add them to obtain the index \(I_{{\mathcal {D}}}\).

4.7 Trapdoor

The trapdoor generation of VSED can be represented as

$$\begin{aligned} T_{w_{q}}=v_{q}M_{w_{q}}^{-1}+v_{r}r' \end{aligned}$$
(9)

4.7a Algorithm description for \(T_{w_{q}}\leftarrow TrapDoor(w_{q},pk,{\mathcal {V}})\):

The step by step procedure to generate trapdoor is given as follows:

  1. 1.

    Calculate \(v_{w_{q}}={\mathcal {V}}(w_{q})\), the corresponding vector for the keyword to be queried \(w_{q}\).

  2. 2.

    Calculate \(M_{w_{q}}^{-1}=Enc(pk,-w_{q},r_{q})\) where pk is the public key of the Paillier cryptosystems, \(-w_{q}\) is the negation of the keyword to be queried, \(r_{q}\) is a random number chosen such that \(r_{q}\in Z_{n}^{*}\) and \(\gcd (r,n)=1\).

  3. 3.

    Calculate \(v_{r}={\mathcal {V}}(r)\), the random vector, and \(r'\) is a random number added for randomness

  4. 4.

    Calculate the trapdoor \(T_{w_{q}}\) for the keyword to be queried \(w_{q}\) as \(T_{w_{q}}=v_{w_{q}}M_{w_{q}}^{-1}+v_{r}r\)’.

4.8 Search

The search mechanism involves multiplying the received trapdoor with the index, represented as

$$\begin{aligned} {\mathcal {E}}_{{\mathcal {F}}_{q}}=I_{{\mathcal {D}}}T_{w_{q}}. \end{aligned}$$
(10)

4.8a Algorithm description for \({\mathcal {F}}_{qk}\leftarrow Search(I_{{\mathcal {D}}},T_{w_{q}},k)\):

The step by step description of the search mechanism is given as follows:

  1. 1.

    Calculate \({\mathcal {E}}_{{\mathcal {F}}_{q}}=I_{{\mathcal {D}}}T_{w_{q}}\) and find the respective document file identifiers:

    1. (a)

      If \({\mathcal {E}}_{{\mathcal {F}}_{q}}\)equals 0, then return null indicated by \(\perp \). This happens when either the query keyword is not found in the document set (i.e., \(|d_{j}\in D:w_{i}\in d_{j}|=0\)) or when a invalid trapdoor is submitted.

    2. (b)

      If \({\mathcal {E}}_{{\mathcal {F}}_{q}}\ne 0\) then send \((k,{\mathcal {E}}_{{\mathcal {F}}_{q}},T_{w_{q}})\) to the SCP invoking the top-k algorithm. The SCP returns the top-k document identifiers that match the query keyword.

  2. 2.

    Later, the CSP returns the corresponding documents to the user who had sent the trapdoor along with \({\mathcal {F}}_{qk}={\mathcal {F}}_{qk}^{'}\Vert h_{{\mathcal {F}}_{qk}^{'}}\) where \({\mathcal {F}}_{qk}^{'}\) is the set of top-k document identifiers that match the query keyword and \(h_{{\mathcal {F}}_{qk}^{'}}={\mathcal {H}}(sk,{\mathcal {F}}_{qk}^{'})\) is the keyed hash of those top-k document identifiers that match the query keyword.

Hence, it can be observed that \({\mathcal {F}}_{qk}\) can be retrieved without revealing the keyword and \(h_{{\mathcal {F}}_{qk}^{'}}\)can be used by the user to verify whether the documents returned by the CSP are correct. The correctness, verifiability and security are discussed in the security analysis subsection.

4.9 Top-k

The top-k ranking is computed by the SCP, ensuring that neither the CSP nor the SCP learns anything about the data involved other than the allowed information leakage described in the security model. The SCP will have the encrypted \({\mathcal {S}}\) score index and may have the knowledge of the number of documents in the document set that match the search query, but does not know to which keyword the scores are associated. The CSP has the encrypted index and the top-k documents only. CSP does not have the knowledge of either the keywords being searched or the information of all the documents that contain the query keyword as only top-k documents are returned to it by SCP.

4.9a Algorithm description for \({\mathcal {F}}_{qk}\leftarrow topk(k,\mathcal {E_{{\mathcal {F}}}}_{q},sk,T_{w_{q}},{\mathcal {S}})\):

The step by step process for computing the top-k documents at the SCP is as follows:

  1. 1.

    SCP decrypts \({\mathcal {E}}_{{\mathcal {F}}_{q}}\) using the private key sk to obtain \(windex_{i}\Vert h_{w_{i}}\).

  2. 2.

    Calculate the hash for \(windex_{i}\) as \(h_{w_{i}}^{'}={\mathcal {H}}(sk,windex_{i})\).

  3. 3.

    Verify whether the received hash \(h_{w_{i}}\) and the computed hash \(h_{w_{i}}^{'}\) are the same.

  4. 4.

    If there is a mismatch \((h_{w_{i}}^{'}\ne h_{w_{i}})\), SCP returns null represented by \(\bot \). This prevents any modification or fabrication of \({\mathcal {E}}_{{\mathcal {F}}_{q}}\)by the CSP.

  5. 5.

    If \(h_{w_{i}}^{'}=h_{w_{i}}\), then calculate the following:

    1. (a)

      Obtain \(windex_{i}\) from \({\mathcal {E}}_{{\mathcal {F}}_{q}}\)by decrypting it with sk.

    2. (b)

      Calculate the score \({\mathcal {Q}}(d_{x})\) for all x, if \(windex_{i}[x]=1,\) by computing \({\mathcal {Q}}(d_{x})={\mathcal {S}}(d_{x})T_{w_{q}}\).

    3. (c)

      Sort the scores \((sort({\mathcal {Q}},k))\); obtain the first k document identifiers denoted by \({\mathcal {F}}_{qk}^{'}\) and given by \({\mathcal {F}}_{qk}^{'}=sort({\mathcal {Q}},k)\).

  6. 6.

    Calculate the hash as \(h_{{\mathcal {F}}_{qk}^{'}}={\mathcal {H}}(sk,{\mathcal {F}}_{qk}^{'})\) where \(h_{{\mathcal {F}}_{qk}^{'}}\) is the hash of top-k document identifiers for verification.

  7. 7.

    Calculate \({\mathcal {F}}_{qk}={\mathcal {F}}_{qk}^{'}\Vert h_{{\mathcal {F}}_{qk}^{'}}\) where \({\mathcal {F}}_{qk}\) is the concatenation of the set of top-k document identifiers and the hash.

  8. 8.

    Send \({\mathcal {F}}_{qk}\) to the calling search algorithm.

4.10 Update

The dynamic keyword update process involves two algorithms, namely AddKeyword and DeleteKeyword.

4.10a Algorithm description for adding keyword \((I_{{\mathcal {D}}},{\mathcal {S}})\leftarrow AddKeyword(pk,I_{{\mathcal {D}}},Y_{w_{u}},{\mathcal {S}},f_{u},{\mathcal {V}})\):

The step by step procedure to add a keyword dynamically is as follows:

  1. 1.

    For adding a keyword that already belongs to the set \({\mathcal {W}}\), (i.e., \(w_{u}\in {\mathcal {W}}\)) the old binary vector index has to be removed and the new binary vector index has to be added to the index:

    1. (a)

      Calculate \(Y_{w_{u}}=v_{u}M_{u}S_{u}\) where \(w_{u}\) is the keyword whose binary vector index needs to be updated, \(v_{u}={\mathcal {V}}(w_{u})\), \(M_{u}=Enc(pk_{,}w_{u},r_{u})\) and \(S_{u}=Enc(pk,{\mathcal {L}}(w_{u}))\).

    2. (b)

      Calculate \(Y_{w_{u}}^{'}=v_{u}M_{u}S_{u}^{'}\) where \(w_{u}\) is the keyword whose binary vector index needs to be updated, \(v_{u}={\mathcal {V}}(w_{u})\), \(M_{u}=Enc(pk,w_{u},r_{u})\) and \(S_{u}^{'}=Enc(pk,{\mathcal {L}}(w_{u}^{'}))\) is the newly generated binary vector index in which the respective bit for the document is set as given by \(windex_{u}^{'}[f_{u}]=1\) for a given keyword \(w_{u}\).

  2. 2.

    For adding a new keyword or a keyword that does not belong to the set \({\mathcal {W}},\) (i.e., \(w_{u}\notin {\mathcal {W}}\)) the new binary vector index has to be added to the index:

    1. (a)

      Set \(Y_{w_{u}}=0\) and calculate \(Y_{w_{u}}^{'}=v_{u}M_{u}S_{u}^{'}\) where \(w_{u}\) is the new keyword whose binary index vector needs to be added, \(M_{u}=Enc(pk,w_{u},r_{u})\), \(v_{u}={\mathcal {V}}(w_{u})\) and \(S_{u}^{'}=Enc(pk,{\mathcal {L}}(w_{u}^{'}))\) is the newly generated binary index vector in which the respective bit for the document is set as given by \(windex_{u}^{'}[f_{u}]=1\) for a given keyword \(w_{u}\).

  3. 3.

    The DO sends \({Y_{w_{u}},Y_{w_{u}}^{'}}\) to the CSP and the CSP computes \(I_{{\mathcal {D}}}=I_{{\mathcal {D}}}-Y_{w_{u}}+Y_{w_{u}}^{'}\), where \(I_{{\mathcal {D}}}\) is the index.

  4. 4.

    Compute \({\mathcal {S}}'(d_{f_{u}})\leftarrow BuildScoreIndex({\mathcal {V}},d_{f_{u}},w_{u})\), the new score index and send it to SCP.

4.10b Algorithm description for deleting keyword \((I_{{\mathcal {D}}},{\mathcal {S}})\leftarrow DeleteKeyword(pk,I_{D},Y_{w_{u}},{\mathcal {S}},f_{u},{\mathcal {V}})\):

The step by step procedure to delete a keyword dynamically is as follows:

  1. 1.

    For deleting a keyword, the old binary index vector has to be removed from the index:

    1. (a)

      Calculate \(Y_{w_{u}}=v_{u}M_{u}S_{u}\) where \(w_{u}\) is the keyword whose binary index vector needs to be updated, \(v_{u}={\mathcal {V}}(w_{u})\), \(M_{u}=Enc(pk,w_{u},r_{u})\) and \(S_{u}=Enc(pk,{\mathcal {L}}(w_{u}))\), and set \(Y_{w_{u}}^{'}=0\).

  2. 2.

    The DO sends \(\{Y_{w_{u}},Y_{w_{u}}^{'}\}\) to the CSP and the CSP computes \(I_{{\mathcal {D}}}=I_{{\mathcal {D}}}-Y_{w_{u}}+Y_{w_{u}}^{'}\), where \(I_{{\mathcal {D}}}\) is the index.

  3. 3.

    Compute \({\mathcal {S}}'(d_{f_{u}})\leftarrow BuildScoreIndex({\mathcal {V}},d_{f_{u}},w_{u})\), the new score index, and send the same to the SCP.

4.11 Protocol description (\(\pi _{V}\))

Let \(\pi _{V}\) denote VSED protocol and the description of the working of the protocol is given as follows:

  1. 1.

    Let \(P_{1}\) be the CSP, \(P_{2}\) be the DO and \(P_{3}\) be the SCP. \(P_{2}\) computes the collection of encrypted files \({\mathcal {C}}\) given by

    $$\begin{aligned} {\mathcal {C}}\leftarrow {Enc(sk_{STD},d_{1}),\ldots , Enc(sk_{STD},d_{n})}. \end{aligned}$$
  2. 2.

    \(P_{2}\) invokes the BuildIndex algorithm to build the index for all the keywords given by

    $$\begin{aligned} I_{D}\leftarrow BuildIndex({\mathcal {D}},{\mathcal {W}},pk,{\mathcal {V}}). \end{aligned}$$
  3. 3.

    \(P_{2}\) invokes the BuildScoreIndex algorithm to build the document scores for the keywords given by

    $$\begin{aligned} {\mathcal {S}}\leftarrow BuildScoreIndex({\mathcal {D}},{\mathcal {W}},pk,{\mathcal {V}}). \end{aligned}$$
  4. 4.

    \(P_{2}\) sends the index, score index and the encrypted files to \(P_{1}\) and document score index to \(P_{3}\) given by

    $$\begin{aligned} P_{1}\leftarrow P_{2}:(I_{{\mathcal {D}}},{\mathcal {C}});\,P_{3}\leftarrow P_{2}:({\mathcal {S}}). \end{aligned}$$
  5. 5.

    If an user wants to search for a keyword over the secure cloud storage then \(P_{2}\) generates a trapdoor and sends it to user, which then sends it to \(P_{1}\), or simply it can be considered as \(P_{2}\) sending it to \(P_{1}\) given by

    $$\begin{aligned} P_{1}\leftarrow P{}_{2}:T_{w_{q}}\leftarrow TrapDoor(w_{q},pk,{\mathcal {V}}). \end{aligned}$$
  6. 6.

    \(P_{1}\) receives the trapdoor \(T_{w_{q}}\), invokes \(Search(I_{{\mathcal {D}}},T_{w_{q}},k)\) algorithm and computes \(\mathcal {E_{{\mathcal {F}}}}_{q}\).

    1. (a)

      \(P_{1}\) sends \(\mathcal {E_{{\mathcal {F}}}}_{q}\) to \(P_{3}\), which then executes the algorithm \(topk(k,\mathcal {E_{{\mathcal {F}}}}_{q},T_{w_{q}},{\mathcal {S}})\) and returns to \(P_{1}\) the top-k document identifiers given by

      $$\begin{aligned}&P_{3}\leftarrow P_{1}:(\mathcal {E_{{\mathcal {F}}}}_{q});\,P_{1}\leftarrow P_{3}: \\&{\mathcal {F}}_{qk}\leftarrow topk(k,\mathcal {E_{{\mathcal {F}}}}_{q},sk_{,}T_{w_{q}},{\mathcal {S}}). \end{aligned}$$
    2. (b)

      \(P_{1}\) returns \({\mathcal {F}}_{qk}\), a set of file identifiers with the hash returned by Search to \(P_{2}\), which is given by

      $$\begin{aligned} P_{2}\leftarrow P_{1}:{\mathcal {F}}_{qk}\leftarrow Search(I_{{\mathcal {D}}},T_{w_{q}},k). \end{aligned}$$
    3. (c)

      \(P_{1}\) receives the file identifiers \({\mathcal {F}}_{qk}\) from \(P_{3}\) and returns the top-k documents to \(P_{2}\).

  7. 7.

    For keyword addition, \(P_{2}\) generates \(Y_{w_{u}}\),\(f_{u}\),\({\mathcal {V}}\), computes \(AddKeyword(pk,I_{{\mathcal {D}}},Y_{w_{u}},{\mathcal {S}},f_{u},{\mathcal {V}})\) and then sends \((Y_{w_{u}},Y_{w_{u}}^{'})\) to \(P_{1}\) to update \(I_{{\mathcal {D}}}\). Then it computes \({\mathcal {S}}'\leftarrow BuildScoreIndex(d_{f_{u}})\) and sends to \(P_{3}\), which updates \({\mathcal {S}}\).

  8. 8.

    For keyword deletion, \(P_{2}\) generates \(Y_{w_{u}}\),\(f_{u}\),\({\mathcal {V}},\) computes \(DeleteKeyword(pk,I_{{\mathcal {D}}},Y_{w_{u}},{\mathcal {S}},f_{u},{\mathcal {V}})\) and then sends \((Y_{w_{u}},Y_{w_{u}}^{'})\) to \(P_{1}\) to update \(I_{{\mathcal {D}}}\). Then computes \({\mathcal {S}}'\leftarrow BuildScoreIndex(d_{f_{u}})\) and sends to \(P_{3}\), which updates \({\mathcal {S}}\).

5 Security analysis

The construction of the security model in VSED assumes an adversarial entity denoted by \({\mathcal {A}}\) that controls the CSP and can attack the execution of the protocol denoted by \(\pi _{V}\). The parties under the control of the adversary are said to be corrupted and follow the adversary’s instructions. A secure protocol is said to withstand any adversarial attack, assuming the adversary is computationally bounded. Therefore, to formally claim and prove that a protocol is secure, an accurate definition of secure protocol execution is required.

5.1 Model

The problem of cloud storage is similar to a two-party protocol problem where the two parties are CSP and DO. A two-party problem is defined by specifying a random process that maps pairs of inputs to pairs of outputs, one from each party. Generally, such a process is referred to as a functionality and denoted as \(\text {f}:\{0,1\}^{*}\times \{0,1\}^{*}\rightarrow \{0,1\}^{*}\times \{0,1\}^{*}\), where \(\text {f}:\{0,1\}^{*}\rightarrow (\text {f}_{1},\text {f}_{2})\). For every pair of inputs (xy), the output is a vector of random variables \((\text {f}_{P_{1}}(x,y),\text {f}_{P_{2}}(x,y))\) where \(\text {f}(P_{1})\) and \(\text {f}(P_{2})\) are received by entities \(P_{1}\) and \(P_{2}\), respectively. The same process is also sometimes represented as \((x,y)\leftarrow (\text {f}_{P_{1}}(x,y),\,\text {f}_{P_{2}}(x,y))\) where the entity \(P_{1}\) is the CSP and \(P_{2}\) is the DO. Unlike the general two-party problem defined earlier, in the secure cloud storage problem considered here, only \(P_{2}\) receives the specified output from \(P_{1}\) such that \(P_{1}\) learns nothing about the input. \(P_{2}\) does not output anything back to \(P_{1}\). This is formalized as one-sided simulation by [36], and \(P_{1}\) is assumed to be corrupted.

The standard security simulation model formalizes security in a general way as ideal world and real world. In an ideal world, it is assumed that an external trusted [38] party helps the parties to carry out their computation securely. In an ideal world, the parties involved in the two-party computation just send their inputs to the trusted party, which computes the desired function and securely passes the correct prescribed output to each party, and each party receives output. Hence, the adversary can force only the corrupted parties to send their chosen inputs and cannot do anything else. Two important security properties hold in the ideal world.

  1. 1.

    Privacy: No party should learn anything more than its prescribed outputs. In particular, the only information that should be learned about other parties’ inputs is what can be derived from the output itself securely. For example, while searching for documents matching a keyword over a secure cloud storage, the only information revealed to CSP should be the document identifiers that match the search query criteria without giving any information about keywords contained in the documents or in the search query. Privacy holds in the ideal world, because the only message ever received by a party is its output and so it cannot learn anything else from what it did not receive.

  2. 2.

    Correctness: Each party is guaranteed that the output that it receives is correct. For example, while searching for documents matching a keyword over a secure cloud storage, the output should be the correct document identifiers that match the desired keywords, which should be guaranteed. In ideal world, correctness holds since the trusted party cannot be corrupted and thus it will always compute the function correctly.

The adversary considered in this security model is static malicious polynomial time such that honest parties remain honest throughout, while the corrupted parties remain corrupted. The adversary may arbitrarily deviate from the protocol specification. A detailed description of the execution in ideal and real model can be found in [38,39,40,40].

Let \(\mathtt {Real}_{\pi ,A(z),i}(x,y,k)\) denote the output of the honest party and the adversary \({\mathcal {A}}\) (controlling \(P_{i}\)) after a real execution of protocol \(\pi \) where \(P_{1}\) has input \(x,\,P_{2}\) has input \(y,\,{\mathcal {A}}\) has an auxiliary input z and the security parameter is k.

Let \(\mathtt {Ideal}_{f,S(z),i)}(x,y,k)\) be the analogous distribution in an ideal execution with a trusted party who computes \(\text {f}\) for the parties. Also, let \(\mathtt {View}{}_{\pi ,{\mathcal {A}}(z),i)}^{{\mathcal {A}}}(x,y,k)\) denote the view of the adversary after a real protocol execution of \(\pi _{V}\) as before. Then, the following definition describes the security model [26].

5.1a Computational indistinguishability:

Let the security parameter be denoted as k. A function \(\mu (.)\) is negligible if for every polynomial p(.) there exists a value N such that for all \(k>N\), \(\mu (k)<1/p(k)\) holds. Let \(X=\{X(a,k)\}_{k{{\in }}N,a{{\in }}{0,1}^{*}}\) and be the distribution ensembles. Then, it can be said that X and Y are computationally indistinguishable, denoted \(X\overset{c}{\equiv }Y\), if for every non-uniform distinguisher D there exists a negligible function \(\mu (.)\) such that for every \(a\in \{0,1\}^{*}\)

$$\begin{aligned} |Pr[D(X(a,k))=1]-Pr[D(Y(a,k))=1]|<\mu (k). \end{aligned}$$
(11)

5.1b Privacy:

Let \(\text {f}\) be a functionality where only \(P_{2}\) receives output. A protocol \(\pi _{V}\) is said to securely compute \(\text {f}\) with one-sided simulation if the following holds:

  1. 1.

    For every non-uniform probabilistic polynomial-time adversary \({\mathcal {A}}\) controlling \(P_{2}\) in the real model, there exists a non-uniform probabilistic expected polynomial-time adversary S for the ideal model, such that

    $$\begin{aligned} \{\mathtt {Real}_{\pi ,{\mathcal {A}}(z),2}(x,y,k)\}\overset{c}{\equiv }\{\mathtt {Ideal}_{f,S(z),2}(x,y,k)\} \end{aligned}$$
    (12)

    where \(|x|=|y|\), \(x,y,z\in \{0,1\}^{*},k\in N\).

  2. 2.

    For every non-uniform probabilistic polynomial-time adversary \({\mathcal {A}}\) controlling \(P_{1}\)

    $$\begin{aligned} \{\mathtt {View}_{\pi _{V,{\mathcal {A}}(z),1}}^{{\mathcal {A}}}(x,y,k)\}\overset{c}{\equiv }\{\mathtt {View}{}_{\pi _{V,{\mathcal {A}}(z),1}}^{{\mathcal {A}}}(x,y',k)\} \end{aligned}$$
    (13)

    where \(|x|=|y'|\), \(x,y,z\in \{0,1\}^{*},k\in N\).

Since it is a one-sided simulation and \(P_{1}\) is the only party assumed to be corrupted, proving the second part of Definition 2 is enough to guarantee privacy, meaning that \(P_{1}\) learns nothing whatsoever about \(P_{2}\)’s input.

5.1c Search pattern \((s_{p})\):

Let \(s_{p}\) denote the frequency of the keyword queries searched, which is found by checking the equality between two queries. Formally, let \({Q_{1},Q_{2},\ldots , Q_{n}}\) be a set of n consecutive queries, and search pattern \(s_{p}\) is an \(n\times n\) binary matrix where \(s_{p}(i,j)=1\) iff \(Q_{i}=Q_{j}\).

5.1d Access pattern \((a_{p})\):

Let \(a_{p}\) denote the access pattern and is a collection of tuples \(a_{i}=({\mathcal {F}}_{i},Q_{i})\) where \({\mathcal {F}}_{i}\) is a set of file identifiers returned for the query \(Q_{i}\).

5.1e History \((t_{n})\):

Let \(t_{n}\) denote the history given by \(t_{n}({\mathcal {D}},q\)) where q, given by\({Q_{1},Q_{2},\ldots , Q_{n}}\), and \({\mathcal {D}}\), given by \({d_{1},d_{2}{\dots },d_{n}}\), are the set of search queries and corresponding document identifiers returned, respectively.

5.1f Trace \(\gamma (t_{n})\):

The trace is the allowed information leaked to an adversary.

Let \(\gamma (t_{n})=\{(f_{1},\ldots , f_{n}),(|C_{1}|,\ldots ,|C_{n}|),t_{n},I_{{\mathcal {D}}},T_{w_{q}},Y_{w_{u}},{\mathcal {S}}\}\) where \(C_{i}=Enc(pk,d_{i})\) is the encryption of document \(d_{i},f_{i}=id(C_{i})\) is the file identifier of \(C_{i}\), \(|C_{i}|\) is the size of \(f_{i},t_{n}\) is the history, \(I_{{\mathcal {D}}}\) is the index, \(T(w_{q})\) is the trapdoor(s), \(Y(w_{u})\) is the update trapdoor(s) and \({\mathcal {L}}(w_{i})\) is the hash map data structure holding the per document index.

A secure protocol should guarantee that Definition 2 holds and no more information than the trace is leaked to an adversary.

5.2 Analysis

Theorems 7–9 prove independently the correctness of the protocol, and subsequently the privacy and correctness are proved in Theorem 10 with respect to the security model.

The scheme is said to return top-k documents correctly for a given encrypted binary data vector \({\mathcal {E}}_{{\mathcal {F}}_{q}}\), trapdoor \(T_{w_{q}}\) for a keyword \(w_{q}\in {\mathcal {W}}\) and score index \({\mathcal {S}}\) such that

$$\begin{aligned} Pr[{\mathcal {F}}_{qk}\leftarrow topk(k,sk_{,}{\mathcal {E}}_{{\mathcal {F}}_{q}},T{}_{w_{q}},{\mathcal {S}}):{\mathcal {F}}_{qk}={f_{i}}_{d_{j}\in {\mathcal {D}}:w_{q}\in d_{j}}^{i=1,..k}]=1. \end{aligned}$$
(14)

The top-k algorithm described in section 4.9 returns the top-k document denoted by \({\mathcal {F}}_{qk}\). The search algorithm inputs the binary vector index or binary vector such that the binary vector for an query keyword \(w_{q}\) contains \(windex_{q}[d_{j}]=1\) iff \(w_{q}\in d_{j}\), where \(j=1,\ldots ,|{\mathcal {D}}|\), while the top-k returns \({\mathcal {F}}_{qk}={f_{i}}_{d_{j}{{\in }}{\mathcal {D}}:w_{q}{{\in }}d_{j}}^{i=1,..k}\), the top-k document file identifiers \(f_{i}\) such that \({d_{j}\in {\mathcal {D}}:w_{q}\in d_{j}}\) and the value of \(i=1,\ldots ,k\). The encrypted binary vector whose bits are set satisfies the condition \({d_{j}\in {\mathcal {D}}:w_{q}\in d_{j}}\) and its retrieval is given by

$$\begin{aligned} {\mathcal {E}}_{{\mathcal {F}}_{q}}T_{w_{q}}=S_{q}=Enc(pk,{\mathcal {L}}(w_{q}))=Enc(pk,windex_{q}\Vert h_{w_{q}}). \end{aligned}$$
(15)

The vector \(windex_{q}\) contains the binary data vector as described in section 4.6, such that \(windex_{q}[x]=1\) if and only if \({d_{j}\in D:w_{q}\in d_{j}}\). Therefore, \(\forall x, \, windex_{q}[x]=1\), then the algorithm described in section 4.9 computes \({\mathcal {Q}}(d_{x})={\mathcal {S}}(d_{x})T_{w_{q}}\)where \({\mathcal {Q}}(d_{x})\) is the document score, and from section 4.5 it can be written as

$$\begin{aligned} {\mathcal {S}}(d_{x})=\sum _{i=1}^{|w_{i}\in d_{j}|}(v_{i}M_{i}{\mathcal {T}}(w_{i},d_{j}))+v_{r}r'. \end{aligned}$$
(16)

From section 4.7, \(T_{w_{q}}={v_{w_{q}}M_{w_{q}}^{-1}+v_{r}r'}\) where \(M_{w_{q}}^{-1}=Enc(pk,-w_{q},r_{q})\). Therefore, on substituting for \(T_{w_{q}}\), \({\mathcal {S}}\), applying the principle of orthogonality and expanding:

$$\begin{aligned} Q(d_{x})= & {} \mathcal {{\mathcal {S}}}(d_{x})T_{w_{q}}\nonumber \\= & {} \left( \sum _{i=1}^{|w_{i}\in d_{x}|}(v_{i}M_{i}{\mathcal {T}}(w_{i},d_{x}))+v_{r}r')(v_{w_{q}}M_{w_{q}}^{-1}+v_{r}r'\right) \nonumber \\= & {} (v_{i}M_{i}{\mathcal {T}}(w_{i},d_{x}))(v_{w_{q}}M_{w_{q}}^{-1}+v_{r}r^{'})+0\nonumber \\= & {} v_{q}v_{q}M_{w_{q}}M_{w_{q}}^{-1}{\mathcal {T}}(w_{q},d_{x})\nonumber \\= & {} g^{w_{q}}g^{-w_{q}}r_{q}^{n}(r_{q}^{n})^{-1}{\mathcal {T}}(w_{q},d_{x})\nonumber \\= & {} {\mathcal {T}}(w_{q},d_{x}). \end{aligned}$$
(17)

Hence, the score of the document \(d_{x}\) can be retrieved correctly. Similarly, for all other documents, \(\forall x \), \(windex_{q}[x]=1\), the top-k algorithm computes the score, sorts the score and returns the top-k scored document to the CSP. Hence, it can be said that the algorithm correctly returns the top-k documents, formally given by

$$\begin{aligned} Pr[{\mathcal {F}}_{qk}{\leftarrow }topk(k,sk_{,}{\mathcal {E}}_{{\mathcal {F}}_{q}},T{}_{w_{q}},{\mathcal {S}}):{\mathcal {F}}_{qk}={f_{i}}_{d_{j}{{\in }}{\mathcal {D}}:w_{q}{{\in }}d_{j}}^{i=1,..k}]=1. \end{aligned}$$
(18)

The scheme is said to provide correct verifiable search for a given trapdoor \(T_{w_{q}}\) for a keyword \(w_{q}\in {\mathcal {W}}\):

$$\begin{aligned} Pr[{\mathcal {F}}_{qk}{\leftarrow }Search(k,I_{{\mathcal {D}}},T_{w_{q}}):{\mathcal {F}}_{qk}={\bot }_{w_{q}\notin {\mathcal {W}}}or{f_{i}}_{w_{q}\in {\mathcal {W}}}]=1. \end{aligned}$$
(19)

The search algorithm operates under two cases: \(w_{q}\in {\mathcal {W}}\) and \(w_{q}\notin {\mathcal {W}}\).

5.2a Case \(\,w_{q}\notin {\mathcal {W}}\):

The search algorithm can be written as \({\mathcal {E}}_{{\mathcal {F}}_{q}}=I_{{\mathcal {D}}}T_{w_{q}}\)where \(w_{q}\) is the keyword searched with the trapdoor \(T_{w_{q}}\)while \({\mathcal {E}}_{{\mathcal {F}}_{q}}\) is the resultant encrypted binary vector, \(windex_{w_{q}}[d_{j}]=1:{d_{j}\in {\mathcal {D}}:w_{q}\in d_{j}}\). The index and the trapdoor can be written as

$$\begin{aligned} I_{{\mathcal {D}}}=\sum _{i=1}^{|{\mathcal {W}}|}(v_{i}M_{i}S_{i})+v_{r}r'; \,T_{w_{q}}=v_{q}M_{w_{q}}^{-1}+v_{r}r' \end{aligned}$$
(20)

where

$$\begin{aligned} M_{w_{q}}= & {} Enc(pk,w_{q},r_{q});\,M_{w_{q}}^{-1}=Enc(pk,-w_{q},r_{q}); \\ S_{q}= & {} Enc(pk,windex_{q}\Vert h_{w_{q}}). \end{aligned}$$

Substituting for \(I_{{\mathcal {D}}}\), \(T_{w_{q}}\) and expanding, the equation can be written as

$$\begin{aligned} {\mathcal {E}}_{{\mathcal {F}}_{q}}= & {} I_{{\mathcal {D}}}T_{w_{q}}\\= & {} \left( \sum _{i=1}^{|{\mathcal {W}}|}(v_{i}M_{i}S_{i})+v_{r}r')(v_{q}M_{w_{q}}^{-1}+v_{r}r'\right) \\= & {} 0\\= & {} \bot . \end{aligned}$$

Since \(v_{i}.v_{q}=1\) iff \(v_{i}=v_{q}\), by the principle of orthogonality, all vectors \(v_{i}\in {\mathcal {V}}\) are mutually orthogonal. Therefore, when a trapdoor \(T_{w_{q}}\) is generated for a keyword \(w_{q}\) such that \(w_{q}\notin {\mathcal {W}}\), then search algorithm outputs correctly null represented by \(\bot \).

5.2b Case \(\,w_{q}\in {\mathcal {W}}\):

Similar to the previous case, the search algorithm for \(w_{q}\in {\mathcal {W}}\) is given as follows:

$$\begin{aligned} {\mathcal {E}}_{{\mathcal {F}}_{q}}= & {} I_{D}T_{w_{q}}\\= & {} \left( \sum _{i=1}^{|{\mathcal {W}}|}(v_{i}M_{i}S_{i})+v_{r}r')(v_{q}M_{w_{q}}^{-1}+v_{r}r'\right) \\= & {} (v_{1}M_{1}S_{1}+\cdot \cdot \cdot +v_{|{\mathcal {W}}|}M_{|{\mathcal {W}}|}S_{|{\mathcal {W}}|})(v_{q}M_{w_{q}}^{-1}+v_{r}r')\\= & {} (v_{q}M_{q}S_{q})(v_{q}M_{w_{q}}^{-1}+v_{r}r')\\= & {} (v_{q}v_{q}M_{q}M_{w_{q}}^{-1}S_{q})+(v_{q}v_{r}M_{q}M_{w_{q}}^{-1}S_{q}r')\\= & {} (v_{q}v_{q}M_{q}M_{w_{q}}^{-1}S_{q})\\= & {} g^{w_{q}}g^{-w_{q}}r_{q}r_{q}^{-1}S_{q}\\= & {} S_{q}\\= & {} Enc(pk,windex_{q}\Vert h_{w_{q}}). \end{aligned}$$

Thus, the encrypted binary vector \({\mathcal {E}}_{{\mathcal {F}}_{q}}\) is retrieved correctly for the keyword \(w_{q}\) searched when \(w_{q}\in {\mathcal {W}}\). The search algorithm then sends \({\mathcal {E}}_{{\mathcal {F}}_{q}}\) to the SCP for retrieval of the top-k document identifiers. From Theorem 7, it was proved that given \({\mathcal {E}}_{{\mathcal {F}}_{q}}\), the top-k algorithm returns the top-k documents correctly. Also, the hash of the binary vector for every keyword \(h_{w_{q}}={\mathcal {H}}(sk,windex_{q})\) being concatenated with the binary vector itself, the CSP cannot forge \(S_{q}\) sent to the SCP. Moreover, from the top-k algorithm it can be seen that \({\mathcal {F}}_{qk}={\mathcal {F}}_{qk}^{'}||h_{{\mathcal {F}}_{qk}}\), and the set of top-k document identifiers also contains the hash of the top-k document identifiers sent to the user. Thus, \(S_{q}\) computed by CSP and \({\mathcal {F}}_{qk}\) computed by the SCP and returned to the user by CSP are verifiable. Therefore, it is proved that the VSED’s search algorithm is said to provide correct verifiable search for a given trapdoor \(T_{w_{q}}\) for a keyword given by

$$\begin{aligned} Pr[{\mathcal {F}}_{qk}\leftarrow Search(k,I_{{\mathcal {D}}},T_{w_{q}}):{\mathcal {F}}_{qk}={\bot }_{w_{q}\notin {\mathcal {W}}}\,or\,{f_{i}}_{w_{q}\in {\mathcal {W}}}]=1. \end{aligned}$$
(21)

The scheme is said to provide correct dynamic index update for a given update trapdoor \(Y_{w_{q}}\).

The index update uses an update trapdoor to make dynamic updates. The index and the score index are updated during addition of new keyword to a document or deletion of a keyword from a document. For both cases, update trapdoor is generated. The process of adding a keyword to the encrypted index uses simple addition and deletion of vectors represented as \(I_{{\mathcal {D}}}=I_{{\mathcal {D}}}-Y_{w_{u}}+Y_{w_{u}}^{'}\), where the old sub-index is denoted by \(Y_{w_{u}}=v_{u}M_{u}S_{u}\) and the new sub-index that needs to be updated in the index is denoted by \(Y_{w_{u}}^{'}=v_{u}M_{u}S_{u}^{'}\). Therefore, it can be written as

$$\begin{aligned} I_{{\mathcal {D}}}= & {} I_{{\mathcal {D}}}-Y_{w_{u}}+Y_{w_{u}}^{'}\\= & {} \left( \sum _{i=1}^{|{\mathcal {W}}|}(v_{i}M_{i}S_{i})+v_{r}r\right) -v_{u}M_{u}S_{u}+v_{u}M_{u}S_{u}^{'}\\= & {} (v_{1}M_{1}S_{1}+{{\cdots }}+v_{|{\mathcal {W}}|}M_{|{\mathcal {W}}|}S_{|{\mathcal {W}}|})-Y_{w_{u}}+Y_{w_{u}}^{'}\\= & {} (v_{1}M_{1}S_{1}+{{\cdots }}+v_{u}M_{u}S_{u}^{'}+{{\cdots }}+v_{|{\mathcal {W}}|}M_{|{\mathcal {W}}|}S_{|{\mathcal {W}}|}). \end{aligned}$$

From these equations, it can be seen that the new sub-index vector is updated in the index \(I_{{\mathcal {D}}}\) correctly. Similar reasoning can also be applied to the case when a keyword is to be deleted. Hence, it can be said that the scheme is said to provide correct dynamic index update for a given update trapdoor \(Y_{w_{q}}\).

The Paillier encryption function is a homomorphic and semantically secure cryptosystem and its security analysis can be found in [35].

Assume that the Paillier cryptosystem is semantically secure and then the protocol \(\pi _{V}\) computes securely in the presence of non-colluding static malicious adversaries in accordance with Definition 2.

Definition 2 states that a protocol securely computes in the presence of non-colluding static malicious adversaries if the view of the adversary when given y and view of the adversary when given \(y'\) are computationally indistinguishable. The definition can thus be written as

$$\begin{aligned} \{\mathtt {View}{}_{\pi _{V,{\mathcal {A}}(z),1}}^{{\mathcal {A}}}(x,y,k)\}\overset{c}{\equiv }\{\mathtt {View}{}_{\pi _{V,{\mathcal {A}}(z),1}}^{{\mathcal {A}}}(x,y',k)\} \end{aligned}$$
(22)

where |\(x|=|y|=|y^{'}\)| and \(x,y,z\epsilon \{0,1\}^{*},k\in N\).

5.2c Game:

Assume the existence of an encryption oracle as defined by [41]. Assume a challenger who generates a key pair (pksk) in accordance with and based on the Paillier cryptosystem key set-up instructions. The challenger publishes the key pk but keeps sk private. Adversary \({\mathcal {A}}\) is given access to an encryption oracle (encryption service), namely \({\mathcal {E}}\). The oracle can perform a polynomial bounded number of encryptions or other operations. \({\mathcal {A}}\) then chooses two plain-texts (or keywords) \(m_{0}=y\) and \(m_{1}=y'\) and submits to the challenger. The challenger selects a bit \(a\in {0,1}\) uniformly at random, encrypts \(m_{a}\) as \(C=Enc(pk,m_{a})\) and sends C to \({\mathcal {A}}\). The adversary may then perform any number of additional operations or encryptions using the encryption oracle. \({\mathcal {A}}\) then outputs its guess for the value of a. From Definition 1, for computational indistinguishability

$$\begin{aligned} |Pr[D(X(a,k))=1]-Pr[D(Y(a,k))=1]|<\mu (k). \end{aligned}$$
(23)

From [44], it can be said that Paillier encryption mechanism is semantically secure. Therefore, the advantage of \({\mathcal {A}}\) in playing Game is negligible or utmost \(\mu (k)\), which means \({\mathcal {A}}\) can distinguish between Enc(pky) and \(Enc(pk,y^{'})\) though \({\mathcal {A}}\) knows the plaintext y and \(y'\) with only negligible probability \(\mu (k)\). Hence, message encrypted with Enc(., .) is computationally indistinguishable. The parameters defined such as trapdoor \(T_{w_{q}}\), search pattern \(s_{p}\), access pattern \(a_{p}\), history \(t_{n}\), trace \(\gamma (t_{n})\) including sub-index \({\mathcal {L}}\) and document score index \({\mathcal {S}}\) are encrypted using Enc(., .), making them computationally indistinguishable too.

Generally, in a real world, the parties run some protocol among themselves without the help of the trusted party, unlike the ideal world. The real world protocol executed by the parties without the trusted party is said to be secure if no adversary can do more harm in a real world protocol execution than in an execution that takes place in the ideal world. In an ideal/real simulation paradigm for any adversary carrying out a successful attack in the real world, there exists an adversary that successfully carries out the same attack in the ideal world. However, this is contradictory since successful adversarial attacks cannot be carried out in the ideal world. Therefore, it is concluded that all adversarial attacks on protocol execution in the real world must also fail. The security of the real world protocol or real protocol is established by comparing the output of a real protocol execution to the output of an ideal computation. A real protocol execution is said to emulate ideal protocol execution if the input/output distributions of the adversary and the participating parties in the real and ideal executions are essentially the same.

Hence, the parameters that constitute the view of the adversary in Definition 2 are computationally indistinguishable as given here:

$$\begin{aligned} \{\mathtt {View}{}_{\pi _{V,{\mathcal {A}}(z),1}}^{{\mathcal {A}}}(x,y,k)\}\overset{c}{\equiv }\{\mathtt {View}{}_{\pi _{V,{\mathcal {A}}(z),1}}^{{\mathcal {A}}}(x,y',k)\}. \end{aligned}$$
(24)

Also, the malicious adversaries are assumed to be non-colluding since \(P_{1}\) and \(P_{3}\) compute the top-k documents among themselves. \(P_{3}\) is by design embedded with the private key sk to decrypt the encrypted binary vector \(windex_{x}\). \(P_{3}\) leaks the tuple (document, score) but not the respective keyword as the keyword is encrypted within the trapdoor, which is computationally indistinguishable. Moreover, it has to be noted that \(P_{3}\) does not have in its possession the encrypted documents. Thus, \(P_{1}\) knows the set of documents or top-k documents returned for a query like any other mechanisms leaking access pattern and does not learn anything more than that.

Hence, as long as the views of \({\mathcal {A}}\) for y and y’ are computationally indistinguishable, the protocol \(\pi _{V}\) is said to securely compute in the presence of non-colluding malicious adversaries. This guarantees privacy and correctness of the protocol as per Definition 2.

6 Performance analysis

VSED is implemented usng Java in a Windows server with Intel Xeon Processor (2.30 GHz) using the real-world Enron [42] e-mail dataset. The performance of VSED is compared to that of multi-keyword ranked searchable encryption (MRSE) proposed by Cao et al [43]. VSED is evaluated for the efficiency of index construction, trapdoor generation, search and update of index. The efficiency comparison is given in table 1 and feature comparison is given in table 2.

6.1 Index construction

The index construction, a one-time operation performed by the DO, is described in sections 4.5 and 4.6. The time cost for encrypting the binary vector \({\mathcal {L}}(w_{i})\) and the keyword \(M_{i}\) is dependent on the size of the dictionary \(|{\mathcal {W}}|\). The time cost for building the index \(I_{{\mathcal {D}}}\) is equal to the number of documents \(|{\mathcal {D}}|\) in the document set and the number of keywords in the dictionary \(|{\mathcal {W}}|\). In MRSE, the major computation of the index involves the splitting process and multiplication between two \((n+2)\times (n+2)\) matrix and a \((n+2)\) dimension vector where \(n=|{\mathcal {W}}|\), the number of keywords in the document set. The size of the sub-index is linear to the size of the data vector. Therefore, the time complexity of MRSE is O\((|{\mathcal {D}}||{\mathcal {W}}|^{2})\) and storage complexity is O\((|{\mathcal {D}}||{\mathcal {W}}|)\). The time complexity of VSED is O\((|{\mathcal {D}}||{\mathcal {W}}|)\) and the storage complexity is O\((|{\mathcal {D}}|+|{\mathcal {W}}|)\) (\(|{\mathcal {D}}|)\) towards score index and \(|{\mathcal {W}}|\) for the index). Figure 4 shows the time cost for building the index \(I_{{\mathcal {D}}}\) by varying the documents and size of the dictionary \(|{\mathcal {W}}|=4000\). The time cost is linear with the size of the document set \(|{\mathcal {D}}|\). Figure 5 shows the dependence of the index \(I_{{\mathcal {D}}}\) on the dictionary. The number of keywords in the dictionary has comparatively less influence than the number of documents in the document set.

Figure 4
figure 4

BuildIndex time vs #documents.

Figure 5
figure 5

BuildIndex time vs #keywords.

Table 1 Comparison of efficiency with other schemes.
Table 2 Features comparison.

6.2 Trapdoor generation

The trapdoor \(T_{w_{q}}\) for VSED is generated by the DO using the keyword of interest or the query keyword \(w_{q}\). The computation involves a multiplication operation of the secret vector with \(M_{w_{q}}^{-1}\), where \(M_{w_{q}}^{-1}=Enc(pk,-w_{q},r_{q})\), and addition with a random secret vector, which is a very trivial operation. Figure 6 shows the time cost for trapdoor for \(|{\mathcal {D}}|=1000\) and \(Q=10\). The time cost for generation of trapdoors does not depend on the document set \(|{\mathcal {D}}|\) or on the keyword set \(|{\mathcal {W}}|.\) However, in MRSE, the size of the dictionary influences the time cost for trapdoor as shown in figure 6. The number of query keywords \(w_{q}\) influences the time cost of the trapdoor. Figure 7 shows time cost of building the trapdoor with \(|{\mathcal {D}}|=1000\) and \(|{\mathcal {W}}|=4000\) by varying the query keyword. The number of query keywords \(w_{q}\) is varied from 5 to 25 keywords. The trapdoor computation time for VSED is significantly less when compared with MRSE. Similar to index construction, MRSE incurs a splitting vector and two matrix multiplications. The time cost for trapdoor is O(1) for VSED and O\((|{\mathcal {W}}|^{2})\) for MRSE.

Figure 6
figure 6

Trapdoor time vs #keywords.

Figure 7
figure 7

Trapdoor time vs # query keywords.

Figure 8
figure 8

Search time vs #documents.

Figure 9
figure 9

Search time vs # query keywords.

Figure 10
figure 10

Time cost of update.

6.3 Search efficiency

The search operation is performed by the CS. The computation is a multiplication of the index \(I_{{\mathcal {D}}}\) that is with the CSP with the trapdoor \(T_{w_{q}}\). The computation carried by the CSP is the product of the index \(I_{{\mathcal {D}}}\) and the trapdoor \(T_{w}\). For top-k retrieval, at the SCP, the time cost for search includes the time to decrypt \({\mathcal {E}}_{{\mathcal {F}}_{q}}\), the file identifiers returned from the CSP, and ranking the retrieved document identifiers from the binary vector. The time complexity of search operation in VSED is constant time, i.e. O(1) without top-k and O\(({\mathcal {F}}_{q_{k}}(1+\log {\mathcal {F}}_{q_{k}}))\) with top-k. The complexity of MRSE is O\((|{\mathcal {D}}||{\mathcal {W}}|)\), as similarity scores for all the documents are computed. It can be inferred that for all practical cases, \({\mathcal {F}}_{q_{k}}\ll |{\mathcal {D}}|\) and \({\mathcal {F}}_{q_{k}}\ll |{\mathcal {W}}|\). The time cost for search by varying the document and varying the query for top-k search is shown, respectively, in figures 8 and 9. In figure 8, it can be seen that the number of documents have less influence on search as the operation \({\mathcal {E}}_{{\mathcal {F}}_{q}}=I_{{\mathcal {D}}}T_{w_{q}}\) is a simple multiplication operation. However, the time for search also includes the time for scoring and sorting the documents identifiers returned by the CSP at the SCP. Figure 9 shows the time cost by keeping the documents constant \(|{\mathcal {D}}|=1000,\) dictionary \(|{\mathcal {W}}|=4000\) constant and varying the query keywords. The number of query keywords has less influence on VSED and has no influence in MRSE. In MRSE, the search algorithm consists of finding the similarity scores ranking for all the documents. Thus the query time is dependent on the number of documents and the number of keywords has less influence. In VSED, if the user wishes for top-k, the k value along with the file identifier is sent to the SCP to compute the ranking. Therefore, VSED cannot be compromised by timing-based side channel attacks that try to differentiate queries based on query time.

6.4 Update efficiency

In order to update the document, the DO modifies the binary vector of the keyword that requires update. The trapdoor for update is sent to the CS, which is similar to the trapdoor function. The CS updates the index by removing the old binary index vector and adds the new binary vector; therefore, the time complexity is O\((|{\mathcal {D}}||{\mathcal {W}}|)\). Figure 10 shows the time cost of update with fixed number of keywords in the dictionary \(|{\mathcal {W}}|=4000\) for query \(|Q|=10\) and varying the documents.

7 Conclusion

In this paper we have presented a secure and verifiable top-k ranked search over encrypted cloud data with support for dynamic addition and deletion of documents. The proposed scheme uses an inverted index and a secret orthogonal vector to achieve better search and update efficiency. The security analysis demonstrates that VSED is semantically secure with completeness and correctness guarantees. The experimental evaluation indicates that the proposed the construction is efficient in terms of time and storage complexity. In the future, we will modify the proposed scheme to work efficiently and securely even without the presence of an SCP.