1 Introduction

In recent years, P2P network and application have developed rapidly, but the limited local storage space of users has become a bottleneck for their development, With the unique advantages of CSS, more and more users storing data in the cloud server. This reduces local storage and computational overhead while also enabling rapid development of SE technology. The efficiency of the SE scheme is affected by various factors such as index, trapdoors and encryption methods. Since Song et al. [21] first proposed a symmetric searchable encryption(SSE), the SE technology has made tremendous progress in improving search efficiency. Simultaneously, the cloud servers in most current SE schemes are set to be honest but curious. But in reality, malicious servers will perform illegal operations on the ciphertexts on them. Researchers have proposed some verifiable searchable encryption(VSE) scheme on this basis. However, most VSE schemes currently only supports static databases or based on specific SE structures. In the meantime, for most existing searchable encryption schemes that support dynamic updates, the update operations are performed on the cloud server, thus forward and backward security is very necessary for dynamic searchable encryption(DSE) schemes. They are defined as follows:

  1. 1.

    Forward privacy: If we search for a keyword and later add a new document containing keyword, the cloud server does not learn that the new document has a keyword we searched in the past.

  2. 2.

    Backward privacy: New queries cannot be executed over deleted documents.

In this paper, we propose a dynamic and verifiable multi-keyword ranked search scheme in the P2P networking environment with forward and backward security(DVMRS). DVMRS can be summarized as follow:

  1. 1.

    DVMRS realizes forward and backward security in line with the traditional inner product algorithm and secure kNN algorithm [16, 17, 28], which are challenging issues on the past DSE schemes.

  2. 2.

    DVMRS combines Merkle tree and time-stamp chain to ensure verification of the search results integrity and freshness in the dynamic databases.

We organize this paper as follows: In Section 2, we give the related work on SE. And in Section 3, we introduce the design goals and preliminaries. Meanwhile, the system model and security model are described in Section 4. We present our scheme in Section 5 completely. Then we carry out the performance and security analysis in Section 6 and Section 7, respectively. Finally, we reveal the conclusion and future work of our scheme in Section 8.

2 Related work

Song et al. [21] first proposed a searchable symmetric encryption scheme using sequential for single keyword search on encrypted data, which proved to be safe. Then Boneh et al. [4] proposed a searchable encryption scheme with public key. Based on these two schemes, the research on searchable encryption has made great progress in recent years.

Multi-keyword search

After a great quantity researches, [2, 5, 10] realized the linked multi-keyword search. Boneh et al. [5] proposed a public key based join and unjoin query scheme, which is similar to subset and range queries. Wang et al. [2] designed a public key scheme based on inverted index. This approach uses private set intersection to support joined multi-keyword queries. Wu et al. [27] constructed an efficient multi-keyword public key searchable encryption scheme based on the reverse encryption index structure and homomorphic encryption.

Ranked search

Ranking search is proposed to sort the search results according to a certain method of relevance, so that search users can find more relevant search results faster. Wang et al. [24] uses inverted indexes and constructs a preserving order symmetric encryption scheme, but this scheme only supports single-keyword search. Cao et al. [19] first construct a searchable encryption scheme(MRSE) that supports multi-keyword ranked search using secure inner product computation with lower computing and communication overhead. However, this scheme ignored the importance of different keywords. Fu et al. [8] used processing stemming algorithm, LSH and bloom filter built a supports ranked and fuzzy search scheme at the same time. Sun et al. [22] extended the index structure to multi-dimensional binary(MDB) tree structure based on the Fu et al. [8]. Dai et al. [3] proposed a multi-keyword ranking search scheme for privacy protection of encrypted data in a hybrid cloud based on binary tree and keyword partition algorithm in line with an equally divided kmeans clustering. Zhang et al. [33] has modified the computation of term frequency(TF) value, which the words position and file length in the file are added into it.

Dynamic search

In practice, the files stored by a cloud server are not immutable, so file updates should be considered. In Goh et al. [9], file update is implemented through bloom filter. Kamara et al. [14] constructed a new dynamic encryption index to implement dynamic updates.The scheme proposed by Wang et al. [25] implements efficient dynamic indexing under homomorphic encryption and pseudo-random padding. Wan and Deng et al. [23] utilized the bilinear-map accumulation tree to achieve updating scheme. Du et al. [6] put forward a dynamic multi-client searchable encryption scheme that supports boolean queries by incorporating the client authorization information into the query token and index.

Verifiable search

Kamara et al. [13] firstly raised VSSE scheme in static databases. Wen and Deng et al. [23] proposed the verification application by homomorphic MAC. Wang et al. [26] proposed an efficient verifiable keyword searchable encryption utilizing aggregate keys to save computational resources. Zhang et al. [33] combined the binary tree and the Merkle tree based on inner product computation to implement the verification of data integrity.

3 Preliminaries

  • Incremental hash (IH). Incremental hash was first proposed by Bellare et al. [1] and was used by numerous existing SE schemes. An IH function is a collision-resistant function:

    $$ IH:{\{ 0,1\}^{*}} \to {\{ 0,1\}^{l}} $$
    (1)

    with which the addition or subtraction operation of two random strings in the IH function will not produce a collision.

  • Merkle tree. Merkle tree was first proposed by kamara et al. [18] for the first time. The Merkle tree in DVMRS is established on the basis of tree index. Firstly, the IH value corresponding to the file at each leaf node of the tree is calculated. Then the IH value of the non-leaf node is calculated by the concatenation operation by using the left and right child nodes of a non-leaf node, where || represents concatenation operation. And the IH value of the root node rt is obtained by repeating this process. In Fig. 1, we demonstrate how to build a Merkle tree.

  • Time-stamp chain. The time-stamp chain is a chain consisting of fixed update time-stamps upi, data update time-stamp TP(i,j), and query time-stamp Ti, where the fixed update time-stamp is the point in time after a fixed period of time for updating, the data update time-stamp is the time point when data are updated and the query time-stamp is the point in time when the user queries. In Fig. 2, we depict the specific composition of a time-stamp chain in DVMRS.

  • Minimum hash sub-tree(MHS). The MHS is a part of Merkle tree, which we utilize the MHS to assist in calculating the root IH value of the search results. Take Fig. 1 for example, the request document is f5. The MHS returned is consists of hash22 node and its two child nodes. When search user delivers a search query to the CSP, the CSP sends back the final search results along with MHS and the corresponding auxiliary information, which will be introduced in Section 5.1.

Fig. 1
figure 1

Merkle tree

Fig. 2
figure 2

Time-stamp chain

4 System design

4.1 Design goals

In order to realize the privacy protection, efficient search, multi-keyword ranked search and search results verification proposed in our scheme, we put forward the following design goals:

  • Search efficiently: The index structure in SE scheme has a significant effect on search efficiency. In our scheme, tree-based index is a great to realize it.

  • Dynamic update: Our scheme should support dynamic updating of data. When data owners increase, delete and update local data, we need to execute the same operation instructions for data on cloud server to ensure the consistency of local data between the CSP and data owner.

  • Privacy: In this scheme, we need to realize data privacy, index and query privacy, forward and backward privacy, keyword privacy and trapdoor unlinkability. (1) Data Privacy. The CSP cannot conclude the corresponding plaintext by analyzing the stored ciphertext. (2) Index and Query Privacy. Both the query and index are represented by vectors. The index vector contains information such as TF value, while the query vector contains information such as inverted document frequency(IDF) value, so it is necessary to protect the privacy of them. (3) Keyword Privacy. The CSP could not distinguish the specific keywords.(4) Trapdoor Unlinkability. The trapdoor need to be distinguishable for the same query. (5) Forward and Backward Privacy. In dynamic schemes, forward and backward privacy are such an indispensable demand that should be implemented.

  • Results verification: In DVMRS, we need to verify the completeness, freshness and correctness of the search results. (1) Completeness. The CSP returns the search results completely. (2) Freshness. The results returned to the user should be up to date. (3) Correctness. The CSP cannot return search results containing irrelevant data to the users.

  • Multi-keyword search: Compared with single-keyword search, multi-keyword search is more able to meet the search requirements of data users and implement higher search efficiency.

4.2 System model

As shown in Fig. 3, DVMRS consists of three entities.

Fig. 3
figure 3

System model

  • Data owner: The data owner will use symmetric encryption algorithm to encrypt the local data and outsourced them to the cloud service provider(CSP). Meanwhile, the data owner will use the data dictionary to build a tree-index structure, which will be transmitted to the CSP after encryption. The data owner sends the key that generates the search trapdoor and the symmetric key that encrypts the data to the authorized users through the secret channel.

  • Cloud service provider(CSP): The CSP is an intermediate entity that stores encrypted data and corresponding indexes obtained from the data owner and provides data access and search services to authorized data users. When an authorized user sends a trapdoor to the CSP, it returns a matching set of ciphertexts in line with trapdoor and the authenticator under the query time-stamp to the data user for results verification.

  • Data users: An authorized user can obtain symmetric key and secret key from the data owner. When data users want to search on the CSP, firstly they need to generate a searching keyword set, then use the secret key to generate the corresponding trapdoor and send it to the CSP. The data user verifies the searching results by obtaining auxiliary information from the CSP.

4.3 Security model

In DVMRS model, the data users are authorized by the data owner to conduct the search, so we do not consider the leakage of the data users. Meanwhile, communication between data owner and data user is conducted through secret channel, so we do not need to consider in the communication between them.

However, the CSP is considered curious and dishonest in this scheme, the CSP may not follow the protocols and the CSP is curious about the data content, keywords and other additional information. DVMRS mainly considers two threat models:

  • Known Ciphertext Model: In this model, the CSP is only aware of encrypted information, which includes encrypted data, encrypted indexes and searching trapdoors.

  • Known Background Model: In this model, the CSP know information other than encrypted information, such as document frequency and keyword frequency. This information will be used for statistical attacks to infer keywords in query requests.

5 Scheme construction

5.1 Notations

Table 1 lists the symbols that need to be used in this scheme and the corresponding instructions.

Table 1 Notations summary

5.2 Algorithm construction

5.2.1 Data owner

The following algorithms are executed by Data owner:

  • KGen(1k) →{K1,K2,K3,ssk,spk}: This algorithm is a probabilistic algorithm run by the data user, It takes a security parameter as input, and outputs the secret keys and a random signing key pair (ssk,spk). K1 is used to encrypt the index based on tree construction, K2 is used to encrypt documents and K3 is used to generate authenticator with the of ssk. The secret key K1 = (S,M1,M2). S is an (m + d + 1)-bit vector, M1 and M2 are two (m + d + 1) × (m + d + 1) invertible matrices respectively. The vector S is a splitting indicator to split document vector into two random vectors, which confuses the value of document vector, M1 and M2 are used to encrypt the splitting vectors simultaneously.

  • Enc(F,K2) →{C}: The data owner uses secret key K2 to generate the ciphertexts of document set.

  • GenIndex(F,K1,K3,ssk) →{I,π}: This algorithm has divided into two sub-algorithms BuildTree(F,K1) →{I} and GenMerkleTree(I,K3,ssk) →{π}. The data owner generates an m-bit vector Pj according to each document Fj(j = 1,2,⋯,N), where each bit Pj[i] indicates the weight value of keyword ti in Fj i.e. Pj[i] = V alue(ti,Fj). Then the data owner extends the vector Pj to an (m + d + 1)-bit vector \(P_{j}^{\prime } = {P_{j}} \mid \mid {V_{j}}\), where Vj is a (d + 1)-bit vector. To achieve certain scalability of document collection, d should be set as possible maximum size of the outsourced document collection. Assume the possible maximum similarity score between a search keyword set and a document is \({\max \limits _{s}}\), then the data owner chooses a random parameter g, where \(g > {\max \limits _{s}}\). Therefore , the vector Vj(j = 1,2,⋯,N) can be generated as shown in Algorithm 1, where g represents arbitrary integer multiples of g. Thus, the type of Vj is as follows:

    $$ \begin{array}{@{}rcl@{}} {V_{1}} &=& ({g^{*}},{g^{*}}, {\cdots} ,{a_{1}},{g^{*}}, {\cdots} ,{g^{*}}, - {a_{1}})\\ {V_{2}} &=& ({g^{*}},{g^{*}},{a_{2}}, {\cdots} ,{g^{*}}, {\cdots} ,{g^{*}}, - {a_{2}})\\ {V_{3}} &=& ({g^{*}},{g^{*}}, {\cdots} ,{g^{*}},{a_{3}}, {\cdots} ,{g^{*}}, - {a_{3}}). \end{array} $$
    (2)
    figure e

    The vector \(P_{j}^{\prime }\) will be encrypted by the secure kNN algorithm: the data owner uses vector S to split \(P_{j}^{\prime }\) into two (m + d + 1) dimensional vectors (pa,pb), where the vector S functions as a splitting indicator. Namely, if S[i] = 0(i = 1,2,⋯m + d + 1), pa[i] and pb[i] are both set as \({P_{j}^{\prime }}[i]\); if S[i] = 1(i = 1,2,⋯m + d + 1), the value of \({P_{j}^{\prime }}[i]\) will be randomly split into pa[i] and pb[i] i.e. \({P_{j}^{\prime }}[i]={p_{a}}[i]+{p_{b}}[i]\). Then the index of encrypted document Cj can be calculated as \(I = ({M_{1}^{T}}{p_{a}},{M_{2}^{T}}{p_{b}})\). Moreover, according to the Merkle tree construction method in Section3, the data owner builds the relevant Merkle tree MI in line with the index tree. Finally, a key tuple \(({K_{1}},{K_{2}},\bar \sigma )\) will be sent to the authenticated search users through secure broadcast channel, where K2 is the symmetric key used to encrypt documents outsourced to the CSP. And the data owner publishes g and stores σ in her own storage space. Meanwhile, the data owner outputs the authenticator π, stores the tree index and the original authenticator locally, and then sends them to the CSP.

  • PreUpdate(K1,K2,K3,ssk,f) →{τu,πi,j}: This algorithm run by the data owner. It takes as input the symmetric keys K1, K2, K3, the secret key of signing key pair ssk, a file f to be updated and outputs the update token τu and the new authenticator πi,j. The data owner sends τu and πi,j to the CSP.

  • \(Update(I,{\tau _{u}},\sigma ,\bar \sigma ) \to \{ {I^{\prime }},{\sigma ^{\prime }},{\bar \sigma ^{\prime }}\}\) This algorithm is used to perform the update token. It takes the tree index, the update token τu, the previous number sets σ and \(\bar \sigma \) as input, then outputs the new tree index and updated number sets \({\sigma ^{\prime }}\), \({\bar \sigma ^{\prime }}\). The Merkle tree will be updated simultaneously. After the execution of this algorithm, the data owner delivers the new index and number sets to CSP.

5.2.2 CSP

The following algorithms are executed by CSP: \(Search(I,TD) \to \{ R,{{\pi }_{q}^{t}},{\pi _{i,j}}\}\): This algorithm uses index and trapdoor to calculate the similarity score to get the topk searching results. After receiving the trapdoor TD from the data users, the CSP calculates the similarity score between TD and the index vector stored in each node to get the top-k relevant results. The similarity score is calculated as:

$$ \begin{array}{@{}rcl@{}} Score(\bar W,{F_{j}}) &=& I \cdot TD\\ &=& ({P}_{j}^{\prime} \cdot {Q_{\bar{W}}^{\prime}}) mod g\\ &=& ({p_{a}} \cdot {q_{a}} + {p_{b}} \cdot {q_{b}})mod g\\ &=& ({{M}_{1}^{T}}{p_{a}},{{M}_{2}^{T}}{p_{b}}) \!\cdot\! ({M}_{1}^{- 1}{q_{a}},{M}_{2}^{- 1}{q_{b}})mod g\\ &=& ({P_{j}} \cdot {Q_{\bar W}})mod g \end{array} $$
(3)

The larger score indicates the corresponding document Fj is more relevant to the search keyword set \(\bar W\), the documents with top scores will be returned to the data user. Meanwhile, the CSP returns the authenticator \({\pi _{q}^{t}}\) to the requested user when the data user sends a request to server and the authenticator πi,j of the check point.

Moreover, It should be noted that when calculating the similarity scores in the first step, it is not necessary to calculate the similarity scores on each node one by one. When the weight of the corresponding keyword position on a non-leaf node is zero, we directly give up searching on the child nodes of this node,which can decrease amount of communication and computational overhead.

5.2.3 Data user

The following algorithms are executed by data user:

  • \(GenTrapdoor(\bar W,{K_{1}},\bar \sigma ) \to \{ TD\}\): The data user generates the keyword set \(\bar W\) for searching. Then data user creates an m-bit vector \({{Q}_{\bar W}}\) according to \(\bar W\), where \({{Q}_{\bar W}}[i]\) indicates whether the i-th keyword of dictionary wi is in \(\bar W\) i.e. \({{Q}_{\bar W}}[i] = 1\) indicates yes and \({{Q}_{\bar W}}[i] = 0\) indicates no. Then the data user also extends the vector \({{Q}_{\bar W}}\) to an (m + d + 1)-bit vector \({{Q}_{\bar W}^{\prime }} = {{Q}_{\bar W}}||{V^{\prime }}\), where \({V^{\prime }}\) is a (d + 1)-bit vector and can be generated as follows: \({V^{\prime }} = \{ {V^{\prime }}[i]\} = 1(i \in \bar \sigma \cup \{ d + 1\} )\) or \({V^{\prime }} = \{ {V^{\prime }}[i]\} = 0(other)\). The data user can split \({Q}_{\bar W}^{\prime }\) into two (m + d + 1)-bit vectors (qa,qb): if S[i] = 0(i = 1,2,⋯,m + d + 1), the value of \({Q}_{\bar W}^{\prime }\) will be randomly split into qa[i] and qb[i]; if S[i] = 1(i = 1,2,⋯,m + d + 1), qa[i] and qb[i] are both set as \({{Q}_{\bar W}^{\prime }}[i]\). Thus the trapdoor TD can be generated as \(TD = (M_{1}^{- 1}{q_{a}},M_{2}^{- 1}{q_{b}})\).

  • Dec(R,K2) →{PR}: The data user uses the secret key K2 transmitted from the data owner by a secure broadcast channel to decrypt the encrypted results R and get the plaintext results PR.

  • \(Check({K_{3}},spk,{{\pi }_{q}^{t}},{\pi _{i,j}}) \to \{ b_{1}\}\): The data user runs this algorithm to check the correctness of authenticator, which they get at query time t, this algorithm takes symmetric key K3, the public key of signing keypair spk, authenticator \({{\pi }_{q}^{t}}\) and πi,j as input. It outputs a bit b1 represents an accept or reject result. The detailed process of this algorithm is shown in Algorithm 2:

    figure f
  • \(Verify({K_{1}},{K_{3}},spk,R,{\pi _{q}^{t}},aux) \to \{ b_{2}\}\): The data users run this algorithm, which takes symmetric keys K1, K3, the public key of signing key pair spk, the searching results R, the authenticator \({\pi _{q}^{t}}\) and the corresponding auxiliary information aux returned by the CSP. It outputs a bit b2 represents verification passed or not. The verification process of searching results is shown in algorithm 3:

figure g

5.3 Detailed process

5.3.1 Anthenticator construction

We utilize the Merkle tree and time-stamp chain to design authenticator. There are two challenges in verifying searching results.

The first one is how to design an efficient generic proof index not only supports data integrity verification but also supports data update. In DVMRS we build and maintain such a proof index by leveraging the fully Merkle tree and IH values.

The second one is how to ensure data freshness by preventing the root from being replayed(malicious server) in the context of data updates. In order to solve this problem, we add a time-stamp to the authenticator, which is located on the three-party known time-stamp chain, it ensures that each authenticator will have a time identity that can not be tampered with.

Combined with the solutions to the above two problems, we design the authenticator used in the scheme as follows:

$$ \left\{ \begin{array}{ll} {\pi_{i,0}} = ({\alpha_{i,0}},Si{g_{ssk}}({\alpha_{i,0}})),(u{p_{i}} < t{p_{i,0}} \le u{p_{i + 1}})\\ {\alpha_{i,0}} = En{c_{{K_{3}}}}(r{t_{i,0}}||t{p_{i,0}})\\ {\cdots} \\ {\pi_{i,j}} = ({\alpha_{i,j}},Si{g_{ssk}}({\alpha_{i,j}})),(t{p_{i,j - 1}} < t{p_{i,j}} \le u{p_{i + 1}})\\ {\alpha_{i,j}} = En{c_{{K_{3}}}}(r{t_{i,j}}||t{p_{i,j}}||{\alpha_{i,j - 1}})\\ {\cdots} \\ {\pi_{i,n}} = ({\alpha_{i,n}},Si{g_{ssk}}({\alpha_{i,n}})),(t{p_{i,n}} = u{p_{i + 1}})\\ {\alpha_{i,n}} = En{c_{{K_{3}}}}(r{t_{i,n}}||t{p_{i,n}}||{\alpha_{i,n - 1}}) \end{array} \right. $$
(4)

The detailed process of constructing authenticator:

  1. 1)

    Generating the IH hash values of files.

  2. 2)

    Building the Merkle tree based on the IH values of files.

  3. 3)

    Data owner determines a time-stamp chain, which the time-stamp chain contains the fixed update time point of authenticator and interval length of the update time.

  4. 4)

    Data owner generates a original authenticator and time-stamp chain, meanwhile broadcasts them among data user with broadcasting channel and upload them to CSP. It is noted that we use Network Time Protocol running on the CSP to synchronize the clocks among the data owners and the data users during their interactions with the servers.

In DVMRS, the update of authenticator is divided into two cases:

The first case is that the Merkle tree is not updated during an update interval, the data owner only needs to update the time-stamp at next time point.

The second case is that the document set of the data owner is modified within an update interval, which means the root of the Merkle tree has been updated, then the data owner will calculate a new authenticator by using the latest root IH value and present time-stamp, meanwhile the CSP need to update authenticator simultaneously.

5.3.2 Results verification

The process of data users verifying searching results divided into two stages:

  1. 1)

    In order to prevent malicious CSP from returning incorrect anthenticator to the user, we design Check algorithm to check the freshness and correctness of authenticator, it is executed by a data user and verifies whether the authenticator has been replayed.

  2. 2)

    We use Generate algorithm to verify searching results by leveraging the root hash value extracted from the authenticator and the minimum hash sub-tree extracted from Merkle tree with searching results.

In order to prevent the CSP from previous authenticators and ensure the freshness of the root hash value, the application of time-stamp chain can solve this problem, such that data users can trace authenticators in the chain and identify if root hash value is fresh.

In this setting, the CSP needs to provide an authenticator at the query time and meanwhile an authenticator at the checkpoint, where the checkpoint is referred as the next update time point close to the query time t.

When data users searching files with keywords in keywords dictionary, there have three cases at different points with same keywords, we have demonstrated these cases in Fig. 4 in sub-section 2.5:

  1. 1)

    The first case is that the query occurs at t1, the malicious CSP could only send πi− 1,n to the data user.

  2. 2)

    The second case is that the query occurs at t2 after the data update at tpi,0, and the authenticator that the malicious CSP sends to the user is πi,0.

  3. 3)

    The last case is that query is generated at t2, and the authenticator sent by the malicious CSP is πi− 1,n. In the last case, a data freshness attack occurs, but it will be detected at the checkpoint upi+ 1. The data user will obtain the last authenticator πi,1 from the malicious CSP at the checkpoint to verify whether the data obtained at the query time has been replayed or not.

Fig. 4
figure 4

Precision test

6 Performance analysis

The DVMRS is to improve the security on the secure kNN neighbor-based searchable encryption scheme. By changing the structure of constructing the index tree and generating the trapdoor vector, the scheme adds a number set to mark the data to achieve forward and backward security. Verification of data integrity, freshness and correctness is achieve by combining time-stamp chain and Merkle tree.

6.1 Precision

In the searching accuracy analysis, we use the following formula to define the precision:

$$ Precision = \frac{{{k^{\prime}}}}{k} $$
(5)

where \(k^{\prime }\) is the number of real documents and k is the number of returned top-k documents.

The schemes for comparison with this scheme are Xia et al. [16] and Zhang et al. [17], and the accuracy of this scheme is slightly higher than that of Xia et al. [16]. The scheme, Zhang et al. [17], is slightly higher than DVMRS and the Xia et al. [16], which is due to the modification of the inner product TF × IDF calculation. The method assigns different weights to search for keywords in different positions in the data, thereby making the search accuracy higher. However, DVMRS realizes the forward and backward security that is not realized by other schemes. This scheme also introduces the time-stamp chain and combines the Merkle tree to resolve it.The final three precision comparison simulation diagram is shown in Fig. 4.

In order to ensure the fairness of the simulation comparison, the same parameters and the same data set were used in the simulation comparison.

The data set used in the simulation contains 3026 documents and 1789 keywords extracted from the document. It can be seen from the simulation image that there is no obvious fluctuation in the search accuracy during the change of the number of keywords in the query from 50 to 500 between three schemes. As can be seen from Fig. 5, the DVMRS maintains high search accuracy while achieving forward and backward security.

Fig. 5
figure 5

Build index(a)

6.2 Index construction

In both the DVMRS and two schemes involved in the comparison, the index construction is divided into two stages, building the tree structure and encryption, and TF × IDF algorithm and kNN nearest neighbor algorithm are used in both Zhang et al. [17] and Xia et al. [16] schemes. In the encryption stage, the two comparison schemes will extend the n-bit original vector to (n + U + 1)-bit, where U is a random number. However, we should note that U is a random number in a certain range. When there are more keywords in a file, the range of U random should be increased accordingly, otherwise its random security can not be brought into play.

In DVMRS, in order to achieve forward and backward security, we modify the vector filling method of vector to extend the original vector of n-bit to (n + d + 1)-bit, where d is the maximum value of the outsourced document set. In this way, the larger number of document sets, the more bits the vector accounts for, although it will increase a small amount of computing overhead, but the security will be stronger. Simultaneously, it is noted that in order to ensure the security of keywords, two schemes for comparison should be also adjust the range of U according to the quantity of stored data.

In Fig. 5, the number of keywords in the dictionary is fixed to n = 750 for the number of different documents from 500 to 3000. In Fig. 6, the number of documents in the dictionary from 250 to 1500 is fixed to N = 1500.

Fig. 6
figure 6

Build index(b)

6.3 Trapdoor generation

In DVMRS and the contrast schemes, the complexity of trapdoor generation is determined by the segmentation vector and secret matrix used in encryption, and it is also related to the total number of keywords extracted from the document set. When the total number of keywords in the dictionary is n, the complexity of generating trap gate is O(n2). Meanwhile, the filling vector in encryption is improved in the DVMRS. The time complexity of trapdoor generation is also affected by the number of document sets, but in the acceptable range.

We have simulated the trapdoor generation many times, as shown in Figs. 8 and 9 respectively. So in practice, there will be a large amount of data in CSP computing environment. There will only be a small gap in the time complexity of trapdoor generation.

In Fig. 7, the number of keywords in each query is fixed at 10, and the number of keywords in the dictionary changes from 500 to 2000. The fixed keyword in the dictionary in Fig. 8 is n = 500, and the number of words in each query is from 5 to 30.

Fig. 7
figure 7

Trapdoor Generation(a)

Fig. 8
figure 8

Trapdoor Generation(b)

6.4 Search efficiency

The search process of this scheme is as follows:(1) Calculate the inner product of the relevant node vector and the query vector in the tree;(2) Return the document with higher similarity score according to the number of returned documents and the obtained similarity score. Therefore, the searching time complexity is mainly affected by the number of nodes in the tree, i.e. mainly depends on the number of documents. When the number of stored documents is N, the time complexity is O(log2N). It is also affected by the number of keywords in the dictionary. We compare the DVMRS with the schemes in Zhang et al. [17] and Xia et al. [16], where we simulate different cases of different documents and different numbers of keywords. The simulation results are shown in Figs. 9 and 10.

Fig. 9
figure 9

Search Efficiency(a)

Fig. 10
figure 10

Search Efficiency(b)

In Fig. 9, the number of keywords is fixed to n = 500, the number of search keyword set is m = 10, the number of returned documents is k = 30, and the number of documents changes from 500 to 3000. In Fig. 10, the number of documents is fixed to N = 1000, the number of search keyword set is m = 10, and the number of returned documents is k = 30. The number of keywords in dictionary varies from 250 to 1500.

7 Security analysis

7.1 Privacy protection

  1. 1.

    Data privacy. The document set is encrypted locally by the data owner using symmetric encryption. The data owner directly uploads the ciphertexts to the CSP after encryption. The data owner transfers symmetric key to the authorized users through the secret channel. The data user is proven to be semantically secure through a symmetrically encrypted set of documents.

  2. 2.

    Index and Trapdoor privacy. In DVMRS, the security of the index and the trapdoor are based on the secure kNN algorithm. Although the keyword set of the two filters or the two search keyword sets are the same, the indexes or trapdoors are not the same. This is because the secure kNN algorithm is a non-deterministic algorithm. In this algorithm, the segmentation vector S used to split the trapdoors and indexes, two matrices M1,M2 are randomly generated. As long as the security of the encryption key K1 is ensured, the CSP cannot identify the trapdoors and indexes by analyzing the plaintext. This has also proven to be safe in the context of Known Ciphertext Model.

  3. 3.

    Trapdoor Privacy. In constructing the vector Vj(j = 1,2,⋯ ,N), the DVMRS first selects a random parameter g(g >similarity score maximum value). Then selects a parameter g of any integer multiple of g to construct the vector Vj. The trapdoor vector is randomly divided into two vectors and then encrypted, which protects the search pattern and makes it indistinguishable from different trapdoors and even the same trapdoors. Therefore, the similarity score will be different for each query, and the CSP cannot distinguish them.

  4. 4.

    Keyword Privacy. A random number is used to randomize similarity scores when data owner encrypts indexes, thus the keyword privacy can be secured under Known Background Model.

  5. 5.

    Forward and Backward Security. Since the CSP is malicious in the context of DVMRS, some historical information such as previous queries and deleted documents are saved in the CSP local storage space.

    In this scenario our goal is that newly inserted documents cannot be searched by previous, or new queries cannot be executed on deleted documents. In DVMRS, a legitimate query calculates similarity scores as follows:

    $$ \begin{array}{@{}rcl@{}} Score && (\bar W,{F_{j}}) \\ &=& ({P_{j}^{\prime}} \cdot {Q^{\prime}}) mod g\\ &=& ({P_{j}} \cdot Q + \sum\limits_{i = 1}^{d + 1} {{V_{j}}[i] \cdot {V^{\prime}}[i]} ) mod g\\ &=& ({P_{j}} \cdot Q + {a_{j}} + {g^{*}} - {a_{j}}) mod g\\ &=& {P_{j}} \cdot Q \end{array} $$
    (6)

    However, when previous queries (or new queries) “touch” new files (or deleted files), there will be:

    $$ \begin{array}{@{}rcl@{}} Score &&(\bar W,{F_{j}}) \\ &=& ({P_{j}^{\prime}} \cdot {Q^{\prime}}) mod g\\ &=& ({P_{j}} \cdot Q + \sum\limits_{i = 1}^{d + 1} {{V_{j}}[i] \cdot {V^{\prime}}[i]} ) mod g\\ &=& ({P_{j}} \cdot Q + {g^{*}} - {a_{j}}) mod g\\ &=& ({P_{j}} \cdot Q - {a_{j}}) mod g \end{array} $$
    (7)

    Because the parameter a is randomly generated, the correct similarity score cannot be calculated. Therefore, this scheme has forward and backward security.

7.2 Security model

Known Ciphertext Model. In DVMRS, the CSP only knows the encrypted information, especially the encrypted document set C, the encrypted index tree I, and the trapdoor TD. The adversary could only distinguish between two files by “generated index” and “encrypted file”. The vector representing the file is (m+d + 1)-bit, the first m-bit represents the weight of the keywords, and the latter (d + 1)-bit represents the dimension of the extended vector V.

In the index generation stage, we first expand the vector Pj corresponding to each file to \({P}_{j}^{*}\), and then use the segmentation vector to segment \({P}_{j}^{*}\). When S[i] = 1, \({{P}_{j}^{*}}[i]\) will be randomly divided into two vectors Pa[i] and Pb[i].Assuming that the number of “1” in the first m-bit is μ1 and the dimension of each file is ηf, there are \({({2^{{\eta _{f}}}})^{\mu } } \cdot {({2^{{\eta _{f}}}})^{d}}\) possible combinations, while the two vectors are encrypted by the matrix of (m + d + 1) × (m + d + 1) dimensions. Assuming that each element is ηM-bit in the matrix, then the two matrices have \({({2^{{\eta _{M}}}})^{{{(m + d + 1)}^{2}} \times 2}}\) possible values. Therefore the probability of the same index of two files is calculated as follows:

$$ \begin{array}{@{}rcl@{}} {P_{d}} &=& \frac{1}{{{{({2^{{\eta_{f}}}})}^{{\mu_{1}}}} \cdot {{({2^{{\eta_{f}}}})}^{d}} \cdot {{({2^{{\eta_{M}}}})}^{{{(m + d + 1)}^{2}} \times 2}}}}\\ &=& \frac{1}{{{2^{{\mu_{1}}{\eta_{f}} + {\eta_{f}}d + 2{\eta_{M}}{{(m + d + 1)}^{2}}}}}} \end{array} $$
(8)

It can be seen from the above equation that the parameter μ1, d, ηf, ηM are larger, the more difficult it is to distinguish, so the encrypted index is indistinguishable.

Known background model

In DVMRS, the CSP may obtain other information, such as a file search frequency and a keyword search frequency, in addition to the encryption information. These information will be used for statistical attacks to infer the keywords in the query.

In DVMRS, the trapdoor is the vector of (m+d + 1)-bit, and the former m-bit represents whether the keywords of the corresponding position exist in the query, the remaining (d + 1)-bit vector represents \({V^{\prime }} = \{ {V^{\prime }}[i]\} = 1(i \in \bar \sigma \cup \{ d + 1\} )\) or \({V^{\prime }} = \{ {V^{\prime }}[i]\} = 0(other)\) in the extension vector. Firstly, the random number r ofηr is used to extend the vector, which has \({2^{{\eta _{r}}}}\) possible values. Then the vector is divided into two vectors using the segmentation vector of (m+d + 1)-bit, in which there are μ0 “0”. Assuming that each dimension of the former (m+d)-bit of the vector is ηq-bit, then there are a total of \({({2^{{\eta _{q}}}})^{{\mu _{0}}}}\) possibilities. Then the random matrix is used to encrypt the two query vectors. The same probability of the two trapdoors is calculated as follows:

$$ {P_{d}} = \frac{1}{{{2^{{\eta_{r}}}} \cdot {{({2^{{\eta_{q}}}})}^{{\mu_{0}}}}}} $$
(9)

As can be seen from the above formula, when ηr, ηq, μ0 are set to larger numbers, the indistinguishable query vector can be achieved.

8 Conclusion and future work

8.1 Conclusion

On the basis of a multi-keyword ranked scheme based on traditional inner product, we propose a searchable encryption scheme with forward and backward security and search results verification.

Firstly, we realize forward and backward security by changing the construction method represents the document vector and introducing two number sets.

Moreover, in order to make the verification of the returned results more secure and effective, we combine the Merkle tree and the time-stamp chain to ensure the data freshness, preventing the malicious server from returning the wrong authenticator. The DVMRS in this paper has also been extended to implement dynamic updates.

Finally, we analyze the performance and security of the proposed scheme. In the security analysis, we prove the two threat models proposed in the paper (Table 2).

Table 2 Scheme comparison

The work done in this article still has space for improvement. In the model of DVMRS, the data owner sends the authenticator to the users through a secure broadcast channel, and also informs all users through it when the data is updated.

8.2 Future work

Above all, with the rapid development of machine learning(ML) and federated learning(FL), the security and privacy of data in them also need to be guaranteed [11, 30,31,32]. The combination of SE technology and ML,FL will be a research focus in the future, and it is the problem that we will continue to study as well.

Moreover, a variety of novel computing modes such as edge computing and fog computing have been proposed in recent years. These computing modes have been used in the design of SE schemes [7, 15, 20, 29] due to their high efficiency and low latency. DVMRS could try to combine these computing modes to improve query efficiency and reduce communication latency as well.

Eventually, the blockchain technology has developed mature nowadays, and its applications in the context of the Internet of Things are also increasing. Many desirable properties of the blockchain, such as decentralization, encryption technology and unchangeable transaction records, which make it have great applied value in privacy protection and authentication and key authorization of SE [12]. In the subsequent work, we will also try to combine DVMRS with the blockchain to achieve fine-grained management of multi-user authority.