Keywords

1 Introduction

With the advent of cloud computing, the users with limited local resources do not need to purchase expensive hardware to support massive data storage. Thus, for economic savings, more and more individuals and enterprises engage cloud servers to maintain their data. However, users would lose control of outsourced data, which may leak some sensitive information, for example, in the cases where health records and private emails are hosted on cloud servers. Therefore, to protect data privacy, user data should be stored on the cloud server in ciphertext format.

For retrieving the interested data, users can request the cloud server to search over outsourced dataset with some specific keywords. However, the keywords may contain some sensitive information of outsourced data, which means these keywords cannot be presented to the cloud server in plaintext format, otherwise the users’ private information could be deduced by the cloud server. To address this issue, privacy-preserving keyword search has recently gained attention, and many solutions have been proposed [8, 10, 17, 26].

However, existing solutions require users to take heavy computations in both phases of data processing and search trapdoor generation. For example, in [4], users need to perform high-dimensional matrix operations, such as multiplications and inversion, where the matrix dimension is determined by the cardinality of the keyword set. While in [20], users have to compute many exponentiation operations for generating searchable indexes and search trapdoor. To deploy in the Internet of Things setting, existing works are not suitable since those heavy computation operations are not affordable by resource-constrained devices.

1.1 Our Contributions

To address the above issue, this paper proposes a light-weight scheme supporting privacy-preserving ranked multi-keyword search over outsourced data, where the search results are determined by the similarity score between the search query and the keyword index of outsourced data. Our contributions are summarized as follows.

  • The proposed scheme allows the user to search for outsourced data with multiple keywords, and the search results can be ranked so that the cloud server only needs to return the results satisfying the given threshold.

  • The proposed scheme can guarantee the privacy of searchable index of outsourced data and queries. That is, the cloud server cannot deduce any private information of outsourced data from the encrypted index and queries.

  • The proposed scheme can guarantee the unlinkability of search trapdoors. That is, for two search trapdoors submitted by the user, the cloud server is unable to identify whether they are generated for the same query.

  • In both data processing and query generation phases, the user only needs to take light-weight computation operations.

Performance analysis demonstrates that our scheme is much more efficient than existing solutions, thus it can be deployed in IoT setting to support resource-constrained devices.

1.2 Related Works

The first single keywords searchable encryption scheme over outsourced encrypted data was proposed by Song et al. [21] in the symmetric key setting. Subsequently, a lot of this type schemes [2, 6, 15, 23] were designed. David et al. [5] proposed a scheme supporting single-keyword boolean search over large outsourced dataset. However, the single keyword search mechanism cannot provide accurate search results. Since the cloud server usually stores massive data, there would be many match data satisfying the search condition of a single keyword, and most of the search results may have no relation with the expected data.

To support more sophisticated outsourcing search methods, many multi-keyword search schemes have been proposed [4, 9, 11, 14]. These schemes can allow the cloud server to return the most relevant data, thus, they are more practical than the single keyword search mechanism in supporting real-world applications. Multi-keyword search can allow complicated search conditions on outsourced data, for example, the works [1, 16] support multi-keyword search with fully homomorphic encryption, [11, 14] support conjunctive keyword search, [9] supports multi-keyword fuzzy search, and [4] supports ranked multi-keyword search.

In the public-key setting, the first searchable encryption scheme was proposed by Boneh et al. [3], where anyone can outsource encrypted data to the cloud server, but only the user holding the private key can issue search queries. Xu et al. [25] constructed a searchable public-key ciphertexts scheme with hidden structures to achieve fast search. Hu et al. [13] presented a public-key encryption scheme with keyword search from obfuscation, where the cloud server is given an obfuscated simple decrypt-then-compare circuit with the secret key to perform keyword search. Xu et al. [24] designed a public-key multi-keyword searchable encryption scheme with a hidden structures model, which also supports boolean search over encrypted e-mails. Wang et al. [22] proposed a tree-based public-key multi-dimensional range searchable encryption scheme from the predicate encryption method and leakage function.

To enrich the functionality of searching over remote data, various practical schemes have been designed. He and Ma [12] proposed a fuzzy search scheme over encrypted data using bloom filter. Zhang et al. [27] noticed that He and Ma’s proposal [12] cannot resist the sparse non-negative matrix factorization based attacks, and further presented a multi-keyword fuzzy search scheme using random redundancy method. Fu et al. [10] designed a semantic-aware search scheme, where both the index and search trapdoor contain two vectors.

Cao et al. [4] proposed an efficient multi-keyword ranked search scheme over encrypted cloud data, where coordinate matching was introduced to capture the relevance between data documents and the search query. In Raghavendra et al.’s solution [19], the index for keywords was generated using split factor, and to save computation overheads, the index tree was constructed to store keywords. Ren et al. [20] studied multi-keyword ranked search, where the search trapdoor is generated using a polynomial function. Ding et al. [7] constructed a keyword set using k-grams and Jaccard coefficient, and also built searchable index of small size. In Liu et al.’s scheme [17], the user is allowed to update the outsourced data and verify the search result.

1.3 Paper Organization

The remainder of this paper is organized as follows. Section 2 describes our system model, threat model, and design goals. Section 3 presents our scheme, which security and performance are evaluated and compared in Sect. 4. Section 5 concludes the paper.

2 Problem Formulation and Design Goals

2.1 System Model

As shown in Fig. 1, a multi-keyword search system consists of three types of entities, that is, data owner, data user and cloud server. There is a secure communication channel between data owner and data user. The data owner outsources a collection of documents to the cloud server. Since the documents may contain sensitive information, they cannot be directly uploaded to the cloud server. Thus, to protect the privacy of outsourced documents, they should be outsourced in ciphertext format.

Fig. 1.
figure 1

The system model

To facilitate data searching, the outsourced documents should be attached with a list of keywords. All keywords are contained in a keyword dictionary. To guarantee that the keywords cannot leak the privacy of outsourced documents, in the data processing phase, the data owner is able to produce an encrypted searchable index for each document. The searchable index is outsourced to the cloud server along with the document.

In the search phase, data user can generate a search trapdoor of its query vector with multiple keywords to enable the cloud server to search over outsourced documents. The keywords in the query vector are also contained in the keyword dictionary, which should be transformed into search trapdoor to protect the privacy of outsourced data. Upon receiving the search trapdoor, the cloud server computes the similarity score between each encrypted searchable index and the search trapdoor, and returns the document if its similarity score satisfies the given search threshold.

2.2 Threat Model

In the honest-but-curious model, the cloud server can perform multi-keyword search according to the user’s request, but it is curious about the sensitive information of outsourced documents. That is, the cloud server may try to deduce some information from the outsourced documents, ciphertext indexes, and search trapdoors. This paper assumes the adversary is able to launch known ciphertext attacks and known background attacks on outsourced documents.

Known ciphertext attack: The cloud server only knows some ciphertext information including encrypted documents, ciphertext indexes and search trapdoors. With these information, the cloud server aims to get the sensitive information of outsourced documents.

Known background attack: The cloud server may also know more background information of outsourced documents, such as statistic information of documents and relation of search trapdoors. These background information may leak the search pattern to the cloud server.

2.3 Design Goals

A secure multi-keyword ranked search scheme needs to satisfy the following requirements.

  • Data privacy: The cloud server should not be able to infer any information about outsourced documents.

  • Keyword privacy: The cloud server should not determine whether a specific keyword is relevant to a outsourced document according to encrypted document, encrypted index, search trapdoor and background knowledge.

  • Trapdoor unlinkability: The cloud server should not be able to identify whether two search trapdoors are generated from the same query.

  • Efficiency: Due to the limited computation capability of data owner and data user, the data processing and query generation phases cannot contain resource-intensive computations.

3 Concrete Construction

This section introduces a light-weight and privacy-preserving multi-keyword search scheme based on the inner product similarity computing scheme [18]. Some notations and the corresponding descriptions are given in Table 1.

Table 1. Notations and descriptions
  • System setup: With input security parameters \(\lambda _{1}, \lambda _{2},\lambda _{3},\lambda _{4}\), the data owner constructs a dictionary W, which contains n keywords. The data owner randomly picks a large prime numbers p such that \(|p| = \lambda _{2}\), \(S \in _R Z_{p}^*\), and a cryptographic one-way hash function \(H : \{0, 1\}^* \rightarrow \{0, 1\}^{\lambda _1}\). Thus, the public parameters are \(para = (\lambda _{1}, \lambda _{2},\lambda _{3},\lambda _{4}, p, n, H)\), and the data owner keeps W and S secret.

  • Index generation: For each document \(F_i\) \((i=1,...,m)\), the data owner encrypts it as ciphertext document \(\overline{F}_i\) using some secure symmetric encryption, randomly picks a unique file name \(N_i\), and calculates the length \(d_i\) of document \(F_i\). The data owner computes \(\gamma _i = H( N_i, d_i)\) and constructs the index vector \(\overline{I}_i\) such that if the document \(F_i\) contains the jth keyword in the dictionary W, then \(\overline{I}_{i,j}=1\), otherwise \(\overline{I}_{i,j}=0\). The data owner further sets \(\overline{I}_{i,n+1} = 0\) and \(\overline{I}_{i,n+2} = 0\), chooses \(n+2\) random number \(m_{j}\) such that \(|m_{j}| = \lambda _{3}\) for \(1 \le j \le n+2\), and encrypts each \(\overline{I}_{i,j}\) as follow:

    $$\begin{aligned} \hat{I}_{i,j}= S \cdot (\overline{I}_{i,j} \cdot \gamma _i+ m_{j} ) \mod p \end{aligned}$$
    (1)

    Then for document \(F_i\), the data owner outsources the ciphertext index vector \(\hat{I}_{i} = (\hat{I}_{i,1},\hat{I}_{i,2}, \cdots ,\hat{I}_{i,n+2})\) and the processed file \(\hat{F_i} = (\overline{F}_i, \gamma _i)\) to the cloud server, and keeps \((N_i, d_i)\) at local.

  • Trapdoor generation: Data user picks a large random number \(\delta \) such that \(|\delta | = \lambda _{1}\), and computes \(S^{-1} \mod p\). From the query keyword set \(\overline{W}\), data user constructs query vector \(Q_{\overline{W}}\), where \(Q_{\overline{W}_j} = 1\) if the query keyword set \(\overline{W}\) contains the jth keyword in the dictionary W, otherwise \(Q_{\overline{W}_j} = 0\). Data user then sets \(Q_{\overline{W}_{n+1}}=0\) and \(Q_{\overline{W}_{n+2}}=0\), and randomly chooses \(n+2\) numbers \(t_{j}\) such that \(|t_{j}| = \lambda _{4}\) for \(1 \le j \le n+2\). Data user constructs the search trapdoor \(\hat{Q}_{\overline{W}}\) as follows.

    $$\begin{aligned} \hat{Q}_{\overline{W}_j} = S^{-1} \cdot ( Q_{\overline{W}_j} \cdot \delta + t_{j}) \mod p \end{aligned}$$
    (2)

    Data user sets search threshold \(\tau \), and submits the search trapdoor \(\hat{Q}_{\overline{W}} = (\hat{Q}_{\overline{W}_1} ,\hat{Q}_{\overline{W}_2} , \cdots ,\hat{Q}_{\overline{W}_{n+2}} )\) and \((\tau , \delta )\) to the cloud server.

  • Search: Once received the encrypted search trapdoor \(\hat{Q}_{\overline{W}}\), the cloud server computes the similarity score \(Score(\overline{I}_i, Q_{\overline{W}})\) with each \(\hat{I}_i\) as follows. The cloud server computes

    $$\begin{aligned} E_i = Score(\hat{I}_i,\hat{Q}_{\overline{W}})= \hat{I}_i \cdot \hat{Q}_{\overline{W}} \mod p \end{aligned}$$
    (3)

    By properly choosing the elements under the given security parameters \(\lambda _1, \lambda _2, \lambda _3, \lambda _4\), we assume both the following conditions hold

    $$ \hat{I}_i \cdot \hat{Q}_{\overline{W}}< p $$

    and

    $$\begin{aligned} \rho&= \sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} \ne 0}^{n+2} (\gamma _i t_j \overline{I}_{i,j}+m_j \delta Q_{\overline{W}_j} + m_j t_j )+ \\& \sum _{j=1, I_{i,j} = 0,Q_{\overline{W}_j} \ne 0}^{n+2} (m_j \delta Q_{\overline{W}_j} +m_j t_j)+ \sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} = 0}^{n+2} (\gamma _i t_j \overline{I}_{i,j} +m_j t_j)+ \\& \sum _{j=1, I_{i,j} = 0,Q_{\overline{W}_j} = 0}^{n+2} m_j t_j \\&< \gamma _i\delta . \end{aligned}$$

    Then, the cloud server computes

    $$\begin{aligned} Score(\overline{I}_i, Q_{\overline{W}}) = \sum _{j=1}^{n} \overline{I}_{i,j} \cdot Q_{\overline{W}_j} = \frac{E_i-(E_i \bmod \delta \cdot \gamma _i)}{\delta \cdot \gamma _i} \end{aligned}$$
    (4)

    If the following search condition is satisfied

    $$ Score(\overline{I}_i, Q_{\overline{W}}) \ge \tau $$

    then the cloud server returns the corresponding document \(\overline{F}_i\).

Theorem 1

The proposed multi-keyword search scheme is correct.

Proof

To compute the similarity score \(Score(\overline{I}_i, Q_{\overline{W}})\), it is required that both \(\overline{I}_{i,j} \ne 0\) and \(Q_{\overline{W}_j} \ne 0\) are satisfied for \(1 \le j \le n\). Let

$$ E_i' = \sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} \ne 0}^{n+2} \gamma _i \delta \overline{I}_{i,j}Q_{\overline{W}_j} \bmod p $$

Note that

$$\begin{aligned} E_i&= \hat{I}_i \cdot \hat{Q}_{\overline{W}} \\&= \sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} \ne 0}^{n+2} (\gamma _i \delta \overline{I}_{i,j}Q_{\overline{W}_j} + \gamma _i t_j\overline{I}_{i,j}+ m_j \delta Q_{\overline{W}_j} + m_j t_j ) + \\& \sum _{j=1, I_{i,j} = 0,Q_{\overline{W}_j} \ne 0}^{n+2} (m_j \delta Q_{\overline{W}_j} +m_j t_j)+ \sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} = 0}^{n+2} (\gamma _i t_j \overline{I}_{i,j} +m_j t_j) + \\& \sum _{j=1, I_{i,j} = 0,Q_{\overline{W}_j} = 0}^{n+2} m_j t_j \\&= E_i' + \rho \mod p \end{aligned}$$

If \(E_i < p\) and \(\rho < \gamma _i\delta \) hold, then we have

$$\begin{aligned} Score(\overline{I}_i, Q_{\overline{W}})= & {} \frac{E_i-(E_i \bmod \delta \cdot \gamma _i)}{\delta \cdot \gamma _i} \\= & {} \frac{E_i -\rho }{\delta \cdot \gamma _i} \\= & {} \frac{\sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} \ne 0}^{n+2} (\gamma _i \delta \overline{I}_{i,j}Q_{\overline{W}_j})}{\delta \cdot \gamma _i} \\= & {} \sum _{j=1, I_{i,j} \ne 0,Q_{\overline{W}_j} \ne 0}^{n+2} ( \overline{I}_{i,j}Q_{\overline{W}_j}) \\= & {} \overline{I}_i \cdot Q_{\overline{W}} \mod p \end{aligned}$$

Thus, the proposed scheme is correct.

4 Analysis and Comparison

4.1 Security Analysis

The proposed multi-keyword search scheme can guarantee the privacy of outsourced data in the known ciphertext attack model and the known background attack model.

Resistance of the Known Ciphertext Attacks. In the dada processing phase, the cloud server can get the encrypted documents and ciphertext indexes, while in the search phase, it is given the search trapdoors. These information are submitted to the cloud server in ciphertext format, where one-time parameters are used for processing each document, index and search query. Thus, the cloud server cannot deduce any sensitive information of outsourced documents and the private key of data owner even though it holds a lots of outsourced materials.

Resistance of the Known Background Attacks. Under this type of attacks, the cloud server can also get some background information of outsourced documents, for example, keyword frequency. In our scheme, for generating a search trapdoor, the one-time elements \(t_j\) and \(\delta \) are randomly picked, which means that the same search query vector will be mapped to different search trapdoors in different round of searching requests. Moreover, the cloud server is unable to infer the real search query vector from these search trapdoors.

4.2 Performance Evaluation

We conduct experimental evaluation of our scheme and compare with Cao et al.’s scheme [4]. The experiments are implemented using Matlab on a Windows 10 operation system with Intel(R) Core(TM) i5-6500 Processor 3.20 GHz and 8 GB memory. In experiments, we compare the performance of each procedures, that is, ciphertext index construction, search trapdoor generation and cloud search. In experiments, the parameters satisfy \(n \le 2^{32}\), \(|\gamma _i|=|\delta | = \lambda _{1} =200\), \(|p| = \lambda _{2} = 512\), \(|m_{j}| = \lambda _{3} = 128\), and \(|t_{j}| = \lambda _{4} = 128\).

Fig. 2.
figure 2

Time cost on ciphertext index generation.

Fig. 3.
figure 3

Time cost on search trapdoor generation.

Fig. 4.
figure 4

Time cost on search process.

As shown in Fig. 2, we set the size of the vector from 10 to 100 to evaluate the performance of generating ciphertext index. It can be seen that the time costs of our scheme are less than 1ms for all cases, while the costs of Cao et al.’s scheme [4] rapidly increase as the number of keywords increases. As shown in Fig. 3, the time costs of both schemes are linear with the number of keywords in the query. Note that Cao et al.’s scheme [4] needs to perform matrix multiplications in generating search trapdoor. Thus, our scheme is more efficient than their scheme in all cases. For the search by the cloud server, our scheme does not involve complicated computation operations. Thus, as shown in Fig. 4, the performance of keyword search of our scheme keeps roughly the same for all cases. Whereas for Cao et al.’s scheme [4], the performance decreases greatly as the number of keywords in the query vector increasing. Thus, our scheme is more efficient than their scheme.

5 Conclusion

Existing privacy-preserving multi-keyword search schemes cannot be deployed on resource-constrained devices due to the complicated computation operations at the device side. To address this issue, this paper presented a light-weight multi-keyword search scheme to allow weak device to process data and generate search trapdoors of outsourced documents. Our proposal can protect the privacy of outsourced documents, index and search trapdoor against the known ciphertext attacks and known background attacks. Performance analysis demonstrated that our scheme is more efficient than existing proposals and can be deployed on resource-constrained devices.