Keywords

1 Introduction

In recent years, cloud storage has becoming more popular than ever as it can allow data owners to store, access and share their data anytime anywhere. Many outsourced data are sensitive and need to be protected from the Cloud Service Provider, which has complete control of the uploaded data. However, for data privacy consideration, data owners usually encrypt the sensitive data before outsourced to the cloud, which makes the deployment of search on encrypted data a difficult task. The simplest way is to download all uploaded data from the cloud and decrypt the encrypted data locally, but the large amount of storage and bandwidth cost makes it an impractical way. Moreover, encryption significantly complicates the search operation on the data. Thus, the cloud should explore effective search service on encrypted data while protecting the data privacy.

In order to solve the above problems, many schemes [15] have been proposed that provide the search service on encrypted data. Curtmola et al. [2] introduced a SE scheme based on inverted index, and its search process is extremely efficient. In [68], the authors proposed ranked keyword search schemes using order-preserving techniques, which can achieve efficient search and meanwhile maintain the accuracy. Boneh et al. [9] proposed the first generalization for symmetric searchable encryption, where data owners can outsource data to the cloud using public key and the data users can search over encrypted data using private key. Goh [10] proposed the first secure index scheme based on bloom filter, which may bring false positive into the result. Scheme [11] proposed a searchable encryption scheme which is not related to the public-key encryption algorithm, and the scheme supports incrementation efficiently. Kamara et al. [12] proposed a dynamic searchable encryption scheme, which supports data insertion and deletion on the encrypted data. Later in scheme [13], they improved their search process by parallelization.

The researchers studied verifiable search functionality extensively in the plaintext database scenario [14, 15]. Merkle hash tree is often adopted to verify the search results. But these works only focused on the verification functionality without considering the data privacy. Scheme [6] used hash chain to construct a single keyword search result verification scheme in encrypted data for the first time. Chai et al. [16] proposed a verifiable searchable encryption scheme under a semi-honest-but-curious model. Kurosawa et al. proposed a verifiable searchable encryption scheme against the malicious server in [17], then they extended it to a verifiable dynamic searchable encryption scheme [18], which supports dynamic update operation efficiently.

But almost all these schemes have been designed for exact keyword search. However, in real-life scenarios, the search keywords maybe input with spelling errors or format inconsistencies. The simplest method to implement keyword search over outsourced data is to encrypt all the keywords extracted from the documents. When the data user submits a trapdoor, the cloud server will search the index and return the encrypted document if the trapdoor is in the index list. The main weakness of this scheme is that it only supports exact keyword search.

However, if the users type the wrong keyword, the cloud server will fail to return the search results. Li et al. proposed the first fuzzy searchable encryption scheme using wild-card approach in [19]. This scheme builds the index based on the wildcard technique, which can tolerate the input typos under predefined edit distance. The index is built based on the keywords extracted from the files and the extended keywords generated based on the wildcard technique. But this scheme only supports the search that the input keyword may not exactly match the extended keywords set which includes the extracted keywords and the keywords that are very similar to the extracted keywords. Liu et al. [20] proposed a dictionary based fuzzy searchable encryption scheme with a small index, because the fuzzy keyword set based on wildcard technique contains many meaningless words, this scheme reduces the fuzzy set based on dictionary. In [21], Kuzu et al. proposed a similarity search scheme over encrypted cloud data, the scheme utilized locality sensitive hashing to generate file index. Chuah et al. [22] built the index based on the bedtree technique and implemented the multi-keyword fuzzy search over encrypted cloud data.

However, all the fuzzy searchable encryption schemes mentioned above didn’t consider the semantical keywords, for example, “holiday” and “vocation” are semantic related, but they don’t have similar structure. Scheme [23] proposed a method which enables semantic keyword search over encrypted documents based on stemming algorithm. In order to support semantic keyword search, the basic way is to generate all the semantically close keywords based on the original keyword.

Scheme [24] proposed a semantic multi-keyword ranked search method, which supports synonym query. Scheme [25] proposed three synonym keyword search schemes and demonstrated the efficiency of their schemes. But these semantic schemes only consider the synonym search and ignore the fuzzy search. Scheme [26] proposed a fuzzy and synonym search scheme, this scheme generates the fuzzy set based on the gram technique and semantic set based on NARCT. However, the gram-based technique still contains many invalid words compared to the dictionary-based technique, and the semantic keyword set only considers the synonyms and ignores the other main semantic relationships. Due to the above drawbacks, the existing schemes signifies that it’s important to develop a novel searchable encryption scheme which can support both fuzzy and semantic search, including main semantic relations, not just the synonyms.

We propose a fuzzy and semantic searchable encryption scheme based on dictionary and WordNet, which not only supports privacy preserving fuzzy keyword search, but also provides semantic search over encrypted cloud data, including three main kinds of relations. Fuzzy and semantic keyword search greatly increases the system usability. Our scheme returns all the matching documents when the searching input matches the fuzzy and semantic keyword set. The contributions are summarized as follows:

  1. (1)

    We utilize the dictionary and WordNet to construct our fuzzy and semantic keyword sets, which reduces the size of the extracted keyword sets. Then we build our secure index using the privacy preserving techniques.

  2. (2)

    The searching input is expanded based on dictionary and WordNet, the query expansion includes the fuzzy keywords and semantic keywords. The cloud server conducts the search operation and returns all the related files, which greatly improves the system flexibility.

  3. (3)

    By combining the fuzzy keyword searchable encryption scheme with keyword-based semantic search scheme, we propose a fuzzy and semantic searchable encryption scheme supporting fuzzy and semantic search in one scenario. We prove our scheme is privacy preserving through rigorous analysis.

We organize the rest of this paper as follows. Section 2 introduces the problem formulation. Section 3 describes the details of our proposed fuzzy and semantic search scheme. In Sect. 4, we present the security analysis. The conclusion of our paper is in Sect. 5.

2 Problem Formulation

2.1 System Model

In our fuzzy and semantic keyword searchable encryption scheme, we consider a system comprising a data owner, a data user, and a remote cloud server. The data owner first encrypts a collection of n documents \({ D =\{ d_1, d_2,\cdots , d_n \}}\) into a set of ciphertexts \({C =\{c_1, c_2, \cdots , c_n \}}\), then the data owner outsources the ciphertexts and the index to the cloud. The authorized data user receives a trapdoor from data owner of her search interest via the search control mechanism and then send the trapdoor to the cloud server. Then the cloud server searches over the encrypted index and returns the search result. Figure 1 shows the overall architecture of our scheme.

Fig. 1.
figure 1

Architecture of our scheme

2.2 Threat Model

The cloud server is assumed semi-honest-but-curious in this paper, which means that the cloud server may try to derive data privacy from the input trapdoors and search operation. In this work, we aim to propose a privacy-preserving search and protect the sensitive information from the server. While designing our privacy preserving fuzzy and semantic keyword searchable encryption scheme, we adopt the secure definition which are also used in the related work [2]. The cloud server can only access the encrypted files, the secure indexes and the submitted trapdoors. The cloud server can also know and record the search results. We require that the server should not be able to learn any more information.

2.3 Design Goals

In this paper, to enable secure fuzzy and semantic keyword search service over the ciphertexts, our scheme has the following design goals: (1) We propose a method to construct fuzzy and semantic keyword sets based on dictionary and WordNet; (2) We propose a scheme to allow users make fuzzy and semantic search over encrypted data; (3) We prove that our proposed scheme is secure and privacy-preserving.

2.4 Preliminaries

Edit Distance. There are many methods to calculate the string similarity, and we choose edit distance [27] in this paper. \(d(w_1, w_2)\) is defined as the edit distance between two strings \(w_1\) and \(w_2\), which is the number of operations required to transform one string into another. There are three primary operations (1) Insertion: insert a character into a string; (2) Substitution: alter a character in a string; (3) Deletion: delete one character from a string.

Dictionary. Dictionary contains a set of legal words. The dictionary is used to reduce the size of the keyword set in this paper. The dictionary excludes all the meaningless English words, and removes all the stop words.

WordNet. WordNet [28] is a large English lexical database. The words are organized into a collection of synonym sets, which represents a lexicalized concept. Synonym sets are linked by different kinds of relations, including synonym, meronym, holonym, hypernym, hyponym and so on. In this paper, we consider three main relations: synonym, meronym/holonym, hypernym/hyponym.

3 Construction of Fuzzy and Semantic Search in Encrypted Cloud Data

3.1 Keyword Set Construction

Fuzzy Keyword Set. The fuzzy keyword search scheme [19] proposed the wildcard technique to generate fuzzy keyword set. The key idea of this scheme is to extract all possible fuzzy keywords under a predefined edit distance, and then build the encrypted index based on fuzzy keyword set. In the wildcard-based fuzzy keyword set construction, \(*\) is used as a wildcard, set the edit distance as 1, the keyword set of use is \(\{use, *use, *se, u*se, u*e, us*e, us*, use*\}\).

The main drawback of scheme [19] is that it pull in all the variations of the keyword, but most of them are invalid. Liu et al. [20] proposed a dictionary based fuzzy searchable encryption scheme with a small index. This scheme uses a dictionary which contains the valid keywords rather than all the variations. For example, considering the wildcard-based fuzzy keyword set of keyword use, the fuzzy keywords \(*use\) are not valid words, and these meaningless keywords will be removed in this scheme based on dictionary. Thus, the index of [20] is much smaller than the index of [19]. In this paper, in order to reduce the index, we choose the dictionary-based technique to generate our fuzzy keyword set.

Semantic Keyword Set. But all the above fuzzy schemes didn’t consider the users’ real search intention. Although the two fuzzy schemes extract the keywords and their variations from the documents, but these schemes can only support search with minor typos. If the input keyword doesn’t have a similar structure with the exact keyword, the search results will not contain these right documents.

However, there exists one situation, the input keyword may have a semantic relation with the keywords in the documents, so the search scheme should consider the semantic keywords of the original keyword. For example, when the input keyword is “corporation”, the fuzzy keyword search scheme will return documents which contain keywords such as “corporation” or “corporations”, but it will ignore the semantic keywords such as “firm”, “enterprise” and “company”. And for some verbs and its variation, such as “bring” and “brought”, the fuzzy keyword search scheme didn’t consider this scene and will not return the right results. In this paper, we consider three main semantic relations. For example, “firm” is a synonym of “corporation”, “wheel” is a meronym of “bicycle” and “flower” is a hypernym of “rose”.

In order to cover the ignored semantic keywords of the fuzzy keyword search schemes, the keyword set should consider the fuzzy keywords and the semantic keywords together in one scenario, including three kinds of semantic keyword sets mentioned above.

3.2 Construct Keyword Set

The data owner should first construct the keyword dictionary. The keyword dictionary contains two parts: the fuzzy keyword set and the semantic keyword set. The semantic keyword set contains three parts: synonym set, meronym/holonym set and hypernym/hyponym set.

Firstly, construct the fuzzy keyword set \(S_{w}\) using the dictionary-based method. Then retrieve all the synonyms of the original keyword w from the WordNet and add them into the semantic keyword set set1, then compared the fuzzy keyword set with the synonym set. If the synonym of the keyword is not contained in the fuzzy keyword set, then add it into the keyword set. If the synonym is duplicate, then remove it. The meronym/holonym set and hypernym/hyponym set can be processed in the same way as the synonym set. At Last, our keyword set contains the fuzzy keyword set and semantic keyword set. The keyword set construction is described in Algorithm 1, \(S_{w}\) is denoted as the fuzzy and semantic keyword set of keyword w.

3.3 Encryption

To construct the index for our fuzzy and semantic keyword search scheme, the data owner first generates secret keys (ksk), where k is a secret key, sk is for algorithm Enc and Dec. Enc is used to encrypt the documents, and Dec is used to decrypt the documents. f is a pseudorandom function. FID\(_{w}\) denotes a set of document IDs whose corresponding documents contain the keyword w.

Our scheme constructs the secure index Index by calling the algorithm SecIndex as follows:

  1. (1)

    The document set D was encrypted into an encrypted form C by calling the algorithm Enc.

  2. (2)

    For each keyword \(w_i \in W\), construct the fuzzy and semantic keyword set \(S_{w_i}\) for the original keyword \(w_i\) by calling the algorithm FuzzySemanticSet;

  3. (3)

    For each keyword \(w_i' \in S_{w_i}\), compute the trapdoor set \(T_{w_i'} = f(k, w_i')\);

  4. (4)

    For each FID\(_{w_i}\), compute the encrypted form \(\mathsf{Enc}(k,\) FID\(_{w_i}||{w_i})\);

  5. (5)

    The data owner finally generates the secure index \({Index} = \{ \{T_{w_i'}\}_{w_i' \in S_{w_i}}, \mathsf{Enc}(k,\) FID\(_{w_i}||{w_i}) \}_{w_i \in W}\);

Finally, the data owner sends the ciphertexts C and the index Index to the server.

figure a

3.4 Search Process

When the data user inputs a search keyword \(w_a\), he first generates the fuzzy and semantic keyword set \(S_{w_a}\) for the original keyword \(w_a\) by calling Algorithm 1, then computes the trapdoors \(\{T_{w_a'}\}_{w_a' \in S_{w_a}}\) for \(w_a' \in S_{w_a}\). After that, the data user sends the trapdoor set to the remote cloud. Then the server searches Index and returns back all the possible encrypted FIDs as the search result. At last, the data user decrypt the search result using secret key k. If the data user intends to download the documents from the cloud. After receiving the download request from the data user, the server will return the corresponding encrypted documents. At last, the user can decrypt the encrypted documents using the secret key sk.

4 Security Analysis

In this section, we analyze privacy preserving issue for our fuzzy and semantic keyword search scheme.

Theorem 1

The proposed fuzzy and semantic keyword search scheme is privacy preserving.

Proof

In our dictionary based fuzzy and semantic keyword search scheme, we compute the secure index and the trapdoor using the same way. Suppose that there exists an simulator S, a challenger C, it does the followings:

  1. (1)

    The challenger C sends \({|d_1|, \cdots , |d_n|}\) and \(m = |W|\) to the simulator S.

  2. (2)

    S keeps (ksk) secret.

  3. (3)

    S computes \(c_j = { \mathsf Enc} (sk, 0^{|d_j|})\) for \(j = 1, \cdots , n\).

  4. (4)

    S chooses \(\{T_{w_i'}\}_{w_i' \in S_{w_i}}\) as the trapdoor set and chooses the secure index \({Index'} = \{ \{T_{w_i'}\}_{w_i' \in S_{w_i}} , \mathsf{Enc}(k,\) FID\(_{w_i}||{w_i}) \}_{w_i \in W}\) randomly for \(i = 1,\cdots ,m\).

  5. (5)

    S sends \({C' = (c_1, \cdots , c_n)}\) and \(Index'\) to C.

  6. (6)

    S then computes \(\{T'_{w_a'}\}_{w_a' \in S_{w_i}}\) as the trapdoor set to C.

In the search phase, C receives(\(C'\), \(Index'\), \(\{T'_{w_a'}\}_{w_a' \in S_{w_i}}\)) from S.

Because all the documents are encrypted with CPA-secure algorithms Enc in this scheme, we consider them as confidential to the cloud. Therefore \(C'\) and C, \((j,{ \mathsf Enc} (sk, 0^{|d_j|}))\) and \((j,{ \mathsf Enc} (sk, 0^{|d'_j|}))\) are indistinguishable. \(Index'\) and Index are indistinguishable as f is a pseudorandom function. The data privacy is preserved because C cannot learn more information.

5 Conclusion

We propose a fuzzy and semantic keyword searchable encryption scheme while maintaining privacy preserving. We combine the dictionary-based technique and WordNet to build our fuzzy and semantic keyword set. After constructing the keyword set, we further introduce the process of the constructing the secure index, then we propose the search process. Our scheme is an effective solution which enables users make fuzzy and semantic search over encrypted data. We prove that our scheme is privacy preserving through rigorous analysis.