Keywords

1 Introduction

With the emergence of information islands and the increasing volume of government data, a smart government [1] is being developed to facilitate the integration of government management services, as evidenced by the realization of data sharing between government departments at all levels. The cloud [2] is the available intermediate carrier for data storage and sharing, and the public integrity auditing [3, 5,6,7, 9, 10] paradigm can be used to check the integrity of outsourced data to support its availability. However, for massive and targeted government data, the existing probabilistic public auditing (PPA) model directly applied to smart government scenarios has the following two limitations. (i) The PPA model does not support targeted auditing of government data. Specifically, civil servants in different departments pay attention to the integrity of only those files about their own departments rather than the entire outsourced database. The challenge information in the PPA model cannot completely cover the involved data, and therefore, cannot complete integrity auditing of the specified data. (ii) PPA, which evenly selects data to be audited, is not economical for government data verification. In each audit, PPA consumes unnecessary computation resources to check irrelevant randomly selected files in addition to the department files. In summary, for smart government, the proposed public keyword-based auditing (PKA) model [4] that determines the challenge information according to users’ wishes could be a better choice.

In the PKA model, the third-party auditor (TPA) periodically verifies the users’ data of interest integrity by retrieving specific keywords and performing verifications, thereby reducing the overall cost of auditing while satisfying their needs. Unfortunately, the PKA model cannot fully meet the security requirements of data auditing and sharing in smart government. In general, various departments’ keyword setting is a long-term and business-related process to a certain extent, which results in the files selected for auditing being relatively fixed. The probability of these kinds of unrelated files being audited is negligible. In this case, malicious cloud servers can infer privacy contexts such as department information or file types, based on the auditing frequency and may even delete files that are rarely retrieved to save storage space. Undoubtedly, the value of government data is enormous. Either a privacy leak or a file corruption can expose a government to a major security crisis.

To ensure the security requirements of government data in integrity auditing, we propose HFKA that combines PPA and PKA. Specifically, the contributions of this work are as follows:

  • We propose the first hidden frequency keyword-based auditing scheme for a smart government, named HFKA. In HFKA, a Bloom filter is introduced to achieve fuzzy matching between user-specified keywords and files to be audited. According to the degree of confidentiality requirements, HFKA sets Bloom filter modes with different fuzziness levels. The generated verifiable challenge information can not only cover the files that users are interested in but also randomly select some low-frequency auditing files. HFKA is well-suited for scenarios where users perform differential targeted auditing on specific data in a shared dataset.

  • We design an index table label (ITL) to support the implementation of fuzzy matching that can protect keyword contents from storage servers and resist two lazy behaviors. The ITL aggregates the serial numbers of all files corresponding to a defined keyword, so the aggregated value determines whether a storage proof covers all the involved files in the challenge information. In addition, a global variable indicating the update time is embedded in the ITL. When a storage server does not perform update operations according to the storage protocol, HFKA can identify replay attacks on proofs in a new round of auditing.

  • We demonstrate the storage robustness, privacy protection regarding auditing frequency and data security of HFKA. The performance evaluation shows that HFKA maintains the computation complexity and communication complexity at O(1) in terms of the total number of files n during the integrity auditing phase. Furthermore, we design a comparative experiment to observe the auditing frequency of Type-A files (files corresponding to the predefined high-frequency keywords from all extracted keywords) and Type-B files (files corresponding to the other keywords). The experimental results show that it is negligible to distinguish user-specified high-frequency files from randomly selected low-frequency files. Our scheme achieves the auditing frequency hiding and satisfies the security requirements of government data.

Organization. The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 covers the system model, threat model and definitions, and briefly introduces design goals of HFKA. Section 4 presents a detailed construction of HFKA. Section 5 provides a security analysis, and Sect. 6 presents the performance evaluation. Finally, Sect. 7 concludes the paper and gives directions for future work.

2 Related Work

Data integrity auditing enables users to verify the integrity of outsourced data. In 2007, Juels et al. [6] proposed the proofs of retrievability (PoR) protocol to enable a user to recover complete data from partial data provided by a server. Ateniese et al. [5] introduced a probabilistic data possession (PDP) proof generation model that can ensure overall data integrity by randomly sampling certain data. Considering the computation overhead and economic burden of users to verify data in a cloud environment, Wang et al. [7] achieves public verifiability by introducing a TPA based on the PoR model to complete auditing work. Then, Wang et al. [9] implemented storage correctness assurance and error localization based on a homomorphic token and an erasure-coded approach [8]. In [10], to resist an honest but curious TPA with malicious behavior, based on a homomorphic linear authenticator with a masking technique, TPA was designed to audit user outsourced data integrity without knowing the data content. Similarly, masking was used to improve the system decentralization and storage efficiency in [13,14,15]. To protect conditional identity privacy in medical data, Zhang et al. [12] designed an identity-based aggregated signature to protect patients’ real identity and used Ethereum blockchain to record TPA’s auditing results, thereby preventing a dishonest TPA from performing malicious auditing behaviors. To address the authenticity of big data streams in untrusted environments, a novel data structure, P-ATHAT, was constructed based on the BLS signature and Merkle hash tree to achieve real-time authentication of data streams in [16]. Li et al. [17] proved that the security of P-ATHAT is unable to resist forgery attacks on cloud servers. Shen et al. [19] focused on identity privacy using the user’s biological data to verify their identity; the method can perform integrity auditing under the condition that there is no hardware token to store the private key. Zheng et al. [20] protected a user’s private key by updating it. To reduce the burden of recalculating the private key, a TPA participates in generating part of the private key. Zhou et al. [11] discussed the use of certificateless signatures to avoid the management and computation problems of certificates in the case of multicopy storage. Notably, these schemes all use the probabilistic auditing model of PDP to audit the integrity of the user’s private data.

For integrity verification of shared data, different auditing models should be selected depending on the specific scenario. Take smart government as an example. Users in different departments have different concerns about outsourced shared data, and the probabilistic auditing model cannot meet the individual needs. In 2021, Gao et al. [4] proposed a keyword-based auditing paradigm that determined the scope of audited data based on keywords selected by users. In addition, a feasible keyword-based auditing scheme was achieved using the privacy-preserving keyword-file index table designed by Ge et al. in [21] and referred to the trapdoors in [22]. As a result of the index table design, which adopted a linked-list structure with an index vector and generated an authentication label for each keyword, storage space is saved and it is easier for the user to detect malicious behavior of the cloud server based on the authentication label. The content in trapdoor in [22] is ciphertext, meaning that without access to user’s encrypted key, no one can forge an efficient trapdoor or crack the information included in the trapdoor. It also supported multiple keywords submitted into a trapdoor.

However, since users have attention preferences for certain data, semihonest cloud servers can infer user privacy by considering how often certain data are audited. Therefore, we need a new paradigm to address the challenge of auditing frequency privacy in this particular scenario. Bringer et al. [23] first used the locality-sensitive hash (LSH) algorithm to allow fuzzy search. LSH has been used in many schemes as a basic technique for fuzzing keywords [25,26,27,28]. Li et al. [24] chose to use edit-distance to measure the similarity between keywords for imprecise fuzzy search. Furthermore, the Bloom filter is a widely used fuzzy tool. In 2021, Indra et al. [30] designed a two-dimensional bloom matrix to achieve fast matching of similar words. Tong et al. [31] designed a twins Bloom filter with LSH in 2022 that combined the above approaches. The coin has two sides, and the fuzzy operations of the above schemes aim to narrow down the trapdoor search and improve the keyword search accuracy, which is the opposite of our intended effect of amplifying the hidden keyword frequency in the search results. Only the application of a Bloom filter by Gervais et al. [32] to fuzzy user’s address in the process of simple payment verification fits the intended purpose perfectly. There has recently been a high frequency of centralized storage accidents, but in smart government, any corruption of shared data is unacceptable. Consequently, distributed storage architecture is available. The distributed storage system, Hyperledger Fabric, is an ideal architecture for data storage designed by Cachin et al. [39]. Furthermore, Xu et al. [18] proved the feasibility of Hyperledger Fabric architecture when discussing how to achieve distributed storage integrity auditing. More privacy-preserving schemes are summarized in [40].

3 Preliminaries

3.1 System Model

On the basis of previous work [4], combined with the Hyperledger Fabric architectureFootnote 1, HFKA introduces the Fabric certificate authority and optimizes the traditional single cloud storage server into a scalable distributed storage structure. In addition, HFKA introduces a retrieval server to modularize the entity’s functionality.

The system model of HFKA involves five entities as illustrated in Fig. 1: Fabric Certificate Authority (\(\mathcal {FCA}\)), User (\(\mathcal {U}\)), Third-Party Auditor (\(\mathcal {TPA}\)), Retrieval Server (\(\mathcal{R}\mathcal{S}\)), Storage Node (\(\mathcal{S}\mathcal{N}\)).

Fig. 1.
figure 1

System model of HFKA

  • Fabric Certificate Authority. The \(\mathcal {FCA}\) is a trusted institution that generates and determines digital certificates on the basis of public key infrastructure (PKI). According to the auditing requirements of smart government, the \(\mathcal {FCA}\) mainly issues identity certificates of various entities and generates the signature key pairs for \(\mathcal {U}\).

  • User. Each \(\mathcal {U}\) is a collator and uploader of some government outsourced data. According to the construction goal of the smart government platform, \(\mathcal {U}\)s of each department outsource files to a distributed storage network, involving many \(\mathcal{S}\mathcal{N}\)s. To ensure the recoverability and confidentiality of outsourced data, each \(\mathcal {U}\) completes data preprocessing, such as redundancy and encryption. During the auditing phase, each \(\mathcal {U}\) generates authenticators and index information.

  • Third-Party Auditor. The \(\mathcal {TPA}\) performs audit tasks on behalf of \(\mathcal {U}\)s, with expertise and corresponding computing resources. The \(\mathcal {TPA}\) completes the fuzzy matching of the specified keywords, challenges the \(\mathcal{R}\mathcal{S}\) in the auditing phase, and finally checks the validity of the proof information fed back by \(\mathcal{S}\mathcal{N}\)s.

  • Retrieval Server. The \(\mathcal{R}\mathcal{S}\) assists the \(\mathcal {TPA}\) in completing the generation of the audited files set, specifically including saving index information and retrieving corresponding files according to the keyword trapdoor.

  • Storage Node. Each \(\mathcal{S}\mathcal{N}\) is a storage unit in the distributed storage architecture that stores redundantly processed data blocks and calculates storage proofs based on challenge information.

Here, we briefly describe the relationships among the various entities in the system model of HFKA. To free up local storage space and realize data sharing, the involved \(\mathcal {U}\)s outsource the preprocessed (encrypted and redundantly processed) data to a distributed storage architecture with multiple \(\mathcal{S}\mathcal{N}\)s. The \(\mathcal {TPA}\) is authorized to perform periodic integrity auditing tasks, aiming to determine the integrity of outsourced data with the least amount of user-side communication and computational overhead. The \(\mathcal{R}\mathcal{S}\) assists \(\mathcal{S}\mathcal{N}\)s with completing the serial number confirmation of the audited data during the challenge phase. The \(\mathcal {FCA}\) creates signature keys for \(\mathcal {U}\)s and offers ID certificate management for each entity to assist the security of distributed storage systems.

3.2 Threat Model

In the threat model, \(\mathcal{R}\mathcal{S}\) and \(\mathcal {U}\) are assumed to be completely trusted, but \(\mathcal{S}\mathcal{N}\) is considered to be semihonest and \(\mathcal {TPA}\) is considered to be curious. The above security assumptions are consistent with the situation of the interests in the actual smart government.

Semihonest \(\mathcal{S}\mathcal{N}\). The \(\mathcal{S}\mathcal{N}\) may detect data privacy by utilizing keyword-based auditing or a replay storage proof with exited proof to reduce the computational overhead of data updating. The details of the malicious behavior of a semihonest \(\mathcal{S}\mathcal{N}\) are as follows: (i) Frequency analysis attack. In the challenge generating phase, the \(\mathcal {TPA}\) and the \(\mathcal{R}\mathcal{S}\) select the auditing target data based on the search trapdoor input by the \(\mathcal {U}\). \(\mathcal {U}\)’s subjectivity makes the auditing distribution concentrated in a small number of files. A curious \(\mathcal{S}\mathcal{N}\) can analyze the auditing frequency of certain data and delete those data that are rarely audited to save storage space. Additionally, a curious \(\mathcal{S}\mathcal{N}\) can combine these data with external information (such as \(\mathcal {U}\)’s department) to guess the value and types of files with high auditing frequency. (ii) Replay attack. When the \(\mathcal {U}\) updates the outsourced data, a lazy \(\mathcal{S}\mathcal{N}\) does not perform the updating operation. The generation of a storage proof is computed by the \(\mathcal{S}\mathcal{N}\) completely, even though the \(\mathcal{S}\mathcal{N}\) does not generate storage proofs based on the challenge information. Because the verified proof has the verification logic, it can pass the \(\mathcal {TPA}\)’s verification. In fact, the random number selection range of the \(\mathcal {TPA}\) is also limited. Although it is extremely unlikely that two groups of challenge information are exactly the same, the probability becomes nonnegligible after multiple rounds of auditing. It is more difficult to be distinguished in this case. Once the attack occurs, whether due to hardware/software failures or to reduce computational costs, the behavior violates cloud storage principles.

Curious \(\mathcal {TPA}\). A curious \(\mathcal {TPA}\) does not initiate attacks, but will eavesdrop on all kinds of private data. (iii) Privacy speculation. In HFKA, a curious \(\mathcal {TPA}\) may guess the type of outsourced files and further infer a \(\mathcal {U}\)’s identity based on the keywords in the \(\mathcal {U}\)’s search trapdoor. As the number of auditing rounds increases, the \(\mathcal {TPA}\) learns more information. Based on the \(\mathcal {U}\)’s identity and the keywords entered in the search trapdoor, the \(\mathcal {TPA}\) can detect the general content of the files. This level of confidentiality leakage is unacceptable for government data.

Based on the above discussion, we provide three relevant definitions for the security of HFKA.

Definition 2 (Storage Robustness.) Storage robustness means that HFKA can resist \(\mathcal{S}\mathcal{N}\)’s replay attack and frequency analysis attack.

Definition 3 (Privacy Protection.) Privacy protection means that not only outsourced data context privacy, as well as keywords specified by user, index and keyword-file relation privacy can be guaranteed.

Definition 4 (Data Security.) Data security means that any adversary cannot detect original files in a direct or indirect manner, such as via brute force attack or ciphertext-chosen attack.

3.3 Design Goals

To achieve a secure and efficient application of keyword-based auditing for smart government, HFKA aims to achieve the following goals: (i) Privacy preserving. HFKA hides information about files, keywords, keyword-file index table and trapdoor using encryption and other processing methods so that \(\mathcal{S}\mathcal{N}\)s and \(\mathcal{R}\mathcal{S}\) cannot obtain sensitive information about files. Even if the data are intercepted by an adversary in the middle of network transmission, no information can be leaked, even the correspondence between files and keywords. (ii) Even auditing. Even auditing makes it impossible for adversaries to capture file auditing regularity, which is an important feature to protect data integrity. For the traditional PPA auditing model, the probabilistic auditing framework aims to achieve even auditing. However, the considerable auditing overhead is unacceptable for smart government due to its absolute evenness. HFKA overcomes the limitations of the above models and achieves relatively even auditing of the files that are and are not of concern to \(\mathcal {U}\) and is suitable for the smart government scenario. (iii) Replay attack resistance. To further ensure data security, HFKA resists replay attacks, focusing on resisting the replay of stored proofs of updated data. The \(\mathcal {TPA}\) can verify whether the storage proofs come from the updated data instead of the old data, which prevents \(\mathcal{S}\mathcal{N}\)s from not updating the data to reduce computational overhead or because of hardware/software failures. Meanwhile, the calculation of the design for replay attack resistance is constant.

4 The Proposed Scheme

An \(\mathcal {FCA}\), multiple \(\mathcal {U}\)s, a \(\mathcal {TPA}\), an \(\mathcal{R}\mathcal{S}\) and multiple \(\mathcal{S}\mathcal{N}\)s in the distributed storage architecture are involved in HFKA. For readability, we show a \(\mathcal {U}\) and an \(\mathcal{S}\mathcal{N}\) that interact with other entities, as shown in Fig. 2.

Fig. 2.
figure 2

Procedure of HFKA

Setup phase.

(1) SysInit. Given a secure parameter \(\lambda \), \( \mathcal {FCA} \) chooses the public parameters and issues identification certificates for all entities and generates signature key pairs for \( \mathcal {U} \).

  • Initialize the public parameters: choose two multiplicative cyclic groups of q-order \(G_1(g),G_2(u)\); a bilinear pairing \(e: G_1\times G_1\rightarrow G_2\); three hash functions \(H_1,H_2,H_3:\{0,1\}^*\rightarrow G_1\); secure hash functions \(SHA_i:\{0,1\}^*\rightarrow Z_q^*, i\in \{1,2,...,\mathcal {R}\}\); a symmetric encryption algorithm: \(Enc(\cdot ,key)\); a pseudo random permutation PRP: \(\pi (\cdot ,key)\) and a pseudo random function PRF: \(f(\cdot ,key)\).

  • Issue the identification certificates for \(\mathcal {U}\)s, \(\mathcal{R}\mathcal{S}\), \(\mathcal{S}\mathcal{N}\)s and \(\mathcal {TPA}\). There is an example to illustrate the structure of each entity’s certificate as shown in Fig. 3.

  • Choose \(ssk\in Z_q^*\) randomly, then compute \(spk=g^{ssk}\) and send (sskspk) to \(\mathcal {U}\).

Fig. 3.
figure 3

Certificate issued by \(\mathcal {FCA}\).

Data Processing phase.

(2) AuthGen. \(\mathcal {U}\) generates the encrypted data block set C and the authenticator set \(\phi \).

  • Divide the raw data set F into \(f_1||f_2||...||f_n\)Footnote 2 and split the processed file into s data blocks \(m_{ij}\), where \(i\in [1,n],j\in [1,s]\).

  • Compute the encrypted data block \(c_{ij}=Enc(m_{ij},k_0)\) and obtain the encrypted data block set \(C=\{c_{ij}\}_{i\in [1,n],j\in [1,s]}\), where \(k_0\) is the encryption key.

  • For each \(c_{ij}\in C\), compute the data block authenticator

    $$\sigma _{ij}=[H_1(ID_i||j)\cdot g^{c_{ij}}]^{ssk},$$

    where \(ID_i\) is the unique identifier of \(f_i\). Then, create authenticator set \(\phi =\{\sigma _{ij}\}_{i\in [1,n],j\in [1,s]}\).

  • Send \(\{C,\phi \}\) to certain \(\mathcal{S}\mathcal{N}\)s randomly and record these \(\mathcal{S}\mathcal{N}\)s’ certificates.

(3) ExtractKW. Based on F, \(\mathcal {U}\) generates the keyword set W and the index vector set V.

  • Extract top-\(\mathcal {K}\) ranked keywords \(\{\omega _t\}_{t=1,2,...,\mathcal {K}}\) from the files using the time frequency inverse document frequencyFootnote 3 text keyword extraction method. Then, create the keyword set \(W=\{\omega _t\}_{t=1,2,...,\mathcal {K}}\).

  • For \(\omega _t\), set up an n-bit binary array as the index vector \(v_{\omega _t}\) and initialize \(v_{\omega _t}=0\). For each file \(f_i\in F\), set \(v_{\omega _t}[i]=1\) if \(f_i\) contains the keyword \(\omega _t\) (e.g., \(v_{\omega _t}=\{0,1,0,0,...,0\}\) when only the file \(f_2\) contains \(\omega _t\)). Create the index vector set \(V=\{v_{\omega _1},v_{\omega _2},...,v_{\omega _\mathcal {K}}\}\).

(4) IndexGen. Based on (WV), \(\mathcal {U}\) generates the encrypted index table \(I_{RS}\) and challenge generating auxiliary index \(I_{\mathcal {TPA}}\).

  • For \(\omega _t\), compute the index address \(f_{\omega _t}=f(\omega _t,k_1)\) via PRF and then update the original index vector \(v_{\omega _t}\) to \(v_{f_{\omega _t}}\), where \(k_1\) is the key of the PRF implementation.

  • Compute the encrypted permutation \(\pi _{\omega _t}=\pi (v_{f_{\omega _t}},k_2)\) via PRP, where \(k_2\) is the key of the PRP implementation.

  • Compute the encrypted index vector \(e_{\omega _t}=v_{f_{\omega _t}}\oplus \pi _{\omega _t}\) to facilitate the restoration of the original index vector.

  • Create a set \(S_{\omega _t}=\emptyset \) to record the subscripts of files containing \(\omega _t\) and add i to \(S_{\omega _t}\) if \(f_i\) contains \(\omega _t\). Then, compute the index table label

    $$\varDelta _{\omega _t,j}=[H_2(Z)^{-1}\cdot H_3(f_{\omega _t}||j)\cdot \prod \limits _{i\in S_{\omega _t}}(H_1(ID_i||j)^{-1})]^{ssk},$$

    where Z is the number of file updates with an initial value of 1. Let \(\varDelta _{\omega _t}=\{\varDelta _{\omega _t,j}\}_{j=1,2,...,\mathcal {S}}\) be the ITL set.

  • Set the encrypted index table \(I_{\mathcal{R}\mathcal{S}}=\{(f_{\omega _t},e_{\omega _t},\varDelta _{\omega _t})\}_{\omega _t\in W}\) and challenge generating auxiliary index \(I_{\mathcal {TPA}}=\{(f_{\omega _t}, \pi _{\omega _t})\}_{\omega _t\in W}\). Send \(I_{\mathcal{R}\mathcal{S}}\) and \(I_{\mathcal {TPA}}\) to \(\mathcal{R}\mathcal{S}\) and \(\mathcal {TPA}\), respectively.

(5) TrapdoorGen. Based on the searched keyword \(\omega '\), \(\mathcal {U}\) generates the search trapdoor \(T_{\omega '}\).

  • Compute the search index address \(f_{\omega '}=f(\omega ',k_1)\) and the encrypted permutation \(\pi _{\omega '}=\pi (v_{f_{\omega '}},k_2)\). Set the search trapdoor \(T_{\omega '}=(f_{\omega '},\pi _{\omega '})\).

Challenge Generating phase.

(6) InitBF. \(\mathcal {TPA}\) initiates the Bloom filter and sets its parameters tuple BF.

  • Compute \(Len=\frac{\ln 2\cdot |T_{\omega '}|}{\mathcal {R}}\). Create a Len-bit array \(\mathcal {B}\) and make \(\mathcal {B}[i]=0\), where \(i\in [0,Len-1]\). Build \(BF=(\mathcal {B},\{SHA_i\}_{i=1,2,...,\mathcal {R}})\).

(7) FuzzyTDGen. Based on \(T_{\omega '}\) and \(I_{\mathcal {TPA}}\), \(\mathcal {TPA}\) updates BF and generates the fuzzy search trapdoor \(FT_\omega \). The pseudo-code is shown in Algorithm 1.

  • When \(T_{\omega '}\) is verified to be legitimate with \(I_{\mathcal {TPA}}\), extract the index address \(f_{\omega '}\) from \(T_{\omega '}\) and then update BF: \(\mathcal {B}[SHA_i(f_{\omega '})\!\bmod Len]_{i=1,2,...,\mathcal {R}}=1\).

  • Create a set \(F_{\omega '}\) and then take the all files’ index address as the input of BF. Write \(f_{\omega ''}\) into \(F_{\omega '}\) if \(\mathcal {B}[SHA_i(f_{\omega ''})\!\bmod Len] = 1\).

  • Search \(\pi _{\omega ''}\) with \(f_{\omega ''}\) from \(I_{\mathcal {TPA}}\), and set \(\Pi _{\omega '}=\{\pi _{\omega ''}\}\). Let \(FT_\omega =(F_{\omega '},\Pi _{\omega '})\) be the fuzzy search trapdoor.

figure a

(8) ChalGen. Based on \(FT_{\omega }\), \(\mathcal {TPA}\) generates the challenge information Chal.

  • Randomly choose a c-element subset \(Q=\{q_1,q_2,...,q_c\}\subseteq [1,s]\) and \(v_j\in Z_q^*\) for each element of Q.

  • Generate \(Chal=(FT_\omega ,Q,\{v_1,v_2,...,v_j\})\) and send it to \(\mathcal{R}\mathcal{S}\) and \(\mathcal{S}\mathcal{N}\).

Integrity Auditing phase.

(9) Retrieval. Based on Chal and \(I_{\mathcal{R}\mathcal{S}}\), \(\mathcal{R}\mathcal{S}\) selects the challenged file set \(S_{\omega _t}\) and corresponding index table label \(\varDelta _{\omega _t}\).

  • Take \(f_{\omega ''}\) from \(FT_{\omega }\) and retrieve \(f_{\omega _t}=f_{\omega ''}\) in \(I_{\mathcal{R}\mathcal{S}}\). Then, remove the \(e_{\omega _t}\) and \(\varDelta _{\omega _t}\) corresponding to \(f_{\omega _t}\) from \(I_{\mathcal{R}\mathcal{S}}\) and remove the \(\pi _{\omega ''}\) corresponding to \(f_{\omega ''}\) from \(FT_{\omega }\).

  • Compute \(v_{\omega _t}=e_{\omega _t}\oplus \pi {\omega ''}\).

  • Initiate the challenged file set \(S_{\omega _t}=\emptyset \). When \(v_{\omega _t}[i]=1,i\in [1,n]\), write i into \(S_{\omega _t}\) and send (\(S_{\omega _t},\varDelta _{\omega _t}\)) to \(\mathcal{S}\mathcal{N}\).

(10) ProofGen. Based on Chal, \(S_{\omega _t}\), \(\varDelta _{\omega _t}\), C and \(\phi \), \( \mathcal{S}\mathcal{N} \) generates the storage proof Prf.

  • With \(Chal=(FT_\omega ,Q,\{v_1,v_2,...,v_j\})\) from \(\mathcal {TPA}\), \(S_{\omega _t},\varDelta _{\omega _t}\) from \(\mathcal{R}\mathcal{S}\) and \(C=\{c_{ij}\},\phi =\{\sigma _{ij}\}\) from \(\mathcal {U}\), compute

    $$T=\prod \limits _{i\in S_{\omega _t}}\prod \limits _{j\in Q}\sigma _{ij}^{v_j}\cdot \prod \limits _{j\in Q}\varDelta _{\omega _t,j}^{v_j},$$
    $$\mu =\sum \limits _{i\in S_{\omega _t}}\sum \limits _{j\in Q}c_{ij}\cdot v_j,$$

    and then set \(Prf=(T,\mu )\).

(11) VerifyProof. Based on Chal and Prf, \(\mathcal {TPA}\) verifies the validity of Prf.

  • Verify whether Prf is valid via the following equation:

    $$e(T,g)\overset{?}{=}e(g^\mu \cdot \prod \limits _{j\in Q}[H_2(Z)^{-1}\cdot H_3(f_{\omega _t}||j)]^{v_j},spk).$$

5 Security Analysis

Theorem 1

HFKA achieves storage robustness, i.e., semihonest storage nodes are unable to launch successful auditing frequency analysis attacks and auditing proof replay attacks using stored data.

Proof

We design two games to prove why HFKA can resist the above two attacks.

Game 1. We assume that adversary \(\mathcal {M}_1\), a malicious \(\mathcal{S}\mathcal{N}\) that proactively counts the frequency of stored files being audited and attempts to delete files with a minimal auditing frequency to free storage space for other users. M is the number of files stored in \(\mathcal {M}_1\) for a \(\mathcal {U}\), A is the number of files selected by \(\mathcal {U}\) in each round, and R is the number of auditing rounds. In the original keyword-based auditing paradigm, as the number of auditing rounds increases, \(\mathcal {M}_1\) can calculate the probability of each file being audited based on the law of large numbers, i.e., knowing \(\mathcal {U}\)’s preference of files as shown in the following equation: \(P_i=\lim \limits _{R\rightarrow \infty }\frac{\sum _{j=1}^{R}X_{ij}}{A\cdot R},\) where \(X_{ij}=1\) if \(f_i\) is selected in the j-th round, else \(X_{ij}=0\). Given the calculated \( P_i \) of each file being audited, \(\mathcal {M}_1\) can delete files with \( P_i\) close to 0 and hardly be detected by the \(\mathcal {U}\). In HFKA, as a result of the Bloom filter, additional \(A\cdot p\) files are chosen randomly in each round, where p is the false positive rate of the Bloom filter. As a result, the files not selected by \(\mathcal {U}\) have a probability of \(\frac{A\cdot p}{M-A}\) of being selected by the Bloom filter in each round. The probability of \( P_i \) analyzed by \(\mathcal {M}_1\) becomes as follows: \(P_i=\lim \limits _{R\rightarrow \infty }\frac{\sum _{j=1}^{R}X_{ij}}{A(1+p)\cdot R}+\frac{A\cdot p}{M-A}=P_i'+\frac{A\cdot p}{M-A},\) where \(X_{ij}=1\) if \(f_i\) is selected in j-th round, else \(X_{ij}=0\). From the above equation, we can see that even if the probability of a file being selected by a \(\mathcal {U}\) tends toward 0, the probability of it being audited is still not less than \(\frac{A\cdot p}{M-A}\). It is worthwhile for \(\mathcal{S}\mathcal{N}\) to delete \(\mathcal {U}\)’s data with \(\frac{A\cdot p\cdot 100}{M-A}\)% risk.

Game 2. We assume that the adversary \(\mathcal {M}_2\) is a lazy \(\mathcal{S}\mathcal{N}\). It does not perform the update operation to save its own computing resources when \(\mathcal {U}\) requests a data update. Furthermore, it attempts to use the original data to pass the auditing verification. We analyze whether \(\mathcal {M}_2\) can forge proof with nonupdated data according to the auditing verification formula as follows: \(e(T,g)=e(g^\mu \cdot \prod \limits _{j\in Q}[H_2(Z)^{-1}\cdot H_3(f_{\omega _t}||j)]^{v_j},spk).\) In this formula, except for the T and \(\mu \) provided by \(\mathcal {M}_2\), Zspk is public, and \(f_{\omega _t}\) and \((j,v_j)_{j\in Q}\) are generated by \(\mathcal {TPA}\) itself. However, only T and \(\mu \) are computed under data block \(c_{ij}\), which suggests that \(\mathcal {M}_2\) can pass the verification with nonupdated data. We provide more details to support the following analysis: \(T=\prod \limits _{i\in S_{\omega _t}}\prod \limits _{j\in Q}\sigma _{ij}^{v_j}\cdot \prod \limits _{j\in Q}\varDelta _{\omega _t,j}^{v_j},\) \(\sigma _{ij}=[H_1(ID_i||j)\cdot g^{c_{ij}}]^{ssk},\) \(\varDelta _{\omega _t,j}=[H_2(Z)^{-1}\cdot H_3(f_{\omega _t}||j)\cdot \prod \limits _{i\in S_{\omega _t}}(H_1(ID_i||j)^{-1})]^{ssk},\) \(\mu =\sum \limits _{i\in S_{\omega _t}}\sum \limits _{j\in Q}c_{ij}\cdot v_j.\) Ignoring challenge information \((j,v_j)_{j\in Q}\) and challenged file set \(S_{\omega _t}=\{i\}\), \(\mu \) is related to only \(c_{ij}\), but T also implies Z. We make \(Z'\) be \(\mathcal {M}_2\)’s update times; then, the system updates data z times after \(\mathcal {M}_2\) does not perform an update operation. That is, the global number of updates is \(Z=Z'+z.\) The probability of \(H(Z)=H(Z')\) can be ignored due to the anticollision property of the hash function.\(\blacksquare \)

Theorem 2

HFKA achieves privacy protection, i.e., no entities other than users can obtain specific content about outsourced data or users through the Bloom filter and the index table while executing auditing tasks.

Proof

We analyze how HFKA achieves privacy protection to resist a semihonest \(\mathcal{S}\mathcal{N}\) or a curious \(\mathcal {TPA}\).

There are two strategies to undermine \(\mathcal {U}\)’s privacy for a semihonest \(\mathcal{S}\mathcal{N}\): (i) It deciphers \(\omega _t\) and then infers the file’s type and even the content from \(\omega _t\). (ii) It does not decipher \(\omega _t\) but speculates the importance of files based on the frequency of selected \(f_{\omega _t}\) and the audited files in each round. For the first strategy, HFKA adopts PRF to protect \(\omega _t\). Only \(\mathcal {U}\) knows its keyword \(\omega _t\); other entities know only the keyword index address: \(f_{\omega _t}=f(\omega _t, k_1)\). Because of the backward unpredictability of PRF, \(\mathcal{S}\mathcal{N}\) cannot invert \(\omega _t\) by any subsequence of \(f_{\omega _t}\). For the second strategy, HFKA resists \(\mathcal{S}\mathcal{N}\)’s privacy attacks in two aspects. Before formally analyzing the measures, we show how \(\mathcal{S}\mathcal{N}\) violates \(\mathcal {U}\)’s privacy. By asking the \(\mathcal {TPA}\) or \(\mathcal{R}\mathcal{S}\), the \(\mathcal{S}\mathcal{N}\) can easily obtain \( f_{\omega } \)s in the auditing trapdoor of each round. After accumulating enough rounds of data, \(\mathcal{S}\mathcal{N}\) can easily guess the documents corresponding to certain high-frequency keywords, as well as the connections between keywords. Two measures are taken to cope with such a situation, firstly using the Bloom filter to make the mapping relationships more ambiguous for each round, and secondly adopting distributed storage so that the audit information received for a certain \(\mathcal{S}\mathcal{N}\) is not complete and the relationships between keywords and documents and between keywords and keywords are further weakened. Hence, HFKA blocks the adversary attack from the frequency distribution. \(\mathcal {TPA}\) faces the same difficulty regarding \(f_{\omega _t}\rightarrow \omega _t\). When users are anonymous and PRF is backward unpredictable, a curious \(\mathcal {TPA}\) can do nothing.\(\blacksquare \)

Theorem 3

HFKA achieves data security, which means that no external adversary can obtain any details about the data in the event that outsourced data are intercepted in transit.

Proof

In the whole process of information interaction, the data involved can be classified as follows: keyword searchable data (index address \(f_{\omega _t}\), search vector \(\pi _{\omega _t}\), encrypted index vector \(e_{\omega _t}\), ITL \(\varDelta _{\omega _t,j}\)), auditing information (encrypted data \(c_{ij}\) and data authenticator \(\sigma _{ij}\)). The security of \(f_{\omega _t}\) is based on PRF, and the security of \(\pi _{\omega _t}\) is based on PRP. PRF is indistinguishable and its proof is detailed in [36]; that is, in the face of an attack, the attacker cannot determine whether the same PRF function is selected, i.e., the attacker cannot further guess the function input. PRP also achieves indiscernibility using computationally difficult mathematical problems. Since \(e_{\omega _t}\) is designed to recover the original index vector \(v_{\omega _t}\), its security is off the table. The security of \(c_{ij}\) is based on AES-128. AES can resist differential cryptanalysis, linear cryptanalysis and other basic attacks; more details are provided in [38]. Moreover, AES with a 128-bit key has the same safety strength as RSA-3072 and ECC-256, which is estimated to be available until 2040. The security of \(\sigma _{ij}\) and \(\varDelta _{\omega _t,j}\) is based on a discrete logarithm problem (DLP) [37], which is stated as follows: For \(a,g\in G\), \(\exists \) b such that \(g^b=a\); finding such b is computationally complex, where G is a cyclic group. For \(\sigma _{ij}\), \(\mathcal {M}\) must solve the DLP twice to obtain the data block \(c_{ij}\), which is still protected by AES-128. \(\blacksquare \)

6 Performance Evaluation

In this section, we evaluate the performance of HFKA on a Lenovo desktop computer equipped with an Intel Core i5 CPU and 8 GB of RAM. All cryptographic operations in HFKA, such as PRP, PRF, Bloom filter and bilinear pairing, are performed using the PBC library. We conduct detailed experiments to demonstrate the unique function of HFKA by balancing the auditing distribution to hide the files’ audited frequency.

6.1 Auditing Distribution

The Bloom filter plays an important role in HFKA, which hides the frequency distribution properties of keyword-based auditing. We elaborate how it plays in HFKA from experiments in three perspectives. Experiment 1 investigates the factors influencing the false positive rate of the Bloom filter and finds the most suitable parameters of it to assist the implementation of HFKA. Experiment 2 shows the results of implementing the frequency hiding function of HFKA. Experiment 3 further improves the fuzzing capability of the Bloom filter. It is assumed that the distribution of \(\mathcal {K}\) keywords in n files is uniform. That is, the distribution of keywords can represent the auditing distribution of files.

Fig. 4.
figure 4

Effect of array length on false positive rate of BF

Based on the original introduction of the Bloom filter in [34], we know that its false positive rate p is equal to \((1-e^{\frac{-k\cdot m}{n}})^k\), where k is the number of hash functions, m is the size of the input, and n is the length of the BF array. In Experiment 1, we construct the Bloom filter and control its false positive rate by adjusting the parameters. Figure 4 shows the variation in the false positive rate with different parameters. Comparing the four small plots (a), (b), (c), (d) in Fig. 4, it was found that the larger the input size is, the larger the false positive rate is. For any of the small plots, we found that both of the number of hash functions and the size of the BF array are inversely proportional to the false positive rate. Here, we set the size of each keyword index address to 10 B, so an input size of 60 B means that the user selects 6 keywords in a search trapdoor. According to the fuzzy requirements of HFKA and the specific experimental data, it is most suitable when the input size is between 90 B and 110 B, the array length is 1000 bits, and the hash function number is 3 so that the false positive rate can be approximately 55%.

In Experiment 2, we use keywords from the simulated dataset to perform fuzzy matching experiments. We select several keywords to update the Bloom filter in each round, and then input all keywords for matching, and then finally collect these matching results. The total number of keywords is 50, and 9 or 10 keywords are chosen in each round. The false positive rate means that there would be 4 or 5 additional chosen keywords. Figure 5 shows the distribution of the selected keywords after 20 rounds of experiments. Horizontally, the false positive rate of the Bloom filter in each round can be seen, with red \(\bullet \) being the keywords selected by \(\mathcal {U}\) and blue ✖ being the fuzzy keywords selected by the Bloom filter. Vertically, the frequency of each keyword that was selected can be seen. The more red \(\bullet \) there are in a column, the higher probability of the keyword would be selected by \(\mathcal {U}\), such as \(\omega _{10}\), \(\omega _{24}\), while blue ✖ balances the probability of the other keywords being selected, such as \(\omega _{20}\), \(\omega _{40}\). The frequency of each keyword is counted and compared with traditional auditing in Fig. 7. HFKA represents high and low frequencies, which are evenly distributed in itself, while PPA are all evenly distributed. Moreover, we discuss the total auditing overhead. To guarantee detection confidence of 99% probability, TPA needs to select 460 files from 10,000 files randomly in the PPA model. However, in the PKA model [4], the user is only concerned about 2% probability of the total files, i.e., select 200 files from 10,000 files. In HFKA, due to the 55% false positive of the Bloom filter, there are additional 110 files to be selected, i.e., a total of 310 files are selected in each round. Therefore, HFKA audits 150 files less than PPA each time, which saves 32.6% probability of auditing overhead.

Fig. 5.
figure 5

The keyword distribution of Experiment 2

Fig. 6.
figure 6

The use of salt-hashing BF in Experiment 3

Fig. 7.
figure 7

The auditing frequency comparison between HFKA and PPA

In Experiment 3, we fix the configuration of the Bloom filter and the search trapdoor, but add different salt to the hash functions in each round. We adopt the control method to verify the efficacy of salted hash functions. To avoid the coincidence of one experimental group, we set three parallel experimental groups, where the \( 1^{st} \), \( 2^{nd} \) and \( 3^{rd} \) groups have different Bloom filter parameter settings. As shown in the Fig. 6, the red \(\bullet \) in each group indicates the input of the original Bloom filter, while the dark-blue \(\bullet \) indicates the input of salt-added Bloom filter, which must be consistent with the red \(\bullet \). Correspondingly, the light-blue ✖ indicates the keywords fuzzied by the original Bloom filter, while the purple ✖ is the keyword fuzzied by the salt-added Bloom filter. From the results of the above experimental groups, it can be seen that even for the exact same input, different outputs are obtained after using the salting method.

6.2 Computation Overhead

Table 1 shows the symbols used to represent these operations. There are mainly calculations in HFKA: hash operation, exponential operation, multiplication operation and bilinear pairing operation. We ignore the XOR, PRF, PRP, addition operations, which have minimal computational cost.

Table 1. Notation description

Table 2 shows the concrete operations of involved data in each phase of HFKA. The data preprocessing phase includes data authenticator generation and index label generation. The challenge generation phase includes Bloom filter updating and fuzzy matching. The integrity auditing phase includes proof generation and verification. Compared to the method of Gao et al. [4], HFKA only increases the Bloom filter updating and the fuzzy matching process, which increase \(\mathcal {R} \cdot (|F_{\omega '}| + \mathcal {K})\) HMAC operations in each round of auditing. We assess the run time of SHA operations with 128-bit keys and 1024 bytes input in Python, and the experimental results show that it is only 0.03 ms, which is negligible due to the whole system.

Table 2. Computation overhead

6.3 Storage Overhead

Here, we consider the storage overhead in HFKA. Due to the involvement of the keyword-file index, HFKA has some additional storage overhead compared to PPA. We list the additional data structures and the storage space they consume as the number of files sampled increases in Table 3. According to the experimental results, when there are 100 files, \(\mathcal{R}\mathcal{S}\) requires only 0.37 MB of space to store \(I_{\mathcal{R}\mathcal{S}}\), and \(\mathcal {TPA}\) requires only 0.368 MB of space. When there are 1000 files, these values become 16.64 MB and 16.62 MB. Even when the number of files increases to 10,000, the size of \(I_{\mathcal{R}\mathcal{S}}\) is only 466.51 MB, and the size of \(I_{\mathcal {TPA}}\) is 466.31 MB. Notably, the size of a Bloom filter with a 55% false positive rate is always 1K bits, as will be elaborated in Sect. 6.1. Furthermore, the size of \(I_{\mathcal{R}\mathcal{S}}\) and \(I_{\mathcal {TPA}}\) are related to the number of files. In fact, \(I_{\mathcal{R}\mathcal{S}}\) and \(I_{\mathcal {TPA}}\) are directly related to the number of keywords, and the number of files affects the number of keywords.

Table 3. Storage overhead

7 Conclusion and Future Work

In this paper, we propose a keyword-based auditing scheme, HFKA, to address the auditing frequency leakage problem for a smart government. We utilize the Bloom filter to achieve a specified keyword fuzzy matching. Meanwhile, we design an index table label to resist replay attacks by lazy server nodes and generate storage proofs without exposing any keyword-file privacy. We also separate the retrieval work from the storage work using a special retrieval server to improve the retrieval efficiency, reduce the storage cost and computing cost of the storage node, and further maintain the keyword privacy. Finally, the security of the scheme is proven by rigorous security analysis, and the feasibility of the scheme is proven by performance evaluation.

In future work, we will examine the security protection and economic viability of the keyword-based auditing paradigm in real world applications and strengthen the integration of the auditing model and smart government. First, in order to allow batch auditing and dynamic auditing of various storage nodes, we will enhance the data label construction of outsourced data. Second, we will attempt to store massively outsourced and more finely partitioned data via a directed acyclic graph (DAG). The difficulties of implementing smart government affairs will then be examined, and we will work to overcome security issues and performance bottlenecks in data sharing and integrity auditing.