Keywords

1 Introduction

Electronic medical records (EMR) are widely used in healthcare systems as the healthcare industry is moving toward digitization. Many hospitals and medical institutes use EMRs for online diagnosis, health screening, and new drug development. Since a large amount of EMRs and images are generated everyday, most hospitals and institutes use the public cloud storage platform to store these data. However, medical records contain highly sensitive personal information that should not be outsourced without protection. A simple way to protect EMRs information is to encrypt them before outsourcing, but this reduces the accuracy and efficiency of EMR retrieval. Aim to solve this problem, the first searchable encryption scheme was proposed by Song et al. [1], in which some basic search approaches over encrypted data were discussed. Boneh et al. [2] proposed the first public key encryption scheme. After that searchable encryption becomes an important technology for encrypted data retrieval with privacy preserving [3,4,5].

Another issue that affects the retrieval accuracy and efficiency is the correctness of the input keywords. Errors in the input keywords would cause inaccurate search results and even retrieval failure. In this paper, we utilize our former proposed Probability-Levenshtein based Spelling Correction (PLSC) algorithm [6, 7] in recommended keywords ranking and medical keywords correction. So that PLSC can support fuzzy multiple keywords input and provide a more accurate search query. Then, we propose a correlation encryption and calculation algorithm based on homomorphic encryption, so that the cloud server can securely complete the calculation of the sum of keywords the relevance scores in the EMR. In addition, proxy is introduced in our scheme to support multiple EMR owners and multi-keyword relevance score ranking. Finally, we compared our PREMR scheme with the newly published searchable encryption scheme for performance evaluation. Our contributions can be summarized as follows.

  • In order to test the accuracy of PLSC, we build a library that contains 2000 medical records with more than 3000 medical words. Based on this library, we compare our work with Norvig’s spelling corrector and edit distance.

  • We design a relevance score encryption and ranking algorithm based on homomorphic encryption to support secure keyword-based query and retrieval. The algorithm adopts Paillier-based encryption to sum up encrypted multi-keyword relevance scores.

  • We built up an encrypted EMR retrieval system that can support data outsourcing and dynamic updates for multiple EMR owners. We also implement the performance comparison among PREMR and several similar searchable encryption schemes.

The rest of the paper is organized as follows. Section 2 is the related work on searchable encryption. Section 3 introduces the template of EMR and PLSC correction evaluation. Section 4 presents the constructions and definitions of our scheme. The detailed description of our PREMR scheme is represented in Sect. 5. Theoretical security analysis is given in Sect. 6. We give the scheme implementation results and comparison in Sect. 7. Section 8 is the conclusion of the whole paper.

2 Related Work

Most researches on searchable encryption are aiming at improving accuracy and security of data retrieval. For improving accuracy, Sun et al. [9] proposed a multi-keyword search scheme using a vector space model and a cosine measure with TF (word frequency) × IDF (inverse text frequency index) to provide order-preserving document retrieval. Kabir et al. [10] improved Sun’s scheme by writing the plaintext TF values in the index tree orderly. However, the plaintext TF values may leak information about keywords and documents. To improve security of encrypted document retrieval, Liu et al. [11] proposed a verifiable searchable encryption scheme that can verify the correctness of retrieval results over dynamic data collection. Du et al. [12] proposed a searchable symmetric encryption scheme that combines access control and boolean queries. Liu et al. [13] adopted attribute hierarchy with the comparison-based encryption to achieve dynamic access control over encrypted personal health records. Those searchable encryption schemes are usually considered as a way to guarantee data privacy and search efficiency. Also, there are many researches on searchable encryption schemes with multiple keyword support [14, 15] and are applied in many other areas [16]. However, these schemes have some limitations in retrieval efficiency, accuracy or privacy. In cloud computing applications, especially in medical cooperation projects, the searchable encryption should be able to support precise and efficient retrieval on outsourced medical records for further diagnosis.

Another research topic on searchable encryption is fuzzy search for multiple keywords. Li et al. [13] proposed a scheme that used kNN and Euclidean distance to select k nearest database records, but the search accuracy is not desirable. Traditional spelling correction algorithms, such as the Levenstein distance, do not achieve high correction accuracy if the spelling error is more than two letters. Zhong [8] proposed a fuzzy search scheme that used k-gram to construct a fuzzy keyword set and Jaccard coefficient to calculate the similarity of keywords. Gnanasekaran [18] converted keyword into a vector, and used LSH (Local Sensitive Hash) to support fuzzy keyword search. Aritomo [19] used simhash to realize the keyword fuzzy search, and the VP-tree to improve search accuracy. K. Wang [20] used LSH to build index, and used Bloom filter to realize fuzzy search over multiple keywords. However, those schemes did not consider the misalignment of letters in the keywords, which may lead to less accurate search results.

3 Spelling Correction on Electronic Medical Records

3.1 Electronic Medical Records Templates

In order to support fuzzy search, we adopt our previous PLSC (Probability-Levenshtein based Spelling Correction) algorithm [6] to correct the ambiguous input search words. We build a library with 2000 EMRs that contain 3000 common medical terms. The medical terms are selected from [17]. The format example of EMR is shown in Fig. 1. This is a typical EMR, which contains private information such as the patient’s name, address and phone number, and also sensitive information such as the patient’s condition, diagnosis and prescription.

3.2 Spelling Correction Evaluation of EMRs

We evaluate the PLSC algorithm using our EMRs library. The experiment tests the correction accuracy of PLSC, Norvig’s spelling corrector, and edit distance. Table 1 gives the correction probability of three spelling correctors, where spelling errors in each keyword are random. The test result shows that PLSC is able to give more accurate candidate correction especially when there are more than two random errors in the input keywords.

Fig. 1.
figure 1

Medical Record Template

Table 1. Accuracy comparison with random errors

4 System Construction and Preliminaries

This section first introduces our system structure, and then describes threat models, system goals, notations, and cryptographic preliminaries.

4.1 System Model

There are four principals in the PREMR system. EMR owners is responsible for medical data encryption and index building. They upload encrypted EMRs to the cloud service provider (CSP), and send indexes to the Proxy. Proxy merges the indexes from all EMR owners and encrypts the merged index. Then Proxy uploads the secured index to the CSP. The encrypted EMRs and index are uploaded by EMR owners and Proxy, respectively. Meanwhile, EMR owners distribute decryption keys to authorized users via secure channel.

In our scheme, the EMR storage server is considered as an “honest-but-curious” entity. Specifically, the storage server will honestly implement the protocol, but also curiously analyze the index, stored data and queries to capture more information associated with plaintext EMRs. EMR owners are suppose to be honest because they have the original plaintext records. Proxy is a trustworthy entity who builds up secured index for outsourced data, and generates trapdoors for users’ searching queries. Users are untrusted, they may collude with others to get more information about the encrypted EMRs. Secret keys are uncompromised.

4.2 Notations

  • \({\text{R}}\): plaintext EMR set, \({\text{R}}\) = {\(R_{1}\), \(R_{2}\), …, \(R_{n}\)};

  • \({\text{R}^{\prime}}\): encrypted EMR set, \({\text{R}^{\prime}}\) = {\(R_{1}^{\prime}\), \(R_{2}^{\prime}\), …, \(R_{n}^{\prime}\)};

  • \(ID\): EMR identifier in plaintext \(ID\) = {\(id_{1}\), \(id_{2}\), …, \(id_{n}\)};

  • \(ID^{\prime}\): encrypted EMR identifier, \(ID^{\prime}\) = {\(id_{1}^{\prime}\), \(id_{2}^{\prime}\), …, \(id_{n}^{\prime}\)};

  • \({\text{SW}}\): keywords set in plaintext, \({\text{SW}}\) = {\(W_{1}\), \(W_{2}\), …, \(W_{m}\)};

  • \({\text{SW}^{\prime}}\): keywords set in ciphertext, \({\text{SW}^{\prime}}\) = { \(W_{1}^{\prime}\), \(W_{2}^{\prime}\), …, \(W_{n}^{\prime}\)};

  • \(S_{i,j} { }\): plaintext relevance score of keyword \(W_{i}\) in \(R_{j}\); \(S_{j}\): sum of relevance score in \(R_{j}\) in plaintext;

  • \(S_{i,j}^{\prime}\): encrypted relevance score of keyword \(W_{i}\) in document \(R_{j}\); \(S_{j}^{\prime}\): sum of relevance score in \(R_{j}\) in ciphertext;

  • \({\mathbb{Z}}_{{y^{2} }}^{*}\) is the set of integers range between 1 and \(y^{2}\).

Our PREMR scheme includes three major processes: EMR index building, encrypted EMR searching and queue-based ciphertext retrieval.

4.3 Cryptographic Preliminaries

In PREMR scheme, we adopt both symmetric key algorithm and homomorphic encryption to guarantee the security of EMR and the value of relevance scores. The symmetric key algorithm (SKA) is used to encrypt keywords, EMR identifiers, and EMRs. The homomorphic encryption (HE) is used to encryption the relevance score of each keyword in every EMR. The algorithms that are involved in the PREMR system are defined as followed.

  • SKA = (T, K, ENC1, DEC1) is a symmetric key encryption algorithm, where T is the input data, K is the symmetric key, ENC1 is the encryption algorithm; DEC1 is the decryption algorithm.

  • HE = (RS, PK, SK, ENC2, DEC2) is a Paillier-based homomorphic encryption, where RS is the relevance score of a keyword, PK is the public key to encrypt RS, SK is the secret key. ENC2 and DEC2 are the encryption and decryption algorithms. PK and SK are generated with the followed method:

    1. 1.

      Suppose \(p,q \in Z_{n}\) are two large prime numbers, and gcd(pq,(p − 1)(q − 1)) = 1, Φ(n) = (p − 1)(q − 1). Let n = pq, \(\lambda\) = lcm(p − 1, q − 1).

    2. 2.

      The multiplicative subgroup \(Z_{n} \times Z_{n}^{*} \to Z_{{n^{2} }}^{*}\). \(\left| {Z_{{n^{2} }}^{*} } \right| = {\Phi }\left( {n^{2} } \right) = n{\Phi }\left( n \right)\). \(g\) is some element of \(Z_{{n^{2} }}^{*}\), \(r \in \left( {0,n} \right)\) is a random integer, and gcd(\(r\), \(n\)) = 1, \(r^{{\left( {p - 1} \right)}}\) ≡ 1(mod \(p\)). \(r^{\lambda }\) = 1 mod \(n\), \(r^{n\lambda }\) = 1 mod \(n^{2}\).

    3. 3.

      Let \(L\left( x \right) = \frac{x - 1}{n}\), the modular multiplicative inverse \(\mu\) = (L(\(g^{\lambda }\) mod \(n^{2}\)))-1mod \(n\).

    4. 4.

      The public key is PK = (\(n\), \(g\)) and the secret key SK = \(\lambda\).

5 Encrypted EMR Searching with Privacy Preserving

This section introduces the index building process and EMR searching. Then it describes the relevance score calculation and ranking algorithms.

5.1 EMR Index Building

Before encrypting EMRs, the owners first extract keywords, and build inverted plaintext indexes. Subsequently, EMR owners encrypt and upload the medical records to the cloud server, and at the same time send the plaintext index to the Proxy. Proxy collects indexes from all EMR owners, merges and builds up the secure inverted index.

Plaintext Index Building and EMR Encryption.

EMR owners first extract keywords from EMRs, calculate the TF-IDF value for each keyword as its relevance score, and then build the plaintext index I. I = {I1,I2,I3…Im}, \({\text{I}}_{i}\) = (\(W_{i}\), \({\bigcup }_{j}\) < \(id_{j}\), \(S_{i,j}\) >), \({\text{I}}_{i}\) is the inverted index of keyword \(W_{i}\), \(id_{j}\) is the identifier of the EMR that contains \(W_{i}\), \(S_{i,j}\) denotes the TF-IDF score of keyword \(W_{i}\), in the EMR with the identifier \(id_{j}\).

Furthermore, EMR owner implements SKA (*, K1, ENC1) to encrypt EMRs. Equation (1) describes the encryption process.

$$ \begin{array}{*{20}c} {id_{j}^{\prime} \leftarrow {\text{KA}}(id_{j} ,K_{{1}} ,{\text{ ENC1}})} \\ {R_{j}^{\prime} \leftarrow \, (R_{j} ,K_{{1}} ,{\text{ ENC1}})} \\ {C = \left\{ {\left( {id_{1}^{\prime} ,R_{1}^{\prime} } \right),\left( {id_{2}^{\prime} ,R_{2}^{\prime} } \right), \ldots ,\left( {id_{n}^{\prime} ,R_{n}^{\prime} } \right)} \right\}} \\ \end{array} $$
(1)

EMR owners then send I to the Proxy, and upload C to the CSP.

Indexes Merging and Encryption.

In our system, we support multiple EMR owners to outsource their medical records. Proxy is introduced to handle multiple indexes merging and secure index building, so that even though EMRs are encrypted with different keys the retrieval can still be accurate and efficient. The secure index \({\mathbf{I^{\prime}}}\) is generated with the followed steps.

  • Step 1. Proxy receives multiple indexes from different EMR owners and merges them into a new index based on keywords.

  • Step 2. Proxy implements SKA((∗, K2, ENC1) to encrypt keywords and EMR identifiers.

    $$ \begin{array}{*{20}c} { W_{i}^{\prime} \leftarrow {\text{SKA}}(W_{i} ,K_{{2}} ,{\text{ ENC1}})} \\ { id_{j}^{^{\prime\prime}} \leftarrow {\text{SKA}}(W_{i} ,K_{{2}} ,{\text{ ENC1}})} \\ \end{array} $$
    (2)

    Comparing Eq. (1) and Eq. (2) we can see that different encryption keys(K1, K2) are used to encrypt the same EMR identifier(\(id_{j}\)), so that the linkability of the index and stored EMR is broken.

  • Step 3. Proxy runs HE(RS, PK, ENC2) to encrypt the relevance score \(S_{i,j}^{\prime}\). The encryption process is defined in Eq. (3).

    $$ S_{i,j}^{\prime} = g^{{S_{i,j} }} \times r^{n} { }mod{ }n^{2} $$
    (3)

At last, Proxy establishes the secure index \({\mathbf{I^{\prime}}}\) and upload it to the CSP. The format of the secure index is defined in Eq. (4).

$$ {\text{I'}} = \left\{ {{\text{I}}_{1}^{'} ,{\text{I}}_{2}^{'} , \ldots ,{\text{I}}_{i}^{'} } \right\},{\text{I}}_{i}^{'} = \left\{ {W_{i}^{'} ,\bigcup\nolimits_{j} {\left\langle {id_{j}^{{''}} ,S_{{i,j}}^{'} } \right\rangle } } \right\} $$
(4)

5.2 Encrypted EMR Retrieval

When user tries to search a set of keywords, the PLSC algorithm will first correct the misspelled ones. Then user sends the plaintext keywords set SW = {\(W_{1}\), \(W_{2}\),…, \(W_{t}\)} to the Proxy. Proxy generate the query trapdoor \({\mathbf{SW}}^{\prime}\) = {\(W_{1}^{\prime}\), \(W_{2}^{\prime}\),…, \(W_{t}^{\prime}\)}, where \(W_{i}^{\prime}\) = SKA(\(W_{i}\), K2, ENC1).

figure a

CSP Searching Algorithm.

SP searches \({\mathbf{SW}}^{\prime}\) in \({\mathbf{I^{\prime}}}\). The searching algorithm is described in Alg.1. Search result is the conjunction of EMRs that contain all queried keywords in \({\mathbf{SW}}^{\prime}\). Subsequently, CSP sums the encrypted relevance scores of multiple keywords in each EMR. The relevance score calculation is defined in Eq. (5).

$$ S_{j}^{\prime} = \mathop \prod \nolimits_{i,j} S_{i,j}^{\prime} = g^{{\sum S_{i,j} }} \times \mathop \prod \nolimits_{i} r_{i}^{n} { }mod{ }n^{2} $$
(5)

CSP returns the search result to the Proxy for relevance score decryption and ranking.

Relevance Ranking Algorithm.

After receiving the search results from CSP, Proxy needs to decrypt and rank the summation of relevance scores for each returned EMR. Proxy implements SKA(∗, K2, DEC1) to get the plaintext keywords \(W_{i}\) and EMR identifiers idi. Then, Proxy runs HE(\(S_{j}^{\prime}\), SK, DEC2) to decrypt the sum of relevance score. The decryption process is defined in Eq. (6).

$$ \begin{aligned} S_{j} = & \frac{{L\left( {S_{j}^{{{^{\prime}}\lambda }} modn^{2} } \right)}}{{L\left( {g^{\lambda } modn^{2} } \right)}} mod n \\ = & \frac{{L\left( {g^{{\lambda \sum S_{i,j} }} \times \mathop \prod \nolimits_{i} r_{i}^{\lambda n} { }mod{ }n^{2} } \right)}}{{L\left( {g^{\lambda } { }mod{ }n^{2} } \right)}} mod\, n \\ = & \sum S_{i,j} \\ \end{aligned} $$
(6)

where \(\mathop \prod \limits_{i} r_{i}^{\lambda n} \equiv 1\). Proxy ranks the top-k EMRs based on their \(\sum S_{i,j}\) and returns the EMR identifiers back to users. Upon receiving the EMR identifiers, users send downloading requests to the CSP directly.

6 Security Analysis

This section analyzes the data confidentiality and private-preserving of our scheme. We have proved that our scheme can guarantee the security of ciphertext retrieval by using the queue-based search strategy, and can protect the EMR privacy through different encryption algorithms.

Data Confidential.

The original EMRs are encrypted before outsourcing to the CSP and the decryption keys are distributed to users via secure channel. Based on the assumption we made in the system model in Sect. 4, EMRs can not be compromised without correct secret keys. Thus, EMR data confidential can be guaranteed.

Indexes are constructed separately by the EMR owners, then merged and encrypted by the Proxy. EMR identifiers in the index and in the outsourced EMRs are encrypted with different keys so that CSP cannot get the relationship of the encrypted EMRs and the encrypted index. Keywords relevance scores of each EMR are encrypted and calculated with the homomorphic encryption, CSP cannot get any information from the keywords and their relevance scores. Therefore, as long as the encryption keys are not compromised, the confidentiality of data, index, keywords and relevance scores can be guaranteed.

Possibility of Privacy Leakage.

Queries are encrypted by proxy and then forwarded to CSP. So that, CSP cannot get user information and user privacy is protected. Meanwhile, the file downloading requests and query trapdoors are sent by users and proxy separately. It is impossible for the CSP to guess the exact correspondence between the queried keyword and the downloaded EMRs.

7 Performance Test

The performance test experiments are implemented by C++ programming language on Windows 7 machines, each of which is with an Intel(R) Core(TM) i5 6500 3.2 GHz processor and a 2GB RAM. The performance is evaluated with our own EMR dataset. Our dataset uses more than 3000 medical keywords to generate 2000 EMRs containing various diseases. We compare our scheme with the most relevant researches on searchable encryption: FMS [13], TBMSM [Error! Reference source not found.] and Zhong’s scheme [8]. In the experiments, the number of keywords in 2000 EMRs varies from 1000 to 3000, and the number of EMRs varies from 100 to 2000.

7.1 Index Building Efficiency

We compare the index building time and storage cost among four schemes. Figure 2(a) shows the time overhead required to build an index with the increasing number of keywords. The index building time of FMS grows exponentially since it needs to create an index vector for each document. When the number of keywords exceeds 1500 the index building time of FMS is more than that of other three schemes. While the time cost on building index with other three schemes are stable and increase linearly. The index structure of our PREMR scheme is the inverted index based on keywords. Therefore, the index generation time increases linearly with the increase of keywords. Figure 2. Index Building Timeshows the index generation time with the increase number of EMRs. Our PREMR takes less time to build the index than other three schemes. Compared with other searchable encryption methods, our PREMR is the most efficient one on index building stage.

Fig. 2.
figure 2

Index Building Time

Fig. 3.
figure 3

Index Storage Space

Figure 3(a) shows the index storage size when the number of index keywords is 1000, 1500, 2000, 2500, and 3000 respectively. When the number of keywords in the index is greater than 1000 or the number of EMRs in the data set is greater than 300, the index storage overhead of our PREMR is less than that of other three schemes. Figure 3(b) shows the required index storage space with the number of EMRs ranges from 100 to 2000. I It indicates that PREMR scheme has better index generation efficiency and less index storage overhead than the other three schemes.

Fig. 4.
figure 4

Trapdoor Generation Time

Fig. 5.
figure 5

Search Efficiency Comparison

7.2 Trapdoor Generation Time

Figure 4 compares the trapdoor generation efficiency of these four schemes when there are 1000 queries, and the keywords in each query ranging from 10 to 50. Figure 4 shows that the trapdoor generation time of FMS is not affected by the number of queried keywords. This is because that the trapdoor in FMS is a fixed-length one-dimensional vector corresponding to the keywords, even though the number of keywords increases, the trapdoor generation time remains basically unchanged. The trapdoor generation time of PREMR, TBMSM and Zhong’s scheme grows linearly with the increase of queried keywords. From comparison result, it shows that the PREMR scheme has a better performance on trapdoor generation efficiency, especially in supporting multiple keywords and simultaneous queries.

7.3 Search Efficiency

Figure 5 shows the search efficiency of compared schemes. All schemes are evaluated with the number of EMRs ranging from 100 to 2000, and the number of keywords in each query is 5. In FMS, a matrix calculation is carried out between the retrieval vector and index vector of each EMR, which increases the search time significantly with the increase of stored EMRs. In TBMSM scheme, a search sequence should be obtained firstly by matching each search keyword with that in the index. So that, the search time in TBMSM increases linearly with the number of keywords in the index. The search efficiency in Zhong’s scheme is mainly affected by the mapping operation of the index and query vectors with LSH (Local Sensitive Hash) function. Although our PREMR scheme is also affected by the number of keywords, the search time grows slowly. Form Fig. 5 we can see that our PREMR scheme has less search time than the other schemes. The search time of PREMR is less than 1s even though there are 2000 encrypted EMRs in the database.

8 Conclusion

This paper proposed a privacy-preserving retrieval scheme over encrypted medical records. The proposed scheme can achieve multi-keyword fuzzy search and relevance ranking. In this paper, we use PLSC to support the fuzzy input keywords and improve spelling correction. In addition, homomorphic encryption algorithm is introduced to support keywords relevance scores calculation and ranking securely. Then, the theoretical proofs show that our PREMR scheme can guarantee the security of query vectors and stored EMRs. Finally, we experimentally analyzed and compared the PREMR with three other similar schemes, and the experimental results proved that the PREMR has better performance in index building, query trapdoor generation and search efficiency.