Keywords

1 Introduction

Managing large amounts of data securely and efficiently has always been an essential task for the State Grid [28]. Traditionally, access control and permission management methods have been deployed to protect sensitive data, e.g., users’ personal private information, management personnel information, and financial information. However, their limited security guarantees make them vulnerable to attacks, which normally result in terrible consequence [1]. Hence, a more rigorous method, encryption, is generally taken as a promising solution.

In practice, however, applying cryptography methods for database protection is in the face of two challenges, which seriously hinder their feasibility. First, encryption disables any parties to manipulate the data. This implies that any data query requires the users first to download the entire database, then decrypt it locally, and conduct data queries on the decrypted data. Downloading the entire data lies a huge burden for the network transmission. Second, this workflow presents high requirements for the client – it requires each client to have enough storage space to hold the data and has the computational capability to perform queries on the entire dataset (Fig. 1).

Thus, we expect to design a scheme that can perform queries on the encrypted data directly, i.e., do not reveal the data to the server and without downloading the data. Depending on the data type, queries can be divided into two main categories: character-oriented data query and numerical data query. Specifically, a keyword-oriented search is performed to obtain records/data entries containing specific keywords (e.g., to find information on experts whose research interests are “smart grid” and “superconducting DC” in the expert database). On the other hand, when we want to query a specific record to satisfy certain numerical characteristics (i.e., querying the oldest employee with more than 25 years of project review experience), we need to perform calculations (including numerical operations and data comparisons) on specific data records, etc.

In this paper, we build a complete encrypted database query scheme that supports the most common queries. Specifically, for queries that rely only on addition operations, we adopt a somewhat homomorphic encryption scheme to encrypt the data so that the server can perform queries directly on the ciphertext. For complex queries that rely on multiplication operations, we adopt a dual-server architecture, in which two servers that store encrypted data independently work together to make queries. Moreover, we devise an Authority Separation mechanism further to restrict the data exploration of the cloud server, so that the relevant third-party department must approve the user before querying and using the data. Based on the proposed framework, we can mitigate the risk of information leakages caused by unreliable management and employees while maintaining the usability of sensitive data.

Fig. 1.
figure 1

Architecture and workflow of the data query system.

2 Preliminary

2.1 Basic Cryptographic Tools

The BCP Crypto-System. BCP cryptosytem [12, 23] is quin-tuple (Setup, KeyGen, Enc, Dec and mDec) that consists of the following algorithms:

SetUp(\(\kappa \)): for the security parameter \(\kappa \), choose a safe prime RSA -modulus \(N=qp\) of bit \(\kappa \), s.t. \(p=2p'+1\) and \(q=2q'+1\) for distinct prime \(p'\) and \(q'\). Pick a random element \(g \in \mathbb {Z}^*_{N^2}\) of order \(pp'qq'\) s.t. \(g^{p'q'}\mod N^2 =1+kN\). The plaintext space of the crypto-system is \(\mathbb {Z^*}\). This algorithm output the public parameters PP = (Nkg) and the master key is MK=(\(p'q'\)).

KeyGen(PP): the key generation algorithm which take the public parameter as input and out put the secret/public key pairs. For each public-secret key pair, it firstly randomly pick \(a \in \mathbb {Z}_{N^2}\) and computes \(h= g^a\mod N^2\) and then our put the public key pk = h and secret key sk = a.

Enc\(_{(\texttt {PP,pk})}(m)\): the encryption algorithm which take the plaintext, the public key and public parameters as input and output the ciphertext. The algorithm firstly pick a random \(r\in \mathbb {Z}_{N^2}\) and generate the ciphertext (\(\alpha , \beta \)) as:

$$\alpha =g^r\mod N^2 and \beta =h^(1+mN)\mod N^2$$

Dec\(_{(\texttt {PP,sk})}\)(PP): the regular decryption algorithm that take the ciphertext, the secret key and public parameters as input and output its corresponding plaintext. In terms a key pair pk = \(g^a \mod N^2\) sk=a, the ciphertext (\(\alpha , \beta \)), it output the plaintext as:

$$m=\dfrac{\dfrac{\beta }{\alpha ^a}-1\mod N^2}{N}$$

mDec\(_{(\mathtt {PP,pk,mk})}(\alpha ,\beta )\): the alternative decryption algorithm which take the public key, the master key, the ciphertext and public parameters as input and output the corresponding plaintext. It first computed \(a \mod N\) as:

$$a \mod N=\dfrac{h^{p'q'}-1 \mod N^2}{N} \cdot k^{-1}\mod N$$

in which \(k^{-1}\) denotes the inverse of k modulo N. Then compute \(r\mod N\) as

$$r \mod N=\dfrac{A^{p'q'}-1 mod N^2}{N} k^{-1} \mod N$$

Then let \(\delta \) denote the inverse of \(p'q'\) modulo N and compute \(\gamma =ar\mod N\). The plaintext is finally computed as

$$m=\dfrac{(\dfrac{\beta }{g^{\gamma }})^{p'q'}-1\mod N^2 }{N}\cdot \delta \mod N$$

The BCP Crypto system has flowing special properties.

  1. 1.

    Additive Homomorphism. For the cipertext encrypted using the same public key, the BCP crytosystem satisfies the addictive homomorphism.

  2. 2.

    Key Homomorphism. The key homomorphism refers to the property that, if a plaintext is encrypted with the public key \(pk=pk_1\times pk_2\), it can be successfully decrypted with the secret key \(sk=sk_1+sk_2\). Formally, it satisfies

    $$m=Dec_{sk_1+sk_2}[Enc_{pk_1+pk_2}(m))]$$
  3. 3.

    Double Trapdoor. The BCP cryptosystem provides two independent mechanisms for decryption. Except for decrypting a ciphertext (encrypted with public key \(p_i\)) with the corresponding secret key (i.e., \(sk_i\)) as the traditional public key cryptosystem does, the BCP cryptosystem allows a master to hold the master key and decrypt ciphertext encrypted by any public-key without knowing its corresponding secret key (i.e., the second trapdoor).

The Goldwasser–Micali Cryptosystem is a bitwise probabilistic (semantically secure) public key encryption scheme based on the quadratic residuosity assumption. It satisfies the additive homomorphic property over \(\mathbb {Z}_{2}\). That is, for two plaintext bit \( m_1 \) and \(m_2\in \{0,1\}\),

$$Enc(m_1)\cdot Enc(m_2)\mod N=Enc(m_1\oplus m_2)$$

As its special bitwise homomorphism, the Goldwasser-Micali Cryptosystem is often used as a basic building block to construct high-level primitives.

2.2 Symbols and Notations

For a plaintext \(a\in \mathbb {Z}_{N^2}\) we use \(Enc_{pk_A}(a)\) to denote its corresponding ciphertext encrypted by the BCP cryptosystem with public key \(pk_A\). If we do not care which public key is used in encryption, we use [a] to represent the ciphertext of a for short. For a plaintext \(b \in \mathbb {Z}_2\), we use \(\Vert b\Vert \) to denote the corresponding ciphertext encrypted with the Goldwasser-Micali encryption algorithm.

3 Scheme Description

3.1 System Architecture High Level Description

This scheme adopts two non-collusive servers – the storage server \(\mathcal {S}\) and the (auxiliary) computing server \(\mathcal {C}\). The whole database is stored in server \(\mathcal {S}\), and each sensitive attribute (column) is encrypted using BCP cryptosystem with a separate public key, whereas the non-sensitive columns remain in plaintext form. This public key is also stored in the server S along with the ciphertext of this column. The computing server holds the master key mk of the BCP cryptosystem, whereas no data is stored on the computing server.

Operation Modes. As some columns are encrypted whereas some are not, we denote inter-column operations between an encrypted and an unencrypted column as the [EP] mode and denote inter-column operations in two encrypted columns as [EE] mode. If there is no sensitive column involved, the database management system will execute operations as in the traditional manner, and in this paper, we will not discuss this case. For each operation, if the serve \(\mathcal {S}\) gets the encrypted result (which is either used for further operations such as aggregation or used as a final classified result), we denote this case as [\(\rightarrow \)E] mode. Likewise, if the serve \(\mathcal {S}\) gets the plaintext result (which is normally used for data retrieval such as select clause), we denote this case as [\(\rightarrow \)P] mode.

3.2 Scheme Construction

Aggregation. The SUM() operation considers the setting that, given k values \(a_1, a_2, ..., a_k\) in one column, it requires to compute the sum \(\sigma _a=\sum ^k_{i=1}a_i\) It serves as the basic step for other aggregation operations such as AGV(), COUNT(), etc. Actually, as the addictive homomorphism of the BCP cryptosystem, the sum operation within a column can simply achieved by \(\mathcal {S}\) independently, without the computing server \(\mathcal {C}\) involved.

Addition. [EP\(\rightarrow \)E mode] Given an encrypted column A encrypted with public key \(pk_A\) and a non-sensitive column B in the plaintext form, we want to compute the column \(C=A+B\), such that each item in C is in encrypted form and can be decrypted with secret key \(sk_A\). Specifically, if a is the element in i-th row of column A and b is the element in i-th row of column B, we want to get c such that \(Enc(c)=Enc(a+b)\) This can be achieved by the following two steps (i.e., entire done by \(\mathcal {S}\) without interaction)

  • Encrypt b with public key \(pk_A\), which will result in \(Enc_{pk_A}(b)\).

  • based on the addictive homomorphism of the BCP cryptosystem, compute \(Enc_{pk_A}(c)= Enc_{pk_A}(a)\cdot Enc_{pk_A}(b)\)

[EE \(\rightarrow \)E mode] we want to compute the column \(C=A+B\), s.t. each item in C is in encrypted form and can be decrypted by an individual secret key \(sk_C\).

figure a

Multiplication. [EP\(\rightarrow \)E mode] In this setting, given a sensitive column A encrypted by BCP-cryptosystem with public key pk and a none-sensitive column X in plaintext form (with x denoting an item in column X), we want to compute B such that \(B=AX\).

Recall that the ciphertext of m under the BCP encryption is in the form of \(c=Enc(m)=(\alpha ,\beta )\), in which \(\alpha =g^r mod\ N^2\) and \(\beta =h^r(1+mN) mod\ N^2\). Thus when multiplay a constant x into the ciphertext c, the storage server just computer \(c'={<}\alpha ^x, \beta ^x{>}\), which will get the ciphertext of xm.(as proved below).

$$xc={<}\alpha ^x, \beta ^x{>}$$
$$\alpha =g^r mod\ N^2, \beta =h^2(1+mN)mod\ N^2$$
$$\alpha ^a=g^{rax} mod\ N^2$$
$$\dfrac{\beta ^x}{\alpha ^{xa}}=\dfrac{h^{xr}(1+mN)^xmod \ N2}{g^{rxa}mod\ N^2}=\dfrac{g^{xar}(1+mN)^xmod \ N2}{g^{rxa}mod\ N^2}=$$
$$(1+mN)^x mod\ N^2=(1+xmN+C^{2}_{x}mN^2+...) mod N^2=(1+xmN)mod\ N^2$$
$$Dec=\dfrac{\frac{\beta ^x}{\alpha ^{xa}}-1 mod\ N^2}{N} mod N^2 =\dfrac{xmN}{N}mod\ N^2=xm$$

[EE\(\rightarrow \)E mode] Given an encrypted column A (encrypted with \(pk_A\)) and column B (encrypted with \(pk_B\)), we want to computed \(C=A\times B\), such that C is in the encrypt form and can be decrypted by a separate key \(sk_C\). Recall that the ciphertext of BCP cryptosystem is in the form like \(<\alpha ,\beta>\). We denote the i-th item in column A as \([a]=<\alpha ,\beta>\) and the i-th item in column B as \([b]=<\alpha ',\beta '>\). The output [c] of Algorithm 2 (i.e., CT) is the ciphertext of \(a\times b\) that can be decrypted with secret key \(sk_C=sk_A+sk_B\).

figure b

Comparison. For the functionality of secure comparison protocol, the server \(\mathcal {S}\) holds two values (either both in encrypted form, or one in encrypted and the other in unencrypted form) and hopes to know the relationship (i.e., whether \(a\le b\) or not) among the two numbers. Normally, the server \(\mathcal {C}\) is an external service provider that concentrates on providing auxiliary computing service (especially cryptographic operation). Thus, as one design principle, we hope the server \(\mathcal {C}\) gets as little information as possible.

As the server \(\mathcal {S}\) holds the BCP public key of each column, both EE or EP mode can be transformed to EE mode.

[EE\(\rightarrow \)E mode] Get the BCP-encrypted comparison result, which is normally used for SELECT COUNT(*).

[EE\(\rightarrow \)P mode] Get the comparison result in plaintext form. Normally used for directly retrieve the tuples that satisfy the query condition (e.g., SELECT COUNT * where Production \(\le \) Sells).

The functionality of comparison protocols in different modes is listed in Table 1.

Table 1. Functionality of comparison protocols.
figure c
figure d
figure e
figure f

Equality Test. Similar to the comparison protocol, all cases can be converted to equality test among two encrypted ciphertexts (either encrypted in the same or different keys). As demonstrated in Sect. 3.2, we have composed protocols for encrypted ciphertext comparison, i.e., for two encrypted items [a] and [b], the server \(\mathcal {S}\) can get the encrypted or unencrypted bit t indicating whether \(a\le b\). In order to determine whether \(a=b\), the servers firstly run the comparison protocol to get the encrypted bit \(||t_1||\), indicating whether \(a \le b\); and then, ran the comparison to get the encrypted bit \(||t_2||\) indicating whether \(b \le b\). Note that, during the excursion of these two protocol, the server \(\mathcal {S}\) get the encrypt bits whereas the server \(\mathcal {C}\) get no information about a and b. Then the server \(\mathcal {S}\) computer \(||t||= ||t_1||\cdot ||t_2||\), which equals to \(||t_1\oplus t_2||\), and \(t_1\oplus t_2 = 1\) if and only if \(t_1=t_2\) (i.e., when \(a\le b\) and \(b \le a\) satisfies simultaneously, indicating \(a=b\)).

After acquiring the QR-encrypted comparison result, if we want to get the plaintext of the result, the server \(\mathcal {S}\) first generate a random bit \(\tau \) and blind the QR-encrypted ||t|| with \(\tau \) and send the blind result to server \(\mathcal {C}\). Then, the server \(\mathcal {C}\) decrypted the blind equality test result and sent back the decrypted result to \(\mathcal {S}\), which will afterward remove the blind factor \(\tau \) and get the result in plaintext setting (Algorithm 8). Likewise, if the server is required to get BCP encrypted result, then we apply the Chenging Encryption Schemes (Algorithm 9).

figure g
figure h
figure i

Unlike theirs, in our scheme, each column is encrypted with BCP cryptosystem of independent keys, and the randomness of each ciphertext is derived from the cryptosystem itself. In this way, a random number does not exist that influences all the encrypted data items in a whole tuple, and thus, the Cartesian product is natively supported. Joint is actually performing selection operation on the Cartesian product, and accordingly, it is natively supported.

4 Performance Evaluation

We deployed our system on top of Amazon AWS and Aliyun and conducted a comprehensive performance evaluation of our scheme. Specifically, we deploy the storage server (i.e., server \(\mathcal {S}\)) on a standard VM (with a dual-core CPU @2.5 GHz, 32-GB memory, and 256-GB SSD storage) rented from Amazon EC2 located at N. Virginia datacenter. The computing server (i.e., server \(\mathcal {C}\)) is run on Aliyun (Silicon Valley datacenter). Both server \(\mathcal {S}\) and server \(\mathcal {C}\) are connected with a high-bandwidth network. Our experiments are designed to answer the following two questions:

\(\mathsf {Q1}\): What is the efficiency of the building blocks, and

\(\mathsf {Q2}\): How practical our proposed method is for typical database query workloads.

To answer \(\mathsf {Q1}\) we implement each cryptographic primitive with C++, and we adopt the GMP(https://gmplib.org) and NTL (http://www.shoup.net/ntl/) library for large integer representation and algebraic manipulation. To answer question \(\mathsf {Q2}\), we build a prototype based on our design. We use MySQL 5.7.19 as the underlying DBMS. To examine the performance of our scheme in the general setting, we conducted the evaluation with the TPC-H benchmark.

Performance of Each Cryptographic Primitive. We tested each building block used in our scheme (i.e., used in query evaluation). In our experiment, we run the aforementioned algorithms/protocols 50 times and record their average exerting time in Table 2. From Table 2 we can see that all the building blocks can be finished in millisecond-level, which indicates that they are efficiently constructed and can be used to build practical query evaluation schemes on an encrypted database.

Table 2. Overhead of each cryptographic algorithm and protocol.

The TPC-H Benchmark. We run our scheme on TPC-H (version 3.0) benchmark (http://www.tpc.org/tpch). Meanwhile, we also compare the performance of our scheme with CryptDB, a state-of-the-art system for encrypted database queries. There are 22 decision-support queries named Q1 to Q22 in the TPC-H workload, and we select Q1-18 to present the performance evaluation. Among them, some of the workloads (e.g., Q13 and Q16) are so complicated that they cannot be run on both our scheme and CryptDB in encrypted form. We record the running time of each workload and record the total time consumption on Table 3. In general, encrypted data query in both our system and CryptDB takes several times (less than 10 \(\times \)) than query on plaintext setting. Only a few of them take longer times (e.g., Q1, Q17 and Q18), but still less than 100\(\times \) of plaintext data query. Most importantly, we can also see that our scheme achieves a better performance than CryptDB in most cases.

Table 3. Performance of our scheme under TPC-H workload, compared with CryptDB; “–” denotes the corresponding system cannot work on this workload.

5 Related Works

Encrypted Database. The notion of executing SQL query on encrypted relational database was firstly considered by Hacigümüş et al.  [16]. They proposed a prospective architecture in which the client outsources his encrypted database along with some additional auxiliary information (i.e., serves as a secure index) into the server and can latterly issue the SQL queries to this database. In their scheme, the database is encrypted by a traditional encryption scheme (i.e., symmetric encryption scheme such as DES and AES), and the aforementioned split-and-transformation mechanism is achieved by a delicately designed algebraic framework. Followed by this approach, Hacıgümüs et al. proposed a query optimization method in [18] and constructed a secure DBMS that support aggregation query in [17]. However, for the sake of query processing efficiency, DBMS built in this approach only provides very limited notions of data privacy.

Property-Preserving Encryption Approach. In order to accelerate data processing operations meanwhile provide appropriate privacy guarantees, researchers proposed some specific cryptographic primitives that support particular calculations to be performed on encrypted data. Specifically, for example, Order-Preserving Encryption [2, 9] enables the ciphertext to maintain the numerical order of its corresponding plaintext, and the Order-Revealing Encryption[11] enables efficient comparison among multiple (randomly-encrypted) ciphertexts. Deterministic encryption [7] have advantage on equality search and match. Using these primitives, Popa et al.  [25] proposed the first practical system, CryptDB, an integrated system that can perform query processing on encrypted data. Afterwards, several systems, like MONOMI [29], are proposed.

FHE and SMPC Based Approach. Fully homomorphic encryption (FHE) [15] is a cryptographic encryption primitive that enables to compute any function on encrypted data. Boneh et al.  [10] firstly implement the database-query functionally with a specific homomorphic encryption scheme. This line of work is based on the prerequisite that efficient fully homomorphic encryption schemes exist; nevertheless, the efficiency for FHE is still far more satisfactory. Similarly, SMPC enables multiple distributed parties to jointly compute an arbitrary functionality in an privacy-preserving manner. Secure database systems implemented SMPC include Blindseer [14, 22], Arx [24], sharemind [8] and SDB [19, 32].

Secure Hardware Based Approach. Secure hardware enclaves (e.g., Intel SGX [13] and Catalyst [21] etc.) promise data confidentiality and secure execution of arbitrary computation. The TrustDB [5, 6], constructed with an ordinary commodity cryptographic co-processor, is the first practical outsourced database prototype that supports a subset of SQL operation, and the Cipherbase [3, 4] achieves the similar functionality with FPGA serving as the tamper-proof hardware. SGX based systems include include VC3 [26], Ryoan [20] and Opaque [33]. However, the secure hardware may not be as “secure” as it claims [31], and there are lots of effective attacks targeted to hardware that can totally devastate the underlying systems[27, 30].

6 Conclusion

Reliable, secure, and trustworthy services are essential for industries with highly sensitive data. In SGCC, we deployed our systems that ensure only authorized users have access to the expected data, while unauthorized entities, including cloud service providers and unapproved internal staff, know nothing about the data. To achieve these goals, we devise an authority separation mechanism, store data in encrypted form, and design a set of mechanisms to enable search and query on encrypted data using searchable encryption (SE) and homomorphic encryption (HE). Experimental and real-work running experiences indicate the practicability of our system.