Keywords

1 Introduction

1.1 Motivation

Until now, the backend (on-prem & cloud) deployments were considered as the single source of truth & unique point of access in regards of Enterprise Systems (ES). Nevertheless, a paradigm shift has been recently observed, by the deployment of ES assets towards the Edge sectors of the landscapes; by distributing data, decentralizing applications, de-abstracting technology and integrating edge components seamlessly to the central backend systems.

Capitalizing on recent advances on High Performance Computing along with the rising amounts of publicly available labeled data, Deep Neural Networks (DNN), as an implementation of AI, have and will revolutionize virtually every current application domain as well as enable novel ones like those on autonomous, predictive, resilient, self-managed, adaptive, and evolving applications.

Distributively deployed AI capabilities will thrust the above mentioned transition. As reported by Deloitte, “... companies are incorporating artificial intelligence in particular, machine learning into their ‘Internet of Things applications’ and seeing capabilities grow, including improving operational efficiency and helping avoid unplanned downtime” [28].

1.2 Problem Statement

The deployment of data processing capabilities throughout Distributed Enterprise Systems rises several security challenges related to the protection of input & output data [26] as well as of DNN-based/enhanced software assets.

In the specific context of distributed intelligence, DNN-based/enhanced software assets will represent key investments in infrastructure, skills and governance, as well as in the acquisition of data and talents. The software industry is therefore in the direct need to safeguard these strategic investments by enforcing the protection of this new form of Intellectual Property.

Furthermore, on the wake of Data Protection (DP) regulations such as the EU-GDPR [26], Independant Software Vendors (ISVs) have the non-transferable requirement to comply with those.

Therefore, ISVs aim to protect both: data and the Intellectual Property of their DNN-based/enhanced software assets, deployed on potentially unsecure edge hardware & platforms [15].

1.3 State-of-the-Art

Security of Deep Neural Networks is a current research topic taking advantage of two major cryptographic approaches: variants of Fully Homomorphic Encryption/FHE [12] and Secure Multi-Party Computation/SMC [8]. While FHE techniques allow addition and multiplication on encrypted data, SMC enables arithmetic operations on data shared across multi-parties.

Several approaches can be found in the literature, at different phases of the development and deployment of DNNs.

Secure Training. Secure DNN training has been addressed using FHE [16] and SMC [30], disregarding protection once the trained model is to be productively deployed. Other Machine Learning models such as linear and logistic regressions have also been trained in a secure way in [24]. In those approaches, confidentiality of training data is guaranteed, while runtime protection (i.e. input, model, output) is out of scope.

Processing on Encrypted Data. At processing phase, SMC has led to cooperative solutions where several devices work together to obtain federated inferences [21], not supporting deployment of the trained DNN to trusted decentralized systems. DNN processing on FHE encrypted data is covered in CryptoNets [13], improved in [4, 18]. More recently, in [2], the authors proposed a privacy-preserving framework for deep learning, making use of the SEAL [29] FHE library. While disclosure of data at runtime is prevented in these solutions, protection of DNN models remains out of the scope.

Intellectual Property Protection of DNN Model. In [31], the authors tackles IP protection of DNN models through model watermarking. While infringement can be detected with this method, it can not be prevented. Furthermore, runtime protection of input, model and output are out of scope.

To the best of our knowledge, no other publication has holistically tackled the protection of both trained DNN models and data, targeting distributed untrusted systems.

1.4 Data and Intellectual Property Protection for Deep Neural Networks

In this paper we propose a novel approach for the Intellectual Property Protection of DNN-based/enhanced software assets while enabling data protection at processing time, making use of concepts such as Fully Homomorphic Encryption (FHE).

Once trained, DNN model parameters (i.e. weights, biases) are encrypted homomorphically. The resulting (encrypted) DNN can be distributed across untrusted landscapes, preserving its IP while mitigating the risk of reverse engineering. At runtime, FHE-encrypted insights from encrypted input data are produced by the homomorphically encrypted DNN. Confidentiality of both trained DNN, input and output data will be therefore guaranteed.

Despite of recent improvements of FHE schemes [3, 5] and implementations [17, 25, 29], homomorphic encryption remains computationally expensive. Hence it could represent a bottleneck having a negative impact on overall performance, and on the accuracy of encrypted DNNs outputs, handling encrypted inputs. In this paper, we therefore evaluate as well the overall performance (e.g. CPU, memory, disk usage) along with the accuracy of encrypted DNNs.

This paper is organized as follows: Sect. 2 details the fundamentals of our approach. Section 3 provides an overview of our solution. In Sects. 4 and 5, we present the architecture and evaluation, concluding with an outlook in Sect. 6.

2 Fundamentals

2.1 Deep Neural Network

Figure 1 depicts a DNN with multiple layers. It is composed of L layers:

  1. 1.

    An input layer, the tensor of input data \(\mathbf {X}\).

  2. 2.

    \(L-1\) hidden layers, mathematical computations transforming \(\mathbf {X}\) sequentially.

  3. 3.

    An output layer, the tensor of output data \(\mathbf {Y}\).

Fig. 1.
figure 1

Deep neural network [14].

We denote the output of layer i as a tensor \(\mathbf {A^{[i]}}\), with \(\mathbf {A^{[0]}}=X\), and \(\mathbf {A^{[L]}}=Y\). Tensors can have different sizes and number of dimensions.

Each layer \(\mathbf {A^{[i]}}\) depends on the mathematical computations performed at the previous layer \(\mathbf {A^{[i-1]}}\). At each layer \(\mathbf {A^{[i]}}\), two types of function can be computed:

  • Linear: involving polynomial operations.

  • Non-linear, involving non-linear operations, so called activation function, such as max, exp, division, ReLU, or Sigmoid.

Linear Computation Layer. For the sake of clarity, we exemplify the inner linear computation with a Fully Connected (FC) layer, as depicted in Fig. 2.

Fig. 2.
figure 2

Fully Connected layer with activation function [14].

A Fully Connected layer, noted \(\mathbf {A^{[i]}}\), is composed of n parallel neurons, performing a \(\mathbb {R}^n\rightarrow \mathbb {R}^n\) transformation (see Fig. 2). We define:

  • \(\mathbf {a^{[i]}} = \begin{bmatrix} a^{[i]}_0 \ldots a^{[i]}_k \ldots a^{[i]}_N \end{bmatrix}^T\) as the output of layer \(\mathrm {A}^{[i]}\);

  • \(\mathbf {z^{[i]}} = \begin{bmatrix} z^{[i]}_0 \ldots z^{[i]}_k \ldots z^{[i]}_N \end{bmatrix}^T\) as the linear output of layer \(\mathrm {A}^{[i]}\); (\(\mathbf {z^{[i]}}=\mathbf {a^{[i]}}\) if there is no activation function)

  • \(\mathbf {b^{[i]}} = \begin{bmatrix} b^{[i]}_0 \ldots b^{[i]}_k \ldots b^{[i]}_N \end{bmatrix}^T\) as the bias for layer \(\mathrm {A}^{[i]}\);

  • \(\mathbf {W^{[i]}} = \begin{bmatrix} \mathbf {w^{[i]}_0} \ldots \mathbf {w^{[i]}_k} \ldots \mathbf {w^{[i]}_N} \end{bmatrix}^T\) as the weights for layer \(\mathrm {A}^{[i]}\).

Neuron k performs a linear combination of the output of the previous layer \(\mathbf {a^{[i-1]}}\) multiplied by the weight vector \(\mathbf {w^{[i]}_k}\) and shifted with a bias scalar \(b^{[i]}_k\), obtaining the linear combination \(z^{[i]}_k\):

$$\begin{aligned} z^{[i]}_k=\left( \sum _{l=0}^{M}w^{[i]}_k[l]*a^{[i-1]}_l\right) +b^{[i]}_k={\mathbf{w}^{[\mathbf{i}]}_\mathbf{k}}*{\mathbf{a}^{[\mathbf{i}-{\mathbf {1}}]}}+b^{[i]}_k~[14] \end{aligned}$$
(1)

Vectorizing the operations for all the neurons in layer \(A^{[i]}\) we obtain the dense layer transformation:

$$\begin{aligned} \mathbf {z^{[i]}}=\mathbf {W^{[i]}}*\mathbf {a^{[i-1]}}+\mathbf {b^{[i]}}~[14] \end{aligned}$$
(2)

where \(\mathbf {W}\) and \(\mathbf {b}\) are the parameters for layer \(A^{[i]}\).

Activation Functions. Activation functions are the major source of non-linearity in DNNs. They are performed element-wise (\(\mathbb {R}^0\rightarrow \mathbb {R}^0\), thus easily vectorized), and are generally located after linear transformations such as Fully Connected layers.

$$\begin{aligned} a^{[i]}_k=f_{act}\left( z^{[i]}_k\right) \end{aligned}$$
(3)

Several activation functions have been proposed in the literature but Rectified Linear Unit (ReLU) is currently considered as the most efficient activation function for DL. Several variants of ReLU exist, such as Leaky ReLU [23], ELU [7] or its differentiable version Softplus.

$$\begin{aligned} \begin{aligned} ReLU(z)&=z^+=max(0, z) \\ Softplus(z)&= log(e^z + 1) \end{aligned}~[14] \end{aligned}$$
(4)

2.2 Homomorphic Encryption

While preserving data privacy, Homomorphic Encryption (HE) schemes allow certain computations on ciphertext without revealing neither its inputs nor its internal states. Gentry [12] first proposed a Fully Homomorphic Encryption (FHE) scheme, which theoretically could compute any kind of arithmetic circuit, but is computationally intractable in practice. FHE evolved into more efficient schemes preserving addition and multiplication over encrypted data, such as BGV [3], FV [11] or CKKS [5], allowing approximations of multiplicative inverse, exponential and logistic function, or discrete Fourier transformation. Similar to asymmetric encryption, a public-private key pair (pub, priv) is generated.

Definition 1

An encryption scheme is called homomorphic over an operation \(\odot \) if it supports the following

$$\begin{aligned}\begin{gathered} Enc_{\mathbf {pub}}(m) = \left\langle {m} \right\rangle _{{\varvec{pub}}}, \forall m \in \mathcal {M} \\ \left\langle {m_1\odot m_2} \right\rangle _{{\varvec{pub}}} = \left\langle {m_1} \right\rangle _{{\varvec{pub}}} \odot \left\langle {m_2} \right\rangle _{{\varvec{pub}}}, \forall m_1, m_2 \in \mathcal {M} \end{gathered}\end{aligned}$$

where \(Enc_{\mathbf {pub}}\) is the encryption algorithm and \(\mathcal {M}\) is the set of all possible messages.

Definition 2

Decryption is performed as follows

$$\begin{aligned}\begin{gathered} Enc_{\mathbf {pub}}(m) = \left\langle {m} \right\rangle _{{\varvec{pub}}}, \forall m \in \mathcal {M} \\ Dec_{\mathbf {priv}}(\left\langle {m} \right\rangle _{{\varvec{pub}}}) = m \end{gathered}\end{aligned}$$

where \(Dec_{\mathbf {priv}}\) is the decryption algorithm and \(\mathcal {M}\) is the set of all possible messages.

2.3 Challenges

Even though HE schemes seem theoretically promising, their usage comes with several drawbacks, particularly when applied to Deep Learning.

Noise Budget. In Gentry’s lattice-based HE schemes [12] and subsequent variants of it, ciphertexts contain a small term of random noise drawn from some probability distribution. While every operation performed on a ciphertext increases the noise of the resulting ciphertext, it is important to keep the noise below a certain threshold, because once the noise reaches that threshold, it is no longer possible to decrypt the ciphertext. To estimate the current magnitude of noise, a noise budget can be calculated, that starts as a positive integer, decreases with subsequent operations and reaches 0 exactly when the ciphertext becomes indecipherable. The noise budget is more strongly affected by multiplications as by additions.

In order to cope with that challenge, encryption parameters can be adjusted accordingly to the required computation depth of an arithmetic circuit. In addition, Gentry introduced the so called bootstrapping procedure, which resets the noise budget of a ciphertext, but requires significant additional computational costs. Recently in [5], the authors proposed a optimized bootstrapping approach with improved performance.

FHE Libraries and APIs. As summarized in Table 1, multiple FHE libraries are available. Depending on the supported HE schemes, those libraries show noticeable difference on performance (e.g. computational, memory consumption), on supported operations type (e.g. addition, multiplication, negative, square, division), datatype (e.g. floating point, integer), and chipset infrastructure (e.g. CPU, GPU).

In addition, and regardless on their level of maturity and performance, HE libraries can be configured through several encryption parameters such as:

  • Polynomial degree or modulus: which determines the available noise budget and strongly affects the performance.

  • Plaintext modulus: which is mostly associated to the size of input data.

  • Security parameter: which sets the reached level of security in bits of the cryptosystem (e.g. 128, 192, 256-bit security level).

Fine-tuning of those encryption parameters enables developers to optimize the performance of encryption and encrypted operations. The selection of the right encryption parameters depends on the size of the plaintext data, targeted accuracy loss or level of security.

Table 1. FHE implementation libraries [14].

Linear Function Support Only. By construction, linear functions, composed of addition and multiplication operations, are seamlessly protected by FHE. But, non-linear activation functions such as ReLU or Sigmoid require approximation to be computed with FHE schemes.

The challenge lies on the transformation of activation functions into polynomial approximations supported by HE schemes. We elaborate more on approximation of activation functions in Sect. 3.2.

Supported Plaintext Type. The vast majority of HE schemes allow operations on integers [17, 29], while others use booleans [6] or floating point numbers [5, 29]. In the case of integer supporting HE schemes, rational numbers can be approximated using fixed-point arithmetic by scaling with a scaling factor and rounding.

Performance. FHE schemes are computationally expensive and memory consuming. In addition, ciphertexts are often significantly bigger than plaintexts and thus use more memory and disk space.

Even if in the past years the performance of FHE made it impractical, recent FHE schemes show promising throughput. New FHE libraries take also advantage of GPU acceleration.

In addition, modern implementations of HE schemes such as HELib [17], SEAL [29], or PALISADE [25] benefit from Single Instruction Multiple Data (SIMD), allowing multiple integers to be stored in a single ciphertext and vectorizing operations, which can accelerate certain applications significantly.

3 Approach

As introduced in Sect. 1.2, the delivery of DNN-enriched insights come at a cost. ISVs aim to guarantee data security, together with the IP protection of their DNN-based/enhanced software assets, deployed on potentially unsecure edge hardware & platforms. In order to achieve those security objectives on DNN, we utilize FHE schemes to operate on ciphertext at runtime.

Consequently, secure training of DNN is out of scope of our approach as we focus on runtime execution. We assume that DNN training already preserves both data privacy & confidentiality, and the resulting trained model. Once a model is trained, as discussed in Sect. 2.1, we obtain a set of parameters for each DNN layer; i.e weights \(\mathbf {W^{[i]}}\) and biases \(\mathbf {b^{[i]}}\) for Fully Connected layers DNN are not solely made of FC layers, and in [14], we identified different type of linear operations parameters within DNN such as Batch Normalization [19] or Convolutional Layer [20]. Those parameters constitute the IP to be protected when deploying a DNN to distributed systems.

3.1 Linear Computation Layer Protection

Our approach is agnostic from the type of layer. In [14], we detail the encryption of layers such as Convolutional Layer or Batch Normalization. For sake of simplicity, we exemplify the encryption of DNN layers parameters on FC layers. Since FC are simply a linear transformation on the previous layer’s outputs, encryption is achieved straightforwardly as follows

$$\begin{aligned} \begin{aligned} \left\langle {\mathbf {z^{[i]}}} \right\rangle _\mathbf{pub }&= \left\langle {\mathbf {W^{[i]}}*\mathbf {a^{[i-1]}}+\mathbf {b^{[i]}}} \right\rangle _\mathbf{pub } \\&= \left\langle {\mathbf {W^{[i]}}} \right\rangle _\mathbf{pub }*\left\langle {\mathbf {a^{[i-1]}}} \right\rangle _\mathbf{pub }+\left\langle {\mathbf {b^{[i]}}} \right\rangle _\mathbf{pub } \\ \end{aligned}~[14] \end{aligned}$$
(5)

Fully Connected Layer (FC). Also known as Dense Layer, it is composed of N parallel neurons, performing a \(\mathbb {R}^1\rightarrow \mathbb {R}^1\) transformation (Fig. 1). We will define:

  • \(\mathbf {a^{[i]}} = \begin{bmatrix} a^{[i]}_0 \ldots a^{[i]}_k \ldots a^{[i]}_N \end{bmatrix}^T\) as the output of layer i;

  • \(\mathbf {z^{[i]}} = \begin{bmatrix} z^{[i]}_0 \ldots z^{[i]}_k \ldots z^{[i]}_N \end{bmatrix}^T\) as the linear output of layer i; (\(\mathbf {z^{[i]}}=\mathbf {a^{[i]}}\) if there is no activation function)

  • \(\mathbf {b^{[i]}} = \begin{bmatrix} b^{[i]}_0 \ldots b^{[i]}_k \ldots b^{[i]}_N \end{bmatrix}^T\) as the bias of layer i;

  • \(\mathbf {W^{[i]}} = \begin{bmatrix} \mathbf {w^{[i]}_0} \ldots \mathbf {w^{[i]}_k} \ldots \mathbf {w^{[i]}_N} \end{bmatrix}^T\) as the weights of layer i.

Neuron k performs a linear combination of the output of the previous layer \(\mathbf {a^{[i-1]}}\) multiplied by the weight vector \(\mathbf {w^{[i]}_k}\) and shifted with a bias scalar \(b^{[i]}_k\), obtaining the linear combination \(z^{[i]}_k\):

$$\begin{aligned} z^{[i]}_k=\left( \sum _{l=0}^{M}w^{[i]}_k[l]*a^{[i-1]}_l\right) +b^{[i]}_k=\mathbf {w^{[i]}_k}*\mathbf {a^{[i-1]}}+b^{[i]}_k~[14] \end{aligned}$$
(6)

Vectorizing the operations for all the neurons in layer i we obtain the dense layer transformation:

$$\begin{aligned} \mathbf {z^{[i]}}=\mathbf {W^{[i]}}*\mathbf {a^{[i-1]}}+\mathbf {b^{[i]}}~[14] \end{aligned}$$
(7)

Protecting FC Layer. Since FC is a linear layer, it can be directly computed in the encrypted domain using additions and multiplications. Vectorization is achieved straightforwardly:

$$\begin{aligned} \begin{aligned} \left\langle {\mathbf {z^{[i]}}} \right\rangle _\mathbf{pub }&\equiv \left\langle {\mathbf {W^{[i]}}*\mathbf {a^{[i-1]}}+\mathbf {b^{[i]}}} \right\rangle _\mathbf{pub } \\&=\left\langle {\mathbf {W^{[i]}}} \right\rangle _\mathbf{pub }*\left\langle {\mathbf {a^{[i-1]}}} \right\rangle _\mathbf{pub }+\left\langle {\mathbf {b^{[i]}}} \right\rangle _\mathbf{pub } \\ \end{aligned}~[14] \end{aligned}$$
(8)
$$\begin{aligned} \begin{aligned} \left\langle {\mathbf {z^{[i]}}} \right\rangle _\mathbf{pub }&\equiv \left\langle {\mathbf {W^{[i]}}*\mathbf {a^{[i-1]}}+\mathbf {b^{[i]}}} \right\rangle _\mathbf{pub } \\&=\left\langle {\mathbf {W^{[i]}}} \right\rangle _\mathbf{pub }*\left\langle {\mathbf {a^{[i-1]}}} \right\rangle _\mathbf{pub }+\left\langle {\mathbf {b^{[i]}}} \right\rangle _\mathbf{pub } \\ \end{aligned}~[14] \end{aligned}$$
(9)
$$\begin{aligned} \left\langle {\mathbf {a^{[i]}_k}} \right\rangle _\mathbf{pub }\equiv \left\langle {\mathbf {f_{approxact}\left( z^{[i]}_k\right) }} \right\rangle _\mathbf{pub }~[14] \end{aligned}$$
(10)
Fig. 3.
figure 3

Conv layer with activation for map k [14].

Convolutional Layer (Conv). Conv layers constitute a key improvement for image recognition and classification using NNs. The \(\mathbb {R}^{2|3}\rightarrow \mathbb {R}^{2|3}\) linear transformation involved is spatial convolution, where a 2D \(s*s\) filter (a.k.a. kernel) is multiplied to the 2D input image in subsets (patches) with size \(s*s\) and in defined steps (strides), then added up and then shifted by a bias (see Fig. 3). For input data with several channels or maps (e.g.: RGB counts as 3 channels), the filter is applied to the same patch of each map and then added up into a single value of the output image (cumulative sum across maps). A map in Conv layers is the equivalent of a neuron in FC layers. We define:

  • \(\mathbf {A^{[i]}_k}\) as the map k of layer i;

  • \(\mathbf {Z^{[i]}_k}\) as the linear output of map k of layer i;

  • (\(\mathbf {Z^{[i]}_k}=\mathbf {A^{[i]}_k}\) in absence of activation function)

  • \({b^{[i]}_k}\) as the bias value for map k in layer i

  • \(\mathbf {W^{[i]}_k}\) as the \(s*s\) filter/kernel for map k.

This operation can be vectorized by smartly replicating data [27]. The linear transformation can be expressed as:

$$\begin{aligned} \mathbf {Z^{[i]}_k}=\left( \sum _{m=0}^{M\; maps}\mathbf {A^{[i-1]}_m}\oplus \mathbf {W^{[i]}}_k\right) +{b^{[i]}_k}~[14] \end{aligned}$$
(11)

Protecting Convolutional Layers. Convolution operation can be decomposed in a series of vectorized sums and multiplications over patches of size \(s*s\):

$$\begin{aligned} \begin{aligned}&\left\langle {\mathbf {Z^{[i]}_k}} \right\rangle _\mathbf{pub }=\left\langle {\left( \sum _{m=0}^{M\; maps}\;\mathbf {A^{[i-1]}_m}\oplus \mathbf {W^{[i]}_k}\right) +b^{[i]}_k } \right\rangle _\mathbf{pub } =\\&\sum _{m=0}^{M\; maps}\left\langle {\mathbf {A^{[i-1]}_m}\oplus \mathbf {W^{[i]}_k}} \right\rangle _\mathbf{pub }+\left\langle {b^{[i]}_k} \right\rangle _\mathbf{pub } =\\&\left\{ \sum _{m=0}^{M\;}\left\langle {\mathbf {A^{[i-1]}_m}[j]} \right\rangle _\mathbf{pub }*\left\langle {\mathbf {W^{[i]}}_k} \right\rangle _\mathbf{pub }\right\} _{[s*s]}+\left\langle {b^{[i]}_k} \right\rangle _\mathbf{pub } \\ \end{aligned}~[14] \end{aligned}$$
(12)
Fig. 4.
figure 4

Max and mean packing for pooling layers [14].

Pooling Layer. This layer reduces the input size by using a packing function. Most commonly used functions are max and mean. Similarly to convolutional layers, pooling layers apply their packing function to patches (subsets) of the image with size \(s*s\) at strides(steps) of a defined number of pixels, as depicted in Fig. 4.

Protecting Pooling Layer. Max can be approximated by the sum of all the values in each patch of size \(s*s\), which is equivalent to scaled mean pooling. Mean pooling can be scaled (sum of values) or standard (multiplying by 1/N). By employing a flattened input, pooling becomes easily vectorized.

Other Techniques

  • Batch Normalization (BN): reduces of the range of input values by ‘normalizing’ across data batches: subtracting mean and dividing by standard deviation. BN also allows finer tuning using trained parameters \(\beta \) and \(\gamma \) (\(\epsilon \) is a small constant used for numerical stability).

    $$\begin{aligned} a^{[i+1]}_k=BN_{\gamma , \beta }(a^{[i]}_k)=\gamma *\frac{a^{[i]}_k-E[a^{[i]}_k]}{\sqrt{Var[a^{[i]}_k]+\epsilon }}+\beta ~[14] \end{aligned}$$
    (13)

    Protection of BN: is achieved by treating division as the inverse of a multiplication.

    $$\begin{aligned} \begin{aligned} \left\langle {a^{[i+1]}_k} \right\rangle _\mathbf{pub }&=\left\langle {\gamma } \right\rangle _\mathbf{pub }*\left( \left\langle {a^{[i]}_k} \right\rangle _\mathbf{pub }-\left\langle {E[a^{[i]}_k]} \right\rangle _\mathbf{pub }\right) \\ *&\left\langle {\frac{1}{\sqrt{Var[a^{[i]}_k]+\epsilon }}} \right\rangle _\mathbf{pub }+\left\langle {\beta } \right\rangle _\mathbf{pub } \end{aligned}~[14] \end{aligned}$$
    (14)
  • Dropout and Data Augmentation: only affect training procedure. They don’t require protection.

  • Residual Block: is an aggregation of layers where the input is added unaltered at the end of the block, thus allowing the layers to learn incremental (‘residual’) modifications (Fig. 5).

    $$\begin{aligned} \mathbf {A^{[i]}}=\mathbf {A^{[i-1]}}+ResBlock\left( \mathbf {A^{[i-1]}}\right) \end{aligned}$$
    (15)

    Protection of ResBlock: is achieved by protecting the sum and the layers inside ResBlock:

    $$\begin{aligned} \left\langle {\mathbf {A^{[i]}}} \right\rangle _\mathbf{pub }=\left\langle {\mathbf {A^{[i-1]}}} \right\rangle _\mathbf{pub }+\left\langle {ResBlock\left( \mathbf {A^{[i-1]}}\right) } \right\rangle _\mathbf{pub }~[14] \end{aligned}$$
    (16)
Fig. 5.
figure 5

Example of a possible residual block [14].

3.2 Activation Function Protection

Due to their innate non-linearity, activation functions need to be approximated with polynomials to be encrypted with FHE. Several approaches have been elaborated in the literature. In [13, 22], the authors proposed to use a square function as activation function. The last layer, a sigmoid activation function, is only applied during training. Chabanne et al. used Taylor polynomials around \(x=0\), studying performance based on the polynomial degree [4]. In [18], Hesamifard et al. approximate instead the derivative of the function and then integrate to obtain their approximation.

Regardless on the approximation technique, we denote activation function \(f_{act}()\) approximation as

$$\begin{aligned} \mathbf {f_{act}()} \approx \mathbf {f_{approxact}()}~[14] \end{aligned}$$
(17)

By construction, we have

$$\begin{aligned} \begin{aligned} \left\langle {\mathbf {a^{[i]}_k}} \right\rangle _\mathbf{pub }&= \left\langle {\mathbf {f_{act}\left( z^{[i]}_k\right) }} \right\rangle _\mathbf{pub }\\&\equiv \left\langle {\mathbf {f_{approxact}\left( z^{[i]}_k\right) }} \right\rangle _\mathbf{pub } \end{aligned}~[14] \end{aligned}$$
(18)
  • Rectifier Linear Unit (ReLU): is currently considered as the most efficient activation function for DL. Several variants have been proposed, such as Leaky ReLU [23], ELU [7] or its differentiable version Softplus.

    $$\begin{aligned} \begin{aligned} ReLU(z)&=z^+=max(0, z) \\ Softplus(z)&= log(e^z + 1) \end{aligned}~[14] \end{aligned}$$
    (19)
  • Sigmoid \(\sigma \). The classical activation function. Its efficiency has been debated in the DL community.

    $$\begin{aligned} Sigmoid(z)= \sigma (z)=\frac{1}{1+e^{-z}}~[14] \end{aligned}$$
    (20)
  • Hyperbolic Tangent (tanh): is currently being used in the industry because it is easier to train than ReLU: it avoids having any inactive neurons and it keeps the sign of the input.

    $$\begin{aligned} tanh(z)= \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}~[14] \end{aligned}$$
    (21)

Protecting Activation Functions. Due to its innate non-linearity, they need to be approximated with polynomials. [13] proposed using only \(\sigma (z)\) approximating it with a square function. [4] used Taylor polynomials around \(x=0\), studying performance based on the polynomial degree. [18] approximate instead the derivative of the function and then integrate to obtain their approximation. One alternative would be to use Chebyshev polynomials.

4 Architecture

In this section we outline the architecture of our IP protection system, as depicted in Fig. 6.

4.1 Encryption of Trained DNN

At backend-level, a DNN is trained by the DNN Training Agent, . Training outcome (NN architecture and parameters) is pushed to the Trained DNN Protection Agent, . Alternatively, an already trained DNN can be imported directly into the Protection Agent. The DNN Protection Agent generates a Fully Homomorphic key pair from the Key Generator component, . The DNN is then encrypted and stored together with its homomorphic key pair in the Trained and Protected DNN Database, .

Fig. 6.
figure 6

Activity diagram in our solution [14].

4.2 Deployment of Trained and Protected DNN

At the deployment phase, the Trained DNN Deployment Agent deploys the DNN on distributed systems, together with its public key, .

4.3 DNN Processing

On the distributed system, data is collected by a Data Stream Acquisition component, , and forwarded to the DNN Processing Agent, . Input layer does not involve any computation, and therefore can be seamlessly FHE encrypted as follows

$$\begin{aligned} \mathbf {X} \xrightarrow {encryption} Enc_{\mathbf {pub}} (X) = \left\langle {X} \right\rangle _\mathbf{pub }~[14] \end{aligned}$$
(22)

Encrypted inferences are sent to the Decryption Agent, , for their decryption using the private key associated to the DNN, . FHE encryption propagates across the DNN layers, from input to output layer. By construction, output layer is encrypted homomorphically.

IP of the DNN, together with the computed inferences, is protected from any disclosure on the distributed system throughout the entire process.

The decryption of the last layer’s output \(\mathbf {Y}\) is done with the private encryption key priv, as in standard asymmetric encryption schemes:

$$\begin{aligned} \left\langle {\mathbf {A^{[L]}}} \right\rangle _\mathbf{pub } \xrightarrow {decryption} Dec_{\mathbf {priv}}\left( \left\langle {\mathbf {A^{[L]}}} \right\rangle _\mathbf{pub }\right) =\mathbf {Y}~[14] \end{aligned}$$
(23)

4.4 Sequential Processes

Encryption of Trained NN. Once a Neural Network is trained or imported, we encrypt all its parameters, using the Protected NN DataBase to store it and handle Homomorphic Keys (Fig. 7).

Fig. 7.
figure 7

Sequence diagram of trained NN Encryption [14].

Deploy Trained and Protected NN. The newly trained and protected deep neural network is deployed on the decentralized systems, including:

  1. 1.

    Network architecture;

  2. 2.

    Network model: Encrypted parameters;

  3. 3.

    Public encryption key.

Encrypted Inference. On the decentralized system, data is collected and injected into the deployed NN. We must encrypt \(\mathbf {A^{[0]}}=\mathbf {X}\) with the public encryption key associated to the deployed NN (Fig. 8).

Fig. 8.
figure 8

Sequence diagram of inference processing [14].

Inference Decryption. Encrypted inferences are sent to backend, together with an identifier of the NN used for the inference. The inference is homomorphically decrypted using the mapping private decryption key (Fig. 9).

Fig. 9.
figure 9

Sequence diagram of inference decryption [14]

5 Evaluation

As detailed in Sect. 2.3, FHE introduces additional computational costs at each step of the DNN life-cycle. In this section, we evaluate performance overhead from computation time, memory load and disk usage perspectives at DNN model and processing encryption and output decryption.

5.1 Hardware Setup

As backend, we use a NVIDIA DGX-1Footnote 1 server, empowered with 8 Tesla V100 GPUs. This machine is theoretically not resource-constrained (computation & memory). We reasonably neglect the impact of the performance overhead introduced by FHE on DNN trained model encryption and output decryption.

We deploy and execute our encrypted DNN on a NVIDIA Jetson-TX2Footnote 2. Powered by NVIDIA Pascal architecture, this platform embeds 256 CUDA cores, CPU HMP Dual Denver 22 MB L2 + Quad ARM A572 MB L2, and 8 GB of memory. This platform gets closer to the hardware configuration of a Distributed Enterprise System.

5.2 Software Setup

DNN Model. As demonstrated in Sect. 3, our approach is fully agnostic from NN topology, or implementation. For the sake of our evaluation, involving several modifications to the NN model, we choose a simple CNN classifierFootnote 3, implemented with the Keras libraryFootnote 4. Two datasets have been used in our experiment: CIFAR10Footnote 5, for image classification, and MNISTFootnote 6 for handwritten digits classification.

As depicted in Fig. 10, we distinguish two main parts in this CNN: a feature extractor and a classifier. The feature extractor reduces the amount of information from the input image, into a set of high level and more manageable features. This step facilitates the subsequent classification of the input data.

Composed of four layers, \([\textit{FC} \rightarrow \textit{ReLU} \rightarrow \textit{FC} \rightarrow \textit{Softmax}]\), the classifier categorizes the input data according to the extracted features, and outputs discrete probability distribution over 10 classes of objects.

Fig. 10.
figure 10

Keras convolutional neural network.

As reference point, we evaluate key performance figures at model training and processing time without encryption. Once trained, the size of the CNN plaintext model is 9.6 Mb. On Jetson TX2, single unencrypted image classification is computed on average in 89.1 ms.

FHE Library. As introduced in Sect. 2.3, several libraries are available for FHE. We use SEAL [29] C library from Microsoft Research running on CPU. This choice is motivated by the library’s performance, support of multiple schemes such as BGV [3], stability, and documentation. The use of SEAL, implemented in C++, with the Keras Python library requires some engineering efforts. To enable both fast performance of the native C++ library and rapid prototyping using Python, we use CythonFootnote 7.

We conduct our evaluation with the BGV scheme [3], utilizing the integer encoding with SIMD support. To handle the floating-point DNN parameters, we use fixed-point arithmetic with a fixed scaling factor, similarly to CryptoNets [13]. This has no noticeable impact on the classification accuracy, if a suitable scaling factor is applied. The SIMD operations allow for optimized performance through vectorization.

Fig. 11.
figure 11

Classification accuracy with ReLU approximation - MNIST dataset.

Fig. 12.
figure 12

Classification accuracy with ReLU approximation - CIFAR10 dataset.

5.3 Linearization

We tackle the problem of linearization of the ReLU functions following approaches: we approximate it with a modified square function, and we skip activation function. The modified square function \(x^2+2x\) (see Fig. 13) is derived from the ReLU approximation proposed in [4]. In order to optimize the computation of that function on ciphertexts, we used simpler coefficients.

In order evaluate the impact of these approaches, we trained the CNN on the CIFAR10 and MNIST datasets, replacing the last ReLU activation. Depicted in Figs. 11 and 12, we report the accuracy loss. Both approximations have merely a minor impact on the output classification accuracy.

Skipping the last activation function shows good results on this simple CNN, but we do not want to generalize to any other DNN or dataset.

Fig. 13.
figure 13

ReLU approximation as square function.

5.4 Experimentation Results

Model & Data Protection. Intellectual Property-wise, we consider the feature extractor as of minor importance, as CNNs generally use state of the art feature extractor. The IP of the model rather lies in the parameters, weights and bias, of the trained classifier. For that reason, we encrypt the classifier only, as a first step towards full model encryption, as depicted in Fig. 10. To better understand the impact of computation depth, we also complete our evaluation with the encryption of the last FC layer only.

Confidentiality-wise, we evaluate the impact of extracted features encryption by comparing processing performance on an encrypted model with plaintext and encrypted feature extractor outputs.

As depicted in Fig. 10, we evaluate our approach on three modified versions of the model:

  • Last FC Layer Encrypted

  • Full Classifier Encrypted with no Activation Function

  • Full Classified Encrypted with our Modified Square Activation Function

Confidentiality-wise, we evaluate the impact of extracted features encryption by comparing processing performance on an encrypted model with plaintext and encrypted feature extractor outputs.

In order to optimize our approach, we omit the Softmax layer within the classifier. This layer does not have any influence on the classification results, as Softmax layer is mostly required at training phase, to normalize network outputs probability distribution, for more consistent loss calculations.

The overall experiment as described in Sect. 4 has been applied 5 times on each model. We report average evaluation metrics for each step: model encryption, processing encryption and decryption.

DNN Model Encryption. Each trained CNN model is encrypted on DGX-1’s CPU. In Table 2, we depict the resource consumption average on the following metrics:

  • Time to Compute: Time to encrypt the model.

  • Model Size: Size of resulting encrypted model.

  • Memory Load: Overall memory usage for model encryption.

We target three security levels: 128, 192, and 256-bits. For each of those, we optimize SEAL parameters as introduced in Sect. 2.3, maximizing performance, and minimizing leftover noise budget. Note that the security levels can have a counter-intuitive effect on performance, where for instance 192-bit security level might be faster that 128-bit security level. This can be explained by the fact that 128-bit security level offers more (unnecessary) noise budget, depending on the choice of FHE scheme parameters (e.g. plaintext modulus, polynomial degree). Therefore, we target a remaining noise budget as close as possible to zero.

Compared to the plaintext model size (9.6 Mb), encrypted model size increases by a factor of 8,22 in the best case, up to 1173,33.

Table 2. Model encryption

DNN Processing Encryption. The three encrypted CNN models deployed on Jetson-TX2 for CPU based encrypted processing. At this stage, we evaluate the following metrics

  • Time to compute: Processing time for an encrypted classification.

  • Memory: Memory usage for encrypted classification.

  • Remaining Noise Budget: At the end of processing encryption, we evaluate the remaining noise budget, which determines if additional encryption operations could be performed on the output vector.

In Tables 3 and 4, we depict the performance of encrypted processing with plaintext and encrypted previous layer outputs. We study the impact of confidentality preservation of the preceding layer outputs. SEAL library supports secure computation over plaintext and ciphertext producing ciphertext. As a consequence, output of the last MaxPooling2D layer can be processed in FHE-encrypted Fully Connected layer. Secure computation between plaintext and ciphertext has a lower impact on performance.

We observe a slight performance improvement on time to compute and memory between 128 and 192-bit security level. This is due to the FHE parameters optimization as described in Sect. 5.4, where initial noise budget is oversized for 128-bit security level, which has a direct impact to performance.

Experiment results show that, depending on the level of achieved security, and targeted scenario, we can achieve at best encrypted classification in 2.1 s (for 128 level security and only one layer encrypted). In the worst case, with encrypted input, full classifier encrypted with a modified square function as activation layer, 5627 s (93 mins) is required for a single classification.

Table 3. Runtime encryption with plaintext input.
Table 4. Runtime encryption with encrypted input.
Table 5. Decryption - performance.

Decryption. Following our approach, encrypted output are decrypted by the backend, on DGX-1. We therefore consider decryption as not computationally expensive, compared to encryption. Results are available in Table 5.

6 Conclusion

In this paper, we discuss and evaluate a holistic approach for the protection of distributed DNN-based/enhanced software assets, i.e. confidentiality of their input & output data streams as well as safeguarding their Intellectual Property. On that matter, we take advantage of Fully Homomorphic Encryption (FHE). We evaluate the feasibility of this solution on a Convolutional Neural Network (CNN) for image classification.

Our evaluation on NVIDIA DGX-1 and Jetson-TX2 shows promising results on the CNN image classifier. Firstly, the impact of activation function approximation is negligible, with almost no accuracy loss on output classification probability. Most of the overhead is introduced at processing time, affecting computation time & memory consumption. Performances vary from 2.1 s for an encrypted classification, with only 53.9 Mb consumed memory, up to 1 h 33 min with almost 5 Gb of consumed memory. This requires a balancing between expected classification throughput, targeted security level and encryption depth of the model. Currently this approach would be unrealistic for the protection of DNN-based/enhanced software assets real-time analytics. Still, the Industry calls for numerous scenarios – such as predictive maintenance – matching the current performance of our approach.

As future work, we aim to improve the performance of our approach by different means: following the constant evolution of FHE, such as with the recent CKKS scheme [5], acceleration of FHE libraries on GPU based infrastructure or optimized vectorized operations on FHE encrypted data [1]. In addition, we foresee a deployment of our solution into a Smart City scenario for risk prevention in public spaces; while expanding our approach to different types of DNNs, and complete encryption of CNNs, including the feature extraction layers.