Keywords

1 Introduction

Machine learning (ML) and neural networks specifically are widely deployed in many different scenarios, from voice assistants like Siri [27], Alexa [6], and Google Assistant [21] over writing assistants like Grammarly [22], and chatbots like Bard [20] and ChatGPT [39], to medical diagnostic systems [16, 30]. Many of these systems deal with privacy-sensitive data, some of which enjoy special legal protections, e.g., medical data. These systems send the data to a server, which runs it through its model and returns the result to the client. Since the server needs access to the unencrypted client data to perform the computation, the client’s privacy is at risk. The server might use the data to train further ML models, which could expose the data to privacy attacks, or the server itself could be breached and the data stolen. Researchers have recently proposed solutions to protect user data privacy in ML applications using different methods. Differential Privacy [18] solutions preserve the privacy of the training data in the trained model [1, 40]. To protect the data during inference, solutions commonly use Secure Multiparty Computation (SMC) [12, 14, 36, 37], Fully Homomorphic Encryption (FHE) [31, 33, 34] or a mixture between the two [7, 38]. SMC allows multiple parties to jointly evaluate a function without revealing their private inputs; however, it requires all parties to stay online during the computation. FHE, on the other hand, can be used entirely offline. FHE is a type of encryption that allows computation on encrypted data without exposing any inputs, intermediate, or final results. Neural networks are a popular choice for privacy-preserving ML models since most operations, like fully connected layers or convolutions, can be performed easily using FHE. Additionally, neural networks perform very well on a wide range of tasks. However, FHE introduces significant time and memory overhead. Some FHE schemes support single instruction multiple data (SIMD) processing, which can offset some time and memory overhead. FHE ciphertexts can be thought of as fixed-sized encrypted vectors containing thousands of elements, called slots. Two approaches for filling the slots have been used for ML. 1.) Pack all the features of an instance into as few ciphertexts as possible and perform convolutions and dot products with the help of rotations [2, 9, 34], called inter-axis packing. This has the advantage that the number of ciphertexts and total operations is relatively small, making it fast for a small number of instances. However, this approach often requires large rotation keys, and the rotations require additional time. 2.) Pack multiple instances’ features into a single ciphertext [17, 25, 42], called batch-packing. This produces as many ciphertexts as the data has features. Batch-packing allows us to simultaneously compute results for many instances, leading to low amortized per-instance cost and high throughput. However, it suffers from high latency and memory requirements. Batch-packing is beneficial when many instances need to be processed, and low latency is not essential. For example, in a medical image diagnostic system, where images are collected throughout the day, and an ML system analyzes them overnight. This work focuses on convolutional neural networks (CNN), specifically. We address the memory requirements for convolutional layers by trading disc space for main memory. Disc space is typically orders of magnitude cheaper. However, it is also slower. We dynamically load ciphertexts and plaintexts and clear them from memory when no longer needed. We present and compare different strategies and their impact on memory and runtime. Prior work focuses primarily on latency reduction; reduction in memory is often a side effect of inter-axis packing. To the best of our knowledge, this is the first study that performs an in-depth analysis of caching strategies and memory reduction for batch-packed inference. Brutzkus et al. [9] or Lee et al. [34] propose input packing techniques, which reduce the number of ciphertexts and thereby memory requirements. However, these approaches require additional operations like masking and rotation, which lower the overall throughput. Boemer et al. [7] present a complex encoding, allowing them to fit more values into a ciphertext. This can reduce the number of ciphertexts and plaintext when using inter-axis packing. However, for batch-packing, it only affects the batch size. Approaches that use client interaction, such as Boemer et al. [7], Podschwadt et al. [41], and Cai et al. [10] can often use smaller crypto parameters, since the client interaction resets the noise level, allowing for further computation. However, these approaches require the client to be online during the computation. We make the following main contributions:

  • We propose a schedule representation for convolutions that allows us to reorder its fundamental operations to achieve increased caching performance.

  • We propose a memory estimation algorithm for schedules.

  • We propose an algorithm for executing a schedule using multiple threads.

  • We propose multiple strategies for creating schedules, which we analyze and experimentally evaluate with regard to their time and memory requirements.

The paper is organized as follows: in Sect. 2, we discuss the theoretical background and notation. In Sect. 3, we discuss related work before we describe our proposed approach in detail in Sect. 4. Section 5 describes ways to reorder the computation to reduce memory requirements, which we experimentally evaluate in Sect. 6. We conclude the paper in Sect. 7.

2 Background

Here, we consider 2-D convolutional layers since they are commonly used in image classification, a prevalent ML task. However, our proposed approach is not limited to 2-D and can easily be transferred to convolutions in other dimensions. We consider convolutions with inputs X, weights W, and outputs Y, where X, W, and Y are tensors all four-dimensional tensors. The first dimension of X and Y is the batch dimension, and \(|\cdot |\) denotes the number of elements in a tensor. Lowercase bold letters, e.g., \(\textbf{x}\), indicate elements of a tensor.

2.1 Fully Homomorphic Encryption

FHE schemes are public key crypto schemes that can evaluate addition and multiplication on encrypted data without decrypting it at any point. The result of the computation is also encrypted. After decryption, the result is as if the computation was performed on plain data. In this paper, we use the Residue Number System (RNS) version of the Cheon-Kim-Kim-Song (CKKS) scheme [13]. Unlike most other schemes, CKKS supports real numbers. However, it performs encrypted computation only approximately, leading to approximation errors. The error appears first in the least significant bits of the result, keeping the error small. We can think of CKKS plain- and ciphertexts as one-dimensional vectors containing multiple values offering vectorized, element-wise SIMD computation [44]. The maximum number of values, typically called slots, is determined by the security parameters, which is a power of two. The number of filled slots in ciphertext does not impact the performance of the operation, allowing us to perform addition or multiplication of thousands of values at once.

2.2 Batch Packing

We consider \(n_s\) to be the number of slots in a ciphertext. For simplicity, we assume that the batch size is equal to \(n_s\). Otherwise, we would need to split the data into multiple batches or pad it. We partially flatten all dimensions in X except the batch dimension to encrypt the inputs. We take each column from the resulting two-dimensional matrix and encrypt that into a ciphertext, leaving us with a vector of ciphertexts. We need to encode the weights as well. Each weight value in W is encoded into its own plaintext. Before encoding, we turn each value into a vector by repeating it \(n_s\) times. This produces \(|X|/n_s\) ciphertexts and |W| plaintexts. If the model needs to be encrypted as well, we can encrypt the encoded plaintext weights. Similarly, the encoding of W contains \(|W|/n_s\) ciphertexts. We can think of the encoding as setting the batch axis to one. The issue that arises is that FHE ciphertexts and plaintexts require a substantial amount of memory. A single ciphertext can be between a few hundred kilobytes to multiple megabytes, depending on the crypto parameters; a plaintext is half the size of a ciphertext. We refer to the encoded and/or encrypted inputs, weights, and outputs as \(X'\), \(W'\), and \(Y'\), respectively, and values taken from them as \(\textbf{x}'\), \(\textbf{w}'\) and \(\textbf{y}'\).

2.3 Convolutional Layers

Here, we consider two-dimensional convolutions commonly used in neural networks; however, other dimensionalities work fundamentally the same. Goodfellow et al. [19] define the operation as follows: given the inputs X,W, and the output Y, which are all tensors, we can define the two-dimensional convolution as:

$$\begin{aligned} Y_{b,m,n, c_{\text {out}}}=\sum _j\sum _k\sum _{c_{\text {in}}} X_{b,m-j,n-k, c_{\text {in}}} W_{j,k, c_{\text {in}}, c_{\text {out}}} \end{aligned}$$
(1)

We use the subscript to indicate a single element in the tensor, where b is the batch index, \(c_{\text {in}}\) the input channel index, and \(c_{\text {out}}\), the output channel index. Eq 1 needs to be computed for all values in Y.

2.4 Lock-Free Multi-threaded Convolution

The most straightforward way to compute a convolution with multiple threads is to have each \(\textbf{y}\) computed by a thread; Eq. 1 is computed by a separated thread for each unique \((b,m,n,c_{\text {out}})\). For \(s = |Y|\), we can use at most \(n_t = s\) parallel threads without requiring some synchronization between the threads since all threads read from the shared resources X and W but do not modify them; every \(Y_{b,m,n,c_{\text {out}}}\) is only modified by one thread, ruling out any race conditions that could lead to lost updates. With fewer than s threads, threads can compute multiple \(Y_{b,m,n,c_{\text {out}}}\). With more than s threads, we either need synchronization or can not use these threads. Generally, given \(n_t\) threads where each thread is assigned a unique integer \( i \in [1,n_t]\), we can use Algorithm 1.

figure a

We assume that inputs are stored on a disk (the term disk refers to any persistent storage, i.e., a hard disk drive or solid-state drive) and must be loaded into memory. With the Algorithm 1, we have two options to keep it lock-free. 1.) Load X and Y before we start the computation, or 2.) Each thread loads the \(\textbf{x}\) and \(\textbf{y}\) as needed. 1.) has the upside in that we only need to load each value once and can reuse them at no additional cost. However, the downside is that we need to keep them in memory for the entirety of the computation. 2.) On the other hand, needs to keep much fewer objects in memory. Each thread has only three objects in memory: one \(\textbf{x}\), one weight \(\textbf{w}\), and the output \(\textbf{y}\). However, each thread must perform two loads for each iteration of the nested sums in line 6. Furthermore, multiple threads may load the same \(\textbf{x}\) and \(\textbf{w}\), causing redundant loads. A further issue with this algorithm arises when |Y| is not divisible by \(n_t\). In this case, \( |Y| - |Y| \mod n_t\) threads finish one iteration, line 4, early and are idle for the rest of the computation, leading to unused computational resources. However, this impact is small if |Y| is large compared to \(n_t\). Performing the second option on plain data will lead to slower results since arithmetic operations are much faster than data loading. Additionally, single \(\textbf{x}\) and \(\textbf{w}\) are so small that we can not save significant memory by loading them on demand.

Running Algorithm 1 on encrypted data is straightforward using batch-packing described earlier. To do this, we replace X with \(X'\), W with \(W'\), and Y with \(Y'\). This replacement sets the batch dimension to one, allowing us to remove it from consideration. For the algorithm, it does not matter if \(W'\) is encoded FHE plaintexts or ciphertexts if the model is encrypted. In the case of the plaintext model, we assume that unencoded weights W are loaded into memory before the computation starts and are encoded when needed; therefore, we don’t need to load them strictly speaking. However, we call this operation loading for simplicity in this context.

3 Related Work

Akavia et al. [3] focus on reducing the storage footprint of FHE ciphertext rather than the in-memory size during computation. They design a protocol that allows multiple data producers to upload and store data in the cloud with no overhead compared to storing AES (Advanced Encryption Standard) encrypted data. Storing AES encrypted on an untrusted server and using secret sharing, a computing server can use the data for HE computation with the help of an auxiliary server. In contrast, our proposed solution reduces the memory footprint at computation rather than the encrypted storage size.

Jiang et al. [28], Brutzukus et al. [9], Lee et al. [34], Dathathri et al. [15], and Lee et al. [33] are conceptually similar works, who all reduce the number of ciphertexts required by using inter-axis packing. While all these approaches reduce the inference latency, they require expensive rotations, lowering the throughput compared to batch-packed solutions. Additionally, they often rely on designing the packing strategy for the specific network architecture.

Other studies rely on interactive solutions for privacy preservation. Hao et al. [23] and Huang et al. [26] both propose efficient matrix multiplications in a two-party setting. Both studies propose rotation-free matrix multiplication over polynomial encoded ciphertexts. However, both require interactive phases where one party must extract specific polynomial coefficients and mask the result. Zheng et al. [46] propose a method for fast private inference using transformers and SMC. The authors use a similar protocol to the one proposed by Juvekar et al. [29], where the server performs much of the expensive matrix multiplication computation in an offline phase. Zheng et al. [46] reduce the number of ciphertext rotations required by packing the same feature of different tokens into the same ciphertext, similar to the batch-packing we use in our approach. However, we compute the intermediate terms in a less memory-consuming way.

Prior work on batch-packed PPML using FHE [8, 11, 17, 24] does not explicitly state how they perform matrix multiplication or convolutions. They focus on other improvements like better polynomial approximation [11, 24], or parameter fusion and special value bypass [8]. We believe most of these solutions could decrease memory requirements using our proposed algorithm. Another work that addresses memory limitations is Badawi et al. [4], which implements a CNN over FHE data using GPU acceleration for the basic ciphertext operation. To fit the input to the convolution into GPU memory, they split it into multiple blocks of the same size as the filter. The filter and as many of these blocks as possible are loaded into GPU memory, where the convolution is performed. Compared to our proposed approach, this process only reduces the memory requirement on the GPU. The input and weights still need to be present in the main memory. Shivdikar et al. [43] also present techniques aimed at GPUs. They aim to reduce the repeated memory reads inside the GPU when performing polynomial multiplication for HE primitives. While this speeds up the low-level operations underpinning most HE schemes, unlike our work, it does not address the issue of requiring a large number of plaintexts or ciphertexts in memory.

4 Our Proposed Approach

To address the issues of memory consumption and unused resources, we model the convolutional layer as a schedule, which determines the order of operations. We present and compare multiple schedule construction strategies based on the computation and available resources. We further present an algorithm to execute a schedule. From now on, we assume that all tensors are flattened.

4.1 Modeling the Problem as a Schedule

We can write each element \(\textbf{y}\) as a sum of products of \(\textbf{x}\) and \(\textbf{w}\). We denote a product of two elements of \(\textbf{x}\) and \(\textbf{w}\) as the triple \(t = (\textbf{x},\textbf{w}, \textbf{y})\), where \(\textbf{y}\) is the result that holds the sum that the product \(\textbf{x}\textbf{w}\) is a part of. To refer to an element in a triple t, we use the following notation \(t^i; i \in \{x,w,y\}\).

figure b

Definition 1 (Schedule)

Let f be a convolutional layer; we say \(t_i \in f\) iff the sum to compute \(t_i^y\) contains the product \(t_i^x t_i^w\). A schedule is an ordered list of triples \(t_i\) that contains all \(t_i \in f\) exactly once.

In other words, we represent f as a sequence of all its element-wise products. To compute the function f, we need to compute all products given in the schedule. Additionally, we must sum all products with the same value for \(\textbf{y}\). We call the number of triples in a schedule the length or steps of a schedule, denoted by |S|. Algorithm 2 shows how to generate a schedule for two-dimensional convolutions. Higher dimensional convolutions work analogously by expanding the iteration bounds in lines 2 and 4, the decomposition of i and j in lines 3 and 5, and the formula for \(t^x\) and \(t^w\) by the extra dimensions. In addition to the computation steps, we also insert load instructions into the schedule. Load instructions specify which elements to load into memory, discard from memory, or write back to disk in case they were updated.

4.2 Executing a Schedule

To execute a schedule, we evaluate all triples in order. To evaluate a triple t we multiply the input \(X_{t^x}\) with the weight \(W_{t^w}\) and add the result to \(Y_{t^y}\); \(Y_{t^t} = Y_{t^y} + X_{t^x}W_{t^w} \). We assume that all \(\textbf{y}\) are 0 at the beginning. We parallelize the execution of the schedule across multiple threads. Each repeatedly evaluates the first unevaluated triple. This requires synchronization at two points. 1.) We must ensure that every triple is evaluated exactly once. 2.) Unlike in Algorithm 1, we cannot guarantee that multiple threads do not write to the same output; therefore, we need locking to prevent race conditions. We use the following algorithm to ensure all values are correctly summed into the output value. We show our proposed algorithm in Algorithm 3. The parts that must be protected from concurrent access are marked as Critical Section.

figure c

We indicate where in the algorithm process load instructions in line 8. A load instruction has three attributes: 1.) the step that is executed on, 2.) the type of instruction, load or unload, and 3.) the object to load. Every iteration, each thread checks if there is an unprocessed load instruction with a step equal to or lower than the step the thread is executing. If there is, the thread marks it complete and executes it. Again, we must ensure that only one thread updates the load instructions at any time. Each thread tries to execute any outstanding load instructions before moving on. Objects loaded through load instructions stay cached until explicitly unloaded through another load instruction or until the computation is complete. If a thread requires values not loaded by any load instructions, it loads them on demand and does not cache them.

4.3 Cost of a Schedule

We can use the schedule to estimate the maximum memory required on encrypted data. Maximum memory is important since we cannot execute the schedule if it requires more than the available memory. Most of the memory required during execution stems from the ciphertexts and plaintexts; therefore, we ignore additional objects like keys, the schedule, and other data in our estimation. To estimate the cost, we look at the load instructions, the number of threads, and the objects loaded on demand. We first examine the simpler case with only one thread and extend it to multiple threads later. Let \(s_x\) be the size of a single \(\mathbf {x'}\), \(s_w\) the size of a single \(\mathbf {w'}\), and \(s_y\) the size of a single \(\mathbf {y'}\). To estimate the memory requirement of a schedule, we need to perform the following steps:

  1. 1.

    Split the schedule into parts at the load instructions so that each part begins with load instructions and contains no other load instructions except those at the beginning. A part must not only contain load instructions.

  2. 2.

    For each part, count how many \(\mathbf {x'}\), \(\mathbf {w'}\), and \(\mathbf {y'}\) are loaded and unloaded.

  3. 3.

    Weight the count of \(\mathbf {x'}\), \(\mathbf {w'}\), and \(\mathbf {y'}\) by \(s_x\), \(s_w\), and \(s_p\), respectively.

  4. 4.

    For every step, weigh the on-demand loaded objects and add them.

  5. 5.

    For each part, add the weighted counts from step 3 and the maximum from step 4. The maximum of all the parts is our estimate for the schedule.

We now extend the estimation to multiple threads. The estimate for multiple threads is less precise than that for a single thread since we can only make assumptions about how multiple threads will interact. We make the following simplifying assumptions: 1.) threads execute schedule steps at the same speed, and 2.) a continuous block of load instructions is executed simultaneously, no matter how many instructions are in that block. The main ideas are that if we have split the schedule into parts that contain fewer steps than we have threads \(n_t\), we merge adjacent parts until all parts contain at least as many steps as we have threads available. Then we identifiy the \(n_t\) steps that require the most memory in each part. To do this we start as we did in the single thread case above. Next, we look at the number of steps in each part. If the part has fewer steps than the number of threads \(n_t\), we combine it with the next part to form a new part by adding the cost of the load instructions. We repeat this until the new combined part has more steps than threads. We repeat this for all parts of the schedule. To estimate the cost of the on-demand loaded objects, we assume that \(n_t\) schedule steps are executed at the same time. In the final step, we handle the cost of the schedule steps. We compute the on-demand cost for all steps in the schedule parts created in the previous step. The computation happens the same way as described in the single-threaded case above. However, now we not only add the step with the highest cost; we add the \(n_t\) steps with the highest costs. This method provides a reasonable estimate for the memory cost of a schedule with multi-threaded execution.

4.4 Threat Model

In this work, we assume that all parties are honest but curious. They follow all protocols and algorithms without deviation. However, they do try to learn as much information as they can. The server offers private inference to the client. Only the dimensions of the data and the data domain need to be shared between the client and server in plaintext. The actual instances and the inference result are only ever shared in encrypted form. Besides the input and output dimensions of the model, the client gains no additional information about the model. However, model extraction attacks by the client, as described by Tramer et al. [45], can still threaten the server-side model. Additionally, since we rely on the CKKS scheme, the client needs to make sure not to share decrypted inference results with the server since this can be used to compromise the security of the client’s secret key [35]. Our proposed approach reorders the operation the server performs of the encrypted data, which is not observable by the client. The client can only observe the time the server needs to perform the computation. Without knowing the server’s hardware configuration, this does not provide any useful information to the client. Even if the client knows the exact hardware configuration, it can not learn any other information than it would learn if the server used a different computational model.

5 Reduced Memory Schedules

Fig. 1.
figure 1

Breaking an example base schedule down into multiple sub-schedules. This schedule is executed row-wise.

In this section, we propose different ways to construct schedules. These schedules provide trade-offs between runtime and memory. The fastest we can execute a schedule is by loading all data at the beginning of the computation and then using the lock-free Algorithm 1. However, this requires a large amount of memory. We can reduce the memory footprint by loading everything on demand. However, this increases runtime significantly.

We can transform the computation performed by Algorithm 1 into a schedule. Again, consider \(n_t\) to be the number of threads, \(n_o\) the number of outputs of the computation, and \(n_f\) the number of products that sum up into a single output. In Algorithm 1, every thread executes a subschedule where all \(t^y \mod i = 0; t \in S\), where i is the thread id. We can obtain the combined schedule by taking all subschedules and interleaving them elementwise. See Fig. 2 for an example with three threads.

Fig. 2.
figure 2

Example of how to turn a lock-free execution with three threads into a schedule

The lock-free algorithm computes and needs to keep in memory \(n_t\) outputs simultaneously. The base schedule, on the other hand, fully computes a single output before moving on to the next one. This allows us to keep fewer outputs in memory. This is the lowest amount of memory we can achieve. However, we need to load objects from disk frequently and are not using any caching. Caching aims to reduce the number of loading operations as much as possible. We can exploit the regular structure of convolutions to find the best values for caching. We can split a schedule into a regular, repeating pattern defined by the size and number of filters and input channels. In two-dimensional convolutions, as used in neural networks, we have a four-dimensional filter volume, W, where the dimensions are in order: i, j the position in the filter, \(c_{\text {in}}\) the input channel, and \(c_{\text {out}}\) the output channel. We move W across the entire input, creating \(c_{\text {out}}\) outputs at every position. Note how far W we move the filter is given by the stride, which we assume to be one here. However, our method remains applicable to other stride values. Each output, at a given position of W, uses the same values from X. We call each unique position of the filter on the inputs the filter position or window.

We need to keep three kinds of objects in memory during the computation. Inputs \(\textbf{x}\), weights \(\textbf{w}\) and, outputs \(\textbf{y}\). We design multiple caching strategies based on the memory available. We will not go over the trivial case that we can fit all values of \(\textbf{x}\) and \(\textbf{w}\) into memory.

5.1 Caching by Object Type

The simplest caching strategy, is to load either all values from \(X'\) or \(W'\) at the beginning of the computation and load the other values on demand as needed. This strategy creates very simple schedules; however, it underutilizes caching. If we preload all \(\mathbf {x'}\), we load too many values much earlier than needed in the computation, and if we preload all \(\mathbf {y'}\) we need to load \(\mathbf {x'}\) values frequently.

5.2 Full Window Caching

We can improve the caching by object type strategy by utilizing the underlying structure of the convolution operation. To obtain the output values we move the filter across the input values. Each filter channel creates one output value. The filter values at every position are the same. Therefore, if we can load them only once and cache them for the duration of the computation we can save a significant amount of load operations. However, for each position the input values change. Each position of W requires only \( |W'|/c_\text {out}\) \(\mathbf {x'}\). If we can fit these objects and all \(\mathbf {y'}\) into memory, we only load \(W'\) once. Since \(W'\) usually moves over the inputs with some overlap, i.e., the stride is smaller than the width and height of the filter, we can reuse many \(\mathbf {x'}\) and only need to unload and reload a small amount. We start at the top left and move W from left to right. Once we reach the end on the right, we move down and start over on the right, repeating until we reach the bottom right. If the filter size or stride is not symmetrical, it is beneficial to change the behavior to first move in the direction that has the most overlap, reducing the number of values that need to be loaded and increasing the number of values that can be reused.

Fig. 3.
figure 3

Load instructions that are necessary when moving from the first window to the second using window caching with a 5\(\,\times \,\)5x2 input and a 3\(\,\times \,\)3x2\(\,\times \,\)2 kernel.

5.3 Partial Window Caching

If we can fit \( |W'|/c_\text {out}\) \(\mathbf {x'}\) but not all of \(W'\) into memory, we can modify the full window caching strategy to reduce the number of loads. Let n be the number of \(\mathbf {w'}\) that we can fit into memory in addition to all the \(\mathbf {x'}\) in the window. We then split the schedule into sub-schedules for every position of W. To reorder the sub-schedules to increase caching potential we reverse every second sub-schedule; see Fig. 4. This reordering makes it so that i values to the left and right of the sub-schedule boundary are the same for all \(i \in [1,|W|]\). This allows us to cache n values before the sub-schedule boundary and reuse them in the next one.

Fig. 4.
figure 4

Weights \(W_i\) only in Sub-Schedules that correspond to individual filter positions and how they can be reordered to increase caching potential.

5.4 Column-Wise Caching

If we cannot fit \( |W|/c_\text {out}\) \(\mathbf {x'}\) or \(W'\) into memory, we cannot use any of the caching methods described above. However, we can construct a different schedule that allows us to cache \(\mathbf {x'}\) values. For this schedule, we need to be able to fit \(c_\text {out}\) \(\mathbf {y'}\) into memory. By taking each window sub schedule and reordering it column-major instead of row-major, see Fig. 5, we can reuse the same \(\mathbf {x'}\) multiple times before we unload it. This ordering requires us to keep \(c_\text {out}\) \(\mathbf {y'}\) in memory. This ordering is most beneficial when the number of input channels is much larger than the output channels or the filter is relatively large. Both scenarios lead to a large number of \(\mathbf {x'}\) in a window. Depending on how much memory is available, we can cache multiple columns. Additionally, we can combine this with the idea from partial window caching of reordering the computation to generate adjacent window subschedules that end and start with the same X values.

Fig. 5.
figure 5

Transforming the base schedule into a column-wise caching schedule

A downside of the proposed approach is that in order to achieve any benefits, we require the data to be batch-packed and a convolutional layer. Only batch-packing allows us to reorder the computation on a granular level. If this approach provides any benefits with inter-axis packing strategies is beyond the scope of this work. We need a convolutional layer to exploit its repeating weight structure. It is possible that we could use similar optimizations with recurrent layers since they also have repeating weights. However, recurrent layers impose additional challenges when used with HE [42].

6 Experimental Evaluation

We evaluate our proposed solution on the layers of a convolutional neural network (CNN) trained on the CIFAR-10 [32] dataset. We first estimate the memory requirements and then compare them to the measurements we obtain by running the model on encrypted data. Table 1 shows the model’s architecture. We have two different models. One for plain data and one adapted to be HE-friendly, meaning it only contains operations that are easy to compute on encrypted data. Both models achieve very similar accuracies on the test data, 70.9% for the original model and 69.7% for the HE-friendly model. The main interest of this paper is not to propose new models or techniques that increase the accuracy of models on encrypted data but to analyze and reduce the memory consumption of these models.

Table 1. Architecture of the evaluation model with the layer parameters showing the filter size (FS), stride (S), number of filters (NF), and the activation or pooling function used on plain text (PT) and on encrypted data (HE).

We define three sets of crypto parameters: small, medium, and large. All parameters guarantee at least 128-bit security. We use OpenFHE [5] as the underlying crypto library in our implementation. The small parameters have a ring dimension of \(2^{14}\) and a multiplicative depth of 2. The medium parameters have a ring dimension of \(2^{14}\) and a multiplicative depth of 8. And the large parameters have a ring dimension of \(2^{15}\) and a multiplicative depth of 19. This results in a ciphertext size of 0.75 MB, 2.225 MB, and 10 MB for the small, medium, and large parameters respectively. A plaintext is always half the size of a ciphertext. We have two machines. One with 16 cores, 20 GB of memory, and 32 GB of operating system (OS) swap space, and another with 104 cores and 768 GB of memory. Both machines have two TB solid-state drives. We define different schedules, then estimate the required memory using the technique described in Sect. 4.3, and finally execute the schedules to obtain real measurements.

We define several schedules that we estimate and measure the memory requirements for. The names of the schedules are given italicized. We use the Lock-free algorithm (Algorithm 1) as our baselines once we load all values on demand (Lock-free on demand) and once we preload all values before execution Lock-free Preload. We compare these baselines to their direct equivalent using our proposed algorithm (Algorithm 3), where we preload all values (Preload everything). Next, we investigate the behavior when we either preload all of \(X'\), Preload \(X'\), or all \(W'\) values Preload \(W'\), \(x'\) on demand. Finally, we look closer at the window, partial window, and column-wise caching. For (partial) window caching, we always load all of \(\textbf{x}'\) in the window and investigate the following strategies for loading \(\textbf{w}'\)s:

  • load all of \(W'\), Load \(X'\) window \(W'\)

  • load \(\mathbf {w'}\)s on demand, Load \(X'\) window, \(w'\) on demand

  • load half of \(W'\) values, Load \(X'\) window, \(W'/2\)

  • load a quarter of \(W'\), Load \(X'\) window, \(W'/4\)

We only cache one \(\mathbf {x'}X\), Column Major for column-wise caching. For all schedules, we cache the \(y'\)s from their first appearance in the schedule to their last.

6.1 Memory Estimate

To demonstrate that our proposed solution is scalable from large servers to consumer hardware, we run the selected schedules on two different machines. A desktop PC with a 16-core AMD Ryzen CPU, 20 GB of RAM, 32 GB of swap space, and a large server with two Intel 54-core CPUs and 756 GB of RAM. Both machines have a 2 TB solid-state drive and run Ubuntu Linux 20.04 LTS. In the tables and figures throughout this paper, we refer to the server and PC by their number of threads: 104 and 16, respectively.

We use the algorithm described in Sect. 4.3 to estimate the cost of all convolutional layers for small, medium, and large parameters and 16 and 104 threads. We need to estimate the memory requirements based on the number of threads that are used during execution since that can influence the number of objects in memory. The estimate column in Table 2, 3, and 4 shows the estimates for each layer and schedule for large parameters (for the small parameters see the appendix). We can see that, especially for the large parameters, the estimate frequently goes beyond the 20 GB of the PC. The estimate also often exceeds the 52 GB of memory and swap space combined. The estimate never exceeds the 756 GB of the server. For the estimate and following experiments, we assume the input \(X'\) is encrypted while the model \(W'\) is in plain.

Unsurprisingly, the schedules that preload all objects, Preload everything and Lock-free Preload, have the highest memory estimate. On the other hand, schedules that load most objects on demand and cache very little, Lock-Free on demand and Column Major have the lowest memory estimate. For the Conv 2d (1) layer, the estimates range from 380 MB to about 35 GB. Schedules that do not load all of \(X'\) are significantly below that value, estimated at most 6193 MB. For the second layer, Conv 2D (2), both the number of \(\mathbf {x'}\) and \(\mathbf {w'}\) is significantly larger. This, however, does not significantly change the estimate for the Lock-Free on demand schedule. This observation also holds for the next layer, Conv 2D (3). The estimation aligns with the insights of a theoretical analysis of the execution. As discussed earlier, during runtime, this schedule has at most \(n_t\) of each \(\mathbf {x'}\), \(\mathbf {w'}\), and \(\mathbf {y'}\) in memory, where \(n_t\) is the number of threads. Therefore, the memory consumption of the schedule is only influenced by the number of threads and independent of the layer. For the Conv 2D (2) layer, we also encounter values outside the PC’s available memory, ranging from 400 MB to 164 GB. We see a similar picture for the last convolutional layer, Conv 2d (3). Large estimates of up to 208 GB, especially for layers that load and cache \(W'\) values.

Fig. 6.
figure 6

Time and Memory requirements schedules, run with large parameters, on the 104 threads servers and the 16 threads PC. The memory graphs also include the PC’s memory limit of 18000 MB.

6.2 Measurements

After obtaining the estimates, we execute the schedules on both the server and the PC. We measure the time it takes to execute the schedules, and the memory the process requires. For the memory measurement, it is important to note that it does include swap memory and only measures actual main memory usage. The PC has 20 GB of memory, about 1.5 GB of which the OS uses, leaving about 18.5 GB for the execution of the schedule. Therefore, measurements in the range of 18.5 GB on the PC will likely have used the OS’s swapping mechanism, especially if the estimated value is much larger. As mentioned in the previous section, for some schedules, the memory available is insufficient, even with swapping. In these cases, the execution is terminated by the OS, yielding no result. We deliberately leave the OS swapping mechanism on to test if our implementation is faster than simply relying on the in-built OS methods. We further assign each schedule a score combining time and memory requirements. To calculate the score, we compute the geometric mean of the time t and m as \(\sqrt{tm}\). The lower the score, the better. However, the schedule with the lowest score is automatically the best schedule on a given machine. The best schedule is typically the schedule that executes the fastest on the machine. It is possible for a slower schedule to achieve a lower score due to it requiring less memory. This, however, indicates that we could perform the computation on a machine with less memory.

Fig. 7.
figure 7

Comparison of the fastest schedule for each layer with 16 and 104 threads. For each layer, the Figure shows the increase factor in runtime from 104 to 16 threads and the increase factor in memory from 16 to 104 threads for the fastest schedule

Table 2. Time in s, Memory in MB requirements, for all Schedules on Conv 2d (1) with large parameters on the PC with 16 and the server with 104 Threads. Additionally, shown are memory Estimate in MB and Score.

Tables 2, 3, and 4 list the time and memory requirements and the score, using the large crypto parameters (for medium and small, see the Appendix). The first important observation is the accuracy of the estimation algorithm. We expect the memory measurements to be larger than the estimate since there is runtime overhead, like the schedule itself, key material, and other data structures that the estimation does not take into account. However, in some cases, the estimate is off by a factor of 4–5. This is especially true for smaller values. An explanation for the discrepancy in estimate and measurements most likely lies in how we process cache instructions that drop data from memory. To ensure that we do not delete data that other threads still need, we only execute the delete instructions once all threads have passed the point for which the instructions are scheduled. During execution, we have little control over how fast threads advance. It is certainly possible for some threads to fall far behind, waiting for locks or input/output operations, thereby preventing the deletion of objects from memory. We have no way of predicting how the threads will interact at runtime and, therefore, need to make simplifying assumptions that can cause the differences in estimated and measured values. Overall, the estimate can still provide us with a useful tool to understand the schedule’s memory requirements without running it.

Table 3. textbfTime in s, Memory in MB requirements, for all Schedules on Conv 2d (2) with large parameters on the PC with 16 and the server with 104 Threads. Additionally, shown are memory Estimate in MB and Score.* indicate out of memory.

The most important metric is time. The schedule that executes the fastest is typically the schedule that uses the available resources the most efficiently. Figure 6 shows the time and memory requirement for large parameters and all schedules. Note that the OS terminated schedules that do not display a time for 16 Threads (the PC) for running out of memory. For Conv 2d (2) and Conv 2d (3) we can see multiple schedules that reach the critical limit of 18000 GB memory, after which the OS’s swapping system kicks in. On the medium parameters and Conv 2d (2) (complete Figures and Table in the Appendix), we observe that Load \(X'\) window \(W'\) schedule reaches the swapping limit and takes 10675 s. The Load \(X'\) window \(w'\) on demand schedule does not reach that limit needing \(\sim \)2 GB. However, despite needing to encode data more often, it is faster at 3885 s. This strongly suggests that our algorithm is more efficient than relying on the OS’s swapping mechanism.

Table 4. Time in s, Memory in MB requirements, for all Schedules on Conv 2d (3) with large parameters on the PC with 16 and the server with 104 Threads. Additionally, shown are memory Estimate in MB and Score.* indicate out of memory.

Table 5 and Fig. 7 compare the fastest schedule for each layer and set of parameters. We are most interested in the increase in runtime and the reduction in memory when running on the 16-thread PC as compared to running on the 104-thread server. For the small parameters, the fastest schedule is either the Lock-free Preload or Preload everything schedule. Since these schedules have very similar memory requirements, there is no significant reduction in memory. The time, however, increases by a factor of 3.3–3.8. We start to see a much bigger difference when moving to the medium parameters. For the Conv 2d (1) layer, the time increases by a factor of 5.4 while the memory usage stays almost the same between PC and server. For this layer, both systems can still use the Lock-free Preload schedule, which explains the negligible reduction in memory. The time increase for the next two layers is 5.9 and 3.5, respectively; however, the memory reduction is significant and a factor of 21 and 83.3. While the server still uses the Lock-free Preload schedule the PC is forced to use window caching and column-wise window caching to fit the objects into memory. The picture repeats for the large parameters. Except that now the server uses a more memory-efficient schedule for Conv 2d (3), which leads to only 15.3 times memory reduction and an increase in runtime by 4.7. An interesting observation: on the small parameters, the PC seems to have a higher per-thread performance as the time increase is only around 3.5 for all layers despite the number of threads on the server being 6.5 more. As the parameters get larger, the time increase seems to approach 6.5 as expected.

Table 5. Fastest Schedule for each layer and parameter size (Param.) on the server, 104 Threads (T), and PC, 16 Threads. As well as the increase (Inc.) in time and reduction (Red.) of memory.

Additionally, we compare the time and memory of the different schedules run on the large crypto parameters executed on the server. For the Conv 2d (1) layer the fastest schedule is Load \(X'\) window, \(W'/4\). It is 74 s, 8%, faster than the Preload everything schedule. The Preload everything schedule, in turn, is much faster, 627 s (64%), than the Lock-free Preload schedule. However, both preload schedules require 38 GB of memory, compared to the 5.4 GB of the Load \(X'\) window, \(W'/4\) schedule. For the second layer, Conv 2d (2), the Lock-free Preload schedule is the fastest at 3257 s. The Preload every is marginally slower at 3364 s. Both schedules require 170 GB of memory. Schedules that require significantly less memory Load \(X'\) window, \(W'/4\) (33 GB) and Column Major (6.3 GB) are only slightly slower at 3662 s and 3843 s. For the Conv 2d (3) layer the Lock-free Preload schedule is the slowest and consumes the most memory at 7074 s and 215 GB. The comparable Preload everything schedule requires approximately the same amount of memory, but only 12.5% of the time, 855 s. Interestingly, schedules that cache very little Preload \(X'\) and Column Major are faster than the Preload everything at 815 s and 825 s. Both of these schedules also require significantly less memory, at 25.7 GB and 4.5 GB. This is a reduction factor of 47.8 between Lock-free Preload and Column Major.

Interestingly, schedules with minimal caching of \(w'\)s are often faster than schedules that substantially cache these values. A potential explanation could be the cache locality inside the CPU. Values that are not cached by our method and are loaded on demand could be accessed faster because they are placed inside the CPU cache. Alternatively, locking that is required for processing the load instructions could introduce additional slowdowns that are not present when values are loaded on demand. Another interesting observation is the poor performance of the Lock-free Preload schedule in the Conv 2d (3) layer. It is eight times slower than Preload everything schedule. Both schedules load all the required data at the start of the computation and do not need to load any values during. Where they differ is the points at which they write the results to disk. If we assume that all threads advance in lockstep, in the Lock-free Preload schedule, all threads want to write to disk at once. In Preload everything schedule, the write operations are more spaced out. It could be that the large number of simultaneous writes slows the schedule down significantly.

7 Conclusion

In this paper, we present ways of reordering the computation to tailor the memory requirements to the hardware available while executing as fast as possible. We further present a technique to estimate the required memory of convolutions over batch-packed, encrypted data. We show that our proposed caching mechanism is faster than relying on the OS’s swapping mechanism. The method proposed in this paper is especially suited for ML workloads with thousands of instances that can run longer, i.e., overnight or over the weekend, and don’t need a fast turnaround. Since our method can reduce the memory requirements for inference, it opens up the potential to save on hardware costs.