Keywords

1 Introduction

Our perception of the environment is inherently multimodal, involving multiple sensory channels such as vision, hearing and taste. A modality refers to the way in which something happens or is experienced, and research is considered multimodal when it involves more than one of these channels. Machine learning approaches based on multimodal data have emerged as a rapidly growing field of research [12, 26, 27, 29]. Substantial research [4, 21] has shown that classifiers based on multimodal data outperform those based on unimodal data, which is consistent with the way in which humans perceive and comprehend the world.

Transmitting multimodal sensitive data from client to server typically entails privacy risks, as the server may be malevolent or susceptible to attacks by third parties. Moreover, because multimodal data contains rich cross-information, the privacy damage caused by leakage is more severe than that caused by unimodal data. Thus, there is a significant motivation to design a privacy-preserving strategy for multimodal machine learning that can ensure the privacy of clients who use cloud computing services. A highly relevant technology is Fully Homomorphic Encryption (FHE) [15]. FHE is an encryption system that enables data owners to encrypt their data and authorize third parties to perform calculations on it. Third parties can perform several calculations but are not authorized to access the original data, thus preserving the privacy of consumers. However, practical applications require specialized solutions to reduce the high computing and transmission costs associated with FHE. Previous works have provided specialized homomorphic schemes for unimodal machine learning [2, 8, 11].

In this paper, we propose, for the first time, a privacy-preserving prediction approach for multimodal representation based on the FHE scheme CKKS [10] and Tensor Fusion Network (TFN) [32]. In this method, the server accepts the encrypted multimodal representation sent by the client, performs fusion and model evaluation, and feeds back the evaluation results to the client. Only the client can decrypt the results and obtain the inference. The main contributions of our study are as follows:

  • This research is the first to propose a multimodal representation inference approach based on FHE. Specifically, we implement the multimodal feature fusion network in the ciphertext state.

  • We provide a pre-expansion method to reduce the computational complexity of the homomorphic tensor feature fusion layer.

  • We utilize the rotation operation of CKKS and the unoccupied slot in ciphertext, to pack data of multiple modes in the same ciphertext. This significantly decreases the amount of data transmission required. To optimize the ciphertext matrix multiplication with the highest latency, we also leverage multithreading to achieve a 3.5-times acceleration.

In the rest of the paper, we summarize the related work in Sect. 2, and present the background knowledge in Sect. 3. In Sect. 4 we present our proposed approach in details, which is followed by performance evaluation and experimental results in Sect. 5. Finally, we conclude the paper in Sect. 6.

2 Related Work

2.1 Multimodal Machine Learning

The employment of multimodal data provides human beings with a comprehensive and multifaceted understanding, thereby facilitating them to make more informed decisions [4, 12]. Despite the convenience brought by cloud services, the privacy concern regarding multimodal data is increasingly urgent. To address this issue, Cai et al. [9] proposed implementing differential privacy in the representation after multimodal fusion to protect the privacy of the original training data. However, the introduction of noise by differential privacy inevitably compromises the utility of the data and the accuracy of the model inference. Moreover, it is noteworthy that this method cannot protect user privacy data during the inference process.

2.2 Homomorphic Encrypted Neural Network

With the increasing application of machine learning in education, finance, and other fields that deal with sensitive customer data, there is a growing need for privacy protection in machine learning algorithms that make accurate predictions. To address this issue, several cryptographic techniques have been proposed, such as Trusted Execution Environments [20], Secure Multi-Party Computation (SMPC) [28] and Homomorphic Encryption (HE) [15]. Each method has its own tradeoffs in terms of calculation, accuracy, and security. Among them, schemes based on Fully Homomorphic Encryption (FHE) can generally achieve quantum security, which is the most rigid security model [1].

FHE was first proposed by Rivest et al. [24], and Gentry [15] proposed the first generation of FHE systems. Despite allowing more homomorphic multiplication and addition operations, practical applications still require too much computation. To address this issue, several practical leveled homomorphic encryption schemes have been proposed, such as the integer-based BGV algorithm [7], BFV [14], and the complex scheme CKKS [10]. These schemes can perform homomorphic computation within the pre-set maximum multiplication depth. CKKS permits approximate HE operations for real numbers, making it a suitable choice for the inference task of machine learning with a fixed number of layers.

Dowlin et al. [16] proposed CryptoNets, which proved the feasibility of using HE for private neural network inference. However, CryptoNets have two limitations. The first is on the time cost, although it supports high-throughput prediction, the prediction of a single sample still takes 205 s. The second is on the width of the network. CryptoNet encodes each node of the network into a separate ciphertext, so it needs a lot of memory to support it. In order to solve these problems, LoLa [8] encrypted the entire network layer, significantly reducing memory requirements and achieving single sample prediction in 2.2 s. Furthermore, homomorphic schemes have been proposed for text classification [2] and audio similarity calculation [22] in addition to image classification. However, these schemes only consider the homomorphic implementation of unimodal machine learning. To the best of our knowledge, there is currently no homomorphic encryption scheme available for multimodal machine learning and multimodal feature fusion.

2.3 Homomorphic Representation Inference

The aforementioned work can be used to encrypt shallow networks through homomorphic inference. However, there are two primary limitations when Fully Homomorphic Encryption (FHE) is applied to deep network models: the growth of noise and the growth of ciphertext size. Each ciphertext contains noise that increases with each homomorphic operation. Therefore, too many operations increase the noise to the point where the decryption may not be correct. Secondly, the operation of the HE scheme can double the size of the encryption parameter without bootstrapping, resulting in a large ciphertext that increases memory requirements and causes greater latency. To address these issues, LoLa [8] proposed using deep representations for encrypted inference. Customers convert the original data into deep representations locally through the feature extraction network. Their prediction only requires a shallow network model, which is more suitable for low-latency homomorphic implementation. Chou et al. [11] proposed extracting deep representations from the screenshot of the original phishing website locally, encrypting them, and sending them to the cloud for homomorphic logistic regression calculation, achieving a low-delay and secure homomorphic inference scheme.

3 Preliminaries

3.1 Fully Homomorphic Encryption (FHE)

Fully Homomorphic Encryption (FHE) is a powerful encryption method that allows ciphertext computation with minimal loss of accuracy upon decryption. This capability makes it useful for secure computing outsourcing: the client encrypts the data and sends it to a third party for computation, where the third party cannot access the plaintext data. After receiving the encrypted output, the client decrypts it to obtain the computation result. Specifically, the encryption function is denoted by Enc, and plaintext data by x and y. Then, \(Enc(x+y) = Enc(x)\oplus Enc(y)\) and \(Enc(x*y) = Enc(x)\odot Enc(y)\), where \(\oplus \) and \(\odot \) represent homomorphic addition and multiplication, respectively.

While HE allows ciphertext computation to obtain the encrypted output with practically little loss of accuracy, it adds noise to plaintext data, and homomorphic operations increase noise continuously. Once the noise reaches a certain threshold, the correct plaintext result cannot be decrypted. Although a primitive bootstrapping method [15] can refresh noise, it is limited by the massive amount of computation required. A more practical technique is to use leveled homomorphic encryption [5], which allows multiple addition and multiplication operations at a predetermined maximum multiplication depth. Since the calculation times of neural networks are also determined, leveled homomorphic encryption has been widely used in encrypted machine learning inference tasks. In this paper, we choose the CKKS leveled homomorphic encryption method, which is specifically designed to handle real numbers and approximate calculations.

3.2 The Levelled FHE Scheme - CKKS

In the following paragraphs, we will briefly introduce the CKKS scheme. Let N be a power of two, and \(R=\mathbb {Z}[X]/(X^N+1)\) be the ring integer of the 2N-th cyclotomic polynomial. For some small prime integers \(p_i\), let \(q_L=\prod _{i=1}^Lp_i\) and \(R_{q_L} = \mathbb {Z}_{q_L}[X]/(X^N+1)\) consists of the polynomials whose coefficients are modulo\(q_L\). Here, L represents the preset maximum multiplicative depth, and a message \(m\in \mathbb {C}^{N/2}\) is encoded and encrypted into \(R_{q_L}\), followed by homomorphic addition and multiplication. Each multiplication consumes a layer of depth and rescales the ciphertext of \(R_{q_l}\) into \(R_{q_{L-1}}\). Another essential operation enabled by ciphertext is rotation, which allows encrypted elements to rotate in N/2 slots.

One advantage of CKKS is on its SIMD feature, that is, one single homomorphic addition (or multiplication) among ciphertexts can attain a corresponding addition (or element-wise product) of two vectors in plaintexts. Suppose that m is a k-dimensional vector. When \(k<N/2\), m will be padded with zeros to size N/2. As the homomorphic operation operates bit by bit on all slots, the operation is highly inefficient when \(k<<N/2\). The optimization methods introduced later in this paper will fully utilize the free spaces in the ciphertext to improve computational efficiency without incurring additional space consumption.

3.3 Homomorphic Linear Layer

Here we introduce the implementation of the homomorphic linear layer [5]. The linear layer is one of the most important network layers in the machine learning model. A linear layer consists of a vector-matrix multiplication and an addition of a bias. Traditionally, each column of the matrix can be packaged and multiplied by the ciphertext vector, and the result can be summed bit by bit. This practice causes the output to be spread across multiple ciphertexts rather than stored under a single ciphertext. Halevi et al. [17] proposed a matrix multiplication that is realized by matrix diagonalization. Let \(n=N/2\) denotes the number of slots in ciphertext c, matrix \(M\in R^{n\times m}\). Firstly M is decomposed into n vectors in diagonal order, in which the j’th element in the i’th diagonal \(diag[i][j]=M[(i+j)mod\ n][j]\). Then we have \(M.c=\sum ^{n-1}_{i=0}diag(i)\odot Rotate(c,i)\), where \(\odot \) denotes the coefficient wise vector multiplication and \(Rotate(c,i)=(c_i,c_{i+1},\dots ,c_0,c_1,\dots ,c_{i-1})\) is the rotation of c by shifting i slots to the left. In practice, in order to get the correct rotation result, it is necessary to copy the encryption vector [19], then \(c=(c_0,c_1,\dots ,c_{n-1},c_0,c_1,\dots ,c_{n-1},0,\dots ,0)\). The complexity of vector-matrix multiplication is O(n), and the computational cost is the largest in the homomorphic linear layer. Since the nonlinear activation function (such as ReLu) cannot be calculated in the ciphertext state, the standard practice [3] is to substitute it with the square activation function. The last layer of the network model is a homomorphic dot product layer, which is first multiplied by a k-dimensional vector. All elements are then added to the first element of the output ciphertext by log(k) rotations. It should be noted that the final activation function is computed locally after decryption by the client [2].

3.4 Multimodal Fusion Representation Learning

The original multimodal data contains a large amount of redundant information, and the feature vectors of each modality are initially located in different subspaces. This can impede the learning of data in subsequent models. To address this issue, representation learning has been proposed as a solution [6]. This technique maps input data to a low-dimensional representation, enabling efficient learning. Currently, unimodal representation learning is widely used in Natural language processing (NLP)[13] and Computer Vision (CV) [18]. In our work, we employ an appropriate encoder network for each modality of the data, and fuse the resulting low-dimensional feature vectors. The validity of the representation is ensured by the convergence of network parameters.

Multimodality Fusion Technology (MFT) [23] involves fusing the feature vectors of each modality to create a more effective representation for subsequent networks. One simple method [33] for feature fusion is direct concatenation, but it can be challenging to capture the interaction information and nonlinear relationships between different modalities. To address this issue, Zadeh et al. proposed Tensor Fusion Network [32]. This approach models each sub-modality feature as different dimensions of a Cartesian space. Therefore, the fusion process between different modes can be achieved through the tensor cross-product, as shown in Eq. (1):

$$\begin{aligned} z=\left[ \begin{array}{c}v_1 \\ 1 \\ \end{array}\right] \otimes \left[ \begin{array}{c}v_2 \\ 1 \\ \end{array}\right] \otimes \dots \otimes \left[ \begin{array}{c}v_n \\ 1 \\ \end{array}\right] \end{aligned}$$
(1)

where z represents the output after TFN, \(v_i\) denotes different modalities and \(\otimes \) denotes the outer product operator. To ensure that the fusion representation contains both the cross-information from dual mode to n mode and the independent information of each unimodal feature, an element of 1 is spliced at the end of each modality representation.

Fig. 1.
figure 1

HE-based Multimodal Representation Inference System

4 The Proposed Approach

4.1 System Model

This section introduces our system model and task settings. A client possesses multimodal raw data and uses the corresponding encoder for feature extraction. Subsequently, the client performs pre-expansion and transmits the ciphertext to the cloud. In the cloud, the encrypted fusion representation is obtained through the homomorphic tensor fusion network (HTFN). The homomorphic dense neural network (HDNN) performs the final evaluation, and the encryption result is delivered back to the client. The system model is illustrated in Fig. 1. Throughout the computing process, the server is limited to operating solely on the ciphertext and cannot access the sensitive data.

Consistent with current private reasoning tasks, such as those described in [8, 11, 22], the complete model network utilizes plaintext data during the common training phase and encrypted data during the inference phase. In this task, we implicitly assume that the client has the computational capacity to preprocess the raw data and execute the encryption/decryption tasks.

4.2 Homomorphic TFN

In TFN, the fusion features of a high-dimensional Cartesian product are obtained by multiplying the eigenvectors of each modality twice. However, in the ciphertext state, only the operation between vectors is supported, and the result of the Cartesian product cannot be retrieved directly. One possible solution, related to CryptoNets, is to encrypt each element of each feature independently and send it to the server for multiplication one by one. However, this approach is computationally expensive since the fusion requires computing hundreds of ciphertexts. An optimization for this method is packing one modality (POM) into a single ciphertext and multiply it with the elements of other features one by one. Even with this optimization, the computational complexity and the amount of output ciphertext are still dependent on the length of the feature.

To solve this issue, we propose a pre-expansion processing method. As the fusion features of the higher-dimensional Cartesian product will be flattened as input to the linear layer, and the SIMD feature of CKKS makes the bit-by-bit multiplication between ciphertexts efficient, we expand each representation to the fused length before encryption. In the simplest case of two modalities, pre-expansion will repeat each bit of one modality for \(L_1\) times, given the other modality has a length of \(L_1\). The other modality will also be expanded so as to be aligned with the previous modality, and finally both modality will be with a length of \(L_1\cdot L_2\), given \(L_2\) is the length of the first modality. This will ensure that only one ciphertext multiplication is needed in the HTFN.

Let’s consider the most widely used fusion of video, audio, and text, as an example. As shown in Fig. 2, the lengths of text, video and audio modality are 2, 2, 3, respectively. Firstly each bit of the video modality is repeated by 2 times, so that each bit can be aligned with the whole length of the text modality. Then correspondingly the lengths of video and text modalities will both be expanded to 4. In the same way, each bit of the audio modality is repeated by 4 times, and finally, all the three modalities will be expanded to be with a length of 12.

Let m denote the number of modalities, and \(L_i\) denote the length of the i’th modality feature. After pre-expansion, all modalities will be with a length of \(\prod _{i=1}^m L_i\), and \(m-1\) ciphertext multiplications are required to compute the HTFN. As shown in Table 1, the number of required ciphertext multiplications in HTFN can be reduced using pre-expansion, and the resulting fused representation can be stored in a single ciphertext. In Sect. 3.2, we mentioned that the CKKS ciphertext has N/2 slots, which is significantly larger than the length of each feature. Therefore, pre-expansion can take advantage of the vacant slots in the ciphertext without additional space requirements.

Table 1. Performance Comparison of HTFN Under Different Pretreatment Methods.
Fig. 2.
figure 2

An Illustration of Pre-expansion on 3 Modalities

Fig. 3.
figure 3

An Illustration of HTFN with our Packing Method based on Ciphertext Rotations

4.3 Other Optimizations

The client sends a ciphertext for each modality to the cloud. As the number of modalities increases, the size of the ciphertexts also increases, resulting in a linear increase in encryption time and data transmission. To solve this problem, we concatenate all the feature vectors after the pre-expansion and finally encrypt them into a single ciphertext. In the ciphertext state, the ciphertext of each feature can be decoupled by two CKKS rotation operations. As mentioned in Sect. 3.3, the subsequent homomorphic linear layer computation requires a copy of the original vector. This can be achieved by concatenating the features of the first modality.

As shown in Fig. 3, assuming that the features fused by the three modalities have n dimensions, we perform two consecutive rotation operations to obtain two ciphertexts, where the first n bits of each ciphertext store the corresponding modality representation. Through this packaging method, the transmission of the original n ciphertexts can be optimized to a single ciphertext, which significantly decreases the client’s encryption computation cost and transmission delay. Finally, to optimize the homomorphic linear layer with the longest computation time, we employ a multi-threaded approach for the n-times rotation and homomorphic multiplication. Algorithm 1 outlines the complete process of HTFN, where \(\oplus \) and \(\odot \) represent homomorphic addition and multiplication, respectively. For the sake of simplicity, we omit the relinearization and rescale operations that are mandatory after homomorphic multiplication and evaluation.

5 Experiments

5.1 Experimental Setup

We use two commonly used multimodal datasets for our experiments. Multimodal Corpus of Sentiment Intensity (CMU-MOSI) [33] contains 2199 movie reviews on YouTube video blog, which came from 89 narrators of ages among 20 to 30 years. Each has audio, text, and video information available. The videos were labeled [\(-3\), 3] with seven categories ranging from negative to positive affective tendencies. Different from the original CMU-MOSI dataset, we use BERT preprocessing to obtain more accurate text information  [31]. CH-SIMS [30] is a Chinese multimodal emotion analysis data set. It cuts out 2281 video clips from different movies and television works and collates the data in audio, text and video modalities. It contains 474 different speakers with a wide range of characters and ages. Each video clip has five categories of emotional intensity values between [\(-1\), 1].

figure a

For network settings, as in the previous work of [32], we choose LSTM to extract text features. DNN with three hidden layers is used for audio and video features. Following this, we use two hidden layers to make the final prediction of the fusion representation. It is worth noting that the maximum fusion representation length should not exceed 2048. Considering the actual usability, our experiment uses a smaller length, 512. The setting of other hyperparameters and the partitioning of data sets are consistent with the TFN experimental settings in MMSA [31]. After the model training, we use the SEAL library [25] to carry out the inference task of homomorphic encryption.

5.2 Experimental Results

In Table 2, we compare the inference results of the model before and after encryption on the test set using the F1 score and MAE as evaluation metrics, which are commonly used in multimodal machine learning. Results demonstrate that our encryption inference method is highly effective, as our model exhibits no performance loss compared to plaintext inference up to four decimal places of precision.

Table 3 illustrates the enhancements made by the proposed optimization method concerning data throughput and inference time in the cloud. We began with the idea of CryptoNets, which encrypts each element individually. Although it supports batch prediction of multiple samples, the time for a single calculation is approximately 150 s. Lola [8] is a low-delay computing model proposed by CryptoNets for single-sample calculations, but it does not support multimodal feature fusion. In POM, each element of the two features (lengths of 4 and 16) had to be packed into a separate ciphertext since TFN necessitates element-wise multiplication. The computation time has been reduced to 4.43 s, but for scenarios with high feature lengths, this time will increase rapidly and require larger data transfers. With pre-expansion, feature fusion can be completed with only two ciphertext vector multiplications, enabling the use of smaller encryption parameters like \(N=8192\) for further optimization while ensuring security. Furthermore, packing the features of the three modalities into a single ciphertext reduces the amount of data transmission to 1/3, which must be decoupled by two rotation operations on the server. Multithreading can effectively reduce the delay of ciphertext vector-matrix multiplication for the first homomorphic linear layer after feature fusion. Finally, the proposed optimization method yields an inference time of approximately 0.91 s for a single sample and the data transfer volume is 211.88 KB.

We present a detailed comparison of the performance of HTFN implemented under three preprocessing methods in Table 4. The number of multiplications required by our approach is linearly dependent on the number of modes, whereas the other two methods show a rapid increase in the number of multiplications as the feature length increases. Our approach, combined with two rotations and packaging, compactly stores both input and output in a single ciphertext, thus reducing the encryption burden on the client and the computational cost on the server. Experimental results demonstrate that our method can maintain stable and efficient computing performance in practical multimodal scenarios.

Table 2. Encrypted Test Set Inference Results for CMU-MOSI and SIMS.
Table 3. Model Performance - Data Throughput and Timing.
Table 4. Performance Comparisons of HTFN.

6 Conclusion

In this paper, we propose the first multimodal representation inference method based on Fully Homomorphic Encryption (FHE). Our method provides complete protection of multimodal data privacy and enables private prediction tasks on the cloud server. We propose a pre-expansion method and a homomorphic TFN scheme using only two ciphertext multiplications. We also propose a variety of optimization schemes that not only improve the computational efficiency of ciphertext but also reduce the amount of data transmission. The experimental results show that our method has the advantages of low computation and communication costs. In the future, we will continue to improve the performance of the model and extend it to multi-user application scenarios using multi-key FHE.