Keywords

1 Introduction

Federated learning (FL) has emerged as a promising paradigm for training machine learning models on decentralized data [1, 23, 35,36,37,38]. In many realistic scenarios, the multi-modal data are collected among distributed data silos and stored in a privacy-sensitive manner, such as the examination and diagnostic records of patients in different hospitals and the multimedia data generated on mobile devices. However, most existing federated learning works focus on single modality scenarios (e.g., image or text) with limited capacity for data with heterogeneous formats and properties. Regarding the fast development of multimedia technology and distributed systems, developing a robust and efficient FL framework for multi-modal machine learning tasks is significant.

To date, several early attempts for multimodal federated learning (MFL) [2] have been proposed [3, 6, 18, 42, 44,45,46, 48]. Some of these approaches [3, 44, 45] consider scenarios where the federated system contains both uni-modal and multi-modal clients. However, most of these works assume that all modalities are available to all clients, which is a strong assumption that may not hold in real-world situations. For example, content posted on social media often combines images and text, but users may also publish posts containing only images or text. This modality missing problem poses a substantial challenge as it can severely impact the model’s learning ability and performance.

In this paper, we aim to address this general and realistic problem of modality missing, where clients share the same modality combinations, but some multi-modal instances lack part of the modality data. For example, a client holds 1000 image-text pairs, while 200 of them only have image data, and 300 instances have only text data. A few existing works [20, 22] focus on the modality incompleteness problem. However, they either only consider text missing in the vision-language learning task or deal with sensor signals that are similar in format. We believe that an advanced MFL framework should be robust to modality incomplete training data and maintain satisfactory performance.

To resolve those challenges, we proposed a multi-modal federated learning framework, namely Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP), which uses frozen pre-trained models as the teachers to support the learnable multi-modal joint encoder module for efficient multi-modal representation learning and to generate informative synthetic data. To enhance the model resilience to the performance degradation caused by modality missing, we utilize the cross-modal generation ability of the recently proposed pre-trained models [14, 15, 27] to complete the missing modalities. To further improve the representation learning performance, we proposed an efficient knowledge-transferring method to transfer the representation knowledge from the pre-trained large models to our multi-modal joint learning module. This knowledge-transferring method can alleviate the conflict between the massive data and computing costs requirements for training and fine-tuning of pre-trained large models and the limited resources of federated learning clients. The proposed framework is competent in integrating various pre-trained models with affordable communication costs. As shown in Table 1, compared to the most costly baseline FedViLT, the FedMVP reduces the communication cost by \(\boldsymbol{26.7\times }\) and computation FLOPS by \(\boldsymbol{15.5\times }\). The pre-trained foundation models will play as the frozen data encoders to transform the original data into high-quality representations, which play an important role in the contrastive-manner training process for the multi-modal joint encoder module.

Table 1. Comparison between FedMVP and baselines in terms of #FLOPS (Floating Point Operations Per Second) and #transmitted parameters per round.

We summarize our contributions as follows: (1) We proposed a novel MFL framework that integrates pre-trained large-scale models to conduct efficient multi-modal representation learning and is robust to the modality missing challenge. Our proposed method shows superior performance on two multi-modal classification benchmarks under both IID and non-IID settings. (2) To efficiently transfer the learnable representation knowledge from the pre-trained model to the multi-modal joint module under the resource-limited scenario, we proposed a Multi-modal Contrastive Matching (MCM) loss and a Representation Aligned Margin (RAM) loss, which effectively improve the model performance with severe modality missing up to \(80\%\). (3) Instead of aggregating the models based on the data distribution or the model architecture, we propose a novel aggregation algorithm for the MFL server aggregation based on the representation abilities among the client models.

2 Related Work

Multi-modal Federated Learning (MFL). MFL is still in its early stages of development. Some of the most existing works [18, 42, 48] focus on exploring task-specific approaches with complete modalities. In [42], the authors propose a multi-modal federated learning framework for multi-modal activity recognition with a local co-attention module to fuse multi-modal features. [5] gives a detailed analysis of the convergence problem of MFL with late fusion methods under the Non-IID setting. [3, 44, 45] adapt modality-wise encoders to tackle the MFL system with both uni-modal and multi-modal clients. However, few of them explore the scenario where multi-modal data are incomplete, which may cause significant performance degradation.

Modality Missing in Multi-modal Learning. As a widely existing challenge in the realistic scenario, handling modality missing has drawn the attention of the multi-modal learning community. Some early works [25, 30, 39] build their methods based on conditional VAE to capture the multi-modal distribution for the cross-modal generation. [33] as one of the recent works utilizes cross-modal fusion to improve the model robustness for modality missing in testing. [29] proposes a contrastive framework for learning both paired and unpaired data. In [22], the authors leverage Bayesian meta-learning to reconstruct pseudo text input from image input to resolve the missing modality issue. Instead of training a generative model from scratch, we utilize the large-scaled pre-trained model [14, 15] and prompt augmentation to achieve effective cross-modal generation for completing the missing data pairs.

Vision-language Pre-training. Represented by CLIP [27] and ALIGN [12], the large-scale Vision and Language Pre-training (VLP) models have demonstrated their surprising performance in many downstream vision-language learning tasks [10] and strong adaptability to new scenarios. A few works have taken the first steps towards incorporating federated learning with pre-training techniques. In [32], the authors propose a splitting learning-based framework for training large-scale models like BERT in federated learning systems. PromptFL [9] allows the clients to train shared soft prompts collaboratively using CLIP [27] to provide strong adaptation capability to distributed users tasks. [4, 19, 40, 41] are trying to explore the efficient methods for lightweight and fast adaptation of pre-trained models. [31] proposes FedPCL to transfer shared knowledge among the clients based on prototype contrastive learning. In this work, instead of fine-tuning the large-scale pre-trained models or splitting the model into multiple modules, we conduct effective knowledge transferring to enhance the representation learning performance of a lightweight local module.

Multi-modal Contrastive Learning. Contrastive learning is widely used in the self-supervised learning field, where the learned representations will be assigned to positive and negative samples based on the class belongings. As for its application in multi-modal learning [16, 17, 47], instead of using spatial or temporal transforming to a single instance, the positive pairs are defined as the samples with the same ID or time window. In [47], the authors propose CrossCLR to improve the quality of learned joint embedding from multi-modal data with a novel contrastive loss, which utilizes both inter-modality and intra-modality alignment. [26] extends the multi-modal contrastive learning to efficiently align the cross-modal representations. Inspired by the predecessors, we adopt a multi-modal contrastive loss to improve the quality of the learned multi-modal joint representations based on the modality-specific representation encoded by the frozen pre-trained models.

3 Methodology

To explore multi-modal data in federated systems, we propose FedMVP for MFL with the robustness of modality missing during training. As illustrated in Fig. 1, the proposed FedMVP contains four main modules for effective MFL, including Modality Completion Module, Multi-modal Joint Learning Module, Knowledge Transferring via Contrastive Training, CKA-based Aggregation.

3.1 Problem Formulation

Multi-modal Federated Learning. In an MFL system, there exist N clients aiming to collaboratively train a global model \(w_G\) for multi-modal representation learning. For client n, its local data set \(\mathcal {D}_n = \{ (X_i,y_i)\}^{|\mathcal {D}_n|}_{i=1}\) contains \(|\mathcal {D}_n|\) image-text pairs denoted as \(X_i = \{x^I_i, x^T_i\}\), i.e., the i-th image data \(x^I_i\) and text data \(x^T_i\). \(y_i\) is the corresponding label. A data instance is denoted as \(X_i = \{x^I_i\}\) or \(X_i = \{x^T_i\}\) if modality missing happens. Each local model \(w_n\) performs on the local task \(F_n(\cdot ;w_n) : \mathbb {R}^{n} \rightarrow \mathbb {R}^{d} \) and collaborates with other clients for the global task \(F_G(\cdot ;w_G) : \mathbb {R}^{d_G} \rightarrow \mathbb {R}^{d} \). Formally, the global objective of MFL for the image-text classification problem is defined as

$$\begin{aligned} \min ~L_{G}(F_G(\cdot ;w_G)) = \min ~\sum ^{N}_{n=1} \gamma _n L_n(F_n(\mathcal {D}_n; w_n)) \end{aligned}$$
(1)

where \(\gamma _n\) is the aggregation weights, and \(L_n\) is the local loss function.

Fig. 1.
figure 1

The overview of the proposed FedMVP framework.

3.2 Local Data Preprocessing

A pre-trained foundation model is deployed on both the server side and client side, which consists of an image encoder \(f^I_{E}(\cdot )\) and a text encoder \(f^T_{E}(\cdot )\) for representation extraction, an image decoder \(f^I_{D}(\cdot )\) and a text decoder \(f^T_{D}(\cdot )\) for the cross-modal generation. Notably, all the parameters of the pre-trained models are frozen and will not be transmitted between the server and clients. We will explain the pre-trained model we used below, as well as the details of the local training process.

Modality Completion Module. To solve the performance drop problem caused by modality missing, the modality completion module utilizes the cross-modal generation ability of the pre-trained model to complete the missing part of multi-modal data. We use DALLE2 [28] for text-to-image generation, and BLIP2 [14] for image-to-text generation. Inspired by [27], we use designed prompts to improve the generation quality of the modality completion module.

Prompt Augmented Text-to-Image Generation. Given an image-text pair \(X_i\) with only text data \(x^T_i\), the modality completion module could generate an image from a text prompt. To avoid the semantic ambiguities caused by synonyms and polysemy in the text data and label name. Instead of directly using text data \(x^T_i\) as the input, we adopt a coarse-to-fine prompt to augment the generation. The prompt template is “A photo of {fine-grained label}, a kind of {class label}, {text description}”, which helps the pre-trained models to better understand the characteristics of the generation target and improve the semantic correlation between the text prompt and generated image. Figures 2 and 3 show examples with different inputs to generate the classes “snapdragon” and “yellow throat” on the Oxford Flower and CUB-200 datasets, where our designed prompt gives high-quality fake images that are close to the original ones.

Fig. 2.
figure 2

Examples of the generated “snapdragon” images.

Fig. 3.
figure 3

Examples of the generated “yellowthroat” images.

Accordingly, we obtain the image generation prompt based on the original text data, and the process is denoted as \(p^T(x^T_i)\). The augmented prompt \(p^T(x^T_i)\) will firstly be decomposed by text encoder \(f^T_{E}(\cdot )\), then passed to image decoder \(f^I_{D}\) for generating the synthetic image \(\hat{x}^I_i\), i.e.\(\hat{x}^I_i = f^I_{D}(f^T_{E}(p^T(x^T_i)))\).

Prompt Augmented Image-to-text Generation. For the image-to-text generation, considering the original text data contains detailed descriptions of the image pair, the direct image captioning result may not be able to cover the fine-grained text details. Therefore, we adapt both the visual question answering (VQA) and image captioning functions of the pre-trained model to generate text pairs \(\hat{x}^T_i\) for the image input. Specifically, with a given image input \(x^I_i\), the modality completion module first performs the VQA task over three serial question prompts to get fine-grained descriptions of the image. For instance, given prompt input “What is the color of the petals?” for a flower image with the pink pedal, the response answer could be “Pink”. After obtaining the answers to the three question prompts, we combine them with the image captioning outcome as the final synthetic text, e.g., “A photo of {flower}, with {pink} petals and {white} pistils,{there is a pink flower with a yellow center in the middle of the picture}”. We show examples of image-to-text generation in Table 2.

To better understand the model design and avoid notation confusion, we use completed image-text pair \(X_i = \{x^I_i, x^T_i\}\) in the following sections to illustrate how data is processed in FedMVP.

Table 2. Image-to-text completion examples from CUB-200 and Oxford Flower.

Modality-specific Representations. The foundation models are believed to have extraordinary representation extraction ability since they are trained with millions of data instances. Thus, we obtain the image-specific embedding and text-specific embedding via the pre-trained encoders. Specifically, we use the pre-trained Vision Transformer(ViT) [8] with the patch size of \(16 \times 16\) as the image-specific encoder to generate high-quality embedding from image input. The image-specific embedding \(\textbf{X}^I\) is encoded via the pre-trained image encoder \(f^I_{E}(\cdot )\) and then mapping to the multi-modal latent space via a shared projection head \(f_{shared}(\cdot )\), i.e., \(\textbf{X}^I = f_{shared}(f^I_{E}(\textbf{x}^I)) \in \mathbb {R}^{d_{latent}}\). Similarly, we get the text-specific embedding \(\textbf{X}^T\) from the pre-trained BERT model [7], where \(\textbf{X}^T = f_{shared}(f^T_{E}(\textbf{x}^T)) \in \mathbb {R}^{d_{latent}}\).

3.3 Local Training

Multi-modal Joint Learning Module. The multi-modal joint learning module contains a joint encoder \(f^{Joint}_{E}(\cdot )\) designed to efficiently fuse the image-text information into a complete view. It consists of a cross-modal fusion layer and follows attention-based embedding layers.

Pre-processing. Given an image-text pair \(\{{\textbf{x}^I, \textbf{x}^T}\}\) as input, we use a non-overlapped patch embedding layer and the pre-trained text encoder \(f^T_{E}(\cdot )\) to get the patch sequence \(\textbf{I}_{com}\) and text embedding \(\textbf{T}_{com}\), both belongs to the common dimension \(d_{com}\).

Cross-Modal Fusion. After the positional embedding operation, both the image and text embeddings are fed into the cross-modal fusion layer, which contains a vision-to-language attention module and a language-to-vision attention module. Both modules are based on the cross-modal attention [33], which can effectively fuse the representation between the two input modality embeddings. We take the image-to-text embedding \(\textbf{X}^{I \rightarrow T}\) to show the cross-modal attention:

$$\begin{aligned} \textbf{X}^{I \rightarrow T} = CM_{I \rightarrow T}(\textbf{I}_{com}, \textbf{T}_{com}) = softmax(\frac{W_{Q_I} {\textbf {I}}_{com}W^\textsf{T}_{K_T} \textbf{T}^\textsf{T}_{com}}{\sqrt{d_{com}}}) W_{V_T}. \end{aligned}$$
(2)

Similarly, we can get text-to-image embedding \(\textbf{X}^{T \rightarrow I}\). The obtained \(\textbf{X}^{I \rightarrow T}\) and \(\textbf{X}^{T \rightarrow I}\) will be concatenated together and projected to the latent space as the final joint embedding via the shared projection head \(f_{shared}(\cdot )\) and a self-attention layer as follows:

$$\begin{aligned} \textbf{X}^{joint} =f_{shared}( SelfAttention( \textbf{X}^{I \rightarrow T} \oplus \textbf{X}^{T \rightarrow I} )). \end{aligned}$$
(3)

We now obtain the image-specific embedding \(\textbf{X}^I\), text-specific embedding \(\textbf{X}^T\), and joint embedding \(\textbf{X}^{joint}\) in the same latent space \(\mathbb {R}^{d_{latent}}\).

Knowledge Transferring from Pre-trained Model. The training data of large-scale models in the pre-training stage is neither available nor affordable for distributed silos to process, making the fine-tuning and traditional knowledge distillation [11] of large-scale models impractical under the MFL scenario. In order to transfer the rich representation knowledge from the pre-trained model, we propose Multi-modal Contrastive Matching (MCM) Loss and Representation Aligned Marginal (RAM) Loss to improve the representation learning performance of the joint encoding module.

Multi-modal Contrastive Matching Loss. To obtain a high-quality joint representation, we utilize the idea of contrastive learning to closer the joint embedding with its corresponding modality-specific embedding and distance it from the embedding of the other categories in the latent space. Let \(s_c(x_i,x_j)\) represent the cosine similarity between two embedding, \(x_i\) and \(x_j\), and \(\tau \in (0,1]\) be the temperature hyperparameter. The corresponding scaled similarity is defined as: \( sim(x_i,x_j) = \exp (\frac{s_c(x_i, x_j)}{\tau }). \)

Given a batch of embedding \(\mathcal {B} = \{\textbf{X}^T_i, \textbf{X}^I_i, \textbf{X}^{joint}_i\}^{|\mathcal {B}|}_{i=1}\), the positive pair for the contrastive learning is defined as the joint embedding with its corresponding modality-specific embedding, i.e., \((\textbf{X}^T_i, \textbf{X}^{joint}_i)\) and \((\textbf{X}^I_i, \textbf{X}^{joint}_i)\) . The other ways of pairing will be treated as negative pairs, denoted as:

$$\begin{aligned} \begin{aligned} \Omega ^m_i &= \sum \limits _{i \ne j} (sim(\textbf{X}^M_i, \textbf{X}^M_j) + sim(\textbf{X}^M_i, \textbf{X}^{joint}_j) + sim(\textbf{X}^{joint}_i, \textbf{X}^{joint}_j) ), \end{aligned} \end{aligned}$$
(4)

where \(M \in \{I, T\}\) indicates the modality type. We define the multi-modal contrastive matching (MCM) loss of all data embedding as follows:

$$\begin{aligned} \begin{aligned} L_{MCM}(\mathcal {B}) = -\frac{1}{|\mathcal {B}|}\sum ^{|\mathcal {B}|}_{i=1} \log \left( \frac{sim(\textbf{X}^T_i, \textbf{X}^{joint}_i)}{\Omega ^T_i} + \frac{sim(\textbf{X}^I_i, \textbf{X}^{joint}_i)}{\Omega ^I_i}\right) . \end{aligned} \end{aligned}$$
(5)

Representation Aligned Margin Loss. We propose the Representation Aligned Margin (RAM) loss to further enrich the joint representation via pre-trained knowledge to close the semantic gap between the joint embedding and the modality-specific embeddings. We use the classification loss derived from the embeddings to evaluate its representation quality. For the i-th data sample, the supervised classification loss of one of its corresponding embeddings is denoted as \( L^M_{sup}(i) = CE(f_c(\textbf{X}^M_{i}), y_i)\).

Intuitively, embeddings with lower cross-entropy losses contain more informative features from the raw data. With an embedding batch \(\mathcal {B}\), the RAM loss aligns joint embedding with image and text embedding separately, if the modality-specific embedding has better representation. Thus, the RAM loss is defined as:

$$\begin{aligned} {L}_{RAM}(\mathcal {B}) = \frac{1}{|\mathcal {B}|}\sum ^{|\mathcal {B}|}_{i=1} \left( {L}^I_{RAM}(i) + {L}^T_{RAM}(i)\right) , \end{aligned}$$
(6)
$$\begin{aligned} {L}^M_{RAM}(I) = {\left\{ \begin{array}{ll} \Vert \textbf{X}^{joint}_i - \textbf{X}^M_i \Vert _2, & \text {if } L^M_{sup}(i) < L^{joint}_{sup}(i) \\ 0, & \text {otherwise} \end{array}\right. }, \end{aligned}$$
(7)

where \(\textbf{X}^M_i\) and \(\textbf{X}^{joint}_i\) are all derived from the i-th sample in the batch, and \(|\mathcal {B}|\) is the batch size. The L2 norm is denoted by \(\Vert \cdot \Vert _2\).

Classification Loss. A two-layer linear classifier \(f_C(\cdot )\) will serve as the classifier using only joint embedding as input. The supervised classification loss \(L_{sup}\) of client n can be obtained:

$$\begin{aligned} L_{sup}(\mathcal {B}) = \frac{1}{|\mathcal {B}|} \sum _{i=1}^{|\mathcal {B}|} CE \left( f_C\left( \textbf{X}^{joint}_{i}; \boldsymbol{\omega }_n\right) , y_{i}\right) , \end{aligned}$$
(8)

where \(f_C(\cdot )\) denotes the classifier model of client n, \(CE(\cdot )\) is the cross-entropy loss function, and \(y_{i}\) is the corresponding label of i-th joint embedding \(\textbf{X}^{joint}_{i}\).

Total Loss. The final local training loss of client k in FedMVP is:

$$\begin{aligned} L_{local}(\mathcal {D}_k) = L_{sup}(\mathcal {D}_k) + L_{MCM}(\mathcal {D}_k) + L_{RAM}(\mathcal {D}_k), \end{aligned}$$
(9)

At each communication round, each client will upload the parameters of the multi-modal joint learning module and classifier to the server for further global aggregation.

Fig. 4.
figure 4

CKA-based Server Aggregation

3.4 Server Aggregation

Previous works tend to aggregate based on the modality type held by the clients [3, 43], share public dataset [44], or model structure [45], which may lead to data privacy leakage and lacking uniformity. To better enhance the representational ability of the global model, we propose a server aggregation method based on the similarity of model output representations.

At the beginning of the aggregation phase, the server-side pre-trained model will automatically generate m synthetic data pairs \(X_m\), where the data amount m is equal to the number of classes of the dataset. Given an uploaded client model, its output representations with generated data are defined as:

$$\begin{aligned} \textbf{X}_{\omega } = [F_{\omega }(X_1), \ldots , F_{\omega }(X_m)]^{T} \in \mathbb {R}^{m \times d_{out}}. \end{aligned}$$
(10)

To measure the similarity of the model representations among the clients, we utilize the centered kernel alignment (CKA) metric [13] based on the output representations from upload models, which is defined as follows:

$$\begin{aligned} s_{ij}(\omega _i, \omega _{j}) = \frac{Cov(\textbf{X}_{\omega _i}, \textbf{X}_{\omega _j})}{\sqrt{Cov(\textbf{X}_{\omega _i}, \textbf{X}_{\omega _i}) Cov(\textbf{X}_{\omega _j}, \textbf{X}_{\omega _j})}}, \end{aligned}$$
(11)

where \(Cov(X,Y) = (m-1)^2 tr(XX^{T}H_{m}YY^{T}H_{m})\), \(H_m\) is the centering matrix, \(tr(\cdot )\) denotes the matrix trace, m represents the number of input represents.

With the calculated representation similarity scores, the server constructs a representation similarity graph to illustrate the relationship among clients, as shown in Fig. 4. The importance of each client in the representation similarity graph is determined by the sum of its similarity score with all the other clients.

$$\begin{aligned} \gamma ^t_i = softmax([s_1, \ldots ,s_i,\ldots , s_K]), \end{aligned}$$
(12)

where K is the number of clients who participate in the t-th aggregation, \(s_i = \sum ^{K-1}_{j=1}s_{ij}\) is the collection of the graph importance of all K clients. Finally, the global model is weighted and aggregated based on the clients’ graph importance \(\gamma ^t_i\) as follows:

$$\begin{aligned} w^t_G = \sum ^K_{i=1}\gamma ^t_i w^t_i. \end{aligned}$$
(13)

4 Experiments

4.1 Experiment Setting

Datasets. We evaluate the proposed FedMVP on two multi-modal fine-grained categorization datasets, The Caltech-UCSD Birds-200-2011 (CUB-200) dataset [34] and Oxford Flower [24]. Both contain paired image-text data, and each image has 10 related descriptive text. CUB-200 has 200 bird classes with 10610 training image-text instances and 1178 for testing. Oxford Flower has 102 flower classes, a training size of 7370, and a testing size of 819.

Table 3. Evaluating the impact of incomplete modality on CUB-200 and Oxford Flower datasets under IID setting. \(\beta \) indicates the missing ratio of the training set.

Data Distribution Setting. For Independent Identically Distribution (IID) setting, we equally distribute the training data to 10 clients with random selection. Each client will hold the same quantity of local data with a balanced category distribution. To simulate the non-IID scenario in federated systems, we divide the training data set into C shards according to the data set categories, i.e., 200 shards for CUB-200-2011 dataset and 102 shards for the Oxford Flower dataset. With fixed 10 clients, the data shards are randomly and equally distributed to clients.

Modality Missing Setting. We set \(\beta \in [0,1]\) as the missing ratio. For example, given a constant \(\beta = 0.3\), \(30\%\) randomly selected image-text pairs will lose either image or text data in equal chances. We select \(\beta = 0.3, 0.5, 0.8\) to conduct our experiments, and the number of missing images and texts is the same.

Training Setting. With fixed 10 clients, the total communication round is 200. In each communication round, the clients will perform 10 epochs for local training with their own local datasets and the server will randomly select \(70\%\) of clients for aggregation. We choose AdamW as the optimization function with a scheduler-controlled learning rate \(2e-5\). We adopt the warm-up scheduler and cosine annealing scheduler for the training process as well.

Baselines. Since the existing approaches for addressing modality missing in multi-modal federated learning are relatively limited, we choose FedViT, FedBERT as the uni-modal baseline and FedCLIP, FedViLT, MMFed as the multi-modal baseline. FedViT [8], FedBERT [7], FedCLIP [27] and FedViLT [21] are using large-scale foundation models pre-trained with millions of data as the local models. These large models are fine-tuned on the local data and upload all the parameters to the server for aggregation. MMFed [42] is a federated multi-modal learning method without leveraging foundation models. FedViLT [21] is designed specifically for modality missing. Please refer to Appendix for details of the implementation.

Table 4. Evaluating the impact of incomplete modality on CUB-200 and Oxford Flower datasets under the non-IID setting. \(\beta \) indicates the missing ratio of the training set.
Table 5. Evaluating the robustness of the methods over different test sets. image only and text only indicate the test set only contains either image or text. All the methods are trained over train set WITHOUT modality missing.

4.2 Empirical Results

Results of the IID Setting. Table 3 shows the superior performance of FedMVP across different missing ratios under the IID setting on both CUB-200 and Oxford Flower datasets. Observably, all models exhibit a decline in accuracy with an increase in the missing ratio (\(\beta \)). FedMVP outperforms baseline methods consistently and demonstrates exceptional resilience to performance degradation due to missing modalities. For instance, on the CUB-200 dataset, FedMVP’s accuracy margin over the next best-performing model, FedViLT, widens from about \(1.6\%\) at \(\beta =0.3\) to \(6.2\%\) at \(\beta =0.8\). A similar trend is observed on the Oxford Flower dataset, with the margin increasing from \(0.52\%\) to \(7.8\%\). The rate of performance degradation of FedMVP is notably slower than the other models. Specifically, as \(\beta \) increases from 0.3 to 0.8, the accuracy of FedMVP drops by merely \(7.58\%\) and \(3.87\%\) on the CUB-200 and Oxford Flower datasets, respectively. In contrast, FedViT witnesses larger drops of \(14.38\%\) and \(15.51\%\).

Results of the Non-IID Setting. The non-IID experimental results, presented in Table 4, all methods experience a significant decrease in accuracy compared to the IID setting, including FedMVP. The proposed FedMVP consistently outperforms the other methods across the settings. FedMVP has minimal performance degradation caused by non-IID compared to the baseline methods, with no more than \(5\%\) drop on CUB-200 and no more than \(7\%\) on Oxford Flower. Despite the increasing missing ratio from \(\beta =0.3\) to \(\beta =0.8\), FedMVP maintains a substantial lead in accuracy on both datasets. For instance, even with \(\beta =0.8\), FedMVP achieves an accuracy of \(66.44\%\) and \(82.47\%\) on the CUB-200 and Oxford Flower datasets, respectively, confirming its robustness to modality incompleteness under non-IID settings. Notably, the performance margin between FedMVP and baseline is further widened compared to the IID setting. For instance, on the Oxford Flower dataset, as \(\beta =0.8\), the accuracy of FedMVP is \(29.68\%\) higher than MMFed compared to \(25.27\%\) under IID.

Results of Single-modality Testing. Shown in Table 5, all methods experience significant performance drops when tested with only one modality (image or text). FedMVP shows the best resilience, achieving the highest accuracy in both image-only and text-only scenarios across datasets. FedViLT [21] performs best with complete data since it has \(26.7\times \) more parameters than FedMVP and is pre-trained over millions of pre-training data. It holds second place in single-modality tests. FedCLIP’s performance is limited by local dataset size but benefits from separate ViT and BERT encodings. MMFed suffers the most due to its co-attention mechanism and performs better in text-only testing due to its integrated BERT. In summary, FedMVP demonstrates robustness in both training and testing under missing modalities.

Ablation Study. The results in Table 6 show that all the modules in the FedMVP model significantly contribute to its performance. Experimental results show that MCM loss and RAM loss can effectively improve the quality of the representation generated by the multi-modal joint encoder and enhance the final performance of the model by transferring pre-trained knowledge through representation learning. The modality completion module can supplement the data by providing additional training information using the transferable knowledge of the pre-trained model. Furthermore, the experimental results suggest that CKA similarity can effectively measure the importance of the representation learned by each client’s local model and can improve aggregation performance compared to traditional average aggregation.

Table 6. Ablation study on both CUB-200 and Oxford Flower datasets with \(\beta =0.3\) under non-IID setting; wo/MCM denoting MCM Loss excluded; wo/RAM excludes RAM loss; wo/Completion refers to training without modality completion module; wo/CKA indicates server aggregation as FedAvg.

5 Conclusion

In conclusion, we proposed the FedMVP framework to tackle modality missing, a widely existing real-world challenge, where part of the multi-modal data is incomplete and unaligned. Our framework utilizes large-scale pre-trained models with frozen parameters for modality completion and representation knowledge transfer at each client. It provides a solution for integrating large-scale pre-trained models to empower the federated system with robustness towards modality incompleteness. The experiments on the real-world image-text pair benchmark demonstrated the effectiveness of our proposed method. The proposed FedMVP framework shows great potential in addressing the missing modality and unified representation learning challenges of multi-modal federated learning. We hope this work can provide inspiration for future research in this field.