1 Introduction

The significant progress [1, 2] achieved in object detection and image localization has established a substantial empirical foundation and theoretical framework for person re-identification (ReID) [3]. Currently, there is growing interest in visible-infrared person re-identification (VI-ReID) [4]. VI-ReID strives to effectively retrieve person images exhibiting maximal similarity amidst diverse illumination conditions. In light of diverse imaging principles, images captured under varying lighting conditions manifest significant discrepancies in their visual attributes. Moreover, within identical illumination settings, images portraying individuals of the same identity showcase considerable intra-class diversity owing to factors like posture and perspective.Hence,the pursuit of consistent inter modal embedding under the same identity, exemplified in Fig. 1, emerges as a pivotal challenge in VI-ReID.

Fig. 1
figure 1

The proposed UPPI framework generates consistent embeddings across diverse modal environments, enabling the network to learn identity-representative features. This approach effectively reduces the modal gap between VIS and IR images of the same identity

This challenge has garnered considerable attention, leading to the emergence of two prominent approaches.One approach involves utilizing feature-based methods [5,6,7,8,9,10,11,12], which employ single-path or multi-path deep neural networks (DNNs) to acquire cross-modal consistent embeddings. These methods leverage sample labels to diminish the modal disparity between cross-modal features.The alternative approach centers on image-based methods [13,14,15,16], with the aim of bridging the substantial appearance disparities between different modalities. For instance, GANs [17] facilitate image conversion from one mode to another, mitigating cross-modality induced appearance differences. Despite their high effectiveness and frequently superior performance, they come with certain limitations: (1) due to the absence of involvement of the infrared mode, there are domain disparities in the pre-trained models mentioned on ImageNet; (2) the scarcity of labeled samples across modalities restricts the matching performance of these networks.

To address the aforementioned issues, we have made extensive efforts in terms of samples, training paradigm, and loss functions outside the network. Consequently, we propose UPPI, as depicted in Fig. 2. Firstly, alongside visible and infrared images, we have assembled a vast sample repository named UnitCP, which encompasses paired images possessing distinct appearance styles: visible and pseudo-infrared. The pseudo-infrared images exhibit comparable appearance styles to infrared images while maintaining consistent semantic information with visible images. This effectively bridges the substantial disparities between ImageNet and specific infrared datasets and also offers a solution for the scarcity of labeled infrared samples. Secondly, within the pre-training phase, we propose the integration of a cross-modal feature fusion module (CF\(^2\)), strategically designed to compel the network to discern shared features across diverse modalities by judiciously merging redundant features from paired samples in an explicit manner.

During the fine-tuning process, to facilitate the model’s adaptation to the temporal absence of paired images, we introduce a novel central contrast loss (C\(^2\)). This loss function incentivizes the prioritization of identity-consistent cross-modal features in scenarios where paired samples are unavailable.

Fig. 2
figure 2

Illustrates the UPPI framework pipeline, encompassing cross-modal feature fusion module (CF\(^2\)) and Central contrast loss (C\(^2\)) components. The distinctively colored arrows depict various training stages, including pre-training and fine-tuning

Our main contributions are summarized as follows. (1) To address the domain discrepancies between ImageNet and specific infrared datasets, as well as the absence of paired samples, we aim to investigate the feasibility of pretraining and propose a Unified Framework for Pre-training with Pseudo-Infrared Images (UPPI). (2) We establish a comprehensive cross-modal sample repository(UnitCP), serving as the foundation for our pre-training endeavors. (3) We propose a cross-modal feature fusion mechanism(CF\(^2\)), strategically devised to maximize the utilization of paired images. Moreover, to enhance the model’s adaptability during fine-tuning in scenarios where paired images are lacking, we introduce a novel center contrast loss (C\(^2\)). (4) Extensive experimental results on two benchmark datasets, SYSU-MM01 and RegDB, validate the effectiveness and exceptional performance demonstrated by our proposed method.

2 Related work

2.1 VI-ReID tasks

In tackling this formidable cross-modal challenge, two distinct approaches arise. One class of methodologies endeavors to attain consistent embedding across diverse modalities [4,5,6,7,8,9,10,11,12]. For example, Wu et al. [4] constructed the largest visible near-infrared dataset SYSU-MM01 and proposed a zero-padding framework. TONE + HCML [5] have advanced the two-stream cross-modal Re-ID framework by concurrently optimizing shared and specific metrics. Lu et al. [6] endeavored to introduce modality-specific features. Mean-while, Wu et al. [7] proposed a modal-gated extractor that integrated a similarity preservation loss. Ye et al. [9] proposed an attention-based framework aimed at leveraging both local-level and image-level contextual cues. FMCNet [12] devised a feature compensation structure to extract additional discriminative features from shared ones. While these methods have achieved notable advancements in performance, the inherent domain distinctions between ImageNet and the specific datasets per-taining to VI-ReID impose constraints on further performance enhancements. Unlike these methods, which use pre-trained models on ImageNet for direct recognition, we first use pre-training tasks to reduce domain disparities for specific datasets in VI-ReID tasks and ImageNet, and then recognize.

An alternative set of methodologies endeavors to mitigate the visual disparities among cross-modal images through sample-based interventions [13,14,15,16], or through the development of effective sample augmentation strategies [18,19,20]. For instance, Wang et al. [15] decomposed the features extracted and decoded the shared modal features to produce high-quality cross-modal paired images. X-modality [16] devised a lightweight network that learns intermediate representations of visible and infrared images. Ye et al. [18] employed generated grayscale images for training more robust networks. Ye et al. [19] proposed a data augmentation strategy that randomly selected color channels to generate single-channel samples.Our pseudo-infrared image is similar to the latter category of methods; however, in order to avoid introducing additional noise, we employ a pseudo-infrared approach similar to that described in [19], while not being limited by it. We extensively leverage this pseudo-infrared technique on the existing visible person datasets, yielding a substantial number of cross-modal sample pairs.

2.2 Pre-training tasks

The practice of pre-training networks [21,22,23] on extensive corpora has exhibited significant advantages in natural language processing tasks. In computer vision,networks supervised pre-trained on ImageNet [24], such as AlexNet [25], ResNet [26], ViT [27], Swin Transformer [28], among others, have showcased substantial performance gains when applied to other tasks.Models pre-trained on ImageNet significantly boost domain disparty retrieval tasks within the same spectrum. However, this performance gain diminishes in cross-spectral settings.From the sample perspective, the lack of infrared samples appears to result in the pre-trained models lacking relevant prior knowledge. This absence is likely a key factor limiting performance in cross-spectral scenarios.Therefore, we propose a pre-training task on a repository of visible-pseudo-infrared sample pairs, aiming to compensate for the missing infrared prior knowledge by introducing pseudo-infrared samples. Considering the similarities in the format of visible infrared samples, the unified pre-training framework resembles a single-stream network [29,30,31,32,33]. However, to further enhance the network’s generalization performance, we have incorporated elements from the pre-training paradigm discussed in [34, 35].

3 Methodology

Let \(x_k\) represent a sample of mode k,where \(k\in {v,i}\) (v denotes the visible mode, i denotes the infrared mode). The dataset consists of visible and infrared samples \(V=\left\{ x_{v}^{j},y_{v}^{j} \right\} _{j=0}^{N_v}\),\(I=\left\{ x_{i}^{j},y_{i}^{j} \right\} _{j=0}^{N_j}\), respectively, where \(N_v\) and \(N_i\) are the number of samples for each mode in the dataset. Here, \(y_k^j\) represents the identity label of the j th sample in mode k. VI-ReID aims to match samples with identical identities across modes. The proposed unified framework structure is illustrated in Fig. 2, with further details discussed in subsequent sections.

3.1 UnitCP: a repository of visible pseudo-infrared samples

Fig. 3
figure 3

Illustrates a comparison between pseudo-infrared images in UnitCP and infrared images generated by GANs.The leftmost column depicts the original image, while the upper and lower right columns showcase the images generated by the GAN network and the pseudo-infrared images, respectively. Our results demonstrate that the pseudo-infrared images exhibit superior visual credibility, semantic consistency, and detail compared to those generated by GANs

In transfer learning, domain disparty refers to the difference between the data distribution learned by the model and the data distribution encountered during actual application or testing. Existing works [36,37,38] have explored the mechanisms and adjustment methods for domain disparty. In VI-ReID, due to the lack of cross-modality source datasets, many efforts [6,7,8] have focused on leveraging the large-scale single-spectrum dataset ImageNet to acquire prior knowledge. However, we argue that domain disparty remains a significant issue in these approaches. Standard domain disparty problems focus on inter-domain differences caused by factors such as different scenes and camera viewpoints within a single spectrum. In contrast, our identified domain disparty issue focuses on differences between datasets under different spectral settings. Specifically, since the target dataset comprises visible-infrared cross-spectrum samples, the source dataset should maintain a consistent setting. Therefore, we explored a cross-spectrum setting consistent with the target dataset in the source dataset.

Several GAN-based works [13,14,15] have successfully transformed visible images into infrared images, while some channel-related works [18, 19] have simulated infrared styles in visible images. As depicted in Fig. 3, the images generated by the former significantly differ in quality from those generated by the latter, even though the former successfully produced infrared-style images. Hence, we constructed UnitCP: a visible-to-pseudo-infrared sample repository based on the latter approach.

Initially, we gathered training and test samples from various datasets including Market1501 [39], DukeMTMC-reID [40], MSMT17 [41], and others. These samples were then organized based on labels to obtain approximately 180,000 visible person samples representing 7,250 unique identities. To ensure optimal generalization performance of the network, all these 180,000 samples were utilized for training while a distinct dataset called SYSU-MM01 was used as the test set. After thoroughly examining the merits and demerits of pseudo-infrared technologies, we synergistically integrated these techniques through a randomized selection process to establish a diversified sample-based pseudo-infrared scheme, as outlined below:

$$\begin{aligned} \left\{ \begin{array}{l} x_{i}^{v}=\left( x_{i}^{R},x_{i}^{R},x_{i}^{R} \right) ,n=0\\ x_{i}^{v}=\left( x_{i}^{G},x_{i}^{G},x_{i}^{G} \right) ,n=1\\ x_{i}^{v}=\left( x_{i}^{B},x_{i}^{B},x_{i}^{B} \right) ,n=2\\ x_{i}^{v}=\left( x_{i}^{\delta },x_{i}^{\delta },x_{i}^{\delta } \right) ,2<n\le 5\\ x_{i}^{\delta }=\alpha \cdot x_{i}^{R}+\beta \cdot x_{i}^{B}+\gamma \cdot x_{i}^{G}\\ \end{array} \right. \end{aligned}$$
(1)

Where v and r represent the visible mode and infrared mode respectively, RGB denote the three channels of the image. The variable n is a random number employed to enhance sample diversity, while \(\alpha , \beta , \gamma \) are correlation coefficients utilized in gray technology.

Subsequently, we employed pseudo-infrared technology to convert 180,000 samples into an equivalent number of visible pseudo-infrared sample pairs. The abundance of visible pseudo-infrared sample pairs in UnitCP significantly mitigates the domain disparities between ImageNet and VI-ReID datasets, thereby maximizing the model’s potential. To validate this, we visualize the performance curve of AGW across various scenarios. As illustrated in Fig. 4, the network pre-trained on UnitCP not only demonstrates faster convergence but also significantly enhances performance. It is noteworthy that the zero-shot mAP of AGW baseline on SYSU-MM01 surpasses the best performance without pre-training by more than 10 percentage points and exceeds that of AGW pre-trained on ImageNet by over 20 percentage points. This indicates that through exposure to simulated visible pseudo-infrared environments, the network effectively adapts to cross-modal settings.

Notice that our proposed UnitCP serves as not only a static sample repository but also a dynamic cross-modal sample augmentation strategy. In future research, we intend to apply this sample augmentation strategy to larger datasets, such as ImageNet, which effectively addresses the domain disparities issue and significantly mitigates the scarcity of cross-modal samples.

Fig. 4
figure 4

Illustrates the performance curves of the network under different pre-training scenarios. \(\alpha \) demonstrates the performance difference between zero-shot learning with UnitCP pre-training and the saturated training performance without pre-training. \(\beta \) represents the performance gap between zero-shot learning pre-trained on UnitCP and that pre-trained on ImageNet. Finally, \(\gamma \) indicate the difference in fine-tuning performance post pre-training on UnitCP versus pre-training on ImageNet

3.2 CF\(^2\): Cross-modal feature fusion module

After introducing a substantial collection of visible pseudo-infrared sample pairs, we thoroughly harness the potential of cross-modal pairwise semantics. Among the high-dimensional features extracted by the network, a significant portion of these features exhibit negligible impact on identity discrimination or even have adverse effects, as demonstrated in previous studies [42]. Motivated by this, we propose a module that exploits redundant features to alleviate disparities in cross-modal features. As il-lustrated in Fig. 2(b), we initially segregate the feature map into dense and sparse information components. Subsequently, we perform cross-reconstruction to establish fusion features for enhancing inter-feature imformation flow.

Specifically, the correlation coefficients in the group normalization (GN) layer are initially utilized to evaluate the information density of individual channels. Given a feature map \(x\in \mathbb {R}^{N\times C\times H\times W}\) at an intermediate layer, where N represents batch size, C denotes channel count, and \(H\times W\) denotes spatial dimensions of the feature, we proceed by normalizing input feature X using the subsequent formula:

$$\begin{aligned} X_{out}=GN\left( X \right) =\gamma \frac{X-\mu }{\sqrt{\sigma ^2+\theta }}+\beta \end{aligned}$$
(2)

where \(\mu \) and \(\sigma \) are the mean and standard deviation in X, \(\theta \) is a small positive number that guarantees division, and \(\gamma \) and \(\beta \) are trainable affine transforms.

The trainable parameter \(\gamma \) in the GN layer is utilized to quantify the spatial pixel variance of the channel during the standardization process, where a larger value of \(\gamma \) indicates a richer representation of spatial information. The weight \(W_\gamma \), which determines the significance of each channel, can be derived from (3).

$$\begin{aligned} W_{\gamma }=\left\{ w_i \right\} =\frac{\gamma _i}{\sum \limits _{j=1}^C{\gamma _j}},i,j=1,2,...,C \end{aligned}$$
(3)

The weights of the reweighted feature map \(W_\gamma \) are subsequently normalized to the range (0,1) through the application of the sigmoid function. We assign 0 to weights below the threshold and 1 to weights above the threshold, resulting in two weight matrices \(W_1\) and \(W_2\) that have the same scale as the feature map. In summary, the process for calculating weights to distinguish sparse feature maps from dense feature maps is as follows:

$$\begin{aligned} W=Gate\left( Sigmoid\left( W_{\gamma }\left( GN\left( X \right) \right) \right) \right) \end{aligned}$$
(4)

We respectively replicate and weight the features of different modes to obtain a feature map that encompasses dense information, as well as a feature map that captures sparse in-formation. Finally, the process for calculate ing the reconstructed features of both modes is outlined as follows:

$$\begin{aligned} \left\{ \begin{array}{l} X_{11}^{v}=W_1\otimes X^v,X_{12}^{v}=W_2\otimes X^v\\ X_{11}^{i}=W_1\otimes X^i,X_{12}^{i}=W_2\otimes X^i\\ X^{v~}=X_{11}^{v}\oplus X_{12}^{i}\\ X^{i~}=X_{11}^{i}\oplus X_{12}^{v}\\ \end{array} \right. \end{aligned}$$
(5)

where \(\oplus \) denotes element-wise addition and \(\otimes \) denotes element-wise multiplication.

The effect comparison before and after CF\(^2\) is illustrated in Fig. 5. The red-boxed area in (a) is severely affected by the background, leading to feature loss in the corresponding area in (b). In contrast, the corresponding area in (c) shows effective feature compensation. This improvement can be attributed to the efficient inter-modal information reciprocity channel constructed by CF\(^2\), which significantly alleviates the issue of feature loss in infrared samples.

Fig. 5
figure 5

Illustrates the comparative analysis of feature representations before and after the incorporation of the CF\(^2\) framework. The integration of CF\(^2\) has facilitated enhanced information exchange across multiple modalities, thereby significantly augmenting the fidelity and discriminative power of the resultant feature set. This improvement in feature quality is attributed to the synergistic interplay between the diverse data modalities, which collectively contribute to a more robust feature learning process

3.3 C\(^2\): Central contrast loss

In the fine-tuning phase, the challenging conditions characterized by a scarcity of paired images and substantial intra-class variations compel the model to redirect its focus from images towards identity. Hence, it is crucial to establish intermodal identity consistency constraints for alignment purposes. Previous works [11, 43, 44] have introduced cross-modal distance constraints to effectively reduce modal distances between cross-modal features and achieved remarkable outcomes. However, these additional hard distance constraints may potentially result in the loss of essential discriminative features.

Instead, we propose a similarity-based center contrast loss. First, we use the Euclidean center of feature semantics to represent the same identity of each modality.

$$\begin{aligned} C_{i}^{v}=\frac{1}{N_v}\sum \limits _{j=1}^{N_v}{f_{j=1}^{v}},C_{i}^{r}=\frac{1}{N_r}\sum \limits _{j=1}^{N_r}{f_{j=1}^{r}} \end{aligned}$$
(6)

where \(N_v\), \(N_r\) represent the number of visible samples and the number of infrared samples under the same batch of the same identity. Compared with the hard distance constraint based on samples, we use a soft identity constraint in order to maintain higher quality image discriminant features.

Assuming a batch size \(N_b\) and a sample size m for each modality and identity, the number of identities in a batch can be calculated as \(N=N_b/2m\). The contrast loss plays a crucial role in maximizing the cross-modal center cosine similarity within the same identity while minimizing it between different identities. The corresponding formula is presented below.

$$\begin{aligned} \begin{array}{c} L_{c^2}^{v}=\frac{1}{N}\sum \limits _{n=1}^N{-\log \left( \frac{\exp \left( sim\left( C^v,C^{r+} \right) \right) }{\sum \limits _{i=1}^{N_{c^{r-}}}{\exp \left( sim\left( C^v,C^{r-} \right) \right) }} \right) ,}\\ L_{c^2}^{r}=\frac{1}{N}\sum \limits _{n=1}^N{-\log \left( \frac{\exp \left( C^r,C^{v+} \right) }{\sum \limits _{i=1}^{N_{c^{v-}}}{\exp \left( sim\left( C^r,C^{v-} \right) \right) }} \right) ,}\\ L_{c^2}=\left( L_{c^2}^{v}+L_{c^2}^{r} \right) /2\\ \end{array} \end{aligned}$$
(7)

Where \(+\) and − respectively denote the feature centers sharing the same identity as the anchored identity center, and those with different identities; \(sim\left( a,b \right) =a^{\top }b/\parallel a\parallel \cdot \parallel b\parallel \) represents the cosine similarity between a and b.

In addition, in order to provide the network with basic semantic signals, the two general loss functions in (8) are also adopted.

$$\begin{aligned} \begin{array}{c} L_{id}=-\frac{1}{N}\sum \limits _{i=1}^N{\log \left( P\left( \dfrac{y_i}{x_i} \right) \right) }\\ L_{wrt}=\frac{1}{N}\sum \limits _{i=1}^N{\log \left( 1+\exp \left( \sum _{ij}{w_{ij}^{p}d_{ij}^{p}}-\sum _{ik}{w_{ik}^{n}d_{ik}^{n}} \right) \right) }\\ w_{ij}^{p}=\frac{\exp \left( d_{ij}^{p} \right) }{\sum _{d_{ij}^{p}\in p_i}{\exp \left( d_{ij}^{p} \right) }},w_{ik}^{n}=\frac{\exp \left( -d_{ik}^{n} \right) }{\sum _{d_{ik}^{n}\in n_i}{\exp \left( d_{ik}^{n} \right) }}\\ \end{array} \end{aligned}$$
(8)

where \(L_{id}\) denotes cross-entropy loss, \(L_{wrt}\) denotes weighted regularization triple loss, and (ijk) denotes a triple in each training batch for each anchored sample \(x_i\). For \(x_i\), P is the corresponding positive sample set and N is the corresponding negative sample set. \(d^p\), \(d^n\) denotes pairwise distances of positive and negative sample pairs, respectively. \(d_{ij}\) is the Euclidean distance between two sample features. Using softmax weighting strategy to obtain two weights, \(w^p\), \(w^n\) can force the network to pay more attention to distance optimization of difficult samples.

The overall objective function of UPPI can be expressed as (9), where \(\omega \) represents the hyperparameter utilized for balancing \(L_{c^2}\) with the two base losses.

$$\begin{aligned} L=L_{id}+L_{wrt}+\omega L_{c^2} \end{aligned}$$
(9)

4 Experiment

4.1 Datasets and implementation details

4.1.1 Datasets

We conducted simulation experiments on two general VI Re-ID datasets to evaluate our proposed method.

SYSU-MM01 consists of 491 identities captured by 6 cameras (4 RGB cameras and 2 IR cameras) at different times and under varying environmental conditions, resulting in a total of nearly 60,000 images. The training set comprises 395 identities with a total of 19,659 RGB images and 12,792 IR images, while the test set includes 96 identities. Following its standard evaluation protocol, the dataset offers global retrieval mode and indoor retrieval mode. To ensure fair comparison with state-of-the-art methods, we extracted respectively 301 and 3010 images to construct the gallery set for evaluation.

RegDB [45], created by a visible camera and a thermal camera, contains 412 identities, with 10 images for each identity under each mode, which contains several different perspectives, a total of 10,932 images. In the training phase, we randomly selected a subset consisting of 206 identity images as the training set while using the remaining 206 identity images as the test set.

4.1.2 Implementation details

We utilized the AGW [46] baseline as our backbone network and collected two versions of the UPPI sample repository, UniCP12 with 4100 identities and nearly 120,000 visible images, UniCP18 with 7,250 identities and almost 180,000 images. To account for the varying number of classes, we set independent classifiers for both pre-training and fine-tuning processes. Our proposed CF\(^2\) was inserted after the second and third convolution blocks; however, its parameters did not participate in updates due to a lack of cross-modal paired images during fine-tuning. After conducting numerous experiments, we determined that setting \(\omega \) at 0.6 effectively balanced loss term C\(^2\). During pre-training and fine-tuning stages, each batch randomly selected sixteen identities from which four visible images and four infrared images were chosen to form small batches. Each infrared sample was stacked into a three-channel image before being fed into the network. To expedite the training process, mixed precision training was employed. The Adam optimizer with a warm-up strategy was utilized, setting the learning rate to \(3.5\times 10^{-4}\) for SYSU-MM01 and \(8.25\times 10^{-4}\) for RegDB during the initial 10 epochs. Subsequently, at epoch 20 and epoch 40, the learning rate was reduced by factors of 0.1 and 0.01 respectively. Standard data augmentation techniques including random cropping, random horizontal flipping, color jittering, and random erasure were applied. A total of 120 epochs were conducted for network training; comprising of an initial pre-training phase spanning over 80 epochs followed by fine-tuning for an additional 40 epochs. To ensure fair comparison among different networks, any reordering algorithm was employed during evaluation.

Table 1 Comparison with the state-of-the art methods on the SYSU-MM01 dataset

4.2 Comparison with state-of-the-art methods

The comparison results with state-of-the-art methods on SYSU-MM01 and RegDB are shown in Tables 1 and 2. It can be seen that the performance indicators of the proposed UPPI on the two datasets are mostly better than existing state-of-the-art methods. Most of the comparison methods employed in this study utilize pre-trained models on ImageNet and lack paired cross-modal images, thus limiting their untapped potential. Therefore, the UPPI proposed in this study is specifically designed to tackle these two significant challenges and offer our innovative solutions. (1) The construction of a large-scale visible pseudo-infrared paired sample repository(UnitCP), based on pseudo-infrared images, not only addresses the scarcity of cross-modal samples but also enables the pre-trained model to compensate for the lack of infrared experience while retaining prior knowledge acquired from ImageNet; (2) CF\(^2\) facilitates the integration of cross-modal information by leveraging redundant features, thereby mitigating substantial visual disparities across different modalities; during the fine-tuning phase, C\(^2\) significantly aids in redirecting the model’s attention from images to identity.

Table 2 Comparison with the state-of-the art methods on the RegDB dataset

4.3 Ablation study and analysis

4.3.1 Effectiveness of proposed components

The experimental results on the SYSU-MM01 dataset are presented in Table 3. It is evident that our pre-training weights significantly enhance the performance of the network on the cross-modal dataset, while the other two components also contribute to a portion of this improvement. To further validate our approach, we integrate these proposed components into well-established backbone networks such as Inception-V3 [58], ResNet50, EfficientNet-B3 [59], and Vit-Base [27]. As shown in Table 3, our components yield performance gains for these networks, demonstrating the versatility of our method.

Table 3 Effectiveness of the proposed components over different backbone networks on the SYSU-MM01 dataset under the all-search single-shot mode
Fig. 6
figure 6

Illustrates the performance curves for four feature extraction networks pre-trained on different datasets. The networks exhibit significantly improved performance and faster convergence on UnitCP12 compared to ImageNet. A moderate enhancement is also observed from UnitCP12 to UnitCP18. Notably, the substantial gap in zero-shot performance highlights the network’s proficiency in cross-modal image environments, attributed to our specialized pre-training approach

Table 4 Comparison results of different feature fusion methods
Fig. 7
figure 7

The performance was systematically assessed across a range of parameter \(\omega \) settings from 0.1 to 0.6, revealing a consistent enhancement that highlights the efficacy of C\(^2\). Nevertheless, further increasing the parameter values may lead to feature distortion

4.3.2 Comparative analysis of pre-training at different scales

we present in Fig. 6 the simulation results of pre-training weights obtained on ImageNet, UnitCP12 and UnitCP18 sample repositories loaded by four different backbone networks, ResNet-50, Inception-V3, EfficientNet-B3 and Vit-Base. It can be seen that the cross-modal matching performance of the model is significantly enhanced following saturated pre-training tasks. Notably, the zero-shot mAP in ResNet50 exhibits an increase of over \(20\%\), while other backbone networks also demonstrate remarkable enhancements in their zero-shot matching capabilities. The substantial improvement in final retrieval performance can likely be attributed to the inclusion of a large number of samples. What factors, then, lead to the significant enhancement in zero-shot retrieval performance post-pretraining? Zero-shot performance is a crucial indicator of the extent of prior knowledge acquisition, and the sharp increase in zero-shot performance undoubtedly signifies that the model has acquired substantial prior knowledge through our tailored pretraining tasks. Crucially, prior knowledge from the visible spectrum has been effectively obtained using ImageNet. Therefore, our customized pretraining tasks compensate for the missing infrared prior knowledge, which is a key measure to mitigate domain discrepancy in cross-spectrum datasets.

Table 5 Comparison results of different identity constraints
Fig. 8
figure 8

The purple and green colors in Fig. 6(a) and (b) respectively denote intra-class distance and inter-class distance. In (c) and (d), distinct colors represent features of different identities, while diverse shapes indicate different modes. It is evident that our proposed UPPI method effectively reduces both modal distance and intra-class variation simultaneously

4.3.3 Comparative analysis of different fusion methods

In Table 4, we substituted CF\(^2\) in the pre-training phase with alternative fusion tech-niques, while keeping all other experimental conditions unchanged. The compared methods include DenseFuse [60], SeAFusion [61], FusionGAN [62], and DIVFusion [63]. In contrast to these approaches, we omitted a portion of the network responsible for generating the fused image in order to extract fusion features. The results demonstrate that our proposed feature fusion method outperforms the aforementioned techniques significantly in cross-modal retrieval tasks. We attribute this superiority to our method’s emphasis on specific optimization processes that aid recognition rather than solely pursuing the visual quality of synthesized images. This observation further validates the suitability of CF\(^2\) within the image environment established by UPPI.

Fig. 9
figure 9

The top-5 retrieval results of several hard queries obtained by the baseline method (AGW) and the proposed framework on the SYSU-MM01 dataset. The images with green bounding boxes have the same identity labels as the query images (i.e., correct matches), and those with red bounding boxes have different identity labels (i.e., wrong matches)

4.3.4 Comparative analysis of different identity constraints

In (9), we set a parameter \(\omega \) to control the tradeoff between the C\(^2\) loss with identity loss and triplet loss. To explore the impact of the hyperparameter, we give an empirical analysis on the SYSU-MM01 datasets and report the results in Fig. 7. From the results, we can observe that even adding the C\(^2\) loss with a small weight (0.2), the final accuracy and mAP could improve significantly. The best performance is achieved when the parameter \(\omega \) goes to 0.6. Although a bigger \(\omega \) may obtain a more compact feature space for every single identity, the high weight of C\(^2\) loss makes limited optimization for the feature from different identities. To compare C\(^2\) with distance-based hard constraints, we substituted it with alternative hard distance constraints and conducted comparative experiments. The compared hard distance constraints include Center Loss (CL) [64], Center Cluster Loss (CCL) [10], Dual-Enhancement Center Loss (DCL) [11], Center-Guided Pair Mining Loss (CPL) [43], and Cross-center Loss (CC) [44]. The results presented in Table 5 demonstrate that proposed C\(^2\), aiming to alleviate intra-class variations and promote the network’s focus on identity features, yields superior performance improvements and significantly enhances accuracy.

4.4 Qualitative analysis

4.4.1 Feature visualization

We visualized the spatial distribution and distance distribution of different modal features through t-SNE [65]. Figure 8(a),(b) respectively present the intra-class distance and inter-class distance of Baseline and UPPI. It can be seen that the inter-class distance of UPPI features is significantly greater than the intra-class distance. As shown in Fig. 8(c), the inter-modal difference of some identities is very large, even greater than the difference between different identities in the same modal feature, which may lead to incorrect cross-modal retrieval results. In contrast, the UPPI proposed by us makes the features more compact and the intra-class difference smaller, as shown in Fig. 8(d).

4.4.2 Analysis of retrieval results

Figure 9 shows the comparison of retrieval results between UPPI and baseline on SYSU-MM01, including single-shot and multi-shot settings. The findings demonstrate a significant enhancement in retrieval accuracy achieved by UPPI. Notably, certain queries depicted in Fig. 9 pose challenges even for human ; however, the network incorporating UPPI still manages to achieve accurate matches.

5 Conclusion

In this paper, we propose UPPI, the first unified pre-training framework for VI Re-ID. It enhances network performance from a training method perspective by incorporating three main components: (1) constructing a large-scale cross-modal sample warehouse (UnitCP) based on pseudo-infrared images and pre-training the network with it for cross-modal learning; (2) utilizing the cross-modal feature fusion module (CF\(^2\)) to identify potential redundant features and fuse them into cross-modal features; and (3) implementing the center contrast loss (C\(^2\)), which establishes flexible constraints of identity consistency that encourage the network to focus on identity-consistent cross-modal features while reducing differences in such features. A multitude of simulation experiments confirm significant performance gains resulting from UPPI.