1 Introduction

Pedestrian re-recognition is a popular technology in computer vision, and its purpose is to realize pedestrian search in multiple non-overlapping camera scenes [1, 2], it benefits from the vigorous development of pattern recognition and deep learning technology. In recent years, researchers have proposed a series of excellent pedestrian re-recognition methods, and high performance is achieved under ideal simulation conditions [3,4,5]. However, most of the current methods focus on the images generated by the visible light camera, and in practical applications, the visible light camera can only meet the needs of some scenes. Under night conditions, visible light cameras cannot accurately describe the appearance of pedestrians. Therefore, in order to better meet the conditions of night monitoring, infrared cameras that can image according to temperature become the first choice for night monitoring, and together with the visible light cameras for daytime monitoring, they form an all-weather closed-loop monitoring. To realize such an all-weather intelligent video surveillance system, the main problem is how to match the pedestrian image in the visible light mode with the pedestrian image in the infrared mode, that is, cross-modal pedestrian re-identification.

Cross-modal pedestrian re-recognition is a multi-source fine-grained image retrieval task. The pedestrian images of the two modalities are shown in Fig. 1. What this task needs to match is the image of two different modes of infrared and visible light under a long time span of day and night, so it is more difficult to realize than the traditional single-mode pedestrian re-recognition. These difficulties are mainly reflected in two aspects: (1) Intra-class changes: First, cross-modal pedestrian re-recognition tasks also face intra-class changes in images of pedestrians with the same identity which are caused by factors such as illumination, occlusion, posture, and viewing angle in a single mode. Secondly, the amount of information which is reflected in the infrared image and the visible light image is not equal, and it is very likely that the intra-class change is greater than the inter-class change. (2) Modal difference: The problem of cross-modal pedestrian recognition is the mutual retrieval of two heterogeneous images. Feature alignment is the basis for correct image matching. However, due to the different imaging principles of the two images, the distribution of the two images in the feature space is quite different. Therefore, the cross-modal pedestrian re-identification needs to overcome an additional challenge that is the difference between modalities.

Fig. 1
figure 1

Examples of images to be processed for cross-modal pedestrian re-recognition. Samples of each row belong to the same person. The first three column images are captured by RGB cameras, while the last three column images are captured by infrared cameras

For the problem of intra-class change, most of the work often uses the overall characteristics of the image as the final pedestrian representation, and there are also literatures mentioning the strategy of dividing blocks in the horizontal direction. However, only considering the overall features or the local features of a certain scale is one-sided. In addition, the existing work only considers the high-level features which are extracted from the deepest layer of the feature extractor network, and does not consider the low-level features which are obtained from the shallow network. Low-level features can reflect the detailed information of the image, and are also of great significance to the identification of pedestrians. Therefore, to obtain a more discriminative cross-modal pedestrian re-recognition model, a multi-scale and multi-level refined feature extraction strategy can be considered.

Regarding the problem of modal differences, most of the current researches often use shared network parameters, the features of the two images are mapped to the same feature space, and the shared modal feature is used as the final pedestrian representation. However, the features of two modal images can be divided into modal shared and unique features. If only the shared features are considered and the unique features are directly discarded, all the information contained in the image will not be fully utilized. Some studies have found that with the idea of ​​modal conversion, the recognition rate is significantly better than traditional methods. The GAN method can effectively use methods such as style transfer to realize the conversion between the two modalities, and effectively alleviate the difference between the modalities. However, although GAN network can improve task performance to a certain extent, these methods destroy the original spatial structure information and introduce additional noise in the process of reconstructing images or generating features. At the same time, GAN is used, the large amount of calculation and the difficulty of training that are difficult to converge cannot be ignored. Therefore, in dealing with the difference between modalities, it is necessary to fully consider the feature complementarity between the source images of the two modalities of the same person, the utilization of heterogeneous information is improved. Try to achieve neither loss of information nor increase in noise in the process of narrowing the modal difference.

In cross-modal pedestrian re-identification, the various features which are extracted from the image of the same pedestrian have different distributions, but they collectively reflect the identity information of the pedestrian. In this way, the complementarity between various features can be utilized with the help of collaborative learning methods, and the learning ability of the network can be improved through information fusion.

2 Related work

2.1 Monomodal pedestrian re-identification

Single-modal pedestrian re-recognition refers to pedestrian re-recognition considering only visible light modes, which means to solve the problem of matching pedestrian images between non-overlapping visible light cameras [6, 7]. The key challenges of this technology are mainly the intra-class changes in the image of pedestrians with the same identity which is caused by different camera angles, pedestrian posture changes, light intensity [8,9,10,11,12,13]. The existing single-modal pedestrian re-identification methods can be roughly divided into representation learning methods and metric learning methods. Representation learning methods mainly use pedestrian identification tags for discriminative feature representation learning [14]. The purpose of metric learning methods is usually to learn the distance between different sample features, and then achieve the effect of increasing the difference between classes and reducing the difference within classes [15]. Early research often used anthropometric data, space and time data, kinematics data, dynamics data and video stream data, etc., specific methods are adopted to describe pedestrian characteristics [16]. Recently, with the help of deep convolutional neural networks, the work of monomodal pedestrian re-recognition has achieved excellent results [9], and even surpassed the recognition level of humans on some, it is widely used public data sets [3, 17]. However, the existing single-modal pedestrian re-recognition method only deals with the pedestrian images which are collected by the visible light camera under good daylight conditions. It is often not well applied in the task of cross-modal pedestrian re-recognition at night [18]. This limits the applicability of this technology to actual all-weather monitoring scenarios.

2.2 Cross-modal pedestrian re-identification

Cross-modal pedestrian re-recognition needs to solve the matching problem between pedestrian images from different imaging sources. The cross-modal pedestrian re-recognition in this paper is the re-recognition of pedestrians between visible light images and infrared images [19,20,21]. Wu et al. released a large-scale cross-modal pedestrian re-recognition data set SYSU-MM01 [18], analyzed three different network structures and proposed a Deep Zero-padding method. Nguyen et al. published another related data set RegDB [22]. Ye et al. designed a dual-stream network to learn multi-modal shared features [23], while dual-constrained Top-Ranking loss is used to deal with inter-modal and intra-modal changes. In addition, Generative Adversarial Networks (GAN) is used to realize cross-modal pedestrian re-recognition [24], achieving better performance than before, and also providing new ideas for subsequent research work [25, 26]. Later, Zhu et al. considered the local characteristics of the human body in the cross-modal pedestrian re-recognition [27], and introduced the heterogeneous center loss, which greatly improved the recognition accuracy. This paper fully considers the intra-class changes and modal differences between the same identity pedestrian images, and proposes a refined multi-source feature collaborative network. Refined feature collaborative learning methods are used to enhance feature discrimination ability to cope with changes within the class. The multi-source feature collaborative learning method is used to improve the utilization rate of heterogeneous information, modal differences are resolved. Moreover, the effectiveness of this method is verified on the SYSU-MM01 and RegDB data sets.

2.3 Collaborative learning

In order to solve the classification problem, the method theory of collaborative learning is introduced [28]. Collaborative learning refers to training multiple feature learners of the same network on the same batch of training data. The information complementarity between multiple features is used for collaborative fusion, and the generalization ability of the model and the robustness to label noise are improved without increasing the cost of reasoning, so that the network can achieve a better learning effect. Collaborative learning has the advantages of auxiliary training [29], multi-task learning [30, 31] and knowledge distillation [32], but it does not require too many additional training networks and can achieve end-to-end training. It is worth exploring method ideas. For the task of cross-modal pedestrian re-recognition, this paper considers the discriminative ability of multi-scale and multi-level features of deep convolutional neural networks, as well as the information complementarity of multi-source heterogeneous image data, and a collaborative learning method is proposed for refined features and multi-source features.

3 Materials and methods

A refined multi-source feature collaborative network is designed, and its overall network architecture is shown in Fig. 2. For the two modal images of visible light and infrared, how to enhance the feature discrimination ability and improve the utilization of heterogeneous information, and then effectively overcome the two major problems of intra-class variation and modal difference, is the research purpose of the method in this paper. Two parallel ResNet50 in the backbone network form a dual-branch network [33], which are used as feature extractors for visible light and infrared images, respectively. The first stages of the network (Stage1 to Stage4) are used to extract the unique features of each modal, and in the later stages, the shared network parameters are used to extract cross-modal common features. In particular, the designed network in this paper includes a refined feature collaborative learning module (multi-scale feature collaboration and multi-level feature collaboration), and a multi-source feature collaborative learning module (modal sharing and unique feature collaboration and human semantic self-supervision).

Fig. 2
figure 2

The overall network architecture for the refined multi-source feature collaborative network

3.1 Collaborative learning of refined features

3.1.1 Multi-scale feature collaboration

Most cross-modal pedestrian re-recognition work is to extract the overall characteristics of the image as the final pedestrian representation [18, 34, 35]. However, because some pedestrians with different identities have small appearance differences, or are affected by occlusion and other noises between cross-modal images, only the overall characteristics are often not effective in distinguishing pedestrian identities. Recently, researchers have proved the effectiveness of using image horizontal block to obtain local features in single-modal and cross-modal pedestrian re-recognition tasks [27]. The local features of different positions will pay attention to different human details, and the details are more distinguishable, so that the model can distinguish different pedestrian identities. However, due to diversified pedestrian posture changes, camera distances and angles, etc., it is sometimes difficult to learn alignment and robust local features with the water scoring method. Therefore, it is not thoughtful to use overall features alone or local features at a specific scale.

The respective advantages and disadvantages of global features and local features are synthesized, a multi-scale feature collaborative learning strategy is proposed in this paper, horizontal multi-scale segmentation is used to deal with cross-modal pedestrian features. For the feature map which is obtained in the fourth stage of ResNet50, a multi-scale block pooling method is used to obtain the multi-scale feature vector of the pedestrian image. In order to obtain partial information of pedestrians at a suitable scale without increasing the amount of calculation, a reasonable block scale needs to be determined. According to the cognition of human joint structure and dressing habits, as well as experimental verification, this paper chooses three block methods: overall, one divided into two, and one divided into three, so that the multi-scale feature collaborative learning method can be used to obtain more discriminative pedestrian information. The work of this paper is the first to comprehensively consider the global and local features at multiple scales, the problem of cross-modal pedestrian re-recognition is solved.

3.1.2 Multi-level feature collaboration

After the feature extractor is performed on the pedestrian image, the features of the corresponding level can be learned from the shallow to the deep stages of the convolutional neural network. In order to intuitively reflect the differences in the features of each layer in the convolutional neural network, ResNet50 is used as the feature extractor, the heat map of the feature distribution at each stage of the network can be obtained as Fig. 3. In the heat map, the distribution of different colors represents the distribution of feature saliency, red represents the most prominent area of the feature, and blue represents the most scattered area of the feature. The visual method Grad-CAM (Gradient-weighted Class Activation Mapping) is used for the output of the neural network, which intuitively displays the features learned by the convolutional neural network and helps to understand the working principle and decision-making process of the neural network. Grad-CAM is to use the gradient of any target concept (such as the logits of a class in the classification category, or even the output in the caption task), flow into the final convolutional layer, and generate a rough localization map to highlight the use of images in the image important areas for forecasting.

Fig. 3
figure 3

In ResNet50, each level feature heat distribution map of the pedestrian image

It is found from Fig. 3 that as the network continues to deepen, the convolutional layers at different stages have noticed changes in the interest areas in the learning process. As shown in Stage1, the first-stage convolutional layer has the most distracted attention, mainly extracting detailed features from the entire pedestrian picture; while in Stage4, the fourth-level convolutional layer focuses on discriminatory features, which mainly extracts key semantic information. Therefore, from the bottom to the top of the convolutional neural network, attention is becoming more and more concentrated, and the extracted information is shifted from scattered spatial structure information to concentrated semantic information.

The current popular pedestrian re-recognition models usually use the deep features of convolutional neural networks to identify pedestrians. However, when learning deep features, due to a large number of filling and merging operations during the training phase, some important spatial information that originally existed in the shallow features, such as shape and texture, will be lost. In addition, infrared images contain less information, which results in a huge difference in the semantic expression capabilities of the two modal images with the same identity. Therefore, it is not appropriate to only use deep features to realize cross-modal pedestrian re-recognition. It is necessary to use the features extracted from the shallow network to supplement the discrimination. A multi-level feature collaborative learning method is proposed in this paper. In order to avoid adding a large amount of calculation and feature dimensions, only the feature map which is obtained in Stage 3 is considered for the shallow information, and the number of channels of this feature map is increased from 1 024 to 2 048 in using 1 × 1 convolution, and then the shallow features are cascaded with the deep features of Stage4, and they are sent to the following network. Image features of different granularities can be effectively expressed by such multi-level feature coordination strategy, thereby a more discriminative representation of pedestrians is obtained.

Feature Pyramid Network (FPN) is used in here, it is similar to the combination of multi-scale feature fusion and multi-scale prediction, as shown in Fig. 4. Through upsampling combined with lateral connections, the semantic information of high layers is gradually propagated to lower layers. The specific method is to upsample the features of the higher layers by a factor of 2, use 1 \(\times \) 1 convolution to change the number of channels on the features of the lower layers, and then add the results of the two. In this way, the feature maps of each layer are equivalent to merging features of different resolutions, so as to detect objects of corresponding resolutions, so as to ensure that each layer has appropriate resolution and strong semantic features, which can not only detect Small targets, but also ensure strong semantic information for classification.

Fig. 4
figure 4

Feature Pyramid Network (FPN)

The receptive field on each scale of FPN is different, and the size corresponding to the original image is different, that is, it can process targets of different scales. After a series of convolutions, the feature map is obtained. Through upsampling, it is restored step by step. In the case of ensuring that the high-level semantic information is not lost, the size of the feature map is also enlarged. Then a large-size feature map is used. To detect small targets. So as to solve the problem that small targets are difficult to detect. In addition, this method consumes little time and computation on the basis of the original network.

3.2 Multi-source feature collaborative learning

3.2.1 Modal sharing and unique feature synergy

The purpose of cross-modal pedestrian re-recognition is to realize the mutual retrieval of two modal images. Due to the differences in different modalities, the description and use of pedestrian representation is a very challenging task in the process of feature learning of cross-modal images. In order to solve this problem, researchers usually use shared network parameters to obtain the common features of the two images as the final pedestrian representation. However, the images of two different modalities of a person contain both modal characteristics and unique modal characteristics. The common feature of the modal can be represented by the intersection of the two sets. If only the common features are considered and the unique features are ignored, it means that the image information is not fully utilized. A cross modality Shared-Specific Transfer Network (cm-SSTN) is proposed [36], which fully considers the shared features between modalities and the unique features within modalities. Their work has achieved the best recognition effect currently, which also verifies the complementary effects of shared features and unique features. However, cm-SSTN also has shortcomings such as complex model and large amount of calculation.

Based on the heterogeneous information complementarity of cross-modal image features, a simple and effective method is proposed for collaborative learning of modal shared and unique features. On the basis of the cross-modal two-branch network, a fully connected layer with parameter sharing is used to extract the common features of the modal, and a fully connected layer with no shared parameters is used to extract the unique characteristics of the modal. Then, in the supervised learning, the modal shared feature and the modal unique feature are trained separately to achieve the effect of heterogeneous complementarity and improve the utilization of image information. Fully connected layers (FC) play the role of "classifier" in the entire convolutional neural network. If the operations such as convolutional layer, pooling layer and activation function layer are to map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned “distributed feature representation” to the sample label space.

3.2.2 Human body semantic self-supervision

Information interaction between modalities is an effective means to reduce modal differences. The existing work usually uses GAN network to realize the style transfer or feature transfer of the image, the information interaction is realized between the modalities. However, the GAN network may introduce new noise based on the idea of generation, and it will face problems such as difficulty in convergence during training. Therefore, how to learn knowledge that is not constrained by modal characteristics without introducing noise and easy to train, and then the information interaction is realized between modals, is an idea worth exploring.

A human body semantic self-supervised module is proposed in this paper, which aims to use the semantic information of human body structure as prior knowledge, and send two modal images to a self-supervised learning network of shared parameters without using identity tags, and learn the basic characteristics of the human body that are not restricted by modalities and identity. That is to say, from an objective perspective, no matter which identity the pedestrian image belongs to and which modality it comes from, the relative position of each body part of the human body in a pedestrian image is determined. In other words, from top to bottom, each person's image is a semantic structure such as head, shoulders, chest, abdomen, legs, and feet. This semantic information is an obvious difference between pedestrian images and other natural images, and it is also an important prior knowledge. It happens to use this prior knowledge to design a self-supervised module of human body semantics with cross-modal parameter sharing.

The human body parts in each image of different modalities are divided into small blocks and the order is shuffled, and then the blocks are reordered under the supervision of each block position label, the original order is obtained. In this way, the relative positional relationship of various parts of the human body can be used to learn basic human body information that has nothing to do with the modal source of the image, so as to achieve the effect of reducing modal differences. Experiments prove that this simple operation can get better results. Moreover, the self-supervised module uses the multi-scale feature block of a single pedestrian image in the above multi-scale collaborative learning method, it does not introduce a lot of calculation.

3.3 Loss function design

Cross-modal pedestrian re-recognition networks often use Cross Entropy Loss (CE Loss) and Triplet Loss to supervise learning features [34, 35]. Cross-entropy loss is used to classify pedestrian identities, and the triplet loss is used to reduce the intra-class distances and increase the inter-class distances. Later, a Hetero Center Loss (HC Loss) was proposed [27]. The purpose of designing this loss function is to reduce the difference between similar samples of different modalities. Heterogeneous center loss and cross-entropy loss are used in this work, the better results are achieved.

In the method in this paper, a mix-Modality Triplet Loss is introduced [37], and it is used in combination with cross-entropy loss and heterocenter loss. It is proved that the cross-entropy loss and the triplet loss function in the same feature space will cause convergence difficulties [38]. In the same way, there is also such a problem between the cross-entropy loss and the heterocenter loss. Therefore, a batch normalization layer (BN Layer) and a fully connected layer (FC Layer) are used to map feature vectors to two feature spaces, the conflicts are resolved.

The size of each batch of input images is denoted as N, then N = 2 × P × K, which means that there are P pedestrian identities in N pictures in each batch, and each identity has K visible light images and K infrared image. For modal shared features, the pedestrian identity information of each picture is used as the supervision label [27], and the combined effect of cross-entropy loss and heterogeneous center loss is used to learn each feature block. The calculation method of cross entropy loss on each feature block is formula (1):

$$ L_{sh - CE} = - \mathop \sum \limits_{i = 1}^{2 \times P \times K} p\left( {x_{i} } \right)lb\left( {q\left( {x_{i} } \right)} \right) $$
(1)

wherein, xi refers to a certain feature block of the i-th image, and p(xi) refers to the expected output, that is, the real label. q(xi) is the predicted label obtained after the extracted feature vector of each feature block in the network passes through the Softmax layer. The heterogeneous center loss is calculated for each feature block, it is in formula (2):

$$ L_{sh - HC} = \mathop \sum \limits_{p = 1}^{P} \left| {\left| {c_{p,1} - c_{p,2} } \right|} \right|_{2}^{2} $$
(2)

Here, \({c}_{p,1}=\frac{1}{K}\sum_{j=1}^{K}{x}_{p,1j}\) and \({c}_{p,2}=\frac{1}{K}\sum_{j=1}^{K}{x}_{p,2j}\) are heterogeneous centers, which are the sample center points of K samples in the visible light and infrared modalities of the image representing the identity of the p-th pedestrian. Therefore, \({\left|\left|{c}_{p,1}-{c}_{p,2}\right|\right|}_{2}^{2}\) refers to the distance between the sample centers of the p-th pedestrian identity in the two modalities. Therefore, the calculation method of the loss function on the common features is as formula (3):

$$ L_{SH} = \mathop \sum \limits_{f = 1}^{7} (L_{sh - CE,f} + \lambda L_{sh - HC,f} ) $$
(3)

wherein, λ is a weight parameter that balances the cross-entropy loss and the heterogeneous center loss. f from 1 to 7 refers to calculating the total loss of 7 feature blocks. The 7 feature blocks are one feature block in the shallow layer and 6 multi-scale feature blocks in the deep layer.

For the unique characteristics of the modal, the cross-entropy loss is first used to identify the identity of each sample, as is shown in Eq. (4):

$$ L_{sp - CE} = - \mathop \sum \limits_{i = 1}^{2 \times P \times K} p\left( {g_{i} } \right)lb\left( {q\left( {g_{i} } \right)} \right) $$
(4)

Here, gi is used to denote the overall feature vector obtained from the single-mode branch. In addition, the triplet loss is used to reduce the intra-class difference and increase the inter-class difference. The triplet loss calculation requires three input images, which are a fixed image (Anchor) a, a positive sample image (Positive) p, and a negative sample image (Negative) n. Images a and p are positive sample pairs, and images a and n are negative sample pairs. Taking into account that other parts of the network have played a role in reducing the modal difference, a mixed modal triplet loss function is used here, that is, the two modal sample features are placed in the same set for triple sampling. Then the number of pictures in a batch is \(2\times P\times K\), the set of all pictures in a batch is batch, the positive sample set of the fixed image a is A, and the negative sample set is B. Then the calculation method of mixed-mode triplet loss is as formula (5):

$$ L_{{sp - m{\text{Tri}}}} = \frac{1}{2PK}\mathop \sum \limits_{{a \in {\text{batch}}}} \left[ {\alpha + \mathop {\max }\limits_{p \in A} d_{a,p} - \mathop {\min }\limits_{n \in B} d_{a,n} } \right]_{ + } $$
(5)

The α in the formula refers to the boundary value parameter of the triplet loss. [.]+ means that if the calculation result in the square brackets is less than 0, it is recorded as 0. A and B are both subsets of batch. The loss function of the unique characteristics of the modal is the formula (6):

$$ L_{SP} = L_{sp - CE} + L_{{sp - m{\text{Tri}}}} $$
(6)

For the human semantic self-supervised module, the purpose is to reconstruct and sort the disrupted block feature vectors during the self-supervised training process. The specific method is that the position of the six multi-scale feature blocks is labeled, and then label is predicted during the training process, and then the original spatial relative positions of the 6 block feature vectors are learned. The cross-entropy loss function of the predicted feature block label can be used as the loss function of self-supervised learning, and Si,s is the s-th block of the i-th sample, and the loss function of this self-supervised learning module can be calculated as the formula (7):

$$ L_{{{\text{SSL}}}} = - \mathop \sum \limits_{i = 1}^{2 \times P \times K} \mathop \sum \limits_{s = 1}^{6} p\left( {S_{i,s} } \right)lb\left( {q\left( {S_{i,s} } \right)} \right) $$
(7)

In formula (7), p(Si,s) is the true position label of each feature block, and q(Si,s) is the predicted label of each feature block.

In summary, the total loss function of this refined multi-source feature collaborative network in the end-to-end training process is Eq. (8):

$$ L = L_{SH} + L_{SP} + L_{SSL} $$
(8)

4 Experiment and performance analysis

4.1 Experimental setup

4.1.1 Data set

There are currently two public data sets (SYSU-MM01 [18] and RegDB [22]), they can be used to evaluate the experimental results of cross-modal pedestrian re-recognition methods. The images in the data set are collected from visible light cameras and infrared (near infrared and far infrared) cameras.

The SYSU-MM01 data set is a large-scale data set which are collected by six different cameras in outdoor and indoor environments, including four visible light cameras and two near-infrared cameras. The data set contains training data for 395 pedestrian identities, including 22,258 visible light images and 11,909 near-infrared images. The test set contains another 95 images of pedestrian identities, as well as two evaluation modes and two test set construction methods. In the two evaluation modes, the query set (Qurey set) is the same, containing 3,803 images which are captured from two infrared cameras. In All-search mode, the Gallery set contains all visible light images captured from all four visible light cameras. In Indoor-search mode, the gallery collection only contains visible light images captured by two indoor visible light cameras. Generally speaking, All-search is more challenging than Indoor-search mode. The two test set construction methods are Single-shot and Multi-shot. The method of the two is to randomly select 1 or 10 pictures of the same pedestrian identity when constructing the gallery set. A detailed description of the evaluation scheme can be found in the literature [17]. The most difficult experimental setting is used, that is, the All-search evaluation mode and the Single-shot test set construction method, 10 tests were performed and the average retrieval performance was recorded.

The RegDB data set is a small-scale data set collected by a dual-mode camera system (a visible light camera and a far-infrared camera). In the RegDB data set, the visible image and infrared image contours are very similar, and cross-modal pedestrian recognition is less difficult. This data set contains a total of 412 pedestrian identities, each of which has 10 visible images and 10 infrared images. According to the evaluation protocol [22], 206 identities (2,060 images) are randomly selected for training, and the remaining 206 identities (2,060 images) are used for testing. The performance of two different retrieval settings, namely visible light image retrieval infrared image (Visible to Thermal) and infrared image retrieval visible light image (Thermal to Visible) are evaluated, and the average accuracy is recorded by randomly dividing the training set and the test set 10 times.

4.1.2 Evaluation index

For the sake of fairness, in this experiment, Cumulative Matching Characteristics (CMC) and mean Average Precision (mAP) are used as evaluation indicators. The Rank accuracy rate in CMC is the probability that the correct cross-modal pedestrian image appears in the first k retrieval results. The mAP index can reflect the average retrieval performance of the method.

4.1.3 Experimental design details

In the experiment, the Pytorch framework is used to implement the engineering code, and training and testing are conducted on an NVIDIA GeForce 1080Ti GPU. The size of the pedestrian image in the data set is adjusted to 384 × 128. In the training phase, four pedestrian identities are randomly selected, and then eight visible light images and eight infrared images are randomly selected for each pedestrian identity. Therefore, in each round of training, the batch size is 64. In order to balance the effects of the cross-entropy loss function and the heterogeneous center loss function, the weight of the heterogeneous center loss is set to 0.5 in formula (1) [27]. The boundary value of the triplet loss is set to 0.3. A stochastic gradient descent (SGD) optimizer with momentum of 0.9 is used in the training process, including the training process of the Warm Up Learning Rate strategy which is adopted in the first 10 rounds, the refined multi-source feature collaborative network has been trained for 80 rounds. The learning rate lr(t) changes with the training round t, as shown in formula (9):

$$ lr\left( t \right) = \left\{ {\begin{array}{*{20}c} {t, t \le 10} \\ {0.1, 10 < t \le 20} \\ {0.01, 20 < t \le 50} \\ {0.001, 50 < t \le 80} \\ \end{array} } \right. $$
(9)

In the training process, the network is optimized by using modal common and modal characteristics. When testing reasoning, only common modal features are used to evaluate the similarity between the query image and the gallery image. The reason is firstly that under the influence of the unique characteristics of the modal, the training is finally completed through end-to-end collaborative learning, and the extracted common characteristics of the modal can effectively describe the image, which is proved in the experiment of this article. Another reason is that the use of shared features alone can speed up the feature inference during the testing process.

4.2 Performance analysis

4.2.1 Comparative analysis with other methods

On the SYSU-MM01 and RegDB data sets, the method in this paper is compared with some popular methods of current cross-modal pedestrian re-recognition tasks under the same experimental setup. These methods include Zero-Padding [18], HCML [39], cmGAN [24], HSME [40], D2RL [25], AlignGAN [26], HPILN [41], eBDTR [23], Hi- CMD [35], JSIA [26], MSR [42], AGW [43], XIV [44], HAT [45], SIM [46], EDFL [47], TSLFN + HC [27] and cm-SSFT [36]. The experimental results are shown in Tables 1 and 2.

Table 1 Comparison of the method in this paper with other methods on the SYSU-MM01 data set (%)
Table 2 Comparison of the method in this article and other methods on the RegDB data set

In Table 1, the refined multi-source feature collaborative network is similar to TSLFN + HC, but the method in this paper leads 9.28% in Rank1 and 10.45% in mAP. In addition, cm-SSFT is the best of all comparison methods. Although cm-SSFT reached 61.60% and 63.20% in Rank1 and mAP, respectively, the experimental results of this paper showed that Rank1 and mAP were 4.64 and 2.20 percentage points higher than cm-SSFT, respectively. Moreover, cm-SSFT has a more complex network structure, which brings more parameters and calculations.

It is seen from Table 2 that the method in this paper is also competitive on the RegDB data set, and the recognition accuracy is higher than that on the SYSU-MM01 data set. This is largely because the images of the RegDB data set are collected from dual-mode cameras, and the obtained visible light images are similar to the pedestrians in the infrared images, so the cross-modal intra-class differences are small. In addition, the experimental data in Table 2 show that the recognition effect of visible light image retrieval infrared image mode is higher than that of infrared image retrieval visible light image mode. This is because the pedestrian image in the infrared mode has a small amount of information, and the ability to discriminate the identity of the pedestrian is not strong. This feature is also consistent with the above-mentioned viewpoint.

4.2.2 Visual analysis of retrieval results

In order to visually analyze the re-identification effect of the proposed method in this paper, several samples in the SYSU-MM01 data set were selected for visual analysis of the retrieval results, as shown in Fig. 5. The first three lines in the figure are the results of using infrared images to retrieve visible light images, and the last three lines are the results of using visible light images to retrieve infrared images. The first column in the figure is the search target pedestrian image. The remaining columns are the top 10 pedestrian images in the search results. From left to right, the pictures are sorted according to the similarity which is calculated by the model in descending order. The green box in the figure is the sample that was retrieved correctly, and the red box is the sample that was retrieved incorrectly.

Fig. 5
figure 5

In this paper method, the re-identification effect on the SYSU-MM01 data set

As shown in Fig. 5, although the color of the upper and lower body clothes in the visible light pedestrian image is quite different, there is no obvious difference in the infrared mode. If such a sample needs to be paired correctly, the model needs to pay more attention to the movement, body shape and some details of the pedestrian texture characteristics. It can be seen from the search results that the method in this paper can effectively extract refined pedestrian features.

It is seen from Fig. 5 that the common features between modalities, such as bag and clothing logos, will still be the key to information matching in the recognition process, and these common features of modalities may be helpful to the judgment of the correct result. Therefore, it is very important to adopt a multi-source feature collaborative learning method to promote information interaction between modalities and extract more discriminative features.

In addition, when the color cannot be used as identification information, the deep network will learn the pedestrian's body shape, posture and other characteristics, these are used as an important basis for distinguishing pedestrians. As shown in the sixth row of Fig. 7, although the first, second, and fourth columns are pedestrians with different identities, they are misjudged as the same person because they all cross their legs. It can be seen that extracting reliable discriminative features is still an important challenge.

4.3 Ablation experiment

In order to verify the effectiveness of the various modules of the refined multi-source feature collaborative network proposed in this paper, an ablation experiment was performed on the network. On the SYSU MM01 data set, the TSLFN + HC [27] method is used as the baseline model, and several modules proposed in this paper are sequentially added to the network, which can clearly quantify and reflect the improvement effect of each module on the task.

It can be seen that the various modules proposed in this paper are helpful for cross-modal pedestrian re-recognition tasks. For each experiment in Table 3, the module design analysis is carried out below.

Table 3 Comparison of the method in this paper with other methods on the SYSU-MM01 data set (%)

4.3.1 Multi-scale feature collaboration

As in Experiment 2 in Table 3, the local features of different scales are obtained through several level division strategies, and they are cascaded with the overall features to obtain the multi-scale feature module in this paper. In order to determine the optimal scale of the multi-scale feature collaboration module, several different levels of horizontal block combination strategies are compared. TSLFN + HC is used as the baseline method (Baseline) to analyze the effect of module design. In other words, in this experiment, only the feature level six-division method of the TSLFN + HC method is changed, and the rest of the network structure and experimental settings are unchanged. The combinations are as follows: Scale1 (global feature + 2 horizontal equal blocks), Scale2 (global feature + 2 horizontal equal blocks + 3 horizontal equal blocks), Scale3 (global feature + 2 horizontal equal block feature + 3 horizontal equal block features + 4 horizontal equal block features). As shown in Fig. 6, the best feature is Scale2. Moreover, according to objective cognition, the human body structure level is divided into two or three parts, which can be understood as independent semantic units, so Scale2 is suitable for personal representation.

Fig. 6
figure 6

Examples of images to be processed for cross-modal pedestrian re-recognition

4.3.2 Multi-level feature collaboration

As in Experiment 3 in Table 3, after the best multi-scale features have been selected, experiments are also carried out to find the best multi-level features. Features are extracted at different levels and different combinations are analyzed. Level2 and Level3 represent different feature maps which are extracted from Stage1 and Stage2 based on the Resnet50 backbone network. The results are shown in Table 4. The best shallow feature is Level3. The features extracted from Level2 will degrade performance in any combination. For example, the performance of Multi-Scale + Level2 is lower than Multi-Scale, and the performance of Multi-Scale + Level2 + Level3 is also lower than Multi-Scale + Level3. It can be seen that the level of feature information extracted by Level2 is too low and does not significantly contribute to semantic classification.

Table 4 Performance analysis of various hierarchical combination methods (%)

4.3.3 Modal sharing and unique feature synergy

As shown in Experiment 4 in Table 3, after multi-scale and multi-level methods are used to achieve refined feature collaborative learning, a modal shared and unique feature collaborative learning module was designed. The mixed modal triplet loss function is an important part of the proposed collaborative learning module of modal shared and unique features. The reason for using such a loss function is to sample the triples by mixing two modal images in a training batch, so that the inter-modal information exchange can be better realized in the process of metric learning. In order to verify the function of the mixed-mode triplet loss function and its advantages over the single-mode triplet loss, a comparative experiment was carried out on the two under the condition that the other design parts of the network remain unchanged. The results are shown in Fig. 7.

Fig. 7
figure 7

Performance comparison of triplet loss function under different conditions

4.3.4 Human body semantic self-supervision

As in Experiment 5 in Table 3, the self-supervised learning module of human semantics improves the performance of cross-modal pedestrian re-recognition tasks. From a logical analysis, the input data of the module is the characteristics of the two modalities, which can achieve the effect of overcoming the difference of modalities. However, this module can also play a role in local feature learning. Therefore, the performance improvement is brought by this module, it is to realize the overcoming of modal differences or the realization of local feature learning.

On the premise that the other designs of the network remain unchanged, the three settings of no self-supervision, single-modal self-supervision, and cross-modal self-supervision are compared, as shown in Table 5. It can be seen that cross-modal human semantic self-supervision not only has the function of local feature learning, but also can achieve the effect of overcoming modal differences.

Table 5 Performance analysis of human body semantic self-supervised module (%)

5 Conclusions and outlook

Visible thermal person re-identification (VT-ReID) is a challenging cross-modality pedestrian retrieval problem due to the large intra-class variations and modality discrepancy across different cameras. In classifier level, both modality-specific and modality-sharable identity classifiers for two modalities are introduced to handle the modality discrepancy [48]. To utilize the complementary information among different classifiers, we propose an ensemble learning scheme to incorporate the modality sharable classifier and the modality specific classifiers. In addition, we introduce a collaborative learning strategy, which regularizes modality-specific identity predictions and the ensemble outputs.

In this paper, comprehensive consideration is given to enhancing the ability of feature discrimination and improving the utilization of multi-source heterogeneous information. Under the guidance of collaborative learning methods, a refined multi-source feature collaborative network is proposed. Multi-scale and multi-level features are used to achieve refined feature collaborative learning, and the purpose of multi-source feature collaborative learning is achieved through modal sharing and unique feature collaboration and human semantic self-supervision. The proposed method in this paper is obviously superior to other methods on two related data sets, and provides a simple and effective idea for further research in this field.

The main contributions of this article are as follows:

  1. (1)

    In order to enhance the ability to distinguish features, in this paper, a collaborative learning method is proposed for refined features, that is, when designing a convolutional neural network for feature extraction, multi-scale and multi-level pedestrian features are comprehensively considered. Experiments show that collaborative learning of refined features is a simple and effective method, the ability of feature discrimination is enhanced.

  2. (2)

    In order to improve the utilization of multi-source heterogeneous information, a multi-source feature collaborative learning method is proposed this paper. First, in view of the heterogeneous information complementarity of visible light images and infrared images, the dual-stream network is used to extract the common and unique features of cross-modal images for collaborative learning. Secondly, a priori identification of the relative positional relationship of various parts of the human body is used as an auxiliary task, and a human body semantic self-supervision method is proposed. Finally, the purpose of multi-source feature collaborative learning is achieved under the joint supervision of multiple targeted loss functions.

  3. (3)

    Sufficient experiments were conducted on the relevant data set of cross-modal pedestrian re-identification. It is verified that the performance of the refined multi-source feature collaborative network in this paper is better than the current best related work, and it has higher reliability and advancement.