1 Introduction

Person re-identification (re-ID) aims at matching the same person in different scenes, which has received a great deal of attention in the past decade due to its research and practical value. As one of the application scenarios, DG re-ID aims to learn a model with multiple labeled source domains and generalize its inference to an unseen domain and is more practical as it assumed less and emphasizes that the target domain is inaccessible in the training process. Generalizing the person re-identification (re-ID) algorithms to unseen domains is an important research topic for intelligent video surveillance, interest in more complicated problems has motivated more research on this task [1,2,3,4,5] and this paper is devoted to the DG re-ID task.

Although a large number of domain generalization methods have been proposed, such as data augmentation [6,7,8,9], representation learning [10,11,12,13,14], and learning strategy [15,16,17], many of them discuss the isomorphic tasks where the source and targets share the same label space. These isomorphic methods are difficult to directly adapt to the task of re-ID, and there are relatively few studies on DG re-ID, for there are always different classes between the source domains and target domains in open-set tasks. The approach proposed in this paper is based on the characteristics of the re-identification task, with emphasis on the class discriminability and feature generalization ability of pedestrians. Recent DG methods typically practice across multiple source domains to learn a more generalizable representation or model for new unseen target domains. In deep neural networks with strong descriptive power, if we train the single model with fixed batches from different source domains, the domain gaps will influence the generalization ability of the model and may cause overfitting and catastrophic forgetting. In this way, the multi-source DG re-ID problem is naturally transferred into two aspects, namely how to conduct the representation learning under the domain gap; and on the other hand, how to fully make use of the domain gap of source domains to improve the generalization on an unseen target.

To enhance the discrimination capabilities, some methods [18, 19] employ adversarial learning or disentanglement models to learn domain invariant features or achieve feature disentanglement, while the introduction of bias and loss of some domain-specific characteristics are inevitable. Other attempts take advantage of the adaptive normalization techniques or architectures. For example, Jin et al. [4] proposed a style normalization and restitution (SNR) module to reduce the domain gap caused by variations in appearance style. In [14], the combinations of different normalization techniques are presented to show that adaptively learning the normalization technique can improve DG. However, these attempts are mainly from the perspective of improving the model's adaptation to the unseen domain and fail to deeply analyze the impact of source domain shifts to feature learning in the training process, limiting the applicability of them in multi-source training. Inspired by some recent works [20] about memory-bank and continual learning, this paper considers that the way humans acquire knowledge can be effectively utilized in domain generalization by remembering the shared information of each class and integrating new knowledge into representations without forgetting previously learned knowledge. It can be beneficial in maintaining and improving representation learning under multi-domain shifts. Thus, the solution of this paper is to perform fine-grained class matches by adding the notion of continual learning into the classifier design.

In terms of exploiting the gaps in the source domain for better generalization, one of the most popular approaches is meta-learning-based models [3, 15, 16]. These models can mimic the realistic training–testing domain shifts to improve the generalization of the model, but at the same time, a significant increase in model complexity and training difficulty is inevitable. To learn a robust model, the most intuitive way is to conduct data augmentation. For example, there are some methods [7, 9] that explored Mixup [8] and AdaIN [21] techniques in original space or feature space to achieve fast stylization and in order to expose the model under the abundant domain styles, which achieved promising performance on popular benchmarks. Unlike directly mixing at the instance level, this paper adopts a probabilistic combination that mixes the bottom domain-level feature statistics for training a more robust model. All in all, we want to remain conceptually and computationally simple while the intra-domain discriminability and inter-domain variability are both taken into account in our model.

In this work, a novel multi-source DG re-ID model is proposed, which consists of two strategies, knowledge accumulation and distribution enhancement. On the one hand, to enhance discrimination capabilities, a knowledge accumulation weight updating strategy based on the notion of continual learning is applied for classifier learning. This requires the model to accumulate prior knowledge from the seen source domains based on the memory bank and then use the knowledge to update the weights of feature classifies, which directly solves the problem that unstable parameter updates due to the small-batch data and source domain shifts.

On the other hand, this paper makes full use of the domain gaps by developing a multi-mix batch normalization (MMBN) module, where the exponential moving average (EMA) technique is used to update the domain distribution statistics, and a probabilistic mixture takes place between batch features and other domain statistics to generate the mixture embeddings. It helps improve the robustness of the model in cross-domain shifts, thereby facilitating the learning process of the whole framework. For further optimizing the model, a global contrastive loss (GCL) is designed to optimize the distance from the mixture embeddings of the original samples to their prototypes and a hybrid triplet loss (HTRI) is employed for the metric learning within small batches at different layers. Extensive experiments result on multiple widely-used benchmarks achieve satisfactory performance, which demonstrate that the knowledge accumulation and distribution enhancement can effectively improve the generalization capability on unseen domains. The contributions of this paper are summarized as follows:

  1. (1)

    An effective framework for multi-source domain generalizable person re-ID is presented, where intra-domain discriminability and inter-domain variability are both considered to enhance generalization capability on the unseen domain.

  2. (2)

    A novel knowledge accumulation feature classifier (KAFC) is designed, which alleviates the degrading impact of catastrophic forgetting and distributional shifts by adaptively updating previous knowledge and the old parameters.

  3. (3)

    A data augmentation module MMBN is introduced to generate the mixture features with cross-domain distribution information, while GCL loss and HTRI loss are designed to further promote the representation learning and classifier learning of the model.

2 Related Work

2.1 Domain Generalization

The goal of domain generalization (DG) is to acquire knowledge from several related source domains and apply it to unseen target domains. The research literature on DG demonstrates great diversity and can be primarily categorized into three aspects: (1) data augmentation. (2) representation learning (3) learning strategy. Data augmentation is a form of domain randomization that performs certain refinements or transformations to increase domain diversity thereby assisting in learning general representations. For example, Tobin et al. [22] first used this method to generate more training data from the simulated environment for generalization in the real environment. Representation learning methods [10, 19, 23] have also proven very effective, which aim to learn domain-invariant features or explicitly feature alignment between domains. From the optimization perspective, many methods [5, 15] focus on exploiting different training strategies to promote generalization capability, such as ensemble learning, meta-learning, and so on. In recent years, more multi-source DG methods have been studied for practical applications. However, most of them assume there are overlapping label spaces among multiple domains, making them unsuitable to be directly applied to the open-set person re-ID task, where the target domain has different identities and classes from the source domains.

2.2 Generalization Person Re-identification

Person re-ID has made great progress in recent years, during which plenty of methods have been introduced to address person re-ID tasks for different practical application assumptions. Specifically including fully-supervised methods, unsupervised methods, and domain generalization re-ID. The ability to generalize is crucial for person re-ID models when deployed to the unseen dataset in practical applications. To address this problem, several methods have been explored and the common goal of these is to learn more generalizable representations or models under domain shifts. The first category is about representation learning, the DG-re-ID methods based on representation learning encourage the model to disentangle person representations to learn robust domain-invariant features. For example, SNR [4] introduced IN layers to neural network architecture and disentangles the task-relevant features from the residual. To further improve the interpretability, one effort is the work of QAConv [24], which considered point-to-point image matching in deep feature maps and suggested that explicit matching is easier to generalize to the unknown domain than feature learning. Moreover, some studies [9, 23] on the multi-source DG-ReID shared the same network among multiple source domains to align distributions or had their own specific normalization layer to remove domain-specific styles. However, the benefits of these remain limited when there are significant domain shifts, small batch training data and imbalanced label distribution within and across domains. The second category that meta-learning-based methods [3, 16], although these methods proved to be very helpful for DG tasks, the complex training procedure made model optimization difficult. Another promising solution for DG re-ID is data augmentation, recent papers [9, 25] pointed out the crucial role of the diversity of learned features from the synthetic data to prevent overfitting to the source domain in the re-ID task.

In contrast to the previous solutions, we approach the domain generalization issue from the point of view of prototype-wise matching and distribution enhancement. The common knowledge about class within the domain and the discrepancy between multi-source domains are both considered in this paper.

2.3 Continual Learning

Similar to transfer learning [26, 27], which apply existing knowledge to learn new knowledge between two different models. Continual learning is an active machine learning task for single model that seeks to continuously learn new knowledge to adjust itself and preserve most of the previously learned knowledge, just like humans learn knowledge.

Existing works can be divided into three categories, including knowledge distillation, parameter regularization and memory replay. Among them, some works prefer to rectify or post-process the output classification layer to mitigate bias caused by catastrophic forgetting and distributional shifts [28,29,30,31,32]. Hou et.al [29] designed a unified classifier and adopted intra-class knowledge distillation to solve the class imbalance problem between the base and new classes. Zhang et.al [30] proposed a continually evolved classifier for the few-shot incremental learning, and an adaptation module was employed to update the classifier weights at the global level. Another type of continual learning considers memory selection and generation for previous samples or gradients [20, 31], it is highly effective when storing training data is possible. [31] devised an exponential moving average framework to alleviate the degrading impact of forgetting and distributional shifts by adapting to a history of the old parameters. MemVir [20] memorizes both embedding features and class weights to utilize them as additional virtual classes. Not only utilizes augmented information for training but also alleviates a strong focus on seen classes for better generalization. There have also a number of efforts focused on the domain differences that the spatial and temporal dispersion of data brings to the task [33,34,35]. Yang et.al [35] first adopts a one-vs-all detector to discover persons who have been presented in previous cameras, which requires re-ID models to continuously learn informative representations without forgetting the previously learned on.

These methods give us more inspiration to fully leverage past knowledge to obtain a better representation for DG re-ID. However, we need to look further at the training characteristics of multi-source DG re-ID tasks, the overlapping of categories within domains, the shifts and imbalances between domains, etc. In this paper, the classifier is designed based on the notion of continual learning, where class prototypes are memorized and updated continuously, meanwhile are utilized to control the optimization direction of the weights.

3 Methodology

In this section, we introduce the framework for multi-source domain generalization person re-ID. We first describe the overview of the framework in Section 3.1. Then the proposed knowledge accumulation feature classifier is presented in Section 3.2. To better conduct representation learning, a data augmentation module MMBN is adopted, which is introduced in Section 3.3.

3.1 Overview

In multi-source domain generalization person re-ID, we are given K source domains Ɗs= {d1,d2,,…,dK}, where {N1,N2,…,NK} denote the classes of {d1,d2,…,dK}, respectively, and the total class is N. The object of this paper is to generalize well in an unknown target domain by transferring knowledge of classification learned from the source domain based on full supervision. Note that the target domain is unlabeled and can’t advance access.

This paper proposes a novel multi-source DG framework that focuses on designing the feature classifier with knowledge accumulation and mixing the multi-domain information to improve generalization. As shown in Fig. 1, a backbone that extracts features from images and can be chosen from various popular networks. Secondly, the memory bank is popularly employed in most recent methods, which is used to store the global class information of source domains according to the label. The feature vectors stored in the memory bank are used to update the feature classifiers’ weights in each epoch. Moreover, the BN layer is replaced after the pooling layer with an MMBN module. The MMBN module is expected to mix information among multiple domains to further enhance feature robustness and reduce the inter-domain gaps.

Fig. 1
figure 1

The overall architecture of the proposed framework. It contains a parameters-shared backbone network to extract domain-invariant features, memory-based feature classifiers that update the weights under the global guidance of label information, and a multi-mix BN module for mixing the information across the domains 

3.2 Knowledge Accumulated Feature Classifier (KAFC)

To extract discriminative features, we assign FC layers for each domain as classifiers for training, where the linear classifier learns a corresponding classification pattern (which can also be regarded as a prototype for each person) for each class through weights W. In our baseline, the cross-entropy loss is used to optimize the network and learn the class weights W. For a domain dataset with Ni classes, given an image, we denote yi as the truth ID label and the prediction logits can be defined as follows: 

$$p({y}_{i}|{f}_{i})=\frac{{\text{exp}}({W}_{{y}_{i}}^{T}{f}_{i}+{b}_{{y}_{i}})}{{\sum }_{j=1}^{{N}_{i}}{\text{exp}}({W}_{j}^{T}{f}_{i}+{b}_{j})}$$
(1)

where fi is a D-dimensional feature of the i-th image in a mini-batch, weight matrix W ∈ N×D and bias b can be set as 0 for convenience.

Obviously, the learned weights are critical to the image prediction logits. However, due to the limitation of batch size and the shifts of domains, it is difficult to ensure the stability of the parameters when updating. Not only lead to the forgetting of prior knowledge but also affect learning new knowledge. Inspired by the idea of continual learning, we expect classifier learning can combine new knowledge with previous information at the global level. To achieve this goal, we provide classifiers with the knowledge accumulation strategy in which the class prototypes are memorized and continually updated. The main idea is to use label information and past knowledge as constraints to provide guidance for domain classifiers, which allows features and prior knowledge to be consistent and reduces the influence of unstable parameters. The knowledge accumulated strategy mainly includes three training stages: the knowledge initialization stage, the knowledge organization stage, and the classifier learning and updating stage, as shown in Fig. 2.

Fig. 2
figure 2

Overview of KAFC, where D, N and B denote the feature dimensions, the number of classes, and the batch size, respectively. During the knowledge initialization stage, we pre-train the model by several epochs to obtain better initial representations for class prototypes. In each iteration of formal training, features are extracted from the network to update the memory bank. And we match them with the class weights and obtain the robust generalized cross-entropy loss. At the beginning of each training epoch, class prototypes in the memory bank are selected to update the current class weights

Knowledge Initialization. It is commonly known that in the early training process, features are not adequately expressed and the class prototypes are captured inaccurately. To reduce the bias caused by the ImageNet [36] and separate the domains, we pre-train the model by several epochs to obtain the initial class prototypes, where the class weights are initialized with the Kaiming normal initialization algorithm [37], and the ADAM optimizer is employed for gradient descent. Then, the baseline can be reused to extract features in the next stage.

Knowledge Organization. Based on the pre-trained baseline model, we continue training the model with the proposed method. A multi-domain memory bank M = {M1,M2,…,MK} is utilized to obtain global information, where there are N class prototypes feature FC ∈ ℝN×D of K source domain are stored, and N is the total class of all training domains. In each training iteration, the k-th centroid M[k] is dynamically updated with the encoded features as follows:

$$M[k]\leftarrow mM[k]+(1-m)\frac{1}{|{\mathcal{B}}_{k}|}\sum_{{x}_{i}\in {\mathcal{B}}_{k}}f({x}_{i})$$
(2)

where f(xi) is the feature vector of sample xi, Ɓk denotes the sample belonging to source-domain class k in the mini-batch and m ∈ [0,1] is the update ratio of the memory.

Classifier Learning and Updating. As has been noted, we have discussed how to organize and extract knowledge from the previous learning process. And then we explain how to leverage such knowledge to benefit the training. At the beginning of each training epoch, the current classifier weights Wt are updated based on the memorized prototypes in the memory bank and the past weights Wt−1, can be written as:

$${W}^{t}=\lambda {W}^{t-1}+(1-\lambda )M[\text{k]}$$
(3)

where λ ∈ [0,1] is the update ratio of the weights. And then the classifier weights updates are performed using stochastic gradient descent during the training iteration.

Naturally, the updated classifiers are used to make predictions and a robust generalized cross-entropy loss is obtained, which integrates the robustness of the memory module with the training efficiency of the traditional cross-entropy loss, which can be computed as:

$${L}_{id}=-{\sum }_{j=1}^{{N}_{i}}{y}_{i}{\text{log}}(\frac{{\text{exp}}({W}_{{y}_{i}}^{t}{}^{T}{f}_{i})}{{\sum }_{j=1}^{{N}_{i}}{\text{exp}}({W}_{j}^{t}{}^{T}{f}_{i})})$$
(4)

3.3 Multi-Mix Batch normalization (MMBN)

Learning a domain-invariant model becomes difficult when source domains become more diverse, because of the inclusion of a lot of domain-specific style information in each domain. BN is a widely used training technique in domain generalization re-ID works. However, if the domain gap is significant, only using a common BN layer to share the parameters across multiple domains may not be conducive to generalization and robustness. Thus, we introduce a multi-mix BN module, in which domain-specific BN layers are integrated and are expected to enhance the diversity and robustness of output by mixing domain information. In MMBN, a common BN (CBN) layer is employed to conduct normalization and store the multi-domain statistics for the test stage. Meanwhile, the domain-specific BN (DSBN) layers are used to obtain the individual domain statistics and share the affine parameters with the CBN layer. The operation of MMBN is illustrated in Fig. 3.

Fig. 3
figure 3

Illustration of MMBN module, the number of domain K is set as 3 for convenience. We only present the feature augmentation on one domain and the same for the other domains. MMBN consists of two branches, one is the CBN layer for all domains, and the other is DSBN for specific source domains. In the training stage, the μ and σ in the CBN and the DSBN are updated by exponential moving average (EMA). The statistics of CBN are used for the testing stage, and the domain-specific properties are used for the mixup technique to mix with current domain features

Specifically, the features before the MMBN are denoted as fg, and the CBN can be expressed as:

$${f}_{i}=CBN({f}_{g})=\gamma \frac{{f}_{g}-\mu }{\sqrt{{\sigma }^{2}+\varepsilon }}+\beta$$
(5)

where fg ∈ B×D, γ and β are the affine parameters, the ε is a small constant to avoid divide by zero. And the mean μ and variance σ within a mini-batch can be computed as follows:

$$\begin{array}{cc}\mu =\frac{1}{B}\sum\limits_{\text{b=1}}^{B}{f}_{g}[b,:],& {\sigma }^{2}=\frac{1}{B}\sum\limits_{\text{b=1}}^{B}({f}_{g}[b,:]-\mu {)}^{2}\end{array}$$
(6)

During training, CBN estimates the mean and variance of activations across multiple domains by exponential moving average operation and are used for the testing stage. The EMA operation can be written as:

$$\overline{\mu }=(1-\alpha )\overline{\mu }+\alpha \mu$$
(7)
$${\tilde{\sigma }}^{2}=(1-\alpha ){\tilde{\sigma }}^{2}+\alpha {\sigma }^{2}$$
(8)

where α is the exponential average factor, the higher the factor, the more relevant to the current batch statistics.

Inspired by [7, 8], this paper expects to mix up domain information by using domain-specific statistics. However, directly fusing batch features for each domain is prone to incur noises. Therefore, in order to search the global domain representation for each domain, the Eqs. 7 and 8 are employed to estimate the domain statistics more stably. For the domain di, the domain-specific statistics are denoted as \({\overline{\mu }}_{i}\) and \({\tilde{\sigma }}_{i}\). We assume that the domain-specific statistics follow the Gaussian process, and then we can obtain K Gaussian distributions as the domain agents. A reparameterization trick is employed to randomly sample B features from each Gaussian distribution N, and obtain domain-specific features {\(f_{\mathit 1}^{ds}\),….,\(f_{\mathit K}^{ds}\)}, the process is defined as following:

$${f}_{i}^{ds}\sim N({\overline{\mu }}_{i},{\tilde{\sigma }}_{i}^{2})$$
(9)

To maintain consistency within domains and alleviate the impact of variability among domains,

We mix the other domain information with the current domain, and the mixing process can be formulated as:

$$\begin{array}{cc}{g}_{f}^{mix}=\frac{1}{2}\times \theta {f}_{j}^{ds}+\left(1-\frac{1}{2}\times \theta \right){f}_{g}& 1\le j\le K-1,j\ne i\end{array}$$
(10)
$${f}_{j}^{mix}=BN({g}_{j}^{mix})={\gamma }_{j}\frac{{g}_{j}^{mix}-{\mu }_{z}^{mix}}{\sqrt{{\sigma }_{z}^{mix}{}^{2}+\varepsilon }}+{\beta }_{j}$$
(11)

where \(\mathrm{g}_j^{mix}\) and \({f}_j^{mix}\) reflect the characteristics after distribution fusion, \(\mathrm{\mu}_j^{mix}\) and \(\mathrm{\sigma}_j^{mix\;2}\) are the mean and var of \(\mathrm{g}_j^{mix}\), and θ ~ Beta(1,1) is the mixing ratio.

Loss functions for the outputs of MMBN are designed to further guarantee the semantically meaningful and robustness of the features. Specifically, a global contrastive loss based on inner product similarity is employed to reflect the relative difference in direction, while a mixed triplet loss based on Euclidean distance is adopted to reflect the absolute difference in value.

Only considering the optimization of local samples may be detrimental to the generalizability of the model, thus we design a prototype-based contrastive loss at the global optimization level named global contrastive loss (GCL). The main idea of the GCL is to gradually optimize the feature representation by contrastive learning between the mixed features and class prototypes in the multi-domain feature space.

Specifically, for feature sets fmix = {\(f_{\mathit 1}^{mix}\),…,\(f_{K-1}^{mix}\)}, computing the feature similarity matrix with N = {N1,N2,…,NK} prototypes, where the prototypes denote the current classifier weights. Then, one positive sample and Kneg negative samples are selected to calculate the contrastive loss. The GCL loss of domain di is defined as:

$${L}_{gcl}=-\frac{1}{K-1}\sum_{j=1,j\ne i}^{K}{\text{log}}\frac{{\text{exp}}(<{f}_{j}^{mix}\cdot {w}^{+}>/\tau )}{{\text{exp}}(<{f}_{j}^{mix}\cdot {w}^{+}>/\tau )+{\sum }_{z=1}^{{K}_{neg}}{\text{e}}xp(<{f}_{j}^{mix}\cdot {w}_{z}^{-}>/\tau )}$$
(12)

where < • > denotes the inner product and \(\tau\) is the temperature parameter.

Deep metric learning between samples in small batches is more appropriate to capture discriminative features. Due to the characteristic of decreasing the intra-class distance and increasing the inter-class distance, triplet loss is very suitable to train person re-ID network. Expecting for features to be more robust to inter-domain distribution differences, a hybrid triplet loss (HTRI) is adopted in the model. The hybrid triplet loss extends the optimization scope, which includes the original optimization objective fg and the mixed features fmix = {\(f_{\mathit 1}^{mix}\),…,\(f_{K-1}^{mix}\)}.

$${L}_{htri}=\frac{1}{(K-\text{1)}\times B}\sum_{k=1}^{(K-1)}\sum\limits_{a\in {f}_{k}^{mix}}{\left[{d}_{a,p}-{d}_{a,n}+\delta \right]}_{+}+\frac{1}{B}\sum\limits_{a^{\prime}\in {f}_{g}}{\left[{d}_{a{\prime},p{\prime}}-{d}_{a{\prime},n{\prime}}+\delta \right]}_{+}$$
(13)

where dp and dn are feature distances of positive pair and negative pair. δ is the margin of hybrid triplet loss, and [z]+ equals to max(z,0). This will guide the network to pay more attention to the intra-domain feature variability after distributional fusion, meanwhile, encouraging the network to extract more discriminative features.

The model totally includes three losses as follows:

$${L}_{final}={L}_{id}+{L}_{gcl}+{L}_{htri}$$
(14)

4 Experiments

4.1 Implementation Details

We follow the general pipeline of the Multi-source DG re-ID methods to build the baseline and incorporate the proposed method on top of the baseline. These are described separately.

Baseline (Base). In the baseline settings, we treat each source domain equally, each source domain has its classification layer, and jointly trains the feature network. Specifically, images of a batch size from each domain are sent to the model sequentially, and identification loss and triplet loss are used for the optimization of the model, where the label smoothing scheme is employed for the cross-entropy loss.

This paper (Base + KAFC + MMBN). We validate the effectiveness of the proposed method on top of the baseline (Base). A memory bank is built with the centroids (each centroid is an averaged feature of each person) of IDs for all of the domains. The centroids are utilized to update the weights of the classification layers in KAFC and compute the loss function. For enhancing robustness of the mixed features of MMBN, a global contrastive loss and a hybrid triplet loss are added, to enlarge the within-identity similarity and encourage the network to learn domain-invariant representations.

Implementation details

We implement the method with two common backbones, i.e., ResNet-50 [38] and IBN-Net50 [39].

For training, each mini-batch contains 32 images (8 identities with 4 images). Images are resized to 256 × 128 and random flipping and random cropping are used for data augmentation. For the memory, the momentum coefficient m is set to 0.2 and the temperature factor τ is set to 0.05. ID loss is label smoothing and the parameter is set to 0.1. The margin δ of triplet loss and hybrid triplet loss are 0.3 and 1 respectively. To optimize the model, we use an Adam optimizer with a weight decay of 0.0005. The learning rate is initialized to 3.5 × 10 − 5 and increases linearly to 3.5 × 10 − 4 in the first 10 epochs. Then, the learning rate is decayed by 0.1 at the 30th, 40th and 50th epochs. The total training stage takes 70 epochs.

Datasets and evaluation metrics

We demonstrate the effectiveness of the proposed method on four large-scale bench-mark datasets: Market1501 [40], DukeMTMC-reID [41], CUHK03 [42] and MSMT17 [43]. Specifically, three datasets are used as source domains for training and the remaining domain for testing. Table—shows the specific details of the four datasets. Note: for CUHK03 and MSMT17 which have multi protocols, we use MSMT17_V1 and the new protocol CUHK03-NP for both training and testing. Concisely, we denote these datasets as M, D, C-NP and MS in the following tables. Rank-n (for n = 1, 5, and 10) and mean average precision (mAP) are adopted to evaluate the performance of different re-ID models (Table 1).

Table 1 Statistics of four experiment datasets 

4.2 Comparison with State-of-the-art Methods

In this section, we compared the proposed method with the current state-of-the-art methods, including QAConv [24], OSNet [44], CBN [45], SNR [4], M3L [3], MixNorm [46] and DSAF [47]. QAConv considers more on interpretable of matching process and constructs query-adaptive convolution kernels on the fly to achieve local matching, which is trained with Resnet-50. OSNet is a flexible fusion mechanism that uses features extracted from different scales, the variant OSNet-AIN is a further development of the original model that adds instance normalization layers. CBN propose a batch normalization within cameras, which normalizes the image features of each camera separately by calculating the mean and variance of the image under each camera, thus eliminating the domain differences between different cameras. M3L improves its robustness and generalization ability through the memory-based module and meta-learning strategy. SNR and MixNorm are the data augmentation method that reduces overfitting by enhancing data diversity, but their performance still has space for improvement. Table 2. shows the experimental results of our method and previous works. With a plain backbone, ResNet-50 or IBN-Net50, the proposed method outperforms these methods on the aforementioned four benchmarks.

Table 2 Comparison with the state-of-the-art unsupervised person re-ID methods on four datasets. (BOLD: BEST)

Results on the Market1501 and DukeMTMC-reID. We can regard Market1501 and DukeMTMC-reID as mid-scale datasets. From the Table 2, this paper achieves the best performance under different types of backbones. Specifically, when testing on Market1501, our method achieves 56.5%mAP and 80.9% Rank-1, which outperforms M3L by 4.0%and 2.6% with the same backbone IBN-Net50, respectively. On DukeMTMC-reID, we can see that there is no significant improvement in the performance of the previous methods, while our proposed shows the boosts of 3.1% and 3.6% in terms of R1 accuracy and mAP, respectively, by comparing with the suboptimal M3L. By compared with the best method MixNorm, our method stays competitive at mAP and have satisfactory results on rank1.

Results on the MSMT17 and CUHK03-NP. MSMT17 and CUHK03-NP are the large-scale and small-scale datasets, respectively. When IBN-Net50 meets our proposed, the performance is achieved 17.0%/41.3% at mAP/rank-1 on MSMT17 and 34.3%/35.7% at mAP/rank-1 on CUHK03-NP. Furthermore, we also observe that the introduction of instance normalization layers indeed enhances the diversity and generalizable of models, the performance under the backbone that is related to IN layer is significantly better than the unrelated one, particularly on the inconsistent scale dataset like MSMT17 and CUHK03-NP. When testing on CUHK03-NP, because the domain is less small than other domains, how to extract more discriminative representations is more important than the reduction of multi-source domain gaps. The above results verify the strong domain generalization of our method on different scale datasets.

4.3 Ablation study

In this section, we have conducted comprehensive ablation experiments using IBN-Net50 as the backbone to investigate how components and hyperparameters impact the performance of our proposal. The results of the ablation study of different components on four datasets are shown in Table 3. The performance variations of hyperparameters on Market1501 are shown in Fig. 4.

Table 3 Ablation study on the impact of knowledge accumulated feature classifier (\(\mathcal{C}\)), global contrastive loss (\(\mathcal{G}\)) and hybrid triplet loss (\(\mathcal{H}\)) for multi-Source DG re-ID
Fig. 4
figure 4

Sensitivity of person re-ID accuracy to the weights update ratio λ. Rank-1 and mAP on Market-1501 are shown

Analysis of Proposed Components. In this part, we analyze the impact of knowledge accumulated feature classifier (\(\mathcal{C}\)), global contrastive loss (\(\mathcal{G}\)) and hybrid triplet loss (\(\mathcal{H}\)) on four datasets. Firstly, the experiment setting in Index-0 denotes our baseline, the performances on different targets are unsatisfying. Secondly, in index-1 and index-4, the knowledge accumulated feature classifiers bring great boosts to all datasets, especially on the small and mid-scale datasets such as Market-1501, the results are improved by 5.9% and 9.0% in Rank-1 and mAP. This shows that accumulating previous knowledge under the global guidance of label information is useful and play an important role for steady and continuous representation learning. Thirdly, under the settings of index-2 and index-3, the model can further improve the performance compared with only adding KAFC, which indicates the effectiveness of our proposed loss function. Moreover, the model integrated with all components outperforms the baseline on all target datasets. On Market1501 and DukeMTMC-reID datasets, the results increased by 10.1%/11.6% R1 accuracy and 14.3%/12.6% mAP respectively. When testing on MSMT17 and CUHK03-NP, the results increased by 6.4%/18.8% R1 accuracy and 3.9%/17.6% mAP respectively. These results demonstrate that our proposed components are effective and mutually beneficial to improving the model generalization. But we can also observe that it is unrealistic and inevitable to expect that data augmentation will enable the transformed training data to cover the distribution of all test data. When only conducting the distribution mixing, the performance improvement is not significant, especially on the large-scale dataset MSMT17 with abundant distributions.

The impact of weight update ratio λ. λ is the weight update ratio of feature classifiers in the KAFC.

Large λ will enhance the guidance of the memory bank, but if the gradient direction is constrained by a large λ completely during iterations, the gradient descent may be too slow. In the proposed method, we only constrain at the beginning of each epoch and still use stochastic gradient descent in each training iter to guarantee the learning ability of the network. In this section, we compare the performance with different settings of λ on Market-1501, which vary from 0 to 1. As seen in Fig. 4(a), when λ is set to 1, the best precision is obtained.

4.4 Visualization

Figure 5 visualizes the feature representation of the baseline and the proposed method on the four datasets. We can observe that in the baseline, there are big domain gaps between multi-source domains and target domains. While in our method, features of different domains in the feature space tend to fuse better, which illustrates that the proposed method is effective to reduce the domain gaps and learn the domain-invariant representations.

Fig. 5
figure 5

t-SNE plots of the sample representations. The top line is the baseline, and the bottom is the proposed method. a MS + C-NP + D → M(orange). b MS + C-NP + M → D(blue). c M + C-NP + D → MS(green). d MS + M + D → C-NP(pink)

5 Conclusions

In this paper, a multi-source domain generalization method is presented for person re-ID, which aims to improve discriminating and generalization capabilities on both seen and unseen domains by formulating two novel strategies including a knowledge accumulation strategy and a distribution enhancement strategy. In particular, we design a new knowledge accumulation feature classifier to adaptively update previously learned knowledge and the old parameters to alleviate the degrading impact of catastrophic forgetting and distributional shifts. For enhancing the robustness of the model under the multi-domain shifts, the MMBN module is introduced to capture and mix the domain-specific statistics. Moreover, in order to better optimize feature representations at the global and local level, we introduce the global contrastive loss and hybrid triplet loss. Finally, our method is evaluated on four public benchmark datasets, extensive experiments show that our method’s effectiveness and superiority over the other state-of-the-art methods.

In the future, we will jointly leverage the semi-supervised and weakly-supervised algorithms to reduce the reliance on labels for large datasets and improve the performance of person re-ID by exploring the knowledge of unlabeled data and the data with weaker labels.