1 Introduction

Person re-identification (ReID) aims to match person images across multiple non-overlapping cameras, which has achieved attention from both industry and academia. Based on the development of deep learning technology, many complicated visual tasks can be solved with the help of powerful representation ability through deep neural networks [1, 2, 6, 15, 46]. Fortunately, the person ReID tasks can also get better results using the deep convolution networks [7, 35, 36, 54, 55]. Most of the existing person ReID models along the supervised approach [30, 31, 41,42,43] have achieved satisfactory performance. The models of these methods are optimized using the annotated labels, through which the models can learn the knowledge from the data distribution under the accurate supervised signal. The produced features from the well trained models can be utilized to recognize the target pedestrian. However, these models generally perform less well in real applications because they have never been trained to adapt to the application scenes. To address this issue, a new problem referred as to unsupervised domain adaptation becomes a hot topic in ReID task [33, 56, 66], which focuses on how to adapt a pretrained model from a labeled source domain to an unlabeled target domain.

The main challenge of unsupervised domain adaptive person ReID lies in learning feature representation with unlabeled target domain data. To solve this challenge, one major line attempts to assign pseudo labels for target samples based on the pretrained model trained with labeled source samples [9, 17, 60, 66], and then fine-tunes the model using the target samples with pseudo labels. Obviously, following this approach, the person ReID performance highly depends on the quality of pseudo labels. Therefore some works focus on obtaining highly dependable pseudo labels. Some of these works concentrate on the clustering process [9, 17, 29], in which target samples are assigned with pseudo labels based on different metrics. This process can improve the clustering result. Other works aim at how to effectively utilize target samples based on the clustering results [60, 61]. During the training process of these methods, the pseudo labels with different reliability are assigned with different weights.

However, due to the clustering results are unsatisfactory, all of the existing methods suffer from noisy pseudo labels. To observe and analyze this phenomenon, we conducted the experiments to track and tag the noisy labels that happened during the training phase. Experimental results revealed that a small proportion of the samples are assigned with wrong pseudo labels frequently. These samples can be regarded as hard samples for the unsupervised domain adaptive person ReID task. As is well known, the performance of a well-trained model relies more on hard samples than easy samples. For the same reason, hard samples with stubborn wrong pseudo labels will limit the performance of the ReID model heavily. Unfortunately, existing unsupervised domain adaptive ReID methods can hardly solve this problem. Meanwhile, most of these models only apply supervised loss functions such as the Cross Entropy and Triplet loss based on the pseudo labels to train the model, but neglect unsupervised feature learning without ground-truth labels or pseudo labels. Appropriate unsupervised guidelines can relieve the absence of ground-truth labels, and it can provide robust optimization for model training, which can enhance the generalization ability of the learned feature representation.

To address these two problems, we propose a novel solution FDL-SD to resist hard samples and learn features with better generalization ability in an united framework. In order to limit the negative influence of these hard samples, we propose a simple but powerful method which is called Sample Dropout (SD) to smooth the distribution of noisy pseudo labels, it can make the model escape from the local minimum trap and enable the model to explore better solutions. For most existing clustering-based unsupervised domain adaptive ReID works, assigning pseudo labels for all target samples is employed before each fine-tuning iterator, the model has no chance to get out of the local minimum solution in such a way. But in this paper, we propose a novel clustering method, in which a proportion of target samples are randomly discarded before each training epoch, through which the vicious circle of iterative training caused by hard samples can be broken. In addition to the proposed SD, we also present a new architecture to realize Feature Diversity Learning (FDL) in an unsupervised way. The unsupervised fashion Feature Diversity Learning we proposed can optimize the model training in a stable way without the presence of the labels, it can increase the feature diversity. Therefore the representation ability of the model is improved, and our method is believed to suppress the ill effect of wrong pseudo labels and enhance the generalization ability of the feature representation. Due to the powerful feature representation ability, our proposed Feature Diversity Learning can also be used in many other fields [5, 13, 26, 38], in which the feature plays an import role to realize the specific tasks. Overall, the main contributions of this paper can be summarized in three aspects:

  1. (1)

    We propose the Sample Dropout (SD) method to reduce the adverse effect of hard samples on domain adaptation, which can prevent some hard samples from being assigned with wrong pseudo labels all the time, thus breaking the vicious circle caused by these hard samples.

  2. (2)

    We propose the Feature Diversity Learning (FDL) and embed it into a dual-branch architecture to learn feature diversity representation in an unsupervised fashion, which can learn stable feature representation and boost the generalization ability of model on target domain.

  3. (3)

    Extensive experiments on multiple benchmark datasets show that our proposed FDL-SD achieves the state-of-the-art performance, which demonstrates the effectiveness of our proposed approach.

2 Related work

2.1 Unsupervised domain adaptive person ReID

Unsupervised domain adaptive person ReID aims at transferring the knowledge learned from a labeled source domain to an unlabeled target domain, which can make the model produce discriminate feature representation for target domain. Existing works can be roughly divided into two categories. For the first category, it attempts to reduce the domain gap between the labeled source domain and unlabeled target domain, the models of these methods have better generalization. Some methods reduce the discrepancy between two domains by aligning the feature distribution[23, 34, 39], aligning the distribution in representation space can reduce the shift between the source domain and target domain. Some other methods adopt Generative Adversarial Network (GAN) technology to transfer the person images from source style to target style [10, 12, 50, 68], these methods alleviate the domain bias in image level. However, the identify information of the target domain is ignored in these methods, where the target samples are not fully utilized to train the model. For the second category, it assigns pseudo labels for target samples based on the pretrained model trained with labeled source domain data, and then fine-tunes the model in supervised fashion [16, 17, 32, 57, 58, 60]. These methods are widely utilized because of their superior performance. BUC [32], HCT [57] and SpCL [17] directly fine-tune the model relying on the iteration of pseudo-label mining, the target domain samples with pseudo labels are used to train the model in the following phase. Some other methods adopt mutual-training to cluster target samples and train the model, such as MMT [16], MEB-Net [58] and NRMT [60], these methods can achieve better clustering results and powerful feature representation. But the wrong pseudo labels are inevitable in these approaches, and some hard samples may even seriously damage the model.

In this paper, we adopt the second line clustering-based methods to solve the unsupervised domain adaptive person ReID task, but the difference compared with the existing works is that our work is motivated to reduce the ill effect caused by hard samples and explore feature diversity representation.

2.2 Learning with noisy labels

Some remarkable methods have been devoted to handling noisy labels, which can be categorized into loss correction, sample reweighting and label correction. Loss correction methods engage to design special loss functions against noisy labels [18, 48, 59], with the help of the designed loss functions, the noise can be reduced and the optimization of the model can get better solutions. Sample reweighting methods assign various weights to different samples [11, 19, 40, 61], the samples with noisy labels can be suppressed in these methods, therefore their ill influence for the model training is alleviated. For example, UNRN [61] proposed to re-weight samples based on the uncertainty of their pseudo labels, the reliable samples are used better and the samples with low certainty are assigned with small weights. Label correction methods [9, 24, 27, 29, 45] focus on direct correction of noisy labels. DCML [9] proposed to gradually increase the usage of pseudo labels as the training process goes on and the pseudo label reliability of these samples is boosted, which helps the model learn knowledge from easy to hard. ADTC [24] proposed to use a two-stage clustering strategy to assign pseudo labels for target samples, it uses the kmeans method to generate the centroids of clusters at first and then assign pseudo labels with the metric of k-reciprocal Jaccard distance, which can improve the clustering quality. However, all of these methods can’t solve the problem that some hard samples with noisy labels hurt the training of the model seriously and are difficult to detect or correct in the iterative optimization process. Therefore, this paper adopts a novel Sample Dropout method to weaken the ill influence caused by theses hard samples.

2.3 Unsupervised feature learning

There are no ground-truth labels in some classification tasks, because it is expensive and time consuming to annotate the lables for the large amounts of data. Therefore, the unsupervised methods are adopted in many works. MMCL [49] designed a memory-based multi-label classification loss which integrates multi-label classification and single-label classification in a unified framework. Similar to MMCL, Xiao et al. [53] introduced a parameter-free Online Instance Matching loss with a memory dictionary scheme, which trains feature encoder directly instead of needing to learn a big classifier matrix and it can speed up the model convergence. In order to mitigate the effects of noisy pseudo labels, MMT [16] introduced a novel soft softmax-triplet loss to support learning with soft pseudo labels and it enhances the reliable feature learning. To solve the problem that previous contrastive losses [8, 20, 51] only focused on separating instances without considering any ground-truth classes or pseudo-class labels, SpCL [17] proposed a unified contrastive loss jointly distinguishes source domain classes, clusters and un-clustered instances of target domain, the utilization of all the data enables the model to learn better knowledge expression. However, all of these methods need to adopt pseudo labels as supervised signals, but the wrong pseudo labels are inevitable. In this paper, we propose the Feature Diversity Learning (FDL), which does not need labels as supervision that avoids the noisy labels affecting the model training. At the same time, it can boost the model’s representation ability.

3 Method

3.1 Overall framework

For the unsupervised domain adaptive person ReID task, we have a labeled training dataset \(D^{\mathrm {s}}=\left \{(\mathbf {x}_{i}^{\mathrm {s}},y_{i}^{\mathrm {s}})\right \}_{i=1}^{N_{\mathrm {s}}}\) collected from the source domain, where \(\mathbf {x}_{i}^{\mathrm {s}}\) and \(y_{i}^{\mathrm {s}}\) denote the i-th source sample and its corresponding person identity label, Ns is the number of all the source samples for training. The unlabeled target domain samples are denoted as \(D^{\mathrm {t}}=\left \{{\mathbf {x}_{i}^{\mathrm {t}} }\right \}_{i=1}^{N_{\mathrm {t}}}\). The general UDA person ReID task is “A to B”, where A is the source domain dataset with annotated labels and the B is the target domain dataset without annotated labels. In the UDA task, we need to learn discriminate feature representation for target domain using the knowledge from source domain dataset. Before each iterative training epoch, the clustering algorithm DBSCAN [3] will be used to assign pseudo labels for target samples. In order to reduce the ill influence of hard samples in the target domain and boost the generalization ability of the model, we propose a novel framework that contains Sample Dropout and Feature Diversity Learning, as shown as Fig. 1.

Fig. 1
figure 1

An overview of the proposed architecture. Our framework adopts a dual-branch structure which is consisted of feature encoders F1, F2 and their corresponding mean feature encodes \(\tilde {F}_{1}\), \(\tilde {F}_{2}\). Classifiers C1 and C2 follow behind F1 and F2. The labeled source domain in the figure means that the labels of the source domain are available during the training phase, and the unlabeled target domain represents that there are no annotated labels for the target domain to guide the model learning

Our model adopts the dual-branch structure which consists of feature encoders F1 and F2. Correspondingly, we leverage momentum update mechanism to construct and update two mean feature encoders \(\tilde {F}_{1}\) and \(\tilde {F}_{2}\) respectively. During the training process, two mini-batches are randomly selected from source domain and target domain, and they are fed into the two branches. As the Fig. 1 shown, the same input is passed through the mutual teaching architecture F1, F2, \(\tilde {F}_{1}\) and \(\tilde {F}_{2}\) at the same time. The feature encoders F1 and F2 are optimized by the backpropagation fashion through the gradients produced by our designed loss functions in this paper, and the parameters of mean feature encoders \(\tilde {F}_{1}\) and \(\tilde {F}_{2}\) are updated by the momentum update fashion according to the parameters of F1 and F2. Therefore, the mean feature encoders can smooth the negative effect caused by noisy labels, and the features produced by the mean feature encoders provide more reliable features to guide the model learning of F1 and F2. A point to note is that the proposed SD step has been applied in the target domain before each training epoch. Along each branch, the output feature vector fi of the encoder Fi will be sent into the classifier Ci.

As shown in Fig. 1, the classic Cross Entropy loss is deployed based on the ID predictions given by the classifier Ci, the Triplet loss is calculated based on the feature vector fi, and the proposed FDL loss is embedded into the mutual teaching architecture based on the encoder’s output fi and mean feature encoder’s output \(\boldsymbol {\tilde {f}}_{j},i\neq j\). At last, the feature vectors of two mean encoders will be concatenated into a united vector \(\boldsymbol {\tilde {f}}=[\boldsymbol {\tilde {f}}_{1};\boldsymbol {\tilde {f}}_{2}]\) to represent each sample in the testing stage, which will be also used by the clustering algorithm to produce pseudo labels.

In summary, there are three main steps during each epoch of the iterative training: 1) Sample Dropout in the target domain, 2) assigning pseudo labels for target domain samples, 3) training the model with source samples and target samples. Through the iterative optimization between feature representations and pseudo labels, the performance of the proposed method will be effectively improved.

3.2 Sample dropout

Noisy pseudo labels are harmful to the model but inevitable for unsupervised domain adaptive ReID tasks. However, we found that a small part of samples often account for a large proportion of noisy pseudo labels during the whole training process, which we define as hard samples. Compared with the general samples which are assigned with wrong pseudo labels only a few times during the whole training process, hard samples will continuously mislead the training process to wrong directions and irreversibly damage the model. In order to solve this problem, we propose Sample Dropout to smooth the distribution of noisy pseudo labels, which can reduce the adverse effect caused by hard samples. Besides, the correction on noisy pseudo labels can also make the model learn better feature representation, which can further improve the model’s performance.

Sample Dropout is adopted before the clustering step of each iterative training epoch. In the beginning of k-th iterative epoch, a proportion of samples are randomly selected from the target dataset and denoted as \(D_{k}^{\mathrm {t}}=\left \{\mathbf {x}_{r(j,k)}^{\mathrm {t}}\right \}_{j=1}^{M_{\mathrm {t}}}\), where Mt = (1 − ρ)Nt, ρ represents the Sample Dropout rate. Function r(j,k) indicates the j-th random sample from the target dataset Dt in the k-th epoch. Then only the selected samples \(D_{k}^{\mathrm {t}}\) will be assigned with pseudo labels based on their clustering results, and the residual target samples will be dropped out from the current training epoch.

Though the proposed SD seems extremely simple and naïve, it proves very powerful in dealing with noisy pseudo labels because it can prevent the model from accumulating degradation caused by hard samples. As is shown in Fig. 2, in upper row, the sample h is a hard sample which is always assigned with wrong pseudo labels. As mentioned above, such a hard sample with wrong labels can easily lead to the vicious circle of iterative training between wrong label and bad feature so that the feature space will be forced to gather samples with different person IDs while scattering the samples with the same ID, just as shown as the top row of Fig. 2. Considering that the query in a traditional ReID task is always based on the feature distance between two samples, such a distorted feature space will noticeably degrade the performance of the model. In contrast, as shown as Fig. 2(b), the employment of SD can occasionally stop these hard samples from participating in the clustering and training steps and consequently prevent the training process from falling into the local minimum trap caused by these hard samples. Our proposed SD method can reduce the hard noisy labels significantly, and these hard samples have the chance to be corrected in the next training epoch. Without the disturbance of hard samples, the feature space can be pulled back from the wrong direction to a better direction in the next epoch. As a result, there is larger probability to assign these hard samples with correct pseudo labels, as shown as the second plot in the bottom row of Fig. 2. By repeatedly implementing SD, the negative impact of hard samples on the whole training process can be effectively suppressed, and the final performance of the model will be dramatically improved.

Fig. 2
figure 2

Illustration of the clustering results before each fine-tuning epoch. Circles with the same color represent samples with the same person ID. Samples in the same red dotted circle belong to the same cluster. The upper row (a) represents a typical clustering process without SD, and the bottom row (b) represents the process with SD, in which the black dotted boxes mark the discarded samples in SD step

3.3 Feature diversity learning

Most unsupervised domain adaptive ReID methods employ only pseudo labels to guide the training of the model. On one hand, noisy pseudo labels could hurt the model severely. On the other hand, the generalization ability of the feature representation has not been sufficiently boosted because only the pseudo-label-based losses are applied in the training stage. Therefore, we present a new approach referred as to Feature Diversity Learning (FDL), in which the model can learn better knowledge from the data distribution without the annotated labels and pseudo labels, and the generalization ability can be effectively enhanced.

Our model adopts a dual-branch structure to produce two feature streams which will be concatenated together at the final stage to form a united feature representation. The diversity between the two streams can be regarded as a kind of regularization term. It is believed to benefit the generalization ability of the feature representation. The main idea of FDL is about how to make both the two streams serve to the same ReID task while keeping their diversity to each other. The biggest challenge lies on the trade-off of the similarity and the diversity between the two feature streams. Higher similarity helps to speed up the convergence and reduce the empirical error but take higher risk of overfitting. In contrast, keeping appropriate feature diversity generally indicates better generalization ability but makes the training difficult to converge. To get a stable and proper balance, as shown as Fig. 1, we build a mutual teaching architecture between the two streams by constructing a mean feature encoder \(\tilde {F}_{i}\) for each stream and updating it with the momentum of the corresponding feature encoder, which can be demonstrated as Eq. 1:

$$ \boldsymbol{\tilde{{\theta}}}_{i}(T)=\alpha\boldsymbol{\tilde{\theta}}_{i}(T-1)+(1-\alpha)\boldsymbol{\theta}_{\boldsymbol{i}}(T) $$
(1)

where 𝜃i and \(\boldsymbol {\tilde {\theta }}_{i}\) are the parameters of Fi and \(\tilde {F}_{i}\) respectively, T and (T − 1) represent the current statement and previous iterative statement, and α is the momentum coefficient. When the parameters of the feature encoder Fi,i = 1,2 are updated through the backpropagation fashion in a mini-batch training, the parameters of mean feature encoder \(\tilde {F}_{i},i=1,2\) are updated at the same time. The error amplification can be avoided in such a way, that’s because the temporally average model can produce more reliable feature representation. In addition, the FDL loss function \({\mathscr{L}}_{\text {FDL}}\) is put forward to guide the feature diversity learning within the proposed architecture, as shown in Eq. 2:

$$ \mathcal{L}_{\text{FDL}}=S(\boldsymbol{f}_{1}^{\mathrm{T}}\boldsymbol{\tilde{f}}_{2})+S(\boldsymbol{f}_{2}^{\mathrm{T}}\boldsymbol{\tilde{f}}_{1}) $$
(2)

where S(x) is a soft-plus function.

$$ S(x)=\ln(1+\exp(x)) $$
(3)

It is easy to understand that the more diverse the vectors fi and \(\boldsymbol {\tilde {f}}_{j}\), ij are, the smaller the FDL loss \({\mathscr{L}}_{\text {FDL}}\) is. When the model is trained under such an unsupervised fashion, the model can not only learn robust feature representation but also boost the discriminate ability. With the proposed mutual teaching structure embedded with the FDL loss, stable and proper diversity between the two branches can be expected. Experimental results prove that the proposed Feature Diversity Learning is effective. We believe this is because the two branches will converge to different suboptimal solutions under the guidance of the FDL loss, which helps lower the risk of overfitting to the outlier samples or unreliable pseudo labels and therefore enhances the generalization ability of the feature representation and boost the model’s recognition ability.

3.4 Overall loss

In addition to the FDL loss, we also adopt the widely used Cross Entropy (CE) loss and the Triplet loss to train the model.

The CE loss is assigned based on the predictions given by both the classifiers C1 and C2. For the source mini-batch Bs and target mini-batch Bt, the CE loss \({\mathscr{L}}_{\text {CE}}\) is defined as follows:

$$ \begin{aligned} \mathcal{L}_{\text{CE}}=-[\frac{1}{\lvert B^{\mathrm{s}}\rvert}{\sum}_{\boldsymbol{x}_{i}\in{B^{\mathrm{s}}}}(\log(\hat{y}_{i,*}^{(1)})+\log(\hat{y}_{i,*}^{(2)}))+\frac{1}{\lvert B^{\mathrm{t}}\rvert}{\sum}_{\boldsymbol{x}_{j}\in{B^{\mathrm{t}}}}(\log(\hat{y}_{j,\#}^{(1)})+\log(\hat{y}_{j,\#}^{(2)}))] \end{aligned} $$
(4)

where \(\hat {y}_{i,*}^{(1)}\) and \(\hat {y}_{i,*}^{(2)}\) denote the predicted probabilities of the source sample \(\mathbf {x}_{i}^{\mathrm {s}}\) on its true label, which are obtained by the classifiers C1 and C2 respectively. Similarly, \(\hat {y}_{j,\#}^{(1)}\) and \(\hat {y}_{j,\#}^{(2)}\) are those of the target sample \(\mathbf {x}_{j}^{\mathrm {t}}\) on the pseudo label.

The standard Triplet loss for a training sample xa (the anchor sample) and its feature vector F(xa) can be calculated according to Eq.5:

$$ \begin{aligned} \mathcal{L}_{\mathrm{t}}(\boldsymbol{x}_{a};F)=[\tau+\left\|F(\boldsymbol{x}_{a})-F(\boldsymbol{x}_{p}))\right\| - \left\|F(\boldsymbol{x}_{a})-F(\boldsymbol{x}_{n}))\right\| ]_{+} \end{aligned} $$
(5)

where xp and xn are the hardest positive and negative samples for the anchor xa in the same mini-batch. By applying the Triplet loss \({\mathscr{L}}_{\mathrm {t}}\) to both the source and target mini-batches in two branches, the overall Triplet loss can be written as (6). It should be noted especially that the hardest positive and negative samples of a target sample are selected according to their pseudo labels.

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\text{TRI}}=&\frac{1}{\lvert B^{\mathrm{s}}\rvert}\sum\limits_{\boldsymbol{x}_{i}\in{B^{\mathrm{s}}}}(\mathcal{L}_{\mathrm{t}}(\boldsymbol{x}_{i};F_{1}) + \mathcal{L}_{\mathrm{t}}(\boldsymbol{x}_{i};F_{2}) )+ \\ &\frac{1}{\lvert B^{\mathrm{t}}\rvert}\sum\limits_{\boldsymbol{x}_{j}\in{B^{\mathrm{t}}}}(\mathcal{L}_{\mathrm{t}}(\boldsymbol{x}_{j};F_{1}) + \mathcal{L}_{\mathrm{t}}(\boldsymbol{x}_{j};F_{2}) ) \end{array} $$
(6)

At last, the CE loss, Triplet loss and the proposed FDL loss are combined together to train the model in an end-to-end fashion. The overall loss function can be written as:

$$ \mathcal{L}=\upbeta\mathcal{L}_{\text{CE}}+\gamma\mathcal{L}_{\text{TRI}}+\delta\mathcal{L}_{\text{FDL}} $$
(7)

where β, γ and δ are the coefficients of the three losses. A point to note is that none of the target samples are involved in the pretraining on the source domain. In other words, the mini-batch Bt can be viewed as an empty set in the pretraining stage.

According to the relative research, the CE loss helps to cluster each ID class in the global scale, and the Triplet loss does well in correcting the boundaries between neighboring classes. Coupled with the proposed FDL loss, both the empirical error and the generalization error of the model can be properly balanced through the iterative training based on the overall loss given by (7). Above all, the training algorithm of the proposed approach can be summarized as follows.

Algorithm 1
figure a

Training process.

4 Experiments

4.1 Datasets

The proposed approach is evaluated on three popular datasets Market-1501[63], DukeMTMC-reID[37] and MSMT17[50]. The Market-1501 dataset contains 32,668 labeled images of 1501 identities from 6 disjoint cameras, in which the training set includes 751 identities and 12936 images, the gallery set contains 19732 images from 750 identities, and the query set contains 3368 images from 750 identities. The DukeMTMC-reID dataset includes 36411 images with 1402 identities, in which 16522 images of 702 identities are for training. The gallery and query sets contain 17,661 and 2228 images of another 702 identities. MSMT17 contains 126441 images of 4101 identities, of which 1041 identities and 3060 identities are used for training and testing respectively. The person images in these datasets are shown in Fig. 3. When the image is fed into the feature encoder, through the convolution layers, activation layers and the pooling layers, the feature vector will be produced as the Fig. 4 shown.

Fig. 3
figure 3

The person images in Market-1501, DukeMTMC-reID and MSMT17

Fig. 4
figure 4

The illustration for the feature extraction

4.2 Settings

Implementation details: The feature encoders F1, F2 and their mean encoders \(\tilde {F}_{1}\) and \(\tilde {F}_{2}\) adopt the ResNet-50 [21] as backbone. They are initialized with the parameters pretrained on the ImageNet. Compared with the original ResNet-50, the fully connected layers of these encoders are discarded. The corresponding classifiers C1 and C2 will be trained directly with the given datasets. In the training process, each input image is uniformly resized to 256×128. Horizontal flipping, random cropping, and random erasing are performed to generate the augmented data [25, 44]. The mini-batch size is set as \(\lvert B^{\mathrm {s}}\rvert =\lvert B^{\mathrm {t}}\rvert =60\), in which each identity contains 4 different samples. Hyperparameters α, β, γ, δ and τ in (1), (7) and (5) are set as 0.999, 1, 1, 0.5 and 0.3 respectively. The training is implemented by the Adam optimizer with the learning rate schedule where η = 0.00035 at beginning and is divided by 10 after every 20 epochs. The whole training process is finished after 55 epochs, in which the first epoch is used to pretrain the model with only the source domain dataset and the other epochs are used to train the model with both the source and target datasets.

Evaluation metrics: In the testing process, cumulative matching characteristics at Rank-1, Rank-5, Rank-10 and mAP are applied to evaluate the performance of our method.

4.3 Comparison with the state-of-the-art methods

We compare our proposed FDL-SD method with other SOTA unsupervised domain adaptive ReID works on the Market-1501, DukeMTMC-reID and MSMT17 datasets. The comparison results are shown in Table 1.

Table 1 Performance (%) comparison with some popular SOTA unsupervised domain adaptive ReID methods on Market-1501, DukeMTMC-reID and MSMT17

At first, we compare the ReID performance on Market-1501 and DukeMTMC-reID. Some typical works along the approach of domain bias reduction, including PTGAN [50], SPGAN [12] and CR-GAN [10] which are GAN-based, DAAM [22], D-MMD [34] and SADA [47] which are based on feature alignment are compared with our approach. The results show that our model significantly outperforms the best of these models by about 15-23% (mAP: 71.3% vs 55.8% on the Market-to-Duke task, and 83.0% vs 59.8% on the Duke-to-Market task). Comparing the traditional clustering-based methods like SSG [14] and ECN[66], our model is still more superior in all the metrics on both the adaptation tasks. There are also a lot of latest models tackling the noisy problem such as DRDL [28], HDS [65], GLT [62] and ACMA [67] in the comparison. Even these methods have achieved better performance in comparison with the above approaches, our method can still beat them. Especially, our method outperforms ACMA [67] by 1.5%/1.0% on mAP/Rank-1 scores for the Market-to-Duke task, and also takes an advantage of 1.9% for the mAP score on the Duke-to-Market task. For the most challenging tasks of Market-to-MSMT and Duke-to-MSMT, our approach also has significant improvement, especially for the mAP score on Duke-to-MSMT task, our work outperforms the existing works by a large margin. The result shows that the university of our approach is better than other methods and we provide a solid and universal baseline for future research on UDA person ReID tasks. All of these strongly support the superiority of our method over the SOTA competing approaches.

4.4 Analysis for sample dropout rate ρ

It has been found from the experimental results that the Sample Dropout rate ρ has a significant impact on the representation ability of the model and affect the final performance of the model. To clarify this influence, a group of tests are carried out to analyse the underlying relationship between the parameter ρ, the ReID metrics and the clustering results.

The influence on performance metrics: We have uniformly sampled the values of ρ from 0 to 0.8. The metric scores under different values of ρ are shown in Tables 2 and 3. It can be observed that the performance on all the tasks has been boosted clearly as the parameter ρ increases from 0. It reaches the peak when ρ = 0.4 and falls back when ρ continues increasing. The results prove that the proposed SD method plays a notable role to affect the training results, and a proper value of ρ can significantly improve the performance of the model, which is consistent with our purpose. ****

The influence on clustering result: To deeply dig out the underlying mechanism of the proposed SD method, we design a group of experiments to observe the distributions of noisy pseudo labels for different values of ρ. According to the proposed clustering-based method, a target sample \(\boldsymbol {x}_{i}^{\mathrm {t}}\) in the k-th clustering and training process will be assigned with a noisy pseudo label when its true label is not equal to the dominant true label of the corresponding cluster. The more times the noisy pseudo labels are assigned in the iterative epochs for a sample, the harder the sample is. According to the results, the curves of four metrics against the increasing values of ρ are plotted in Fig. 5.

Table 2 The performance (%) of the model on Market-to-Duke and Duke-to-Market tasks when different values of ρ are applied
Table 3 The performance (%) of the model on Market-to-MSMT and Duke-to-MSMT tasks when different values of ρ are applied
Fig. 5
figure 5

Curves of clustering error rate, relative error rate of the 10% and 20% hardest samples and the mAP plotted along with growing values of the parameter ρ. Similar to (a), the left vertical axes of (b), (c) and (d) represent the clustering error rate and relative error rate, the right axes indicate the mAP scores

The clustering error rate is the ratio of noisy labels to all the labels when we implement the clustering algorithm on the fully converged model. The corresponding pink curves in Fig. 5(a), (b), (c) and (d) illustrate that the clustering error rate increases with the growing value of ρ. This result coincides with our intuition because higher SD rate indicates that more samples are not fully utilized in the training. The blue and green curves in Fig. 5 record the relative error rates of the 10% and 20% hardest samples against the parameter ρ, respectively. By 10% or 20% hardest, we mean the top 10% or 20% samples that are most frequently assigned with noisy pseudo labels. The corresponding relative error rates are defined as the ratio of the noisy labels ever assigned to these hardest samples to all the noisy labels ever generated in the whole training process. It can be easily discovered that the relative error rates of the 10% or 20% hardest samples fall steadily as the value of ρ grows from 0 to 0.8. This result reveals a truth that the proposed SD method can effectively prevent noisy pseudo labels from concentrating on a minority of target samples, namely the hardest samples we mentioned above. Our method can alleviate the damage for the model caused by these hard samples.

Considering the synthetic influence of the hyperparameter ρ on both aspects of the clustering error rate of noisy labels and the relative error rate of the hardest samples, a clear conclusion can be drawn that there must be a most proper value of ρ that can keep the optimal balance between the two factors. The mAP curve in Fig. 5 also provides solid evidence to support this point, in which the mAP score reaches the peak at ρ = 0.4 for the four tasks. All above experimental results and the corresponding analysis prove that the proposed SD method can significantly suppress the ill influence caused by hard samples and consequently improve the reliability of the model.

4.5 Ablation study on FDL

To validate the effectiveness of our proposed Feature Diversity Learning, a group of ablation tests are designed to show the mAP and Rank-1 scores of the model when the FDL is or is not applied. As shown as Fig. 6, against different values of ρ in the horizontal axes, the red and green bars represent the mAP scores with or without the FDL, and the blue and pink bars reflect the Rank-1 scores when the FDL is or is not applied, respectively. It can be clearly seen that the FDL achieves 2.8% and 1.2% improvements on mAP and Rank-1 scores in dealing with the Market-to-Duke task when ρ = 0.4, and it also shows 2.7% and 1.4% superiority of mAP and Rank-1 scores on the Duke-to-Market task. This advantage can also be validated on both of the Market-to-MSMT and Duke-to-MSMT tasks, the performance has significant improvement when the FDL loss is adopted. The results under different values of ρ on all the tasks indicate the same conclusion that the proposed FDL certainly improves the generalization ability of the model on the unsupervised domain adaptive ReID task, which supports our viewpoint that the diversity between the two feature streams helps to prevent the model from falling into the overfitting trap.

Fig. 6
figure 6

The mAP and Rank-1 scores for ablation tests under different values of ρ on four tasks, where “w/” means the FDL is applied to the model while “w/o” means the opposite

4.6 The application of our method

Person ReID task has important application value in real-world, it can maintain public order and serve the society. We have implemented the intelligent video system, the system can recognize and query the target person. During the training phase, due to the absence of the annotated labels of the persons in the target scene, we have to use the UDA person ReID technology to implement the specific functions. Just as we have mentioned above, there are some problems that the existing works can not solve, but our proposed method can meet the demands. Therefore, our method has important real-world application value.

5 Conclusion

Noisy pseudo label is one of the most challenging problem for clustering-based unsupervised domain adaptive ReID models, which is the main problem we have addressed in this paper. With our proposed Sample Dropout method, the hard samples can be effectively suppressed and consequently the uneven distribution of noisy labels can be smoothed, which is proved helpful for breaking the vicious circle between noisy label and bad feature and improving the training results. The proposed Feature Diversity Learning provides a new approach for the well-known mutual teaching architecture, which provides an unsupervised fashion to make the model learn knowledge from the data distribution and focuses on enhancing the diversity of the two feature streams. Ablation study shows that FDL has a stable positive impact on the generalization ability of the model. With the above two improvements, it is proved by the comparison results that our proposed FDL-SD outperforms most state-of-the-art methods on the unsupervised domain adaptive ReID task. And our method has been applied in the real-world scene, it has significant value in video system.

In spite of the advantages of performance, further room around the proposed SD and FDL methods needs much more exploration, including the intrinsic mechanism about the dropout rate ρ, and the underlying relationship between momentum-based mean feature vector and the FDL scheme. We will remain focusing on these problems in our future work.