Keywords

1 Introduction

Person re-identification (re-ID), which aims at retrieving images of the same person from the database given a person image, has advanced considerably relying on the power of deep learning technology in recent years  [19, 29, 32, 34, 35, 48, 50, 51, 53, 58]. However, due to the problem of domain shift  [17], the performance of a deep re-ID model that performs well in a source domain may drop significantly when applied to a target domain. Besides, it is usually not easy to obtain labels of target data in practice, which hinders supervised fine-tuning of the deep model on the target data.

Fig. 1.
figure 1

Overview of the proposed Noise Resistible Mutual-Training (NRMT). NRMT maintains two networks during training, which performs collaborative clustering to ease the fitting to noisy instances and mutual instance selection to further select reliable and informative instances for the network update.

To learn a deep re-ID model which generalizes well in the target domain without using labels from this domain, unsupervised domain adaptation (UDA) approaches are proposed given labeled source data and unlabeled target data  [5, 21, 24, 45, 56, 57]. Different from the traditional setting of UDA which assumes that the source and target domains share the same classes, UDA in person re-ID is under an open-set scenario, i.e., the two domains have totally different person identities (classes). Thus, it is a more challenging task.

Self-training is an effective strategy for UDA in person re-ID  [8, 11, 31, 49], which performs clustering with the pre-trained source model to assign pseudo-labels to samples of the target dataset, then alternately updates the model with the pseudo-labels on target data and re-assigns the labels with the updated model to make the model adapt to the target data progressively. In the early stage of training, pseudo-labels assigned by clustering usually contain lots of noises due to the divergence between the source and target domains. The model can correct some of them by learning from clean labels. However, as the number of training iteration increases, some noisy instances are fitted by the model and cannot be corrected anymore. These noises eventually harm the self-training model performance on the target data.

In order to address the problem mentioned above, we propose Noise Resistible Mutual-Training (NRMT) to effectively reduce the impact of noisy instances throughout the training process by leveraging dual networks with information interaction. As shown in Fig. 1, NRMT maintains two networks during training, which performs collaborative clustering to ease the fitting to noisy instances and mutual instance selection to further select reliable and informative instances for the network update. We argue that there always exist some noisy instances that the single network cannot distinguish by itself in the iteration process of self-training. Inspired by deep learning with noisy labels  [14, 22], we use another network with different learning ability to assist in correcting pseudo-label errors.

Specifically, for each iteration, collaborative clustering allows the two networks to not only learn by their respective pseudo-labels but also exploit the ones provided by each other as an additional supervision. For one network, its peer network can provide various labels for instances due to different learning ability. Although there also exists noises in these labels, they still can be used to reduce the effect of label errors of the single network because deep neural networks tend to fit easy (more likely to be correct) instances first  [1]. For each mini-batch, mutual instance selection is introduced to further filter out noisy instances while keeping informative instances. Here the reliability of a triplet of instances is assessed for one network according to the prediction confidence of its peer network on this triplet. Informative instances are also important for improving the network performance. Thus, we further measure the amount of information of the triplet by the relationship disagreement of the predictions across the networks. Combining collaborative clustering at each iteration and mutual instance selection within each mini-batch, the proposed NRMT can effectively depress noises in pseudo-labels and improve the performance of both the two networks.

Our main contributions can be summarized as follows: 1) We present a novel noise resistible mutual-training method for unsupervised domain adaptation in person re-ID, which exploits dual network interaction to depress noises in pseudo-labels of unsupervised iterative training on the target data. 2) We introduce a collaborative clustering to ease the fitting to noisy instances by the memorization effects of deep networks. 3) We propose a mutual instance selection based on the peer-confidence and relationship disagreement of networks on triplets of instances to select reliable and informative instances in a mini-batch.

2 Related Work

Unsupervised Domain Adaptation. Our work is related to unsupervised domain adaptation (UDA)  [3, 28, 36, 37]. Some methods are proposed to match distributions between the source and target domains  [20, 33]. Long et al.  [20] embed features of task-specific layers in a reproducing kernel Hilbert space to explicitly match the mean embeddings of different domain distributions. Sun et al.  [33] propose to learn a linear transformation that aligns the second-order statistics of feature distributions between the two domains. There are also several works that learn domain-invariant features  [12, 37]. Ganin et al.  [12] introduce a gradient reversal layer to learn features invariant to domain via an adversarial loss. The aforementioned methods only consider the closed-set scenario. Recently, some works are introduced to address the problem of open set domain adaptation  [10, 23, 27], where several classes are unknown in the two domains (or in the target domain). However, classes of the two domains are entirely different for UDA in person re-ID, which presents a greater challenge.

UDA for Person re-ID. There are many research works that have been proposed for unsupervised cross-domain person re-ID  [5, 24, 25, 30, 31, 38, 40,41,42, 44, 46, 56, 57]. Some of them focus on image-level domain invariance. Wei et al.  [39] propose a person transfer generative adversarial network to bridge the domain gap, which considers the style transfer and person identity keeping. Deng et al.  [7] generate target image samples through the coordination between a CycleGAN and an Siamese network. Several works also try to improve the model generalization from the view of feature learning. Wang et al.  [38] establish an identity-discriminative and attribute-sensitive feature representation space transferable to any new (unseen) target domain. Qi et al.  [25] develop a camera-aware domain adaptation to reduce the discrepancy across sub-domains in cameras and utilize the temporal continuity in each camera to provide discriminative information.

Recently, some methods are developed based on the self-training framework. Fu et al.  [11] present a self-similarity grouping to explore the potential similarities by both global and local appearance cues. Zhang et al.  [49] propose a self-training method with progressive augmentation framework to offer complementary data information by different learning strategies for self-training. In contrast, our method provides complementary information through dual network interaction. Ge et al.  [13] present a mutual mean-teaching framework to softly refine the pseudo-labels in the target domain. Note that our method and  [13] are complementary and can be combined.

Deep Learning with Noisy Labels. There exist several works that aim at improving the training of deep models with noisy labels. Decoupling  [22] trains two networks simultaneously, and then updates models only using the instances that have different predictions from these two networks. Co-teaching  [14] proposes to select small-loss instances of each network as the useful knowledge and transfer such useful instances to its peer network for the further training. Yu et al.  [47] combine the disagreement strategy with Co-teaching, which trains two deep neural networks with the disagreement-update step (data update) and the cross-update step (parameters update). These methods mainly focus on the classification problem, which cannot be directly applied to the metric learning problem in our task.

3 Our Method

Given a labeled training dataset \(\{\mathbf{{X}}^s, \mathbf{{Y}}^s\}\) from the source domain and an unlabeled training dataset \(\mathbf{{X}}^t\) from the target domain where identities of persons are different from the ones in the source domain, we aim to learn discriminative feature representations for target testing dataset. In this section, we present the proposed Noise Resistible Mutual-Training (NRMT) method, which incorporates the interaction of dual networks to depress noises in pseudo-labels produced by unsupervised clustering in a self-training process. Now, we proceed to explain each component of our NRMT in details.

3.1 Self-training with Clustering

Since the ground truth labels of the target person images are not available, one way to fine-tune the target model is to consider the target labels as latent variables that can be inferred in the learning process. Thus, a typical self-training framework for unsupervised domain adaptation aims to minimize the following loss function:

$$\begin{aligned} \mathop {\min }\limits _{\mathbf{{\hat{Y}}}^t,\mathbf{{W}}} {\mathcal {L}}(\mathbf{{\hat{Y}}}^t,f(\mathbf{{X}}^t;\mathbf{{W}})), \end{aligned}$$
(1)

where \(\mathbf{{\hat{Y}}}^t\) denotes the estimated target labels, \(\mathbf{{X}}^t\) is the set of target images and f denotes the target model parameterized by \(\mathbf{{W}}\).

In the case of person re-ID, source and target domains do not share the common label space. Thus, one cannot directly apply the classifier trained on the source dataset to estimate the target identities. Similar with  [8, 31], we perform clustering on CNN features to assign pseudo-labels to instances with the most confident predictions and assume that they are mostly correct. Once the target model is updated with these pseudo-labels, the remaining instances with less confidence are continuously explored by the model adapted better to the target domain. Therefore, to minimize the loss function in Eq. (1), we firstly initialize the model parameters \(\mathbf {W}\) on the source data \(\{\mathbf{{X}}^s, \mathbf{{Y}}^s\}\) and then apply an alternating block coordinate descent algorithm: 1) Fix \(\mathbf {W}\) and minimize the loss w.r.t \(\mathbf{{\hat{Y}}}^t\) through clustering. 2) Fix \(\mathbf{{\hat{Y}}}^t\) and optimize the loss w.r.t \(\mathbf {W}\) by stochastic gradient descent.

3.2 Mutual-Training with Collaborative Clustering

The problem of self-training based models  [8, 31] is that the quality (correctness) of pseudo-labels generated by unsupervised clustering on the target data heavily affects the model performance. Although the deep learning model in self-training can avoid fitting noisy instances in the early stage of training due to the memorization effects of deep neural networks  [1] and improve the performance progressively as more and more instances with high confidence are explored, there inevitably exist some label errors that cannot be corrected and would be overfitted as the training proceeds. These accumulated errors eventually impede the performance growth.

In order to reduce the label error accumulation throughout the training process, the proposed NRMT maintains two neural networks f parameterized by \(\mathbf{{W}}_f\) and g parameterized by \(\mathbf{{W}}_g\) simultaneously during training, and allows them to share clustering information by collaborative clustering at each iteration to reduce the effect of their respective label errors.

To make f and g have different learning abilities, we use different random seeds to pre-train f and g on the source dataset \(\mathbf{{X}}^s\) with labels \(\mathbf{{Y}}^s\) by the triplet loss and the Softmax loss  [31]. Here f and g have the same network architecture to facilitate the deployment. Because deep neural networks are highly non-convex models, different initializations can still lead to different local optima even with the same architecture and optimization algorithm  [14]. Then, we use the pre-trained f and g to extract features on the target dataset \(\mathbf{{X}}^t\) and obtain two sets of pseudo-labels \(\mathbf{{\hat{Y}}}_f^t\) and \(\mathbf{{\hat{Y}}}_g^t\) through applying clustering to the features. Since the target domain has classes different from the source domain, we drop the Softmax loss and fine-tune the networks on the target data only using the triplet loss with the pseudo-labels. To share clustering information, f and g consider both their own pseudo-labels and the ones of their peer networks. Thus, we have a joint loss function for each network:

$$\begin{aligned} {{\mathcal {L}}_f}&= {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_f^t,f(\mathbf{{X}}^t;\mathbf{{W}}_f)) + {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_g^t,f(\mathbf{{X}}^t;\mathbf{{W}}_f)), \end{aligned}$$
(2)
$$\begin{aligned} {{\mathcal {L}}_g}&= {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_g^t,g(\mathbf{{X}}^t;\mathbf{{W}}_g)) + {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_f^t,g(\mathbf{{X}}^t;\mathbf{{W}}_g)), \end{aligned}$$
(3)

where \({\mathcal {L}}_{tri}\) is the batch-sampling triplet loss  [16].

Different from self-training where the network assigns new pseudo-labels to the training instances at each iteration only according to its own parameter update, in NRMT, to make the learning more robust, the two networks f and g collaboratively assign pseudo-labels, i.e., each instance has two pseudo-labels from f and g, respectively. The study on memorization in deep networks  [1] suggests that deep networks tend to prioritize learning easy patterns. Usually noisy instances caused by clustering are relatively hard examples, thus if one instance is assigned two labels, the networks will fit the clean (easy) one first to become robust and the error may be eliminated at the next iteration. The joint loss functions in Eq. (2) and Eq. (3) are similar to Co-training  [2] where classifiers are trained on two views (two independent sets of features). However, here we have two networks but only have a single view, and we utilize the memorization effect of deep networks to handle the error in labels.

3.3 Mutual Instance Selection

Although collaborative clustering across networks is able to ease the fitting to noisy instances for each iteration, these noisy instances still have impact on the network training in a mini-batch, especially in the advanced stage of training. To further select reliable and informative instances in a mini-batch, we introduce a mutual instance selection strategy by considering both the peer-confidence and relationship disagreement of the two networks.

Reliable Instance Selection by Peer-Confidence. In order to select reliable instances for training, we consider using the prediction confidence of the peer network to measure the reliability of instances for one network. We argue that in the metric learning, the relationship of one pair of instances with other pairs in the feature space can provide more information about the network prediction than the distance between one instance and another one. Thus, we compute the prediction confidence based on the relationship of a triplet of instances.

Given an instance x, its corresponding positive instance \(x_p\) and negative instance \(x_n\) from a mini-batch, we encode the relationship of the triplet \(\{x, x_p, x_n\}\) by the difference between the Euclidean distances of the positive and negative pairs in the feature space:

$$\begin{aligned} \mathcal {D}(x, x_p, x_n;f)&= ||f({x}) - f({x_p})|{|_2} - ||f({x}) - f({x_n})|{|_2}, \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {D}(x, x_p, x_n;g)&= ||g({x}) - g({x_p})|{|_2} - ||g({x}) - g({x_n})|{|_2}, \end{aligned}$$
(5)

where f(x) and g(x) is the features extracted by the networks f and g, respectively. The smaller the difference is, the higher the confidence is. If the difference of the peer network g (resp. f) of f (resp. g) for the triplet \(\{x, x_p, x_n\}\) is smaller than a threshold \(T_c\):

$$\begin{aligned} \mathcal {D}({x},{x_p},{x_n};g) < T_c, \end{aligned}$$
(6)
$$\begin{aligned} \quad \text {resp.} \ \mathcal {D}({x},{x_p},{x_n};f) < T_c, \end{aligned}$$
(7)

we call \(\{x, x_p, x_n\}\) as a peer-confident triplet of instances for f (resp. g) and use this peer-confident triplet to update f (resp. g). Because the two networks have different learning abilities, we expect that they can filter out various noisy instances  [14] to make up for each other’s mistakes.

Informative Instance Selection by Relationship Disagreement. The peer-confidence of the network can pick up reliable (clean) instances in a mini-batch, but these instances usually contain lots of easy instances which provide limited information for the network performance improvement. To further select more informative instances, we propose to use the relationship disagreement between one network and its peer network to measure the amount of information on the basis of the peer-confidence.

Similar to the peer-confidence, we compute the relationship disagreement on a triplet of instances. We first define the prediction inconsistency of the two networks f and g combined with Eq. (4) and Eq. (5) as:

$$\begin{aligned} \mathcal {I}({x},{x_p},{x_n};f,g) = \mathcal {D}(x, x_p, x_n;f) - \mathcal {D}(x, x_p, x_n;g). \end{aligned}$$
(8)

Larger absolute value of the inconsistency indicates that the triplet of instances has larger amount of information. It can be considered that there is the relationship disagreement between the predictions of two networks for the triplet \(\{x, x_p, x_n\}\) if the absolute value of the prediction inconsistency is larger than a threshold \(T_d\):

$$\begin{aligned} |\mathcal {I}({x},{x_p},{x_n};f,g)| > T_d \end{aligned}$$
(9)

The networks are only updated on the mini-batch data with the relationship disagreement. Furthermore, when combined with the peer-confidence, Eq. (9) can be rewritten with the absolute symbol removed:

$$\begin{aligned}&\mathcal {I}({x},{x_p},{x_n};f,g) > T_d, \end{aligned}$$
(10)
$$\begin{aligned}&\mathcal {I}({x},{x_p},{x_n};g,f) > T_d. \end{aligned}$$
(11)

The intuition is that, for the item within the absolute symbol in Eq. (9) which is smaller than \(-T_d\), because \(T_d\) is not less than zero and \(\{x, x_p, x_n\}\) meets the peer-confidence condition in Eq. (6) or Eq. (7), we have

$$\begin{aligned} \mathcal {D}(x, x_p, x_n;f)< \mathcal {D}(x, x_p, x_n;g) - T_d< \mathcal {D}(x, x_p, x_n;g) < T_c, \end{aligned}$$
(12)
$$\begin{aligned} \text {or} \ \mathcal {D}(x, x_p, x_n;g)< \mathcal {D}(x, x_p, x_n;f) - T_d< \mathcal {D}(x, x_p, x_n;f) < T_c. \end{aligned}$$
(13)

As a result, when \(T_c\) is set to a proper small value, for the network f or g, the triplet \(\{x, x_p, x_n\}\) is actually an easy instance that can be ignored during training. Figure 2 illustrates three types of triplets of instances obtained by the proposed mutual instance selection strategy, where we consider instances selection for the network f according to the prediction of the network g.

Fig. 2.
figure 2

Three types of triplets of instances obtained by the proposed mutual instance selection strategy. Different shapes (circle, triangle and square) denote different ground truth class labels and different colors (blue, green and yellow) denote different pseudo-labels. (a) Noisy triplet of instances obtained by \(\mathcal {D}({x},{x_p},{x_n};g) \ge T_c\); (b) Reliable but easy triplet of instance obtained by \(\mathcal {D}({x},{x_p},{x_n};g) < T_c\) but \(\mathcal {I}({x},{x_p},{x_n};f,g) \le T_d\); (c) Reliable and informative triplet of instances obtained by \(\mathcal {D}({x},{x_p},{x_n};g) < T_c\) and \(\mathcal {I}({x},{x_p},{x_n};f,g) > T_d\). (Best viewed in color). (Color figure online)

For the clarity, the training process of NRMT is summarized in Algorithm 1. It is worth noting that we only maintain two networks in the stage of training and the performance of the two networks can be aligned to the similar level via the information interaction. Thus, we can use any one of the two networks for the deployment in practice.

figure a

4 Experiments

In this section, we evaluate the proposed NRMT using three large-scale person re-ID datasets, i.e., Market-1501  [52], DukeMTMC-reID  [26, 54] and MSMT17  [39] and the performance evaluations are presented in term of Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) under the single-query setting.

4.1 Datasets

Market-1501  [52] contains 32,668 labeled images of 1,501 identities. 12,936 images of 751 identities form the training set. 3,368 query images from the other 750 identities and 19,732 gallery images (with 2,793 distractors) are used as the test set. The bounding boxes of persons are generated by Deformable Part Model (DPM)  [9]. DukeMTMC-reID  [26, 54] includes 36,411 labeled images of 1,404 identities. 702 identities are randomly selected for training and the rest is used for testing. There are 16,522 training images, 2,228 query images and 17,661 gallery images. MSMT17  [39] is the largest re-ID dataset consisting of 126,441 bounding boxes of 4,101 identities taken by 12 outdoor and 3 indoor cameras. 32,621 images of 1,041 identities are used for training.

4.2 Implementation Details

We adopt ResNet-50  [15] as the architectures of the two networks and initialize them with the parameters pre-trained on ImageNet  [6]. All images are resized to 256\(\times \)128. Random horizontal flipping and random erasing  [55] are employed for training data augmentation. We use the Softmax and triplet losses to pre-train the two networks on the source dataset with different random seeds, respectively. The margin m in the triplet loss is 0.5. For each mini-batch, we randomly sample 32 identities and 4 images per identity. The SGD optimizer with a momentum of 0.9 is used to train the networks and the learning rate is 6e-5.

The peer-confidence threshold \(T_c\) is set to 1.0 and the relationship disagreement threshold \(T_d\) is set to 0.5. The HDBSCAN clustering algorithm  [4] is adopted to produce pseudo-labels for each iteration, which does not require the number of clusters as prior parameter. The number of minimum samples for each cluster is set to 8. The maximal number of iterations is 30. At the first half of the iterative process, we train the networks only using collaborative clustering. Then we add mutual instance selection to further select clean and informative data in mini-batches for the network update.

Table 1. Evaluation on different values of the threshold \(T_c\). Results of the two networks f and g are reported, respectively.
Table 2. Evaluation on different values of the threshold \(T_d\). Results of the two networks f and g are reported, respectively.
Table 3. Evaluation on different numbers of the minimum samples for each cluster in HDBSCAN. Results of the two networks f and g are reported, respectively.

4.3 Parameter Analysis

We first study impacts of some important parameter settings in the proposed NRMT, including the peer-confidence threshold \(T_c\), the relationship disagreement threshold \(T_d\) and the number of minimum samples in the HDBSCAN clustering algorithm.

Peer-Confidence Threshold \(T_c\) . To analyze the impact of \(T_c\) in Eq. (6) and Eq. (7), we fix the relationship disagreement threshold \(T_d=0.5\) in all experiments. The results are listed in Table 1. We can observe that a proper value of \(T_c\) is important for NRMT to filter out noisy instances, which provides a reasonable assessment of the noise confidence. The best performance is achieved when \(T_c\) is set to 1.0.

Relationship Disagreement Threshold \(T_d\) . We also conduct experiments to investigate the impact of \(T_d\) in Eq. (10) and Eq. (11). In all experiments, we fix the peer-confidence threshold \(T_c = 1.0\). As reported in Table 2, when \(T_d = 0.5\), we can obtain the best results. When \(T_d\) is set to a larger value, fewer instances are selected for update, which is likely to discard instances that are actually informative. Too small values of \(T_d\) will allow most of the instances to be involved in update, which may contain too many easy instance and thus cannot provide effective information for improving the network.

Number of Minimum Samples. To evaluate the influence of the number of minimum samples in HDBSCAN, we report the results of {6, 8, 10} minimum samples in Table 3. As we can see, the number 8 yields the superior performance. Note that our NRMT is not very sensitive to this prior clustering parameter.

Table 4. Performance evaluation of components in the proposed NRMT on Market-1501 and DukeMTMC-reID. Separate Training: Train the two networks separately. CC: Collaborative clustering. SC: Instance selection by the peer-confidence. SD: Instance selection by the relationship disagreement. Results of the two networks f and g are reported, respectively.

4.4 Ablation Study

We further validate the effectiveness of each component in the proposed NRMT, including collaborative clustering, instance selection by the peer-confidence and relationship disagreement on Market-1501 and DukeMTMC-reID. The results are shown in Table 4. As we can see, by sharing clustering information between two networks on the whole dataset, “Ours w/ CC” improves the performance of both the two networks compared with “Separate Training”. This demonstrates that the collaborative clustering is able to ease the fitting to noisy instances caused by unsupervised clustering by exploiting different learning abilities of two networks and the memorization effect of deep networks. “Ours w/ CC+SC” and “Ours w/ CC+SC+SD” further obtain better results by prediction information interaction between the networks in mini-batches, which can pick up clean and informative instances to update the networks.

Fig. 3.
figure 3

Comparison on the accuracy of pseudo-labels in the iteration process for DukeMTMC-reID \(\rightarrow \) Market-1501.

Fig. 4.
figure 4

Examples of (a) clean and informative, (b) noisy and (c) easy triplets of instances obtained by the proposed mutual instance selection strategy in a mini-batch. Only the clean and informative triplets are used for the network update. For each triplet, the first two ones are positive examples and the last one is negative example.

To explore the ability of correcting label errors of collaborative clustering, Fig. 3 illustrates the accuracy of pseudo-labels generated by clustering in the iteration process. It can be seen that the pseudo-label accuracy of the two networks f and g trained with collaborative clustering are both improved significantly compared with the networks trained separately. This shows that sharing clustering information between two networks on the whole dataset can effectively correct label errors at each iteration and reduce the accumulation of noises during training.

In Fig. 4, we show some examples of clean and informative, noisy and easy triplets of instances obtained by the proposed mutual instance selection strategy. We can observe that the clean and informative triplets selected by our strategy contains negative examples with similar appearances and positive examples with large variations. Meanwhile, our strategy can filter out not only noisy triplets but also easy triplets. This indicates that our strategy is able to act as a robust online hard example mining for the triplet loss in training with noisy labels.

Table 5. Comparison with the state-of-the-art UDA methods on Market-1501 and DukeMTMC-reID. The averaged performance of the two networks f and g is reported.
Table 6. Comparison with the state-of-the-arts on transfers from DukeMTMC-reID and Market-1501 to MSMT17.

4.5 Comparison with State-of-the-art Methods

In this section, we compare the proposed NRMT with the state-of-the-art unsupervised person re-ID methods on the transfers between DukeMTMC-reID and Market-1501 and the transfers from DukeMTMC-reID/Market-1501 to MSMT17. Here we reports the averaged performance of the two networks f and g in NRMT.

Table 5 shows the results on the transfers between DukeMTMC-reID and Market-1501. We first compare the proposed NRMT with two hand-crafted features, i.e., LOMO  [18] and Bag-of-Words (BoW)  [52]. We can see that deep learning features can significantly improve the performance. Three unsupervised methods including UMDL  [24], PUL  [8] and DECAMEL  [45] are compared. Our method surpasses these methods by a large margin by adapting to the target data from the source data progressively. We also compare with the unsupervised domain adaptation methods, including UDAP  [31], MAR  [46], ECN  [57], PCB-R-PAST  [49], SSG  [11], ACT  [43], etc. our method still achieves the best performance. Especially, our NRMT outperforms PCB-R-PAST  [49], which also focuses on the improvement of label quality, by 17.1%/9.4% on mAP/Rank-1 accuracy for DukeMTMC-reID \(\rightarrow \) Market-1501 and by 7.9%/5.4% for Market-1501 \(\rightarrow \) DukeMTMC-reID. This demonstrates the effectiveness of information interactions between dual networks for noise reduction. Moreover, our NRMT also exceeds the second best method ACT  [43] by clear margins.

We also evaluate our NRMT on transfers from DukeMTMC-reID and Market-1501 to MSMT17 in Table 6. The results obtained by NRMT are 20.6%/45.2% on mAP/R1 accuracy for DukeMTMC-reID \(\rightarrow \) MSMT17 and 19.8%/43.7% for Market-1501 \(\rightarrow \) MSMT17, which all exceed the second best method, i.e., SSG  [11]. This further demonstrates the superiority of our NRMT on the large-scale dataset.

5 Conclusions

This paper proposed a noise resistible mutual-training method (NRMT) for unsupervised domain adaptation (UDA) in person re-ID to effectively depress label noises in a self-training process. In NRMT, two networks are maintained during training. For each iteration, these two networks share clustering information to ease the fitting to noisy instances. For each mini-batch update, the networks also exchange prediction information to further select both reliable and informative instances. Extensive experimental results demonstrate that the proposed NRMT achieves the state-of-the-art performance for UDA in person re-ID.