Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification

Zhao, Fang; Liao, Shengcai; Xie, Guo-Sen; Zhao, Jian; Zhang, Kaihao; Shao, Ling

doi:10.1007/978-3-030-58621-8_31

Fang Zhao¹²,
Shengcai Liao¹²,
Guo-Sen Xie¹²,
Jian Zhao¹³,
Kaihao Zhang¹⁴ &
…
Ling Shao^12,15

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12356))

Included in the following conference series:

European Conference on Computer Vision

5768 Accesses
103 Citations

Abstract

Unsupervised domain adaptation (UDA) in the task of person re-identification (re-ID) is highly challenging due to large domain divergence and no class overlap between domains. Pseudo-label based self-training is one of the representative techniques to address UDA. However, label noise caused by unsupervised clustering is always a trouble to self-training methods. To depress noises in pseudo-labels, this paper proposes a Noise Resistible Mutual-Training (NRMT) method, which maintains two networks during training to perform collaborative clustering and mutual instance selection. On one hand, collaborative clustering eases the fitting to noisy instances by allowing the two networks to use pseudo-labels provided by each other as an additional supervision. On the other hand, mutual instance selection further selects reliable and informative instances for training according to the peer-confidence and relationship disagreement of the networks. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art UDA methods for person re-ID.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Domain-adaptive person re-identification via domain alignment and mutual pseudo-label refinement

Article 02 April 2024

Unsupervised dual-teacher knowledge distillation for pseudo-label refinement in domain adaptive person re-identification

Article 04 September 2024

Unsupervised Domain Adaptation for Person Re-Identification with Few and Unlabeled Target Data

Keywords

1 Introduction

Person re-identification (re-ID), which aims at retrieving images of the same person from the database given a person image, has advanced considerably relying on the power of deep learning technology in recent years [19, 29, 32, 34, 35, 48, 50, 51, 53, 58]. However, due to the problem of domain shift [17], the performance of a deep re-ID model that performs well in a source domain may drop significantly when applied to a target domain. Besides, it is usually not easy to obtain labels of target data in practice, which hinders supervised fine-tuning of the deep model on the target data.

To learn a deep re-ID model which generalizes well in the target domain without using labels from this domain, unsupervised domain adaptation (UDA) approaches are proposed given labeled source data and unlabeled target data [5, 21, 24, 45, 56, 57]. Different from the traditional setting of UDA which assumes that the source and target domains share the same classes, UDA in person re-ID is under an open-set scenario, i.e., the two domains have totally different person identities (classes). Thus, it is a more challenging task.

Self-training is an effective strategy for UDA in person re-ID [8, 11, 31, 49], which performs clustering with the pre-trained source model to assign pseudo-labels to samples of the target dataset, then alternately updates the model with the pseudo-labels on target data and re-assigns the labels with the updated model to make the model adapt to the target data progressively. In the early stage of training, pseudo-labels assigned by clustering usually contain lots of noises due to the divergence between the source and target domains. The model can correct some of them by learning from clean labels. However, as the number of training iteration increases, some noisy instances are fitted by the model and cannot be corrected anymore. These noises eventually harm the self-training model performance on the target data.

In order to address the problem mentioned above, we propose Noise Resistible Mutual-Training (NRMT) to effectively reduce the impact of noisy instances throughout the training process by leveraging dual networks with information interaction. As shown in Fig. 1, NRMT maintains two networks during training, which performs collaborative clustering to ease the fitting to noisy instances and mutual instance selection to further select reliable and informative instances for the network update. We argue that there always exist some noisy instances that the single network cannot distinguish by itself in the iteration process of self-training. Inspired by deep learning with noisy labels [14, 22], we use another network with different learning ability to assist in correcting pseudo-label errors.

Specifically, for each iteration, collaborative clustering allows the two networks to not only learn by their respective pseudo-labels but also exploit the ones provided by each other as an additional supervision. For one network, its peer network can provide various labels for instances due to different learning ability. Although there also exists noises in these labels, they still can be used to reduce the effect of label errors of the single network because deep neural networks tend to fit easy (more likely to be correct) instances first [1]. For each mini-batch, mutual instance selection is introduced to further filter out noisy instances while keeping informative instances. Here the reliability of a triplet of instances is assessed for one network according to the prediction confidence of its peer network on this triplet. Informative instances are also important for improving the network performance. Thus, we further measure the amount of information of the triplet by the relationship disagreement of the predictions across the networks. Combining collaborative clustering at each iteration and mutual instance selection within each mini-batch, the proposed NRMT can effectively depress noises in pseudo-labels and improve the performance of both the two networks.

Our main contributions can be summarized as follows: 1) We present a novel noise resistible mutual-training method for unsupervised domain adaptation in person re-ID, which exploits dual network interaction to depress noises in pseudo-labels of unsupervised iterative training on the target data. 2) We introduce a collaborative clustering to ease the fitting to noisy instances by the memorization effects of deep networks. 3) We propose a mutual instance selection based on the peer-confidence and relationship disagreement of networks on triplets of instances to select reliable and informative instances in a mini-batch.

2 Related Work

Unsupervised Domain Adaptation. Our work is related to unsupervised domain adaptation (UDA) [3, 28, 36, 37]. Some methods are proposed to match distributions between the source and target domains [20, 33]. Long et al. [20] embed features of task-specific layers in a reproducing kernel Hilbert space to explicitly match the mean embeddings of different domain distributions. Sun et al. [33] propose to learn a linear transformation that aligns the second-order statistics of feature distributions between the two domains. There are also several works that learn domain-invariant features [12, 37]. Ganin et al. [12] introduce a gradient reversal layer to learn features invariant to domain via an adversarial loss. The aforementioned methods only consider the closed-set scenario. Recently, some works are introduced to address the problem of open set domain adaptation [10, 23, 27], where several classes are unknown in the two domains (or in the target domain). However, classes of the two domains are entirely different for UDA in person re-ID, which presents a greater challenge.

UDA for Person re-ID. There are many research works that have been proposed for unsupervised cross-domain person re-ID [5, 24, 25, 30, 31, 38, 40,41,42, 44, 46, 56, 57]. Some of them focus on image-level domain invariance. Wei et al. [39] propose a person transfer generative adversarial network to bridge the domain gap, which considers the style transfer and person identity keeping. Deng et al. [7] generate target image samples through the coordination between a CycleGAN and an Siamese network. Several works also try to improve the model generalization from the view of feature learning. Wang et al. [38] establish an identity-discriminative and attribute-sensitive feature representation space transferable to any new (unseen) target domain. Qi et al. [25] develop a camera-aware domain adaptation to reduce the discrepancy across sub-domains in cameras and utilize the temporal continuity in each camera to provide discriminative information.

Recently, some methods are developed based on the self-training framework. Fu et al. [11] present a self-similarity grouping to explore the potential similarities by both global and local appearance cues. Zhang et al. [49] propose a self-training method with progressive augmentation framework to offer complementary data information by different learning strategies for self-training. In contrast, our method provides complementary information through dual network interaction. Ge et al. [13] present a mutual mean-teaching framework to softly refine the pseudo-labels in the target domain. Note that our method and [13] are complementary and can be combined.

Deep Learning with Noisy Labels. There exist several works that aim at improving the training of deep models with noisy labels. Decoupling [22] trains two networks simultaneously, and then updates models only using the instances that have different predictions from these two networks. Co-teaching [14] proposes to select small-loss instances of each network as the useful knowledge and transfer such useful instances to its peer network for the further training. Yu et al. [47] combine the disagreement strategy with Co-teaching, which trains two deep neural networks with the disagreement-update step (data update) and the cross-update step (parameters update). These methods mainly focus on the classification problem, which cannot be directly applied to the metric learning problem in our task.

3 Our Method

Given a labeled training dataset $\{\mathbf{{X}}^s, \mathbf{{Y}}^s\}$ from the source domain and an unlabeled training dataset $\mathbf{{X}}^t$ from the target domain where identities of persons are different from the ones in the source domain, we aim to learn discriminative feature representations for target testing dataset. In this section, we present the proposed Noise Resistible Mutual-Training (NRMT) method, which incorporates the interaction of dual networks to depress noises in pseudo-labels produced by unsupervised clustering in a self-training process. Now, we proceed to explain each component of our NRMT in details.

3.1 Self-training with Clustering

Since the ground truth labels of the target person images are not available, one way to fine-tune the target model is to consider the target labels as latent variables that can be inferred in the learning process. Thus, a typical self-training framework for unsupervised domain adaptation aims to minimize the following loss function:

$$\begin{aligned} \mathop {\min }\limits _{\mathbf{{\hat{Y}}}^t,\mathbf{{W}}} {\mathcal {L}}(\mathbf{{\hat{Y}}}^t,f(\mathbf{{X}}^t;\mathbf{{W}})), \end{aligned}$$

(1)

where $\mathbf{{\hat{Y}}}^t$ denotes the estimated target labels, $\mathbf{{X}}^t$ is the set of target images and f denotes the target model parameterized by $\mathbf{{W}}$.

In the case of person re-ID, source and target domains do not share the common label space. Thus, one cannot directly apply the classifier trained on the source dataset to estimate the target identities. Similar with [8, 31], we perform clustering on CNN features to assign pseudo-labels to instances with the most confident predictions and assume that they are mostly correct. Once the target model is updated with these pseudo-labels, the remaining instances with less confidence are continuously explored by the model adapted better to the target domain. Therefore, to minimize the loss function in Eq. (1), we firstly initialize the model parameters $\mathbf {W}$ on the source data $\{\mathbf{{X}}^s, \mathbf{{Y}}^s\}$ and then apply an alternating block coordinate descent algorithm: 1) Fix $\mathbf {W}$ and minimize the loss w.r.t $\mathbf{{\hat{Y}}}^t$ through clustering. 2) Fix $\mathbf{{\hat{Y}}}^t$ and optimize the loss w.r.t $\mathbf {W}$ by stochastic gradient descent.

3.2 Mutual-Training with Collaborative Clustering

The problem of self-training based models [8, 31] is that the quality (correctness) of pseudo-labels generated by unsupervised clustering on the target data heavily affects the model performance. Although the deep learning model in self-training can avoid fitting noisy instances in the early stage of training due to the memorization effects of deep neural networks [1] and improve the performance progressively as more and more instances with high confidence are explored, there inevitably exist some label errors that cannot be corrected and would be overfitted as the training proceeds. These accumulated errors eventually impede the performance growth.

In order to reduce the label error accumulation throughout the training process, the proposed NRMT maintains two neural networks f parameterized by $\mathbf{{W}}_f$ and g parameterized by $\mathbf{{W}}_g$ simultaneously during training, and allows them to share clustering information by collaborative clustering at each iteration to reduce the effect of their respective label errors.

To make f and g have different learning abilities, we use different random seeds to pre-train f and g on the source dataset $\mathbf{{X}}^s$ with labels $\mathbf{{Y}}^s$ by the triplet loss and the Softmax loss [31]. Here f and g have the same network architecture to facilitate the deployment. Because deep neural networks are highly non-convex models, different initializations can still lead to different local optima even with the same architecture and optimization algorithm [14]. Then, we use the pre-trained f and g to extract features on the target dataset $\mathbf{{X}}^t$ and obtain two sets of pseudo-labels $\mathbf{{\hat{Y}}}_f^t$ and $\mathbf{{\hat{Y}}}_g^t$ through applying clustering to the features. Since the target domain has classes different from the source domain, we drop the Softmax loss and fine-tune the networks on the target data only using the triplet loss with the pseudo-labels. To share clustering information, f and g consider both their own pseudo-labels and the ones of their peer networks. Thus, we have a joint loss function for each network:

$$\begin{aligned} {{\mathcal {L}}_f}&= {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_f^t,f(\mathbf{{X}}^t;\mathbf{{W}}_f)) + {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_g^t,f(\mathbf{{X}}^t;\mathbf{{W}}_f)), \end{aligned}$$

(2)

$$\begin{aligned} {{\mathcal {L}}_g}&= {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_g^t,g(\mathbf{{X}}^t;\mathbf{{W}}_g)) + {\mathcal {L}}_{tri}(\mathbf{{\hat{Y}}}_f^t,g(\mathbf{{X}}^t;\mathbf{{W}}_g)), \end{aligned}$$

(3)

where ${\mathcal {L}}_{tri}$ is the batch-sampling triplet loss [16].

Different from self-training where the network assigns new pseudo-labels to the training instances at each iteration only according to its own parameter update, in NRMT, to make the learning more robust, the two networks f and g collaboratively assign pseudo-labels, i.e., each instance has two pseudo-labels from f and g, respectively. The study on memorization in deep networks [1] suggests that deep networks tend to prioritize learning easy patterns. Usually noisy instances caused by clustering are relatively hard examples, thus if one instance is assigned two labels, the networks will fit the clean (easy) one first to become robust and the error may be eliminated at the next iteration. The joint loss functions in Eq. (2) and Eq. (3) are similar to Co-training [2] where classifiers are trained on two views (two independent sets of features). However, here we have two networks but only have a single view, and we utilize the memorization effect of deep networks to handle the error in labels.

3.3 Mutual Instance Selection

Although collaborative clustering across networks is able to ease the fitting to noisy instances for each iteration, these noisy instances still have impact on the network training in a mini-batch, especially in the advanced stage of training. To further select reliable and informative instances in a mini-batch, we introduce a mutual instance selection strategy by considering both the peer-confidence and relationship disagreement of the two networks.

Reliable Instance Selection by Peer-Confidence. In order to select reliable instances for training, we consider using the prediction confidence of the peer network to measure the reliability of instances for one network. We argue that in the metric learning, the relationship of one pair of instances with other pairs in the feature space can provide more information about the network prediction than the distance between one instance and another one. Thus, we compute the prediction confidence based on the relationship of a triplet of instances.

Given an instance x, its corresponding positive instance $x_p$ and negative instance $x_n$ from a mini-batch, we encode the relationship of the triplet $\{x, x_p, x_n\}$ by the difference between the Euclidean distances of the positive and negative pairs in the feature space:

$$\begin{aligned} \mathcal {D}(x, x_p, x_n;f)&= ||f({x}) - f({x_p})|{|_2} - ||f({x}) - f({x_n})|{|_2}, \end{aligned}$$

(4)

$$\begin{aligned} \mathcal {D}(x, x_p, x_n;g)&= ||g({x}) - g({x_p})|{|_2} - ||g({x}) - g({x_n})|{|_2}, \end{aligned}$$

(5)

where f(x) and g(x) is the features extracted by the networks f and g, respectively. The smaller the difference is, the higher the confidence is. If the difference of the peer network g (resp. f) of f (resp. g) for the triplet $\{x, x_p, x_n\}$ is smaller than a threshold $T_c$:

$$\begin{aligned} \mathcal {D}({x},{x_p},{x_n};g) < T_c, \end{aligned}$$

(6)

$$\begin{aligned} \quad \text {resp.} \ \mathcal {D}({x},{x_p},{x_n};f) < T_c, \end{aligned}$$

(7)

we call $\{x, x_p, x_n\}$ as a peer-confident triplet of instances for f (resp. g) and use this peer-confident triplet to update f (resp. g). Because the two networks have different learning abilities, we expect that they can filter out various noisy instances [14] to make up for each other’s mistakes.

Informative Instance Selection by Relationship Disagreement. The peer-confidence of the network can pick up reliable (clean) instances in a mini-batch, but these instances usually contain lots of easy instances which provide limited information for the network performance improvement. To further select more informative instances, we propose to use the relationship disagreement between one network and its peer network to measure the amount of information on the basis of the peer-confidence.

Similar to the peer-confidence, we compute the relationship disagreement on a triplet of instances. We first define the prediction inconsistency of the two networks f and g combined with Eq. (4) and Eq. (5) as:

$$\begin{aligned} \mathcal {I}({x},{x_p},{x_n};f,g) = \mathcal {D}(x, x_p, x_n;f) - \mathcal {D}(x, x_p, x_n;g). \end{aligned}$$

(8)

Larger absolute value of the inconsistency indicates that the triplet of instances has larger amount of information. It can be considered that there is the relationship disagreement between the predictions of two networks for the triplet $\{x, x_p, x_n\}$ if the absolute value of the prediction inconsistency is larger than a threshold $T_d$:

$$\begin{aligned} |\mathcal {I}({x},{x_p},{x_n};f,g)| > T_d \end{aligned}$$

(9)

The networks are only updated on the mini-batch data with the relationship disagreement. Furthermore, when combined with the peer-confidence, Eq. (9) can be rewritten with the absolute symbol removed:

$$\begin{aligned}&\mathcal {I}({x},{x_p},{x_n};f,g) > T_d, \end{aligned}$$

(10)

$$\begin{aligned}&\mathcal {I}({x},{x_p},{x_n};g,f) > T_d. \end{aligned}$$

(11)

The intuition is that, for the item within the absolute symbol in Eq. (9) which is smaller than $-T_d$, because $T_d$ is not less than zero and $\{x, x_p, x_n\}$ meets the peer-confidence condition in Eq. (6) or Eq. (7), we have

$$\begin{aligned} \mathcal {D}(x, x_p, x_n;f)< \mathcal {D}(x, x_p, x_n;g) - T_d< \mathcal {D}(x, x_p, x_n;g) < T_c, \end{aligned}$$

(12)

$$\begin{aligned} \text {or} \ \mathcal {D}(x, x_p, x_n;g)< \mathcal {D}(x, x_p, x_n;f) - T_d< \mathcal {D}(x, x_p, x_n;f) < T_c. \end{aligned}$$

(13)

As a result, when $T_c$ is set to a proper small value, for the network f or g, the triplet $\{x, x_p, x_n\}$ is actually an easy instance that can be ignored during training. Figure 2 illustrates three types of triplets of instances obtained by the proposed mutual instance selection strategy, where we consider instances selection for the network f according to the prediction of the network g.

For the clarity, the training process of NRMT is summarized in Algorithm 1. It is worth noting that we only maintain two networks in the stage of training and the performance of the two networks can be aligned to the similar level via the information interaction. Thus, we can use any one of the two networks for the deployment in practice.

4 Experiments

In this section, we evaluate the proposed NRMT using three large-scale person re-ID datasets, i.e., Market-1501 [52], DukeMTMC-reID [26, 54] and MSMT17 [39] and the performance evaluations are presented in term of Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) under the single-query setting.

4.1 Datasets

Market-1501 [52] contains 32,668 labeled images of 1,501 identities. 12,936 images of 751 identities form the training set. 3,368 query images from the other 750 identities and 19,732 gallery images (with 2,793 distractors) are used as the test set. The bounding boxes of persons are generated by Deformable Part Model (DPM) [9]. DukeMTMC-reID [26, 54] includes 36,411 labeled images of 1,404 identities. 702 identities are randomly selected for training and the rest is used for testing. There are 16,522 training images, 2,228 query images and 17,661 gallery images. MSMT17 [39] is the largest re-ID dataset consisting of 126,441 bounding boxes of 4,101 identities taken by 12 outdoor and 3 indoor cameras. 32,621 images of 1,041 identities are used for training.

4.2 Implementation Details

We adopt ResNet-50 [15] as the architectures of the two networks and initialize them with the parameters pre-trained on ImageNet [6]. All images are resized to 256$\times $128. Random horizontal flipping and random erasing [55] are employed for training data augmentation. We use the Softmax and triplet losses to pre-train the two networks on the source dataset with different random seeds, respectively. The margin m in the triplet loss is 0.5. For each mini-batch, we randomly sample 32 identities and 4 images per identity. The SGD optimizer with a momentum of 0.9 is used to train the networks and the learning rate is 6e-5.

The peer-confidence threshold $T_c$ is set to 1.0 and the relationship disagreement threshold $T_d$ is set to 0.5. The HDBSCAN clustering algorithm [4] is adopted to produce pseudo-labels for each iteration, which does not require the number of clusters as prior parameter. The number of minimum samples for each cluster is set to 8. The maximal number of iterations is 30. At the first half of the iterative process, we train the networks only using collaborative clustering. Then we add mutual instance selection to further select clean and informative data in mini-batches for the network update.

Table 1. Evaluation on different values of the threshold $T_c$. Results of the two networks f and g are reported, respectively.

Full size table

Table 2. Evaluation on different values of the threshold $T_d$. Results of the two networks f and g are reported, respectively.

Full size table

Table 3. Evaluation on different numbers of the minimum samples for each cluster in HDBSCAN. Results of the two networks f and g are reported, respectively.

Full size table

4.3 Parameter Analysis

We first study impacts of some important parameter settings in the proposed NRMT, including the peer-confidence threshold $T_c$, the relationship disagreement threshold $T_d$ and the number of minimum samples in the HDBSCAN clustering algorithm.

Peer-Confidence Threshold $T_c$ . To analyze the impact of $T_c$ in Eq. (6) and Eq. (7), we fix the relationship disagreement threshold $T_d=0.5$ in all experiments. The results are listed in Table 1. We can observe that a proper value of $T_c$ is important for NRMT to filter out noisy instances, which provides a reasonable assessment of the noise confidence. The best performance is achieved when $T_c$ is set to 1.0.

Relationship Disagreement Threshold $T_d$ . We also conduct experiments to investigate the impact of $T_d$ in Eq. (10) and Eq. (11). In all experiments, we fix the peer-confidence threshold $T_c = 1.0$. As reported in Table 2, when $T_d = 0.5$, we can obtain the best results. When $T_d$ is set to a larger value, fewer instances are selected for update, which is likely to discard instances that are actually informative. Too small values of $T_d$ will allow most of the instances to be involved in update, which may contain too many easy instance and thus cannot provide effective information for improving the network.

Number of Minimum Samples. To evaluate the influence of the number of minimum samples in HDBSCAN, we report the results of {6, 8, 10} minimum samples in Table 3. As we can see, the number 8 yields the superior performance. Note that our NRMT is not very sensitive to this prior clustering parameter.

Table 4. Performance evaluation of components in the proposed NRMT on Market-1501 and DukeMTMC-reID. Separate Training: Train the two networks separately. CC: Collaborative clustering. SC: Instance selection by the peer-confidence. SD: Instance selection by the relationship disagreement. Results of the two networks f and g are reported, respectively.

Full size table

4.4 Ablation Study

We further validate the effectiveness of each component in the proposed NRMT, including collaborative clustering, instance selection by the peer-confidence and relationship disagreement on Market-1501 and DukeMTMC-reID. The results are shown in Table 4. As we can see, by sharing clustering information between two networks on the whole dataset, “Ours w/ CC” improves the performance of both the two networks compared with “Separate Training”. This demonstrates that the collaborative clustering is able to ease the fitting to noisy instances caused by unsupervised clustering by exploiting different learning abilities of two networks and the memorization effect of deep networks. “Ours w/ CC+SC” and “Ours w/ CC+SC+SD” further obtain better results by prediction information interaction between the networks in mini-batches, which can pick up clean and informative instances to update the networks.

To explore the ability of correcting label errors of collaborative clustering, Fig. 3 illustrates the accuracy of pseudo-labels generated by clustering in the iteration process. It can be seen that the pseudo-label accuracy of the two networks f and g trained with collaborative clustering are both improved significantly compared with the networks trained separately. This shows that sharing clustering information between two networks on the whole dataset can effectively correct label errors at each iteration and reduce the accumulation of noises during training.

In Fig. 4, we show some examples of clean and informative, noisy and easy triplets of instances obtained by the proposed mutual instance selection strategy. We can observe that the clean and informative triplets selected by our strategy contains negative examples with similar appearances and positive examples with large variations. Meanwhile, our strategy can filter out not only noisy triplets but also easy triplets. This indicates that our strategy is able to act as a robust online hard example mining for the triplet loss in training with noisy labels.

Table 5. Comparison with the state-of-the-art UDA methods on Market-1501 and DukeMTMC-reID. The averaged performance of the two networks f and g is reported.

Full size table

Table 6. Comparison with the state-of-the-arts on transfers from DukeMTMC-reID and Market-1501 to MSMT17.

Full size table

4.5 Comparison with State-of-the-art Methods

In this section, we compare the proposed NRMT with the state-of-the-art unsupervised person re-ID methods on the transfers between DukeMTMC-reID and Market-1501 and the transfers from DukeMTMC-reID/Market-1501 to MSMT17. Here we reports the averaged performance of the two networks f and g in NRMT.

Table 5 shows the results on the transfers between DukeMTMC-reID and Market-1501. We first compare the proposed NRMT with two hand-crafted features, i.e., LOMO [18] and Bag-of-Words (BoW) [52]. We can see that deep learning features can significantly improve the performance. Three unsupervised methods including UMDL [24], PUL [8] and DECAMEL [45] are compared. Our method surpasses these methods by a large margin by adapting to the target data from the source data progressively. We also compare with the unsupervised domain adaptation methods, including UDAP [31], MAR [46], ECN [57], PCB-R-PAST [49], SSG [11], ACT [43], etc. our method still achieves the best performance. Especially, our NRMT outperforms PCB-R-PAST [49], which also focuses on the improvement of label quality, by 17.1%/9.4% on mAP/Rank-1 accuracy for DukeMTMC-reID $\rightarrow $ Market-1501 and by 7.9%/5.4% for Market-1501 $\rightarrow $ DukeMTMC-reID. This demonstrates the effectiveness of information interactions between dual networks for noise reduction. Moreover, our NRMT also exceeds the second best method ACT [43] by clear margins.

We also evaluate our NRMT on transfers from DukeMTMC-reID and Market-1501 to MSMT17 in Table 6. The results obtained by NRMT are 20.6%/45.2% on mAP/R1 accuracy for DukeMTMC-reID $\rightarrow $ MSMT17 and 19.8%/43.7% for Market-1501 $\rightarrow $ MSMT17, which all exceed the second best method, i.e., SSG [11]. This further demonstrates the superiority of our NRMT on the large-scale dataset.

5 Conclusions

This paper proposed a noise resistible mutual-training method (NRMT) for unsupervised domain adaptation (UDA) in person re-ID to effectively depress label noises in a self-training process. In NRMT, two networks are maintained during training. For each iteration, these two networks share clustering information to ease the fitting to noisy instances. For each mini-batch update, the networks also exchange prediction information to further select both reliable and informative instances. Extensive experimental results demonstrate that the proposed NRMT achieves the state-of-the-art performance for UDA in person re-ID.

References

Arpit, D., et al.: A closer look at memorization in deep networks. In: International Conference on Machine Learning (ICML) (2017)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory (1998)
Google Scholar
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Chen, Y., Zhu, X., Gong, S.: Instance-guided context rendering for cross-domain person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., Jiao, J.: Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Fan, H., Zheng, L., Yan, C., Yang, Y.: Unsupervised person re-identification: clustering and fine-tuning. ACM Trans. Multimed. Comput. Commun. Appl. (TOMCCAP) 14(4), 1–18 (2018)
Article Google Scholar
Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model (2008)
Google Scholar
Feng, Q., Kang, G., Fan, H., Yang, Y.: Attract or distract: exploit the margin of open set. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Fu, Y., Wei, Y., Wang, G., Zhou, Y., Shi, H., Huang, T.S.: Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. (JMLR) 17(1), 2096–2130 (2016)
MathSciNet Google Scholar
Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
Li, Y.J., Yang, F.E., Liu, Y.C., Yeh, Y.Y., Du, X., Frank Wang, Y.C.: Adaptation and re-identification network: an unsupervised deep transfer learning approach to person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018)
Google Scholar
Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Liu, Z., Wang, J., Gong, S., Lu, H., Tao, D.: Deep reinforcement active learning for human-in-the-loop person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Lv, J., Chen, W., Li, Q., Yang, C.: Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
Google Scholar
Panareda Busto, P., Gall, J.: Open set domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Peng, P., et al.: Unsupervised cross-dataset transfer learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Qi, L., Wang, L., Huo, J., Zhou, L., Shi, Y., Gao, Y.: A novel unsupervised camera-aware domain adaptation framework for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2
Chapter Google Scholar
Saito, K., Yamamoto, S., Ushiku, Y., Harada, T.: Open set domain adaptation by backpropagation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 156–171. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_10
Chapter Google Scholar
Shu, R., Bui, H.H., Narui, H., Ermon, S.: A dirt-t approach to unsupervised domain adaptation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Song, C., Huang, Y., Ouyang, W., Wang, L.: Mask-guided contrastive attention model for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Song, J., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.M.: Generalizable person re-identification by domain-invariant mapping network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Song, L., et al.: Unsupervised domain adaptive re-identification: theory and practice. arXiv preprint arXiv:1807.11334 (2018)
Suh, Y., Wang, J., Tang, S., Mei, T., Lee, K.M.: Part-aligned bilinear representations for person re-identification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 418–437. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_25
Chapter Google Scholar
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Sun, Y., Zheng, L., Deng, W., Wang, S.: SVDNet for pedestrian retrieval. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 501–518. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_30
Chapter Google Scholar
Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Wang, J., Zhu, X., Gong, S., Li, W.: Transferable joint attribute-identity deep learning for unsupervised person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer Gan to bridge domain gap for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wu, A., Zheng, W.S., Lai, J.H.: Unsupervised person re-identification by camera-aware similarity consistency learning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Xie, G.S., et al.: Attentive region embedding network for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Xie, G.S., et al.: Region graph embedding network for zero-shot learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 562–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_33
Chapter Google Scholar
Yang, F., et al.: Asymmetric co-teaching for unsupervised cross-domain person re-identification. In: Thirtieth AAAI Conference on Artificial Intelligence (AAAI) (2020)
Google Scholar
Yang, Q., Yu, H.X., Wu, A., Zheng, W.S.: Patch-based discriminative feature learning for unsupervised person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Yu, H.X., Wu, A., Zheng, W.S.: Unsupervised person re-identification by deep asymmetric metric embedding. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 42, 956–973 (2018)
Article Google Scholar
Yu, H.X., Zheng, W.S., Wu, A., Guo, X., Gong, S., Lai, J.H.: Unsupervised person re-identification by soft multilabel learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., Sugiyama, M.: How does disagreement help generalization against label corruption? In: International Conference on Machine Learning (ICML) (2019)
Google Scholar
Zhang, K., Luo, W., Ma, L., Liu, W., Li, H.: Learning joint gait representation via quintuplet loss minimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Zhang, X., Cao, J., Shen, C., You, M.: Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Zhao, H., et al.: Spindle net: person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., Kautz, J.: Joint discriminative and generative learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017)
Zhong, Z., Zheng, L., Li, S., Yang, Y.: Generalizing a Person retrieval model hetero- and homogeneously. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 176–192. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_11
Chapter Google Scholar
Zhong, Z., Zheng, L., Luo, Z., Li, S., Yang, Y.: Invariance matters: exemplar memory for domain adaptive person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Zhou, S., Wang, J., Wang, J., Gong, Y., Zheng, N.: Point to set similarity based deep feature learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 61702163.

Author information

Authors and Affiliations

Inception Institute of Artificial Intelligence, Abu Dhabi, UAE
Fang Zhao, Shengcai Liao, Guo-Sen Xie & Ling Shao
Institute of North Electronic Equipment, Beijing, China
Jian Zhao
Tencent AI Lab, Shenzhen, China
Kaihao Zhang
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Ling Shao

Authors

Fang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shengcai Liao
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Sen Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Kaihao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Shao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengcai Liao .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, F., Liao, S., Xie, GS., Zhao, J., Zhang, K., Shao, L. (2020). Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12356. Springer, Cham. https://doi.org/10.1007/978-3-030-58621-8_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-58621-8_31
Published: 27 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58620-1
Online ISBN: 978-3-030-58621-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification

Abstract

Similar content being viewed by others

Domain-adaptive person re-identification via domain alignment and mutual pseudo-label refinement

Unsupervised dual-teacher knowledge distillation for pseudo-label refinement in domain adaptive person re-identification

Unsupervised Domain Adaptation for Person Re-Identification with Few and Unlabeled Target Data

Keywords

1 Introduction

2 Related Work