1 Introduction

Person re-identification (Re-ID) aims at matching images of the same identity across different cameras views, which plays an important role in the intelligent surveillance systems. Despite many researches [5, 17, 22, 36, 53, 54] have made great progress in supervised learning manner, it is still difficult to learn a Re-ID model that generalizes well on a target domain without annotations. One main reason is that the data distribution discrepancy between source and target domains, caused by the variations such as body pose, camera view, illumination, image resolution, occlusion and background.

To address the unsupervised domain adaptation (UDA) problem on Re-ID, some recent works [6, 38, 42, 47, 56, 57] focus on transferring the knowledge from labeled source dataset or clustering on the target unlabeled dataset. When extending to a new target domain, most of UDA Re-ID approaches require source domain data for pre-training or joint training. However, in some practical scenario, source domain data cannot be obtained because of privacy problem or transfer problem, etc. This requirement of source data reduces scalability and usability of these UDA Re-ID methods. In such case, directly transferring the knowledge from learned models trained on labeled source domains to unlabeled target domains remains a meaningful challenge, as illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of source-free multi-source domain adaptation problem setting. Using pre-trained source models and unlabeled data, the knowledge from labeled source domain can be transferred to target domain via proxy task, which improve the scalability and reusability of Re-ID system

There are a few studies [43] on source-free (without any source data) multi-source domain adaptation on unsupervised Re-ID. In Distill [43], this problem was considered as a knowledge distillation with multi-teacher and developed sample pairwise similarity matrix for target model to imitate the source models. However, using sample pairwise similarity matrix to transfer knowledge suffers from two limitations. On the one hand, the knowledge from source models depend on the quality of sampled batch data. When there are some noisy samples in training batch, the sample pairwise similarity matrix may not provide effective guiding information for target model learning. On the other hand, the weighting strategy for source models based on sample pairwise similarity matrix is coarse-grained, because of the limited number of batch data.

To tackle the above problems, we introduce a novel proxy task learning framework to transfer knowledge from source models. The proposed proxy task is inspired by recent self-supervised learning methods [3, 13, 49] which construct pretext tasks by discovering supervisory signals directly from the input data itself and learn useful visual representation from pretext task. Similarly, our proxy task also defines the supervisory signals from unlabeled data itself. Specifically, our proxy task includes two parts, which are proxy label learning and domain discriminative learning, shown in Fig. 2.

For each image, the features extracted by different source models contain various source information in proxy label learning. We train a classifier (proxy label generator) to distinguish from which image is the input feature. In this way, the supervisory signal for proxy label generator is the data itself, since each image for proxy task is regarded as an individual identity (category). When we train process converge, the output probability distribution vectors from proxy label generator are used as proxy labels. Intuitively, each entry of proxy label generator can be viewed as a prototype of image feature, since the proxy label generator is constructed by a simple full connection layer and optimized by cross-entropy loss. Thus, each element of proxy label vector represents the similarity between the input data and the corresponding image sampled for proxy label learning. By learning the proxy label, the target model can learn the knowledge embedded in source models, which relates to the similarity among unlabeled data.

Meanwhile, in domain discriminative learning, a domain discriminator is trained to distinguish from which source model is the input proxy label. The domain discriminative learning aims to estimate the discrepancy among target and different source models through proxy label, since we expect the target model to learn from the more relevant source model. To integrate the knowledge from different source models, we use weighting strategy over different source proxy labels to generate the aggregated proxy label. The weighting strategy is chosen by the domain discriminator, which emphasizes the more similar source model and suppresses the dissimilar one. Finally, the knowledge can be adaptively transferred from multiple source models into target model by learning the aggregated proxy label.

The main contributions of this paper are summarized as follows: First, we propose a novel proxy label learning method to embed the knowledge from multiple source models into proxy label. Second, we propose to use domain discriminator to automatically choose a weighting strategy for effectively aggregating knowledge from different source models by domain discriminative learning. Third, experimental results on DukeMTMC and Market-1501 show that our model outperforms the state-of-the-art source-free multi-source unsupervised Re-ID methods.

Fig. 2
figure 2

Illustration of the proposed proxy task which includes proxy label learning and domain discriminative learning. The proxy label learning focuses on learning a classifier, proxy label generator, to distinguish which unlabeled images is the input feature from. The proxy label is the output probability distribution of classifier, while the domain discriminative learning attempts to distinguish which source models is the proxy label from

2 Related work

2.1 Unsupervised domain adaptation

Unsupervised domain adaptation (UDA) aims at tackling the domain shift [30] problem by transferring knowledge from labeled source domain to unlabeled target domain. The common idea of most single-source UDA methods is aligning the source and target domains in different level. Discrepancy-based methods explicitly measure the feature discrepancy between the source and target domains, such as the variant of maximum mean discrepancies (MMD) [20, 21], correlation alignment (CORAL) [34], and adversarial discriminative loss [12, 37]. GAN-based (Generative Adversarial Networks ) approaches focus on learning the image-to-image transformation to align the discrepancies in the pixel space, such as PixelDA [4], CyCADA [15]. Multi-source methods assume that the training data are collected from multiple sources [35]. Recent multi-source methods [28, 44, 50, 51] focus on extracting knowledge from source domains data. MDAN [50] and DCTN [44] use domain discriminator to adversarial learning domain. MMN [28] transfers the learned knowledge from multiple sources to the target by dynamically aligning moments of their feature distributions. MDDA [51] proposes a sample selection framework to distill effective knowledge from source data to target domain. SK et al. [1] propose a novel algorithm to find the optimal combinations of source models and this algorithm leads to superior results than source model. Song et al. [33] propose a theoretical foundation about domain adaptive classification theories, which is the first to extend this theory to re-ID tasks. Further, SSG focuses on how to harness the similar natural characteristics in samples from target domain to conduct person re-ID. However, the approaches mentioned above fail to directly apply to person re-ID task. Most of them assume that source and target domains share the same classes, while the person identities (classes) are totally different among different datasets.

2.2 Self-supervised learning

Self-supervised learning (SSL) constructs pretext tasks by discovering supervisory signals directly from the input data itself. A pretext task is designed for solving the problem which requires to learn a useful visual representations. These methods use various cues and pretext tasks, such as in-painting [26], jigsaw puzzles of patch context [8, 25], noise-as-targets [3], colorization [16, 49], predicting transformations [13, 48]. Recently, contrastive learning-based methods also lead to superior result on image classification [14] and video action recognition [39]. Inspired by the idea that the pretext task can be constructed automatically and easily from images alone, we introduce proxy task to extract knowledge from source model by utilizing the supervisory signals from unlabeled images alone. The major difference is we have no available source data and we use the source model as a pseudolabel generator. In addition, the pretext for this work is based on dynamic pseudolabel which is changed with different input image.

2.3 Person Re-ID

2.3.1 Supervised learning methods

Most related works in person re-identification are based on supervised learning. Among them, MPN [7] proposes a novel robust part-aware model that driven by part-level representations and achieves significant improvement on Market-1501. Song et al. focus on adversarial attack with human-imperceptible noise to galley images [2] and further improve the model robustness. Ahmed et al. [1] propose an interesting setting: They focus on exploring the dynamic nature of a camera network and minimizing the additional effort when adapting the existing re-identification models. Besides utilizing the information from the whole image, PAUL. Yang et al. [45] introduce a patch-based unsupervised learning framework to learn discriminative feature from patches.

2.3.2 Unsupervised learning methods

Recent unsupervised person re-ID methods can fall into three categories. The clustering-based methods [9, 10, 19] focus on clustering images of the same identity to train, which is similar to the supervised method. BUC [9] proposes a bottom-up clustering framework with a diversity regularization. SSG [10] clusters on global and local features and assigns hard pseudolabels. With the aid of Generative Adversarial Networks (GAN), GAN-based methods [6, 42, 56] aim at learning a image-to-image transformation to close the gap between domains in pixel-level. PTGAN [42] and SPGAN [6] transform source images into target domain style without changing the original person identities label. HHL [56] focuses on camera-invariant learning with camera style transferred images among different domains. The third category of methods [29, 47, 57] attempts to explore the knowledge of source and target domains itself and proposes the designed constraints to learn a generalized model. Wang et al. [41] focus on the higher-order relationships across the entire camera network and propose a consistent cross-view matching framework. ENC [57] utilizes invariant properties to generalize the model with exemplar memory module. MAR [57] conducts soft label learning with reference persons from source data. UCDA [29] reduces the discrepancy not only between source and target domains but also among camera-aware sub-domains. Previous works solved the unsupervised person Re-ID problem focus on learning from data, while in this work we focus on learning knowledge from pre-trained models, i.e., on “model-level.” However, unsupervised methods still perform poorly compared with the supervised alternatives and Wang et al. [40] propose to learn models from weak supervision.

Our work is most closely related to Distill [43] using sample similarity matrix to transfer knowledge from source models under source-free setting. However, the knowledge embedded in sample similarity matrix is limited by the sampled batch data, which might not be effective enough to transfer knowledge from source model to target domain.

3 Proposed method

Problem definition To study source-free multi-source domain adaptation for Re-ID, the problem is formulated as follows. Given K labeled source domains \(\{S_1, S_2, \ldots , S_k\}\) and one fully unlabeled target domain T. Suppose we have trained models \(\{F_{1}^{s}, F_{2}^{s}, \ldots , F_{k}^{s}\}\) from source domains \(\{S_1, S_2, \ldots , S_k\}\), respectively. In our setting, we aim to learn a model \(F^t\) from trained source models \(\{F_{1}^{s}, F_{2}^{s}, \ldots , F_{k}^{s}\}\) and unlabeled data \({X^t} = \{{x^t_i}\}_{i=1}^{N_t}\) without any source data.

3.1 Framework overview

In this section, we introduce the proposed source-free multi-source domain adaptation framework. Our framework includes 2 steps, which are proxy task learning and multi-source domain adaptation via aggregated proxy label. For proxy task learning, we first sample \(N^p\) unlabeled images from target domain to train a proxy label generator and domain discriminator. In the multi-source domain adaptation stage, the aggregated proxy labels are generated from two branches. First, we feed the features extracted from different source models of input image into proxy label generator and obtain different proxy labels which are corresponding to different source models. Second, the domain discriminator takes the proxy label of target model saved from last iteration to choose a weighting strategy over multiple proxy labels. We obtain the aggregated proxy label of input image by combining all the proxy labels from different source models with different weighting. Finally, we fine-tune the target network by the aggregated proxy labels to transfer knowledge. The details of each step will be explained in the following subsections.

3.2 Proxy task learning

The proxy task includes two sub-task, which are proxy label learning and domain discriminative learning, shown at Fig. 2. By learning the proxy task, the knowledge from source models are embedded in proxy label.

Proxy label learning In brief, proxy label learning aims at training a classifier on input feature to identify which image is from and each image \({x_i^{t}}\) is regarded as individual identity (category) \(y_i^{t}\). Suppose we sample \(N_{p}\) images from target dataset for training, the proxy label is the output probability distribution \(\tilde{y} \in {\mathbb {R}}^{N_p}\) of classifier. It is note that using K source models as feature extractors can obtain K different features for each individual image, while the classifier only discriminates the feature of one source model at each time. The classifier G is optimized by cross-entropy loss, written as:

$$\begin{aligned} {\mathcal {L}}_{pl}({G}) = \frac{1}{N^p} \frac{1}{K} \sum _{i=1}^{N^p} \sum _{j=1}^{K} {\mathcal {L}}_{ce}\left( G\left( F_{j}^{s}\left( {x}_{i}^{t} \right) \right) , {{y}}_{i}^{t}\right) \end{aligned}$$
(1)

where each target image \({x_i^{t}}\) is corresponding to a unique \({{y}}_{i}^{t}\). When the training process is converged, the classifier G is used as proxy label generator. The proxy label is formally defined as:

$$\begin{aligned} \tilde{y}^{(z)} = G(F(x^t))^{(z)} = \frac{\exp \left( {W_{g,z}^{\mathrm {T}}} F({x}_{}^{t} )\right) }{\varSigma _{a} \exp \left( {W_{g,a}^{\mathrm {T}}} F({x}_{}^{t} )\right) } \end{aligned}$$
(2)

where \(\tilde{y}^{(z)}\) is the z-th entry of proxy label \(\tilde{y}\). \(F(\cdot )\) denotes any network for feature extraction and \(W_{g,a}\) is the a-th entry from the proxy label generator G.

Intuitively, the proxy label \(\tilde{y}\) represents the similarity among the input image \(x^t\) and \(N_p\) images sampled for proxy task. Each entry \(W_{g,a}\) of proxy label generator G can be viewed as a prototype of image \(x^t_a\). Even though the source models suffer from the domain shift problem, which leads to the degradation of performance, they still preserve the ability of capturing the common low-level feature for Re-ID task. By optimizing the cross-entropy with learnable parameters, each prototype of \(N_p\) image is more discriminative than the original feature. Compared with modeling the similarity directly utilizing the original feature, our proxy task not only tackles the discriminative ability problem of raw feature but also integrates the knowledge contained in multiple source models. The knowledge contained in proxy label are not depend on any specific data, instead it is constructed by the similarity among a set data. Our proxy label provides another approach to characterize the similarity among samples instead of using identities which is concordant with the fact that common Re-ID task is based on the similarity metric among identities.

Domain discriminative learning Another key problem in multi-source domain adaptation is how to select effective knowledge from multiple source domain. To address this problem, we introduce another sub-task, which is domain discriminative learning, into the proxy task. In domain discriminative learning, the domain discriminator D aims to distinguish which source domain is the proxy label from, which can measure the discrepancy among different source. We can optimize D by minimizing the following cross-entropy loss:

$$\begin{aligned} {\mathcal {L}}_{dd}({D}) = \frac{1}{N^p} \frac{1}{K} \sum _{i=1}^{N^p} \sum _{j=1}^{K} {\mathcal {L}}_{ce}\left( D \left( G\left( F_{j}^{s}\left( {x}_{i}^{t} \right) \right) \right) , {{y}}^{s}_j\right) \end{aligned}$$
(3)

where \({{y}}^{s}_j\) denotes the source \(S_j\) of the model \(F^s_j\). The output of the domain discriminator D represents the distance among the input model and multiple source models, which can provide guiding information for selecting appropriate source model to transfer knowledge.

Domain discriminator is widely used in adversarial-based unsupervised domain methods [12, 15, 37] to distinguish source and target domains. And it differs from the proposed approach that our proposed domain discriminative learning is based on proxy label instead of features. Compared with image classification, Re-ID task is an open set problem, which means that the identities (categories) are not shared among different datasets. Hence, the feature across different identities might not characterize the discrepancy among different source well, while the proxy labels are based on the similarity among samples, which contain more information than the specific feature.

Fig. 3
figure 3

Illustration of the proposed multi-source domain adaptation via proxy label. It consists of three branches: (1) providing proxy labels of input image from multiple source models; (2) estimating the target domain distribution by last iteration target model for generating domain weighting strategy; (3) updating target model by learning the aggregated proxy label

3.3 Multi-source domain adaptation via proxy label

We show an overall illustration of our multi-source domain adaptation method in Fig. 3. Recall that the proxy label generator can generate K proxy labels using K different source models for any input image. To learn from these proxy labels, target model needs to integrate the knowledge embedded in those proxy labels. Since the discrepancy among source and target domains are various, the contributions of multiple source models need to be adjusted adaptively. To achieve this goal, we estimate the target distribution with target network \(F^{t}_{\tau -1}\) from last iteration. The domain discriminator D generates weighting strategy over source domains by the proxy label of the target network \(F^{t}_{\tau -1}\). The weighting strategy of different source models reflects the distance between source and target distribution through proxy label, which emphasizes more relevant sources and suppresses the irrelevant ones. The aggregated proxy label \(\tilde{y}^{*}_{i}\) and weighting strategy \({\{\omega _{i,j}}\}_{j=1}^K\) of image \(x^t_i\) can be written as:

$$\begin{aligned}&\tilde{y}^{*}_{i} = \frac{1}{K} \sum _{j=1}^{K} \omega _{i,j} * G(F_{j}^{s}( x^t_i)) \end{aligned}$$
(4)
$$\begin{aligned}&\omega _{i,j} = \frac{\exp \left( {W_{d,j}^{\mathrm {T}}} G\left( F^{t}_{\tau -1} \left( {x}_{i}^{t}\right) \right) \right) }{\varSigma _{a} \exp \left( {W_{d,a}^{\mathrm {T}}}G\left( F^{t}_{\tau -1} ({x}_{i}^{t}) \right) \right) } \end{aligned}$$
(5)

Next, in order to transfer the knowledge embedded in the aggregated proxy label \(\tilde{y}^{*}_{i}\), we minimize the Kullback–Leibler divergence between the proxy label from target network \(F_{\tau }^{t}\) and the aggregated proxy label \(\tilde{y}^{*}_{i}\) of input \(x_i^t\), formulated as follow:

$$\begin{aligned} {\mathcal {L}}_{ada}(F_{\tau }^{t}) = \frac{1}{N^{t}} \sum _{i=1}^{N_{}^{t}} {\mathcal {L}}_{kl} \left( G\left( F_{\tau }^{t}\left( {x}_{i}^{t} \right) \right) , {\tilde{y}^{*}_{i}}\right) \end{aligned}$$
(6)

With the objective of minimizing the \({L}_{ada}\), the proxy label of target model can better estimate the target distribution, promoting domain discriminator to select weighting strategy to provide better guidance for target model.

figure d

4 Experiment

4.1 Experimental settings and datasets

We conduct extensive experiments on two large person re-identification benchmark datasets Market-1501 [52] and DukeMTMC [55].

Table 1 Statistics information of datasets
Fig. 4
figure 4

Examples of the CUHK03, MSMT17, LPW, Market-1501 and DukeMTMC datasets. Images in each column represent the same identity

Datasets Source models are trained with labeled data from the training sets: CUHK03 [17], MSMT17 [42], LPW [32], Market-1501 [52], DukeMTMC [55]. CUHK03 [17] contains 14,096 images, including 1467 identities, from two cameras. The dataset has two settings, which are labeled bounding boxes and DPM detected bounding boxes. And we use the labeled setting for source model training. MSMT17 [42] is the current largest publicly available person Re-ID dataset. It has 126,441 images, which have 4101 identities, captured by a 15 cameras. LPW [32] consists of 2731 different pedestrians collected from three different crowed scenes. A total of 7694 image sequences are generated with an average of 77 frames per sequence. We random selected 2 frames from each training sequences to construct our training set. There are 11,732 images, which have 1975 identities, which are used for training. Market-1501 [52] contains 32,668 images, including 1501 persons, from six cameras. There are 12,936 images, which contain 751 identities, used for training. And 3,368 images are in the query and 19,732 in gallery sets. DukeMTMC [55] has 1404 persons from eight cameras, with 16,522 training images of 702 identities, 2228 queries, and 17,661 gallery images. Basic information of the datasets is in Table 1. In evaluation, similarities between query and gallery samples were determined by the target model. We use cumulative matching characteristic (CMC) [52] and mean Average Precision (mAP) [55] as performance metrics (Fig. 4).

Implementation details We adopt ResNet-50 as backbone with extra batch normalization layer [22] after global average pooling layer in all experiments unless otherwise indicated. The source models are only trained by ID-discriminative embedding (IDE) [53] for 80 epochs. The target model was initialized by ImageNet pre-train, without training on any Re-ID dataset. The Adam optimizer is used to train all the model with learning rate 0.00035. We resize each image into 384 \(\times \) 128 pixels and pad 10 pixels with zero values. Then we randomly crop it into a 384 \(\times \) 128 image and adopt random flip with 0.5 probability. We set the mini-batch size to 64. The feature maps extracted through the additional batch normalization layer were used as feature vectors with 2048 dimensions. During the proxy label learning, we random select 75% of unlabeled images from target dataset to train the proxy label generator.

Table 2 Comparison with the state-of-the-art methods of unsupervised Re-ID on Market1501 and DukeMTMC-reID

4.2 Comparison with the state of the art

Table 2 presents the comparison with recent state-of-the-art unsupervised learning methods on Market-1501 and DukeMTMC. The compared methods can be divided into 4 types based on the usage of source data. (1) N-N: fully unsupervised learning methods without any source data, including LOMO [18] BoW [52], PUL [9] and CAMEL [46]; (2) S-N: single-source domain adaptation methods only using source data for pre-train, including TJ-AIDL [38] and T-Fusion [23]; (3) S-P: single-source domain adaptation methods with source data used for joint learning during adaptation, including PTGAN [42] , SPGAN [6], HHL [56] and UCDA [29]; (4) M-N: source-free multi-source domain adaptation, including Distill [43].

As seen, our M-N approach achieves competitive performances to the compared unsupervised Re-ID, while the performance on Market-1501 is not largely beyond the best N-N methods. It should be noticed that our method does not use source model for training the second phase and the target model is learning from random initialization. Compared with the state-of-the-art method Distill [43] with the same experimental setting, our model achieves 9.8 and 9.3% improvement in rank-1 accuracy and mAP on DukeMTMC, respectively. It indicates that proposed proxy task framework can more effectively transfer knowledge from multiple source models to target domain. The main advantage of our method is that we can utilize information from multiple source models even with different architectures and learn the common information from multiple teachers, which is beneficial for practical online models. Besides, we do not need to access the specific architecture of the source models and the target model can transfer knowledge just from the output of source model. Our method is more practical due to privacy ad security.

Table 3 Performance (%) of source models directly tested on target dataset DukeMTMC and Market-1501

4.3 Further evaluation and ablation study

In this section, we first evaluate the baseline performance by directly deploying source model on target dataset. Then we further conduct ablation studies on the number of images for proxy task learning, the effectiveness of our weighting strategy, the loss function for adaptation and the architecture of target model.

Baseline evaluation We evaluate the performance of all source models on target datasets individually. The results are reported in Table 3. When trained and tested both on same dataset, the model can get high performance. However, performance drops significantly when the model is directly deployed on the target dataset which is different from training set. For example, the baseline model trained and tested on DukeMTMC achieve 82.8% in rank-1 accuracy, but drops to 41.8% when tested on Market-1501. The domain shift among datasets is the causes of performance degradation.

Number of images \(N_p\) for proxy task In this experiment, we analyze the impact of the ratio of samples that used for our self-supervised learning pretext task. Specifically, we conduct experiments with two default setting: (1) transfer the trained model from Market-1501 to DukeMTMC. (2) Transfer the trained model from DukeMTMC to Market-1501. The direct transfer results with no unlabeled images available are also reported for reference. We show the results of sampling different number of images for proxy label learning in Table 4. We can observe that the more training samples used for proxy task learning, the higher performance of target model in general. It illustrates that the number of images used for training relates to the efficiency and robustness of knowledge transfer. We also observe when using 75% unlabeled images, our method leads to the best result on DukeMTMC. As DukeMTMC dataset may contain multiple persons and serious occlusion problem, this dataset is more challenge. The reason behind this may be the training unstable and the left 25% unlabeled images are easy samples and they contribute less to the gradient optimization. Empirically, we set \(N^t_c\) to 75% of unlabeled dataset in the following experiment.

Table 4 Performance (%) of using different numbers of unlabeled data for proxy label learning
Table 5 Performance (%) of using different combination of source models and weighting strategy under proposed adaptation framework

Benefit of the multi-source aggregated adaptation We study the benefit of our propose multi-source domain adaptation in Table 5. Firstly, we conduct experiments on different combinations of source models. Results show that even using single source for proxy task, our method outperforms the baseline, i.e., directly tested on target domain. A case in point is that our method obtains rank-1 accuracy in 38.4% by using CUHK03 as source model on DukeMTMC dataset, surpassing the baseline by \(+18.5\)%. Also, the proxy label becomes more robust with the increasing number of source models, so that it boosts the performance of adaptation. Furthermore, we compare the proposed weighting strategy with the baseline using average weighting to generate aggregated proxy label. As seen in Table 5, our proposed weighting strategy outperforms the average weighting. This is reasonable because the average weighting does not reveal the importance of different sources; therefore the proxy label aggregated by average weighting may not fit the target distribution well.

Different loss functions for target model adaptation As shown in Table 6, we evaluate how different loss functions affect our adaptation. The performance of using L1 and L2 loss function drops significantly compared with Kullback–Leibler divergence. This indicates that the Kullback–Leibler divergence is more suitable for characterizing the discrepancy of distribution among source and target models by proxy label.

The variations of source/target model architectures In this experiment, we explore the transfer ability of our method with different source/target architecture. To this end, we adopt three widely used networks: ResNet50, ShuffleNetV2 [24] and MobileNetV2 [31] for inference. The results are reported in Table 7 and we make following observations. (i) With less parameter and lower computation costs, the MobileNetV2 [31] achieves best result for most cases. It shows the flexibility and generalization of our methods, and the knowledge can be effectively transferred across heterogeneous model architectures. (ii) Comparing with target model, the performance is more robust to the selection of source model, e.g., the R-1 result on DukeMTMC with different source model but same MobileNetV2 varies from 59.2 to 61.7% (\(+2.5\)%). On the contrary, with same source model, the R-1 result on DukeMTMC with different target models varies from 49.7 to 59.2% (\(+9.5\)%).

Table 6 Performance (%) of using different loss function for adaptation
Table 7 Performance(%) of using different backbone as source and target model. R50 is short for ResNet-50
Fig. 5
figure 5

Visualization of the proxy label of 50 identities randomly selected from DukeMTMC dataset. For ad we use pre-trained source model from CUHK03, Market-1501, target model after multi-source adaptation, supervised learning on DukeMTMC, respectively. The coordinate axis represents different samples sorted by identity index in same order and the element at heatmap corresponding to cosine similarity of proxy label between the x- and y-images

4.4 Visualization of proxy label

In order to show the interpretability of our proposed proxy task, we use the heat map to visualize the cosine similarity of proxy labels from different models on target samples. As illustrated in Fig. 5, the coordinate axis represents different samples sorted by identity index in the same order. We can observe that the brightness in square is very low along the diagonal when using pre-trained source model (CUHK03) to generate the proxy label. Meanwhile, there are some very bright square with clear boundary along the diagonal when using the model trained and tested on same dataset. There also exits a large performance gap between these two models, which implies that our proxy label can identify image like actual annotation in some way. It verifies that our proposed proxy label indeed embeds useful knowledge from source models for Re-ID task via proxy task.

5 Conclusion

In this paper, we present a new proxy task learning framework for unsupervised multi-source domain adaptation on person re-identification, which does not require any source data. By introducing proxy label learning into proxy task, the knowledge from source models can be embedded into proxy label. To integrate the knowledge from different source models, we also propose domain discriminative learning for proxy task which aims at generating weighting strategy over proxy labels from different source. Experiments conducted on DukeMTMC and Market-1501 verify that our approach achieves competitive performance compared with the state of the art. In the future work, we will further combine the metric learning and meta learning into proxy task.