1 Introduction

Person re-identification, also named re-id, [1,2,3,4,5,6,7,8,9,10] can be considered as a single-modal retrieval task, which requires to search the underlying suspect of the query person from the gallery filmed from multi-view cameras. It is a technology that employs computer vision to determine whether a particular suspect is in a video. Because of the broad distribution of the cameras, their angles, specifications, and illumination degrees of the scene where the camera is located are entirely different, which may lead to the appearance characteristics of the same pedestrian vary a lot under different cameras [11]. In addition, the researchers also encounter many other challenges such as low image resolution, pedestrian posture change, and occlusion. In detail, present difficulties can be concluded as:

  • In most cases, images from surveillance cameras are fuzzy. There may be only part of the pedestrian shown in an image. Even with advanced object detection algorithms, it is still at low resolution, so it is impossible to compare the embedding of extracted face features. The only solution is to judge from the aspects of clothes and postures.

  • Pedestrian re-id images may be filmed in different periods, and pedestrians’ posture and appearance may change to a certain extent. For instance, the images taken in the day and at night will be quite different. Moreover, in many monitoring environments, there is a large flow of people, which is prone to overlap and occlusion of pedestrians.

  • The acquisition of pedestrian re-identification dataset involves the issue of personal privacy and security. Pedestrian detection should be conducted among the whole video frame sequence, which is time-consuming. It is crucial to judge whether pedestrians detected in frames are the same identities and then mark them. The whole process is tedious and inefficient. However, the existent supervised re-id models need to be supported by large-scale datasets, which results in the contradiction between model performance and model training cost.

In a word, the above situations have brought significant challenges to the re-id task. Researchers still need to focus practical issues and make long-term efforts.

1.1 Motivation

Due to its important application in security monitoring, person re-id has been widely concerned by academia. However, present algorithms commonly depend on enormous labeled datasets, limiting their availability in practical applications. Even though many traditional supervised algorithms show excellent performance on the benchmarks, they perform poorly on real-world datasets [12,13,14,15]. This is because when there is a slight gap between data distributions, the performance of the supervised model will decrease significantly. As the specification of different cameras varies greatly, the appearance of pedestrians is easily affected by factors such as clothing, shelter, posture, and lighting. All the factors mentioned above make pedestrian re-id a hot and challenging task.

This paper solves person re-id in an unsupervised way. It needs to train the neural network on labeled source domain in a supervised way and unlabeled target domain in a unsupervised way. The object is to optimize the performance in the unlabeled domain. The self-adaptive pedestrian re-id algorithm aims to improve the generalization capability and reduce the cost of manual labeling. The regular unsupervised domain adaptation (UDA) bases on the assumption that both the two domains share a identical category distribution [12, 16]. However, it is not correct given to the inherent open set peculiarity [17]. Recently, most UDA methods [18,19,20] are intending to decrease the distributional discrepancy between the two domains. In addition, some UDA methods leverage generative adversarial networks (GAN) [21] to realize image-to-image domain adaptation, which is essential to augment the pedestrian dataset. Nevertheless, these methods do not focus on the potential label information in the target domain, nor do they address the diversity of camera styles.

In this work, a reciprocal neighbors label is formulated smoothing loss (RNLSL), which is based on the observation that when the model is trained on the labeled dataset, the highest-ranked returns are more likely to be the same identity as the query. In a word, the matching image of a pedestrian’s similar image is more likely to be the person’s similar image. Therefore, we can guide the model to be aware of the potential invariance in the target domain by decreasing the discrepancy among each sample and its neighbors. This situation is likely to deviate from the best case: the mismatch is contained in k-nearest neighbors and even has a high ranking [3]. RNLSL is designed to solve the aforesaid problem by pulling the true matches in the k-nearest neighbors and pushing the hard negative samples.

Simultaneously, this paper regards the camera in the unlabeled dataset as a domain and employ style transfer GAN to train a domain adaptive transfer model. Then, MBCR augment the unlabeled domain images with the pre-trained model, which is illustrated in Fig. 1. These generated images can help the model grasp the variations in the appearance of persons due to differences in camera styles. Finally, a cross-camera invariance loss (CCIL) is proposed to further eliminate the interference of camera style on personal identities.

Fig. 1
figure 1

Examples of style-transferred images in Market-1501. The Market-1501 benchmark contains six cameras. Every pedestrian filmed by one camera will also be captured by other five cameras

1.2 Contributions

The contributions can be concluded as:

  • We demonstrate how to employ the memory-based circular ranking mechanism to generate reliable smooth labels in an unsupervised way and optimize the object with RNLSL. The proposed method is named as MBCR. It does not label the total dataset by clustering the entire target domain, but label the individual samples with the guidance of neighborhood information. MBCR boosts the performance by improving the efficiency of model optimization and avoiding the noise generated by a clustering algorithm.

  • A loss optimization named CCIL is proposed to direct the model to focus on cross-camera invariance in the target domain, which increases the model’s robustness.

  • We have verified the method on three benchmarks, including Market-1501 [22], DukeMTMC-reID [23] and MSMT17 [20]. Experiments on these datasets demonstrate that the proposed approach surpasses the SOTA method.

  • It is simple to implement the proposed method and it brings no massive extra parameters and calculation overhead to the training process, so it is suitable for practical application. There is still a long way on how to design appropriate loss based on training samples and their adjacent images. The proposed method provides a wonderful reference for future work.

2 Related work

Over the past few decades, most existing supervised re-id works have focused on learning distance metric or subspace [24,25,26,27,28]. With the rise of neural networks, deep learning methods [29,30,31,32] are fully applied in person re-id. In this section, we primarily introduce the traditional metric learning methods and the deep learning methods.

2.1 Metric learning Person Re-identifification

LMNN [33] constructs a triplet in terms of anchor sample, positive sample, and negative sample. It requires the distance among the embedding of the same category be close enough, and the ones of distinct categories should be distant enough, where the distance is measured by Euclidean distance or Cosine distance. Hinge loss is employed to optimize the objective function, which is a classical convex optimization problem with low complexity, and many scholars later improve on this basis. Chen et al. [34] apply the quaternion loss for pedestrian re-id for the first time, where the distance among the negative samples is constrained.

KISSME [35] judges whether two pedestrians are identical or not according to the logarithmic probability value. It hypothesizes that the image features of the same pedestrian follow a Gaussian distribution, which means that the mean of the same pedestrian’s image features should be 0, and the covariance should be of gaussian distribution. Hao et al. [36] proposed a local similarity metric method based on KISSME to discriminate local regional similarity of pedestrian images, avoiding the similarity conflicts between positive and negative samples, which exist in previous distance measurement.

2.2 Supervised Person Re-identifification

He et al. [37] proposed SPP-Net. The traditional neural network utilizes a single-size convolution kernel, which is not capable of capturing the information of different scales in images. The author employs spatial pyramid structure to extract samples’ features, and pyramid pooling structure to make the extracted feature dimensions no longer rely on the size of input images.

Subsequently, a deeper convolutional neural network, VGG-NET, was proposed by Simonyan et al. [38]. Considering that deepening the neural network can boost the fitting capacity of the model. The author proposes to replace the previous 5 × 5 convolutional kernels with multiple 3 × 3 convolutional kernels, which increases the depth and improves the performance while maintaining the feature map’s receptive field.

However, the methods mentioned above are universal feature extractors of images. Re-id owns its particularity, the challenge described in Section 1 prevents these models from being better at generalization. So OSNet [39] is proposed to attention to the full-scaled feature and advance the re-id to a new level. It designs a convolution block by introducing multiple features flows of different scales, and the feature scale concerned by each feature flow can be adjusted through hyperparameters. Moreover, features of different scales will be uniformly fed into the aggregation module to generate dynamic weights for feature flows of different scales through the full connection layer and conduct the multi-scale feature fusion. For the input images, the feature aggregation module can adaptively focus on an appropriate scale or choose to mix with features from different scales to produce heterogeneous feature scales. In addition to realizing multi-scale feature fusion learning, OSNet is designed with the principle of lightweight and employs deep separable convolution to replace the original 3 × 3 convolution.

In [32], image pairs are divided into three overlapped image pairs. The cosine measuring function is utilized to jointly perform feature extraction and metric learning through a siamese CNN network. Recently, deep attention mechanisms [40, 41] has been proposed to solve the problems of lighting, occlusions, and back-ground variations. In [40], a dual attention matching network is designed to search the implicit context representations and then compares them simultaneously. [41] adopts the pose-guided part attention mechanism to reduce noise interference. In addition, [42] uses full convolutional siamese networks to calculate visual similarity at different levels and combines multiple levels of information to improve the robustness of matching. However, these methods lack effective guidance for unlabeled datasets, which leads to poor scalability model in realistic deployment.

2.3 Unsupervised domain adaptation

UDA can effectively solve the learning problem of the distribution inconsistency between the two domains. When their categories are the same, an alignment operation is realized by reducing the maximum mean difference (MMD) [43] in Reproductive Kernel Hilbert Space (RKHS) [44, 45]. However, in most scenarios, unknown categories exist in the target domain. Aiming to solve this, Busto and Grall [17] propose the open set domain adaptation. They project the estimate feature from the source labeled domain to target unlabeled domain by assigning images within the target domain to certain categories in the source domain. Recently an adversarial learning framework [46] is proposed to achieve the style transformation of the source domain by use of cycle-consistency loss. This paper also utilizes the domain style transfer GAN to bridge the gap between domains.

2.4 Domain adaptive person re-identification

Several unsupervised approaches utilize source domain to initialize a pre-trained model and mine the potential label information by unsupervised clustering on the unlabeled target domain [12,13,14,15]. Nevertheless, they do not use labeled source images to continue to refine the pre-trained model. Recently, unsupervised methods [6, 18, 20, 47] is proposed to leverage inter-domain style conversion. SPGAN [18] and PTGAN [20] utilize an image-to-image conversion network to preprocess the source dataset and then conduct the sepervised learning. An iterative pseudo-label framework is proposed in [47], which significantly boosts the accuracy. However, this framework is very sensitive to initialization. Tzeng et al. [16] utilizes the style transfer GAN to acquire initialization and then generate pseudo-labels by unsupervised clustering for all target images. Nevertheless, the algorithms based on clustering works poorly on similar images. The pseudo-labels they assigned to similar images from different categories can be the same, which indicates their weakness in distinguishing confusing samples [45]. ECN [11] constructs a continuously updated feature memory to estimate the similarity between the entire target images. Compared with ECN, the proposed method does not treat every neighbor of the training sample equally because there is often contamination of mismatches in one-way ranking. In contrast, the proposed method adopts RNLSL to effectively distinguish true matches from hard negative samples, which greatly improves the reliability of similarity estimation. Meanwhile, CCRL is adopted to brings a new idea for solving camera style variations.

2.5 K-Reciprocal encoding

K-Reciprocal Encoding [3], which applies the novelty of set intersection to re-rank sample similarity for the first time, bases on the hypothesis that for an optimal match, they should each other’s top-k nearest neighbors. Specifically, Mahalanobis distance is firstly employed to obtain the query’s primary k-nearest gallery lists, and then Jaccard distance is used to get the k-reciprocal nearest list. The method sequentially calculates the k-reciprocal nearest list for each sample in the query’s k-reciprocal nearest list. Then some positive samples, which are ignored because of illumination and perspective variances, can be recalled according to some restricted conditions. Jaccard distance represents the distance between the query and the recalled image. The author encodes the k-reciprocal nearest information into an equal but more simple vector to reduce the complexity with a higher weight for nearer samples.

2.6 Regular loss functions

In the field of re-id, the commonly used loss function mainly follows the one in classification and retrieval tasks.

cross-entropy loss [48] is mainly adopted in classification. It firstly converts the model output into predicted probability by softmax function and then calculate the log value to obtain each sample’s corresponding loss, which can be expressed as:

$$ \begin{array}{@{}rcl@{}} L_{CE} = -\frac{1}{B} \sum\limits_{i=1}^{B} log \frac{e^{s_{i}}}{{\sum}_{j=1}^{n} e^{s_{j}}} \end{array} $$
(1)

where B denotes the sample amount in the mini-batch, n denotes the category number.

Center loss [49] is an improvement on cross-entropy loss. It assumes a centroid for each category and realizes the convergence by narrowing the gap of the input sample and its corresponding category centroid.

$$ \begin{array}{@{}rcl@{}} L_{CL} = -\frac{1}{B} \sum\limits_{i=1}^{B} log \frac{e^{s_{i}}}{{\sum}_{j=1}^{n} e^{s_{j}}} + \frac{\lambda}{2}\sum\limits_{i=1}^{B} \|x_{i}-c_{i}\|^{2} \end{array} $$
(2)

where xi denotes the embedding of sample i, ci represents its category.

Triplet Loss, which is put forward in FaceNet [50], bases on the most crucial idea that the samples in the same category should be nearer in the embedding space. Considering such one simple regulation will result in that the cluster centroids of different categories get closer, too, a margin constant m is added:

$$ \begin{array}{@{}rcl@{}} L_{triplet}=\|{x_{i}^{a}}-{x_{i}^{p}}\|^{2}-\|{x_{i}^{a}}-{x_{i}^{n}}\|^{2}+m \end{array} $$
(3)

where \({x_{i}^{a}}\), \({x_{i}^{p}}\), and \({x_{i}^{n}}\) denote the anchor sample, positive sample, and negative sample separately.

3 Method

Preparatory Work

To facilitate the following analysis, this paper first defines the mathematical symbols to be used. The domain adaptive person re-id task provides two datasets, a source domain {Ps,Ls} with person identities and a target domain {Pt,Ct} with camera identities. As many UDA person re-id methods, we adopt the assumption that each target image’s camera-id is obtained in advance, which is easy to be known when gathering target images from frame sequences in a video. The source domain contains M pedestrian images and these pedestrians belong to X categories. Each ps,i in the source domain belongs to a label ls,i representing its identification. In addition, there are N pedestrian images and Y cameras in the target domain. Each target image pt,i corresponds to a camera identity annotation ct,i.

3.1 Overview of network

The overall architecture of the proposed method is shown in Fig. 2. In the first step, MBCR regard each single camera as a unique domain and utilize Cycle GAN [46] to train a camera-style transfer model. The specific implementation can be found in [51]. The model trained by the first step is then employed to augment the target images and generate the corresponding fake version for each original image. These fake images are added to the original dataset to participate in model training together. All images are fed to the pre-trained CNN backbone network, followed by an embedding module that consists of 512-dimensional feed forward network (FFN), one-dimensional batch normalization, and ReLU. The 512-dimensional features of each image is extracted through the embedding module. For source images, the extracted features are sent to an X-dimensional FFN (named FC *P-id), followed by softmax. The cross-entropy function is employed for supervised training. Simultaneously, MBCR maintain a feature memory module to save the latest output of the embedding module for each target image. To more accurately estimate the similarity between the target samples in the mini-batch and the ones in the memory module, this paper proposes a reciprocal neighbors label smoothing loss (RNLSL) based on memory-based circular ranking. Since RNLSL and memory-based circular ranking mechanisms are closely related, they are introduced in Section 3.2.

Fig. 2
figure 2

The overall architecture of MBCR. In the training process, style transfer images, original target images, and source images are sent to the deep re-id network to acquire updated features. For the source domain, the cross-entropy loss function is utilized to optimize it. For the target domain, the loss functions turn to RNLSL and CCIL

In addition to RNLSL, we also add a Y-dimensional FC layer (FC *C-id) after the embedding module and formulate a cross-camera invariance loss function (CCIL) to guide the model to discern the discrepancy of pedestrians in the unlabeled domain. The factors of occlusion, illumination, pose and background clustering all will result in such variations among different domains.

3.2 Reciprocal neighbors label smoothing loss

In the view of supervised learning, we hope that the identifications of the same category are close enough in the embedding space, while maintaining a certain distance from other identities. With the help of feature memory module T, reciprocal neighbors label smoothing loss can effectively mine the potential identity information in the target domain. In the beginning, each t in the target domain denotes a separate category, and assign an index to it. Each row of the feature memory module is used to store the 512 dimensional features corresponding to the index. During the iterative training process, the feature T[i] corresponding to pt,i with the L2 normalized feature f(pt,i) is updated by,

$$ \begin{array}{@{}rcl@{}} & T_{t}[i] = (1-\lambda)T_{t-1}[i] + \lambda f(p_{t,i}) \end{array} $$
(4)

where t denotes the epoch numbers and λ controls the updating rate. The original k-nearest neighbors of pt,i can be obtained by the pairwise cosine distance function between f(pt,i) and the feature memory module. We define the indexes of these neighbors as S(pt,i). It is fallacious to directly pull pt,i and its neighbors because there are often mismatches in one-way ranking. Thus, a memory-based circular ranking mechanism is proposed to excavate the confusing samples in a batch, as shown in Fig. 3. We process the feature memory module with k-reciprocal encoding [3] and obtain an aggregate similarity matrix Drecode which contains the k-reciprocal encoding distance among all the embeddings saved in the memory module. It is worth noting that even though cosine function is widely employed in similarity metric, k-reciprocal encoding distance has recently exhibited better generalization.

Fig. 3
figure 3

Illustration of the memory-based circular ranking mechanism. First, we calculate the k-nearest neighbors list of pt,i. For each top-k neighbor pt,j in this list, if pt,i also exists in the top-k nearest neighbors of pt,j, pt,j are called the k-reciprocal nearest neighbors of pt,i. Otherwise, pt,j represents a hard negative sample of pt,i

For each image xt in the original k-nearest ranking list S(pt,i), its k-nearest ranking list S(xt) is obtained by sorting the similarity matrix Drecode. If pt,i also exists in S(xt), xt denotes a positive sample of pt,i, otherwise xt is regarded as a hard negative sample with a great probability. By traversing S(pt,i), the original ranking list is divided into Spos(pt,i) and Sneg(pt,i). Finally, we treat N unlabeled pedestrians as N categories and assign the pseudo-label Wt,i = {wi,1,wi,2,wi,3......wi,N} to pt,i as,

$$ \begin{array}{@{}rcl@{}} w_{i,j}=\left\{ \begin{aligned} 1 \quad p_{i,j} \in S_{pos}(p_{t,i}) \quad \quad \quad \quad \quad \quad \quad\\ -\frac{1}{k} \quad p_{i,j} \in S_{neg}(p_{t,i}) \quad ,\forall p_{t,j} \in \{P_{t}\}\\ \quad 0 \quad otherwise \quad \quad \quad \quad \quad \quad \quad \quad \quad \end{aligned} \right. \end{array} $$
(5)

where, k denotes S(pt,i)’s size. The estimated probability that pt,i belongs to i-th class is obtained through,

$$ \begin{array}{@{}rcl@{}} p(i \lvert p_{t,i}) = \frac{exp(\alpha f(p_{t,i})) \cdot T[i]}{{\sum}_{j=1}^{N} exp(\alpha f(p_{t,i})) \cdot T[j]} \end{array} $$
(6)

where α is the scaling number. nt denoted the amount of unlabeled images in a training batch.Finally, the reciprocal neighbors’ label smoothing loss is formulated as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{RNLSL} = -\frac{1}{n_{t}}\sum\limits_{i} \sum\limits_{j=1}^{N} w_{i,j} \log p(j \lvert p_{t,i}) \end{array} $$
(7)

During the progress of RNLSL calculating, an important step is to assign pseudo-labels to target samples with memory-based circular ranking mechanism, and there are two principles of invariance. In (5), relatively large weights are assigned to the sample itself and its reciprocal neighbor samples. They are called sample invariance and neighborhood invariance in this paper.

Sample invariance allows the characteristics of the sample itself to be pulled in, which is a conservative approach in the absence of labels. But this keeps the different sample instances far from each other. However, it widen the gap between instances of different categories. However, in this paper, each target domain image is regarded as a category, which will lead to the features of images with the same identity is pulled far away, which decreases the performance.

Neighborhood invariance can guide each pedestrian image instance and its candidate nearest neighbor sample to converge with each other. This helps to reduce the distance of similar pedestrian images in the embedding space. However, the pseudo-labels generated by circular sorting are not accurate, and neighborhood invariance is likely to shorten the embedding of two pedestrians with different identities. Even after the circular sorting and filtering, it is still not guaranteed that the query sample owns the same label as the candidate samples in the screened positive set.

Considering the limitations of these two Invariances, the cross-camera Invariance Loss is proposed as following.

3.3 Cross-camera invariance loss

Camera style variations might significantly change the appearance of person, which makes it difficult for re-id model to find persons of the same identity in different cameras. Although we employ the camera style transfer model to reduce the difference among the camera styles, the inferred results from the network is still sensitive to image transformations. To reduce such correlation between features and camera styles, we propose a cross-camera invariance loss as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{CCIL} = \frac{1}{n_{t}}\sum\limits_{i} \log p(c_{t,i}\lvert p_{t,i}) \end{array} $$
(8)

where, nt denotes the batch size and \(p(c_{t,i} \lvert p_{t,i})\) represents the scores that the target image pt,i is filmed by its true camera id ct,i, which can be obtaining by the classification network. For each style transfer image, its camera identity is annotated according to its transferred style domain. Cross-camera invariance loss is a reverse form of the original cross-entropy loss. The traditional classification task strengthens the divergences between different categories by reducing the cross-entropy loss for the sake of obtain more discriminative features. When we optimize the traditional cross-entropy loss in the opposite direction, videlicet, the value of loss is increased. In this way, the camera style domains that were originally relatively independent are disrupted and the inter-domain distance would decrease. With the guidance of CCIL, the extracted features would be more robust to various camera styles.

3.4 Final loss for network

During training, we collaboratively optimize the source and target domains. The traditional cross-entropy loss function is leveraged to optimize the source domain as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{s} = -\frac{1}{n_{s}}\sum\limits_{i} \log p(l_{s,i} \lvert p_{s,i}) \end{array} $$
(9)

where ns denotes the images numbers in a mini-batch and ls,i is ps,i’s person identity. Finally, the source domain loss and the target domain losses are added into the following formula,

$$ \begin{array}{@{}rcl@{}} \mathcal{L} = \upbeta \mathcal{L}_{s}+\mathcal{L}_{RNLSL}-\log(-\mathcal{L}_{CCIL}) \end{array} $$
(10)

where β controls the proportion of \({\mathscr{L}}_{s}\). It’s worth noting that the value of \({\mathscr{L}}_{CCIL}\) is always negative. In experiments, we found that \({\mathscr{L}}_{CCIL}\) rapidly decreased in the first epoch of training, which caused the overall loss to be a negative value and severely disrupted the optimization process. Therefore a logarithmic function is added to limit the weight of \({\mathscr{L}}_{CCIL}\) in the overall loss, making the entire optimization process smoother.

4 Experiments and analysis

We perform experiments on three popular academic benchmarks, Market-1501 [22], DukeMTMC-reID [23] and MSMT17 [20]. These three datasets include abundant variations in viewpoint, occlusion, illumination, pose, and background, which exactly conform to the investigated issues,

4.1 Datasets

Market-1501 training set contains about 13k images with about 1.5k pedestrians, while the test set contains approximate 20k images with 1.5k identities and the query set contains about 3k images. Six cameras are used to capture this dataset.

DukeMTMC-reID is obtained from a multi-camera tracking dataset DukeMTMC by sampling manually bounding boxes, which results in different sizes of images in the dataset. There are 8 cameras and about 36k labeled pictures with about 1.4k pedestrians in DukeMTMC-reID. The training set and the test set each contain half the person identities.

MSMT17 is released recently. It uses Faster RCNN [52] as the pedestrian detector and screens out 126,441 bounding boxes of 4,101 pedestrians from video sequences with different weather conditions. MSMT17 is randomly divided according to the training-test ratio of 1:3, rather than equally divided like other datasets. The purpose is to encourage efficient training strategies. Finally, the training set contains 10,421 pedestrians with 32,621 bounding boxes, while the test contains 3,060 identities with 93,820 bounding boxes.

4.2 Metrics

As to the evaluation metric, we choose the following ones:

  • Rank-n. It denotes the probability that positive samples are shown in the recalled top-n results.

  • Mean Average Precision (mAP). It is calculated from the proportion under the PR curve, P refers to precision and R refers to recall. In a robust re-id system, It is hoped that the matched returns of the query one can be recalled as much as possible, and the relatively more convinced images should be positive ones. mAP metric urges the model to balance the precision and recall, which is a significant means when measuring the performance.

4.3 Implementation details

ResNet-50 [53] is employed as the backbone to extract the base features. MBCR only reserve the layers before the last average pooling layer, and add an embedding module. During training, the random transformations used for data augmentation is the same as he2016deep. Finally, the input images are resized to 256x128. Each batch contains 128 source domain samples and 128 target domain samples. The target domain samples are randomly selected from the original images and the camera-style transfer images. The scaling number α is set to 20. The updating rate of feature memory is set as λ = 0.1. The weight β of Ls is set to 4. For the reciprocal neighbors label smoothing loss (RNLSL), the size of neighbor candidates k is set to 14. SGD [54] is employed as the optimizer and the learning rate is set to 0.01 for backbone and 0.1 for the others. During inference, we utilize the L2 normalized output of the average pooling layer as the features. The Euclidean distance is adopted to measure the similarity between the query image and the gallery images.

Baseline

The proposed method is based on the mechanism of ECN [11], so ECN is selected as the baseline for experimental analysis.

4.4 Comparisons with the previous SOTA method

We compare MBCR with hand-crafted ones (including LOMO [31], BoW [22]), and the other excellent unsupervised learning methods. Table 1 reports the experimental comparisons on Market-1501 and DukeMTMC-reID. We choose one of them as the source domain and the other as the target domain. Concluding from the results, MBCR shows superiority on these two large-scale datasets conspicuously. As shown in the results, LOMO and BoW perform poorly on both datasets. Even the mAP of these two methods is less than 10% when tested on DukeMTMC-reID. The reason is that these two hand-crafted methods neither utilize the supervised information in the source domain nor mine the potential invariance in the target domain. CAMEL [14] significantly improve the rank-1 accuracy through unsupervised clustering methods. However, the label pollution generated by clustering limits their performance. Compared with previous excellent methods in view of domain adaptation (including PTGAN [20], SPGAN [18], CamStyle [51], HHL [55], OSNet-AIN [56], CCSE [57], PGS [6], PREST [58], and MMCL [4]), MBCR outperforms these methods significantly on these two datasets. Specifically, MBCR attains rank-1 accuracy= 81.3% and mAP= 53% when DukeMTMC-reID and Market-1501 are used as the source dataset and test dataset, respectively. Simultaneously, MBCR reaches rank-1 accuracy= 69.2% and mAP= 48.5% when using Market-1501 as source dataset and tested on DukeMTMC-reID. Compared to the baseline method ECN, MBCR achieves rank-1 accuracy gain of 6.2% and 5.9% when tested on Market-1501 and DukeMTMC-reID respectively.

Table 1 Performance (%) comparison with previous SOTA method on Market-1501 and DukeMTMC-reID

Finally, to verify the generalization performance, we also adopt a novel dedicated backbone network OSnet [39] for supplementary experiments. OSnet can dynamically capture multi-scale features and aggregate them with flexible weights. To effectively obtain the correlation among spatial channels and alleviate overfitting, OSnet employs both point convolution and depth convolution, which enables the model to achieve better performance with fewer parameters. It can be seen that with OSnet, the rank-1 accuracy and mAP of MBCR are increased by 7.3% and 14.3% when using DukeMTMC-reID as source dataset and tested on Market 1501, while ECN has only improved 3.6% and 7.9% on the two evaluation metrics. In MBCR, the performance gains from the high-performance backbone network are even more significant, as is the case when verified on DukeMTMC reID. The proposed method can be better integrated with the backbone network and give full play to the backbone network performance. Meanwhile, comparing with the previous SOTA PREST and MMCL, MBCR still shows mighty competitiveness.

We also demonstrated the scalability of the proposed method on a larger dataset MSMT17. Compared with previous two datasets, MSMT17 consists of more pedestrian images, bounding boxes, and cameras. Not only that, MSMT17 also contains more complicated scenes and backgrounds, with a longer period of time and intricate lighting variations. Since MSMT17 is released recently, few unsupervised methods have published experimental results on it. Two unsupervised methods (PTGAN and ECN) are selected for comparative experiments. As shown in Table 2, the proposes approach significantly precedes PTGAN and ECN, whether using Market-1501 or DukeMTMC-reID as the source domain. Specifically, MBCR attains rank-1 accuracy= 35.2% and mAP= 12.2% when using DukeMTMC-reID as source dataset. Compared to the baseline method ECN, MBCR boosts the performance of rank-1 and mAP by 5% and 2% separately.

Table 2 Performance evaluation when tested on MSMT17

All in all, MBCR can utilize memory-based circular ranking mechanism to produce smooth labels for the unlabeled dataset. With the guidance of RNLSL and CCIL, MBCR can mine the identity information hidden in the neighborhood and make the model less sensitive to various variations in the target domain.

4.5 Ablation study

To demonstrate that the performance improvement described in this paper is due to the proposed components, massive ablation experiments are performed and reported in Table 3.

Table 3 Ablation study(%) on Market-1501 and DukeMTMC-reID

First, RNLSL is added to the baseline network to demonstrate its effectiveness. As shown in Table 3, RNLSL improves the rank-1 accuracy from 75.1% to 78.7% and 63.3% to 66.7% when regarding Market-1501 and DukeMTMC-reID as the unlabeled domain. This proves that RNLSL not only can effectively narrow the gap between real matching images but also can mine hard negative samples in the neighborhood.

Next we validate the performance improvements of CCIL. In Table 3, CCIL achieves rank-1 accuracy increments of 3.2% on Market- 1501 and 3.1% on DukeMTMC-reID. This demonstrates that CCIL can instruct the model to extract camera-independent features, effectively alleviating the influence of the camera style diversity. Moreover, the combination of RNLSL and CCIL further improves the performance. This indivates that the two-loss functions can coordinate with each other to guide the optimization of the model from two different aspects. Specifically, CCIL enables the network more robust to the image variations in the unlabeled domain, narrowing the gap among images of the same identity with different camera styles. This allows us to use memory-based circular ranking mechanism to accurately distinguish the positive samples from the negative samples in the neighborhood and generate more accurate smooth labels. With the guidance of smooth labels, RNLSL prompts the model to extract more discriminative features.

To further prove that the proposed RNLSL, which is some kind of improvement based on k-Reciprocal Encoding, performs better than k-Reciprocal Encoding. The ablation study is conducted on then as Table 3. Pure k-Reciprocal Encoding indeed brings increments of 1.9% and 2.8% towards R-1 and mAP, but they are still less than the ones from RNLSL. And the combination of ECN, k-Reciprocal Encoding, and CCIL performs more poorly than that of ECN, k-Reciprocal Encoding, and RNLSL.

4.6 Further analysis

To further understand the effectiveness of the proposed memory-based circular ranking mechanism, we demonstrate how the model uses the memory-based circular ranking mechanism to screen the candidates in the sample neighborhood when training refering to Fig. 4. Simultaneously, we carry out experiments on two important hyper-parameters of MBCR, including the weight of the source domain loss β and the number of neighbor candidates k.

Fig. 4
figure 4

Example results of four images on the Market-1501 dataset. For each probe, its initial k-nearest neighbors are listed. The two rows after the initial list correspond to the positive and negative sample sets divided by memory-based circular ranking, respectively

4.6.1 Analysis of memory-based circular ranking

As shown in Fig. 4, true positive samples (marked with a green border) and true negative samples (marked with a red border) are doped with each other in the initial neighborhood. ECN mines neighborhood invariance by directly reducing the gap between the sample and all neighbors. MBCR uses feature memory to perform circular ranking, which can further examine the similarity between the two images and effectively reduce the noise in the original neighborhood. In Fig. 4, four example results are shown. As a result, most candidates in the sample neighborhood are effectively distinguished, and only a few are classified into the wrong set.

4.6.2 The weight of loss: β.

The analysis on the weight of the source domain loss β is reported in Fig. 5. When β is fixed to 0, the network is optimized only by unlabeled images of the target domain. As the value of β increases, the source domain and the target domain jointly guide the training process and improve the experimental results. Combined with Fig. 5, MBCR leverages the labels in the source domain as useful guidance in model training. When setting β= 4, the model achieves the best results. As β continues to increase, the performance of MBCR begins to degrade. When setting β= 8, the proportion of target domain losses is too small, so the re-id model focuses on the distribution of the source domain and ignores the invariance in the target domain.

Fig. 5
figure 5

Evaluation with different values of β

4.6.3 The number of candidate neighbors: k.

In Fig. 6, we show the experimental results of comparison with ECN when k takes different values. As the value of k increases, the experimental results continue to improve, eventually reaching an optimal value when k= 14. Compared with ECN, MBCR extends the size of the reliable neighborhood from 6 to 14. The reason is that MBCR can utilize a memory-based circular ranking mechanism to more accurately distinguish between images that look similar but have different identities. When k is assigned a large value, the rank-1 accuracy of MBCR does not decay as quickly as ECN. We want to point out that MBCR outperforms the best results of ECN at all k values.

Fig. 6
figure 6

Analysis of the number of candidate neighbors together with ECN

4.6.4 Computational cost analysis

In this paper, feature memory and mini-batch are utilized to train and optimize the model with the proposed loss function respectively. As for the method based on mini-batch, the input samples are comprised of the target sample, the corresponding camera style transfer sample, and the corresponding k-nearest neighbor candidate sample. Referring to Table 4, memory-based approach is significantly superior to the mini-batch based one. It is worth noting that memory-based method will introduce limited additional training time cost (+ 1.6 minutes) and GPU memory (+ 780 MB), though they are negligible compared to the total cost.

Table 4 Computational cost analysis of the exemplar memory

Feature memory modules are frequently employed in recent self-supervised training models. Most of the existing unsupervised models are based on contrast learning. Traditional neural network training is carried out in the form of mini-batch. It is unscientific to compare positive and negative samples within a small batch due to the limitation of samples number. There may be no pedestrians with the same identity as the target sample in the mini-batch, so it is meaningless to conduct circular ordering in it. By adding a feature memory module, the model can sort globally and the performance can be improved significantly.

5 Conclusion

In this work, we present a novel unsupervised domain adaptation method for person re-identification using memory-based circular ranking mechanism to adaptively assign pseudo-labels for the target domain. It’s worth noting that MBCR does not use clustering to generate pseudo-labels for the entire target images like the existing unsupervised re-id methods because the clustering algorithm involves heavy CPU calculations and long training time. The memory-based circular ranking mechanism can iteratively generate smooth labels for samples in each mini-batch, which significantly decreases the time cost and avoids the noise caused by clustering. Different from previous unsupervised approaches, MBCR also has better scalability in the large-scale pedestrian benchmark of the real world. The labeled source images are employed to supervise the training process while realizing unsupervised optimization of the target domain by training jointly both the reciprocal neighbors’ label smoothing loss (RNLSL) and the cross-camera invariance loss (CCIL). RNLSL aims to screen out positive samples in the neighborhood and mine hard negative samples with a similar appearance. Simultaneously, CCIL is designed to ensure that the deep re-id network is robust to various variations of the camera styles. Abundant experiments demonstrate the superiority of the elaborate components proposed in this paper.