1 Introduction

Person ReID seeks to match a pedestrian’s visual representation to a gallery of images taken by a camera network. In general, Person ReID is a difficult problem to solve due to numerous challenges such as (i) requirement of re-identifying (matching) a person across different camera sensors, (ii) intra class variations like changes in clothings, view variations, scale changes and deformations (iii) background clutter and distracting objects. Despite these challenges, deep learning based models [1,2,3,4,5] made considerable progress in recent years achieving state-of-the-art performance in a supervised setting by extracting high-quality discriminative features. The majority of recent studies focus on designing novel loss functions and proposing better feature-extraction strategies for person Reid. Additionally, alternative strategies such as the effectiveness of batch normalization [1] and centroid loss [2] are also investigated in the literature.

Fig. 1.
figure 1

The objective of a Test time adaptation (TTA) model is to update the encoder parameter \(\theta \) through the testing phase stage, resulting in updating the weights \(f_{\theta } \rightarrow {f_{\theta +1}}\), in each step. The proposed method Anchor uses feature distribution alignment to minimize the divergence of the domain shift.

Although recent supervised deep learning based approaches achieve impressive performance in scenarios where the source and target datasets are from the same domain, they struggle to generalize across different domains. However, in practical re-identification applications, it is highly probable that we may have to deploy the re-identification model on target domains that are different to the source domain used to train the model. For example, various aspects such as the camera sensors, camera viewpoints, background scenarios, illuminations, clothing style of people, etc. may differ between the source and target domains. In recent years, progress in unsupervised domain adaptation (UDA) ReID [6,7,8] achieved great performance that avoided the extensive labor work for labeling target domain data. Utilizing the source domain labeled data with the target domain unlabeled data allowed for better discriminative feature embedding and improvement in results. Pseudo-labeling for UDA led to great progress in learning better discriminate features in the target domain with methods [7] that exploit the dynamic memory aspect of training and pseudo-label the target domain. These solutions require more resources and training time for a model to adapt better to a new domain. Even, with methods that utilize a more adversarial approach of adapting to new domains [9,10,11] they require access to the source data and are expensive in terms of resources needed for training.

Domain Generalization (DG) in person Reid goes beyond the simple premise in UDA, by designing a more robust model to domain shift and does not access the target domain data. The common strategy here is to train a model on single/multiple datasets and test it on other target dataset/s to measure the generalizability to unseen domains. A combination of batch normalization (BN) and instance normalization (IN) [12,13,14] improved the model statistical normalization to be invariant to domain shift. Another direction is to match the convolution feature maps [3, 15] between the gallery and the query images, resulting in a more robust generalizable framework.

The majority of DG studies focus on designing more robust models that are trained on source datasets in an offline setting with explicit components or training mechanisms for domain generalization. Similarly, UDA methods [7,8,9, 16, 17] utilize labeled source domain data as well as unlabelled target domain data and train their model in an offline learning setting. Popular DG [18,19,20] and UDA approach disregards the fact that the real-world environment is in continual change and trains their model in an offline setting, without any method to update the model parameters when encountering data from unseen domains. A major setback to a more practical model for real-world application is the adaptability of the model to changes to the environment and deployment in a domain that is different in setting than the source domain. Thus, we propose a new problem formalization designed for more practical usage in real-world applications, named as Test-time adaption (TTA) for re-identification. Our TTA problem setting takes a pre-trained model (that is not designed for domain generalization) and adapts it to the given target domain during inference, while not having access to the pre-trained source dataset.

In this work, we propose a framework for TTA named Anchor Reid, that utilizes a parameter \(\theta \) updating method similar to [18] with a novel anchor distribution strategy to summarize the source domain as seen in Fig. 1. The proposed Anchor ReID utilizes KL-divergence to measure the discrepancy between the two data sources [21]. Our framework can take any pre-trained re-id model from the source domain and it does not require additional components or explicit training strategy on the source data for domain generalization. Therefore, the proposed method is designed to be compatible with any fully supervised person ReID method. In summary, the contributions of this paper are as follows:

  • Introducing a new problem formulation of TTA in Person Reid that focuses on cross-domain evaluation with a unique set of challenges.

  • We design a new framework that takes a pre-trained Reid model and built a summarization distribution from the source domain to effectively adapt to the target domain.

  • We show that by using a pre-trained model trained on Market1501 [22], Anchor ReID achieved an absolute gain of over 10 mAP on CUHK03 [23] and Duke-mtmc [24] benchmarks when compared to the baseline [1] model.

Fig. 2.
figure 2

We visualize distribution for each domain by taking 64 samples and projecting them into 2D feature space using TSNE. On the right: each domain distribution is occupying a region of the 2D space, resulting in the degradation of the discriminative feature extraction. In domain adaptation, the main objective is for the encoder to adapt to new domains. Thus, in the right image the Anchor ReID blurs boundaries between domains, which enforces the encoder to push the target domains means \(\mu \) into the Source domain.

2 Related Work

Domain Generalizability: The objective for DG ReID is to learn a more robust model that is invariant to domain shift to perform better to unseen target domains. When compared to standard ReID the aim there is for a more discriminative visual representation for each instance. For DG [14] proposed to use instance normalization (IN) to eliminate style variations, and to disentangle relevant features from the irrelevant features for a more robust model. Normalization has a major role on the model robustness which is explored in [13] and [25] by replacing batch normalization or adding instance normalization to bottleneck. These approaches require a lot of trial and error to get a consistent and robust solution. MetaBIN [12] proposed the combination of BN and IN together with using meta-learning few-shot reinforcement learning to boost the generalization capability. Image augmentations are used in [26] to make a model generalizable to unseen data. Instead of using embedding ReID features for matching another approach is to extract a feature map and maintain the spatial information of the image. [3, 15] Extract feature maps and then pass them into a kernel that calculates each pair’s similarity score in a network. Furthermore, utilizing a better sampling [3] of data through a graph improves the model’s accuracy.

Test-Time Adaptation: When it comes to UDA the methods require access to the source domain and the target domain. However, in practice, TTA does not require access to the source domain for the adaptation at inference. In recent years, fine-tuning a source-trained model is a more cost-effective approach in terms of resources and training time. An approach to minimize the uncertainty of a model is to minimize the entropy of the prediction by updating the model normalization parameters. Test entropy minimization (TENT) [18] calculates the cross-entropy of the prediction and updates the batch normalization layers to effectively adapt to the new domain. By only updating normalization parameters \(\theta _{norm}\) the model reduces the risk of diverging. [19] utilize self-supervising learning with SimClR to train the model, and then generate a summarization for the source domain. In inference, TTT++ uses the distribution of the source domain to minimize the distance of the target domain. Similar to Tent, [16] focuses on updating only the classification layer alone, which makes the model a propagation-free solution for TTA. This is done by generating pseudo labels of the previous inferences to update the classifiers.

To the best of our knowledge, there is no work that has been published for a TTA re-identification solution, which makes our work the first in the field.

3 Problem Formulation

Table 1. Summary of the different ReID domain settings and how our proposed method differs from previous methods. The training Data describes how a model needs to use data in either the training for updating the model in the training or testing stages. Since Anchor ReID takes Pre-trained weights it does already trained, but it is only updating the model \(\theta \) in inference, unlike other settings.

It is important to describe in detail the problem formalization of TTA ReID with the constraints. The work here is based on the work done in TTA for classification, but the work here adds some constraints for privacy since ReID is a more sensitive security application. The proposed method differs from the other ReID settings as in Table 1. TTA ReID doesn’t access Source domain images for updating the model in inference similar to the other domains. The main objective is to make a model adapt in inference, unlike other domains focusing on better feature discrimination. A pre-trained model \(f_\theta (x)\) that is trained on \((X^s, Y^s)\) is used for inference on a stream of data in \((X^t)\). In addition, \((x^{t}_{i})\) is only accessible once through inference. The goal is to update the model \(\theta \) for it to adapt better to the target domain. The target set is split into multiple streams with specific size \(X^t=(X^{t}_{1}, X^{t}_{2}, X^{t}_{3},....X^{t}_{n})\), and the streams are sent sequentially to the model. Each instance of the stream is only processed once by the model to mimic the real world. The model process the output of the model \(f_{\theta i-1}(X^{t}_{i})\) and the model update in each iteration \(\theta _{i} \rightarrow \theta _{i+1}\).

4 Method

Our main objective is to minimize the disparity between the source and target setting for the ReID model in a test-time adaptation setting, in which target-set data are divided into a stream of data and are only accessible ones. Anchor-ReID is built on top of multiple modules that together make it possible to adapt an off-the-shelf ReID model. In Sect. 4.1 the model utilizes a sampling strategy to select the samples that summarize the source domain. In 4.2 describes the policy of selecting parameters for updating and optimization. In this case, the BN is the only parameter that is being optimized. Finally, we design a way to measure the discrepancy between the different datasets.

Fig. 3.
figure 3

The proposed Anchor ReID framework comprises of two stages: offline sampling and a test time adaptation (TTA). (i) At first, in the offline sampling stage we take an off-the-shelf person ReID model and extract a single embedding for each ID class in the source dataset \(X^s\). These embeddings are stored inside a cache among which P samples are further selected as anchors, following an anchor distribution, that summarizes the source data. (ii) Next, in the test-time adaptation stage, we obtain the embeddings for the test data which are utilized to compute the disparity between the distributions of test data and the anchor distribution. Finally, the normalization layers in our model are updated to minimize the disparity between the anchor and target distributions.

4.1 Sampling the Anchor Memory

Taking inspiration from works in unsupervised Domain adaptations [3, 9, 17], Anchor ReID takes the alignment learning approach and tries to minimize the discrepancy between the source and target datasets. In the offline stage Fig. 3, we construct a subset from the source data by randomly selecting images from each ID class. Assuming that \(f_\theta (x)\) is robust on the training set, randomly selecting samples for each ID will have a minimal effect since each ID would have an almost identical instance embedding. The \(f_\theta (x)\) would generate the \(A\in R^{C\times d}\) where C is the number of ID classes, and A represents the Anchor Memory. The next step is to minimize the number of samples in the memory by selecting P samples. With the given distribution, a pairwise distance is calculated for each class ID \(dist \in R^{C \times C}\). Next, the most centered ID instance in the distribution is selected, and then the nearest \(P-1\) samples are selected. Selecting samples that are near each other is essential since it reduces the variance in distribution. The Anchor memory is acting as a summarization of the source domain and enables the framework to eliminate the need to accessing \(X^s\) during TTA.

4.2 Parameter Selection and Update

ReID models encode the feature representations of the pedestrian into a single embedding vector. While the model has a high level of visual discrimination features, it makes the process of optimizing the model parameter \(\theta \) more difficult. Optimizing the \(\theta \) of the whole model may cause it to diverge significantly by damaging the robustness of the visual extraction. To minimize the sensitivity of optimizing the model \(\theta \), we select the Normalization layers in order to adapt the model to the target domain. Similar to [18] the model normalization \(\theta \) is updated through inference. Through updating the normalization running mean and running variance the model would shift them to be more in line with the target domain, thus reducing diverged risk of the model optimal \(\theta \). In sense, ReID models are designed to be highly discriminative of different IDs, by extracting features embedding that is highly similar to the same ID class and repealing embeddings of different IDs. In Anchor ReID case the BN on the final layer [1] is a more important aspect since it affects the similarity measurement as can be seen in Fig. 2.

Next, a challenging aspect is to measure the uncertainty of the predicted output \(Z^t_i\), since in TENT [18] and other TTA frameworks they use cross-entropy to measure the loss. Given that \(X^t\) is the only data available, an adversarial feature alignment UDA method is required for a robust adaptation in test time. The strategy is to impose the feature distribution of the target domain to be near the training domain. For feature alignment, Maximum Mean Discrepancies [21] (MMD) is wildly used for adversarial learning in unsupervised Domain adaptation, by measuring the distance between two different means of distributions acting as a discriminator loss function. The Anchor memory \(A \in R^{P\times d}\) is used as an anchor constraining the feature distribution of \(X^t_i\) to stay near the source distribution, allowing to minimize the latent feature discrepancy as seen in Fig. 2. The features are reproduced using the kernel Hilbert \(\mathcal {H}_k\) on \(\mathbb {P}\) distribution to generate mean embedding of the \(\mu _k(\mathbb {P})\). Given Anchor A and Target \(X^t\), MMD distance between reads \(MMD^2(A,Z^t)=||\mu _A-\mu _{Z}||^{2}_{\mathcal {H}_k}\). In the case of the proposed method the model doesn’t have access to the full distribution thus we estimate it as follows using Gaussian kernel:

$$\begin{aligned} \begin{aligned} MMD^2=l(A,Z^t)=\frac{1}{m(m-1)}\sum _i \sum _{j \ne i}k(z_{i},z_{j})-2\frac{1}{m.m}\sum _i \sum _{j}k(z_{i},a_{j}) \\ + \frac{1}{m(m-1)}\sum _i \sum _{i\ne j}k(a_{i},a_{j}) \end{aligned} \end{aligned}$$
(1)

5 Results

5.1 Implementation Details

Our experiments are adapted from the official Pytorch code of [1]. It was selected to mimic the experimenting environment of TTA papers since it uses Resnet50 as a backbone and the availability of multiple public pre-trained weights from different datasets and configuration settings. TTA main premise is to take off a shelf model and adapt it in inference. Thus, by stating these design decisions we are showing that we are taking pre-trained weights and just adapted in inference. The batch size is set to 128 with a similar number of samples for \(P=128\). Adam optimizer is used for Anchor ReID with a learning rate of 0.0003 and beta between \((0.9-0.99)\).

5.2 Datasets

The datasets were selected w.r.t the public pre-trained weights of [1]. CUHK03 [23], Market-1501 [22], and Duke-mtmc [24] are the datasets for experiments. The pretrained weights are trained on Market-1501 and Duke-mtmc. CUHK03 dataset is composed of 1,360 pedestrians from 13,164 images. The Market-1501 dataset includes 1,501 IDs which are split into 12,936 images for training and 19,732 images for testing. Finally, the Duke-etc dataset contains 16,522 training images of 702 identities and 702 identities with 19,889 images for testing.

Evaluation Protocol. TTA standard evaluation metric consists of minimizing the error of the model prediction. However, for Anchor ReID we followed the ReID community evaluation metric of Cumulative Matching characteristics (CMC) curve and mean Average Precision (mAP). Similar to domain generalization the objective is to measure cross-domain performance.

Table 2. Comparison on cross-domain evaluation for person re-identification using CMC and mAP evaluation metrics. The [1] weights are taken as baseline to measure (\(\%\)) change with Anchor ReID. The Anchor ReID (\(\%\)) is the mean and standard deviation from running the experiment 10 times, due to shuffling the Test set.

5.3 Results

Table 2 compares Anchor ReID (\(\%\)) utilizing the baseline [1] public weights. For comparison with other state-of-the-art models, we indirectly compared Anchor ReID with DG models, due to similar evaluation metrics and show the challenges TTA ReID needs to overcome. Overall, Anchor ReID shows an impressive improvement in all case scenarios with +10 mAP in CUHK03 when trained on Market1501 and +3.7 mAP when the source is Duke-mtmc. For example, in Market1501 \(\rightarrow \) Duke-mtmc and on Duke-mtmc \(\rightarrow \) Market1501 cases, our proposed method improves the baseline significantly averaging between +4 to +10 mAP. However, in these cases, the proposed method is still behind DG models. Interestingly, in Market1501/Duke-mtmc \(\rightarrow \) CUHK03 the (\(\%\)) of Anchor ReID is overcoming the majority of DG models and showing a performance gain of +10 mAP. In general, the Anchor ReID is capable of improving on the baseline in all cases, but when compared with Domain Generalization models, it still lacking in some scenarios. QACONV-GS [3] shows a better performance in CHUK-03 in all cases. However, in consideration, the Anchor ReID doesn’t add or change anything of the original model (Resnet50) when compared with DG. Both TTA and DG over have different constraints and challenges. A limiting factor for Anchor ReID is that the model \(f_{\theta }\) is updating through testing resulting in better instances embedding at the end. The first few iterations pull down the mAP and top-1 scores. The numbers reported in Table 2 are the mean and standard deviation when Anchor ReID is tested with a randomized stream of data in testing. Randomization of the data stream shows a better improvement in overall accuracy when adapting to the target sets. This is the result of how the dataset data are formatted. For CUHK03, Market1501, Duke-mtmc the data are ordered by ID class, thus when we measure the divergence of the distributions, this results in a sub-optimal understanding of the real distribution of the target set. For comparison, Table 4 shows the mean of 14.7 mAP when shuffling and 10 mAP when not shuffling. This is explored more in detail in Table 4.

5.4 Ablation Study

We performed analytic experiments to understand the limitations and effectiveness of Anchor ReID in different settings. Table 3 shows how data streaming and sampling have an effect on the model performance. A limiting factor of the study is the randomization aspects of the testing set order, which results in a different score each time. To limit the effect of the randomization all the reported numbers are the results of measuring the mean and standard deviation of the experiments. Each experiment is done 5 times for a more consistent testing environment. In Table 3, the results variate for each testing set, but there is a vast difference in Market1501 to Duke-mtmc when the anchor samples are selected randomly. This results in a decrease of −1.7 mAP and −3.4 top-1 accuracy. This is explained since the sampling the Anchor distribution from the pairwise sampling method the distribution of the mean would be closest to the actual mean of the source data, while the randomizing would affect by shifting the mean in unpredictable directions.

Table 3. Sampling methods (\(\%\)) on Anchor selection. For each sampling method, we tested them in two scenarios of shuffling and not shuffling the testing set.
Table 4. The performance (\(\%\)) of Anchor ReID with the change of the number of samples used in testing in a Cross-domain setting. The proposed model performs better with smaller batch sizes when data are shuffled

A limiting factor of the Anchor ReID is how many samples are needed to improve the mAP and CMC scores in cross-domain testing. Table 4 show the effect of increasing the number of samples in two sets no shuffling and shuffling testing set. The Pattern in both scenarios is that the more samples you introduce in the Anchor ReID the more it improves the overall score. The Anchor ReID is a summarization of the \(X^s\) distribution and the more samples you introduce the more accurately the disparency of the domains is calculated. For the majority of the testing scenarios, the model would start to adapt better as long, as it has at least 64 samples in Anchor. Furthermore, when not shuffling the data the model would degrade significantly as long the model has few samples to measure the divergence. The datasets sort image order by ID class, thus when using the feature alignment method of the latent embedding the \(\mu \) is affected by sample \(X^{t}_{i}\). Resulting in less than optimal (\(\%\)).

5.5 Further Evaluations

Table 5. Performance (\(\%\)) Comparison on how increasing the number of samples affects TTA. The Ablation is done by first TTA the model on the Training set of the target Domain with label data. Frozen means that after going through the training set the model is frozen in inference. In Full the model goes through the whole set without freezing for evaluation. On each category, the experiment is done by Shuffling the data and vice versa.

Anchor ReID is affected by the number of samples available in inference. To simulate more data Table 5 shows that we took the training set from the target domain and used only the \(X^t\) to update the model. We divided the testing into two settings. The first is to update the model on the training set and then freeze it back to the evaluation mode in inference (Frozen). Comparing the Suffled results in 2 and here there is a huge improvement in the CUHK03 results by +2.8 mAP and a slight decrease in Duke-mtmc by −1.4 mAP. The increase in CUHK03 could be attributed to the huge difference in the domains which shows that the first few predictions drag back the model’s overall accuracy. In the second scenario, the model is not frozen in the inference stage and it continues adapting the new updated model. The model improves slightly in both target domains, but the most interesting insight is that the standard deviation is decreasing in the CMC and top-1 recall. This shows that over time Anchor ReID going to stabilize. This shows that in practicality Anchor ReID is more likely to perform better in real-world applications.

Fig. 4.
figure 4

Qualitative Results of the baseline and the proposed method. Yellow represents the Query image, Red for the wrong output, and Green for the correct output. (Color figure online)

Figure 4 we show visualize results of baseline [1] and the proposed method on CUHK03 dataset and returning the top-3 results. As can be observed, the Anchor ReID is able to retrieve images of pedestrians with similar visual cues. The disturbing observation is that the baseline is retrieving images of people who clearly don’t have similar features to the queue image. The baseline seems to confuse the background features with the query and instead returns images from the same viewpoint with individuals with a drastically different set of clothing. In cupper-right and bottom-left the baseline doesn’t differentiate between colors and instead returns images of people with the same pose position and direction of the queue. While Anchor ReID is able to handle better color cues and return results from different viewpoints.

Table 6. Performance of proposed method on continual testing setting through multiple datasets. The testing goes through Target A and B. The number after ± is the standard deviation after 5 experiments.

To provide a more comprehensive study of the proposed method a continual test-time adaptation setting was conducted as shown in Table 6. The experiment illustrates a practical setting in which, the data stream source may change instantly depending on the task, simulating a real-world scenario. Conducting 4 different settings the proposed method maintained performance.

6 Conclusion

In this paper, we conduct an in-depth analysis of the potential of test-time adaptation for the person re-identification setting, and from our knowledge and we are the first to introduce this setting. To tackle the limited research on TTA for re-identification we proposed the novel method of Anchor ReID that is evaluated on cross-domain testing. The proposed method incorporates off-the-shelf models. To effectively adapt these models we introduce a sampling method to build an Anchor distribution that summarizes the distribution of the source domain. The summarization helps to reduce the divergence of the target domain from the source domain. The proposed method was efficient in adapting to a new domain and showed promising improvements when compared to Domain Generalization models.