1 Introduction

Visual surveillance plays a pivotal role in proactively managing risks and bolstering security through extensive person tracking across large areas. With the proliferation of CCTV cameras, traditional manual surveillance methods are increasingly becoming impractical, necessitating the adoption of automated video surveillance systems powered by computer vision. Facial features [1, 2] are widely considered as biometric identifiers due to their distinctiveness. Recent advancements in face recognition leveraging Robust Principal Component Analysis (RPCA) [3] and optimization techniques such as Grey Wolf Optimization [4, 5] have significantly propelled the field forward. Despite these advancements, challenges persist in CCTV environments where facial visibility can be compromised or completely obscured, prompting the use of person re-identification (reID) techniques to ensure reliable identification under such circumstances.

Person reID has emerged as a prominent area of research in computer vision in recent years, addressing the challenge of matching individuals across different visual contexts. This task is complex due to variations in poses, lighting conditions, and occlusions, resulting in significant differences in how individuals appear across images and videos. Applications of person reID include surveillance systems, human identification in videos, pedestrian tracking, and crowd analysis, highlighting its practical importance in various domains.

Although supervised reID methods [6] have achieved commendable performance, but they rely on labelled data from the same domain (dataset) for model training. However, when these supervised models are deployed in different domains, they frequently experience a significant performance drop [7].

In this context, Unsupervised Domain Adaptation (UDA) [9] plays a crucial role in adapting supervised reID models to new, unlabeled domains by mitigating the issue of domain shift. UDA is concerned with the transfer of knowledge from a labeled source domain to an unlabeled target domain. This knowledge transfer aims to enhance the model’s performance in the target domain by leveraging insights gained from the source domain. Figure 1 provides a graphical representation of the UDA process for person reID. The schematic representation illustrates the adaptation process, which involves a feature extractor and classifier trained in a supervised manner. This adaptation leverages domain-invariant reID knowledge to generate pseudo-labels from target domain images.

Among various UDA methods, clustering-based UDA [10,11,12] has gained popularity over time, due to the ability o group similar looking individuals together. This approach involves by utilizing a pre-trained model to assign pseudo labels to target domain images, which are then employed for clustering. The advantage of clustering-based UDA is its ability to reduce the impact of noisy pseudo labels, which is considered as a common challenge in conventional clustering-based UDA methods [10].

Fig. 1
figure 1

A schematic representation of the Unsupervised Domain Adaptation process in person reID. The adaptation involves a feature extractor and classifier trained in a supervised manner, and it leverages domain-invariant reID knowledge to generate pseudo-labels from target domain images. The images of these persons are used from the Market1501 [8] dataset, and for the privacy concerns, the faces have been masked

One of the fundamental challenge in UDA models is handling label noise in the form of inaccuracies in pseudo labels generated for target domain data. Several recent studies have suggested different strategies to address this issue. Some approaches focus on refining pseudo labels to enhance the accuracy [13, 14], while others emphasize learning from soft pseudo labels [13, 15]. Furthermore, some efforts have been made towards developing noise-resilient methods [16, 17] to counter the adverse effects of inaccuracies in pseudo labels on reID model performance.

The motivation behind this study is to address the persistent issue of noisy pseudo labels in UDA for person reID. Noisy labels can significantly degrade the performance of reID models by introducing inaccuracies that lead to incorrect model learning, reducing the model’s ability to generalize across different domains. These inaccuracies stem from the erroneous assignment of labels during the pseudo-labeling process, which can cause the model to learn incorrect associations and patterns. To overcome this problem, we propose an Unsupervised Dual-Teacher Knowledge Distillation (UDKD) learning scheme that combines collaborative noise-rectification and learning from soft pseudo labels. This approach aims to enhance the robustness and accuracy of reID models in unsupervised settings by mitigating the adverse effects of noisy labels.

The workflow of the proposed method consists of three phases, namely teacher training (3.1), clustering (3.2), and knowledge distillation (3.3). For better clarity, the contribution of this paper can be summarized as follows:

  1. 1.

    An Unsupervised Dual-Teacher Knowledge Distillation (UDKD) method is proposed to enhance the robustness of person reID models.

  2. 2.

    An Enhancement has been suggested for the baseline model with a modified backbone architecture and feature learning technique.

  3. 3.

    To further improve robustness in person reID tasks, soft pseudo-labels from teachers are utilized, contributing to a more effective knowledge distillation process.

  4. 4.

    A modified soft triplet loss for UDKD is incorporated to optimize sample discrepancies and positive distances.

  5. 5.

    A comprehensive experimental evaluation has been conducted employing various benchmark datasets, and the proposed scheme shows superior performance than that of the state-of-the-art approaches.

The subsequent sections of this paper are structured as follows. Section 2 provides a comprehensive overview of recent advancements in cluster-based domain adaptive person reID. In Section 3, the proposed method is extensively elucidated, highlighting its key components and underlying principles. The experimental setup and specific details are outlined in Section 4. To rigorously evaluate the effectiveness of the proposed scheme and its various components, we conduct a thorough testing and evaluation process. The results are presented and discussed in Section 5, along with a comparative analysis between our method and state-of-the-art approaches in UDA for person reID. Finally, the results are discussed in Section 6 and concluded in Section 7, summarizing the key findings and discussing potential avenues for future research.

2 Related works

In the field of person reID, Unsupervised Domain Adaptation (UDA) techniques are crucial due to the absence of human identity information in real-world deployment scenarios. UDA leverages reID knowledge from labelled source domains to compensate for the lack of labelled data in target or deployment domains. Recent research has underscored the competitive performance of clustering-based UDA methods when compared to supervised approaches. However, the existence of inter-domain disparities introduces noise into pseudo labels generated by pre-trained networks, impacting overall model performance. To address this challenge, we draw inspiration from existing works in the domain of person reID, clustering-based domain adaptation, and knowledge distillation.

In clustering-based Domain Adaptation approaches, [18] introduced a novel method for matching individuals across camera views by using spatiotemporal sequences. Furthermore, [19] improved pseudo-label accuracy through hierarchical clustering with hard-batch triplet loss. These developments have laid the foundation for research in this domain.

To tackle noisy pseudo labels, several strategies have been proposed. [13] introduced the Mutual Mean-Teaching (MMT), which employs a mutual teaching strategy for pseudo-label refinement. Similarly, [11] presented a two-branch architecture optimized for classification and metric learning to adapt to target domains, emphasizing domain adaptation and metric learning techniques.

A range of label refinement approaches, including Group-aware Label Transfer (GLT) by [20], probabilistic uncertainty-guided progressive label refinery (P\(^2\)LR) by [21], and Noise Resistible Network (NRNet) by [22], have been suggested to tackle noisy pseudo labels. These approaches collectively address challenges related to intra-camera similarity, self-discrepancy, and feature distribution noise in different domains.

Knowledge distillation in UDA has been exemplified by methods such as Moving Semantic Transfer Network and Progressive Feature Alignment Network, both of which focus on enhancing semantic understanding and feature alignment, reducing domain divergence, and improving representation learning capabilities. Moreover, the Adversarial Double Mask based Pruning (ADMP) method by [23] employs advanced adversarial learning techniques to eliminate unnecessary features and enhance feature alignment in domain adaptation scenarios. These approaches underline the importance of leveraging knowledge from diverse sources to narrow domain gaps.

A major challenge in person re-identification (reID) is the need for extensive data annotation. The Color Prompting (CoP) method [24] addresses this issue by generating pseudo-supervision messages to facilitate data-free, continual, unsupervised domain adaptive reID. This approach leverages colour prompts to support continual learning and adaptation, effectively managing the domain shift over time. Similarly, the Adaptive Memorization with Group labels (AdaMG) framework [25] emphasizes creating comprehensive sample descriptions and managing noisy pseudo labels in unsupervised reID tasks. By integrating group labels, AdaMG enhances model adaptability and robustness, resulting in improved performance under varying domain conditions.

In the context of multi-view, multi-person 3D pose estimation, the problem of domain shift is addressed using unsupervised domain adaptation with a dropout discriminator [26]. This method improves the accuracy of 3D pose estimation by aligning features across different views and individuals, significantly enhancing pose estimation robustness and generalization capabilities. Additionally, a global-local transformer-based framework for unsupervised reID [27] employs a multi-branch structure to learn robust features from pedestrian images. This approach highlights the effectiveness of transformer architectures in capturing both global and local features essential for accurate reID across diverse datasets and environments. Collectively, these studies demonstrate significant advancements in unsupervised domain adaptation techniques, encompassing innovative labelling strategies and transformative model architectures to enhance the reliability and applicability of reID systems.

Despite commendable efforts in addressing challenges related to clustering-based domain adaptation, pseudo-label generation, and knowledge distillation for person reID in existing literature, several persistent limitations remain. The presence of noisy pseudo-labels continues to affect model performance due to inter-domain disparities. Current strategies for pseudo-label refinement, although advanced, still suffer from issues like intra-camera similarity, self-discrepancy, and feature distribution noise. Additionally, extensive data annotation remains a significant challenge, particularly for unsupervised methods. Domain shifts in multi-view, multi-person 3D pose estimation and capturing both global and local features in transformer-based frameworks also present ongoing difficulties. These gaps underscore the necessity for more robust and effective solutions, such as our proposed UDKD scheme, to advance the field of unsupervised domain adaptation for person reID.

In response to this challenge, our UDKD scheme takes a novel approach by integrating the outputs of two teachers. This integration aims to minimize the impact of noisy pseudo-labels generated by one teacher, leveraging the complementary guidance provided by the other. An integral aspect of our UDKD method is the adoption of soft pseudo-labels, where the predicted probability distribution across all classes is considered for training the student network. This not only contributes to the robustness of the learning process but also addresses the issue of noisy pseudo-labels. To effectively handle these soft labels, we introduce a modified soft triplet loss into the scheme. This nuanced strategy collectively addresses the challenge of noisy pseudo-labels, thereby enhancing the efficacy of our proposed method in advancing unsupervised domain adaptation (UDA) for person reID. The subsequent section, ’Proposed Methodology,’ delves into the intricate details of our approach, elucidating its components and demonstrating their impact on refining pseudo-labels in the domain adaptive setting.

3 Proposed methodology

Unsupervised domain adaptation is the process where data from one domain \(D_s = (x_i^s,y_i^s) |_{i=1}^{m}\) is used to train a model that can be applied to another domain \(D_t = x_i^t |_{i=1}^{n}\), where \(x^s\) and \(x^t\) denote person images in the source and target domains, respectively, whereas m and n represent the number of images in the source and target domains. The goal is to learn a model that can effectively generalize to the new domain by predicting labels \(y^{\sim t}\) for every target image \(x^t\), even though no labelled data is available for that domain. This can be accomplished by either transferring knowledge from the source domain to the target domain or learning a model invariant to changes in the domain.

Clustering-based UDA involves pre-training the deep neural network \(F(.|\theta )\) with the target domain in order to encode the features \(\{F(x_i^s|\theta \}|^{m}_{i=1}\). Followed by re-tuning the network parameters \(\theta \) in order to transfer the encoded features \(\{F(x_i^t|\theta \}|^{n}_{i=1}\) to the target domain.

The present work introduces an Unsupervised Dual-Teacher Knowledge Distillation (UDKD) learning scheme, aiming to ameliorate the challenges posed by the generation of pseudo labels during the clustering process. These pseudo-labels are produced to categorize data points based on their similarity, thus serving as a form of weak supervision. However, due to the inherent noise and imprecision associated with the clustering procedure, these labels may exhibit inaccuracies.

To mitigate the influence of such noise, the approach at hand capitalizes on the deployment of two independent classifiers. An averaging method is employed to minimize the potential impact of outliers within the pseudo labels, leading to the refinement of these labels and rendering them more dependable and robust. Subsequently, these refined soft pseudo labels assume a pivotal role in training a smaller network in a supervised manner, thereby augmenting the overall robustness of UDKD.

The methodology builds upon the combined architecture of the stronger baseline of Luo et al. [28] and the Group-aware Label Transfer network of Zheng et al. [29] as the initial method. This baseline forms the foundation for subsequent modifications with the introduction in two critical phases: Teacher Training (3.1) and Clustering (3.2). In Phase 1, we enhance the baseline by integrating two teacher models, denoted as \(C_1\) and \(C_2\). These teachers undergo fine-tuning using the source domain, adding discriminative power to the original baseline. In Phase 2, a clustering mechanism is introduced to adapt the model to the target domain, refining features through k-means clustering. Architectural details of these modifications, along with a rationale and comparative analysis against the original stronger baseline, are provided in Appendix A. This approach aims to capitalize on the baseline’s strengths while tailoring it to the specific demands of unsupervised domain adaptation for person reID.

For a comprehensive exposition of the proposed UDKD, inclusive of its procedural intricacies pertaining to training and knowledge distillation, refer to the end-to-end workflow diagram depicted in Fig. 2, which provides an illustrative visualization. In phase 1 (represented in blue and P1), image samples \(x^s\) of the source domain \(D_s\) are used to train the parameters \(\theta _1\) and \(\theta _2\) of classifiers \(C_1^s\) and \(C_2^s\) by using the source domain labels \(y^s\) and loss function \(\mathcal {L}^s\). In phase 2 (represented in green and P2), the parameters of the same classifiers are re-tuned by using k-means clustering and loss function \(\mathcal {L}^c\) for every image sample \(x^t\) of the target domain \(D_t\). In the last phase (represented in red and P3), the parameters \(\theta _3\) of the third classifier \(C^t\) are tuned by using the combined output \(Y^{\sim t}\) of \(C_1^s\) and \(C_2^s\) using loss function \(\mathcal {L}_{st}\) for the target domain \(D_t\). Additionally, the flowchart presented in Fig. 3 offers a step-by-step depiction of the sequential processes involved in the UDKD methodology. This flowchart provides a visual representation of the sequential steps involved in the UDKD process, including Dual-Teacher Training, Clustering, and Knowledge Distillation phases. Each phase is depicted with its corresponding input, processes, and output, highlighting the comprehensive approach of UDKD for effective unsupervised domain adaptation in person re-identification. Algorithm 1 succinctly encapsulates the essence of the training and knowledge distillation procedures for reference.

Fig. 2
figure 2

End-to-end workflow diagram of the proposed UDKD learning scheme. In phase 1 (represented in blue and P1), image samples \(x^s\) of the source domain \(D_s\) are used to train the parameters \(\theta _1\) and \(\theta _2\) of classifiers \(C_1^s\) and \(C_2^s\) by using the source domain labels \(y^s\) and loss function \(\mathcal {L}^s\). In phase 2 (represented in green and P2), the parameters of the same classifiers are re-tuned by using k-means clustering and loss function \(\mathcal {L}^c\) for every image sample \(x^t\) of the target domain \(D_t\). In the last phase (represented in red and P3), the parameters \(\theta _3\) of the third classifier \(C^t\) are tuned by using the combined output \(Y^{\sim t}\) of \(C_1^s\) and \(C_2^s\) using loss function \(\mathcal {L}_{st}\) for the target domain \(D_t\)

Fig. 3
figure 3

Flowchart of the proposed Unsupervised Dual-Teacher Knowledge Distillation (UDKD) methodology. This flowchart provides a visual representation of the sequential steps involved in the UDKD process, including Dual-Teacher Training, Clustering, and Knowledge Distillation phases. Each phase is depicted with its corresponding input, processes, and output, highlighting the comprehensive approach of UDKD for effective unsupervised domain adaptation in person re-identification

Algorithm 1
figure a

Unsupervised Dual-Teacher Knowledge Distillation (UDKD) training algorithm.

3.1 Phase 1: Dual-Teacher Traning

UDKD begins by training two different CNNs separately with the source data \(D_s\) to model two feature transformation functions, \(F(.|\theta _1)\) and \(F(.|\theta _2)\). Here, each input sample \(x_i^s\) is transformed into two feature representations \(y_i^{\sim s1}\) and \(y_i^{\sim s2}\). Using these feature representations, the reID classifiers \(C_1^s\) and \(C_2^s\) produce two m-dimensional probability vectors corresponding to the predicted identities of the source domains. Here, m is the number of classes or identities present in the source domain. Using a combination of a triplet loss function \(\mathcal {L}^s_{trip}\) and an identity loss \(\mathcal {L}_{id}\) function, the CNNs are optimized to distinguish features belonging to distinct identities.

$$\begin{aligned} \mathcal {L}^s(\theta ) = \mathcal {L}_{id}^s(\theta )+ (\lambda ^s \times \mathcal {L}_{id}^s(\theta )) \end{aligned}$$
(1)
$$\begin{aligned} \mathcal {L}_{id}^s(\theta ) = - \frac{1}{n} \sum ^{n}_{i=1} y_i^{s} \log (y_i^{\sim s}) \end{aligned}$$
(2)
$$\begin{aligned} \begin{array}{c} \mathcal {L}^s_{trip}(\theta )=\frac{1}{n} \sum _{i=1}^{n} max(0,||y_i^{\sim s}-y_{i,p}^{s}|| \\ +\chi -||y_i^{\sim s}-y_{i,n}^{s}||) \end{array} \end{aligned}$$
(3)

Where \(y^s_{i,p}\) and \(y^s_{i,n}\) represent positive and negative identities in each minibatch, \(\mathcal {L}_{ce}\) is the cross-entropy loss, \(||-||\) represents the euclidean distance, \(\chi \) is the margin and \(\lambda ^s\) is the weighting parameter for source domain losses.

In summary of phase 1 of UDKD, the classifiers \(C^s_1\) and \(C_2^s\) are trained with the images \(x_i^s\) and ground truth \(y_i^s\) of source domain \(D_s\) to produce the predictions \(\hat{y_i^s}\) for each \(x_i^s\). Further loss functions (\(\mathcal {L}_{id}^s,\mathcal {L}_{id}^s\)) are used to fine-tune the weights \(\theta _1\) and \(\theta _2\) to enhance the performance.

3.2 Phase 2: Clustering

After training of two teacher networks, the classifiers \(C_1^s\) and \(C_2^s\) are used to generate soft pseudo-labels for k number of image samples \(x_{i}^t\) present in the target domain \(D_t\). Each classifier generates n-dimentional probability vector for each \(x^t\), in which each element represents the probability of the image \(x^t_i\) belongs to the pseudo id \(y_i^{\sim t1}\) and \(y_i^{\sim t2}\). (i.e. \(P(x_i^t \in y_i^{\sim t1})\) and \(P(x_i^t \in y_i^{\sim t2})\)). Further, the k-means clustering algorithm is used to generate the probability vector \((M_1, M_2,..., M_n)\). Here, the parameters \(\theta _1\) and \(\theta _2\) of \(C^s_1\) and \(C_2^s\) are re-tuned using the \(\mathcal {L}_c\) loss function.

$$\begin{aligned} \mathcal {L}_c = \sum _{i=1}^k \sum _{j=1}^n || x_i - M_j ||^2 \end{aligned}$$
(4)

3.3 Phase 3: knowledge distillation

In the final phase, the target domain image samples \(x^t\) are fed to the classifiers \(C_1^s\) and \(C_2^s\) to generate two probability vectors (\(y^{\sim t1}_i\) and \(y^{\sim t2}_i\)). Subsequently, these two vectors are combined using an averaging function \(\mu \) (as defined in (5)). The rationale behind this process lies in harnessing the strengths of multiple models to create a more robust ensemble. By leveraging the diverse perspectives and learned representations of each classifier, the ensemble aims to enhance overall performance, utilizing the complementary capabilities of the individual models. This ensemble-based approach contributes to improved accuracy and generalization across the target domain, ultimately achieving superior results in the context of unsupervised domain adaptation for person reID. The following reasons would justify the combination of the results of teacher models in ensemble learning, including:

  • Improved Prediction Accuracy Combining the predictions of multiple models helps mitigate variance and enhance the overall accuracy of the ensemble. This proves beneficial, especially when individual models exhibit high bias or susceptibility to overfitting.

  • Increased Robustness Utilizing multiple models contributes to the ensemble’s robustness against noise and outliers in the data. The presence of diverse models ensures that the impact of any single model is diluted, making the ensemble more resilient to erratic data patterns.

  • Leveraging Complementary Models Different models may excel at predicting distinct facets of the data. Integrating predictions from these complementary models allows the ensemble to exploit their individual strengths, leading to an overall improvement in performance by capturing a broader spectrum of data characteristics.

  • Enhanced Generalization Ability Training multiple models on different subsets of the data and amalgamating their predictions enhances the ensemble’s generalization ability. This approach enables the ensemble to grasp a wider range of patterns and relationships present in the data, ultimately contributing to superior performance across diverse scenarios.

$$\begin{aligned} Y_i^{\sim t} = \bigcup _{j=1}^n \mu (P(x_i^t \in y_j^{\sim t1}),P(x_i^t \in y_j^{\sim t2})) \end{aligned}$$
(5)

In subsection 5.2, the simple averaging method \(\mu _{sa}\), as well as the weighted averaging method \(\mu _{w}\), are tested in different scenarios. We have considered the mAP and R1 scores of each teacher network \(C_1^s\) and \(C_2^s\) as weights in the weighted averaging method (\(w=m\) and \(w=r\)). In conclusion, we utilize the mAP-weighted averaging method \(\mu _{w=m}\) as a final approach because it performs better than the simple and R1-weighted averaging methods in each scenario.

Simple Averaging Method:

$$ \mu _{sa} (y_i^{\sim t1}, y_i^{\sim t2}) = \frac{y_i^{\sim t1} + y_i^{\sim t2}}{2} $$

Weighted Averaging Method (with \(w_1\) and \(w_2\) as weights):

$$\begin{aligned} \mu _{w} (y_i^{\sim t1}, y_i^{\sim t2}) = \frac{w_1.y_i^{\sim t1} + w_2.y_i^{\sim t2}}{w_1 + w_2} \end{aligned}$$
(6)

Meanwhile, a comparatively smaller classifier \(C^t\) is fed with the target sample image \(x_i^t\) to obtain the feature vector \(y_i^{\sim t}\). The loss function \(\mathcal {L}_{st}\) uses both \(y_i^{\sim t}\) and \(Y_i^{\sim t}\) to fine-tune the trainable parameters \(\theta _3\) of \(C^t\). In place of the triplet loss \(\mathcal {L}^s_{trip}\) function used in the previous phase (3), a modified soft-triplet loss \(\mathcal {L}_{st}\) is used. As the classifier \(C^3\), deals with soft labels \(Y_i^{\sim t}\), which is a collection of probabilities. Therefore, it cannot be handled by a triplet loss. In subsection 5.2, the proposed method is also tested with the triplet loss by computing \(max(Y_i^{\sim t})\).

The proposed soft triplet loss function \(\mathcal {L}_{st}\) is an adaptation of the original soft triplet loss \(\mathcal {L}_{trip}\) introduced in [13]. This adaptation addresses the complex demands of UDKD, which involves multiple teacher networks. The refined loss function incorporates a softmax function, ensuring its smooth and differentiable nature for effective backpropagation during model training.

$$ \tau _{i,p} = \exp (\chi -||y_i^{\sim t}-\mu _{w=m}( y_{i,p}^{\sim t1},y_{i,p}^{\sim t2})||) $$
$$ \tau _{i,n} = \exp (||y_i^{\sim t}-\mu _{w=m}( y_{i,n}^{\sim t1}, y_{i,n}^{\sim t2})||-\chi ) $$
$$\begin{aligned} \mathcal {L}_{st} (\theta _3) = \frac{1}{n} \sum _{i=1}^n \log (1+ \tau _{i,p}) + \log (1+\tau _{i,n}) \end{aligned}$$
(7)

In the UDKD context, the soft triplet loss serves the dual purpose of minimizing the discrepancy between positive and negative samples while maximizing the distance between the positive and the negative samples. This nuanced training objective is essential for acquiring a feature representation capable of effectively discriminating between similar and dissimilar samples, aligning seamlessly with the unique architecture of UDKD. Appendix B provides comprehensive details on the modification of the original triplet loss, offering an in-depth exploration of the adjustments made to suit the distinctive features and requirements of UDKD.

To fine-tune the network parameters of \(C^t\) (students) in real time, the above procedures will be carried out concurrently. For this particular task, online learning or knowledge distillation is preferred to offline learning for several reasons. Firstly, online learning facilitates continuous improvement by perpetually refining and advancing the model, as it remains updated with the incorporation of fresh data. Secondly, it enhances scalability and data flexibility by enabling the model to dynamically adjust to shifts in the data distribution as they occur. Additionally, online learning proves advantageous in handling larger and more dynamic datasets, as it does not require storing all the data at once. Moreover, it offers the benefit of real-time performance by updating the model on the fly with new data, which can be critical in time-sensitive applications. Furthermore, online learning tends to be more computationally efficient, as it does not necessitate storing all the data simultaneously, leading to reduced computational costs. Importantly, it effectively addresses challenges such as concept drift, where the underlying distribution of the data changes over time, as well as non-stationary data, whose statistical properties evolve over time.

3.4 Model evaluation

In order to assess the effectiveness of the proposed UDKD approach, two prominent person reID datasets, namely DukeMTMC [30] and Market1501 [8], are utilized for evaluation purposes. Details of these datasets are provided in Table 1.

Table 1 Description of Market-1501 and DukeMTMC datasets

To evaluate and compare the performance of the UDKD model, two distinct combinations of source-to-target datasets are employed: Market-to-Duke and Duke-to-Market. Various evaluation metrics are utilized to gauge the model’s performance, including mean average precision (mAP), as well as the Cumulative Matching Characteristic (CMC), ranks of 1, 5, and 10, respectively. These metrics are commonly employed in the field of person reID research and serve as benchmarks for evaluating the efficacy of different models.

4 Implementation details

4.1 Training and optimization details

In the experimental setup, we employ a Windows server machine equipped with an Intel Xeon Silver 4114 CPU operating at a frequency of 2.20 GHz, accompanied by a generous 64 GB of RAM. Complementing this configuration is a single NVIDIA Quadro RTX 8000 GPU endowed with an impressive 48 GB of dedicated Graphics-RAM. This computational infrastructure enables us to undertake both supervised pre-training on the source domain and unsupervised fine-tuning on the target domain.

During the training process, a mini-batch approach is adopted, where each batch consists of 64 images belonging to 16 distinct individuals. To ensure uniformity and compatibility within our network, all input images undergo a preprocessing step, wherein they are uniformly resized to dimensions of \(256 \times 128\) pixels.

4.2 Hyperparameter tuning

The hyperparameter values were determined through a pragmatic "heat and trial" method, considering hardware limitations. Due to computational constraints, a manual iterative approach was adopted, allowing for adjustments based on intuition and observed model performance. This heuristic method proved effective in achieving satisfactory results within the given limitations. Further details regarding specific hyperparameter values used in the experiments are provided in Table 2.

Table 2 Training Parameters for Each Stage

4.3 Dual-teacher training

The teacher networks \(C^s_1\) and \(C^s_2\) employ the ResNet152 [31] and DenseNet169 [32] CNNs, respectively, which are pretrained on ImageNet [33]. These networks are trained independently using the image samples \(x^s\) and corresponding labels \(y^s\) from the source domain \(D_s\). The training procedure, as outlined in [29], involves fine-tuning the weights \(\theta _1\) and \(\theta _2\) over 100 iterations, each consisting of 30 epochs. An initial learning rate of 0.00035 is set, which decreases to 1/10 of its previous value every 10 epochs.

4.4 Clustering

After the pre-training process, the images \(x^t\) of the target domains are used as input for the teacher networks to generate n number of clusters of identities by deploying the k-means clustering algorithm. With 0.0004 learning rate, 100 iterations of training consisting of 80 epochs are executed to re-tune the weights \(\theta _1\) and \(\theta _2\). At the end, each cluster is considered as soft-pseudo-labels \(y^{\sim t1}\) and \(y^{\sim t2}\).

4.5 Knowledge distillation

Finally, \(C^s_1\), \(C^s_2\) and \(C^t\) fed with the image samples \(x^t\) of \(D_t\) to generate refined pseudo-labels \(y^{\sim t1}_i\) and \(y^{\sim t2}_i\). Further, the \(\mu _{w=m}\) is calculated for each sample by using the (6) to fine-tune the weights \(\theta _3\) of the student network (ResNet50) in an online manner. The training involves 400 iterations of 40 epochs with a 0.00035 learning rate.

Table 3 Experimental results of the proposed UDKD on Duke-to-Market and Market-to-Duke datasets with two kinds of loss functions (triplet loss \(\mathcal {L}^t_{trip}\) and soft-triplet loss\(\mathcal {L}_{st}\)), three types averaging methods (simple averaging \(\mu _{sa}\), R1-weighted averaging \(\mu _{w=r}\), and mAP-weighted averaging \(\mu _{w=m}\)), and five different number of clusters (\(n = 500, 750, 1000, 1250, 1500\))

5 Experimental setup and results

This section aims to describe, analyze, and signify the different components of the proposed scheme, such as the type of loss, averaging methods, and the number of clusters. The proposed UDKD is evaluated using all possible combinations of two loss functions (triplet loss \(\mathcal {L}^t_{trip}\) and soft-triplet loss \(\mathcal {L}_{st}\)), three averaging methods (simple averaging \(\mu _{sa}\), R1-weighted averaging \(\mu _{w=r}\), and mAP-weighted averaging \(\mu _{w=m}\)), and five different numbers of clusters (\(n = 500, 750, 1000, 1250, 1500\)).

To facilitate a clearer comprehension, Table 3 illustrates the significance of each component. This table presents the experimental results of the proposed UDKD on Duke-to-Market and Market-to-Duke datasets with various configurations. The detailed breakdown includes performance metrics when employing the triplet loss \(\mathcal {L}^t_{trip}\) and soft-triplet loss \(\mathcal {L}_{st}\), alongside comparisons of the three averaging methods (simple averaging \(\mu _{sa}\), R1-weighted averaging \(\mu _{w=r}\), and mAP-weighted averaging \(\mu _{w=m}\)). Additionally, it explores the impact of different numbers of clusters, ranging from 500 to 1500, on the overall performance. These evaluations underscore the relative importance and contribution of each component to the efficacy of the UDKD scheme, highlighting the nuanced improvements brought by specific configurations.

Fig. 4
figure 4

Comparison of the significance of soft-triplet loss over triplet loss with fixed \(n=500\) and \(\mu _{w=m}\). This figure illustrates the performance improvement achieved using soft-triplet loss compared to triplet loss in the context of a fixed number of clusters (\(n=500\)) and using the mAP-weighted averaging method (\(\mu _{w=m}\))

5.1 Soft triplet loss analysis

To investigate the necessity of the soft-triplet loss \(\mathcal {L}_{st}\), we have compared the model’s performance with or without using it. In Fig. 4, which illustrates the performance improvement achieved using soft-triplet loss compared to triplet loss in the context of a fixed number of clusters (\(n=500\)) and using the mAP-weighted averaging method (\(\mu _{w=m}\)), the number of clusters \(n=500\) and pseudo-label averaging method \(\mu _{w=m}\) are kept constant. In the context of the Market-to-Duke dataset, we observed an increase of 6.15% in mAP and a significant 4.64% in the rank 1 score. Similarly, in the Duke-to-Market dataset, there is a 3.83% and 4.68% increase in mAP and rank 1 score. Based on these observations, we adopted a soft triplet loss during the knowledge distillation process.

Fig. 5
figure 5

Comparison of the significance of \(\mu _{w=m}\) over \(\mu _{w=r}\) and \(\mu _{sa}\) with soft-triplet loss and \(n=500\). This figure demonstrates the superiority of mAP-weighted averaging (\(\mu _{w=m}\)) over other averaging methods (\(\mu _{w=r}\) and \(\mu _{sa}\)) in conjunction with soft-triplet loss, particularly in the scenario where the number of clusters is fixed at \(n=500\)

5.2 mAP-weighted averaging evaluation

Selecting an effective method for combining pseudo labels becomes paramount when dealing with diverse teacher networks exhibiting varying performance on the target domain. A simplistic equal-weights (\(\mu _{sa}\)) approach falls short, prompting the use of weighted averaging. Here, each teacher’s output is assigned a specific weight, considering its individual performance.

Our mAP-weighted averaging method, denoted as \(\mu _{w=m}\), plays a pivotal role in refining pseudo labels within our UDKD learning scheme. Notably, we incorporate the mAP (\(\mu _{w=m}\)) and R1 (\(\mu _{w=r}\)) scores of each teacher network as their respective weights in the averaging process.

To underscore the significance of \(\mu _{w=m}\), we conducted comparative evaluations with alternative averaging methods (\(\mu _{w=r}\) and \(\mu _{sa}\)), utilizing \(\mathcal {L}_{st}\) as the loss function and maintaining a constant cluster number (\(n=500\)). Results, based on the Duke-to-Market dataset, reveal a 7.07% and 0.54% increase in mAP scores compared to the simple average and R1-weighted average, respectively. Similarly, in the Market-to-Duke dataset, a remarkable increase of 10.37% and 3.61% in mAP is observed with the mAP-weighted average over the other two methods. These findings underscore the efficacy of our chosen \(\mu _{w=m}\) method. Refer to Fig. 5 for a detailed barplot comparison.

5.3 Impact of cluster number

Selecting a large number of clusters in clustering algorithms can engender overfitting, whereby the algorithm models the intrinsic noise within the data rather than discerning the underlying patterns. This can yield clusters that are excessively specific and lack generalizability to novel data instances. Conversely, opting for a low number of clusters may result in underfitting, where the algorithm fails to capture all pertinent patterns present in the data. Consequently, the produced clusters may prove excessively broad and inadequately specific. Thus, it is imperative to ascertain the appropriate number of clusters or employ methodologies such as the elbow method and silhouette score to determine the optimal number of clusters. By leveraging these techniques, one can navigate the challenge of cluster selection, thereby enhancing the fidelity and generalizability of the clustering results.

To establish a baseline, [13] opted for 500 clusters in their research, guiding our initial evaluation and ensuring a meaningful comparison. Our model underwent systematic testing across different cluster counts, including \(n=500, 750, 1000, 1250,\) and 1500, utilizing the soft-triplet loss with mAP-weighted averaging. On the Market-to-Duke dataset, notable distinctions of 3.02% and 2.34% were observed in mAP and R1 scores, respectively, between \(n=500\) and \(n=750\). The Duke-to-Market dataset revealed an even more pronounced contrast, with differences of 4.21% and 6.77% for mAP and R1 scores, respectively. Given these insights, our model settles on \(n=500\) clusters. Figure 6 visually summarizes the detailed comparison through a barplot.

Fig. 6
figure 6

Comparison of the significance of \(n=500\) over 750, 1000, 1200,  and 1500 with soft-triplet loss and \(\mu _{w=m}\). x-axis represents evaluation metrics, and y-axis represents the percentage (60-100%). This figure illustrates the impact of varying the number of clusters on performance metrics when using soft-triplet loss and the mAP-weighted averaging method

Table 4 The performance of the proposed UDKD compared with current state-of-the-art methods for UDA for person reID. Bold values indicate the highest score

5.4 Comparative analysis with state-of-the-art methods

The evaluation of the proposed method is conducted by comparing its performance with state-of-the-art UDA methods for person re-ID using the Duke-to-Market and Market-to-Duke datasets. The assessment results, including the mAP, Rank 1, Rank 5, and Rank 10 scores for both datasets, are provided in Table 4. Notably, the experimental findings clearly demonstrate the superior performance of the proposed method across all evaluation metrics, surpassing all existing approaches in a significant manner.

6 Discussion

The results from our experiments show that the proposed Unsupervised Domain Knowledge Distillation (UDKD) approach improves the performance of person re-identification (reID) systems across different domains. By using dual-teacher networks pre-trained on a source domain and incorporating soft-triplet loss and mAP-weighted averaging, we observed significant improvements in both mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) metrics. The dual-teacher framework, involving ResNet152 and DenseNet169, allows the student network, ResNet50, to learn diverse and robust features, leading to better generalisation of the target domain.

A key finding of this study is the effectiveness of the soft-triplet loss function over the traditional triplet loss. The soft-triplet loss provides a softer margin between positive and negative pairs, leading to more stable and effective training, which is particularly beneficial in an unsupervised setting where ground truth labels are absent. Additionally, the mAP-weighted averaging strategy for combining teacher network outputs significantly enhances the quality of pseudo-labels used for training the student network. This method outperforms other averaging techniques by prioritising the most relevant features, thereby improving the model’s adaptability to new domains.

The number of clusters used for pseudo-label generation also plays a crucial role. Our experiments indicate that setting the number of clusters to 500 strikes a balance between intra-cluster compactness and inter-cluster separation, resulting in more accurate and meaningful pseudo-labels. When compared to other state-of-the-art unsupervised domain adaptation methods for person reID, the UDKD approach shows superior performance across various evaluation metrics. This highlights the robustness and effectiveness of our method, demonstrating its potential for practical applications in real-world reID systems.

One of the strengths of the UDKD approach is its ability to operate effectively in an unsupervised manner. By leveraging knowledge from pre-trained networks and generating high-quality pseudo-labels, the method reduces the reliance on labelled data, which is often scarce and expensive to obtain. Furthermore, the use of dual-teacher networks ensures that the student network benefits from diverse perspectives, leading to more comprehensive feature learning.

However, there are limitations to this study. Hardware constraints limited the extent of hyperparameter tuning, which may have affected the model’s performance. Future work could explore more extensive hyperparameter optimisation and apply UDKD to a broader range of datasets and domains. Additionally, while this implementation uses two specific teacher networks, examining different combinations of teacher architectures could provide further insights into optimising the dual-teacher framework. Exploring other advanced clustering algorithms and loss functions could also enhance the model’s performance.

7 Conclusion and future work

The suggested scheme presented an efficient UDKD learning scheme designed specifically for UDA in person reID tasks. The proposed method addresses the challenge of noisy pseudo labels by leveraging the outputs of two large networks to minimize misclassifications and train a smaller classifier using the knowledge distilled from both networks.

Through an extensive experimental investigation, we have demonstrated the significant contributions of two key components: the mAP-weighted average method and the soft-triplet loss method. The mAP-weighted average method effectively combines the predictions of the dual teacher networks, providing a robust and reliable output for UDKD. Additionally, the modified soft-triplet loss facilitates improved discrimination between classes and enhances the discriminative capabilities of the smaller classifier.

The UDKD has exhibited superior performance compared to the current state-of-the-art UDA methods in Market-to-Duke and Duke-to-Market domain adaptation tasks. The comprehensive evaluation of the proposed approach demonstrates its effectiveness in achieving higher accuracy in person identification results. The proposed method achieves an mAP of 84.57 and 73.32, and Rank 1 scores of 94.34 and 88.26 for Duke to Market and Market to Duke scenarios, respectively. These improvements underscore the efficacy of UDKD in advancing UDA techniques for person reID, highlighting its potential to enhance performance and robustness in real-world applications.

However, to further enhance the performance of UDKD, future research should prioritize addressing the challenges associated with noisy labels generated by clustering techniques. By mitigating the impact of noisy labels, the performance of UDKD can be elevated to a level comparable to that of fully supervised models.

Moreover, as an additional consideration, exploring the potential and effects of incorporating more than two teacher networks into the UDKD could provide valuable insights. Investigating the benefits and trade-offs of increasing the number of teacher networks could lead to improved knowledge distillation techniques and further advancements in UDA for person identification.