1 Introduction

Person re-identification (Re-ID) is to search for target pedestrians according to given query images across non-overlapping cameras. It has attracted widespread attention due to its wide applications in cross-camera tracking, video surveillance, etc [1,2,3]. Encouraged by deep learning, supervised Re-ID has obtained great success [4,5,6,7]. However, these methods will have a significant drop when they face a new database. In addition, supervised Re-ID requires abundant labeled pedestrian data, and its cost is expensive. Hence, supervised Re-ID is difficult to satisfy the requirement of real-world scenarios.

To alleviate the scalability challenge, researchers attempt to study unsupervised methods. Most of them utilize the labeled source domain and unlabeled target domain to optimize deep models, treated as unsupervised domain adaptation (UDA) methods. Some works leverage GAN to transform the style from the source domain to the target domain for bridging the domain gap [8, 9]. Some other methods treat each image as an individual identity and learn the pedestrian invariance [10, 11]. In addition, several state-of-the-art methods employ clustering algorithms to generate pseudo-labels and translate them into a supervised learning problem [12, 13]. But, UDA requires leveraging the prior knowledge to optimize Re-ID models and lacks flexibility. It is closely related to the domain gap and needs to select a suitable source domain.

Fully unsupervised person Re-ID aims to mine robust features by no labeled image data. A few existing methods learn the softened label distribution or design the multi-label classification loss to optimize deep models [14, 15], which obtains the significant improvement. But, these methods are difficult to identify pedestrians with significant variation due to lack of effective supervision information. Inspired by UDA, some approaches employ clustering algorithms to produce pseudo-labels. However, the clustering algorithm depends on discriminative pedestrian features. As shown in Fig. 1a, the initial feature distribution has poor pedestrian information. Simply utilizing the clustering generates low-quality labels and cannot learn effective features. They still have a significant performance gap with supervised learning or UDA methods.

In the paper, we propose a novel Hybrid Feature Constraint Network (HFCN) to adequately implement the feature constraint and provide meaningful supervision information without any labeled identities for the Re-ID model. To this end, we define the feature constraint loss to force near-neighbor features to have high similarity and decrease the similarity among different pedestrian features as shown in Fig. 1b. As a result, pedestrian features with the same identity have the similar feature distribution and different pedestrians possess large distribution gaps as shown in Fig. 1c. Afterward, we design the multi-task operation with the iterative update for clustering algorithm to adequately employ predicted labels and identify complex samples. Finally, we integrate the feature constraint loss and multi-task operation to jointly optimize the Re-ID model. This could stimulate the clustering algorithm to produce high-quality labels, and in turn, effective labels provide effective supervision signals for the deep model. The paper has the following contributions:

(1) We define the feature constraint loss to restrict the feature distribution so that pedestrian features with the same identity have the similar distribution and different pedestrians possess large distribution gaps.

(2) The proposed HFCN successfully integrates the feature constraint loss and multi-task operation process. This facilitates the clustering algorithm to generate high-quality labels and provides effective supervision information for the Re-ID model.

(3) Extensive experiments prove that our method is valuable for fully unsupervised Re-ID and exceeds the state-of-the-art methods on mainstream Re-ID databases.

Fig. 1
figure 1

a Original feature distribution is represented as different circles. b The optimization method for example A is presented. The near-neighbor similar features are obtained inside circle \(C_1\), and the top-n dissimilar features are extracted outside circle \(C_2\). The similarity of these near-neighbor features is further enhanced, and the similarity of the top-n features is further reduced. c The optimization result is visualized by our optimization strategy

2 Related work

2.1 Supervised person Re-ID

With the renaissance of neural networks [16,17,18,19,20,21], deep learning has become the mainstream technology and dominated the Re-ID community to optimize objective features [22, 23]. Most Re-ID methods mainly focus on supervised learning by obtaining abundant annotated data [24,25,26,27,28,29]. For extracting discriminative features, some methods explore the similarity relationship among pedestrian images and constrain the spatial distribution of features. For example, Chen et al. [30] take into account the local similarity and dependencies of global features to optimize the deep model. Luo et al. [25] construct sample pairs according to pedestrian identities and employ the triplet loss and center loss to constrain feature distributions.

The above methods suffer significant performance degradation while applied to a new domain. In our paper, we abandon annotated pedestrian labels to focus on unsupervised person Re-ID. We explore the feature distribution among different pedestrians to mine meaningful pedestrian information.

2.2 Unsupervised domain adaptation

Unsupervised domain adaptation (UDA) methods leverage the labeled source domain data and unlabeled target domain data to optimize deep models [31,32,33]. Some works implement the style transformation operation from one scenario to the other for aligning the feature distribution. Wei et al. [8] design an adversarial network to transfer the pedestrian style from one database to the other database so as to decrease the domain gap. Liu et al. [31] propose the Adaptive Transfer Network (ATNet) to learn three critical styles (illumination, resolution, and camera) for addressing the domain discrepancy. Some approaches study the intra-domain invariance to mine robust features. For example, Zhong et al. [10] consider each image as a class and require the nearest neighbor of each image to have the same label by assigning weights. In [34], GAN is utilized to generate positive training pairs by image-image translation and then employ the triplet loss to constrain the intra-domain variation. In addition, several clustering-based methods assign pseudo-labels to pedestrian images and achieve impressive performance. For instance, Ge et al. [35] mitigate the effect of predicted false labels by designing the mutual mean-teaching framework. Zhai et al. [12] pre-trains several different networks based on clustering algorithms and integrate these models to implement a mutual learning process.

In addition, Ge et al. [36] propose a self-paced contrastive learning strategy by designing a hybrid memory to encode available information from both source and target domains. Differently, the hybrid memory method solves the inconsistency problem between the source and target domains and effectively utilizes outliers in the clustering process. To this end, they design a hybrid memory to encode available information from training samples for feature learning. Our method resolves the feature distribution problem, which stimulates the clustering algorithm to generate effective labels. We define the feature constraint loss to restrict the feature distribution and then integrate the multi-task operation with clustering to jointly optimize the Re-ID model.

In the optimization process, the hybrid memory method builds the memory to store clustering features and employs the contrastive loss to learn the identity information, while our method directly employs the classifier to identify clustering-based labels and perform the multi-task operation to mine pedestrian features. In addition, they design a self-paced contrastive learning strategy to prevent training error amplification, while our method defines the feature constraint loss to constrain the feature distribution for promoting the clustering algorithm. Furthermore, as for all instance features, the hybrid memory method minimizes the negative log-likelihood over feature similarities, while we utilize the specificity of cosine similarity to maximize the feature similarity. In conclusion, our method is completely different from the existing hybrid memory [36]. Our method and the existing hybrid memory possess different motivations and implementation processes.

These methods mentioned above obtain impressive performance in the UDA field. However, they require sufficient labeled source domain data. Otherwise, these methods will have significant performance degradation. In contrast, we explore the fully unsupervised Re-ID method by effectively constraining the feature distribution, which achieves superior performance and exceeds many UDA methods.

2.3 Unsupervised person Re-ID

The conventional hand-craft features do not employ any external information [37,38,39]. But, they are difficult to extract robust features, and their performance suffers a significant degradation when dealing with large databases. Recently, several deeply learned approaches have been proposed to stimulate the performance of unsupervised person Re-ID. Lin et al. [40] utilize a clustering algorithm to predict pedestrian labels and iteratively train the Re-ID model. Nevertheless, the clustering may generate many wrong labels and affect the model’s generalization capability. On the other hand, since the camera-ID can be directly employed, some works utilize camera-ID to finish unsupervised Re-ID tasks and achieve impressive performance. For example, Lin et al. [15] propose the cross-camera encouragement to learn the potential feature similarity, which has a significant improvement. Wang et al. [14] apply GAN to generate pedestrian images with different styles and design multiple labels for each image. However, the two methods easily cause performance degradation when pedestrian images have large variations due to lack of effective supervision signals.

In the paper, we define the feature constraint loss to restrict the feature distribution and then integrate the multi-task operation with clustering to adequately employ generated pseudo-labels. The strategy could promote the clustering algorithm to generate high-quality labels, which provides effective supervision information for the Re-ID model.

3 Proposed method

The study aims to restrict the feature distribution of different pedestrians for solving the unsupervised person Re-ID challenge. Given an unlabeled pedestrian database X = {\(x_1\), \(x_2\), \(x_3\), \(\ldots \), \(x_N\)}, where N is the total number of samples, our method could extract robust features to match the same pedestrian. We first introduce the whole structure in Sect. 3.1 and then explain the feature constraint loss in Sect. 3.2. Afterward, we illustrate the multi-task operation process in Sect. 3.3. Finally, we present the final optimization objective in Sect. 3.4.

3.1 The framework of HFCN

Fig. 2
figure 2

The whole framework of the proposed HFCN. We extract image features and calculate the similarity between them. The feature constraint loss is defined to restrict the feature distribution. Meanwhile, we design the multi-task operation with the iterative update for the clustering algorithm to learn robust features. We integrate the feature constraint loss and the multi-task operation to jointly optimize the Re-ID model and update pedestrian labels. The solid line represents the operation in each iteration, and the dotted line denotes the process on the entire database

In the paper, we employ the ResNet-50 [41] as the backbone to extract image features, and the whole framework is shown in Fig. 2. After the final pooling operation, we remove the original fully connected layer and extract corresponding pedestrian features. Afterward, we add two branches to optimize the deep model. Specifically, we define the feature constraint loss to restrict the feature distribution for distinguishing different pedestrians in the first stage in the upper branch. After m epochs, the multi-task operation is implemented to further optimize pedestrian features with clustering algorithm for identifying complex samples in the below branch. We integrate the two branches to jointly train the Re-ID model and update pedestrian labels at each epoch.

For producing pedestrian labels, the density-based DBSCAN [42, 43] is utilized to assign pseudo-labels for all samples. According to predicted labels, we randomly select P identities, and each identity has K pedestrian images as the input. So, each batch contains PK pedestrian images. As for the input, we implement the random horizontal flipping, random cropping, and random erasing [44] data enhancement methods. Since the camera-ID can be utilized directly, Camstyle [45] is leveraged to increase the variety of camera styles.

In the test stage, we employ the trained Re-ID model to process pedestrian images and obtain corresponding features behind the pooling layer. We compute the Euclidean distance to match the same pedestrian, which obtains a remarkable improvement.

3.2 Feature constraint loss

Fully unsupervised person Re-ID is a challenging technology because there are no annotated identities. Several state-of-the-art methods apply clustering algorithms to produce pseudo-labels and obtain impressive performance [35, 46]. Due to large variances in cameras, poses, backgrounds, etc., pedestrian images have a large distribution discrepancy as shown in Fig. 1a. Since effective clustering depends on discriminative pedestrian features, these existing methods are easy to produce low-quality labels by directly employing clustering algorithms and affect the performance of person Re-ID.

Pedestrian images in the person Re-ID field generally have large background changes and relatively small foreground variations. It means that the foreground region contains the identifiable semantic feature and the foreground feature has a strong correlation, which is an essential factor for solving unsupervised Re-ID. In order to explore the potential correlation, we define the effective feature constraint loss to restrict the feature distribution without any external information. The strategy forces image features with the same identity to possess the similar distribution and different pedestrians to have large distribution gaps. More importantly, the strategy could stimulate clustering algorithms to generate high-quality labels by distinguishing various pedestrians and extract valuable information for unsupervised person Re-ID.

Specifically, we compute the feature similarity and sort these similarity scores for each pedestrian image. We set the threshold v to filter high similarity features, treated as near-neighbor features. To this end, we construct an exemplar memory M to store pedestrian features, where M[i] represents the feature of the i-th pedestrian image. In the initialization, we assign all values to be 0 in M. During the iteration, we forward each batch through the Re-ID model and extract the L2-normalized feature. In the back-propagation stage, M is updated:

$$\begin{aligned} M[i] \leftarrow \theta f_i + (1-\theta )\cdot M[i], \end{aligned}$$
(1)

where \(\theta \epsilon \) [0, 1] is the updating rate, \(f_i\) denotes the current normalized feature, and M[i] is then L2-normalized via \(M[i] \leftarrow ||M[i]||_2\). So, the similarity between \(f_i\) and M[j] is defined as:

$$\begin{aligned} s_{ij} = f[i] \times M[j]^T, \end{aligned}$$
(2)

We compute the similarity between the feature \(f_i\) and all other features. Afterward, we rank the similarity to get the corresponding sort score:

$$\begin{aligned} S_{i} = \mathop {\mathrm {arg ~sort}}_{j} {(s_{ij})}, ~ j\epsilon [1,N], \end{aligned}$$
(3)

where N denotes the total number of pedestrian images.

In the beginning, the initial model has poor identification capability, and initial features stored in M have poor discriminability. These features have a large distribution gap, so it is difficult to search for the same pedestrian properly. Nevertheless, it is easy to search for different pedestrians because there are many negative samples in M. Therefore, we first treat each image as an individual identity by constraining the feature similarity between \(f_i\) and M[i]. Simultaneously, we force dissimilar features to have low similarity. As for dissimilar samples, pedestrian features with high similarity scores are considered hard negative sample features, while pedestrian features with low similarity scores are treated as easy negative sample features. Obviously, when hard negative sample pairs possess large distribution gaps, easy negative sample pairs naturally produce a larger difference. Thus, we select n difficult scores \(S_{i}\) for the feature \(f_i\) and obtain the similarity score set, i.e., \(V_{i}^{n}\). As for any two features, their cosine similarity is the matrix inner product of two features after L2 normalization. Since pedestrian features stored in M are \(L_2\) normalized, the feature similarity \(s_{ij}\) is restricted between [\(-1, 1\)]. Therefore, the same pedestrian features should have high similarity with the highest similarity close to 1, while different pedestrian features have low similarity with the lowest similarity reaching \(-1\). So, the feature constraint loss is defined as:

$$\begin{aligned} L_{sc}^i = \eta (1- s_{ii})^2 + \frac{1}{n} \sum ^{}_{s_{iv} \epsilon V_{i}^{n}} (1 + s_{iv})^2, \end{aligned}$$
(4)

where \(\eta \) controls the importance of reliable features, \(s_{ii}\) indicates the similarity of current epoch feature and last updated feature M[i], and \(s_{iv}\) denotes the similarity score between the feature \(f_i\) and \(f_v\) in \(V_{i}^{n}\).

After m epochs, pedestrian features achieve a certain discriminant ability. We search for reliable samples and force them to have the similar feature distribution. Intuitively, reliable samples have high similarities. Hence, we set the threshold v to filter low similarity score in \(S_{i}\), and then obtain a reliable score set \(U_{i}^v\). Hence, the feature constraint loss is escalated to:

$$\begin{aligned} L_{sc}^i=\frac{\eta }{T} \sum ^{}_{s_{iu} \epsilon U_{i}^v}(1- s_{iu})^2 + \frac{1}{n} \sum ^{}_{s_{iv} \epsilon V_{i}^{n}} (1 + s_{iv})^2, \end{aligned}$$
(5)

where T is the total number of selected reliable samples and \(s_{iu}\) denotes the similarity score between the feature \(f_i\) and \(f_u\) in \(U_i^v\). So, the loss value within one batch could be written as:

$$\begin{aligned} L_{sc} = \frac{1}{PK} \sum ^{PK}_{i=1}L_{sc}^i. \end{aligned}$$
(6)

By implementing the feature constraint, near-neighbor features generate high similarity; otherwise, they have a low similarity. Approximatively, image features with the same identity are forced to have the similar feature distribution, and different pedestrians possess larger distribution gaps. This promotes the Re-ID model to distinguish different pedestrians without any external labels and learn meaningful features. Furthermore, the strategy could effectively facilitate clustering algorithms to predict pedestrian identities.

Table 1 Ablation study on three person Re-ID databases

3.3 Multi-task operation

The feature constraint loss encourages pedestrian features with the same identity to have high similarity and forces different pedestrians to possess low similarity. The effect could stimulate the deep model to identify different pedestrians. With the promotion of the feature distribution, the clustering algorithm can generate valuable label information. To this end, we design the multi-task operation to adequately employ generated labels for distinguishing different pedestrians. Specifically, the label smoothing regularization loss is utilized to perform the classification task, while the hard triplet loss is employed to accomplish the ranking task, which can improve the generalization ability of the Re-ID model.

Since the number of pedestrian labels is uncertain, we employ DBSCAN [42, 43] to predict pedestrian identities. In each epoch, the clustering algorithm generates a variable number of labels. Thus, we create a new classifier depending on the number of generated identities in each epoch to perform the classification task. We calculate the mean feature of each cluster and implement the normalization operation to initialize the classifier.

Similarly, we extract pedestrian features behind the pooling operation and then feed them into the classifier. Afterward, we employ the label smoothing regularization loss to calculate the loss of feature \(f_i\):

$$\begin{aligned} L_{id}^i= -(1-\varepsilon )\log (q(r\mid f_i))- \frac{\varepsilon }{Z} \sum _{{\begin{array}{c} z = 1\\ z \ne r\\ \end{array}}}^{Z}\log (q(z\mid f_i)), \end{aligned}$$
(7)

where \(\varepsilon \) is set to 0.1 that denotes the hyperparameter, \(q(z\mid f_i)\) is the predicted probability that the feature \(f_i\) belongs to the z-th label, Z is the total number of predicted labels, and r is the corresponding prediction label. So, the loss value within one batch is formulated as:

$$\begin{aligned} L_{id}= \sum _{i=1}^{PK} L_{id}^i. \end{aligned}$$
(8)

In the optimization process, hard samples are difficult to be effectively distinguished. The hard triplet loss could decrease the distance among the same pedestrian features and increase the distance among different pedestrians. It is widely used in supervised methods to perform the similarity measurement learning operation and obtains the stimulative effect [47,48,49]. In the paper, we intuitively employ the hard triplet loss to unsupervised person Re-ID based on generated labels to implement the ranking task. The loss could further constrict the feature distribution to identify hard samples and in turn, it facilitates the clustering to produce effective labels. The loss value within one batch is written as:

$$\begin{aligned} L_{t} = \sum _{i=1}^{PK} [\psi + D_{ip} - D_{ib}]_+. \end{aligned}$$
(9)

where \(\psi \) denotes the margin, \(D_{ip}\) represents the maximum distance between the i-th feature and the hardest positive sample feature within one batch, \(D_{ib}\) is the minimum distance between the i-th feature and the hardest negative sample feature within one batch, and \([w]_+\) = \(\max (w, 0)\).

In a word, the label smoothing regularization loss could make full use of predicted label information, while the hard triplet loss increases the inter-class variation and decreases the intra-class variation. The two loss functions enable the deep model to easily identify hard samples, which effectively improves the performance of the Re-ID model.

3.4 Final objective

On the one hand, we define the feature constraint loss to learn the potential feature correlation, which enforces pedestrian features with the same identity to have high similarity and different pedestrians to possess low similarity. This not only promotes the model to learn robust features but also facilitates the clustering algorithm to generate high-quality labels. On the other hand, we design the multi-task operation with the iterative update to make full use of predicted labels and distinguish hard pedestrian samples. The final objective function is formulated as:

$$\begin{aligned} L = L_{sc} + \alpha L_{id} + \beta L_{t}. \end{aligned}$$
(10)

where \(\alpha \) and \(\beta \) denote the importance of the label smoothing regularization loss and the hard triplet loss, respectively.

We integrate the feature constraint loss and multi-task operation to optimize the Re-ID model with iteratively updated pedestrian labels. The iterative optimization strategy in each epoch promotes the clustering algorithm to generate high-quality labels, and in turn, these labels stimulate the Re-ID model to learn potential valuable information. The proposed HFCN substantially improves the performance of the Re-ID model and exceeds the state-of-the-art methods.

4 Experiments

4.1 Databases and implementation details

Our method is evaluated on three mainstream person Re-ID databases: Market1501 [39], DukeMTMC-reID [2] and MSMT17 [8]. Two typical assessment criteria, i.e., rank-1 and mAP, are utilized to evaluate the performance of the algorithm.

In the paper, we employ the ResNet-50 [41] as the backbone due to its competitive advantages. We set P to 32 and K to 4, resulting in 128 pedestrian images within one batch. We crop each pedestrian image into \(256\times 128\). The total number of epoch is 60. We set the learning rate to 0.1 and change it to 0.01 after 40 epochs. We employ SGD to optimize the Re-ID model. The parameter \(\theta \) in Eq. 1 starts from 0 increases linearly to 0.5. The threshold \(\psi \) in Eq. 9 is set to 0.5. Several other important parameters are evaluated in Sect. 4.4.

4.2 Ablation study

We systematically assess the effectiveness of each component in HFCN by performing the ablation study, and the results are reported in Table 1.

Direct transfer denotes the deep model pre-trained on ImageNet [66] is directly applied to the person Re-ID field. This has a poor performance and illustrates the pre-trained model contains very little pedestrian information, which can not provide meaningful features for the clustering algorithm.

\(L_{sc}\) indicates the feature constraint loss is utilized to explore the potential pedestrian correlation. The strategy surpasses direct transfer by 53.2% rank-1, 28.2% mAP on Market1501 and 41.8% rank-1, 24.6% mAP on DukeMTMC-reID. It also surpasses direct transfer by 4.9% rank-1, 1.3% mAP on MSMT17. This proves the feature constraint loss is valuable for restricting the feature distribution. More importantly, the method could promote the pedestrian features with the same identity to have the similar distribution and different pedestrians to possess large distribution gaps.

\(L_{id} + L_{t}\) represents that the clustering algorithm is directly applied to predict pedestrian labels. Afterward, the classification and ranking tasks are performed, which reveals a certain performance. Nevertheless, the effect is not satisfying for us. This is because that simple clustering cannot produce high-quality label information due to lack of effective constraint, limiting the effect of unsupervised person Re-ID.

Next, we add \(L_{sc}\) to \(L_{id}\) + \(L_{t}\) to validate the effect of our final HFCN, i.e, \(L_{sc}\) + \(L_{id}\) + \(L_{t}\). We can see that rank-1 and mAP are improved by 14.5% and 20.8% on Market1501 compared with \(L_{id}\) + \(L_{t}\). Meanwhile, the improvement on DukeMTMC-reID is 10.8% and 14.6% on rank-1 and mAP, respectively. The significant advantage proves that predicted labels possess the high quality. Our feature constraint loss could provide effective constraint to restrict the feature distribution. In addition, this also verifies we effectively fuse the feature constraint loss and the multi-task operation to mine robust pedestrian features.

To further analyze HFCN, we compare the invariance loss presented in ECN [10], which is expressed as Invariance. ECN constructs a memory to save all pedestrian features and assigns near-neighbor features to have the same identity. The invariance loss exhibits a certain performance for the Re-ID model. However, we replace our feature constraint loss with the invariance loss to integrate the multi-task operation, i.e., Invariance + \(L_{id}\) + \(L_{t}\), which reflects worse performance than that of the proposed HFCN. Even the invariance loss makes the performance of multi-task operation degrade, especially in the two large-scale person Re-ID databases, i.e., DukeMTMC-reID and MSMT17. This is because the invariant loss sets hard labels for each pedestrian, which is inconsistent with the multi-task operation. Our method is superior Invariance + \(L_{id}\) + \(L_{t}\) by 14.2% rank-1, 25.2% mAP on Market1501 and 14.4% rank-1, 20.7% mAP on DukeMTMC-reID. In addition, HFCN surpasses Invariance + \(L_{id}\) + \(L_{t}\) over 31.7% rank-1, 15.7% mAP on MSMT17. The significant advantage proves the proposed feature constraint loss can stimulate the clustering algorithm to generate high-quality labels. The proposed HFCN could effectively integrate the feature constraint loss and the multi-task operation to mine potential pedestrian features.

Table 2 Comparison with other state-of-the-art methods on Market1501 and DukeMTMC-reID. In column “Setting”, “U” denotes fully unsupervised methods and “UDA” indicates utilizing external labeled data

In addition, we also replace our feature constraint loss with the triplet loss at the first stage to prove its effectiveness, i.e., Triplet + \(L_{id}\) + \(L_{t}\). Obviously, the performance is inferior to our final results. This is because the triplet loss mainly employs pseudo-labels to constrain the feature space. The feature constraint loss only utilizes the feature similarity to constrain the feature distribution, which stimulates the clustering algorithm to produce high-quality labels. The feature constraint loss is complementary to the multi-task operation. This further demonstrates that our feature constraint loss is valuable for unsupervised person Re-ID.

4.3 Comparison with the state-of-the-art methods

We compare the proposed HFCN with other competitive unsupervised person Re-ID methods on Market1501 and DukeMTMC-reID, and the compared results are reported in Table 2.

We divide these competing approaches into three categories. The first category employs hand-craft feature methods. The second category adopts deep learning methods for unsupervised person Re-ID, and the third type utilizes the labeled source domain to implement UDA tasks. From Table 2, deep learning methods obtain higher accuracy than the two hand-crated features. This is because neural networks can learn discriminative features by adapting to various situations.

More importantly, our method obtains the best performance and outperforms all other unsupervised methods by a large margin. The proposed HFCN achieves rank-1 = 88.7%, mAP = 69.3% on Market1501. Meanwhile, it also obtains rank-1 = 75.5% and mAP = 56.6% on DukeMTMC-reID. The impressive performance proves that the proposed HFCN successfully integrates the feature constraint loss and multi-task operation, which could promote the clustering algorithm to produce high-quality labels. The method could adequately utilize predicted labels and identify different pedestrians. All results demonstrate that the proposed HFCN is effective and could extract discriminative features without any annotated labels.

BUC [40] and HCT [52] apply the clustering method to fully unsupervised person Re-ID. From Table 2, the proposed HFCN outperforms them by a large margin, which proves our method is valuable. HCT also utilizes the hard triplet loss to optimize the Re-ID model. Our method is still superior HCT by 8.7% rank-1, 12.9% mAP on Market1501 and 5.9% rank-1, 5.9% mAP on DukeMTMC-reID. The distinct advantage demonstrates the proposed HFCN is a meaningful algorithm.

Table 3 Performance comparison with the state-of-the-art methods on MSMT17. The column “Source” denotes external data utilized, where “None” stands for fully unsupervised methods

Furthermore, even though UDA methods leverage the external source domain data, our method still exceeds them. This further proves HFCN is an innovative approach for unsupervised Re-ID, which could extract discriminative features and markedly boost the generalization ability of the Re-ID model.

MSMT17 is the most challenging database in the person Re-ID field. We also verify our method on MSMT17, and the results are listed in Table 3. From the Table, the proposed HFCN achieves rank-1 = 52.3% and mAP = 21.9%, which outstands others competitive approaches. Although UDA methods employ different external data, our method still has a competitive advantage. This proves HFCN can generate high-quality labels to learn discriminative features once again.

Fig. 3
figure 3

Evaluation of three parameters

4.4 Parameter analysis

We investigate six hyperparameters and analyze their effect on Market1501, i.e., the weight \(\eta \) in Eq. 4, the candidate number n in Eq. 5, the threshold v in Eq. 5, the parameter \(\alpha \), \(\beta \) in Eq. 10, and m.

The weight \(\eta \). Figure 3a investigates the effect of the weight \(\eta \) in Eq. 4. The parameter \(\eta \) controls the importance of reliable samples. We verify the parameter value from 2 to 7, and the performance is best when \(\eta \) is 5. When \(\eta \) gradually increases, the accuracy is not significantly changed, which means that appropriate similar feature constraints can promote the deep model to identify different pedestrians.

The candidate number n. We employ parameter n in Eq. 4 to control the candidate number of pedestrian features that are different from feature \(f_i\). The evaluation results are shown in Fig. 3b. We rank the feature similarity and select n similarity scores after one hundred. From the figure, the proposed HFCN achieves the best performance when n is equal to 30. When n is too small, the feature constraint loss is difficult to stimulate the Re-ID model to distinguish different pedestrians, resulting in performance degradation. So, n = 30 is utilized in our experiments.

The threshold v. We evaluate the threshold v in Eq. 5, and the results are listed in Fig. 3c. The parameter is utilized to select reliable pedestrian features. A higher threshold selects fewer and more similar features, and the lower threshold selects more features for optimization. The results show that the threshold of 0.8 drives this model to achieve the best accuracy.

The parameter \(\alpha \). Table 4 shows the effect of parameter \(\alpha \) in Eq. 10. The performance achieves the best result when \(\alpha \) = 1.2. This proves the generated label information is important. Since the accuracy is relatively stable, the parameter is not sensitive to the optimization process. So, we set \(\alpha \) to be 1.2 in our experiments.

Table 4 Evaluation of parameter \(\alpha \) in Eq. 10
Table 5 Evaluation of parameter \(\beta \) in Eq. 10

The parameter \(\beta \). The parameter \(\beta \) denotes the importance of the hard triplet loss, and the assessment is shown in Table 5. The proposed HFCN obtains the highest accuracy when \(\beta \) = 1.4. This illustrates that the hard triplet loss is vital for identifying hard samples and further stimulates the deep model to distinguish different pedestrians.

Table 6 Evaluation of parameter m
Fig. 4
figure 4

Visualization of twenty randomly selected identities from the test set on Market1501. a Denotes the initial distribution, b represents the optimized result by performing the multi-task operation, and c indicates the final feature distribution by HFCN

The parameter m. Table 6 reveals the effect of the parameter m. The parameter expresses we start to search for reliable features after m epochs. The results illustrate that we obtain the best performance when m is equal to 5. This reason may be pedestrian features have certain discrimination after proper constraints, which is more conducive to the clustering algorithm to produce effective labels.

4.5 Visualization analysis

We visualize the feature distribution as shown in Fig. 4, where the same color indicates the same identity. Figure 4a is the initialized feature extracted by a pre-trained model on ImageNet. It hardly distinguishes any pedestrians, reflecting the poor performance. By performing the clustering algorithm and implementing the multi-task operation, Fig. 4b reveals certain performance with its ability to distinguish different pedestrians. The proposed HFCN achieves superior performance as shown in Fig. 4c, which significantly identifies pedestrian images. This further proves that our feature constraint loss effectively encourages the same pedestrian to have the similar distribution and different pedestrians to generate the large discrepancy. More importantly, we successfully integrate the feature constraint loss and the multi-task operation, which stimulates the clustering algorithm to generate high-quality labels for learning discriminative features.

5 Conclusion

In the paper, we propose HFCN to restrict potential pedestrian features and mine effective information with no labeled data for fully unsupervised person Re-ID. Specifically, we define the feature constraint loss to restrict the feature distribution. As a result, pedestrian features with the same identity are facilitated to have the similar distribution and different pedestrians possess large distribution gaps. Hence, the operation could promote the deep model to distinguish different pedestrians, which further stimulates the clustering algorithm to generate high-quality labels. Afterward, we design the multi-task operation to sufficiently utilize predicted labels and identify hard samples. Finally, the feature constraint loss and multi-task operation are integrated to jointly optimize the Re-ID model, improving the Re-ID model’s generalization ability. Numerous experiments demonstrate the effectiveness of the proposed HFCN that surpasses the state-of-the-art for fully unsupervised Re-ID.

The proposed HFCN possesses six hyperparameters that have been analyzed in Sect. 4.4. Although the six parameters are easy to determine, their choice still requires some experience when expanding to other fields. In the future, we will extend the idea to unsupervised cross-modality person ReID to solve the cross-modality problem and further verify its effectiveness and stability [69].