Introduction

Benefiting from the development of deep learning and the emergence of large-scale face datasets [9, 14, 16, 48], face recognition techniques have made significant progress in recent years [6, 15, 20, 24, 35]. However, manually annotating a massive number of face images is time consuming and labor intensive. To alleviate these limitations, several face clustering methods have been proposed [29, 36, 37, 41, 42]. These approaches not only provide high-quality pseudo-labels for model learning but also extend their applications from family photo album management [46] to automatic data cleaning [25, 45].

Most traditional clustering algorithms have relatively strict assumptions. For example, K-means [22] requires specifying the number of clusters, spectral clustering favors balanced clustering results, and density-based spatial clustering of applications with noise (DBSCAN) [7] may not work in cases with large cluster density variations. While hierarchical clustering methods [18, 26, 47] generate clusters with arbitrary shapes and numbers, they are not suitable for large-scale face datasets due to the high complexity of the proximity matrix computation in each iteration.

Fig. 1
figure 1

The motivation of our approach. a Conventional pairwise features mainly focus on neighbor-to-center properties. b Our enhanced pairwise features further encode intra-neighbor relationships

Recently, researchers have attempted to incorporate a small amount of supervised information into the clustering process, resulting in several supervised face clustering methods [19, 29, 41]. Considering the nature of clustering, those methods do not yield clustering results in a straightforward manner. Instead, they seek to generalize the structural knowledge learned from labels, such as sample density [8, 19, 41] and inter-sample connectivity [27, 29, 37], to unknown observations. Thus, the key issue is, given partial labels, how to effectively and efficiently model these structural characteristics. Due to the great power of propagating and aggregating information over graphs, graph convolutional networks (GCNs) [38] are widely used to describe intra-cluster and inter-cluster relationships [2, 13, 17], and they significantly improve the performance of face clustering [29, 36, 37, 41]. However, GCN-based approaches bring excessive computational cost and memory consumption. While some sampling strategies [37, 40] have been proposed to improve scalability, they also suffer from overlapping subgraphs [37] and a lack of global views [40]. Moreover, one or more thresholds are usually employed in the post-processing phase to remove noisy edges, thus limiting the adaptability of these methods.

Concurrently, following the link prediction approach [3, 43, 44], some methods uncover clustering patterns from pairwise relationships [19], which can alleviate the influences of redundant and noisy connections in graphs [41]. Specifically, when constructing the pairwise features of two target faces, their k-nearest neighbor (k-NN) features are also involved to improve contextual awareness (as shown in Fig. 1(a)). However, the explicit exchange of information among neighbors is usually ignored during pairwise learning, but this process also plays an essential role in outlining local variations. This fact is highlighted in Fig. 1(b).

To resolve these problems, in this paper, we propose a structure-enhanced pairwise feature learning method (SEPFL) for face clustering. Unlike existing methods that employ some trivial functions to aggregate the neighbors around each sample [19], in SEPFL, the relationships between a sample and its neighbors, as well as the inherent intra-neighbor patterns, are jointly encoded to adaptively weight the candidate neighbors. Thus, the elements of the pair features ensure that the local relationships are sufficiently mixed, which is particularly beneficial for clustering learning.

Moreover, the two samples forming a pair should have some differences to characterize the local structural changes. Thus, in terms of density, a density gap should be present between them. To achieve this goal, a combined density is proposed to enhance the density decay from high-density samples, usually cluster centers, to low-density samples, such as boundary samples or outliers. Guided by the combined density, many redundant pairs are discarded, resulting in both clustering performance and inference efficiency improvements.

The main contributions of this paper are summarized as follows.

  1. 1.

    We propose a novel face clustering framework that performs data grouping at the pair level. Compared to graph-based approaches, our framework incorporates pairwise feature learning for connectivity classification, reducing the computational cost and alleviating the dependence on thresholds in the inference phase.

  2. 2.

    We propose a neighborhood mixing mechanism for learning structure-enhanced pairwise representations, in which information interchange among neighbors is effectively promoted to characterize the local structural variations.

  3. 3.

    We design a combined density strategy to assist in selecting more representative pairs for both training and inference, thus further improving the clustering accuracy of our method.

  4. 4.

    Our method is more competitive than other advanced methods and demonstrates effectiveness in other clustering tasks, such as fashion clustering.

The paper is organized as follows. In Related work, we briefly introduce the related work on face clustering regarding the aspect of whether supervision information is involved. In Methodology, we present the innovation points of this paper in detail. Experiments such as comparisons with different face clustering algorithms are presented in Experiments. Conclusion is the conclusion of this paper.

Related work

Due to their restrictive assumptions regarding data distribution and scalability issues, traditional clustering methods [7, 11, 22] are unable to be directly applied for large-scale face clustering. Therefore, we briefly review unsupervised and supervised face clustering methods in this section.

Unsupervised face clustering

Hierarchical clustering can handle complex data distributions and does not require the number of clusters to be specified in advance, resulting in the proposal of a series of methods that group faces in an agglomerative manner. Lin et al. [18] proposed a proximity-aware hierarchical clustering (PAHC) method, which separates positive and negative samples in the feature space to simplify the subsequent clustering procedure; however, this approach is inadequate when dealing with unbalanced clustering sizes. By introducing neighborhood structures, Zhu et al. [47] proposed a rand-order (RO) distance metric to alleviate the nonuniform distribution issues that are typically related to face data. As an extended version of the RO metric, Otto et al. [26] adopted an approximation mechanism for faster nearest-neighbor searching, which supports the clustering of millions of faces. The above methods usually work under transductive settings, leading to model retraining when new faces are encountered. Therefore, some recent methods have taken inductive or supervised approaches for face clustering.

Fig. 2
figure 2

Overview of the proposed SEPFL framework. First, we calculate the combined density of each sample based on a k-NN graph, which is used to select representative sample pairs. Then, these selected pairs are fed into a neighborhood mixing module where structure-enhanced pairwise representations are derived. Next, by utilizing a binary multilayer perceptron (MLP) classifier, two samples are directly assigned to one or more clusters. Finally, we perform a breadth-first search (BFS) over the selected relationships, which produces the output clustering results

Supervised face clustering

Supervised face clustering leverages the structural descriptions learned from labels to guide the clustering process for unknown samples. These methods can be roughly divided into two categories: two-stage methods and one-stage methods.

Two-stage methods

The two-stage approaches usually follow a coarse-to-fine procedure, in which the first stage provides an overall view of the input data, commonly in the form of a full graph or multiple subgraphs. Each vertex in the graph represents a face, and the edges indicate the relationships between pairs of vertices. Ideally, two faces belonging to the same identity are linked by an edge. The second stage gradually refines the relationships to discard the influences of noisy vertices and outliers. For instance, consensus-driven propagation (CDP) [42] first builds a multiview representation of the given data based on k-NN graphs through a base model and several proposed committee models, which are then merged into a single graph for label propagation. Most two-stage methods use GCNs as the base feature extractors due to their powerful representation capabilities. Wang et al. devised an L-GCN method [37] that takes a subgraph as a learning unit. It first utilizes a GCN to reason the linkage likelihood of a pivot and its nearest neighbors and then obtains consistent clustering results by merging the predicted links of multiple subgraphs. Among its variants, Qi et al. proposed a residual GCN (RGCN) [27] method, which cascades multiple GCN blocks with residual connections. Yang also invented a two-stage clustering framework; the first stage, called GCN-D, detects high-quality cluster proposals with high recall and purity, and the second stage, called GCN-S, refines the proposals by removing outliers [40]. Similarly, in a later work [41], the clustering process was divided into two sequential subtasks, GCN-V, which predicted node confidence, and GCN-E, which estimated edge connectivity. Notably, GCN-V was trained on the entire graph rather than subgraphs, resulting in less learning bias. However, only a one-layer GCN was deployed to reduce the computational cost, limiting the representation power of the GCN. Guo et al. designed a density-aware feature embedding network (DA-Net) [8], which consisted of two subnetworks, one based on a GCN and one based on a long short-term memory (LSTM) unit. The first subnetwork is used to enable information propagation over each subgraph, and the second subnetwork covers remote dependencies via a proposed density chain module. To improve the graph quality, Wang et al. [36] transformed the input samples into a structured space with fewer noise edges and identified the candidate neighbors of each vertex with a designed adaptive filter module.

In short, these two-stage-based algorithms not only achieve improved clustering performance but also provide richer structural descriptions, including the quality of graphs [40], the centrality of points [41], the connectivity of edges [27, 37], and the more discriminative feature representations [36], which can support further decision making. However, these methods also have the following drawbacks. First, the use of multiple phases or subtasks may lead to suboptimal results, in addition to the fact that they require more hyperparameters to be tuned. Second, although GCNs have shown advantages in aggregating features, they significantly increase the computational burden. Even though the resource usage can be reduced by sampling subgraphs, additional mechanisms need to be designed for subgraph merging.

One-stage methods

Recently, by designing sampling strategies at different scales, some single-stage face clustering methods have been investigated to simplify the training processes of two-stage techniques. The structure-aware face clustering (STAR-FC) algorithm [29] takes clusters as the smallest sampling units instead of points and achieves large-scale GCN training while preserving the important structural information of the entire graph. In addition, STAR-FC takes a full graph as input to ensure inference efficiency [19] performs face clustering at the pair level, and it uses a breadth-first search (BFS) algorithm to deduce the final clustering results from the predicted pairwise relationships. Pair-based approaches have demonstrated competitive performance in face clustering. However, how to learn effective pair representations and how to construct pairwise data that are less subject to structural biases are topics have not been effectively explored.

Methodology

As previously stated, existing methods for learning pairwise embeddings usually ignore the relationships among neighbors. Therefore, we propose an SEPFL method that employs a neighborhood mixing module to better capture structural properties. Moreover, a combined density strategy, which selects more representative pairs for training and testing, is further proposed. The entire framework is shown in Fig. 2.

For a given face dataset with C identities, image features are extracted by a pretrained convolutional neural network (CNN) [6] and normalized to a set \(\mathcal {F}=\{f_i\}_{i=1}^N\), where \(f_i \in \mathbb {R}^D\) is the i-th face, N is the total number of samples, and D denotes the dimensionality of the features. To obtain a broad view of \(\mathcal {F}\), a k-NN graph is constructed based on cosine similarity. The goal of face clustering is to assign a unique pseudo-label \(y^{\prime }\) to each cluster, where \(y^{\prime } \in \{1,2,\dots ,C^{\prime }\}\) and \(C^{\prime }\) denotes the number of predicted clusters.

Table 1 Comparison of the main hyperparameters adopted by different methods

Ideally, all faces associated with the same identity should be grouped together. Therefore, a variety of metrics are introduced to measure the gap between the predicted clustering results and the ground-truth labels [1, 30].

Structure-enhanced pairwise feature learning

Learning better structural descriptions is essential for face clustering. To handle large-scale clustering problems, most existing methods adopt GCNs to learn the structural patterns of predivided subgraphs, which are typically based on vertex confidence [41], edge or subgraph connectivity [29], and so on. Subsequently, the patterns obtained from the subgraphs are fused to restore the clustering results. However, this multistage cluster generation schema increases the computational cost of the overall method, and each stage may also introduce some hyperparameters. For instance, Table  1 summarizes the main hyperparameters used in the different stages of several face clustering algorithms. The k-NN method provides the initial structure of the input data for all the mentioned methods. In addition, some hyperparameters are applied at different stages to control the sparsity [40], connectivity [36, 37] or randomness [29] of the graphs. Generally, at least one cutting threshold is applied during the postprocessing step to eliminate the noisy connections between samples or clusters [29, 36, 37, 41]. As a result, these additional hyperparameters may affect the generalizability and scalability of the associated methods.

Based on the above observations, instead of graphs, we use more primitive structural descriptions, i.e., pairwise relationships, as learnable units. Specifically, a single-stage clustering framework that integrates pairwise feature learning and pairwise relationship classification is proposed. A binary MLP classifier is used to directly predict whether two samples belong to the same cluster. Consequently, according to these filtered positive pairs, each connected component found by a simple search strategy can be taken as a cluster.

The key challenges of the proposed approach are twofold. First, the simplicity of the paired structure comes at the cost of limited representational power. A pairwise relationship is dominated by the two samples contained in it, leading to a lack of local structure perception. To solve the representation problem, we propose a neighborhood mixing approach for pairwise feature learning. Details are provided in “Neighborhood mixing module section”. Second, while pairwise structures allow for greater flexibility, they give rise to large amounts of redundant relationships. For example, for N samples, at most \(N \times (N-1) / 2\) pairs can be extracted from them. To ensure the sufficiency of the training data and the efficiency of the inference process, we propose a density-guided pair selection strategy for the construction of candidate pairs. The mechanism of this strategy is presented in “Density-guided pair selection section”.

Table 2 List of different weight setting methods
Fig. 3
figure 3

Neighbor weight learning procedure via the neighborhood mixing module

Neighborhood mixing module

As a key component of the GCN, neighborhood aggregation captures each center node’s contextual information to enhance its local structure awareness. For the same reason, combining pairwise features and their neighborhood embeddings is necessary to classify pairwise relationships. Specifically, given a pair of feature vectors, \(f_a\) and \(f_b\), as well as their neighborhood vectors, \(f_{\mathcal {N}_a}\) and \(f_{\mathcal {N}_b}\), the combined pair representation is defined as:

$$\begin{aligned} \begin{aligned} f_{ab} = [f_a, f_{\mathcal {N}_a}, f_b,f_{\mathcal {N}_b}] \end{aligned} \end{aligned}$$
(1)

where [, ] is the concatenation operation, and \(\mathcal {N}_a\) is the set of neighbors of \(f_a\).

Leaving linear transformations and nonlinear activations aside, a common aggregation mechanism is to calculate the weighted combination of the neighboring features of \(f_a\):

$$\begin{aligned} f_{\mathcal {N}_a}=\sum _{i \in \mathcal {N}_a} w_{ai}f_i \end{aligned}$$
(2)

where \(w_{ai}\) denotes the contribution of the ith neighbor in \(\mathcal {N}_a\) to \(f_a\). Therefore, the effectiveness of the aggregator is mainly dependent on the weighting scheme and the representation power of the feature vectors of the neighboring sample. Table 2 summarizes some typical weight setting methods, which are based on neighborhood sizes [37], attention coefficients [34], the distance decay between two samples [19], etc. These weight values are only learned through center-neighbor pairs, which may fail to reflect complex similarity relationships. While some approaches perform multiplication pooling over each neighbor pair [32], more collaborative relationships among the neighbors should be explored. To leverage local interactions for weight learning, we design a module for neighborhood mixing. It consists of three steps: center-neighbor relation embedding, neighborhood information mixing and neighboring weight generation (as shown in Fig. 3). Each of these three steps is elaborated below.

Center-neighbor relation embedding: In this step, we focus on learning the relations between a center and its first-order neighbors. First, a feature matrix \(X_i\in \mathbb {R}^{K\times 2D}\) is built, where the jth row is the concatenated result of \(f_i\) and its jth neighbor of size K. To obtain primitive relational representations, \(X_i\) is passed to an MLP consisting of two fully connected layers and a nonlinear activation function, which is followed by a batch normalization (BatchNorm) layer for stable optimization. In addition, a skip connection is added between the input and the output of the MLP. The whole process can be written as follows:

$$\begin{aligned} \dot{X_i} = X_i + BatchNorm(\sigma (X_{i}W_1)W_2) \end{aligned}$$
(3)

where \(\sigma \) is the Gaussian error linear unit (GELU) activation function [10], and \(W_1\in \mathbb {R}^{2D \times 4D}\) and \(W_2\in \mathbb {R}^{4D \times 2D}\) are two learnable weight matrices.

Neighborhood information mixing: When measuring the importance of one neighbor point \(j\in \mathcal {N}_i\) to the center point i, it is not sufficient to exchange information between the two members of the pair. Indeed, all the neighbors in \(\mathcal {N}_i\) should be considered. To achieve this goal, we devise a neighborhood mixing mechanism, which was initially introduced in [33] to support information communication along the spatial dimension. Specifically, we assign a similar MLP that is fed with the output feature matrix \(\dot{X}_{i}\) of the previous step and aggregate features across neighbors as follows:

$$\begin{aligned} \hat{X_i} = \dot{X_i} + BatchNorm(W_4\sigma (W_3\dot{X_i})) \end{aligned}$$
(4)

where \(W_3\in \mathbb {R}^{2K \times K}\) and \(W_4\in \mathbb {R}^{K \times 2K}\). As shown in 3, the neighborhood interactions are encoded along each row of \(\dot{X_i}\).

Neighboring weight generation: Finally, a two-layer MLP is used to obtain the weight assignment of the jth neighbor of point i:

$$\begin{aligned} w_{ij} = \sigma (\hat{X_i}[j,:]W_5)W_6 \end{aligned}$$
(5)

where \(\hat{X_i}[j,:]\) denotes the ith row of \(\hat{X_i}\), \(W_5\in \mathbb {R}^{2D \times D}\) and \(W_6\in \mathbb {R}^{D \times 1}\).

Fig. 4
figure 4

Comparison between the similarity density and combined density

Density-guided pair selection

Due to the redundancy of pairwise connections, it is unnecessary to evaluate all the possible pairs extracted from a dataset. We assume that a data point should be attracted to some cluster center or to another point that is closer to the cluster center than it is. Consequently, the set of candidate pairs can be constructed by concentrating on such directed relationships. Since cluster centers usually lie in the high-density regions of the feature space [41], the two samples forming a pair are expected to differ in terms of density.

figure a

Face images belonging to the same individual should be clustered together. The informative faces with near-frontal views and normal illuminations are found in higher density regions, while the low-quality faces with complex expressions and extreme lighting conditions appear at lower density cluster boundaries. Thus, candidate pairs can be collected along a path that starts from one sample and ends at a density peak.

Most existing face clustering methods adopt density settings based on neighborhood distances or similarities [8, 41]. Given a sample \(f_i\), its density \(\rho ^{d}_{i}\) can be written as:

$$\begin{aligned} \rho ^{d}_{i} = \sum _{j \in \mathcal {N}_i}a_{ij} \end{aligned}$$
(6)

where \(a_{ij}\) is the similarity between \(f_i\) and its jth nearest neighbors. However, due to the complex distribution of faces, samples with higher density levels may be located at the boundaries of the clusters. As shown in Fig. 4(a), those samples lead to some undesirable connections between two clusters, which degrade the resulting clustering performance. To suppress the densities of the boundary samples, we propose a density fusion mechanism. In particular, a spherical density is added to adjust the initial value of \(\rho ^{d}_{i}\), which is written as follows:

$$\begin{aligned} \rho _{i}^{s} = \sum _{j \in \mathcal {N}_i}(a_{i,j} \ge \theta _{\rho }) \cdot 1 \end{aligned}$$
(7)

where \(\theta _{\rho }\) is the threshold value used to identify neighbors with high similarities. The introduction of spherical density is motivated by the fact that samples farther away from the cluster center often exhibit more significant neighborhood differences. Hence, the combined density for sample i is defined as:

$$\begin{aligned} \rho _i = \frac{\rho ^{d}_{i}}{2\max (\{\rho ^{d}_{i}\}_{i=1}^{N})} + \frac{\rho ^{s}_{i}}{2\max (\{\rho ^{s}_{i}\}_{i=1}^{N})} \end{aligned}$$
(8)

Fig. 4(b) illustrates the results obtained after applying the combined density, whose main steps are summarized in algorithm 1.

Construction of training and testing sets

figure b

Unlike graph-based face clustering methods, our method takes a pair of samples as the underlying processing unit, so it becomes crucial to identify pairs of data. As described in “Density-guided pair selection section”, a density-guided pair selection strategy is proposed for constructing training and testing sets.

As a result of algorithm 1, a set of candidate pairs can be constructed for model training. Specifically, pairs with the same label are regarded as positive pairs; otherwise, they are negative pairs. However, this splitting approach tends to cause an imbalance between the positive and negative pairs, leading to an excess of positive pairs. To address this issue, we select pairs of samples and their K-nearest neighbors to augment the training set and keep the ratio of positive and negative pairs at 1:1.

In addition, for the testing set, the experimental results suggest that the aforementioned pair selection strategy is capable of characterizing the structures of clusters. Therefore, to guarantee the efficiency of the inference process, the set of testing pairs is not augmented. The main steps of the proposed framework are summarized in algorithm 2.

Complexity analysis

The computational complexity of algorithm 2 mainly arises from the k-NN graph construction and pair selection steps. The k-NN search is the bottleneck of the algorithm, which has a complexity of \(O(n^2)\). With the approximate nearest search technique [39], the time complexity is reduced to \(O(n \log n)\). Since the size of the candidate set used for inference does not exceed the number of samples, the time cost of pair selection is O(n). Due to the pair augmentation process employed for training, the pair selection cost is O(nk), where \(k\ll n\) is the neighbor size for the pair search. Hence, the overall complexity is approximately \(O(n \log n)\).

Experiments

Datasets

MS-Celeb-1 M (MS1M) [9] is a widely used large-scale face dataset. It contains approximately 100K identities and 10M face images, with varying numbers of images associated with each identity. Following the protocol used in [19, 29, 36, 41], we clean the dataset in terms of the annotations from Arcface [6], producing approximately 86K identities and 5.82M face images. Then, we divide the dataset into 10 equal parts and select the first part for training and the rest for testing. In particular, the five testing sets are constructed by selecting 1, 3, 5, 7 and 9 parts, and the numbers of images are 584K, 1.74M, 2.89M, 4.05M, and 5.21M, respectively. Additionally, to verify the feasibility of the proposed method, we test it on the DeepFashion [21] dataset for fashion clustering. We follow the settings in [41], where the training set includes about 26K images and 4K categories, and the testing set includes about 27K images and 4K categories.

Evaluation metrics

Two common metrics are used to assess the performance of clustering algorithms, namely, the Pairwise F-score (\(F_P\)) [30] and BCubed F-score (\(F_B\)) [1], which are harmonic means of precision and recall.

The Pairwise F-score is calculated based on sample pairs. The pairwise precision indicates the proportion of sample pairs that are correctly predicted among all pairs predicted to belong to the same class, which is written as:

$$\begin{aligned} Pairwise\;Precision = \frac{TP}{TP + FP} \end{aligned}$$
(9)

where TP and FP are the abbreviations of true-positive pairs and false-positive pairs, respectively. Similarly, the pairwise recall is written as:

$$\begin{aligned} Pairwise\;Recall = \frac{TP}{TP + FN} \end{aligned}$$
(10)

where FN denotes false-negative pairs.

The BCubed F-score measures the difference between the ground-truth labels and the cluster results. Let G(i) and P(i) denote sets of samples that have the same annotation and cluster assignments as sample i, respectively, and let C(ij) indicate the consistency of samples i and j, which is formulated as:

$$\begin{aligned} C(i, j) = \left\{ \begin{array}{ll} 1, &{}G(i)=G(j)\;and\;P(i)=P(j) \\ 0, &{}otherwise \end{array} \right. \end{aligned}$$
(11)

The precision and recall are defined as

$$\begin{aligned} BCubed Precision = \frac{1}{N}\sum _{i=1}^{N}\sum _{j \in P(i)} \frac{C(i, j)}{|P(i) |} \end{aligned}$$
(12)

and

$$\begin{aligned} BCubed Recall = \frac{1}{N}\sum _{i=1}^{N}\sum _{j \in G(i)}\frac{C(i, j)}{|G(i) |} \end{aligned}$$
(13)

where \(|G(i) |\) and \(|P(i) |\) denote the sample sizes of sets G(i) and P(i), respectively. The following formula calculates the F-score for both metrics:

$$\begin{aligned} F\text {-}score = \frac{2 \times Precision \times Recall}{Precision + Recall} \end{aligned}$$
(14)

Implementation details

In this study, we follow the settings in [19, 29, 36, 40, 41] and utilize Arcface [6] as a base feature extractor to obtain 256-dimensional input features. We also use the k-NN algorithm [5] to search K neighbors for each sample, where K is set to 80 and 20 for the MS1M dataset and the DeepFashion dataset, respectively. For the construction of the combined density, we set \(\theta _{\rho }=0.7\) for the MS1M dataset and \(\theta _{\rho }=0.9\) for the DeepFashion dataset to properly increase the density gap between the cluster center and boundary. During the training phase, we use the cross-entropy loss to optimize the network. The stochastic gradient descent (SGD) optimizer is used with an initial learning rate of 0.01, a momentum of 0.9 and a weight decay of 1e-4. The batch size is set to 512, and the training process ends after 100 epochs.

Method comparison

To demonstrate the validity of our approach, we compare SEPFL with a series of clustering baselines, including six traditional clustering methods and seven deep learning-based methods. A brief description of each algorithm is given below.

  • K-Means [22]: The most commonly used clustering method, K-means produces clustering results with a predefined number of clusters.

  • HAC [31]: Hierarchical clustering is a bottom-up approach that iteratively merges samples through various distance metrics.

  • DBSCAN [7]: DBSCAN is a density-based method that has demonstrated advantages in handling data with complex distributions.

  • MeanShift [4]: The convergence of multiple sets of samples to the same local maximum density constitutes the resultant cluster.

  • Spectral Clustering [12]: The similarity matrix of the data is eigen-decomposed and clustered according to the eigenvectors.

  • ARO [26]: A new metric is proposed to achieve improved rank-order clustering [47].

  • CDP [42]: This is a graph-based clustering method. By fusing the information of multiple samples, better pairwise features can be obtained.

  • L-GCN [37]: L-GCN is a supervised clustering algorithm that uses a GCN to learn sample structure information for connection prediction.

  • LTC [40]: This is a two-stage clustering method. The input data are processed separately using the ideas of classification and segmentation.

  • GCN(V+E) [41]: This is a two-stage clustering method. The whole constructed graph is fed into a GCN to predict the confidence levels of the samples and construct subgraphs. After that, the noise points in the subgraphs are predicted.

  • CFPC [19]: This is a pairwise learning-based clustering method. CFPC assists the pairwise learning process by fusing neighbor sample information to obtain structural information, differing from the GCN-based approaches.

  • STAR-FC [29]: This method suggests a structure- preserving sampling strategy to build a subgraph that preserves enough structural information to make training on tens of millions of face data possible.

  • Ada-NETS [36]: Ada-NETS solves the problem of introducing too many noisy edges when constructing graph-structured data via the adaptive neighbor discovery method.

Table 3 Comparison among the face clustering results obtained with different numbers of unlabeled images from the MS1M dataset

Result comparison

Table 3 shows the results obtained by our method and other clustering baselines, including four conventional clustering methods and seven supervised clustering methods, on the MS1M dataset. It can be seen that the proposed method obtains the best results on all five testing datasets, which possess varying sizes. Compared to the second-best approach, i.e., Ada-NETS, the proposed SEPFL method achieves greater performance gains as the dataset volume increases. For instance, SEPFL outperforms Ada-NETS by 0.55% and 0.29% on the 584K data in terms of \(F_P\) and \(F_B\), and when the data size increases to 5.21 M, the performance gaps are enlarged to 2.28% and 1.77%, respectively. These results suggest that our method is able to learn more generalized representations for clustering. In addition, by introducing the neighborhood learning strategy, SEPFL consistently outperforms CFPC, which also works at the pair level. Note that out of the four conventional clustering methods, only K-means achieves competitive results, because the ground-truth number of clusters is given in advance.

To demonstrate the applicability of our model to nonface images, we conduct experiments on the DeepFashion dataset. As shown in Table 4, our approach produces the best results. In terms of the two utilized metrics, SEPFL is ahead of Ada-NETS by 2.77% and 1.76%, respectively, which indicates the generalization ability of the proposed model on clustering tasks.

Table 4 Comparison results obtained on DeepFashion

Figure 5 compares the efficiency and accuracy of our method with several clustering baselines on the MS1M dataset (part 1). Since ARO does not obtain competitive performance (as shown in Table 3), its results are not included. The results indicate that our method achieves the highest accuracy as well as being time-efficient with a faster inference time than most clustering methods.

Parameter analysis

In this section, we explore the effects of different numbers of neighbors K and density thresholds \(\theta \) on the clustering results.

Influence of the number of neighbors k

In our model, the number of neighbors k is mainly used to establish the regions for density calculation and pair selection. For simplicity, we set the same k for the above operations. To investigate the influence of different values of k on the clustering performance, we increase the parameter from 20 to 80 with a step size of 10, and the results are shown in Fig. 6. We observe similar trends in the metrics on all testing partitions. More candidate pairs are included as k increases, resulting in a higher recall. A larger k brings more false-positive pairs, which decreases the precision rate. In short, the performance is stable when k is greater than 30, demonstrating that our model is not sensitive to k.

Influence of the density threshold \(\theta \)

As defined in Eq. (8), the proposed combined density \(\rho \) is the average of the similarity density \(\rho ^d\) and the spherical density \(\rho ^s\). The parameter \(\theta \) is employed to adjust the importance of \(\rho ^d\) to \(\rho \). Figure 7 presents the influences of \(\theta \) on the MS1M and DeepFashion datasets. Apparently, the performance of the proposed approach can be improved when \(\theta \) is set within a certain range (e.g., from 0.6 to 0.75 on the MS1M dataset). This is because as \(\theta \) tends to 0 or 1, \(\rho ^s\) approximates a constant, resulting in the degradation from \(\rho \) to \(\rho ^d\). Moreover, due to the differences in the distributions and sizes of datasets, a larger \(\theta \) is required to activate \(\rho ^s\) on DeepFashion than on MS1M.

Fig. 5
figure 5

Comparison of efficiency and accuracy between our method and other clustering methods

Fig. 6
figure 6

Comparison of the \(F_P\) and \(F_B\) results obtained with different values of K from the MS1M dataset

Influence of postprocessing thresholds

In “Structure-enhanced pairwise feature learning” section, we argue that some manual-setting postprocessing thresholds may limit the applicability of existing clustering methods to different datasets and scenarios. To verify the sensitivity of the clustering results to the threshold values, we take STAR-FC [29] as an example, which includes two postprocessing thresholds, i.e., \(\tau _1\) and \(\tau _2\). Specifically, \(\tau _1\) represents the threshold for removing weak edges and \(\tau _2\) represents the threshold for removing edges with low node intimacy. For comparison, we first adjust them to achieve the best performance, then fix one of them and change the value of the other. The experimental results are illustrated in Fig. 8.

It can be observed that (1) the clustering results are sensitive to the value of the threshold. (2) Some dataset-specific thresholds should be established due to distribution differences between datasets, and (3) the two metrics (\(F_P\) and \(F_B\)) show different trends with the same group of thresholds. As a result, these inconsistencies increase the difficulty of setting thresholds. Since our method requires no postprocessing threshold, it provides better robustness and stability.

Fig. 7
figure 7

Effects of different density thresholds on the clustering results

Fig. 8
figure 8

Effects of different postprocessing thresholds on the clustering results. \(\tau _1\) and \(\tau _2\) are two edge-cutting thresholds employed in STAR-FC [29]. Our results are shown as two lines

Ablation study

In this section, we further analyze the effectiveness of our algorithm through a large number of ablation experiments on the MS1M dataset.

Design of pairwise features

We explore different designs of pairwise features for clustering purposes. According to Eq. (1), we can formulate a pair representation by directly concatenating the features of two faces without any neighborhood information; this process is denoted as simple concatenation. To investigate the benefits of local relations with respect to pairwise descriptors, we compare several neighborhood aggregation strategies. Mean aggregation computes the average of neighbors [37]. Similarity aggregation obtains the weighted sum of neighbors based on cosine similarity [41]. As discussed in Neighborhood mixing module Section, the proposed SEPFL approach employs a neighborhood mixing schema for weight learning.

Fig. 9
figure 9

Compare the results of different feature designs on the \(F_P\) and \(F_B\) evaluation metrics

Fig. 10
figure 10

Comparison of different weighting strategies. a Given a face sample, we list the top ten most similar samples based on the calculated weights (a larger weight indicates a higher similarity value). The faces with the same identity are outlined by green boxes, and the rest are outlined by red boxes. b We plot the weight distributions of two weighting strategies. As we can see, our method can generate more discriminative weights to reduce the influence of noisy or unreliable face images

Table 5 Comparison among the results obtained with different densities
Fig. 11
figure 11

Visualization of the feature distributions of two types of densities

As shown in Fig. 9, the performances are significantly improved when the contextual properties are considered. In addition, these aggregation strategies differ mainly in their weighting procedures. Mean aggregation assigns the same weight to each neighbor. Since this setting does not reveal the importance levels of the neighbors, the performance of this technique is inferior to similarity-weighted aggregation. Our model enhances the exchange of information within the neighborhood for weight learning and thus achieves the best results.

Furthermore, to evaluate the quality of the weights produced by different weighting strategies, we illustrate the weights assigned to the neighbors of a given sample in Fig. 10. The results obtained based on cosine similarity weighting are included for comparison. As shown in Fig. 10(a), after ranking the neighbors by their weights, the similarity weighting method yields two false-positive samples (outlined by red boxes). In contrast, SEPFL generates more robust weight assignments. Additionally, Fig. 10(b) shows the distributions of the weight values of the two weighting schemes. The results suggest that our model can assign more discriminative weights to the neighborhood samples, by which the importance of irrelevant samples is effectively suppressed.

Design of the combined density

As discussed in “Density-guided pair selection section”, we combine two densities for pair selection. Based on a fixed-size neighborhood centered at a sample \(f_i\), the first density is computed by the sum of the similarities to \(f_i\), which is called the similarity density and defined by Eq. (6), and the second density is derived from the number of samples whose similarities to \(f_i\) are higher than a threshold, which is called the spherical density and defined by Eq. (7). Table 5 shows the effects of different density settings. It can be seen that the spherical density performs the worst since it outputs discrete values. However, it can improve the similarity density by increasing the density gap between the cluster center and boundary samples. This enables the combined density to achieve the best results.

In addition, Fig. 11 visualizes the feature distribution of 6 identities sampled from the MS1M dataset using t-distributed stochastic neighbor embedding (t-SNE) [23]. As shown in Fig. 11(a), in terms of the similarity density, the samples in the dense regions share relatively high densities and therefore may result in excessive low-information pairs. We present the updated densities in Fig. 11(b). It is clearly seen that the density variations within clusters are appropriately enlarged. Hence, the combined density can effectively improve the quality of the pairs output by algorithm 1.

Influence of the number of pairs

Unlike graph-based face clustering methods, our method uses a sample pair as the underlying processing unit, so it is crucial to identify data pairs. As described in “Construction of training and testing sets”, a density-guided pair selection strategy is proposed for constructing training and testing sets. Basically, each sample serves as the starting point for the construction of at least one pair. However, the resulting imbalance between positive and negative pairs may cause training bias, and insufficient data lead to overfitting. To solve these issues, we adjust the numbers of positive and negative pairs to be equal by adding random pairs. To investigate how the number of pairs affects the clustering performance, we vary the parameter over a wide range on the MS1M dataset.

For the training phase, we include the results obtained without the balancing procedure for comparison and report the F-scores in Fig. 12. The experimental results reveal that (1) both the \(F_P\) and \(F_B\) metrics gradually improve as the training size increases, and when the performance saturates, adding more training data does not yield better results. (2) The balance between positive and negative pairs is essential for achieving performance improvement, and (3) this imbalance also affects the generalizability of the model; for example, as the volume of the testing set increases (i.e., from part 1 to part 9), the performance loss is enlarged.

Fig. 12
figure 12

Comparison among the F-scores obtained under different numbers of training pairs from the MS1M dataset. \(F_B\) (b) or \(F_B\) (wb) stands for the results obtained with or without the balancing procedure (the \(F_p\) metric is denoted in the same way)

Fig. 13
figure 13

Comparison among the Bcubed F-scores obtained under different numbers of testing pairs from the MS1M dataset

Given testing pairs with a size of \(N'\), we begin with a random selection of samples to construct the corresponding pairs. The process continues until each sample is selected once as the starting point in a pair. After that, the size of the candidate set for testing is further increased to 2.0 millions by adding the remaining pairs. A baseline, by which the positive and negative pairs are equally randomly selected from all possible pairs, is utilized for comparison purposes. As summarized in Fig. 13, we observe that (1) the density-guided pair selection method consistently outperforms its counterpart based on random selection as the number of pairs increases, and (2) its performance peaks when the number of pairs reaches \(N'\) and deteriorates thereafter. The reason for this is that our method can identify sufficient representative pairs, while adding extra noise pairs would degrade the achieved performance. According to these observations, the proposed method provides a simple method for pair selection, yielding improved clustering accuracy and inference efficiency.

Statistical testing

In this section, we investigate the significant difference between SEPFL and other clustering methods using the Wilcoxon signed-rank test [28], which is a non-parametric statistical hypothesis test method. The null hypothesis \(H_0\) in this experiment indicates that there is no significant difference between the two methods, while the alternative hypothesis \(H_1\) indicates the opposite conclusion. P-value is used to determine whether the null hypothesis holds, and if p-value is less than the significance level \(\alpha \) then the null hypothesis is rejected.

As shown in Tables 3 and 4, there are large gaps between the traditional clustering methods and our method in terms of the two clustering metrics. Therefore, we choose to compare our method with two deep learning-based methods, namely STAR-FC [29] and Ada-NETS [36], which achieve competitive results among the compared methods.

Table 6 Statistical analysis results of Wilcoxon signed-rank test

To ensure the robustness of the results, we repeat the experiments 10 times using different random seeds on each dataset, and the results are shown in Table 6. It can be seen that the obtained p-values are all less than the common significance level of 0.05, indicating that our method is significantly different from other clustering methods.

Conclusion

A GCN efficiently obtains sample structure information by aggregating neighboring features but requires large levels of memory and time consumption. This paper proposes a novel pairwise learning method for face clustering, denoted as SEPFL. In particular, we design a neighborhood mixing block to weight the aggregation of neighborhood features as structural information by learning the correlations between samples and neighbors. Unlike other methods, the neighborhood mixing block considers both the relationships between samples and neighbors and the relationships between neighbors to learn more comprehensive structural information. In addition, a density-guided pair selection strategy is used to select candidate pairs, which avoids the influence of excessive redundant pairs on the clustering results.

We conduct extensive experiments on the MS1M and DeepFashion datasets. The experimental analysis proves that (1) SEPFL reduces the computational cost and alleviates the dependence on thresholds. (2) The neighborhood mixing block has a powerful ability to obtain structural information. (3) The density-guided pair selection strategy is capable of selecting representative candidate pairs. (4) SEPFL has higher accuracy than other advanced face clustering methods. As the amount of data increases, SEPFL exhibits better robustness.

Although SEPFL achieves good performance in various experiments, the method performs clustering on the original feature space, which is limited by the utilized feature extraction model. Problems such as complex sample distributions and large numbers of noisy samples or difficult samples may be encountered, which largely affect the clustering effect. In future works, we will explore an efficient feature learning method to transform the given data into an easily distinguishable feature space to assist with data clustering.