Keywords

1 Introduction

Face recognition is a classical computer vision task [13, 21, 34] that aims to infer person identities from face images. Scaling it up relies on more annotated data if using deeper models. Face clustering is a popular and efficient solution to reducing the annotation costs [5, 15, 16, 27].

Problems. Face clustering is challenging due to that 1) recognizing person identities is a fine-grained task; 2) the number of identities is always large, e.glet@tokeneonedot, 77k on MS1M 5.21M dataset [8]; and 3) the derived face clusters are often of high variations in both size and sparsity, and small or sparse clusters—we call hard clusters—are hard to identify. Figure 1(a) and (b) show the distributions of ground-truth clusters on MS1M 5.21M dataset. For Fig. 1(c) and (d), we first group these clusters into five subsets based on a fixed ranking of size and sparsity, respectively, and then evaluate three top-performing methods and ours on each subset. It is clear that the performance drops significantly for hard clusters, e.glet@tokeneonedot, in subsets sz-5 and sp-5, particularly on metric Recall (see Fig. 2). We think the reason is two-fold: 1) small clusters are overtaken by large ones; 2) samples of sparse clusters are wrongly taken as “on” low-density regions, i.elet@tokeneonedot, the boundaries between dense clusters.

Fig. 1.
figure 1

(a) and (b) show the ground-truth distribution of the face identity clusters on MS1M 5.21M dataset [8]. (a) is for cluster size, i.elet@tokeneonedot, the number of samples in a cluster. (b) is for cluster sparsity, which is defined as the average cosine distance of all pair of samples in a cluster, e.glet@tokeneonedot, for cluster C we have \(\text {Sparsity}(C) = 1 - \frac{\sum _{ij\in C, i\ne j}\text {cosine}{<}x_{i},x_{j}{>}}{|C|(|C|-1)}\) where |C| denotes the cluster size. (c) and (d) show the performances (Pairwise F-score) of three top-performing methods, GCN(V+E) [31], Pair-Cls [14], and STAR-FC [23], compared to ours, on five cluster subsets with descending size (from sz-1 to sz-5) and ascending sparsity (from sp-1 to sp-5), respectively.

Fig. 2.
figure 2

Pairwise precision and recall (of the three baselines) that elaborates the results in Fig. 1(c) and (d). The recall of hard cluster subsets shows a significant drop.

We elaborate these based on Density Peaking Clustering (DPC) [22] which has shown the impressive effectiveness in state-of-the-art face clustering works [14, 31]. DPC requires point-wise density and pair-wise distance to derive clustering results. The density is usually defined as the number of neighbor points covered by an \(\epsilon \)-ball around each point [4], and the distance is standard cosine distance. We find that both density and distance are highly influenced by the size and sparsity of latent clusters in face data. For example, 1) smaller clusters tend to have lower density as shown in Fig. 3(a), so they could be misclassified as big ones by DPC, and 2) to identify positive pairs, higher-sparsity (lower-sparsity) clusters prefer a higher (lower) distance threshold, as indicated in Fig. 3(b), so it is hard to determine a uniform threshold for DPC.

Our Solution. Our clustering framework is based on DPC, and we aim to solve the above issues by introducing new definitions of point-wise density and pair-wise distance. We propose a probabilistic method to derive a size-invariant density called Neighborhood-Diffusion-based Density (NDDe), and a sparsity-aware distance called Transition-Probability-based Distance (TPDi). Applying DPC with NDDe and TPDi can mitigate hard clusters and yield efficient face clustering with a simple and uniform threshold.

Fig. 3.
figure 3

(a) The average standard density of clusters on each subset. (b) The probability density function on each subset with respect to the positive pairs. “Pos” indicates “Positive”, and “Neg” for “Negative”.

We first build a transition matrix where each row contains the normalized similarities (predicted by a pre-trained model as in related works [14, 23, 29, 31]) between a point and its K-nearest neighbors, and each column is the transition probability vector from a point to the others. Then, for NDDe, we specify a diffusion process on the matrix by 1) initializing a uniform density for each point, and 2) distributing the density to its K-nearest neighbors, where the distribution strength is proportional to the transition probability, until converge. The derived NDDe is invariant to the cluster size and thus free from the issue of small clusters. We provide the theoretical justification and empirical validation in Sect. 4.2. For TPDi, we define a relative closeness that equals the inner product between two points’ transition probability vectors (corresponding to two columns on the transition matrix). We assume two points are close if they have similar transition probabilities to their common neighbors. Our TPDi can yield more uniform sparsity (in clusters) than conventional distances such as cosine or Euclidean, and thus free from the issue of sparse clusters. Our justification and validation are in Sect. 4.3.

Our main contributions are threefold. 1) We inspect face clustering problem and find existing methods failed to identify hard clusters—yielding significantly low recall for small or sparse clusters. 2) To mitigate the issue of small clusters, we introduce NDDe based on the diffusion of neighborhood densities. 3) To mitigate the issue of sparse clusters, we propose the relative distance TPDi that can facilitate a uniform sparsity in different clusters. In experiments, we evaluate NDDe and TPDi on large-scale benchmarks and incorporate them into multiple baselines to show their efficiency.

2 Related Work

Face clustering has been extensively studied as an important task in the field of machine learning. Existing methods can be briefly divided into traditional methods and learning-based methods.

Traditional Methods. Traditional methods include K-means [18], HAC [24], DBSCAN [6] and ARO [20]. These methods directly perform clustering on the extracted features without any supervision, and thus they usually seem simple but have obvious defects. K-means [18] assumes the cluster shape is convex and DBSCAN [6] assumes that the compactness of different clusters is homogeneous. The performances of these two methods are limited since both assumptions are impractical for face data. The scale of unlabeled face images is usually large, and from this perspective, traditional methods are low-efficient and thus not suitable for face clustering task. The computational efficiency of HAC [24] is not acceptable when handling millions of samples. To achieve better scalability, Otto et allet@tokeneonedot [20] proposed ARO that uses an approximate rank-order similarity metric for clustering, but its performance is still far from satisfactory.

Learning-Based Methods. To improve the clustering performance, recent works [7, 14, 23, 28, 29, 31,32,33] adopt a learning-based paradigm. Specifically, they first train a clustering model using a small part of data in a supervised manner and then test its performance on the rest of the data. CDP [33] proposed to aggregate the features extracted by different models, but the ensemble strategy results in a much higher computational cost. L-GCN [29] first uses Graph Convolutional Networks (GCNs) [12] to predict the linkage in an instance pivot subgraph, and then extracts the connected components as clusters. LTC [32] and GCN(V+E) [31] both adopt two-stage GCNs for clustering with the whole K-NN graph. Specifically, LTC generates a series of subgraphs as proposals and detects face clusters thereon, and GCN(V+E) learns both confidence and connectivity via GCNs. To address the low-efficiency issue of GCNs, STAR-FC [23] proposed a local graph learning strategy to simultaneously tackle the challenges of large-scale training and efficient inference. To address the noisy connections in the K-NN graph constructed in feature space, Ada-NETS [28] proposed an adaptive neighbor discovery strategy to make clean graphs for GCNs. Although GCN-based methods have achieved significant improvements, they only use shallow GCNs resulting in a lack of high-order connection information, and in addition, their efficiency remains a problem. Pair-Cls [14] proposed to use pairwise classification instead of GCNs to reduce memory consumption and inference time. Clusformer [19] proposed an automatic visual clustering method based on Transformer [25].

In general, existing learning-based methods have achieved significant improvements by focusing on developing deep models to learn better representation or pair-wise similarities, but they failed to identify and address the aforementioned hard cluster issues. In this paper, we explore face clustering task from a new perspective. Based on DPC [22], we propose a size-invariant point-wise density NDDe and a sparsity-aware pair-wise distance TPDi, which can be incorporated into multiple existing methods for better clustering performance, especially on hard clusters.

3 Preliminaries

Problem Formulation. Given N unlabelled face images with numerical feature points , which are extracted by deep face recognition models, face clustering aims to separate these points into disjoint groups as \(\boldsymbol{X} = \boldsymbol{X}_{1}\bigcup \boldsymbol{X}_{2}...\bigcup \boldsymbol{X}_{m}\), such that points with the same identity tend to be in the same group, while points with different identities tend to be in different groups.

Data Preprocessing. Following the general process of learning-based face clustering paradigm, the dataset \(\boldsymbol{X}\) is split into a training set and a test set, \(\boldsymbol{X} = \boldsymbol{X}_{\text {train}}\bigcup \boldsymbol{X}_{\text {test}}\). For a specific learning-based face clustering method, a clustering model is first trained on \(\boldsymbol{X}_{\text {train}}\) in a supervised manner, and then the clustering performance is tested on \(\boldsymbol{X}_{\text {test}}\). Without loss of generality, we always denote the features and labels as \(\boldsymbol{X} = \{x_{1}, x_{2}, \cdots , x_{N}\}\) and \(l = \{l_{1}, l_{2}, \cdots , l_{N}\}\), respectively, for both training stage and test stage.

Density Peak Clustering (DPC). DPC [22] identifies implicit cluster centers and assigns the remaining points to these clusters by connecting each of them to the higher density point nearby, which is adopted by several state-of-the-art face clustering methods [14, 31]. In this paper, we also adopt DPC as the clustering algorithm. Given point-wise density \(\rho = \{\rho _{1},\rho _{2},\cdots , \rho _{N}\}\) and pair-wise distance \((d_{ij})_{N\times N}\), for each point i, DPC first finds its nearest neighbor whose density is higher than itself, i.elet@tokeneonedot,

$$\hat{j} = \text {argmin}_{\{j | \rho _{j} > \rho _{i}\}}d_{ij},$$

If \(\hat{j}\) exists and \(d_{i\hat{j}} < \tau \), then it connects i to \(\hat{j}\), where \(\tau \) is a connecting threshold. In this way, these connected points form many separated trees, and each tree corresponds to a final cluster. Note that \(\tau \) is uniform for all clusters, so consistent point-wise density \(\rho \) and pair-wise distance \((d_{ij})_{N\times N}\) are essential for the success of DPC. To solve hard cluster issues, we propose a size-invariant density called Neighborhood-Diffusion-based Density (NDDe) and a sparsity-aware distance called Transition-Probability-based Distance (TPDi) for better \(\rho \) and \((d_{ij})_{N\times N}\).

4 Method

Figure 4 shows the overall framework consisting of four steps. First, we construct a transition matrix by learning the refined similarities between each point and its K-nearest neighbors using a model consisting of a feature encoder \(\mathcal {F}\) and a Multi-Layer Perceptron (MLP). The second step uses our first novel module: computing Neighborhood-Diffusion-based Density (NDDe) by diffusing point-wise density on the neighboring transition matrix, which is invariant to cluster size. The third step is our second novel module: computing Transition-Probability-based Distance (TPDi) by introducing a relative closeness, which is aware of cluster sparsity. Fourth, we directly apply DPC with NDDe and TPDi to derive the final clustering result.

Fig. 4.
figure 4

Overview. Our method consists of four steps: (1) Constructing transition matrix \(\boldsymbol{P}\). Feature encoder \(\mathcal {F}\) is for feature refinement, which can be Transformer, GCNs, or a feature aggregation module, and after that, an MLP is used to predict the similarity between each anchor point and its neighbors. (2a) Computing NDDe for each point through a diffusion process. (2b) Computing TPDi to measure the distance between points. (3) Applying DPC with NDDe and TPDi to obtain final clustering result.

4.1 Constructing Transition Matrix

The standard way of construction the transition matrix is to compute the similarity between the deep features of pair-wise samples. The “similarity” can be the conventional cosine similarity or the learned similarity in more recent works such as [14, 23, 31]. To reduce the memory consumption of using GCNs [23, 31] for similarity learning, Pair-Cls [14] simply learns the similarity via pair-wise classification by deciding whether two points share the same identity. However, in Pair-Cls, all pairs are completely independent during training. We argue that the similarity between a point and one of its neighbors usually depends on the similarities between the point and its other neighbors. Therefore, in our work, we adopt the same pair-wise classification, i.elet@tokeneonedot, using an MLP to predict the similarity, and besides that, we leverage a collaborative prediction manner by considering the similarities between each point (as an anchor) and its neighbors as a whole to improve the robustness of the prediction, similar to [14, 19].

Here, we elaborate a general formulation. For a sample point i, we first find its K-nearest neighbors denoted as \(\text {nbr}_{i} = \{i_{1},\cdots , i_{K}\}\), and then generate the following token sequence:

$$\tilde{x}_{i} = [x_{i}, x_{i_{1}}, x_{i_{2}}, \cdots , x_{i_{K}}].$$

Our similarity prediction model first takes \(\tilde{x}_{i}\) as input, and outputs \(K+1\) features after feature encoder \(\mathcal {F}\):

$$\{t_{i}, t_{i_{1}}, \cdots , t_{i_{K}}\} = \mathcal {F}(\tilde{x}_{i}),$$

where \(\mathcal {F}\) can be Transformer [25], GCNs [23, 31] or a simple feature aggregation module [14] (aggregate features of neighbors and concatenate to the feature of anchor). Then, for each neighbor \(i_{j}\), \(j = 1, ..., K\), \(t_{i}\) are concatenated with \(t_{i_{j}}\) and fed into an MLP with Sigmoid function to estimate the probability of that i and \(i_{j}\) share the same identity:

$$p_{ij} = \text {MLP}([t_{i}, t_{i_{j}}]).$$

Assuming \(l_{ij}\) is the ground-truth label, \(l_{ij}=1\) if \(l_{i} = l_{i_{j}}\) and \(l_{ij}=0\) vice versa. The total loss function is formulated as:

$$\begin{aligned} \mathcal {L} = -\sum _{i=1}^{N}\sum _{j=1}^{K}(l_{ij}\log p_{ij} + (1-l_{ij})\log (1- p_{ij})). \end{aligned}$$
(1)

Once the model converges, its predicted similarity takes the anchor’s feature as well as its respective neighborhoods’ features into consideration. Then, we can derive the similarity matrix \(\boldsymbol{\hat{S}}_{N\times N}\) by applying this model on the test set.

Finally, we assume \(d_{i} = \sum _{j=1}^{N}\hat{s}_{ij}\) as the measure of the volume around point i, and generate the probability transition matrix \(\boldsymbol{P}\) with each element as \(p_{ij} = {\hat{s}_{ij}}/{d_{i}}\). The size of \(\boldsymbol{P}\) is \(N\times N\). Please note that \(\hat{S}\) is a sparse matrix where each row contains \(K+1\) non-zero elements (itself and its top-K nearest neighbors). Therefore, \(\boldsymbol{P}\) is also sparse. We highlight that the above approach is not the only way to construct the transition matrix \(\boldsymbol{P}\), and we show the results of using other approaches to obtain \(\boldsymbol{P}\) in the experiment section.

4.2 Neighborhood-Diffusion-Based Density

In this section, we propose a new definition of the point-wise density, called NDDe, to alleviate the issue of small-size clusters. In the transition matrix \(\boldsymbol{P}\), each element \(\boldsymbol{P}_{ij}\) denotes the probability from one point i to its specific neighbor j. It satisfies the conservation property, i.elet@tokeneonedot, \(\sum _{j}\boldsymbol{P}_{ij} = 1\), which induces a Markov chain on \(\boldsymbol{X}\). Denoting \(\boldsymbol{L} = \boldsymbol{I} - \boldsymbol{P}\) as the normalized graph Laplacian, where \(\boldsymbol{I}\) is the identity matrix. We can specify a diffusion process as follows,

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{\partial }{\partial t} \rho _{i}(t) = -\boldsymbol{L} \rho _{i}(t), \\ \rho _{i}(0) = 1. \end{array}\right. } \end{aligned}$$
(2)

where \(\rho _{i}(t)\) is the density of point i at t-th step. Starting from a uniformly initialized density, the diffusion process keeps distributing the density of each point to its K-nearest neighbors, following the corresponding transition probabilities in \(\boldsymbol{P}\), until converged to a stationary distribution. The diffusion density thus can be induced as:

$$\begin{aligned} \rho _{i} = \lim _{t \rightarrow \infty } \rho _{i}(t). \end{aligned}$$
(3)

Justification of Local Properties of Diffusion Density. The diffusion process is local because each point transits its density to K-nearest neighbors and itself (based on the transition matrix \(\boldsymbol{P}\)). If considering the ideal situation when \(\boldsymbol{P}\) is closed, which means \(p_{ij} > 0\) if and only if \(x_{i}\) and \(x_{j}\) share the same identity, we have the following theorem.

Theorem 1

Assume the dataset \(\boldsymbol{X}\) can be split into m disjoint clusters: i.elet@tokeneonedot, \(\boldsymbol{X} = \boldsymbol{X}_{1}\bigcup ...\bigcup \boldsymbol{X}_{m}\). Define \(\bar{\rho }_{i} = \frac{\sum _{j\in \boldsymbol{X}_{i}}\rho (j)}{|\boldsymbol{X}_{i}|}\) is the average density of \(\boldsymbol{X}_{i}\), and we have \(\bar{\rho }_{1} =\dots = \bar{\rho }_{m} = 1\) where \(|\boldsymbol{X}_{i}|\) is the number of points in \(\boldsymbol{X}_{i}\).

Theorem 1 demonstrates that the average diffusion densities in all clusters are the same regardless of cluster sizes. In a dynamic sense, the diffusion process can elevate the density of latent small clusters, and thus enable DPC algorithm to identify density peaks in such clusters. To further demonstrate our claim, we divide clusters in MS1M 5.21M dataset into five subsets according to cluster sizes and calculate the average diffusion density for each subset. As shown in Fig. 5(a)(b), compared with the standard density, the average NDDe for different subsets are much more comparable.

Fig. 5.
figure 5

(a) and (b) show the average values of the standard density and our NDDe on five cluster subsets from MS1M 5.21M dataset, respectively. NDDe is shown to be more uniform, i.e., small clusters are alleviated. (c) and (d) show the probability density functions of using the conventional cosine distance and TPDi on five cluster subsets from MS1M 5.21M dataset, respectively. Using TPDi makes it easier to decide a more uniform threshold to separate positive and negative pairs, in all subsets including the sparsest one “sp-5”.

4.3 Transition-Probability-Based Distance

In this section, we introduces our new definition of the pair-wise distance, called TPDi, to solve the issue of varying sparsity in latent face clusters. TPDi depicts the similarity between two points based on their respective transition probabilities (in \(\boldsymbol{P}\)) to the common neighbors. Assuming \(C_{ij}=\text {nbr}_{i}\cap \text {nbr}_{j}\) contains the common neighbors in the K-nearest neighbors of both point i and j. TPDi between them is defined as:

$$\begin{aligned} d_{ij} = 1 - \sum _{c\in C_{ij}}\sqrt{p_{ic}p_{jc}}. \end{aligned}$$
(4)

We highlight that TPDi has three impressive properties: (1) By Cauchy-Schwarz inequality, we have \(\left( \sum _{c\in C_{ij}}\sqrt{p_{ic}p_{jc}}\right) ^2\le (\sum _{c\in C_{ij}}p_{ic})(\sum _{c\in C_{ij}}p_{jc})\le 1\), so it is easy to check \(0\le d_{ij}\le 1\), which implies that \(d_{ij}\) can be a valid metric. (2) \(d_{ij}= 0\) if and only if \(p_{ic} = p_{jc}\) for all \(c = 1,..., N\), which implies that \(d_{ij}\) is small when i and j share as many as common neighbors. It is consistent with the motivation of TPDi. (3) Compared with cosine distance, TPDi of negative pairs and positive pairs are better separated, regardless of cluster sparsity (Fig. 5(c)(d)). So it is easier to choose a uniform threshold for TPDi.

Remark 1. If considering a simple case when each point transits to its neighbors with equal transition probability \(\frac{1}{K}\), we have \(d_{ij} = 1-\frac{2\text {Jaccard}(i,j)}{(1+\text {Jaccard}(i,j))}\), where \(\text {Jaccard}(i,j)\) is the Jaccard similarity [9]. This implies that the TPDi is a generalization of Jaccard distance, which also demonstrate the feasibility of TPDi.

4.4 Overall Algorithm

The overall clustering procedure is summarized in Algorithm 1. In our implementation, we use an iterative method as an approximation of Eq. 3.

figure b

5 Experiments

5.1 Experimental Settings

Datasets. We evaluate the proposed method on two public face clustering benchmark datasets, MS1M [8] and DeepFashion [17]. MS1M contains 5.8M images from 86K identities and the image representations are extracted by ArcFace [5], which is a widely used face recognition model. MS1M is split into 10 almost equal parts officially. Following the same experimental protocol as in [14, 23, 31], we train our model on one labeled part and choose parts 1, 3, 5, 7, and 9 as unlabeled test data, resulting in five test subsets with sizes of 584K, 1.74M, 2.89M, 4.05M, and 5.21M images respectively. For DeepFashion dataset, following [31], we randomly sample 25,752 images from 3,997 categories for training and use the other 26,960 images with 3,984 categories for testing.

Metrics. The performances of face clustering methods are evaluated using two commonly used clustering metrics, Pairwise F-score (\(F_{P}\))  [3] and BCubed F-score (\(F_{B}\))  [1]. Both metrics are reflections of precision and recall.

Implementation Details. Our similarity prediction model consists of one transformer encoder layer [26] as \(\mathcal {F}\) and an MLP. The input feature dimension, feedforward dimension, number of heads for \(\mathcal {F}\) are set to 256, 2048, 8, respectively. LayerNorm [2] is applied before Multi-head Attention module and Feed Forward module in \(\mathcal {F}\), according to [30]. Dropout is set to 0.2. The MLP consists of three linear layers (\(512\rightarrow 256, 256\rightarrow 128, 128\rightarrow 1\)) with ReLU as the activation function for the first two layers and Sigmoid for the last layer. Adam [11] is used for optimization. For the computation of NDDe, we set the number of top nearest neighbors K to 80 for MS1M and 10 for DeepFashion (the same as previous works [14, 31]). Convergence threshold \(\epsilon \) is set to 0.05. Connecting threshold \(\tau \) is searched within the range of [0.5, 0.9] with a step of 0.05 on MS1M 584K dataset, and is fixed to 0.7 for all experiments.

5.2 Method Comparison

We compare the proposed method with a series of clustering baselines, including both traditional methods and learning-based methods. Traditional methods include K-means [18], HAC [24],DBSCAN [6], and ARO [20]. Learning-based methods include CDP [33], L-GCN [29], LTC [32], GCN (V+E) [31], Clusformer [19], Pair-Cls [14], STAR-FC [23], and Ada-NETS [28]. Since NDDe and TPDi can be incorporated into existing face clustering methods for better performance, we also incorporate them into GCN (V+E), Pair-Cls, and STAR-FC by using the three methods to obtain the transition matrix \(\boldsymbol{P}\), which are denoted as GCN(V+E)++, Pair-Cls++, and STAR-FC++, respectively.

Table 1. Comparison on MS1M when training with 0.5M labeled face images and testing on five test subsets with different numbers of unlabeled face images. \(F_P\), \(F_B\) are reported. GCN(V+E)++, Pair-Cls++ and STAR-FC++ denote incorporating NDDe and TPDi into the corresponding methods. The best results are highlighted with bold.

Results on MS1M. Experimental results on MS1M dataset are shown in Table 1, which contains both \(F_P\) and \(F_B\) on five test subsets with different scales. We can observe that 1) Our method consistently outperforms the other methods in terms of both metrics, especially for large-scale subsets, e.glet@tokeneonedot, the improvements of our method on 4.05M and 5.21M subsets are more than \(2.5\%\). 2) By incorporating NDDe and TPDi into GCN (V+E), Pair-Cls and STAR-FC, their ++ versions achieve better clustering performance than the original versions, e.glet@tokeneonedot, compared to GCN (V+E), the performance gains brought by GCN(V+E)++ are more than \(3\%\) on large-scale test subsets, which demonstrates that NDDe and TPDi can raise the performance of other methods to a new state-of-the-art.

Results on Hard Clusters. To demonstrate that our method is capable of tackling the issues of small clusters and sparse clusters, we conduct experiments by adding NDDe and TPDi one by one to our baseline model, i.elet@tokeneonedot, the model with the same transition matrix but the density and distance computed in the standard way. As shown in the last three rows in Table 2 and Table 3, both NDDe and TPDi have raised the performance of the baseline model to a new level, especially on hard clusters.

Fig. 6.
figure 6

(a) and (b) show Pairwise precision and recall of three baselines and our method. Significant improvements of our method in terms of recall can be observed. (c) ROC curves of the three baselines and our method. (d) Optimal threshold \(\tau \) for the five test subsets of MS1M dataset.

Table 2. The effectiveness of NDDe and TPDi. \(F_P\) and \(F_B\) of five cluster subsets from MS1M 5.21M with descending size (from sz-1 to sz-5) are reported.
Table 3. The effectiveness of NDDe and TPDi. \(F_P\) and \(F_B\) of five cluster subsets from MS1M 5.21M with ascending sparsity (from sp-1 to sp-5) are reported.

We also reproduce GCN(V+E), Pair-Cls and STAR-FC for comparison, all of which employ a clustering algorithm just as or similar to DPC, as shown in the first two rows in Table 2 and Table 3. It is worth noticing that the improvements brought by our method over the three top-performing methods keep increasing on five cluster subsets with descending size or ascending sparsity. As shown in Fig. 6(a)(b), the improvements of our method in terms of Pairwise recall are more significant than Pairwise precision. All the experimental results show the success of our method in mitigating hard clusters, owing to NDDe and TPDi.

Fig. 7.
figure 7

Top-20 images ranked by distance, using an image in hard clusters as probe.

The Superiority of TPDi. Figure 6(c) shows the receiver operating characteristic (ROC) curves of three top-performing methods and ours, which are obtained by computing true/false positive rate at various distance threshold settings. Our method achieves the highest Area Under Curve (AUC), which illustrates that TPDi endows our method with a good measure of separability. To show that by using TPDi, our method can yield efficient face clustering with a uniform connecting threshold \(\tau \), we conduct experiments using different \(\tau \) (from 0.5 to 0.9, with a step of 0.05) on all the test subsets of MS1M dataset, as shown in Fig. 6(d). It can be observed that the best \(\tau \) is the same for test subsets with varying scales. To be specific, given \(\tau =0.7\), our method consistently achieves the highest \(F_P\) on all test subsets. Figure 7 shows the discovery results of several methods with the image in the first column as a probe, and the images are ranked in ascending order of distance. We can observe that the discovery result of our method contains the most number of positive images.

Table 4. Comparison on DeepFashion. #Clusters, \(F_P\), \(F_B\) and computing time are reported.

Results on DeepFashion. For DeepFashion dataset, clustering task is much harder since it is an open set problem. It can be observed that our method also uniformly outperforms the other methods in terms of both \(F_P\) and \(F_B\) with comparable computing time, as shown in Table 4.

5.3 Ablation Study

To demonstrate the effectiveness of NDDe and TPDi, we conduct an ablation study on MS1M 5.21M dataset, as shown in Table 5. All these four methods use the same transition matrix as described in Sect. 4.1. \(M_1\) is our baseline model, which uses the standard density and cosine distance. \(M_2\) is obtained by replacing the cosine distance in \(M_1\) with TPDi, \(M_3\) is obtained by replacing the standard density in \(M_1\) with NDDe, and \(M_4\) is the proposed method using both NDDe and TPDi as the density \(\rho \) and distance \((d_{ij})_{N\times N}\) required by DPC. Table 5 shows that both NDDe and TPDi contribute to the final clustering performance. And the improvement brought by NDDe is more significant, which illustrates that NDDe is essential for the success of our method.

5.4 Face Recognition

Table 5. Ablation study of NDDe and TPDi on MS1M. \(F_P\) and \(F_B\) are reported.
Fig. 8.
figure 8

Rank-1 face identification accuracy on MegaFace with 1M distractors.

To further show the potential of our method in scaling up face recognition systems using large-scale unlabeled face images, we use our method to generate pseudo-labels for unlabeled face images and use them to train face recognition models. For a fair comparison, we adopt the same experimental setting as in [23, 31, 32]. We use a fixed number of labeled data and different ratios of unlabeled data with pseudo-labels to train face recognition models and test their performance on MegaFace benchmark [10] taking the rank-1 face identification accuracy with 1M distractors as metric. In Fig. 8, the upper bound is trained by assuming all unlabeled data have ground-truth labels, and the other five curves illustrate that all the methods benefit from an increase of the unlabeled data with pseudo-labels. And it can be observed that our method consistently achieves the highest performance given any ratio of unlabeled data, and improves the performance of the face recognition model from \(58.20\%\) to \(80.80\%\), which is the closest to the upper bound.

6 Conclusion

In this paper, we point out a key issue in face clustering task—the low recall of hard clusters, i.elet@tokeneonedot, small clusters and sparse clusters. We find the reasons behind this are 1) smaller clusters tend to have a lower density, and 2) it is hard to set a uniform (distance) threshold to identify the clusters of varying sparsity. We tackle the problems by proposing two novel modules, NDDe and TPDi, which yield the size-invariant density and the sparsity-aware distance, respectively. Our extensive ablation study shows that each of them contributes to improving the recall on hard clusters, consistently on multiple face clustering benchmarks.