Keywords

1 Introduction

Face clustering is a fundamental task in face analysis and has been extensively studied in resent years [8, 15, 20, 22, 23]. Existing face clustering methods roughly fall into two categories, i.e., unsupervised methods and supervised methods. Unsupervised approaches, such as K-Means [10] and DBSCAN [6], rely on specific assumptions and lack the capability of resonating with high-dimensional structured data information. Supervised face clustering methods mainly aim to learn more distinguishing embedding subspace [3, 14] or the complex cluster patterns [23]. These existing methods mainly based on the node distance in feature space, while ignore the negative effect caused by the face images in low quality. Face node with low quality is common and goes against the clustering due to its ambiguous identity. Figure 1(a) shows the example that the pair nodes with one in low quality can also have high similarity even they are under different identity. These low face quality nodes will obviously degrade the face clustering precision if not handled appropriately. It is essential for face clustering algorithms to have the ability to deal with this general application circumstance.

Fig. 1.
figure 1

(a) Demonstration of two node pairs with quality score shown on the images and similarity given in the parentheses. (b) Mean absolute value of quality score difference between every two pair nodes on IJB-C dataset [11] under the same and different identities with respect to similarity threshold. The node pairs with similarity higher than threshold are involved in this statistic. The face score is obtained by EQFace [9], the identity feature is extracted by pre-trained IResNet50 [4, 5].

Fortunately, face quality score provides helpful auxiliary information for clustering. Figure 1(b) shows the mean absolute value of quality score difference between every two pair nodes under same and different identities with respect to similarity threshold. The node pairs in IJB-C dataset [11] with similarity higher than threshold are involved in this statistic. There always exists a gap between the score differences, which implies that two nodes similar in feature subspace but with different identity may have larger score difference. Based on this main observation, this paper proposes a face clustering algorithm, which is referred as FC-Q, intuitively takes the face quality as extra input to exploit unlabeled face data.

Face quality can be conveniently assessed beforehand by recent deep learning based methods. SER-FIQ [17] obtains the face quality by measuring the embedding variations generated from random sub-networks of the face recognition model. A deep tiny network [13] is also proposed to learn a face quality prediction function that is recognition-oriented. Meanwhile, face quality can be explicitly given along with the identity feature by face recognition network. MagFace [12] introduces an adaptive mechanism to learn a universal feature embedding with magnitude measuring the face quality. EQFace [9] outputs face quality and identity feature at the same time by adding a quality network branch to the baseline network of face recognition. Such methods make our work more efficient in gathering input data.

Fig. 2.
figure 2

Flowchart of our FC-Q algorithm. The proposed algorithm follows the link-based clustering paradigm, Given one pivot face node and its K nearest neighbors in feature subspace, the proposed algorithm first obtains the input vectors by concatenating the identity features and face scores of pivot-neighbor pairs as well as the specific Pearson correlation encoding. Its linkage prediction module is in the form of modified Transformer encoder, and infers linkage likelihood with the help of pre-calculated prior relevancy matrix. Its label propagation module transitively merges face nodes according to the linkage likelihood refined with local face quality information. Instances with the same pseudo label constitute a cluster.

The proposed FC-Q algorithm incorporates the face quality into its two main modules, i.e., the linkage prediction module and the label propagation module. In the linkage prediction module, the algorithm adopts the framework of Transformer encoder [18]. It is specially modified to fit the general clustering circumstance. One relevancy prior is designed according to the quality score relationship among nodes in neighborhood. The prior helps the self-attention mechanism of Transformer encoder better infer the linkage likelihood between node pairs. In the label propagation module, the algorithm utilizes the face quality to recalibrate the abnormal linkage likelihood based on the local quality information. The linkage with one node having inconsistent quality score with its neighbors will be suspected of being unreliable, and its likelihood will be suppressed if the two pair nodes have a large gap with respect to their local quality information. Finally, our proposed algorithm transitively merges face nodes according to the refined linkage likelihood, and obtains the clusters.

To summarize, the main contributions of this work are as follows:

  • A face clustering algorithm named FC-Q is proposed with the face quality as extra input. Compared with the state-of-the-arts, this algorithm deals with the more general circumstance that the face nodes are not all guaranteed in good quality.

  • The proposed FC-Q algorithm modifies the Transformer encoder and designs the relevancy prior with face quality to infer more reliable linkage likelihood between similar pairs in feature subspace.

  • The proposed FC-Q algorithm utilizes the local quality information of each node to further suppress the pairing with abnormal high linkage likelihood.

  • The proposed FC-Q algorithm specifically conducts face clustering experiments on IJB-C dataset with low face quality and achieves \(91.7\%\) pairwise F-score on partial IJB-C, which provides a strong baseline for low-quality face clustering.

2 Methodology

In this section, we introduce the details of the proposed FC-Q algorithm, which includes the specific linkage prediction module and label propagation module. Figure 2 shows the flowchart.

2.1 Linkage Prediction with Prior Attention

Following the link-based clustering paradigm, the proposed algorithm selects every face node as a pivot, and estimates the linkage likelihood between the pivot and its K nearest neighbors in feature subspace.

Given ith pivot face node \(\textbf{f}_0^{i}\in \mathbb {R}^{d_f}\) and its K nearest neighbors \(\textbf{f}_j^{i}, j\in \{1,...,K\}\), the input of the linkage prediction module consists of \(K+1\) vectors with each has the form

$$\begin{aligned} \textbf{g}_j = \textrm{cat}\left( \textbf{f}_j^i, \textbf{f}_0^{i}, q_j^i, q_0^i, \textbf{p}_j\right) \in \mathbb {R}^{d_g}, \quad j\in \{0,...,K\}, \end{aligned}$$
(1)

where \(q_0^i\in \mathbb {R}\) denotes the face quality score of pivot node, and \(q_j^{i}, j\in \{1,...,K\}\) denotes the face quality score of its neighbor. Operator \(\mathrm {cat()}\) denotes concatenation operation along the feature dimension, thus \(d_g=2d_f+2+K+1\). Vector \(\textbf{p}_j\in \mathbb {R}^{K+1}\) represents Pearson correlation encoding of each involved face node. The element in \(\textbf{p}_j\in \mathbb {R}^{K+1}\), i.e., the Pearson correlation coefficient, is calculated as

$$\begin{aligned} \textbf{p}_j^k = \frac{\textrm{cov}\left( \textbf{f}_j^i, \textbf{f}_k^i\right) }{\sigma _{\textbf{f}_j^i}\sigma _{\textbf{f}_k^i}}, \quad k\in \{0,...,K\}, \end{aligned}$$
(2)

where \(\textrm{cov}\) and \(\sigma \) denote the covariance and the standard deviation of \(\textbf{f}_j^i\) and \(\textbf{f}_k^i\) respectively. Concatenating pivot feature \(\textbf{f}_0^{i}\) in (1) aims to inform the linkage prediction module to learn the relationship between pivot-neighbor pair. Concatenating quality scores \(q_j^i, q_0^i\) in (1) aims to let the linkage prediction module infer the link likelihood with more references. Concatenating Pearson correlation encoding \(\textbf{p}_j\) in (1) is under the consideration that the Pearson correlation coefficient can offer the linkage prediction module robust linear relations between pair nodes, which is highly sensitive to outliers [2].

With the input vectors generated, the linkage prediction module adopts the framework of modified Transformer encoder to estimate the linkage likelihood of pivot-neighbor pair. Compared with the GCN based framework [20], the Transformer encoder framework has the ability to learn relation weights based on its effective self-attention mechanism. As shown in Fig. 2, the linkage prediction module is composed of a stack of 3 identical layers. Each layer has three sub-layers. The first is a pre-layer normalization [21], the second is a modified multi-head self-attention mechanism, and the third is a simple fully connected feed-forward network. Note that the skip connections of the last two sub-layers are specifically removed to let the whole module focus more on inferring the difference between pivot and neighbor nodes. The last feed-forward sub-layer performs the binary node classification followed by softmax activation, which outputs the probability of whether the corresponding input belongs to the same class as the pivot.

Fig. 3.
figure 3

Architecture of our modified self-attention mechanism. The input vectors are first projected into three super vectors named key, query and value. The key and query are forced to share the same projection. A prior relevancy matrix is added into the learned relevancy, and offers the negative relevancy information if the pair nodes have large quality score difference. The brighter the color, the larger the value.

Figure 3 further shows the architecture of our three modified self-attention mechanism sub-layers. Taking the first one as an example, every normalized vector \(\textbf{g}_j'\) in ith input \(\textbf{G}_i'\in \mathbb {R}^{(K+1)\times d_g}\) is first linearly projected into three super vectors named key, query and value, which can be expressed in matrix form as

$$\begin{aligned} \begin{aligned}&\textbf{K}=\textbf{G}_i'\textbf{W}^S, \textbf{K}\in \mathbb {R}^{(K+1)\times d_s},\\&\textbf{Q}=\textbf{G}_i'\textbf{W}^S, \textbf{Q}\in \mathbb {R}^{(K+1)\times d_s},\\&\textbf{V}=\textbf{G}_i'\textbf{W}^V, \textbf{V}\in \mathbb {R}^{(K+1)\times d_v}, \end{aligned} \end{aligned}$$
(3)

where matrices \(\textbf{W}^S\in \mathbb {R}^{d_g\times d_s}\) and \(\textbf{W}^V\in \mathbb {R}^{d_g\times d_v}\) denote the learnable project matrices. The key and query are forced to share the same projection, thus the relevancy between the corresponding two samples will be symmetric. This setting is helpful for face clustering, which is shown in the following experiment section.

The relevancy between the samples are constructed as

$$\begin{aligned} \begin{aligned} \frac{1}{\sqrt{d_s}}\textbf{Q}\textbf{K}^T + \textbf{A}, \end{aligned} \end{aligned}$$
(4)

where matrix \(\textbf{A}\in \mathbb {R}^{(K+1)\times (K+1)}\) represents the prior relevancy, and its element in the mth row and nth col is of the form

$$\begin{aligned} \begin{aligned} \textbf{A}_{m,n} = \textrm{sim}\left( \textbf{f}^i_m, \textbf{f}^i_n\right) \cdot \left( 1 - \textrm{abs}(q^i_m - q^i_n)\right) , \end{aligned} \end{aligned}$$
(5)

where operator \(\textrm{sim}()\) calculates similarity between the identity features \(\textbf{f}^i_m\) and \(\textbf{f}^i_n\), and operator \(\textrm{abs}()\) denotes the absolute operation. Prior relevancy matrix \(\textbf{A}\) is also symmetric, and can offer the negative relevancy information between two samples if they have large quality score difference.

The output \(\textbf{Z}\in \mathbb {R}^{(K+1)\times d_v}\) in self-attention mechanism is the aggregation of value matrix \(\textbf{V}\) by attention weights, i.e.,

$$\begin{aligned} \begin{aligned} \textbf{Z}=\textrm{softmax}\left( \frac{\textbf{Q}\textbf{K}^T + \textbf{A}}{\sqrt{d_s}}\right) \textbf{V}, \end{aligned} \end{aligned}$$
(6)

where the attention weights are obtained by applying softmax normalization to (4).

The modified self-attention is further incorporated into the multi-head mechanism, where the self-attention is performed H times in parallel. The outputs of each self-attention are concatenated and once again projected, resulting in the final output of multi-head self-attention sub-layer

$$\begin{aligned} \begin{aligned} \textbf{Z}_{M}=\textrm{cat}\left( \textbf{Z}_1,...,\textbf{Z}_H\right) \textbf{W}_M, \end{aligned} \end{aligned}$$
(7)

where matrix \(\textbf{W}_M\in \mathbb {R}^{Hd_v\times d_g}\) denotes learnable project matrix.

During the training stage, the whole linkage prediction module is trained by cross-entropy loss

$$\begin{aligned} \begin{aligned} \mathcal {L}=-\sum _{k=0}^K\textrm{log}(\hat{y}_{k_1}^i), \end{aligned} \end{aligned}$$
(8)

where \(\hat{y}_{k_1}^i\) denotes the output probability, i.e., linkage likelihood, of whether the ith pivot and its corresponding kth neighbor belong to the same class.

figure a

2.2 Label Propagation with Anomaly Suppression

As all the pivot nodes are involved in linkage prediction module, a set of pivot-neighbor pairs \(\mathcal {E}=\{\textbf{e}_j\}_{j=1,...,NK}\) will be obtained along with the corresponding linkage likelihood set \(\mathcal {P}=\{p_j\}_{j=1,...,NK}\). The goal of the label propagation module is to assign pseudo label \(y_i\) to every face image \(\textbf{x}_i\) with the help of quality score set.

The label propagation module starts with the initial linkage threshold \(\tau _p\), and performs in an iterative manner. In each iteration, the module first finds all the connected neighbors of every unlabeled node according to the current pair set \(\mathcal {E}\). Then Given one pivot-neighbor pair \(\textbf{e}_j\) connecting nodes m and n, the module calculates the mean absolute value of quality score difference between each node and its connected neighbors,

$$\begin{aligned} \begin{aligned} \breve{q}_{*} = \frac{1}{|\mathcal {N}_*|}\sum _{k\in \mathcal {N}_*}\textrm{abs}\left( q_*-q_k\right) ,\,\,\, *=m,n, \end{aligned} \end{aligned}$$
(9)

where \(\mathcal {N}_*\) denotes the set of neighbor index, and \(|\mathcal {N}_*|\) denotes the number. If the value \(\breve{q}\) of either node is larger than a predefined threshold \(\tau _q\), the pivot-neighbor pair will be considered unreliable and its linkage likelihood will be suppressed as

$$\begin{aligned} \begin{aligned} p_j =p_j \cdot \left( 1-\textrm{abs}\left( \bar{q}_m-\bar{q}_n\right) \right) , \end{aligned} \end{aligned}$$
(10)

where \(\bar{q}_m\) and \(\bar{q}_n\) denote the mean value of quality scores of the very node and its connected neighbors. The gap in local face quality information between the two nodes determines the degree to which the corresponding linkage likelihood is suppressed.

With the linkage likelihood all updated, the label propagation module first removes the pivot-neighbor pairs from \(\mathcal {E}\) whose linkage likelihood are below the threshold \(\tau _p\). Then the module finds connected components based on the remaining pivot-neighbor pairs. If the node number of one component is below the predefined maximum size M, all nodes in the component are annotated with a new pseudo label, and the corresponding pivot-neighbor pairs are also removed from set \(\mathcal {E}\). The threshold \(\tau _p\) is increased at the end of every iteration.

The process mentioned above is iterated until the pair set \(\mathcal {E}\) is empty. The neglected orphan nodes are also annotated with new pseudo labels individually. Finally, the face nodes with same pseudo label constitute a cluster. To summary, Algorithm 1 lists the whole procedure of this label propagation. The maximum computational complexity of one iteration is of order \(\mathcal {O}(3NK)\), and decreases as the iteration progresses. It is observed that involving face quality can help improve the clustering precision along with a slight drop of recall.

3 Experiments

3.1 Experimental Settings

Datasets. We use the IJB-C dataset [11] of low face quality as well as the refined MS1M dataset [4, 7] of good face quality for training and testing in face clustering. The IJB-C contains about 138K face images from 3.5K identities, and the MS1M contains about 5.8M face images from 85K identities. The IJB-C dataset is randomly partitioned into 10 splits with equal identity number, and each part has the same distribution of nodes per identity. As IJB-C dataset is small, 9 parts are used for training to alleviate overfitting and 1 part for testing. The MS1M dataset is partitioned in the same way with 1 part for training and the other 9 parts for testing.

We evaluate the performance of face clustering by three commonly used metrics, i.e., Pairwise [16], BCubed [1] and NMI [19]. Pairwise and BCubed both measure the precision and recall of clustering, with F-score being their harmonic mean. The former metric emphasizes more on large clusters. NMI measures the global closeness of the output pseudo labels and the ground-truth. All the metrics have a range of [0, 1] with 1 being the perfect. Four state-of-the-art algorithms, namely, L-GCN [20], GCN-DS [23], GCN-VE [22], and STAR-FC [15] are used for comparison.

Our framework is implemented in Pytorch. We first use the pre-trained IResNet50 [4, 5] to extract the identity features of face nodes with dimension \(d_f=512\), and EQFace [9] to obtain the corresponding face quality scores. All the competing algorithms share the same input. We then set the neighbor number \(K = 80\), the dimension of key, query and value \(d_s=d_v=2048\), and the number of multi-head attention \(H=2\). We also empirically set the initial linkage threshold \(\tau _p=0.9\), the quality threshold \(\tau _q=0.3\), and the maximum cluster size \(M=1000\). The maximum iteration number T is set to 20 which is sufficient to obtain a satisfactory clustering. These parameter values remain the same in following experiments.

In addition, we train the model with 20 epochs from scratch, and optimize the loss with the SGD optimizer. The weight decay and the momentum are set to 0.0005 and 0.9, respectively. The initial learning rate is set to 0.01 and is empirically divided by 10 at 8, 12 and 18 epochs. All the experiments are performed on a single Tesla-P40 GPU, and one can use more for acceleration.

Table 1. Results on IJB-C dataset and MS1M dataset. “P” is short for “Pairwise”, and “B” is short for “BCubed

3.2 Experimental Results

Table 1 first shows the competing results on IJB-C dataset. The GCN-D only uses its detection module. The GCN-DS further incorporates the segmentation module but archives no performance improvement. This is because the algorithm fails in learning the complex cluster patterns as the identity similarities are not so reliable among low-quality faces. Algorithms GCN-V and GCN-VE present the same phenomenon, where the corresponding vertex confidence and edge connectivity estimation modules are heavily dependent on the identity similarity in feature subspace. The STAR-FC performs relatively better under the influence of its structure-preserved sub-graph sampling strategy. Overall, our FC-Q algorithm outperforms other algorithms on this dataset, and achieves \(91.7\%\) pairwise F-score under the employment of face quality scores. Table 1 also shows the competing results on MS1M dataset. It is observed that all the algorithms achieve performance improvements when the faces are in good quality. Overall, although the assistance role of face quality is diluted, our FC-Q algorithm still gets the satisfactory result. This indicates that our algorithm can be the first choice when implementing face clustering under general circumstance where the face quality is unknown or mixed.

3.3 Ablation Studies

In this subsection, we evaluate some design elements used in our algorithm. The following experiments are all conducted on IJB-C dataset. Table 2 presents the clustering results of our algorithm when individually removing the following four design elements, i.e., concatenating the quality scores, concatenating the Pearson correlation encoding, adding the prior relevancy matrix, and making key and query share the same projection. It is observed that all these four design elements contribute to the performance improvement, and the last element helps the most.

Table 2. Results on IJB-C when four design elements are removed successively. “P” is short for “Pairwise”, and “B” is short for “BCubed”.
Table 3. Results of L-GCN and FC-Q algorithms with their label propagation modules swapped. “A+B” denotes using A for linkage prediction and B for label propagation.

We further swap the label propagation modules of L-GCN and FC-Q to validate the effectiveness of incorporating the face quality. We can see from Table 3 that equipping our proposed label propagation module evidently improves the clustering precision of L-GCN with a slight drop of recall. This is because our module can further suppress the abnormal pairing based on extra local quality information. The improvement is not so significant on FC-Q as its linkage prediction module has already incorporated the face quality to output more reliable linkage likelihood.

Fig. 4.
figure 4

Effect of self-attention mechanism. One attention map is list on the right and three pivot-neighbor pairs are list on the right.

Figure 4 further shows the effectiveness of our modified self-attention mechanism. An attention map extracted from the last layer is list on the left, which involves one pivot example and its \(K = 80\) nearest neighbors. The neighbors are sorted in descend order with respect to the similarity. The attention values are taken as logarithm to better demonstrate the numerical differences. The sequence above the attention map describes the identity consistency, where the black dot indicates the neighbor node at same sequence position having the different identity. Three pivot-neighbor pairs are also list on the right, with quality score shown on the image and similarity below the arrow. We can see that the first pivot-neighbor pair in red box and the second pivot-neighbor pair in rosy box have close similarity but opposite identity consistency. The self-attention mechanism successfully suppresses the attention values of the former neighbor node based on the large score difference prior. Note that although the third neighbor node in black box with different identity has small score difference, the self-attention mechanism can also suppress its attention value based on the relevancy aggregation among involved nodes.

4 Conclusion

This paper has introduced a face clustering algorithm, referred as FC-Q, to tackle with face nodes with mixed quality. The algorithm takes face quality score as extra input, which is incorporated into the linkage prediction module and label propagation module as a prior. The algorithm first modifies the Transformer encoder, and uses quality relevancy to infer more reliable linkage likelihood. Then the algorithm utilizes the local quality information to further suppress the abnormal pairing with high linkage likelihood. Experimental results validate that our algorithm gets the satisfactory clustering result under general circumstance where the face quality is unknown or mixed.