1 Introduction

Face clustering has applications in organizing personal photo album, video understanding and automatic labeling of data for semi-supervised learning. Many existing methods cannot cluster millions of faces. They are either too slow, inaccurate, or need a lot memory. Our method can run on CPU in minutes given the nearest neighbors graph and embedding values. Greedy clustering algorithms such as TC are fast but inaccurate. Non-greedy algorithms such as spectral clustering are slow, use a lot of memory and need to specify accurate number of clusters to produce good result. For large dataset, spectral clustering will lead to out of memory when runs on desktop computer because spectral clustering has to store and process all pairs distance information between all embeddings during clustering. Our method combines best of both worlds which our optimization based clustering consists of both greedy and non-greedy algorithms. Deep learning clustering requires learning, produces noisy results with a lot of singleton clusters whereas our method does not require learning. These deep learning clustering methods cannot scale to large number of faces (in millions), slow, need large amount of memory and accuracy drops at large number of faces.

Clustering is an important step for semi-supervised face recognition. Semi-supervised learning is to learn from both labeled and unlabeled data. The trend in deep learning face recognition models is that the larger the dataset, the better the performance. However large dataset requires more man-hours to label the data. It is difficult to label face classes because it has unlimited number of classes. This helps to circumvent the problem of large data collection and makes training state of the art face recognition model using lesser labeled data possible. Semi-supervised learning is an under-explored area in face recognition. Unlimited number of classes makes fixed classes semi-supervised learning method impossible to use. Face recognition is an open set classification problem which means we can always add more face classes for learning.

Omni-supervised learning is a recently proposed algorithm to label unlabeled data using only a single trained model [1]. Originally it is applied for the human pose estimation problem where the input image is transformed multiple times, fed into the model to produce multiple labels. These labels are then combined and used as labels for semi-supervised learning. We adapt the technique to use on face recognition where the face image is transformed before the face alignment process and flipped before the computation of face embedding. This allows us to compute multiple embeddings and these embeddings are combined to form a more robust embedding that is more accurate for clustering.

The main contributions of this paper are:

  • Developed an unsupervised clustering method that is very fast and accurate. It is a two stage algorithm, first a greedy clustering is performed, followed by a non-greedy clustering algorithm to tackle both easy and difficult to separate clusters

  • Achieved state of the art performance on large-scale face clustering, surpass supervised deep learning clustering algorithms. Our clustering algorithm has achieved F-measure of 76.30% in 22.75 min for a 5 millions faces of the MS-Celeb-1 M dataset compared to a competing method with F-measure 71.63% in 162.27 min

  • Shown that our clustered result can be fed into a face recognition algorithm to do semi-supervised learning

2 Related work

Classical clustering methods (such as K-means, DBSCAN and spectral clustering) are too slow, consume too much memory and have poor accuracy. They are not tuned to work on face recognition embedding distribution. Many of them require to specify number of clusters. Density clustering methods such as DBSCAN have low recall. An unsupervised clustering algorithm (FINCH) [2] using first neighbor relation is not able to scale to large number of face classes in both accuracy and speed.

State of the art deep learning clustering algorithms [3, 4] require training, large memory usage and they are slow. They cannot work with face embeddings of any vector length. So we came out a unsupervised two stage clustering algorithm that does not require to do learning on any dataset and it has few parameters to tune. These deep learning algorithms not only need the nearest neighbor graph between face embeddings to do clustering, they also require the embedding values for clustering. Graph convolutional network (GCN) method [4] will increase and decrease the edge distances of the embeddings so that embeddings from the same class are moved closer to each other and embeddings of different classes are moved further from each other. Then a greedy clustering algorithm is applied to segment the embeddings based on simple increasing distance threshold and breath first search. For another work that uses affinity graph (LTC) [3], it groups up the embeddings into super-vertices, generates cluster proposals, does non-maximal suppression on the proposal clusters and refines the cluster labels within each proposal similar to a Mask-RCNN [5] algorithm. By grouping into super-vertices, their method are more robust against producing small noisy clusters.

Semi-supervised algorithms are generally divided into two categories. One type of semi-supervised learning operates on classification problem with fixed number of classes (closed set classification) [6,7,8,9,10,11,12,13,14,15]. Some of them train multiple deep networks and combine output labels from multiple networks to label the unlabeled data [8, 10, 13, 14]. Some use label propagation to spread labels from labeled samples to unlabeled samples [6, 12]. One method uses graph filtering similar to clustering [9]. Some use specialized loss function to propagate labels during deep learning [11, 16]. One very innovative work uses learning speed to determine the labels [17]. Since face recognition has unlimited number of classes, these fixed classes semi-supervised learning methods are not suitable for face recognition and they are difficult to modify to work on face recognition problem.

The second type of semi-supervised learning algorithm is designed to work on variable number of classes (open set classification). One subcategory of this type requires the unlabeled data to be clustered [3, 4] to become labeled data in order to do the learning. A simple model is learned from the labeled data. Then clustering is performed using the trained model on the unlabeled data to do labeling. The combined data from both labeled and unlabeled data is used to train a final model for face recognition.

Another subcategory of this type does not require clustering and labeling is done during learning using a semi-supervised loss function [18]. This approach learns and labels at the same time using a single loss function and does not output clustering labels. Our semi-supervised learning using our clustering algorithm can use existing face recognition deep learning networks and loss functions without the use of complex loss function as in theirs.

Some semi-supervised face recognition algorithms take into the context of how the photos are taken [19,20,21]. Our method does not need any context information, and can work on random faces crawled from the internet.

CDP [22] is a recently proposed semi-supervised learning algorithm. The disadvantage is it requires to train multiple committee models in order to achieve higher clustering accuracy. However training many models is time consuming. Omni-supervised learning [1] has recently been proposed to generate multiple labels of a single input image by doing multiple transforms of the input image. The different transformations are fed into a single model to produce multiple labels. The multiple labels are then combined similar to how the labels are combined by multiple committee models. This avoids the need to train multiple models. As deep network model has redundancy representation of weights where a slight transformation of the input should result in the same label. If the output label of a transformation is incorrect, it is detected as having different label from other transformations and the labels of multiple transformations can be combined to create a more accurate label. The model is thought to be ‘self-ensembled’ with different models of the ensemble are expressed when different transformations are applied to its input.

As for the recent deep learning approaches [23, 24] to clustering, clustering is inherently an optimization problem. Deep learning clustering is to try to learn the optimization function of clustering. Deep learning clustering is non-interpretable and explainable, so it does not add knowledge to the science of clustering. Deep learning has drawbacks. It does not scale well to extremely large datasets, so it took very long and consumed a lot of memory to do clustering. An analogy is the use of linear programming or reinforcement learning to solve the Travelling Salesman Problem. Reinforcement learning does not scale well in computational performance to the number of cities.

3 The proposed clustering algorithm

3.1 Overview

Fig. 1
figure 1

Our face clustering framework

Figure 1 shows the framework for our clustering algorithm with the last step doing a semi-supervised face recognition learning. Step 1 is to train a shallow model using only the labeled data. We will use a shallow network to prevent overfitting. Step 2 is to apply multiple transforms to each input face image. These transformed face images are fed into the shallow network to produce multiple embeddings. These embeddings are averaged and normalized to produce the final embedding for each face image to be used for clustering at later stages. The transformed embeddings should complement each other mispredictions. Step 3 is to label the unlabeled data using the TC algorithm [22]. Step 4 is to refine the clusters by splitting or merging the clusters. We will explain that it is very unlikely that the clusters need to merge at this step to produce better result. Therefore in our algorithm, only splitting is carried out at this step. Step 5 is to propagate the labels to neighboring face embeddings that have distances smaller than a threshold. Step 6 is to train the final semi-supervised model using both the labeled and unlabeled data with the cluster labels.

It is true that two-stage clustering, as well as TC clustering and K-Means, have been introduced and adopted in the literature, our approach combines these methods in a novel way to address a specific problem. The combination of both clustering has not been experimented in the literature.

3.2 Omni-supervised model with multiple transformations

Fig. 2
figure 2

The four transforms for Omni-supervised learning

Figure 2 shows how four transforms are carried out to produce four different embeddings using only one model in step 2 of our framework. The four transformations are made up of simpler transformations namely left and right flips, align and scale face image up two times then align. These embeddings produced by running different transformations are fed to the same model, later combined using average operation and then normalized. We notice that there are errors in the alignment stage. By scaling up the face image then run an align operation, this returns a slightly different alignment keypoint positions. This helps as the original alignment without scaling may be incorrect. This averages out the errors in embeddings due to wrong alignment. The left to right flipping also averages out the errors in the embedding. We apply another alignment transformation [25] instead of using the alignment information in the dataset. We use the alignment information in the data only if the alignment transform fails to find an alignment for a face image. We assume the unlabeled data has no alignment information. Other transformations such as rotation can also be used, however we did not investigate them in this work.

These transformations act like image augmentations in deep learning. We have shown that the use of all transformations will lead to the best result. These transformations tackle the face key points misalignment problem, but did not tackle blurry or color distortion due to surveillance camera. Blurry and color distortion problem can be addressed in the future work.

3.3 Greedy clustering and non-greedy cluster splitting

An additional clustering is performed at step 4 to further improve on the clustering result. As the TC is a greedy clustering approach, it will result in clusters that geometrically contain two or more clusters. These clusters can be further split at step 4 by non-greedy clustering algorithm such as K-means, hierarchy clustering, spectral clustering etc. For detail, implementation of our clustering algorithm can be found in the supplementary sections of our paper.

D is a distance matrix where each element \(d_{ij}\) represents the distance between embedding i and embedding j, likewise S is a similarity matrix with \(s_{ij}\) is a reciprocal of \(d_{ij}\),

$$\begin{aligned} d_{ij}= & {} \Vert e_i-e_j \Vert \nonumber \\ s_{ij}= & {} 1/d_{ij}. \end{aligned}$$
(1)

\(S^{(2)}=\delta (S,t)\) is the threshold of the distances \(s_{ij}\) in S elementwise, if smaller than threshold is 0, else 1. \(s^{(2)}_{ij}=1\) if embedding i is link to embedding j. The threshold-ed matrix \(S^{(2)}\) is the adjacency matix that represents the graph that connects the datapoints. TC clustering is to do transitive closure on the adjacency matrix. TC of the graph is the same as Floyd Warshall Algorithm with path algebra, which the algorithm can be represented in matrix form.

$$\begin{aligned} \delta ((S^{(2)})^2,0) \end{aligned}$$
(2)

is the linking of the embeddings that are 2 edges apart where the matrix \(S^{(2)}\) is raised to the power of 2. This is the addition of edges that are connected by a path of 2 edges in the adjacency matrix.

$$\begin{aligned} S^{(3)}=\delta ((S^{(2)})^\infty ,0) \end{aligned}$$
(3)

is the linking of embeddings that are many (large finite positive) edges apart, or we can say it is transitive closure of the edges using path algebra. The matrix \(S^{(2)}\) is raised to the power of \(\infty \). To subdivide the clusters using non-greedy clustering, let

$$\begin{aligned}{} & {} \min _T \{|T-S^{(3)}|\}_{\ge 0} \nonumber \\{} & {} \quad \text {such that} \quad S^{(4)} = \delta (S,T) \nonumber \\{} & {} \quad S^{(4)} \le S^{(3)} \nonumber \\{} & {} \quad v_i = \lambda _i(S^{(4)}) \nonumber \\{} & {} \quad v_i \in \{0,1\} \nonumber \\{} & {} \quad S^{(4)} \textbf{1} < t_{max\_size}. \end{aligned}$$
(4)

\(\textbf{1}\) is a vector of ones and \(S^{(4)}\) is a matrix of thresholds which is a block diagonal matrix, each block represents 1 cluster. Because each cluster should be fully conntected with every other datapoint in the cluster. Basically it means that we will repeatedly breaks down a large cluster by increasing the threshold until the size of each cluster is below e.g. \(\textbf{t}_{max\_size}=600\). Note that different clusters will have different thresholds. In equation 4, the matrix S is thresholded by a matrix of thresholds T where each individual element of \(S^{(4)}\) has a different threshold. \(v_i=\lambda _i(S^{(4)})\) returns a eigen vector i of matrix \(S^{(4)}\) multipled by its eigen value. The ones in \(v_i\) represents which datapoints (indexed by the ones in the vector) are in cluster i. The clustering result is in \(S^{(4)}=\delta (S^{(3)},T)\).

Clustering result of TC and non-greedy clustering are represented in an adjacency matrices in \(S^{(3)}\) and \(\delta (S^{(3)},T)\). Examples of how the matrices will look like is shown in Figs. 3 and 4. In the examples, \(S^{(3)}\) clustered into 2 clusters with data points in each clusters fully connected with each other by TC. \(\delta (S^{(3)},T)\) further breaks down the cluster into 3 clusters so that each cluster is below a maximum size by non-greedy clustering. \(min_T |T-S^{(3)}|_{>=0}\) is to maximise the size of each cluster as much as possible, but the clusters should be subdivision of the TC clusters as shown in \(S^{(4)}\le S^{(3)}\). \(|T-S^{(3)}|_{>=0}\) is the same as \(|max(T-S^{(3)},0)|\) where we sum up the difference of T and \(S^{(3)}\) when T is larger than \(S^{(3)}\).

Fig. 3
figure 3

Example of adjacency matrix with 2 clusters after TC

Fig. 4
figure 4

Example of adjacency matrix with 3 clusters after non-greedy clustering

The TC clustering algorithm consists of two parts. First a 15 nearest neighbors of all face embeddings of unlabeled data are computed and these neighbors are used to create a graph that connects the nearby embeddings if their similarities are above a threshold. Then the graph is partitioned into connected components and each component forms a cluster. This is the first step of this clustering algorithm. Next the connected components or clusters are repeatedly broken down if the cluster sizes are larger than a specified size (e.g. 600 embeddings) and the similarity threshold is increased by a small amount to break down the clusters into smaller connected components or clusters.

Ideally, the face embeddings of the face recognition model are trained to have a fixed distance between pairs of face embeddings from different face identities and a very small distance between pairs of face embeddings from same face identity. A greedy algorithm with a fixed similarity threshold that links up embeddings above the similarity threshold is able to produce the clusters. This is true for small number of face embeddings in the test set. However for large number of faces in the test set, the distances between pairs of face embeddings from different identities may be higher than a fixed similarity threshold. This is because many of these new faces in the test set are very different from the training set, and the face recognition algorithm is confused whether some pairs of faces are from the same or different persons using a fixed similarity threshold. The greedy algorithm needs to use adaptive similarity threshold (which is increased gradually over time) to further break down the face embedding clusters if they are larger than a predefined number of face embeddings (e.g. 600).

After the face embeddings are broken down into clusters by TC, some of these clusters are still large and contain two or more face classes. For these clusters of face embeddings, a greedy clustering algorithm using connected components is not able to break them and a non-greedy algorithm at step 4 is needed to break them down. The non-greedy algorithm looks at all pairs of distances between the face embeddings in these large clusters and still able to figure out the sub-clusters within these large clusters using between class distances and within class distances. In practice, if the number of face embeddings in a cluster is smaller than a predefined number (e.g. 150), we ignore this cluster and it will not be further split by our non-greedy algorithm in step 4 into smaller clusters as the cluster relationship is unsure for small number of embeddings. As the TC algorithm is greedy when it does clustering, the algorithm will always separate easy to separate clusters and therefore the algorithm will very unlikely oversplit the datapoints (there will be very unlikely a case where two or more resulted clusters are part of an actual face class). We have verified experimentally that the TC algorithm has high recall and low precision, therefore it is unlikely that it will oversplit the datapoints. The difficult clusters will be left for the non-greedy clustering algorithm at step 4 to tackle.

Non-greedy clustering algorithm can be implemented using k-means algorithm. The loss function for the k-means algorithm is

$$\begin{aligned} J(C_{ij}) = \sum _{ (i,j) \in \{(m, n) \Vert C_{mn} = 1\} } \Vert x_i-u_j \Vert ^2 \end{aligned}$$
(5)

where \(x_i\) is the embedding i and \(u_j\) is the centroid of cluster j. \(x_i\) belong to cluster \(u_j\). Non-greedy clustering algorithm is based on centroid clustering.

If we use the non-greedy algorithm directly on the face embeddings without doing TC greedy clustering, this will result in poor clustering performance. It is because it is difficult to determine the number of clusters and cluster size of each cluster using non-greedy clustering algorithm. Table 1 shows the results of spectral clustering directly on the face embeddings with 2577 (exact ground truth number of clusters), 2000 and 3000 clusters. We can see that if we wrongly estimate the number of clusters by a small fraction, the clustering performance (F-measure) will differ a lot. Beside that, non-greedy clustering algorithm looks at all pair distances of face embeddings, it will take up a lot of memory and makes it impossible to cluster millions of faces. For a small set of about 600 face embeddings, greedy clustering algorithm will be able to execute very fast using small amount of memory.

Table 1 Spectral clustering results with different number of clusters

Figure 5 shows the cascade clustering process of our algorithm. A quick greedy clustering algorithm will split easy to separate large clusters (each may contain a few sub-clusters) and the non-greedy clustering algorithm will further decompose these large clusters into small clusters.

Fig. 5
figure 5

Our two stage clustering algorithm

Figure 6 shows the types of clusters that exist in the face embeddings. For case 1, a simple fixed threshold is able to separate the clusters. For case 2, although the fixed threshold cannot separate them, the between class distances are larger than within class distances. By increasing the similarity threshold, the greedy clustering algorithm is still able to separate them. For case 3, if there exists a bridge between two clusters, then the greedy clustering algorithm will not be able to separate them. A non-greedy clustering algorithm such as K-means can be used to separate them. For case 4, although the two clusters can be split by a threshold and greedy clustering algorithm, some small clusters are produced as a side effect of the algorithm. These small clusters (e.g. cluster with only one embedding) can be merged to the nearest clusters using step 5 of our framework. For more information on the distance distribution of each case can be found in the supplementary section.

Fig. 6
figure 6

Four types of clusters

3.4 Label propagation of remaining unlabeled face embeddings

Label propagation is performed at step 5 of our algorithm. The TC algorithm will label some face embeddings as noise (unlabeled). These face embeddings each forms a singleton cluster with one embedding in size. We can also change the labels of small clusters (e.g. size\(\le 3\)) same as singleton clusters as these labels are noisy. They are separated as clusters by the TC clustering algorithm. These singleton clusters are far away from every other face embeddings and so they are labeled as noise. These noisy face embeddings can be ignored during training of the final face recognition model after the unlabeled data is labeled as they make up a small proportion of the total number of unlabeled face embeddings. But for clustering purposes, we can assign them to the nearest clusters if their distances are smaller than a threshold. This will improve the overall F-measure of the clustering result when compared to the ground truth labels of the unlabeled faces. These singleton clusters may be formed during the greedy clustering process and they most likely belong to the nearest clusters.

For a small cluster k smaller a certain size,

$$\begin{aligned} t_{ij}&=\delta (C(i),k)\times (1-\delta (C(j),k))\times \delta _{\text {top 3}}(i,j)\times c_j^{(size)}\nonumber \\ c_{k}&=C(arg\,max_j \bigcup _{C(i)=k} s_{ij}) \end{aligned}$$
(6)

where \(t_{ij}\) is an element of a matrix, \(t_{ij}>0\) if embedding i is in cluster k and embedding j is not in cluster k and \(t_{ij}\) has the cluster size of where embedding j is located with additional top3 constraint, else it is 0. \(\delta _{\text {top 3}}()\) returns a 1 if j is top 3 nearest embedding to embedding i (in terms of distance), else it is 0. \(\delta (C(i),k)\) and \(1-\delta (C(j),k))\) make sure that the matrix is zero when either embedding i does not belong to cluster k or embedding j belongs to cluster k. \(c_j^{(size)}\) is the size of cluster that contains embedding j. The function C() returns the cluster id of embedding i or j. \(c_{k}\) is the cluster id where the cluster k is finally assigned to. Basically it means that considering all the embeddings in small cluster k, find the largest cluster that the cluster k is connected to based on top 3 distances of each embeddings and merge cluster k with the largest cluster.

The labels are propagated synchronously to the unlabeled embeddings by finding the most frequently occurring labels of the 3 nearest neighbors of these unlabeled embeddings. Note that the unlabeled embeddings that are far away from their nearest embeddings (e.g. further than a 0.4 threshold) are ignored and new labels will not be assigned to them.

4 Experiment results

In this section, we carried out experiments to validate the effectiveness of our clustering algorithm. We tested our algorithm on the IJB-B 1845 [26] data and MS-Celeb-1 M [27] data which are commonly used by the face clustering community for validation. We used the Folkes and Mallows F-measure [28,29,30] to evaluate the pairwise performance of clustering,

$$\begin{aligned} \text {Avg Recall}&=\sum _{i,j}N_{i,j}/\sum _{i}N_i \end{aligned}$$
(7)
$$\begin{aligned} \text {Avg Precision}&=\sum _{i,j}N_{i,j}/\sum _{j}N_j \end{aligned}$$
(8)
$$\begin{aligned} \text {Avg F-Measure}&=\frac{2\times \text {Avg Recall}\times \text {Avg Precision}}{\text {Avg Recall}+\text {Avg Precision}} \end{aligned}$$
(9)

where \(N_j\) is the number of possible pairs of embeddings in cluster j and \(N_i\) is the number of possible pairs of embeddings in class i. For class i and cluster j, \(N_{ij}\) is the number of possible pairs of embeddings of class i in cluster j. Class i is the ground truth label of an embedding and cluster j is a cluster label from the clustering algorithm.

4.1 Omni-supervised clustering results

In this subsection, we use 0.5 million data partition of MS-Celeb-1 M dataset from github site [30] to investigate on the use of different transformations of our omni-supervised face clustering.

Table 2 Combine two weak transformations becomes a strong transformation

Table 2 shows that using two weak transformations to generate two embeddings, then combine the embeddings leads to better result than using each original embedding itself. The two weak transformations are 3\(\times \)3 median filter and scale down by 4 pixels. They are called weak transformations because they produce embedding values that produce slightly higher performance on test data compared to using the original image without tranformation. These two transformations are weaker than using the transformations of original image and original image flipped left to right. However the combined of the two transformations embeddings (by averaging) leads to better clustering result than using only each transformation. This is in analogy to the idea of weak classifiers where a combination of the classifiers leads to a stronger classifier.

Table 3 Combine four input transformations to generate robust embeddings for clustering

Table 3 shows that by applying two or four best tranformations for omni-supervised clustering, we are able to achieve  1% and  5% improvements respectively. This is considered large improvement using only one trained model instead of multiple trained models as in CDP algorithm. The results shown in this subsection used embeddings generated using our trained model on labeled data. We did not use existing embeddings from the github site as they do not provide their deep network model for us to experiment with different input transformations. Our omni-supervised transformations can act like multiple committee models, without the need to train multiple models.

4.2 Compare with CDP multiple committee models

We experimented with CDP algorithm using their committee models instead of using our omni-supervised transformations to generate and combine the embeddings. Then the cluster results from CDP are further refined using our step 4 cluster splitting and step 5 label propagation algorithms. We used the embeddings from the CDP github site [29]. From Table 4, we can see that on 200k MS-Celeb-1 M dataset, using our clustering method with 1 committee model (top right entry in the table) has close to the same performance as using CDP with 4 committee models (bottom left entry in the table). Using 4 committee models requires much more training time and therefore it is not worth it to train 4 models to cluster this small dataset. Using 4 committee models with our clustering algorithm (bottom right entry in the table) has reached almost the same F-measure as spectral clustering (97.22%, see Table 1) with actual number of ground truth clusters specified as input.

Table 4 Clustering results on 200k dataset

4.3 Compare with state of the art clustering algorithms

Table 5 shows the step 3, 4 and 5 results of our framework on the 1.7M and 5 M data. The 1.7 and 5 millions faces datasets are partitions of a large MS-Celeb-1 M dataset provided in github site [30]. The embeddings are also downloaded from this github site. Our algorithm has significant performance improvement over the CDP algorithm. Step 5 has made some improvement to step 4 of our framework. The timings shown in this table exclude the time taken to find the 15 k-nearest neighbors graphs for the TC algorithm in step 3. It takes 15.97 and 59.83 min to compute the 15 k-nearest neighbors graph for 1.7 and 5 millions data respectively. The k-nearest neighbors graph computation is a bottleneck in the clustering algorithms, which is also needed for the deep learning clustering algorithms. In fact, the deep learning clustering algorithms LTC and GCN require 80 and 200 nearest neighbors respectively.

Table 5 Clustering results on 1.7 and 5 millions faces of the MS-Celeb-1 M dataset

In Tables 5 and 6, we use only one face recognition model without omni-supervised tranformations and without multiple models as in the multiple committee models case in previous subsection. We used the embeddings provided at the LTC paper github site [30] and the GCN paper github site [31]. We used spectral clustering for step 4 of our framework. Spectral clustering is applied only to clusters larger than a predefined number (e.g. 150 embeddings). We select number of clusters in spectral clustering using the number of eigenvalues greater than a threshold, with a maximum of 5 clusters. The spectral clustering is repeated again if the broken down clusters are still larger than the predefined number.

From Table 6, we can see that our clustering algorithm has outperformed LTC [3] and GCN [4] clustering algorithms even on the large dataset of 1.7 and 5 millions faces. Both of these algorithms use deep learning network. Our algorithm is much faster. This enables the user more time to vary with the parameters of our clustering algorithm to further improve the performance. Note that all clustering algorithms are run on CPU and the timings of the deep learning clustering algorithms are inference times only without training times. The GCN, LTC and our clustering timings in the table do not include nearest neighbors computation as k-nearest neighbors are computed separately. As for the FINCH method, it requires only one nearest neighbor and it is computed together with the clustering process. We can also see that GCN algorithm works well on the small 66k IJB-B 1845 data but it performs poorly on the 1.7M data. This algorithm does not scale well to large dataset. FINCH algorithm [2] is a hierarchy clustering technique which returns a set of different numbers of clusters. By choosing the best number of clusters and compute the F-measure, our clustering algorithm still outperforms FINCH. LTC algorithm performs poorly on IJB-B 1845 data as the algorithm is not trained to cluster effectively on that data.

Table 6 Our clustering results comparing to the state of the art algorithms

Table 7 compares the best clustering algorithm VEGCN [32] with our algorithm with and without omni-supervision. It has clearly shown that our algorithm outperforms VEGCN even without omni-supervision. The VEGCN algorithm reported here is slightly less than the original paper as we do not know the exact hyperparameters and we reduced the length of each embeddings from 512 dimensions to 256 dimensions so that the embeddings can be input into the deep learning network of VEGCN. The embeddings are reduced from 512 to 256 dimensions simply by adding the first 256 dimensions with the next 256 dimensions and normalizing the resultant embeddings to unit length.

Table 7 Compare our clustering algorithm (with and without omni-supervision) and the best algorithm VEGCN [32]

Table 8 shows the percentage of singleton clusters labeled by the different clustering algorithms. Although FINCH has 0 singleton cluster, it has poor clustering performance and it is not able to identify noisy face images in the data. We can see that our clustering algorithm is more robust, produces lesser singleton clusters and much higher clustering F-measure compared to the deep learning clustering algorithms. Higher clustering performance is needed for semi-supervised learning.

Table 8 Percentage of singleton clusters by different clustering algorithms on the 1.7M MS-Celeb-1 M data

Figure 7 shows an example of ground truth cluster distribution of four classes of face embeddings. The purple dotted circles are the large clusters returned by step 3 (greedy clustering) of our clustering algorithm. These large clusters can be further broken down into small clusters by step 4 (non-greedy clustering). The green dotted circle is a singleton cluster created by first stage of our clustering algorithm which can be re-merged to the nearest cluster at step 5.

Fig. 7
figure 7

Example of embeddings distribution of four face identities in first two principal components space

4.4 Different variations of our clustering algorithm

Table 9 shows the results of our clustering algorithm at step 5 if we assume clusters of sizes smaller than or equal to 3 instead of 1 as noisy clusters. We can see that this assumption helps to improve the recall and therefore also improves the F-measure of the clustering results.

Table 9 Clustering performance of our algorithm with assumption of noisy clusters of different sizes

We used the embeddings provided at the LTC paper github site [30]. Table 10 investigates the use of different non-greedy clustering algorithms in step 4 in terms of execution speed and F-measure performance. We used squared error,

$$\begin{aligned} SE=\frac{1}{N}\sum _{1}^{N}{\Vert x_i-C(x_i)\Vert }^2 \end{aligned}$$
(10)

to select the number of clusters for K-means, hierarchy clustering and Birch algorithms. N is the number of face embeddings, \(x_i\) is face embedding i and \(C(x_i)\) is the centroid of cluster of face embedding \(x_i\). In this experiment, a face embedding cluster is repeatedly broken down by 2 if it contains 2 or more clusters. A cluster is broken down by 2 if

$$\begin{aligned} SE_2<SE_1/1.1 \end{aligned}$$
(11)

where \(SE_1\) and \(SE_2\) are the square errors when using one and two clusters respectively. This method is similar to elbow method to select number of clusters, where we continue to increase the number of clusters if there is large decrease in error. DBSCAN algorithm needs to specify eps (neighbourhood distance) and F-measure is sensitive to this eps parameter. Spectral clustering selects number of clusters by looking at the eigenvalues. It looks at how many of these eigenvalues are larger than a threshold. For detail information of which python package is used to implement each type of non-greedy clustering algorithm can be found in the supplementary section of our paper.

Table 10 Clustering results of our algorithm using different non-greedy clustering algorithms at step 4 on the 1.7M MS-Celeb-1 M data

Using K-means with 1 random initialization is much faster than 10 random initializations with slight loss of F-measure. K-means is the fastest of all algorithms. DBSCAN is sensitive to the choice of eps parameter. Result will differ a lot for slight change in the parameter. Hierarchical clustering works best if we use the ‘Ward’ merging criteria. ‘Ward’ is also the best overall algorithm. ‘Single’ criteria leads to greedy merging and has tendency to form long chain cluster, therefore leads to poor performance. ‘Complete’, ‘Average’ and ‘Weighted’ are slightly inferior to ‘Ward’, although they are similar to ‘Ward’. ‘Centroid’ and ‘Median’ perform poorly as they disregard the sizes of the clusters when they try to merge clusters. Spectral clustering is too time consuming. Although ‘Ward’ has the best performance and is slightly better than spectral clustering, we think that spectral clustering is theoretically better which can partition non-circular and irregular shaped clusters. Therefore for Table 6, we choose to use spectral clustering.

We can also select the number of clusters in ‘Ward’ hierarchy clustering by using the dendrogram (see row ‘Hierarchy clustering (ward with selection of number of clusters)’ in Table 10). We search between 1 to 14 clusters and cut the dendrogram where the distance between the nearest split and merge operations is larger than 1.1 using largest number of clusters. The clusters are repeatly broken down for a few iterations. Using this approach, we have achieved F-measure of 79.27% in 0.73 min. It is about  0.4% more accurate than using ward with repeated splitting into 1 or 2 clusters at each iteration, but has about the same speed compared to repeated splitting.

Table 11 Optimal clustering results at step 4 on the 1.7M MS-Celeb-1 M data

Table 11 shows the optimal clustering results at step 4. This is done by using the cluster results at step 3, then split them according to the ground truth at each cluster to obtain the optimal results. If we do not do additional clustering on the clusters at step 4 that are smaller than or equal to a predefined number (150 samples), the optimal F-measure is 80.58%. Our result is  1.3% lower than this optimal result. If we do additional clustering on all clusters at step 4, the optimal F-measure is 82.44%. Our result is  3.2% lower than this optimal result. As for small clusters at step 3, it is difficult to obtain accurate clustering of them at step 4 so these small clusters are ignored at step 4.

The value of 150 is chosen because the number of images in an identity Celeb-1 M is about 100 images, which is slightly less than 150. 600 is chosen for transitive closure because it is a multiple of 100. So that after transitive closure clustering, there is still room for K-means for further fine-tuning breakdown into smaller clusters of the clusters.

4.5 Semi-supervised face recognition results

We trained an initial 14 layers shallow model which is a modified version of ResNeXt [33] and used it to label the unlabeled data. Then a final model using the ResNeXt 50 layers model is used to train on all the labeled and unlabeled data. Note that the face identities in the labeled and unlabeled data can overlap. We can use separate classifier heads to each part of the data to overcome the problem of data overlap. In our case, we assume that the identities in the labeled and unlabeled data do not overlap so only one classifier head using ArcFace loss function is used. The results are shown in Table 12. Supervised model on labeled data is to train a model on the labeled data only. Supervised model on all data is to train on both labeled and unlabeled data assuming we have the labels of unlabeled data. We can see that our semi-supervised model achieves good performance (almost the same identification performance on the MegaFace dataset) compared to fully supervised model using ground truth labels for the unlabeled data for training. We use the labeled data and unlabeled data provided in the github site [30].

Table 12 Semi-supervised face recognition results

Our semi-supervised model with removal of clusters of size\(\le 4\), has achieved nearly the same identification rate as our supervised model on whole data validated using the MegaFace dataset. Without removal of small clusters of size\(\le 4\), our semi-supervised model result is clearly less than our model with removal of small clusters. Our semi-supervised model has greatly outperformed supervised model on labeled data only by more than 10%.

5 Conclusion

In this paper, we have shown that combining two weak tranformations leads to strong clustering result, in similar analogy to combining weak classifiers leads to strong classifier. Our omni-supervised clustering has led to  5% improvement in our clustering algorithm compared to the case with no transformation. For clustering of 200k dataset, we have shown that using one committee model has about the same performance of using four committee models if step 4 and 5 step of our clustering are performed after step 3 TC. We have tried substituting step 4 of our clustering algorithm with many other classical clustering algorithms and have shown that K-means, hierarchy clustering (ward) and spectral clustering have performed similarly well. We have shown that if our step 4 clustering is perfect, it is only  4% away from our hierarchy clustering (ward) result. We have trained a semi-supervised model using both labeled and unlabeled data (labeled by clustering). It has almost the same performance with the model trained on all ground truth data without stripping the labels from unlabeled data.

Although our method is an optimization technique, our method is interpretable and explainable than the deep learning approach, which can be run in much shorter time using less memory. In the future work, we will add in convex optimization clustering to perfect the art of semi-supervised learning.