1 Introduction

The use of biometric authentication methods has been growing in recent decades. These methods use biometric factors such as iris, fingerprint, voice, and face for authentication [58]. However, the face is known as one of the popular biometric factors. The reason is the simplicity of using the face image and also the fact that there is no need for interaction between the person and the system. Face recognition is frequently used in video surveillance [2], law enforcement [11], access control systems, facial rejuvenation [7] and so forth due to its ease of use as well as the possibility of remote authentication [29].

Despite the many advances that have been made in face recognition, it encounters many challenges, that each method is trying to solve one or more of them. Conditions such as illumination and changes in the pose of the face, occlusion and expressions [20] make it difficult to build a robust face recognition system [4]. Although some methods have worked these challenges, they require high processing time, high memory consumption, and are relatively complex [29, 32].

In addition to the above mentioned challenges, other factors such as large number of images in the dataset [44] and having only one sample per person [35] are also problematic. Most face recognition methods are not scalable and their performance dramatically decreases as the number of images increases. Also, most methods require several training samples for identification and fail if the number of samples is small [10]. The goal of this paper is to present a novel approach addressing problems of large number of images in the dataset and having only one sample per person.

The process of recognizing faces consists of three main steps: Preprocessing, Feature extracting, Matching features [32]. The first step is to divide the image into two segments, face and non-face [68], and also after detecting face, the image quality is enhanced in terms of brightness, contrast, etc. In the second step, discriminative features are extracted from face images to make it easier to identify faces. Finally, in the last step, extracted features are learned by classifiers to predict the identity of the new faces.

Face recognition methods are mainly divided into three categories based on their perspective: Holistic, Local, and Hybrid [29]. The first category, also known as subspace methods (such as Principal Components Analysis (PCA) [61], Linear Discriminant Analysis (LDA) [53], Independent Components Analysis (ICA) [6]), projects the entire image to a smaller space. The second category can be divided into two subcategories, local-appearance-based methods (such as Local Binary Patterns (LBP) [1] and its variations, Histogram of Oriented Gradients (HOG) [18]) that divide the images into blocks and extract features locally in each block, and the key-point-based methods (such as Scale-Invariant Feature Transform (SIFT) [15], Speeded-Up Robust Features (SURF) [75]) that detect points of interest in images, then extract local features around those points. The third category uses a combination of holistic and local methods (such as Gabor-LBP-LDA [8], DCT-PCA-GA [41]) to obtain better performance in recognition [60].

The purpose of this paper is to focus on the challenge of large dataset size and the Single Sample Per Person (SSPP) problem. Since the problem of face recognition can be considered as a multiple optimization problem [32], the existing methods usually focus on a single problem to solve. For example, Deep Convolutional Neural Network (DCNN) can work well with large datasets [51, 56, 66]. But they also have certain problems and limitations. Deep learning methods require a large number of training samples to work well [30]. They have high resource consumption and need too much time to be trained. These methods have many hyperparameters that must be specified [27] and also they are always at risk of overfitting [65].

Some other methods use clustering techniques to divide images into clusters in order to solve the problem of large dataset size. In this case, face recognition can be done with a relatively small number of images within each cluster [44, 46]. These methods are also associated with problems such as finding the right clustering method with high performance, and sometimes the need for a large number of samples for training.

To solve the SSPP problem, some methods project the images into another space where the distance between the samples is maximum [42, 71]. Another set of methods use an auxiliary dataset to learn the original gallery images with help of the generic set images [25, 26, 69]. Some other methods divide images into several patches and consider the extracted features of each patch as a new sample for each person [10, 37]. Other methods transform images to create new images in different spaces of the sample image to generate new samples for each person and use them for identification [38, 72].

This paper proposes a hierarchical face recognition method that is robust to large datasets and single sample per person. The proposed method uses a simple clustering approach to hierarchically divide images into clusters with fewer images. The clustering method uses a user-defined threshold to cluster images. Within each cluster, two sets of features are extracted using Non-negative Matrix Factorization (NMF) [28] and Fast Retina Keypoint (FREAK) [3] descriptors. NMF is one of the holistic methods that projects the entire image space to a feature space with smaller dimensions. Then it extracts basic features that are very similar to the human visual system [31]. FREAK descriptor is one of the local key-point-based methods that extracts binary features from the region of interests. Finally, these two set of features will be used to identify new faces using the nearest neighbor with City-Block distance measure. The proposed method combines these two features to take advantage of both holistic and local methods. The results of experiments performed on the FERET [47] dataset and an extended image set collected from other standard datasets indicate the good performance of the proposed method.

The main contributions of this paper include 1) a robust face recognition approach that solves the challenge of having a high number of images in the dataset. Using a simple clustering method, the images are divided into smaller sets, so the identification task can be done on smaller image sets. 2) Using one sample per person. Combining the Non-negative matrix factorization method and FREAK descriptors helps us to extract discriminative features with only one sample per person. The proposed method investigates on these two problems simultaneously. While previous works could only solve one of the problems at a time. 3) The proposed method does not require high-resolution images. The discriminative features can be extracted using 120 × 120 pixel images.

The rest of this paper is organized as follows. Section 2 summarizes some of the methods related to the subject of this paper. The proposed method is descibed in details in Section 3. To evaluate the proposed method, experiments have been performed, the explanations and results of these experiments are presented in Section 4. We then conclude the proposed method and discuss about the potential future directions in Section 5.

2 Related works

This section examines several face recognition methods that have been proposed to address the challenges of large dataset size or SSPP. Also, some other related works that use hierarchical approaches and hybrid features for image or face applications are examined. Lu et al. [36] proposed a method based on the LDA approach. In this method, to deal with face recognition on large-scale datasets first a clustering method is used to divide the images into smaller size sets. This method uses a “separability criterion” to partition the training set into a set of K smaller and simpler subsets or Maximal-Separability Clusters (MSCs). Then, inside the clusters, feature extraction is done using the LDA method. In the method presented by Dugelay et al. [46] like Lu’s method [36], images are clustered before classification. The difference is that the Maximum Likelihood (ML) and Expected-Maximization (EM) is used for clustering.

Choi’s method [12] is one of the methods based on hierarchical classification. This method uses a combination of the wrinkle features extracted using Gabor filters and the skin features extracted using a local binary pattern. A hierarchical classifier based on a Support Vector Machine (SVM) and a Support Vector Regression (SVR) is used in this method. Some other methods used Radial Basis Function (RBF) networks to build a hybrid and hierarchical classifier [9, 17]. An ensemble of the RBF networks is used to build a hierarchical classifier. These methods were evaluated on a low number of subjects.

FaceNet [51] is a deep learning model that directly learns to map images to a compact Euclidean space. In this space, the distance between the images is directly related to their similarity. This model has been trained using about 260 million images in 1000 hours. This method has been evaluated on the Labeled Faces in the Wild (LFW) dataset for face verification tasks, which has good results.

Otto et al. [44] proposed a method based on Rank-Order clustering to cluster millions of faces by identity. LFW dataset images were used along with 123 million other unlabeled images. The Rank-Order method is a form of agglomerative hierarchical clustering that uses the nearest neighbor approach. This method starts with each image in its own cluster and starts to merge the two closest clusters. The minimum distance between any members in the clusters is considered as the distance between two clusters.

Zhang et al. [72] proposed two methods based on Singular Value Decomposition (SVD) perturbation to deal with the SSPP problem. In the first method, the original image is linearly combined with its derived image gotten by perturbating the image matrix’s singular values. In the second method, the derived images are regarded as independent images that could augment the training image set. After all, PCA is performed on the joined images and derived images.

Lu et al. [37] proposed a Discriminative Multi-Manifold Analysis (DMMA) method that extracts discriminant features from image patches. To do this, the image is first divided into several non-overlapping patches to generate a set of images for each person. This image set is then used to extract the features and face recognition tasks. Liu et al. [35] used a modified K-means method to cluster two groups of people. Then they used the DMMA method to extract features from images. These two steps were repeated to obtain a binary tree projection matrix of fast DMMA. Finally, this tree is used for identification.

Xin et al. [67] created new images using geometric transformation from original images. Then the main images along with the new images are considered as a set of training images and feature extraction is done using the Local Gabor Binary Patterns (LGBP) method. Finally, the Locality Preserving Discriminant Analysis (LPDA) method is used to reduce the dimension of the features. Then the identity of the images is determined using the nearest neighbor method with Euclidean distance.

Hu et al. [24] proposed a method that uses lower-upper (LU) decomposition to generate two new image sets of original images. Then two approximation images were reconstructed from the two basis image sets. The authors used the Fisher Linear Discriminant Analysis (FLDA) method to extract features from the original image and its two approximation images.

Pan et al. [45] proposed an algorithm that is based on Locality Preserving Projection (LPP) to solve the SSPP problem. In this algorithm, first, transfer sources are screened to obtain the selective sample source using the whitened cosine similarity metric. Then vectors of source faces and target faces are projected into feature subspace by LPP. Finally, the feature transfer matrix can be calculated to approximate the mapping relationship on the source and target faces.

Based on the transfer learning method of Zeng et al. [71], new samples are created using a traditional sample expansion approach. New samples are then used to fine-tune the DCNN model. The Class-Level Joint Representation with Regional Adaptive Convolution Features (CJR-RACF) method [70] was proposed by Yang et al. to solve the SSPP problem. This method learns the local and global features of images that are robust to facial variations. Min et al. [40] also used transfer learning to train a deep model using a large-scale dataset. Then, a sample expansion method in the feature space called K-Class Feature Transfer (KCFT) was introduced to enrich intra-class variation information for each image. Finally, expanded samples are used to train the classifier.

Nikan and Hassanpour [42] proposed a solution to the problem of large-scale datasets and SSPP based on NMF. In this method, images are first divided into two parts, top and bottom, then the basic features are extracted separately from the lower and upper parts, using the NMF method. Finally, the same images from the same identities are identified using the correlation distance measure.

In recent years, several deep learning methods for face recognition have been introduced. For example, DeepFace [57], DeepID [55], and ResNet [23] models, which were among the first deep methods used to identify and verify faces in unconstrained environments, performed well. However, these methods were very complex [22]. The next generation of deeper models such as Face++ [76] and FaceNet [51], which also had deeper structures, were able to improve the performance of previous methods as well.

The state-of-the-art in deep learning is the ArcFace [14] method, which uses the ResNet [23] structure and the Additive Angular Margin loss function to learn the face images. These approaches have even more accuracy than humans, which can identify a face among millions of images [51]. However, the training process requires a large number of samples from each person [30]. The resolution of the images strongly affects the performance of these models. These methods have complex computations and need a long time to be trained [27].

3 Proposed method

In general, the proposed method consists of three main steps: preprocessing, clustering, and face recognition. The face images are extracted from the original images. They are improved in terms of brightness and contrast. Then a simple method is proposed for clustering images into two clusters using NMF features. Finally, in each cluster, two types of features including NMF and FREAK descriptors are extracted and matching faces is done using the nearest neighbor method [13]. The whole process of the proposed method is drawn in Fig. 1.

Fig. 1
figure 1

Diagram of the proposed method

3.1 Preprocessing

Since images may contain non-face regions, it is necessary to specify the location of the face in the image in the first step. Face detection at this stage is done using the known Viola-Jones [62] algorithm. We also examine another state-of-the-art face detection method known as Multi-Task Cascaded Convolutional Networks (MTCNN) [73] in our experiments that results show its weaker performance than the traditional Viola-Jones method.

Then, the brightness and contrast of the extracted faces are improved by using conventional enhancement methods including Histogram Equalization (HE) [54], and Contrast Limited Adaptive Histogram Equalization (CLAHE) [49].

3.2 Clustering

Extracted face images after enhancements are ready for clustering. For this goal, we propose an innovative and simple approach. The images are first divided into two equal non-overlapping blocks, upper and lower parts. The upper part includes the eyes, eyebrows, and forehead, and the lower part includes the nose, lips, and chin. These steps are shown in Fig. 2. In order to achieve a better clustering, discriminative features should be extracted from images. Then, the features will be extracted separately from both parts.

Fig. 2
figure 2

Four main steps before clustering images

3.2.1 Feature extraction for clustering

The NMF method is used to extract the discriminative features from images. This method is known as an unsupervised approach in which all elements of the decomposed matrix and the obtained matrix factors are forced to be non-negative [16]. This non-negativity makes the resulting matrices easier to inspect and to visualize [64]. Also NMF is able to decompose objects into its parts [43]. For instance, face images can be decomposed into basis elements of the face, including eyes, nose, lips, etc. The NMF method factorizes any given matrix \(V\in \mathbb {R}^{m \times n}\) into two matrices, \(W\in \mathbb {R}^{m \times r}\) and \(H\in \mathbb {R}^{r \times n}\) with an integer \(r \ll \min \limits (m, n)\), and both matrices are non-negative (1).

$$ V \approx WH $$
(1)

Factors W and H are then found by optimizing the problem (2) [28]. The matrix W is called the base matrix and the matrix H is called the coefficient matrix. In other words, according to this method, each matrix V can be written as a linear combination of columns of matrix W which matrix H is its coefficients.

$$ \min f(W, H) = \frac{1}{2} \parallel V - WH {\parallel^{2}_{F}} $$
(2)

3.2.2 Clustering approach

After extracting the feature vectors from the top and bottom parts of the images, the average feature vectors of each part are calculated. Then the difference between the mean of the top and bottom features is determined. Finally, a threshold is used to cluster the images. In this way, images whose feature difference is greater than or equal to the threshold are in one cluster and the rest of the images (images whose feature difference is less than the threshold level) are in another cluster (3).

$$ C(x) = \left\{\begin{array}{ll} 1 & \quad\text{if } diff_{x} < \tau \\ 2 & \quad\text{if } diff_{x} \ge \tau \end{array}\right. $$
(3)

In (3), diffx = mean(ftop) − mean(fbottom), where ftop and fbottom correspond to the extracted feature vectors from top and bottom parts respectively. The threshold level (τ) considered in this paper is the median of the differences. By considering the median as the threshold level, the images will be practically divided into two clusters with equal size. At this point, all gallery images are divided into two clusters.

Now with a given probe image y, the cluster related to that person must be specified. To do this, the average of each cluster is considered as the centroid of that cluster. Features of the probe image, like the gallery image, is extracted by the NMF method after being divided into upper and lower parts. Then the distance of the test image with the centroids is calculated and the nearest cluster is determined. If the distance of the test image from the nearest cluster is larger than the threshold (\(\tau \prime \)), the cluster of the nearest centroid is announced as the probe image’s cluster. Otherwise, the closest image of the gallery is specified and its cluster is announced as the probe image’s cluster (4).

$$ C(y) = \left\{\begin{array}{ll} C(c) & \quad\text{if } d_{y, c} \ge \tau \prime \\ C(\arg\min(D)) & \quad\text{if } d_{y, c} < \tau \prime \end{array}\right. $$
(4)

In (4), D is the distance matrix calculated by the distance measure between the probe image and all the images in the gallery. c is the centroid of the closest cluster to the probe image y, and dy,c is the distance between y and c. \(\tau \prime \) is the threshold level that specifies the way that we must use to identify the test image’s cluster. This parameter was selected by trial and error.

3.3 Face recognition

After identifying the new image’s cluster, its identity must be determined from the faces within the same cluster. At this stage, new features must be extracted to fit the members of the cluster. To do this, two methods of feature extraction, including NMF and FREAK descriptors will be used. Shi-Tomasi’s method [52] is used to detect the points of interest.

3.3.1 Feature extraction for recognising faces

Shi-Tomasi’s method (also known as Min-Eigen) is one of the interest point detectors. In fact, it’s an improvement over the traditional Harris’s method for detecting corner points. For detecting corners, the Sum of Squared Differences (SSD) between any region in the image and patches shifted in some directions should be computed. Then a corner response function (R) along with a threshold indicates whether it’s a corner region or not [50].

Harris and Stephens [19] computed an approximation to the second derivation of the SSD that is shown in (5). Since the type of the regions in images can be determined by the amount of changes in directions around it, some inferences can be made based on the value of the eigenvalues of H (5):

  1. 1.

    If there is no change in any directions, then it’s a flat region (λ1 ≈ 0, λ2 ≈ 0).

  2. 2.

    If there are some significant changes in one direction, then it’s an edge region (λ1 ≈ 0, λ2 has large value).

  3. 3.

    If there are some significant changes in all directions, then it’s a corner region (λ1 and λ2 both have large values).

    $$ H = \left[\begin{array}{cc} \widehat{{I_{x}^{2}}} & \widehat{I_{x} I_{y}} \\ \widehat{I_{x} I_{y}} & \widehat{{I_{y}^{2}}} \end{array}\right], $$
    (5)

    where Ix and Iy are the partial derivatives of image I, and \( \widehat {~} \) denotes averaging (i.e. summation over the patch). Then Harris and Stephens used (6) to compute the corner response. Shi and Tomasi [52] then conclude that it is better to use the minimum of the eigenvalues as the corner response (7). Shi-Tomasi’s method has been used in some object tracking [33] and action recognition [63] applications.

    $$ R = |H| - k(trace(H))^{2} $$
    (6)
    $$ R = \min(\lambda_{1}, \lambda_{2}) $$
    (7)

After detecting key-points (corner points), descriptors around those key-points should be calculated. We use FREAK descriptor to extract binary features from Shi-Tomasi’s key-points. FREAK is inspired by the human visual system and precisely the retina. Based on FREAK a retinal sampling grid is used which is also circular and have high density in the center. Each circle represents the standard deviation of a Gaussian kernel that applied to the corresponding sampling points. It constructs descriptor F that is a binary string formed by a sequence of one-bit Difference of Gaussians (DoG):

$$ F = \sum\limits_{0\le a < N} 2^{a} T(P_{a}), $$
(8)

where

$$ T(P_{a}) = \left\{\begin{array}{ll} 1 & \quad\text{if } (I({P}_{a}^{r_{1}}) - I({P}_{a}^{r_{2}})) > 0 \\ 0 & \quad\text{if } (I({P}_{a}^{r_{1}}) - I({P}_{a}^{r_{2}})) \le 0 \end{array}\right. $$
(9)

and \(I({P}_{a}^{r_{1}})\) denotes the smoothed intensity of the first receptive field of the pair Pa. Then a greedy algorithm is used to select best pairs to find uncorrelated tests with high variance. Finally, they are normalized to make descriptors robust to orientation. FREAK descriptor’s computation is faster than the previous methods such as SIFT and SURF [3].

NMF features are extracted exactly like the clustering step. Then corner points are detected using Shi-Tomasi’s method, and FREAK descriptors are calculated on detected key-points. NMF features are extracted globally from the image, while FREAK descriptors are extracted locally. Global features describe the image as a whole and can be interpreted as a particular property of the image involving all pixels; While, the local features describe individual pixels with their neighbors, and in some approaches, they detect keypoints and describe the regions around these keypoints [21]. Combining these two features makes the proposed method more powerful. Because by combining these two features, both the generality of the image and the details of the image are given importance.

3.3.2 Matching faces

After extracting the features, the image distances matrix is calculated for the two types of features. Finally, the total distance matrix is calculated by combining the distance matrices of the NMF features and FREAK descriptors. In principle, to obtain the total distance matrix, we must calculate the average of the two matrices. In the proposed method, we calculate the weighted average to make the effect of the NMF features different from FREAK descriptors (10). Weight values (α, β) are calculated by trial and error. Then, matching the same images is done by applying the 1-nearest neighbor method with the City-Block distance measure.

$$ D_{total} = \frac{\alpha \times D_{NMF} + \beta \times D_{FREAK}}{\alpha + \beta} $$
(10)

4 Experiments

In this section, we will discuss the results of experiments performed to evaluate the proposed method. To perform the experiments and to show the robustness of the proposed method to large datasets, the well-known FERET [47] dataset was used as well as an extended image set consisting of images collected from several standard datasets.

The FERET dataset was collected from August 1993 to July 1996 in 15 sessions from 1,199 people. This dataset has a total of 14,126 images that each person can have multiple images. Images belong to each subject were categorized into frontal, half, quarter, profile, rotated, and duplicate images. In this article, 990 images from the frontal set (fa) are used for training and their corresponding fb images for testing.

Also, an extended image set is collected consisting of frontal images from several standard datasets. These datasets are FERET, MUCT [39], Psychological Image Collection at Stirling (PICS),Footnote 1 FEI [59], Face94.Footnote 2 Their main difference is in the brightness or change of expressions. In total, the number of collected subjects reaches 1684 people (Fig. 3).

Fig. 3
figure 3

We evaluated our method using FERET [47] dataset, and an extended image set including FERET [47], MUCT [39], PICS, FEI [59], and Face94 images

To perform the experiments, the images are first converted to grayscale and the face image is extracted using the Viola-Jones [62] algorithm. We also examine another face detection method that is called MTCNN [73]. The results shown in Table 1 indicate that Viola-Jones method works better than MTCNN for face image extraction.

Table 1 Comparing the performance of two face detection methods (bold number shows the best result)

Face images are then enhanced using histogram equalization and CLAHE. To extract the NMF features, images are downsampled to size 20 × 20 pixel. 120 × 120 pixel images have been used to detect corner points and extract FREAK descriptors.

The first experiment examines the performance of the proposed clustering approach using F-measure (F1-score). F-measure is known as the harmonic mean of precision and recall. F1-score ranges from [0,1], where 0 indicates the worst and 1 is the best [34].

We compared our clustering method with some common clustering methods including K-means, K-medoids, Agglomerative Hierarchical Clustering (Agglomerative Nesting - AGNES), DBScan, and Spectral Clustering. However, DBScan method failed to cluster images into separate clusters, so this method is not mentioned in the results.

According to Fig. 4, the proposed method performs better than the other clustering methods in terms of both F-measure and final recognition accuracy. AGNES and Spectral Clustering methods also have higher recognition rate, but their clusters are not balanced in terms of cluster size. The balance of the clusters is evaluated using the ratio of cluster members according to (11). For example they split images into two clusters with the size of 20 and 970, while our method guarantees that final clusters will be in the same size. Balanced clustering tends to avoid forming outlier clusters, and thus has beneficial regularizing effect [74].

$$ R = \frac{|\text{members of the small cluster}|}{|\text{members of the large cluster}|} $$
(11)
Fig. 4
figure 4

Performance of the clustering methods on a) FERET dataset, b) extended dataset

Another experiment was performed to investigate the effect of the clustering steps that applied to the dataset. For this purpose, we increased the number of clustering hierarchies. In this case, at each hierarchy, each cluster is further divided into two clusters. For example, if there are two hierarchies of clustering, in the first hierarchy we will have two clusters and in the second hierarchy, we will have two more clusters for each cluster; So the final number of clusters will be four. In this way, a binary tree of clusters can be formed. The structure of the clustering hierarchies is shown in Fig. 5. The results of this experiment are shown in Fig. 6. As can be seen, the final accuracy decreases as the clustering steps increases. Hence the best accuracy is achieved when only one clustering hierarchy is performed.

Fig. 5
figure 5

Hierarchies of the proposed clustering approach

Fig. 6
figure 6

Effect of clustering hierarchies, evaluated on a) FERET dataset, b) extended dataset

According to the proposed method, the images were first divided into upper and lower parts. Then we can extract features separately from both parts. We consider two cases to investigate the effect of dividing the images into two parts in the clustering step. In the first case, as described in the proposed method, the images were divided into two parts, top and bottom, and the NMF features were extracted from each part separately. In the second case, the NMF features were extracted from the whole image. Then, the feature vector is divided into two equal parts so that we can calculate the difference between the averages of the two parts.

The results of this experiment are reported in Table 2. These results show that dividing images into two parts and extracting features separately from both parts gives us a much better performance. The rank value in this experiment was obtained by trial and error.

Table 2 Effect of dividing images into two parts in clustering step (bold number shows the best result)

In another experiment, the effect of the different sizes of dataset is evaluated. In this experiment the dataset size starts with a number of 100 images. These images are randomly selected from all the images of the dataset, and the final accuracy is the average accuracy of multiple runs of the proposed method. The results of this experiment are shown in Fig. 7. The proposed method performs well even for large sets of images. As can be seen, as the dataset size increases, the final accuracy decreases at a very small rate.

Fig. 7
figure 7

Effect of the number of images in the dataset on the performance of the proposed method

The experiments were performed on a personal computer with Intel Corei5, 3.2 GHz CPU, and 12 GB RAM. To examine the computational complexity of the proposed method, it was tested with different dataset sizes. The results are reported in Fig. 8. In Fig. 8a, the running time is required for the proposed method to learn the gallery set of the dataset, is shown. In Fig. 8b, the running time is required for an unseen test image to be compared and identified by the proposed method, is recorded. Both running times increase linearly with the increase of the dataset size. It can be said that the computational complexity of the proposed method is approximately in order of N, i.e. (\(\mathcal {O}(N)\)), which N represents the number of images in the dataset.

Fig. 8
figure 8

Required running time for a) training step, b) test an unseen image

We compare the performance of our method with some other existing methods that were described in Section 2. These methods mainly focus on solving dataset size and SSPP problems. The disadvantage of the methods proposed to solve the problem of large dataset size is the need for a large number of samples for training. On the other hand, the disadvantage of the methods proposed to solve the problem of SSPP is the weak performance on the large datasets.

SPCA + [72], ML-EM [46], FLVD-SVD [48], DMMA [37], Fast-DMMA [35], LU [24], LPP [45], and NMF [42] methods use handcrafted features, while TDL [71] and KCFT [40] methods use transfer learning models to extract features. However, they were evaluated on small image sets. The accuracy of recognition on the FERET dataset achieved by the mentioned methods is listed in Table 3. The results show that the proposed method outperforms previous methods. While these methods just focus on one problem, the proposed method considers both of the dataset size and SSPP problems simultaneously.

Table 3 Comparison of the proposed method with the existing methods on FERET dataset (bold number shows the best result)

To investigate further with more images in the dataset, we implemented Nikan’s method (NMF) [42], which is the latest published method in this field, and evaluated it on the extended dataset. The result of this evaluation is shown in Table 4. The results indicate that our proposed method can perform better than the NMF method.

Table 4 Comparison of the proposed method with the existing methods on the Extended dataset (bold number shows the best result)

One of the important parameters in the proposed method is the rank of the NMF method (r). The value of rank should be an integer that is far less than the dimensions of the initial matrices [42]. For example, if the input matrix is 400 × 990 (400 is for flattened images that are 20 × 20 pixel, and 990 is for the number of images in the dataset), then the rank value should be \(r \ll \min \limits (400, 990)\). In the proposed method, the value of rank was obtained by trial and error. It can be seen in Fig. 9 that for a number of 990 images from FERET dataset, the rank value of 55 gives the best result.

Fig. 9
figure 9

Effect of the parameter r (rank in NMF) in recognition rate

Since one of the most important steps in this method is to select the nearest image from the gallery collection, an appropriate distance measure must be selected. For this purpose, experiments with several distance measures, including Euclidean, Correlation, Cosine, City-block, and Minkowski were performed. The results of this experiment are known in Table 5. According to the results, the City-block measure was better than other measures to show the same-class images close to each other and different-class images apart.

Table 5 Performance of the proposed method using different distance measures

In another experiment, to investigate the effect of various parts of the proposed method including clustering and use of the two NMF and FREAK descriptors, we applied these factors separately. For example, to see the effect of using NMF features, we applied only the NMF method on the images set; Or to investigate the performance of the proposed method when not using the clustering step, we just applied NMF and FREAK on the images set without clustering. The results are shown in Fig. 10.

Fig. 10
figure 10

Effects of various factors on the recognition rate of the proposed method. The letters N, F, and C stand for NMF, FREAK, and Clustering factors

As can be seen, using only NMF or FREAK descriptors has poor performance. But using a combination of these two features has a significant improvement in the final performance. Finally, we added the clustering step to the proposed method, which provides the best performance.

For further comparison, we examined several related works, including Choi’s method [12], Gutta’s method [17], and Bai’s method [5], which use hierarchical approaches and hybrid features for image or face applications. Most of these methods were not mentioned in Table 3. The proposed method uses a balanced clustering approach to create a hierarchy of the dataset. It is guaranteed that the created clusters are always balanced, while the mentioned methods do not generate balanced clusters.

The proposed method uses one sample per subject for training while the mentioned methods use several samples for training. The number of subjects used to evaluate the proposed method is more than a thousand, while most of the previous methods were only able to evaluate a limited number of subjects. Gutta’s method [17] and Choi’s method [12] use 350 subjects, Bai’s method [5] uses 70 subjects to evaluate.

5 Conclusion and future works

In this paper, a hierarchical face recognition method for large datasets is presented. The proposed method also considers the SSPP problem in face recognition. We use a simple method for clustering images, which has performed better than existing clustering methods. Within each step of clustering (hierarchy), images are divided into two clusters. Also, the proposed clustering approach guarantees that the clusters in each steps are balanced and equal. Then two sets of discriminative features including NMF and FREAK descriptors are extracted from faces within each cluster. Finally, the identity of the test image is determined using the nearest neighbor method with the City-Block distance measure.

Previous methods focused on eighter solving large dataset size or the SSPP problems, whilst the proposed method can consider these two problems simultaneously. Experiments show that the computational complexity of the proposed method is in order of N, which means it can be applied to larger datasets without getting computational problems. Since the employed images in this study have various expressions, experimental results indicate robustness of the proposed method to the expression changes. Other experiments conducted in this research indicate that the proposed method performs better than the previous methods.

The images used in this method are mostly frontal. One of the limitations of the proposed method is the use of images with no variation in pose. Therefore, research can be continued to eventually make the proposed method invariant to pose changes. In this case, the proposed method can be tested on other real-world large-scale datasets. The focus of this research is mainly on facial expressions, and not images with cosmetic and aging effects. Hence, if the given image has any of such effects, the final performance will be affected. This issue will be considered in future works. In addition, research can be continued to improve the clustering method so that by applying several clustering hierarchies, the dataset can be divided into several clusters with a much smaller number of members.