1 Introduction

With the gradual growth of artificial intelligence application and technology, many industry perspective and research interest in face recognition (FR). Face recognition has some advantages, such as low cost, least intrusive and more data sources. However the processing and analysis of high-dimensional data in FR is still a challenge [1,2,3], robust FR remains a challenging task in small sample environments. There are mainly two stages for a common face recognition system: (1) robust and discriminant feature extraction, such as, principal component analysis (PCA) [4], linear discriminative analysis (LDA) [5], regularized linear discriminant analysis (RLDA) [6]. Null LDA (NLDA) [7], the orientation matrix is calculated in two steps. In the first stage, the data is projected on the null space of SW and in the second stage it finds W that maximizes \(\left| {{{\text{W}}^T}{S_B}W} \right|\), and spectral regression discriminative analysis (SRDA) [8]. Sparse graph-based discriminate analysis (SGDA) [9] was developed by preserving the sparse connection in a block-structured affinity matrix with class-specific samples. Using low-rank constraints, low-rank graph-based discriminate analysis (LGDA) [10] preserves the global structure in data. Sparse and low-rank graph-based discriminate analysis (SLGDA) method was developed in [10] to purse block-diagonal structured affinity matrix with sparsity and low-rank constraints. (2) Classifier construction, e.g. Nearest Neighbor (NN) [11]. But, many methods that include LDA-based statistical learning methods always affected by “small-sample-size” (SSS) problems [12].

Feature extraction has proved to be great in transforming high-dimensional space to lower one, and retain most of the intrinsic information in original data [13, 14]. PCA was originally used to remove zero value for Sw, and LDA was then executed in the reduced dimensional subspace. It has lighted that removed zero spaces include discriminatory information that cannot ignore. However, for supervised dimensionality reduction methods which are only suitable for single model data, classification performance is closely related to between-class separation, within-class compactness and equal emphasis on separation between classes [15]. In the technique of RLDA, the Sw matrix is regularized to deal with the Sw singularity. The matrix Sw is approximated by Sw + ηI. But it does not consider whether the definition of the scatter matrix is more reasonable. Li and Tang [16] presents the idea that traditional LDA algorithm is not optimal for the definition of between-class scatter matrices. It cannot help separate classes other than edge classes, and it may cause them to overlap with each other, resulting in discriminant performance degradation. Second, a fixed regularization parameter value was introduced in the RLDA, but it may not give the best classification. In [17], an approach, estimating η term by putting the modified Fisher’s criterion maximize, presents better performance than other methods. In addition, close class pairs prone to overlap in the subspace, which is referred to as the class separation problem. A number of weighting methods was put forward to deal with this problem [18, 19], and the fundamental thought is to assign large weights to close class pairs. However, the problem cannot be solved thoroughly by those methods [20]. The proximity function proposed by [21] solves the shortcomings of the traditional distance function in high dimensional data. In view of that, this paper proposes an improved RLDA algorithm. It redefines the between-class and introduces a precise regularized parameter to control the deviation and variance of the eigenvalue. Finally, a better method of parameter estimation and the improved scatter matrices are combined.

This study is motivated by the fact that previous studies [6, 17, 21]. The frame of this article is as below. Part 2 gives a detailed introduction of mathematical derivation of regularized linear discriminant analysis algorithm. Section 3 introduces the method with the structure of improved scatter matrices and a precise regularized parameters. Section 4 is the simulation results and analysis. Lastly, the last part is the conclusions.

2 Regularized linear discriminant analysis

Among supervised dimensionality reduction methods, RLDA is the most popular discriminant analysis method to SSS problem, which is widely used in pattern recognition fields. Both the degree of deviation and variance are all decided by the extent of SSS problem. A related algorithm improved by Friedman under similar conditions, where estimating Si for every sample type covariance matrix may not be very appropriate. The solving method, put forward by Friedman, is to add a regularization parameter, η·I, so have Si = Si + ηI. I is identity matrix and η is a regularization term. This regularization has influence on increasing the smaller ones and decreasing the larger eigenvalues, thus offsetting the biasing. To stabilize the smallest eigenvalues is another effect of the regularization.

A training set, \({\text{z}}=\{ {z_i}\} _{{i=1}}^{C}\) including C classes with every class \({{\text{z}}_i}=\{ {z_{ij}}\} _{{j=1}}^{{{C_i}}}\) consisting of multiple partial images zij, a total of \(N=\sum\nolimits_{{i=1}}^{C} {{C_{\text{i}}}}\) images can be obtained in the set. For easing computation, each face image is standed by a lexicographic order of pixel elements (i.e. \({z_{ij}} \in {R^J}\)). The length of it is J (= Iw × Ih). RJ represents the J-dimensional data space. The method, obtains a discriminant vector by lagging the proportion of the between-class scatter measure to the within-class scatter measure, can be formulated as:

$${\text{W}}=\mathop {\arg \hbox{max} }\limits_{W} \frac{{\left| {{W^T}{S_b}W} \right|}}{{\left| {\eta ({W^T}{S_b}W)+{W^T}{S_w}W} \right|}}.$$
(1)

Among them, Sb is between-class scatter matrices, Sw is within-class scatter matrices, \(W \in {R^m}\), 0 ≤ η ≤ 1 is regularization parameter. And

$${S_{\text{b}}}=1/N\sum\limits_{{i=1}}^{C} {{C_i}} ({\bar {z}_i} - \bar {z}){({\bar {z}_i} - \bar {z})^T}$$
(2)
$${S_{\text{w}}}=1/N\sum\limits_{{i=1}}^{C} {\sum\limits_{{j=1}}^{{{C_i}}} {{C_i}} } ({z_{ij}} - {\bar {z}_i}){({z_{ij}} - {\bar {z}_i})^T},$$
(3)

where zi is the average value (or center) of class i, \(\bar{\text{z}}\) is the total mean (or center) of all the classes. In general, by using the training samples, zi, \(\bar{\text{z}}\) can be estimated, i.e., \({{\text{z}}_i}=\frac{1}{{{N_i}}}\sum\nolimits_{{j=1}}^{{{N_i}}} {{x_{ij}}}\) and \(\bar{\text{z}}=1/N\sum\nolimits_{{i=1}}^{{C=1}} {\sum\nolimits_{{j=i+1}}^{C} {{x_{ij}}} }\).

A series of discriminant vectors would be available by eigenvalue decomposition of \(S_{w}^{{ - 1}}{S_b}\) according to (1) when Sw is full rank. The matrix of projection can be constructed by the eigenvectors that associated to the d eigenvalues those are largest, which is the suboptimal solution to (2). However, as mentioned in Sect. 1, there are some disadvantages on RLDA for FR and it can be improved. To sum up, algorithm proceeds in the following way.

3 Improved regularized linear discriminant analysis

Equation (2) is defined so that all the mean values of the sample and the average values of the classes are separated as much as possible, but the mean values of various class may be close to each other, resulting in overlapping of many samples of adjacent classes, resulting in a decrease in recognition performance. The reason for this problem is that the variance is the largest in the most discriminating projection direction obtained by the previous algorithm. So the edge class and other class can be separated as much as possible. It should be noted, however, that this direction does not help separate other class except the edge class, and it may cause them to overlap with each other, resulting in a decline in discriminant performance.

Therefore, the existing algorithm for the definition of the scatter of the between-class is not optimal, because the edge class dominates the feature decomposing, resulting in the dimension reduction of the conversion matrix is too much emphasis on those who already have been better separated from the class. Then the overlapping of adjacent class be caused.

3.1 The model of improved between-class scatter matrices

Improved scatter matrices model was expressed as:

$${S_b}=\sum\limits_{{i=1}}^{{C - 1}} {\sum\limits_{{j=i+1}}^{C} {{P_i}} } {P_j}Close({\bar {z}_i},{\bar {z}_j})({\bar {z}_i} - {\bar {z}_j}){({\bar {z}_i} - {\bar {z}_j})^T},$$
(4)

where the Close function:

$${\text{Close}}\, (\bar{\text{z}_{\text{i}}} ,\bar{\text{z}_{\text{j}}}) = \frac{1}{{\text{m}}}\sum\limits_{{k = 1}}^{m} {e^{{ - \left| {\overline{{{\text{z}}_{{{\text{ik}}}} }} - \overline{{{\text{z}}_{{{\text{jk}}}} }} } \right|}} } .$$
(5)

The value of a sample is in m-dimensional space. The range of the Close function is (0, 1), which indicates the proximity of the sample \(\bar{\text {z}}_{\text{i}}\) to the sample \(\bar{\text{z}}_{\text{j}}\). The closer, the greater the value of the function. On the contrary, the smaller, the value. Where Pi and Pj are the prior probability of class i and j, respectively, and \(\bar{\text {z}}_{\text{i}}\) and \(\bar{\text{z}}_{\text{j}}\) are the average values of i-th and j-th.

From Eq. (3), the larger the \(\left\| {\bar{\text{z}}_{i} - \bar{z}_{j} } \right\|\) value, the smaller the weight assigned to them; on the other hand, the greater the weight assigned to them.

3.2 The model of improved within-class scatter matrices

Many algorithms need to train scatter matrices on larger databases. But in practice, the number of training set isn’t unlimited. Under the condition of small sample, the model cannot be correctly and effectively represent the logic and characteristics of the model, and it is easy to obtain the problem of over-fitting of the scatter matrices, which makes the performance of face recognition significantly lower. Because the difference of characteristics of the same person is more susceptible to other factors, even greater than the difference between different characteristics, that is, the degree of scatter within the class is greater than the between-class scatter changes, making the estimation error greater. Therefore, under the condition of small sample, the within-class scatter is obviously more sensitive, and this paper also pays attention to solve the sensitivity of within-class scatter matrices to small samples.

When the data is less effective in data set, the effective information of samples can be robustly estimated by making full use of local data structure of the sample. When there is an outlier in the data set, the local data structure adjacent to the sample can also be used to represent the characteristics of the outlier. For the small sample caused by the over-fitting problem, you can solve the problem by smoothing. In this paper, the KNN algorithm was used to select the within-class scatter matrices of adjacent classes and by taking advantage of local data structure. The within-class scatter matrices are smoothed, the over-fitting problems caused by small samples can be solved.

Let the training sample data set be: \({z_{ij}} \in R\), i = 1,..., C and j = 1,..., Ci. Ci is the number of classes i in the training sample, C is class number, N is the total sample number, and zij represents the j-th face image of the i-th class of the training sample. Within-class scatter matrices model can be express as:

$${S_i}=\sum\limits_{{j=1}}^{i} {({{\text{z}}_{ij}} - {{\bar {z}}_i}){{({z_{ij}} - {{\bar {z}}_i})}^T}} .$$
(6)

The general within-class scatter matrices was formulated as:

$${S_{\text{w}}}=1/N\sum\limits_{{i=1}}^{C} {{S_i}} .$$
(7)

Using the adjacent class to smooth the class divergence matrix

$${\tilde {S}_i}=\beta {S_i}+(1 - \beta )\sum\nolimits_{{k \in KNN(i)}} {{\omega _k}} {S_k}.$$
(8)

\(k \in {\text{KNN}}(i)\) represents the K nearest neighbors of class i. \(\beta \in [0,1]\) is the trade-off parameter, k is the weight parameter determined by the nearest neighbor system, the smaller the distance, the greater the weight.

Improved within-class scatter matrices model can be express as:

$${S_{\text{w}}}=1/N\sum\limits_{{i=1}}^{C} {{{\tilde {S}}_i}} .$$
(9)

By the definition, \({\tilde {S}_1}\) is the result of the smoothing of Si and Sk of Si K nearest neighbor classes of Si, and the problem of fitting can be solved by making full use of class i samples and adjacent class sample information. When a class has only one sample, the scatter matrix cannot be estimated effectively, but the scatter matrix can be approximated by using the neighboring class samples. The smoothing method takes full advantage of local data structure and reduces adverse effects of outliers in each class. The improved algorithm makes full use of the within-class scatter matrices of the class, and solves the problem of over-fitting of the general within-class scatter matrices, and obtains exact within-class scatter matrices.

3.3 A deterministic approach to RLDA

Let ST, SW, SB denotes the total, within-class and between-class scatter matrix, respectively. The scatter matrices would be singular in condition of SSS. It is well known that the discriminant information does not exist in the zero space of ST. Thus, the feature dimensionality from d-dimension can drop to rt-dimension (where rt is the rank of ST) by advance processing of PCA. The range space of ST matrix, \({P_1} \in {R^{d \times rt}}\), will be applied as a transformation matrix. In reduced dimension, the scatter matrices are: \({{\text{S}}_{\text{w}}}={\text{P}}_{1}^{{\text{T}}}{{\text{S}}_{\text{W}}}{{\text{P}}_1}\) and \({{\text{S}}_{\text{b}}}={\text{P}}_{1}^{{\text{T}}}{{\text{S}}_{\text{B}}}{{\text{P}}_1}\). After this procedure \({S_w} \in {R^{rt \times rt}}\) and \({S_b} \in {R^{rt \times rt}}\) are decreased dimensional scatter matrix.

In RLDA, the regularization of within-class scatter matrix SW was performed by adding η to diagonal elements of SW; i.e., Sw = Sw + ηI. The η make SW presents non-singular and reversible which would benefit the revised Fisher’s criterion maximized:

$${\text{W}}=\mathop {\arg \hbox{max} }\limits_{W} \frac{{\left| {{{\text{W}}^{\text{T}}}{{\text{S}}_{\text{b}}}{\text{W}}} \right|}}{{\left| {{{\text{W}}^{\text{T}}}{\text{(}}{{\text{S}}_{\text{w}}}+\eta {\text{I)W}}} \right|}},$$
(10)

where \(w \in {R^{rt \times 1}}\) is orientation vector. Avoid using any heuristic method in determining η, solving Eq. (10) in the belowing way. Denote

$${\text{f}}={{\text{W}}^{\text{T}}}{{\text{S}}_{\text{b}}}{\text{W.}}$$
(11)

Constraint condition:

$${\text{g}}={{\text{W}}^{\text{T}}}({{\text{S}}_{\text{w}}}+\eta {\text{I}}){\text{W}} - {\text{b}}=0.$$
(12)

b > 0 is constant. Under constrained curve g, the restricted relative maximum of f can be obtained. By putting its derivative to zero value, then

$$\frac{{\partial ({\text{f}} - \lambda g)}}{{\partial {\text{W}}}} = 2{\text{S}}_{{\text{b}}} {\text{W}} - \lambda (2{\text{S}}_{{\text{w}}} + 2\eta W) = 0.$$

Or

$$\left( {\frac{1}{\lambda }{\text{S}}_{{\text{b}}} - {\text{S}}_{{\text{w}}} } \right){\text{W}}\eta - {\text{W}} = 0$$
(13)

λ is Lagrange’s multiplier (λ ≠ 0). Shifting \(\eta\)W from Eq. (13) into Eq. (12), we conclude

$${{\text{W}}^{\text{T}}}{{\text{S}}_{\text{b}}}{\text{W}}=\lambda {\text{b}}.$$
(14)

And from Eq. (12) and Eq. (14), we can get

$$\lambda = \frac{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{b}}} {\text{W}}}}{{{\text{W}}^{{\text{T}}} {\text{(S}}_{{\text{w}}} + \eta I){\text{W}}}}.$$
(15)

We can observe that the left term of Eq. (15) is the Lagrange’s multiplier, and to the right of Eq. (15) same as the Fisher’s revised criterion. To large the modified Fisher’s criterion, we need to maximize λ. So approximate value of λ can be got by maximizing WTSbW/WTSwW, W corresponding to the large eigenvalue of \({\text{S}}_{w}^{{ - 1}}{{\text{S}}_b}\). But, \({\text{S}}_{w}^{{ - 1}}\) can be replaced by its pseudoinverse for which it is singular and irreversible. We can get λmax by decomposing the eigenvalue of matrix \({\text{S}}_{w}^{+}{{\text{S}}_b}\). \({\text{S}}_{w}^{+}\) is the pseudoinverse of Sw. The value of λmax can be substituted as follows:

$$\begin{gathered} \lambda \max = {\text{max}}\left( {\frac{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{b}}} {\text{W}}}}{{{\text{W}}^{{\text{T}}} {\text{(S}}_{{\text{w}}} + \eta I){\text{W}}}}} \right) \approx {\text{max}}\left( {\frac{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{b}}} {\text{W}}}}{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{w}}} {\text{W}}}}} \right) \hfill \\ \approx {\text{the}}~{\text{maximum}}~~{\text{eigenvalue}}~{\text{of S}}_{w}^{ + } {\text{S}}_{b} . \hfill \\ \end{gathered}$$
(16)

Equation (16) will help us to seek the value of η by decomposing the eigenvalue of 1/λSb − Sw which will give \({r_b}=~rank({S_b})\) finite eigenvalues. Since the dominant eigenvalue correspond to the largest discriminant eigenvector, η is considered to be the maximum eigenvalue. Then,

$$\eta ={\Lambda _{{\text{max}}}},$$
(17)

where \(1/\lambda {S_{b~}} - ~{S_{w~}}=~E\Lambda {E^T}\), \(E \in {R^{rt \times rt}}\) is a matrix of eigenvectors, Λ is a diagonal matrix of corresponding eigenvalues. If η is determined, the projection vector W would be obtained by decomposing the eigenvalue of \({({S_w}+\eta I)^{ - 1}}{S_b}\) which can be formulated as:

$${\text{((S}}_{{\text{w}}} + \eta {\text{I)}}^{{ - 1}} {\text{S}}_{{\text{b}}} ){\text{W}} = \beta {\text{W}}.$$
(18)

The m eigenvectors be obtained by Eq. (18) corresponding to the m highest eigenvalues to form W.

figure a

4 Simulation results and analysis

In this part, our approach is compared with a number of related state-of-the-art methods, including LGDA [10], NLDA [7], SRDA [8], SGDA [9] and SLGDA [10], etc. The range of normalized parameters in RLDA [6] is [0, 1]. RLDA in the following experiments has better results when the value obtained is 0.001. Our algorithm solves the difficulty of determining the normalized parameter values in RLDA. And the parameters introduced in our algorithm are β and k which can be seen in formula (8). When the values in the following experiments are 0.5 and 10 respectively, this algorithm achieves comparatively better results. Under different dimensions or training samples or classes, the parameters can be changed to other values so that the performance can be much better. NLDA verifies that the zero space of the between-class scatter matrices contains important discriminative information, but in some cases, the between-class scatter matrices may not contain null space. So the results were not so great sometimes. The Tikhonov regularizer was used in SRDA to control the model complexity, but the projection matrix obtained by it does not have orthogonality and it is not conducive to eliminating information redundancy between samples. All those algorithms of feature extraction are combined with NN classification algorithm for face recognition. The experiment is conducted on three face datasets, including the Extended Yale B [22], CMU PIE [23] and AR to evaluate performance. Details of datasets can be seen in Table 1 and Fig. 1.

Table 1 The three data sets used in our experiments
Fig. 1
figure 1

Some facial images used in the experiments: a AR; b CMU PIE; c extended Yale B

The parameters in competing methods are adjusted to their best performance according to the suggestions in original papers.

4.1 2-D visualization experiment on CMU PIE dataset

In this part, the discriminate ability is showed by different methods using a partial CMU PIE [22] face database. In the experiment, each individual are randomly selected 7 images for training, and the remaining about 17 images were tested. Figure 2a–j visualize the testing data distribution along the first two dimensions obtained by different methods. From Fig. 2, we may draw several conclusions. First, considered small sample problem, NLDA [7], RLDA [6], SRDA [8] are superior to PCA [4] and LDA [5]. But overlaps are still serious. Second, SGDA [9] only uses the local neighborhood structure through sparse representation, which doesn’t perform very well. Some parts of 5 classes mixed together can be seen in Fig. 2g. LGDA [10] shows better separation ability by introducing global low-rank regularization, But, still have significant overlaps among class 2, class 3 and class 5, and the distance between class 1 and class 3 is not far. With both sparse and low-rank constraints, SLGDA [10] performs better than the first two. However, class 2, class 3 and class 5 are still not separated as shown in Fig. 2i. Contrastively, the proposed method show more clear boundaries among classes and shows stronger robustness in the following experiment.

Fig. 2
figure 2

Two-dimensional five-class CMU PIE data projected by different methods. a PCA; b LDA; c NLDA; d RLDA; e PCA + LDA; f SRDA; g SGDA; h LGDA; i SLGDA; j OURS

4.2 Experiments on face recognition

4.2.1 CMU PIE database

The CMU PIE have surpassed four thousands images of sixty-eight individuals. Each person’s image was obtained through 13 different postures. Here, we use a subset of poses close to the front, C07, for the experiment, which contains 1629 images of 68 people. Everyone has about 24 pictures. And all the facial images were pruned to 32 × 32 pixels. Each individual is selected to have a subset of p (= 2, 3 ...) samples for training, the rest for testing. For each p, we ran all of methods 10 times independently, and reported the average results in Table 2. The FR rates under different dimensions are shown in Table 3. Table 2 shows that our method almost exceeds the other methods in different experimental settings. The results in LGDA and SLGDA are similar with ours, while obviously lower than ours when training samples per subject are not much available. Also, LGDA and SLGDA can get better performance as ours under different dimensions when training samples per subject are fixed. The results in RLDA are better than in SRDA under different training samples per subject except p = 2. And the results in NLDA are better than both RLDA and SRDA. Our method has higher recognition rate under different dimensions. Figure 2 shows the recognition rate versus the number of training samples and feature dimensions on CMU PIE by using some methods. From Fig. 2b, it is obvious that the whole training information does not provide significant advantage for classification, by which lead to computational costs instead. Thereby it’s necessary to extra features.

Table 2 Recognition rates under different number of training set (CMU PIE database)
Table 3 Recognition rates under the condition of different feature dimensions

4.2.2 Experiments on AR database

A subset of AR consisted of 50 men and 50 women in two sessions, with 6 lighting and 8 expression changes. From session 1, only seven images of light and expression changes, seven samples from another session. Each individual is selected to have a random subset of p (= 2, 3 ...) samples for training. For each p, we ran 10 times independently, and reported the average results in Tables. From Table 4, one can conclude that all the algorithms achieve better performance with the number of training samples per class increases, our method has higher recognition rate than other methods under numbers of every individual of training samples. The results in RLDA are better than in SRDA under different training samples per subject except p = 2. And the results in NLDA are better than both RLDA and SRDA. The FR rates under different dimensions were listed in Table 5. It can show that NLDA and our method exceed other methods in different experimental settings. The NLDA gain best outcomes on AR, which is slightly better than our method under 50 feature dimensions. But it doesn’t better than our algorithm under higher dimensions. Particularly, because of the number of training samples per subject are just 14 in AR Database, the performance of SGDA, LGDA and SLGDA are drop sharply when p does not reach half. Figure 3 shows the recognition rate versus different number of training samples per class and feature dimensions by using some methods. From Fig. 3b, it is obvious that the whole training information does not provide significant advantage for classification, by which lead to computational costs instead.

Table 4 Recognition rates under different number of training set
Table 5 Recognition rates under the condition of different feature dimensions
Fig. 3
figure 3

Face recognition accuracy versus a number of training samples per subject, b feature dimension on CMU PIE

4.2.3 Yale face database

About 2414 images of 38 people each and 64 frontal face of different lighting on Extended Yale B. In this experiment, the cropped and resized images were used which is 32 × 32 pixels. Figure 1 shows some example images of individual. A subset of individuals with p (= 3, 4, 5...) samples were taken with labels as for training, and the remaining is used for testing. The experiment was ran for 10 times. From Table 6, one can conclude that all the methods achieve better performance along with the number of training images per subject grows. Then, we randomly select four images from every person for training, and use the remaining samples for testing. The results in RLDA are better than in SRDA under different training samples per subject except p = 3, 4, 5. And the results in NLDA are better than both RLDA and SRDA. SGDA doesn’t present superior performance in this dataset as in CMU PIE. The results in LGDA and SLGDA are similar with ours in some cases, while obviously lower than ours when training samples per subject are not much available. Also, LGDA and SLGDA can get better performance as ours under different dimensions when training samples per subject are fixed. The rates, under the condition of different dimensions, were listed in Table 7. Our method exceeds the other methods in different experimental settings, and shows more robust in dealing with illumination problem in FR. Figure 4 vividly illustrate the recognition rate versus the number of training samples for each category and feature dimensions by using some methods. From Fig. 4b, it is obvious that the whole training information does not provide significant advantage for classification, by which lead to computational costs instead (Fig. 5).

Table 6 Recognition rates under different number of training set
Table 7 Recognition rates under different feature dimensions
Fig. 4
figure 4

Face recognition accuracy versus a number of training samples per class, b feature dimension on AR

Fig. 5
figure 5

Face recognition rates versus a number of training samples per subject, b feature dimension on Yale

5 Conclusions

The issue of small sample size in FR is studied in this paper. The algorithm of regularized linear discriminant analysis still has some disadvantages to fix the SSS problems. Considering that the model of scatter matrices can be more reasonable and related parameter can be obtained by avoiding the process of heuristic. An improved algorithm is introduced, which cannot only fix the singularity problem of scatter matrix but also the problem of parameter estimation. PCA is simply to calculate, and performs well in some cases, but the performance is limited by its unsupervised nature. By introducing different discrimination standards to fix SSS problems, RLDA, NLDA, SRDA and etc perform well to some extent. SGDA, LGDA and SLGDA can adaptively select neighbors for graph construction, and use the labeled samples in the same class to find the representation of each sample for block-diagonal structure representations. However, due to the limited number of samples per class, this process may result in large representation error, which may not reveal the within-class adjacent relationship as well as ours do. So, SGDA, LGDA and SLGDA hardly perform better than the proposed method when training sample are not enough. The simulation results on the famous databases illustrate that the proposed method has much better performance than other methods and improves the face recognition.