Optimized regularized linear discriminant analysis for feature extraction in face recognition

Tan, Xiaoheng; Deng, Lu; Yang, Yang; Qu, Qian; Wen, Li

doi:10.1007/s12065-018-0190-0

Optimized regularized linear discriminant analysis for feature extraction in face recognition

Research Paper
Published: 03 December 2018

Volume 12, pages 73–82, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Evolutionary Intelligence Aims and scope Submit manuscript

Optimized regularized linear discriminant analysis for feature extraction in face recognition

Download PDF

Xiaoheng Tan¹,
Lu Deng¹,
Yang Yang¹,
Qian Qu¹ &
…
Li Wen¹

287 Accesses
10 Citations
Explore all metrics

Abstract

In a reduced dimensional space, linear discriminant analysis looks for a projective transformation that can maximizes separability among classes. Since linear discriminant analysis demands the within-class scatter matrix appear to non-singular, which cannot directly used in condition of small sample size (SSS) issues in which the dimension of image is much higher, while the number of samples isn’t unlimited. Both the between-class and within-class scatter matrices are always exceedingly ill-posed in SSS problems. And many algorithms are suffered from small sample size issues still. To solve SSS problems, many methods including regularized linear discriminant analysis were proposed. In this article, a way was presented by optimized regularized linear discriminant analysis for feature extraction in FR which can not only fix the singularity problem existing in scatter matrix but also the problem of parameter estimation. The experiment is conducted on several databases and promising results are obtained compared to some state-of-the-art methods to demonstrate the effectiveness of the proposed approach.

Schatten-p Norm Based Linear Regression Discriminant Analysis for Face Recognition

Local Similarity Based Linear Discriminant Analysis for Face Recognition with Single Sample per Person

Bayesian Face Recognition Approach Based on Feature Fusion

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the gradual growth of artificial intelligence application and technology, many industry perspective and research interest in face recognition (FR). Face recognition has some advantages, such as low cost, least intrusive and more data sources. However the processing and analysis of high-dimensional data in FR is still a challenge [1,2,3], robust FR remains a challenging task in small sample environments. There are mainly two stages for a common face recognition system: (1) robust and discriminant feature extraction, such as, principal component analysis (PCA) [4], linear discriminative analysis (LDA) [5], regularized linear discriminant analysis (RLDA) [6]. Null LDA (NLDA) [7], the orientation matrix is calculated in two steps. In the first stage, the data is projected on the null space of S_W and in the second stage it finds W that maximizes $\left| {{{\text{W}}^T}{S_B}W} \right|$, and spectral regression discriminative analysis (SRDA) [8]. Sparse graph-based discriminate analysis (SGDA) [9] was developed by preserving the sparse connection in a block-structured affinity matrix with class-specific samples. Using low-rank constraints, low-rank graph-based discriminate analysis (LGDA) [10] preserves the global structure in data. Sparse and low-rank graph-based discriminate analysis (SLGDA) method was developed in [10] to purse block-diagonal structured affinity matrix with sparsity and low-rank constraints. (2) Classifier construction, e.g. Nearest Neighbor (NN) [11]. But, many methods that include LDA-based statistical learning methods always affected by “small-sample-size” (SSS) problems [12].

Feature extraction has proved to be great in transforming high-dimensional space to lower one, and retain most of the intrinsic information in original data [13, 14]. PCA was originally used to remove zero value for Sw, and LDA was then executed in the reduced dimensional subspace. It has lighted that removed zero spaces include discriminatory information that cannot ignore. However, for supervised dimensionality reduction methods which are only suitable for single model data, classification performance is closely related to between-class separation, within-class compactness and equal emphasis on separation between classes [15]. In the technique of RLDA, the Sw matrix is regularized to deal with the Sw singularity. The matrix Sw is approximated by Sw + ηI. But it does not consider whether the definition of the scatter matrix is more reasonable. Li and Tang [16] presents the idea that traditional LDA algorithm is not optimal for the definition of between-class scatter matrices. It cannot help separate classes other than edge classes, and it may cause them to overlap with each other, resulting in discriminant performance degradation. Second, a fixed regularization parameter value was introduced in the RLDA, but it may not give the best classification. In [17], an approach, estimating η term by putting the modified Fisher’s criterion maximize, presents better performance than other methods. In addition, close class pairs prone to overlap in the subspace, which is referred to as the class separation problem. A number of weighting methods was put forward to deal with this problem [18, 19], and the fundamental thought is to assign large weights to close class pairs. However, the problem cannot be solved thoroughly by those methods [20]. The proximity function proposed by [21] solves the shortcomings of the traditional distance function in high dimensional data. In view of that, this paper proposes an improved RLDA algorithm. It redefines the between-class and introduces a precise regularized parameter to control the deviation and variance of the eigenvalue. Finally, a better method of parameter estimation and the improved scatter matrices are combined.

This study is motivated by the fact that previous studies [6, 17, 21]. The frame of this article is as below. Part 2 gives a detailed introduction of mathematical derivation of regularized linear discriminant analysis algorithm. Section 3 introduces the method with the structure of improved scatter matrices and a precise regularized parameters. Section 4 is the simulation results and analysis. Lastly, the last part is the conclusions.

2 Regularized linear discriminant analysis

Among supervised dimensionality reduction methods, RLDA is the most popular discriminant analysis method to SSS problem, which is widely used in pattern recognition fields. Both the degree of deviation and variance are all decided by the extent of SSS problem. A related algorithm improved by Friedman under similar conditions, where estimating S_i for every sample type covariance matrix may not be very appropriate. The solving method, put forward by Friedman, is to add a regularization parameter, η·I, so have S_i = S_i + ηI. I is identity matrix and η is a regularization term. This regularization has influence on increasing the smaller ones and decreasing the larger eigenvalues, thus offsetting the biasing. To stabilize the smallest eigenvalues is another effect of the regularization.

A training set, ${\text{z}}=\{ {z_i}\} _{{i=1}}^{C}$ including C classes with every class ${{\text{z}}_i}=\{ {z_{ij}}\} _{{j=1}}^{{{C_i}}}$ consisting of multiple partial images z_ij, a total of $N=\sum\nolimits_{{i=1}}^{C} {{C_{\text{i}}}}$ images can be obtained in the set. For easing computation, each face image is standed by a lexicographic order of pixel elements (i.e. ${z_{ij}} \in {R^J}$). The length of it is J (= Iw × Ih). R^J represents the J-dimensional data space. The method, obtains a discriminant vector by lagging the proportion of the between-class scatter measure to the within-class scatter measure, can be formulated as:

$${\text{W}}=\mathop {\arg \hbox{max} }\limits_{W} \frac{{\left| {{W^T}{S_b}W} \right|}}{{\left| {\eta ({W^T}{S_b}W)+{W^T}{S_w}W} \right|}}.$$

(1)

Among them, S_b is between-class scatter matrices, S_w is within-class scatter matrices, $W \in {R^m}$, 0 ≤ η ≤ 1 is regularization parameter. And

$${S_{\text{b}}}=1/N\sum\limits_{{i=1}}^{C} {{C_i}} ({\bar {z}_i} - \bar {z}){({\bar {z}_i} - \bar {z})^T}$$

(2)

$${S_{\text{w}}}=1/N\sum\limits_{{i=1}}^{C} {\sum\limits_{{j=1}}^{{{C_i}}} {{C_i}} } ({z_{ij}} - {\bar {z}_i}){({z_{ij}} - {\bar {z}_i})^T},$$

(3)

where z_i is the average value (or center) of class i, $\bar{\text{z}}$ is the total mean (or center) of all the classes. In general, by using the training samples, z_i, $\bar{\text{z}}$ can be estimated, i.e., ${{\text{z}}_i}=\frac{1}{{{N_i}}}\sum\nolimits_{{j=1}}^{{{N_i}}} {{x_{ij}}}$ and $\bar{\text{z}}=1/N\sum\nolimits_{{i=1}}^{{C=1}} {\sum\nolimits_{{j=i+1}}^{C} {{x_{ij}}} }$.

A series of discriminant vectors would be available by eigenvalue decomposition of $S_{w}^{{ - 1}}{S_b}$ according to (1) when Sw is full rank. The matrix of projection can be constructed by the eigenvectors that associated to the d eigenvalues those are largest, which is the suboptimal solution to (2). However, as mentioned in Sect. 1, there are some disadvantages on RLDA for FR and it can be improved. To sum up, algorithm proceeds in the following way.

3 Improved regularized linear discriminant analysis

Equation (2) is defined so that all the mean values of the sample and the average values of the classes are separated as much as possible, but the mean values of various class may be close to each other, resulting in overlapping of many samples of adjacent classes, resulting in a decrease in recognition performance. The reason for this problem is that the variance is the largest in the most discriminating projection direction obtained by the previous algorithm. So the edge class and other class can be separated as much as possible. It should be noted, however, that this direction does not help separate other class except the edge class, and it may cause them to overlap with each other, resulting in a decline in discriminant performance.

Therefore, the existing algorithm for the definition of the scatter of the between-class is not optimal, because the edge class dominates the feature decomposing, resulting in the dimension reduction of the conversion matrix is too much emphasis on those who already have been better separated from the class. Then the overlapping of adjacent class be caused.

3.1 The model of improved between-class scatter matrices

Improved scatter matrices model was expressed as:

$${S_b}=\sum\limits_{{i=1}}^{{C - 1}} {\sum\limits_{{j=i+1}}^{C} {{P_i}} } {P_j}Close({\bar {z}_i},{\bar {z}_j})({\bar {z}_i} - {\bar {z}_j}){({\bar {z}_i} - {\bar {z}_j})^T},$$

(4)

where the Close function:

$${\text{Close}}\, (\bar{\text{z}_{\text{i}}} ,\bar{\text{z}_{\text{j}}}) = \frac{1}{{\text{m}}}\sum\limits_{{k = 1}}^{m} {e^{{ - \left| {\overline{{{\text{z}}_{{{\text{ik}}}} }} - \overline{{{\text{z}}_{{{\text{jk}}}} }} } \right|}} } .$$

(5)

The value of a sample is in m-dimensional space. The range of the Close function is (0, 1), which indicates the proximity of the sample $\bar{\text {z}}_{\text{i}}$ to the sample $\bar{\text{z}}_{\text{j}}$. The closer, the greater the value of the function. On the contrary, the smaller, the value. Where P_i and P_j are the prior probability of class i and j, respectively, and $\bar{\text {z}}_{\text{i}}$ and $\bar{\text{z}}_{\text{j}}$ are the average values of i-th and j-th.

From Eq. (3), the larger the $\left\| {\bar{\text{z}}_{i} - \bar{z}_{j} } \right\|$ value, the smaller the weight assigned to them; on the other hand, the greater the weight assigned to them.

3.2 The model of improved within-class scatter matrices

Many algorithms need to train scatter matrices on larger databases. But in practice, the number of training set isn’t unlimited. Under the condition of small sample, the model cannot be correctly and effectively represent the logic and characteristics of the model, and it is easy to obtain the problem of over-fitting of the scatter matrices, which makes the performance of face recognition significantly lower. Because the difference of characteristics of the same person is more susceptible to other factors, even greater than the difference between different characteristics, that is, the degree of scatter within the class is greater than the between-class scatter changes, making the estimation error greater. Therefore, under the condition of small sample, the within-class scatter is obviously more sensitive, and this paper also pays attention to solve the sensitivity of within-class scatter matrices to small samples.

When the data is less effective in data set, the effective information of samples can be robustly estimated by making full use of local data structure of the sample. When there is an outlier in the data set, the local data structure adjacent to the sample can also be used to represent the characteristics of the outlier. For the small sample caused by the over-fitting problem, you can solve the problem by smoothing. In this paper, the KNN algorithm was used to select the within-class scatter matrices of adjacent classes and by taking advantage of local data structure. The within-class scatter matrices are smoothed, the over-fitting problems caused by small samples can be solved.

Let the training sample data set be: ${z_{ij}} \in R$, i = 1,..., C and j = 1,..., C_i. C_i is the number of classes i in the training sample, C is class number, N is the total sample number, and z_ij represents the j-th face image of the i-th class of the training sample. Within-class scatter matrices model can be express as:

$${S_i}=\sum\limits_{{j=1}}^{i} {({{\text{z}}_{ij}} - {{\bar {z}}_i}){{({z_{ij}} - {{\bar {z}}_i})}^T}} .$$

(6)

The general within-class scatter matrices was formulated as:

$${S_{\text{w}}}=1/N\sum\limits_{{i=1}}^{C} {{S_i}} .$$

(7)

Using the adjacent class to smooth the class divergence matrix

$${\tilde {S}_i}=\beta {S_i}+(1 - \beta )\sum\nolimits_{{k \in KNN(i)}} {{\omega _k}} {S_k}.$$

(8)

$k \in {\text{KNN}}(i)$ represents the K nearest neighbors of class i. $\beta \in [0,1]$ is the trade-off parameter, k is the weight parameter determined by the nearest neighbor system, the smaller the distance, the greater the weight.

Improved within-class scatter matrices model can be express as:

$${S_{\text{w}}}=1/N\sum\limits_{{i=1}}^{C} {{{\tilde {S}}_i}} .$$

(9)

By the definition, ${\tilde {S}_1}$ is the result of the smoothing of S_i and S_k of S_i K nearest neighbor classes of S_i, and the problem of fitting can be solved by making full use of class i samples and adjacent class sample information. When a class has only one sample, the scatter matrix cannot be estimated effectively, but the scatter matrix can be approximated by using the neighboring class samples. The smoothing method takes full advantage of local data structure and reduces adverse effects of outliers in each class. The improved algorithm makes full use of the within-class scatter matrices of the class, and solves the problem of over-fitting of the general within-class scatter matrices, and obtains exact within-class scatter matrices.

3.3 A deterministic approach to RLDA

Let S_T, S_W, S_B denotes the total, within-class and between-class scatter matrix, respectively. The scatter matrices would be singular in condition of SSS. It is well known that the discriminant information does not exist in the zero space of S_T. Thus, the feature dimensionality from d-dimension can drop to rt-dimension (where rt is the rank of S_T) by advance processing of PCA. The range space of S_T matrix, ${P_1} \in {R^{d \times rt}}$, will be applied as a transformation matrix. In reduced dimension, the scatter matrices are: ${{\text{S}}_{\text{w}}}={\text{P}}_{1}^{{\text{T}}}{{\text{S}}_{\text{W}}}{{\text{P}}_1}$ and ${{\text{S}}_{\text{b}}}={\text{P}}_{1}^{{\text{T}}}{{\text{S}}_{\text{B}}}{{\text{P}}_1}$. After this procedure ${S_w} \in {R^{rt \times rt}}$ and ${S_b} \in {R^{rt \times rt}}$ are decreased dimensional scatter matrix.

In RLDA, the regularization of within-class scatter matrix S_W was performed by adding η to diagonal elements of S_W; i.e., S_w = S_w + ηI. The η make S_W presents non-singular and reversible which would benefit the revised Fisher’s criterion maximized:

$${\text{W}}=\mathop {\arg \hbox{max} }\limits_{W} \frac{{\left| {{{\text{W}}^{\text{T}}}{{\text{S}}_{\text{b}}}{\text{W}}} \right|}}{{\left| {{{\text{W}}^{\text{T}}}{\text{(}}{{\text{S}}_{\text{w}}}+\eta {\text{I)W}}} \right|}},$$

(10)

where $w \in {R^{rt \times 1}}$ is orientation vector. Avoid using any heuristic method in determining η, solving Eq. (10) in the belowing way. Denote

$${\text{f}}={{\text{W}}^{\text{T}}}{{\text{S}}_{\text{b}}}{\text{W.}}$$

(11)

Constraint condition:

$${\text{g}}={{\text{W}}^{\text{T}}}({{\text{S}}_{\text{w}}}+\eta {\text{I}}){\text{W}} - {\text{b}}=0.$$

(12)

b > 0 is constant. Under constrained curve g, the restricted relative maximum of f can be obtained. By putting its derivative to zero value, then

$$\frac{{\partial ({\text{f}} - \lambda g)}}{{\partial {\text{W}}}} = 2{\text{S}}_{{\text{b}}} {\text{W}} - \lambda (2{\text{S}}_{{\text{w}}} + 2\eta W) = 0.$$

Or

$$\left( {\frac{1}{\lambda }{\text{S}}_{{\text{b}}} - {\text{S}}_{{\text{w}}} } \right){\text{W}}\eta - {\text{W}} = 0$$

(13)

λ is Lagrange’s multiplier (λ ≠ 0). Shifting $\eta$W from Eq. (13) into Eq. (12), we conclude

$${{\text{W}}^{\text{T}}}{{\text{S}}_{\text{b}}}{\text{W}}=\lambda {\text{b}}.$$

(14)

And from Eq. (12) and Eq. (14), we can get

$$\lambda = \frac{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{b}}} {\text{W}}}}{{{\text{W}}^{{\text{T}}} {\text{(S}}_{{\text{w}}} + \eta I){\text{W}}}}.$$

(15)

We can observe that the left term of Eq. (15) is the Lagrange’s multiplier, and to the right of Eq. (15) same as the Fisher’s revised criterion. To large the modified Fisher’s criterion, we need to maximize λ. So approximate value of λ can be got by maximizing W^TS_bW/W^TS_wW, W corresponding to the large eigenvalue of ${\text{S}}_{w}^{{ - 1}}{{\text{S}}_b}$. But, ${\text{S}}_{w}^{{ - 1}}$ can be replaced by its pseudoinverse for which it is singular and irreversible. We can get λ_max by decomposing the eigenvalue of matrix ${\text{S}}_{w}^{+}{{\text{S}}_b}$. ${\text{S}}_{w}^{+}$ is the pseudoinverse of S_w. The value of λ_max can be substituted as follows:

$$\begin{gathered} \lambda \max = {\text{max}}\left( {\frac{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{b}}} {\text{W}}}}{{{\text{W}}^{{\text{T}}} {\text{(S}}_{{\text{w}}} + \eta I){\text{W}}}}} \right) \approx {\text{max}}\left( {\frac{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{b}}} {\text{W}}}}{{{\text{W}}^{{\text{T}}} {\text{S}}_{{\text{w}}} {\text{W}}}}} \right) \hfill \\ \approx {\text{the}}~{\text{maximum}}~~{\text{eigenvalue}}~{\text{of S}}_{w}^{ + } {\text{S}}_{b} . \hfill \\ \end{gathered}$$

(16)

Equation (16) will help us to seek the value of η by decomposing the eigenvalue of 1/λS_b − S_w which will give ${r_b}=~rank({S_b})$ finite eigenvalues. Since the dominant eigenvalue correspond to the largest discriminant eigenvector, η is considered to be the maximum eigenvalue. Then,

$$\eta ={\Lambda _{{\text{max}}}},$$

(17)

where $1/\lambda {S_{b~}} - ~{S_{w~}}=~E\Lambda {E^T}$, $E \in {R^{rt \times rt}}$ is a matrix of eigenvectors, Λ is a diagonal matrix of corresponding eigenvalues. If η is determined, the projection vector W would be obtained by decomposing the eigenvalue of ${({S_w}+\eta I)^{ - 1}}{S_b}$ which can be formulated as:

$${\text{((S}}_{{\text{w}}} + \eta {\text{I)}}^{{ - 1}} {\text{S}}_{{\text{b}}} ){\text{W}} = \beta {\text{W}}.$$

(18)

The m eigenvectors be obtained by Eq. (18) corresponding to the m highest eigenvalues to form W.

4 Simulation results and analysis

In this part, our approach is compared with a number of related state-of-the-art methods, including LGDA [10], NLDA [7], SRDA [8], SGDA [9] and SLGDA [10], etc. The range of normalized parameters in RLDA [6] is [0, 1]. RLDA in the following experiments has better results when the value obtained is 0.001. Our algorithm solves the difficulty of determining the normalized parameter values in RLDA. And the parameters introduced in our algorithm are β and k which can be seen in formula (8). When the values in the following experiments are 0.5 and 10 respectively, this algorithm achieves comparatively better results. Under different dimensions or training samples or classes, the parameters can be changed to other values so that the performance can be much better. NLDA verifies that the zero space of the between-class scatter matrices contains important discriminative information, but in some cases, the between-class scatter matrices may not contain null space. So the results were not so great sometimes. The Tikhonov regularizer was used in SRDA to control the model complexity, but the projection matrix obtained by it does not have orthogonality and it is not conducive to eliminating information redundancy between samples. All those algorithms of feature extraction are combined with NN classification algorithm for face recognition. The experiment is conducted on three face datasets, including the Extended Yale B [22], CMU PIE [23] and AR to evaluate performance. Details of datasets can be seen in Table 1 and Fig. 1.

Table 1 The three data sets used in our experiments

Full size table

The parameters in competing methods are adjusted to their best performance according to the suggestions in original papers.

4.1 2-D visualization experiment on CMU PIE dataset

In this part, the discriminate ability is showed by different methods using a partial CMU PIE [22] face database. In the experiment, each individual are randomly selected 7 images for training, and the remaining about 17 images were tested. Figure 2a–j visualize the testing data distribution along the first two dimensions obtained by different methods. From Fig. 2, we may draw several conclusions. First, considered small sample problem, NLDA [7], RLDA [6], SRDA [8] are superior to PCA [4] and LDA [5]. But overlaps are still serious. Second, SGDA [9] only uses the local neighborhood structure through sparse representation, which doesn’t perform very well. Some parts of 5 classes mixed together can be seen in Fig. 2g. LGDA [10] shows better separation ability by introducing global low-rank regularization, But, still have significant overlaps among class 2, class 3 and class 5, and the distance between class 1 and class 3 is not far. With both sparse and low-rank constraints, SLGDA [10] performs better than the first two. However, class 2, class 3 and class 5 are still not separated as shown in Fig. 2i. Contrastively, the proposed method show more clear boundaries among classes and shows stronger robustness in the following experiment.

4.2 Experiments on face recognition

4.2.1 CMU PIE database

The CMU PIE have surpassed four thousands images of sixty-eight individuals. Each person’s image was obtained through 13 different postures. Here, we use a subset of poses close to the front, C07, for the experiment, which contains 1629 images of 68 people. Everyone has about 24 pictures. And all the facial images were pruned to 32 × 32 pixels. Each individual is selected to have a subset of p (= 2, 3 ...) samples for training, the rest for testing. For each p, we ran all of methods 10 times independently, and reported the average results in Table 2. The FR rates under different dimensions are shown in Table 3. Table 2 shows that our method almost exceeds the other methods in different experimental settings. The results in LGDA and SLGDA are similar with ours, while obviously lower than ours when training samples per subject are not much available. Also, LGDA and SLGDA can get better performance as ours under different dimensions when training samples per subject are fixed. The results in RLDA are better than in SRDA under different training samples per subject except p = 2. And the results in NLDA are better than both RLDA and SRDA. Our method has higher recognition rate under different dimensions. Figure 2 shows the recognition rate versus the number of training samples and feature dimensions on CMU PIE by using some methods. From Fig. 2b, it is obvious that the whole training information does not provide significant advantage for classification, by which lead to computational costs instead. Thereby it’s necessary to extra features.

Table 2 Recognition rates under different number of training set (CMU PIE database)

Full size table

Table 3 Recognition rates under the condition of different feature dimensions

Full size table

4.2.2 Experiments on AR database

A subset of AR consisted of 50 men and 50 women in two sessions, with 6 lighting and 8 expression changes. From session 1, only seven images of light and expression changes, seven samples from another session. Each individual is selected to have a random subset of p (= 2, 3 ...) samples for training. For each p, we ran 10 times independently, and reported the average results in Tables. From Table 4, one can conclude that all the algorithms achieve better performance with the number of training samples per class increases, our method has higher recognition rate than other methods under numbers of every individual of training samples. The results in RLDA are better than in SRDA under different training samples per subject except p = 2. And the results in NLDA are better than both RLDA and SRDA. The FR rates under different dimensions were listed in Table 5. It can show that NLDA and our method exceed other methods in different experimental settings. The NLDA gain best outcomes on AR, which is slightly better than our method under 50 feature dimensions. But it doesn’t better than our algorithm under higher dimensions. Particularly, because of the number of training samples per subject are just 14 in AR Database, the performance of SGDA, LGDA and SLGDA are drop sharply when p does not reach half. Figure 3 shows the recognition rate versus different number of training samples per class and feature dimensions by using some methods. From Fig. 3b, it is obvious that the whole training information does not provide significant advantage for classification, by which lead to computational costs instead.

Table 4 Recognition rates under different number of training set

Full size table

Table 5 Recognition rates under the condition of different feature dimensions

Full size table

4.2.3 Yale face database

About 2414 images of 38 people each and 64 frontal face of different lighting on Extended Yale B. In this experiment, the cropped and resized images were used which is 32 × 32 pixels. Figure 1 shows some example images of individual. A subset of individuals with p (= 3, 4, 5...) samples were taken with labels as for training, and the remaining is used for testing. The experiment was ran for 10 times. From Table 6, one can conclude that all the methods achieve better performance along with the number of training images per subject grows. Then, we randomly select four images from every person for training, and use the remaining samples for testing. The results in RLDA are better than in SRDA under different training samples per subject except p = 3, 4, 5. And the results in NLDA are better than both RLDA and SRDA. SGDA doesn’t present superior performance in this dataset as in CMU PIE. The results in LGDA and SLGDA are similar with ours in some cases, while obviously lower than ours when training samples per subject are not much available. Also, LGDA and SLGDA can get better performance as ours under different dimensions when training samples per subject are fixed. The rates, under the condition of different dimensions, were listed in Table 7. Our method exceeds the other methods in different experimental settings, and shows more robust in dealing with illumination problem in FR. Figure 4 vividly illustrate the recognition rate versus the number of training samples for each category and feature dimensions by using some methods. From Fig. 4b, it is obvious that the whole training information does not provide significant advantage for classification, by which lead to computational costs instead (Fig. 5).

Table 6 Recognition rates under different number of training set

Full size table

Table 7 Recognition rates under different feature dimensions

Full size table

5 Conclusions

The issue of small sample size in FR is studied in this paper. The algorithm of regularized linear discriminant analysis still has some disadvantages to fix the SSS problems. Considering that the model of scatter matrices can be more reasonable and related parameter can be obtained by avoiding the process of heuristic. An improved algorithm is introduced, which cannot only fix the singularity problem of scatter matrix but also the problem of parameter estimation. PCA is simply to calculate, and performs well in some cases, but the performance is limited by its unsupervised nature. By introducing different discrimination standards to fix SSS problems, RLDA, NLDA, SRDA and etc perform well to some extent. SGDA, LGDA and SLGDA can adaptively select neighbors for graph construction, and use the labeled samples in the same class to find the representation of each sample for block-diagonal structure representations. However, due to the limited number of samples per class, this process may result in large representation error, which may not reveal the within-class adjacent relationship as well as ours do. So, SGDA, LGDA and SLGDA hardly perform better than the proposed method when training sample are not enough. The simulation results on the famous databases illustrate that the proposed method has much better performance than other methods and improves the face recognition.

References

Zhang L, Zhang D (2017) Evolutionary cost-sensitive extreme learning machine. IEEE Trans Neural Netw Learn Syst 28(12):3045–3060
Article MathSciNet Google Scholar
Zhang L, Zhang D (2016) Visual understanding via multi-feature shared learning with global consistency. IEEE Trans Multimed 18(2):247–259
Article Google Scholar
Sha C, Zhao H (2017) Design and analysis of associative memories based on external inputs of continuous bidirectional associative networks. Neurocomputing 266:433–444
Article Google Scholar
Jolliffe I (2005) Principal component analysis. Wiley Online Library, New York
MATH Google Scholar
Lee S, Park YT, d’Auriol BJ et al (2012) A novel feature extraction method based on normalized mutual information. Appl Intell 37(1):100–120
Article Google Scholar
Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc 84(405):165–175
Article MathSciNet Google Scholar
Chen L-F, Liao H-Y, Ko M-T, Lin J-C, Yu G-J (2000) A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognit 33:1713–1726
Article Google Scholar
Cai D, He X, Han JS (2008) An efficient algorithm for large-scale discriminant analysis. IEEE Trans Knowl Data Eng 20(1):1–12
Article Google Scholar
Ly NH, Du Q, Fowler JE (2014) Sparse graph-based discriminant analysis for hyperspectral imagery. IEEE Trans Geosci Remote Sens 52:3872–3884
Article Google Scholar
Li W, Liu J, Du Q (2016) Sparse and low-rank graph for discriminant analysis of hyperspectral imagery. IEEE Trans Geosci Remote Sens 54:4094–4105
Article Google Scholar
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article MATH Google Scholar
Lu GF, Wang Y, Zou J (2016) Graph maximum margin criterion for face recognition. Neural Process Lett 44(2):1–19
Article Google Scholar
Kasun LLC, Yang Y, Huang GB et al (2016) Dimension reduction with extreme learning machine. IEEE Trans Image Process 25(8):1–1
Article MathSciNet Google Scholar
Zhang Q, Deng K, Chu T (2016) Sparsity induced locality preserving projection approaches for dimensionality reduction. Neurocomputing 200:35–46
Article Google Scholar
Shao G, Sang N (2014) Max–min distance analysis by making a uniform distribution of class centers for dimensionality reduction. Neurocomputing 143:208–221
Article Google Scholar
Li K, Tang P (2014) An improved linear discriminant analysis method and its application to face recognition. Appl Mech Mater 556:4825–4829
Article Google Scholar
Sharma A, Paliwal KK (2015) A deterministic approach to regularized linear discriminant analysis. Neurocomputing 151:207–214
Article Google Scholar
Tao D, Li X, Wu X, Maybank SJ (2009) Geometric mean for subsPACe selection. IEEE Trans Pattern Anal Mach Intell 31(2):260–274
Article Google Scholar
Loog M, Duin R, Haeb-Umbach R (2001) Multiclass linear dimension reduction by weighted pairwise Fisher criteria. IEEE Trans Pattern Anal Mach Intell 23(7):762–766
Article Google Scholar
Bian W, Tao D (2011) Max–min distance analysis by using sequential SDP relaxation for dimension reduction. IEEE Trans Pattern Anal Mach Intell 33:1037–1050
Article Google Scholar
Shao C, Lou W, Yan L-M (2011) Optimization of algorithm of similarity measurement in high-dimensional data. Comput Technol Dev 21(2):1–4
Google Scholar
Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660
Article Google Scholar
Sim T, Baker S, Bsat M (2003) The CMU pose, illumination, and expression database. IEEE Trans Pattern Anal Mach Intell 25(12):1615–1618
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (no. 61571069) and Project no. 106112017CDJQJ168817 supported by the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

College of Communication Engineering, Chongqing University, Chongqing, 400044, China
Xiaoheng Tan, Lu Deng, Yang Yang, Qian Qu & Li Wen

Authors

Xiaoheng Tan
View author publications
You can also search for this author in PubMed Google Scholar
Lu Deng
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qian Qu
View author publications
You can also search for this author in PubMed Google Scholar
Li Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Deng.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, X., Deng, L., Yang, Y. et al. Optimized regularized linear discriminant analysis for feature extraction in face recognition. Evol. Intel. 12, 73–82 (2019). https://doi.org/10.1007/s12065-018-0190-0

Download citation

Received: 15 March 2018
Revised: 21 September 2018
Accepted: 21 November 2018
Published: 03 December 2018
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s12065-018-0190-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimized regularized linear discriminant analysis for feature extraction in face recognition

Abstract

Similar content being viewed by others

Schatten-p Norm Based Linear Regression Discriminant Analysis for Face Recognition

Local Similarity Based Linear Discriminant Analysis for Face Recognition with Single Sample per Person

Bayesian Face Recognition Approach Based on Feature Fusion

1 Introduction

2 Regularized linear discriminant analysis

3 Improved regularized linear discriminant analysis

3.1 The model of improved between-class scatter matrices