1 Introduction

Feature extraction is a critical issue in the field of pattern recognition [6]. A main goal of feature extraction is to obtain a few low-dimensional representative features for the purpose of discrimination or data visualization. In past decades, many feature extraction algorithms have been proposed in the literatures [10]. The most famous feature extraction algorithms are perhaps principal component analysis (PCA) [6, 7] and linear discriminant analysis (LDA) [1, 6, 7].

PCA, which aims to find a set of orthogonal projection vectors to maximize the variance of the training samples, is an unsupervised feature extraction algorithm. On the contrary, LDA, which aims to find a set of projection vectors on which the training samples of the same class are as near as possible to each other while the training samples of the different classes are as far as possible from each other, is a supervised technique. Generally, LDA is more suitable than PCA for objection recognition problems.

The conventional LDA and PCA are both based on L2-norm, which is more sensitive to outliers than L1-norm since the square operation of the L2-norm can magnify the effect of outliers [12]. Then, to address the drawback of L2-norm based feature extraction methods, researchers turned to develop the L1-norm based feature extraction techniques, e.g. robust PCA [5, 11, 12, 18, 22] and robust LDA [15, 19, 20, 23, 25], in recent years. Compared with L2-norm based feature extraction methods, the main advantage of L1-norm based feature extraction methods is that they are less sensitive to effects of the outliers. The relationship between L1-norm and its robustness is explained intuitively in Fig. 1, where the solid line and the dot line correspond to L1-norm and L2-norm, respectively. From Fig. 1, we observe that comparing to L1-norm distance, L2-norm will exaggerate the influence of large errors to some extent, which are usually caused by outliers and noise.

Fig. 1
figure 1

Illustration of the exaggeration effect of the L2-norm and comparison with that of the L1-norm

By using maximum likelihood estimation, Ke et al. [11] proposed L1-PCA, which obtains the optimal projection matrix by using convex programming techniques. R1-PCA [5], which is rotational invariant, can combine the advantages of PCA and L1-PCA. However, L1-PCA and R1-PCA are more computationally expensive than the conventional L2-PCA. Recently, Kwak [12] proposed a rotational and robust L1-norm based PCA, i.e., PCA-L1, which can maximize the variance based on L1-norm and learn the optimal projection vectors by using a greedy iteration method. In contrast to L1-PCA and R1-PCA, PCA-L1 can obtain much lower reconstruction error in the facial image reconstruction. In [17], Nie et al. proposed a non-greedy strategy to solve PCA-L1 which can obtain much better projection matrix than that of L1-PCA. By using matrix and tensor techniques, Li et al. [22] and Pang et al. [18], respectively, generalizes L1-PCA to propose L1-norm based 2DPCA and tensor PCA.

For object recognition problems, it is more suitable to choose LDA rather than PCA since LDA can obtain the optimal discriminative projection matrix. By combining maximum margin criterion (WMMC) [13, 14] and R1-PCA, Li et al. proposed a new rotational invariant L1-norm based MMC, called as R1-MMC. However, R1-MMC is computationally expensive since its iterative algorithm to obtain the optimal projection matrix is based on eigenvalue decomposition. Recently, in order to extract robust electroencephalography (EEG) feature, Wang et al. [19] used L1-norm to replace L2-norm and proposed a L1-norm based common spatial patterns (L1-CSP), which can get better performance in the EEG classification experiments. Similar ideas are also appeared in [21, 24, 25], all papers replace L2-norm with L1-norm in the LDA objection function and propose L1-LDA. However, these methods use a greedy strategy to obtain the optimal projection vectors one by one.

Generally, the most discriminant information is contained in the null space of within-class scatter matrix and the method, which uses the discriminant information of null space and is originally proposed to address the small sample size (SSS) problem of LDA, is called null space based LDA (NSLDA) [3, 4, 9, 16]. Motivated by null space based LDA and L1-LDA, in this paper, we propose a L1-norm based null space linear discriminant analysis (L1-NSLDA). By analyzing the objective function of NSLDA, we get the transformed version of its objective function, which is formed using L2-norm. Similar to the other L1-norm based feature extraction methods, it is also very difficult to directly find the global optimal projection matrix of L1-NSLDA. To address the problem, we present a non-greedy iterative algorithm to obtain a local solution of L1-NSLDA.

The remainder of the paper is organized as follows. In section 2, we review briefly the related works on LDA and NSLDA algorithms. In Section 3, we propose the L1-NSLDA method, including its objective function and algorithmic procedure. Section 4 is devoted to the experiments. Finally, we conclude the paper in Section 5.

2 Outline of LDA and NSLDA

Given a data matrix X = {x 1, x 2,  ⋯ , x n } = [X 1,  ⋯,  X c ] ∈ Rd × n , where x i  ∈ Rd, for i = 1 , 2 ,  ⋯  , n, is the ith training sample in a d dimensional space, \( {X}_i\in {\mathrm{R}}^{d\times {n}_i} \), for i = 1 , 2 ,  ⋯  , c, is a collection of training samples from the ith class and \( {\displaystyle {\sum}_{i=1}^c{n}_i=n} \). Let N i be the set of column indices that belongs to the ith class, i.e., x j , for j ∈ N i , belongs to the ith class. In LDA, the within-class scatter matrix, between-class scatter matrix and total scatter matrix are defined, respectively, as follows:

$$ {S}_w={\displaystyle \sum_{i=1}^c{\displaystyle \sum_{j\in {N}_i}\left({x}_j-{m}_i\right){\left({x}_j-{m}_i\right)}^T}}={H}_w{H}_w^T $$
(1)
$$ {S}_b={\displaystyle \sum_{i=1}^c{n}_i\left({m}_i-m\right)}{\left({m}_i-m\right)}^T={H}_b{H}_b^T $$
(2)
$$ {S}_t={\displaystyle \sum_{i=1}^n\left({x}_i-m\right){\left({x}_i-m\right)}^T}={H}_t{H}_t^T={S}_b+{S}_w $$
(3)

where m i is the mean of the ith class and is defined as \( {m}_i=\frac{1}{n_i}{X}_i{e}_i \), where \( {e}_i={\left(1,1,\cdots, 1\right)}^T\in {\mathrm{R}}^{n_i} \), m is the global mean and is defined as\( m=\frac{1}{n}Ae \), where e = (1, 1,  ⋯ , 1)T ∈ Rn, H b ,H w and H t are defined, respectively, as

$$ \begin{array}{l}{H}_w=\left[{X}_1-{m}_1{e}_1^T,\cdots, {X}_c-{m}_c{e}_c^T\right]\in {\mathrm{R}}^{d\times n}\\ {}{H}_b=\left[\sqrt{n_1}\left({m}_1-m\right),\dots, \sqrt{n_c}\left({m}_c-m\right)\right]\in {\mathrm{R}}^{d\times c}\\ {}{H}_t=X-m{e}^T=\left({x}_1-m,{x}_2-m,\cdots {x}_n-m\right)\in {\mathrm{R}}^{d\times n}\end{array} $$
(4)

LDA aims to find an optimal projection matrix G that maximizes the following criterion:

$$ G= \arg \underset{G}{ \max } trace\left({\left({G}^T{S}_wG\right)}^{-1}\left({G}^T{S}_bG\right)\right) $$
(5)

where trace(⋅) denotes the trace operator. The projection matrix G can be obtained by solving the following generalized eigenvalue problem:

$$ {S}_bg=\lambda {S}_wg,\kern0.5em \lambda \ne 0 $$
(6)

whose eigenvectors corresponding to the c-1 largest eigenvalues form the columns of G.

When the small sample size problem occurs,S w is singular and LDA cannot work [8]. NSLDA, which can make use of the discriminant information in the null space of S w , has been proposed to address the singularity of S w . In particular, the optimization problem associated with NSLDA [3, 4, 9, 16] is

$$ \left\{\begin{array}{c}\hfill G= \arg \max trace\left({G}^T{S}_bG\right)\hfill \\ {}\hfill trace\left({G}^T{S}_wG\right)=0,{G}^TG=I\hfill \end{array}\right. $$
(7)

where I is an identity matrix.

3 L1-norm based null space linear discriminant analysis

3.1 Problem formulation

In this subsection, we will present our proposed L1-norm based null space linear discriminant analysis. We first reformulate the objective function of the conventional NSLDA method into a L2-norm based equation and then reveal that NSLDA is based on L2-norm distance criterion. It is well known that the L2-norm distance criterion is sensitive to outliers, which means that atypical samples may affect the desired solution to NSLDA. In literature, L1-norm is usually used as a robust alternative to L2-norm [19, 21, 24, 25]. Then motivated by this idea, we replace L2-norm in NLDSA with L1-norm and present the objective function of L1-NSLDA.

The optimization problem associated with NSLDA, i.e., Eq.(7), can be reformulated as

$$ \begin{array}{l}G= \arg \max trace\left({G}^T\left(\frac{1}{n}{\displaystyle \sum_{i=1}^c{n}_i\left({m}_i-m\right){\left({m}_i-m\right)}^T}\right)G\right)\kern0.5em \\ {}s.t.\kern0.5em {G}^TG=I,{G}^T\left(\frac{1}{n}\left({\displaystyle \sum_{i=1}^c{\displaystyle \sum_{j\in {N}_i}\left({x}_j-{m}_i\right){\left({x}_j-{m}_i\right)}^T}}\right)\right)G=0\end{array} $$
(8)

By simply algebraic transformation, Eq.(7) can be rewritten as

$$ \begin{array}{l}G= \arg \max {\displaystyle \sum_{i=1}^c{\left\Vert \sqrt{n_i}{G}^T\left({m}_i-m\right)\right\Vert}_2^2}\kern0.5em \\ {}s.t.\kern0.5em {G}^TG=I,{\displaystyle \sum_{i=1}^c{\displaystyle \sum_{j\in {N}_i}{\left\Vert {G}^T\left({x}_j-{m}_i\right)\right\Vert}_2^2}}=0\end{array} $$
(9)

where ‖‖2denotes L2-norm. From Eq. (9), we can find that NSLDA is based on L2-norm measurement, which is sensitive to the effect of outlier. Generally, we can replace L2-norm with L1-norm to obtain a more robust method. Then, the optimization problem associated with L1-NSLDA is

$$ \begin{array}{l}G= \arg \max {\displaystyle \sum_{i=1}^c{\left\Vert \sqrt{n_i}{G}^T\left({m}_i-m\right)\right\Vert}_1}\kern0.5em \\ {}s.t.\kern0.5em {G}^TG=I,{\displaystyle \sum_{i=1}^c{\displaystyle \sum_{j\in {N}_i}{\left\Vert {G}^T\left({x}_j-{m}_i\right)\right\Vert}_1}}=0\end{array} $$
(10)

where ‖‖1denotes L1-norm. By using Eq. (4), Eq. (10) can be rewritten as

$$ \begin{array}{l}G= \arg \max {\left\Vert {G}^T{H}_b\right\Vert}_1\kern0.5em \\ {}s.t.\kern0.5em {G}^TG=I,{\left\Vert {G}^T{H}_w\right\Vert}_1=0\end{array} $$
(11)

Although it is easy to obtain the global optimal solution of the objection function of conventional NSLDA, i.e., Eq. (7), it is very difficult to directly obtain the global optimal solution to L1-NSLDA, i.e. Eq. (11) since the absolute value operation is nonconvex. In the next subsection, we will propose an iterative algorithm to find a local optimal solution to Eq. (11).

3.2 Algorithm of finding the projection G

In this subsection, we will present an iterative procedure for finding the optimal projection G of L1-NSLDA.

Theorem 1

: Let Q be the matrix whose columns generate the null space of S t , then we have ‖Q T H b 1  = 0 and ‖Q T H w 1 = 0.

Proof: Since Q be the null space of S t , we have QT S t Q = 0. Then we have QT S b Q = 0 and QT S w Q = 0 since S b and S w are both positive semi-definite matrices and S t  = S b  + S w . From Eq. (1) and Eq. (2), we have

$$ {Q}^T{H}_b=0,\kern0.5em {Q}^T{H}_w=0 $$
(12)

Then we can obtain

$$ {\left\Vert {Q}^T{H}_b\right\Vert}_1=0,\kern0.5em {\left\Vert {Q}^T{H}_w\right\Vert}_1=0 $$
(13)

From Theorem 1, we can know that we can first remove the null space of S t without loss of discriminant information.

Let H B  = (Q )T H b ,H W  = (Q )T H w and U be the solution of the following optimization problem

$$ \begin{array}{l}U= \arg \max {\left\Vert {U}^T{H}_B\right\Vert}_1\kern0.5em \\ {}s.t.\kern0.5em {U}^TU=I,{\left\Vert {U}^T{H}_W\right\Vert}_1=0\end{array} $$
(14)

where Q is a matrix whose columns form an orthogonal basis of S t . Then we can obtain the projection matrix G by G = Q U. So solving the projection matrix G boils down to solving the matrix U.

Theorem 2

: Let P be the null space of H W , then ‖P T H W 1 = 0 and ‖P T H B 1 ≠ 0.

Proof: Let S T  = (Q )T S t Q , S B  = (Q )T S b Q and S W  = (Q )T S w Q , obviously we have

$$ {S}_T={S}_B+{S}_W $$
(15)

Since P is the null space of H W , we have P T H W  = 0, ‖P T H W 1 = 0 and P T S T P = 0. From Eq. (15) we can obtain

$$ {P}^T{S}_BP\ne 0 $$
(16)

That is

$$ {P}^T{H}_B\ne 0 $$
(17)

and

$$ {\left\Vert {P}^T{H}_B\right\Vert}_1\ne 0 $$
(18)

From Theorem 2 we can know that solving the projection matrix U boils down to solving the matrix V, where V is the solution of the following optimization problem

$$ \begin{array}{l}V= \arg \max {\left\Vert {V}^T{\overline{H}}_B\right\Vert}_1\kern0.5em \\ {}s.t.\kern0.5em {V}^TV=I\end{array} $$
(19)

where \( {\overline{H}}_B={P}^T{H}_B \) and \( {\overline{H}}_W={P}^T{H}_W \). Obviously we have U = PV.

Now we consider how to solve V. Suppose \( {\alpha}_i=\mathrm{s}\mathrm{g}\mathrm{n}\left({V}^T\left({\overline{m}}_i-\overline{m}\right)\right) \), 1 ≤ i ≤ c, where \( {\overline{m}}_i-\overline{m}={P}^T{\left({Q}^{\perp}\right)}^T\sqrt{n_i}\left({m}_i-m\right) \). Then Eq. (19) can be reformulated as

$$ \begin{array}{l}V=\underset{V^TV=I}{ \arg \max }{\left\Vert {V}^T{\overline{H}}_B\right\Vert}_1=\underset{V^TV=I}{ \arg \max }{\displaystyle \sum_{i=1}^c{\left\Vert {V}^T\left({\overline{m}}_i-\overline{m}\right)\right\Vert}_1}\\ {}=\underset{V^TV=I}{ \arg \max }{\displaystyle \sum_{i=1}^c{\alpha}_i^T{V}^T\left({\overline{m}}_i-\overline{m}\right)}=\underset{V^TV=I}{ \arg \max } trace\left({V}^TM\right)\end{array} $$
(20)

where \( M={\displaystyle \sum_{i=1}^c\left({\overline{m}}_i-\overline{m}\right){\alpha}_i^T} \). By using Lagrange multiplier method, Eq. (20) can be rewritten as

$$ f(V)=tr\left({V}^TM\right)-\frac{1}{2}tr\left(L\left({V}^TV-I\right)\right) $$
(21)

where L is a symmetric Lagrange multiplier matrix. Setting the partial derivative of f(V) with respect to V equal to zero, we obtain

$$ M-VL=0 $$
(22)

That is

$$ V=M{L}^{-1} $$
(23)

Since V T V = I, we should have

$$ {\left({L}^{-1}\right)}^T{M}^TM{L}^{-1}=I $$
(24)

Suppose the singular value decomposition (SVD) of M is \( M={U}_M{\varLambda}_M{V}_M^T \)and \( {L}^{-1}={V}_M{\varLambda}_M^{-1}{V}_M^T \), we have

$$ {\left({L}^{-1}\right)}^T{M}^TM{L}^{-1}=={\left({V}_M{\varLambda}_M^{-1}{V}_M^T\right)}^T{\left({U}_M{\varLambda}_M{V}_M^T\right)}^T\left({U}_M{\varLambda}_M{V}_M^T\right)\left({V}_M{\varLambda}_M^{-1}{V}_M^T\right)=I $$
(25)

Then we can obtain

$$ V=M{L}^{-1}={U}_M{\varLambda}_M{V}_M{V}_M{\varLambda}_M^{-1}{V}_M^T={U}_M{V}_M^T $$
(26)

Note that α i , 1 ≤ i ≤ c, is an unknown variable since it depends on V. We propose an iterative procedure to solve (19) and prove that the iterative procedure converges to a local solution. From the above analysis, the algorithm of solving (11) is described in Algorithm 1.

figure e

In [2], Cevikalp et al. pointed out that all samples of the same class will be projected into the same unique vector in NSLDA, which makes NSLDA overfit to data hence it is not stable [4]. Then a small variation of the test samples in the range space of S w will lead to misclassification errors. The L1-NSLDA method will inherit the weakness of NSLDA since it is also a null space based method. However, L1-NSLDA finds the optimal projection matrix in the null space of S w by using a L1-norm based objective function whereas NSLDA finds the optimal projection matrix in the null space of S w by using a L2-norm based objective function. Then the ultimate projection matrices obtained by L1-NSLDA and NSLDA, respectively, are totally different and L1-NSLDA is not sensitive to the atypical samples. That is, L1-NSLDA can alleviate the effects of the outliers and noise when there exist outliers and noise in the training samples.

The following theorem 3 guarantees the convergence of the Step 6 of Algorithm 1.

Theorem 3

: The Step 6 of Algorithm 1 will monotonically increase the objective value of Eq. (19) in each iteration.

Proof: Obviously, we have

$$ \begin{array}{l} trace\left({\left({V}^{t+1}\right)}^T\left({\displaystyle \sum_{i=1}^c\left({\overline{m}}_i-\overline{m}\right){\left({\alpha}_i^t\right)}^T}\right)\right)= trace\left({\left({V}^{t+1}\right)}^T{M}^t\right)\\ {}\ge trace\left({\left({V}^t\right)}^T{M}^t\right)= trace\left({\left({V}^t\right)}^T\left({\displaystyle \sum_{i=1}^c\left({\overline{m}}_i-\overline{m}\right){\left({\alpha}_i^t\right)}^T}\right)\right)\end{array} $$
(27)

Since \( {\alpha}_i^t=\mathrm{s}\mathrm{g}\mathrm{n}\left({\left({V}^t\right)}^T\left({\overline{m}}_i-\overline{m}\right)\right) \) for each i, we have

$$ {\left\Vert {\left({V}^{t+1}\right)}^T\left({\overline{m}}_i-\overline{m}\right)\right\Vert}_1={\alpha}_i^{t+1}{\left({V}^{t+1}\right)}^T\left({\overline{m}}_i-\overline{m}\right)\ge {\alpha}_i^t{\left({V}^{t+1}\right)}^T\left({\overline{m}}_i-\overline{m}\right) $$
(28)

From Eq. (20) and Eq. (28), we have

$$ trace\left({\left({V}^{t+1}\right)}^T{M}^{t+1}\right)={\displaystyle \sum_{i=1}^c{\left({\alpha}_i^{t+1}\right)}^T{\left({V}^{t+1}\right)}^T\left({\overline{m}}_i-\overline{m}\right)}\ge {\displaystyle \sum_{i=1}^c{\left({\alpha}_i^t\right)}^T{\left({V}^{t+1}\right)}^T\left({\overline{m}}_i-\overline{m}\right)}= trace\left({\left({V}^{t+1}\right)}^T{M}^t\right) $$
(29)

By combining Eq. (27) and Eq. (29), we have

$$ trace\left({\left({V}^{t+1}\right)}^T{M}^{t+1}\right)\ge trace\left({\left({V}^t\right)}^T{M}^t\right) $$
(30)

Then the Step 6 of Algorithm 1 will monotonically increase the objective value of Eq. (19) in each iteration.

3.3 Computational complexity analysis

In this subsection, we will discuss the computational complexity of the proposed L1-NSLDA method. In Table 1, we list the computational complexity of each step in L1-NSLDA. Note that in Step 6 t denotes the iterative number. Generally, we have d ≫ n ≫ c when the SSS problem occurs. Then from Table 1 we can know the total computational complexity of L1-NSLDA is in the order of O(dn 2 + dnc + n 3 + n 2 c).

Table 1 Computational complexity of each step in L1-NSLDA

4 Experiments and results

In this section, we will compare our proposed L1-NSLDA with PCA [7], L1-PCA [12], LDA [1], L1-LDA [20, 25] and NSLDA [3, 4, 9, 16] on ORL and Yale face databases. In order to overcome the small sample size (SSS) problem of LDA, we firstly use conventional PCA as a processing method. In the PCA phase we keep nearly 98 % image energy. For L1-LDA, the updating parameter β is set to 0.08. In the experiments we use the Euclidean distance based nearest neighbor classifier (1-NN) for classification and the recognition rate is computed by the following formula

$$ \mathrm{recognition}\ \mathrm{rate}=\frac{rec}{tot} $$
(31)

where rec denotes the number of samples in the test set that is corrected labeled as being a given class and tot denotes the total number of samples in the test set. The experiments are implemented on a Mobile DualCore Intel Pentium (1.8GMHz) processor Acer Computer with 4GB RAM and the programming environment is MATLAB 2008.

4.1 Experiments on ORL face database

The ORL face database consists of a total of 400 face images, of a total of 40 people (10 samples per person). For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glassed/no glassed). All the images were taken against a dark homogeneous background with the subjects in an upright, front position (with tolerance for some side movement). In our experiments, the size of each image in ORL database is 112 × 92.

Firstly, we test the performances of L1-NSLDA and the other methods on ORL database without added noise. In the experiments, we randomly choose i (i = 4,5) samples of each person for training, and the remaining ones are used for testing. The procedure is repeated 20 times and the average recognition rates as well as the standard deviation are reported in Table 2. We also reported the computing times of each algorithm in Table 3.

Table 2 Comparison of recognition rates for the different methods on ORL database without noise
Table 3 Computing times (S) of each algorithm on ORL database

To test the robustness of the proposed L1-NSLDA against outliers, in the following we randomly choose 40 % of the training samples to be contaminated by rectangle noise. The rectangle noise takes white or black dots, its location in face image is random and its size is 50 × 50. Some face images with or without rectangle noise are shown in Fig. 2. The procedure is repeated 20 times and the average recognition rates as well as the standard deviation are reported in Table 4.

Fig. 2
figure 2

Some face images with/without occlusion in ORL database

Table 4 Comparison of recognition rates for the different methods on ORL database with noise

4.2 Experiments on Yale face database

The Yale face database contains 165 Gy scale images of 15 individuals, each individual has 11 images. The images demonstrate variations in lighting condition, facial expression (normal, happy, sad, sleepy, surprised, and wink). In our experiments, each image in Yale database was manually cropped and resized to 100 × 80.

Firstly, we test the performances of L1-NSLDA and the other methods on Yale database without added noise. In the experiments, we randomly choose i (i = 4,5) samples of each person for training, and the remaining ones are used for testing. The procedure is repeated 20 times and the average recognition rates as well as the standard deviation are reported in Table 5. We also reported the computing times of each algorithm in Table 6.

Table 5 Comparison of recognition rates for the different methods on Yale database without noise
Table 6 Computing times (S) of each algorithm on Yale database

To test the robustness of the proposed L1-NSLDA against outliers, in the following we randomly choose 40 % of the training samples to be contaminated by rectangle noise. The rectangle noise takes white or black dots, its location in face image is random and its size is 50 × 50. Some face images with or without rectangle noise are shown in Fig. 3. The procedure is repeated 20 times and the average recognition rates as well as the standard deviation are reported in Table 7.

Fig. 3
figure 3

Some face images with/without occlusion in Yale database

Table 7 Comparison of recognition rates for the different methods on Yale database with noise

4.3 Experiments on FERET face database

The FERET face database contains 14,126 images from 1199 individuals. In our experiments, we select a subset which contains 1400 images of 200 individuals (each individual has seven images). The subset involves variations in facial expression, illumination and pose. In our experiments, each image in FERET database was manually cropped and resized to 80 × 80.

Firstly, we test the performances of L1-NSLDA and the other methods on FERET database without added noise. In the experiments, we randomly choose i (i = 3) samples of each person for training, and the remaining ones are used for testing. The procedure is repeated 20 times and the average recognition rates as well as the standard deviation are reported in Table 8. We also reported the computing times of each algorithm in Table 9.

Table 8 Comparison of recognition rates for the different methods on FERET database without noise
Table 9 Computing times (S) of each algorithm on FERET database

To test the robustness of the proposed L1-NSLDA against outliers, in the following we randomly choose 40 % of the training samples to be contaminated by rectangle noise. The rectangle noise takes white or black dots, its location in face image is random and its size is 50 × 50. Some face images with or without rectangle noise are shown in Fig. 4. The procedure is repeated 20 times and the average recognition rates as well as the standard deviation are reported in Table 10.

Fig. 4
figure 4

Some face images with/without occlusion in FERET database

Table 10 Comparison of recognition rates for the different methods on FERET database with noise

4.4 Discussions

From the experiment results we can make the following conclusions:

  1. (1)

    From Table 2-10, we can find that the recognition rates of the unsupervised methods, e.g. PCA and L1-PCA, are generally lower than those of the supervised methods, e.g. LDA, L1-LDA, NSLDA and L1-NSLDA. This shows that the information of class label is critical to the recognition problems.

  2. (2)

    The L1-norm based methods can get higher recognition rates than their L2-norm-based counterparts. This shows that L1-norm is helpful for suppressing the negative effects of outliers indeed.

  3. (3)

    Generally, the L1-norm based methods spend more computing times to obtain the projection vectors than their L2-norm-based counterparts.

  4. (4)

    Our proposed L1-NSLDA achieves the highest recognition rates in our experiments. This is attributed to the uses of discriminant information in the null space of within-class scatter matrix and L1-norm.

5 Conclusions

In this paper, a new feature extraction method, called L1-norm based null space linear discriminant analysis (L1-NSLDA), has been proposed. Since L1-norm can suppress the effects of outliers, L1-NSLDA is more robust to outliers than the conventional L2-norm based feature extraction methods. The experiment results on some image databases show the effectiveness of the proposed L1-NSLDA. In the future, we will investigate nonlinear feature extraction methods based on L1-norm.