1 Introduction

Recently, the techniques for dimensionality reduction have received a lot of attentions in the fields of computer vision and pattern recognition [9, 10, 12, 18, 24]. Real data are usually depicted in high dimensions, a common way to deal with the high-dimensional data adequately and avoid the curse of dimensionality is to use the dimensionality reduction techniques [8, 23]. Principal component analysis (PCA) [3, 16, 19], linear discriminant analysis (LDA) [1, 20] and locality preserving projections (LPP) [6, 14] are the most widely used techniques for reduced dimensionality. The high-dimensional data information in the real world tend to lie on a smooth nonlinear low-dimensional manifold [2, 23]. However, both PCA and LDA see only the global Euclidean structure and fail to discover the underlying manifold structure [4, 23]. The LPP is proposed to uncover the essential manifold structures by preserving the local structure of image samples [26].

Based on LPP, many related dimensionality reduction approaches including orthogonal locality preserving projections (OLPP) [4], marginal fisher analysis (MFA) [7, 24] and SOLDE [12], and Fast and Orthogonal Locality Preserving Projections (FOLPP) [22] have been developed. In MFA, the intraclass compactness (similarity) of data was characterized by minimizing the distance among nearby data belonging to same classes, and at the same time the interclass separability of data was characterized by maximizing the distance among nearby data belonging to different classes. Both the similarity and the interclass separability are important for improving the algorithmic discriminating ability. However, the projection vectors obtained by LPP and MFA are mutually nonorthogonal. Some researchers [4, 7, 25] have pointed out that enforcing an orthogonality relationship between the projection directions is more effective for preserving local geometry of data and improving the discriminating ability. Thus, the OLPP was proposed by characterizing the similarity of data and enforcing an orthogonality relationship between projection vectors, simultaneously.

Moreover, both MFA and OLPP characterize the similarity of data by minimizing the sum of the distance among nearby data from the same class. Following this idea, nearby data from the same class can be mapped to a single data point in the reduced space. In this way, the intraclass separability (diversity) of data was completely ignored [12], which is also important for preserving local geometry of data. Combining above analysis, we can see that the above mentioned discriminant approaches ignore the diversity of data resulting in instable intrinsic structure representation [12]. Recently, the stable orthogonal local discriminant embedding (SOLDE) was proposed, which takes the similarity, diversity, interclass separability and orthogonal constraint together into account, to preserve local geometry of data and improve the algorithmic discriminating ability. And it shows a good performance for face recognition compared with many prevalent approaches including LPP, OLPP, MFA, locality sensitive discriminant analysis (LSDA) [5] and maximum margin criterion (MMC) [17]. However, it adopted a step-by-step procedure to obtain the orthogonal projection vectors sequentially, due to which the algorithmic computation burden is seriously increased [4, 11].

We solve the problem in SOLDE using the trace ratio criterion. Due to the fact that there is no closed form solution for solving the trace ratio problem, Wang et al. [21] proposed an efficient algorithm, called iterative trace ratio (ITR) algorithm, to find the optimal solution based on an iterative procedure. It is faster than the prevalent approach proposed by Guo et al. [13] and will converge to global solutions. In this paper, we propose a stable and orthogonal local discriminant embedding using trace ratio criterion (SOLDE-TR), which is much faster than SOLDE. Specially, we first generate the objective function of SOLDE to a trace ratio maximization problem. Then, we adopt the ITR algorithm to solve the trace ratio maximization problem iteratively [15, 21]. By this procedure, the orthogonal projection vectors were optimized simultaneously, and will always converge to a global solution. Especially, the solutions are always better than that of SOLDE.

The rest of this paper is organized as follows: Section 2 gives a brief review of SOLDE. The SOLDE-TR is proposed in Section 3. Section 4 reports all experimental results, and conclusion are drawn in Section 5.

2 Brief review of SOLDE

Given N training data \(\boldsymbol {x}_{1}, \boldsymbol {x}_{2},\cdots ,\boldsymbol {x}_{N} \in \mathbb {R}^{d}\) from C classes. Denote the training data matrix by X = [x 1,x 2,⋯ ,x N ]. Then, three adjacency graphs G gs = {X,S}, G gv = {X,H} and G d = {X,F} are constructed over the training data to model the local similarity, diversity and interclass separability. The elements S i j ,i,j = 1,⋯ ,N in weight matrix \( {S}\in {\mathbb {R}^{N\times N}}\) are defined as

$$ S_{ij}=\left\{ \begin{array}{ll} K(\boldsymbol{x}_{i},\boldsymbol{x}_{j}), & \text{if}~\boldsymbol{x}_{i} \in \boldsymbol{N}_{s}(\boldsymbol{x}_{j})~\text{or}\\ &\boldsymbol{x}_{j} \in \boldsymbol{N}_{s}(\boldsymbol{x}_{i}), \text{and}~\tau_{i}=\tau_{j} ,\\ 0, & \text{otherwise}, \end{array} \right. $$
(1)

where \(K(\boldsymbol {x}_{i},\boldsymbol {x}_{j})=\exp (-||\boldsymbol {x}_{i}-\boldsymbol {x}_{j}||^{2}/\sigma )\), σ > 0 is a proper parameter, ∥⋅∥ denotes the 2-norm of a vector, N s (x i ) stands for the set of s nearest neighbors of x i and τ i is the class label of x i . This weight matrix S mainly characterizes the important similarity of patterns, when the nearby points lie on a compact region.

Moreover, when the nearby points lie on a sparse region, a weight matrix \(\boldsymbol {H}\in {\mathbb {R}^{N\times N}}\) is defined to characterize the important diversity of patterns. The elements of H are defined as

$$ H_{ij}=\left\{\!\! \begin{array}{ll} 1-K(\boldsymbol{x}_{i},\boldsymbol{x}_{j}),&\text{if}~\boldsymbol{x}_{i} \in \boldsymbol{N}_{s}(\boldsymbol{x}_{j})~\text{or}\\ &\boldsymbol{x}_{j} \in \boldsymbol{N}_{s}(\boldsymbol{x}_{i}), \text{and}~\tau_{i}=\tau_{j} ,\\ 0,&\text{otherwise}. \end{array}\right. $$
(2)

In order to characterize the interclass separability of data in low dimensional representation, a weight matrix \( {F}\in {\mathbb {R}^{N\times N}}\) is defined, of which the elements are defined as

$$ F_{ij}=\left\{\begin{array}{ll} K(x_{i},x_{j}),&\text{if}~\boldsymbol{x}_{i} \in \boldsymbol{N}_{s}(\boldsymbol{x}_{j})~\text{or}\\ &\boldsymbol{x}_{j} \in \boldsymbol{N}_{s}(\boldsymbol{x}_{i}), \text{and}~\tau_{i}\neq\tau_{j} ,\\ 0,&\text{otherwise}. \end{array}\right. $$
(3)

Then, three objective functions of SOLDE are constructed to map the original high-dimensional data to a line so that the nearby points with small distance in a same classes can be mapped close in the reduced space, the nearby points with large distance in a same classes can be mapped far, and at the same time the nearby points with small distance in the different classes can be mapped far. The objective functions of SOLDE are defined as

$$\begin{array}{@{}rcl@{}} &&\min \sum\limits_{ij} (y_{i}-y_{j})^{2}\boldsymbol{S}_{ij}=2\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{s}\boldsymbol{X}^{T}\boldsymbol{w}, \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} &&\max \sum\limits_{ij} (y_{i}-y_{j})^{2}\boldsymbol{H}_{ij}=2\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{v}\boldsymbol{X}^{T}\boldsymbol{w}, \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} &&\max \sum\limits_{ij} (y_{i}-y_{j})^{2}\boldsymbol{F}_{ij}=2\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{m}\boldsymbol{X}^{T}\boldsymbol{w}, \end{array} $$
(6)

where \(y_{i}=\boldsymbol {w}^{T}\boldsymbol {x}_{i}\) is the one-dimensional representation of x i , w is a projection vector, \(\boldsymbol {L}_{s}=\boldsymbol {D}^{s}-\boldsymbol {S}\), D s is a diagonal matrix whose elements on diagonal are column sum of S, \(\boldsymbol {L}_{v}=\boldsymbol {D}^{v}-\boldsymbol {H}\), D v is a diagonal matrix whose elements on diagonal are column sum of H, \(\boldsymbol {L}_{m}=\boldsymbol {D}^{m}-\boldsymbol {F}\), D m is a diagonal matrix whose elements on diagonal are column sum of F. After some simple algebraic steps, the optimal objective functions of (4), (5) and (6) are integrated into the following objective function

$$ \boldsymbol{w}^{*}=\arg\max{\frac{\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{m}\boldsymbol{X}^{T}\boldsymbol{w}}{\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{d}\boldsymbol{X}^{T}\boldsymbol{w}}}, $$
(7)

where L d = a L s − (1 − a)L v , a is a suitable constant and 0.5 ≤ a ≤ 1.

Aiming to find a set of orthogonal projection vectors, the objective function of (7) can be written as

$$ \boldsymbol{w}_{1}=\arg\max\limits_{\boldsymbol{w}}{\frac{\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{m}\boldsymbol{X}^{T}\boldsymbol{w}} {\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{d}\boldsymbol{X}^{T}\boldsymbol{w}}}, $$
(8)

and

$$\begin{array}{@{}rcl@{}} &&\boldsymbol{w}_{k}=\arg\max\limits_{\boldsymbol{w}}{\frac{\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{m}\boldsymbol{X}^{T}\boldsymbol{w}} {\boldsymbol{w}^{T}\boldsymbol{X}\boldsymbol{L}_{d}\boldsymbol{X}^{T}\boldsymbol{w}}},\\ &&\boldsymbol{w}_{k}^{T}\boldsymbol{w}_{1}=\boldsymbol{w}_{k}^{T}\boldsymbol{w}_{2}=\cdots=\boldsymbol{w}_{k}^{T}\boldsymbol{w}_{k-1}=0. \end{array} $$
(9)

Motivated by [4, 11], the orthogonal projection vectors were obtained by adopting a step-by-step procedure, because of which the algorithmic computation burden was seriously increased. This step-by-step procedure is written as follows:

  1. 1.

    Compute w 1 as the eigenvector of \((\boldsymbol {X}\boldsymbol {L}_{d}\boldsymbol {X}^{T})^{-1}\boldsymbol {X}\boldsymbol {L}_{m}\boldsymbol {X}^{T}\) associated with the largest eigenvalue.

  2. 2.

    Compute w k as the eigenvector of

    $$\begin{array}{@{}rcl@{}} \boldsymbol{M}^{(k)}&=& \{\boldsymbol{I}- (\boldsymbol{X}\boldsymbol{L}_{d}\boldsymbol{X}^{T})^{-1}\boldsymbol{W}^{k-1}[\boldsymbol{B}^{k-1}]^{-1}[\boldsymbol{W}^{k-1}]^{T} \} \\ &&\cdot(\boldsymbol{X}\boldsymbol{L}_{d}\boldsymbol{X}^{T})^{-1}\boldsymbol{X}\boldsymbol{L}_{m}\boldsymbol{X}^{T}, \end{array} $$
    (10)

associated with the largest eigenvalue, where {w 1,w 2,⋯ ,w k−1} are the first k − 1 projection vectors,

\(\boldsymbol {W}^{k-1}=[\boldsymbol {w}_{1},\boldsymbol {w}_{2},\cdots ,\boldsymbol {w}_{k-1}]\) and \(\boldsymbol {B}^{k-1}=[\boldsymbol {W}^{k-1}]^{T}(\boldsymbol {X}\boldsymbol {L}_{d}\boldsymbol {X}^{T})^{-1}\boldsymbol {W}^{k-1}\).

3 Stable and orthogonal local discriminant embedding using trace ratio criterion (SOLDE-TR)

In order to obtain a set of orthogonal projection vectors based on the adjacency graphs of G gs = {X,S}, G gv = {X,H} and G d = {X,F} with a relatively less computation burden, we introduce a fast stable and orthogonal local discriminant analysis algorithm, termed as SOLDE-TR.The objection functions of (8) and (9) can be generalized to the following trace ratio maximization problem:

$$ \boldsymbol{W}^{*}=\arg\max\limits_{\boldsymbol{W}^{T}\boldsymbol{W}=\boldsymbol{I}} \frac{ \text{tr} (\boldsymbol{W}^{T}\boldsymbol{S}_{p}\boldsymbol{W} ) }{\text{tr} ({\boldsymbol{W}^{T}\boldsymbol{S}_{d}\boldsymbol{W}}) }, $$
(11)

and the optimum trace ratio value

$$ \lambda^{*}= \frac{ \text{tr} (\boldsymbol{W}^{T}\boldsymbol{S}_{p}\boldsymbol{W}^) }{\text{tr} ({\boldsymbol{W}^{T}\boldsymbol{S}_{d}\boldsymbol{W}}) }, $$
(12)

where W = [w 1,w 2,⋯ ,w m ], m is the desired lower feature dimension, \(\boldsymbol {S}_{p}=\boldsymbol {X}\boldsymbol {L}_{m}\boldsymbol {X}^{T}\) and \(\boldsymbol {S}_{d}=\boldsymbol {X}\boldsymbol {L}_{d}\boldsymbol {X}^{T}\). We firstly assume that the data x i ,i = 1,⋯ ,N have been projected into the PCA subspace such that the matrix S d is nonsingular. Thus, the denominator of the objective function (11) is always positive for non-zero W , that is, S d is positive definite.

In order to solve the trace ratio maximization problem (11), we resort to solve the following trace different problem iteratively [21]

$$ \boldsymbol{W}^{*}=\arg\max\limits_{\boldsymbol{W}^{T}\boldsymbol{W}=\boldsymbol{I}} \text{tr} [\boldsymbol{W}^{T}(\boldsymbol{S}_{p}-\lambda_{t}\boldsymbol{S}_{d})\boldsymbol{W}], $$
(13)

where λ t is the trace ratio value calculated from the projection matrix W t−1 of the previous step, i.e.,

$$ \lambda_{t}=\frac{ \text{tr} \left( \boldsymbol{W}_{t-1}^{T} \boldsymbol{S}_{p} \boldsymbol{W}_{t-1} \right) } {\text{tr} \left( {\boldsymbol{W}_{t-1}^{T} \boldsymbol{S}_{d}\boldsymbol{W}_{t-1}}\right) }. $$
(14)

Then, we propose our fast stable and orthogonal local discriminant analysis method to solve the trace ratio maximization problem (11). The algorithmic procedure of the proposed algorithm is stated as follows:

  1. 1.

    PCA Projection: Project the data x i ,i = 1,⋯ ,N into the PCA subspace such that the matrix \(\boldsymbol {S}_{d}=\boldsymbol {X}\boldsymbol {L}_{d}\boldsymbol {X}^{T}\) is nonsingular. And denote the transformation matrix of PCA by W P C A .

  2. 2.

    Construct the Adjacency Graphs and Choose Weights: Construct the adjacency graphs of G gs = {X,S}, G gv = {X,H} and G d = {X,F}. Compute the weight matrices S, H, and F according to (1), (2) and (3), respectively.

  3. 3.

    Compute the Orthogonal Projection Vectors:

    1. (a)

      Initialize W 0 as an arbitrary matrix whose columns are orthogonal and normalized, t = 1;

    2. (b)

      Compute the trace ratio value λ t from the projection matrix W t−1 according to

      $$ \lambda_{t}=\frac{ \text{tr} \left( \boldsymbol{W}_{t-1}^{T} \boldsymbol{S}_{p} \boldsymbol{W}_{t-1} \right) } {\text{tr} \left( {\boldsymbol{W}_{t-1}^{T} \boldsymbol{S}_{d}\boldsymbol{W}_{t-1}}\right) }. $$
      (15)
    3. (c)

      Compute the eigenvalue decomposition of (S p λ t S d ) as:

      $$ (\boldsymbol{S}_{p}-\lambda_{t}\boldsymbol{S}_{d})\boldsymbol{w}_{k}^{t}={\rho_{k}^{t}}\boldsymbol{w}_{k}^{t}, $$
      (16)

      where \({\rho _{k}^{t}}\) is the k th largest eigenvalue of (S p λ t S d ) with the corresponding eigenvector \(\boldsymbol {w}_{k}^{t}\).

    4. (d)

      Set \(\boldsymbol {W}_{t}=\left [\boldsymbol {w}_{1}^{t},\boldsymbol {w}_{2}^{t},\cdots ,\boldsymbol {w}_{m}^{t}\right ]\), where m is the desired lower feature dimensions.

    5. (e)

      If ∣λ t+1λ t ∣ < ε (0 < ε < 10−6), go to (f); else set t = t + 1, go to (b).

    6. (f)

      Output W S O L D ET R = W t .

  4. 4.

    Orthogonal Embedding: Thus, the embedding is given as follows:

    $$ \boldsymbol{x}\rightarrow \boldsymbol{y}=\boldsymbol{W}^{T}\boldsymbol{x} ~~~~~\boldsymbol{W}=\boldsymbol{W}_{PCA}\boldsymbol{W}_{SOLDE-TR}, $$

    where y is a m dimensional representation of the data x, and W is the transformation matrix.

In each step, a trace difference problem \(\arg \max _{\boldsymbol {W}} \text {tr} [\boldsymbol {W}^{T}(\boldsymbol {S}_{p}-\lambda _{t}\boldsymbol {S}_{d})\boldsymbol {W}]\) is solved with λ t being the trace ratio value computed from the previous step. The projection matrix of our proposed approach will converge to a global solution [21]. And the trace ratio value will monotonic increase, which have been proved based on point-to-set map theories [21]. Due to the fact that the intrinsic structure of real data is complex, and the testing data are usually different from the training data for a same subject, only similarity or diversity is not sufficient to guarantee the algorithmic generalization capability and stableness. Thus, the proposed SOLDE-TR is more stable and robust than the prevalent OLPP and MFA for testing data.

4 Experimental results

In this section, several experiments were carried out to show the effectiveness of the proposed method on the ORL database and the AR database. More details of the two databases include:

  1. 1.

    ORL Database: ORL consists of 400 facial images from 40 subjects. For each subject, the images were taken at different times. The facial expressions and occlusion also vary. The images were taken with a tolerance for fitting and rotation up to 20 degrees. Each subject contributes 10 different images. All images are grayscale and normalized to the size of 64 × 64.

  2. 2.

    AR Database: AR consists of over 4000 color face images of 126 people (70 men and 56 women). The images of the most persons (65 men and 55 women) are taken in two sessions (separated by two weeks). Each session contained 13 gray scale images for each subject, which have been normalized to the size of 50 × 40. The images from each session consist of three images with illumination variations, three images with wearing scarf, three images with wearing sun glasses, and four images with expression variations.

We compare the proposed method with the LPP, OLPP and SOLDE for face recognition problem. Without loss generality, a nearest neighbor classifier was used for classification [4]. Likewise, the data vectors was firstly normalized to unit length [4]. In all experiments, the parameter σ was set to the mean distance of the normalized data vectors and the parameters a to 0.9. The weight matrix S was computed according to the fixed σ for each database. In the following experiments, the best recognition accuracies in Tables 12 and the training time of the SOLDE and SOLDE-TR in Table 3 have been emphasized with the bold numbers.

Table 1 Best recognition accuracies (%) on the ORL database
Table 2 Best recognition accuracies (%) on the AR database
Table 3 Training time in seconds required by LPP, OLPP, SOLDE and SOLDE-TR on the AR database

4.1 Face recognition on the ORL database

For the ORL database, four random subsets with (l = 2,3,4,5) images per individual were taken with labels to form the training sets. The rest of the database were considered to be the testing sets.

First, we compare the best recognition accuracies of the LPP, OLPP, SOLDE and our proposed SOLDE-TR on the ORL database. Table 1 summarises the best recognition accuracies of the four methods with the subsets of (l = 2,3,4,5). The projection dimension was set to 100, expect for the training set (l = 2), in which the projection dimension was set to 50. For each given l, we averaged the results over 20 random splits. As can be seen, the SOLDE-TR performs the best in most training sets (l = 2,3,4). As to the last training set (l = 5), SOLDE-TR performs an equal level with SOLDE.

Then, we concentrate on the convergence speed of the SOLDE-TR on the ORL database. Figure 1 shows the increasing trend of the trace ratio value in (15) vs. the iterative numbers. The projection dimension was set to 100 for the ORL training set (l = 5). As can be seen, the trace ratio value approximated the maximum after 3th iteration. And our proposed method empirically converged from 6 to 9 iterations.

Fig. 1
figure 1

Trace ratio value vs. iterative numbers on ORL database

Next, we show the recognition accuracies versus dimensions on the ORL database. Figure 2 shows the classification performance of the SOLDE, LPP, OLPP and SOLDE-TR on the subset (l = 3) of the ORL database. We averaged the results over 20 random splits. As can be seen, the SOLDE-TR outperformed the other methods distinctly.

Fig. 2
figure 2

Recognition accuracy vs. dimensions on the ORL database

4.2 Face recognition on the AR database

For the AR database, we randomly selected 20 men and 20 women that had participated in two sessions with each individual having 13 images in each session. Then, three subsets with (l = 3,7,13) images per individual from the first session were selected for the training subsets. The first subset (l = 3) was composed of the images with merely illumination variations. The second subset (l = 7) was composed of the images with merely expression variations and the images of the first subset. The third subset (l = 13) was composed of the total images from the first session. And the corresponding images from the second session of each individual composed the testing sets.

First, we compare the best recognition accuracies of the LPP, OLPP, SOLDE and our proposed SOLDE-TR on the AR database. Table 2 summarizes the best recognition accuracies of the four methods with the subsets of (l = 3,7,13). It is shown that the SOLDE-TR performs the best in most training sets (l = 7,13). As to the first training set (l = 3), SOLDE-TR performs an equal level with OLPP.

Then, we show the recognition accuracies versus dimensions on the AR database. Figure 3 shows recognition accuracies of the SOLDE, LPP, OLPP and SOLDE-TR on the subset (l = 3) of the AR database. We averaged the results over 20 random splits. As can be seen, the SOLDE-TR outperformed the other methods in most cases.

Fig. 3
figure 3

Recognition accuracy vs. dimensions on the AR database

Next, we concentrate on the time consumption of the LPP, OLPP, SOLDE and SOLDE-TR on the AR database. Table 3 summarises the average CPU time consumption of the training procedure with projection dimensions ranging from 100 to 500, measured in seconds, required by LPP, OLPP, SOLDE and SOLDE-TR method on the subset (l = 13) of the AR database. All algorithms have been implemented on Matlab R2014a and a computer with Intel I7 2600 CPU (3.5Ghz) and 8 GB RAM. As can be seen, the SOLDE-TR outperforms the SOLDE distinctly. Thus, the computation burden of SOLDE-TR is evidently alleviated compared with SOLDE.

5 Conclusion

Stable orthogonal local discriminant embedding (SOLDE) is a recently proposed dimensionality reduction method. However, it solves the orthogonal projection vectors through a step-by-step procedure, and is thus computationally expensive. In this paper, a stable and orthogonal local discriminant embedding using trace ratio criterion (SOLDE-TR) method was proposed for dimensionality reduction. The objective function of SOLDE was firstly generalized to a trace ratio maximization problem. Then, the ITR algorithm was provided to optimize the orthogonal projection vectors simultaneously resulting in global solutions. Therefore, the new SOLDE-TR method usually leads to a better solution for face recognition in contrast to the conventional LPP, OLPP and SOLDE. Especially, the solutions of our proposed method are always faster than that of SOLDE. Experimental results on the ORL databases and the AR databases demonstrate the effectiveness and advantages of the proposed method.