1 Introduction

Many previous works have demonstrated that manifold learning can effectively improve the image recognition accuracy [16]. Manifold learning technique constructs an adjacency graph over the training data to model the geometrical relationship of data and maps nearby data points in the high-dimensional data space to nearby data in the reduced space. It intuitively preserves the intrinsic structure of data and has been widely used in image recognition, clustering, image retrieval, and data classification. Locality Preserving Projection (LPP) [1, 7] is one of the prevalent approaches and has proven its effectiveness in face recognition, document clustering, and image retrieval. However, LPP neglects the label of samples when learning the intrinsic structure of data. This may impair the recognition accuracy.

Motivated by LPP and linear discriminant analysis (LDA) [8], many discriminant approaches based on manifold learning have been developed for linear dimensionality reduction [914], among which the representative approaches are Marginal Fisher Analysis (MFA) [2], Local Discriminant Embedding (LDE) [9], Local Fisher Discriminant Analysis (LFDA) [11, 12]. MFA, LFDA, and LDE represent the intraclass compactness by LPP [7] which maps nearby data points in the observed data space to nearby data with the low-dimensional representation. Different from MFA and LDE, which represent the between-class separability by maximizing the sum of distances among nearby data with different labels and characterize the local discriminant structure of data, LFDA represents the interclass separability by maximizing the sum of distances of data from different classes and detects the global discriminant structure of data. Moreover, the data pairs with large distance will dominate the global discriminant structure of data. This may impair the local discriminant structure.

Although the motivations of the above-mentioned discriminant approaches are different, they can be unified within the Graph Embedding Framework (GEF) that was proposed by Yan et al. [2]. GEF constructs two adjacency graphs, namely intrinsic graph and penalty graph, to model the intrinsic geometric structure and discriminant geometric structure of the data, respectively. However, GEF only considers the local or global discriminant structure of data and does not simultaneously consider them for dimensionality reduction. In real cases, the intrinsic structure of data is unknown and complex, and testing data are usually different from the training data due to many factors. Thus, only local or global discriminant structure of data is not sufficient to guarantee the recognition accuracy [3, 1517]. It indicates that the recognition accuracy of above-mentioned discriminant approaches is not good enough on testing data.

In this paper, motivated by [17], we propose a novel discriminant approach, namely global–local Fisher discriminate analysis (GLFDA), for dimensionality reduction. To be specific, GLFDA constructs two adjacency graphs to model the local intrinsic geometric structure and discriminant geometric structure of data, respectively, and then integrates the global discriminant structure that is learned by LDA, local discriminant structure and intrinsic structure into the objective function of dimensionality reduction. Experimental results on AR, YALE, and UMIST databases demonstrate the effectiveness of the proposed approach.

The rest of the paper is organized as follows. Section 2 introduces our GLFDA approach. Section 3 provides some experimental results. We conclude the paper in Sect. 4.

2 Global–local fisher discriminant analysis

2.1 Motivation

According to above-mentioned analysis, graph embedding framework only encodes the local or global discriminating information, which will reduce the flexible of the algorithm and impair the recognition accuracy. In this section, we learn a novel dimensionality reduction approach, namely GLFDA, which explicitly considers both the global and local discriminating information of data. To be specific, we construct three adjacency graphs, namely intrinsic graph, global penalty graph, and local penalty graph, to model the intrinsic and discriminant geometrical structures of the data manifold. The global penalty graph encodes the global discriminating information of data and maps the large distance pairs from different classes to be faraway in the reduced space. The local penalty graph encodes the local discriminating information and maps nearby points from different classes to be faraway in the reduced space. Thus, the discriminant structure of data can be effectively characterized by local and global penalty graph. The intrinsic graph maps nearby data points from the same class to nearby points with the low-dimensional representation. In the following section, we first describe the intrinsic graph of data.

2.2 Intrinsic graph

Suppose, we have a set of p-dimensional samples X = {x 1x 2···x n }, belonging to C classes. We construct an adjacency graph \(G_{w} = \left\{ {X,W} \right\}\), namely intrinsic graph, with a vertex set X and weight matrix W to learn the similarity of data from the same class, which characterizes the intrinsic structure. The elements W ij in W are defined as follows:

$$W_{ij} = \left\{ {\begin{array}{*{20}l} {A_{ij} /n_{c} ,} \hfill & {{\text{if}}\,\tau_{i} = \tau_{j} = c;} \hfill \\ {0,} \hfill & {{\text{if}}\,\tau_{i} \ne \tau_{j} } \hfill \\ \end{array} } \right.$$
(1)

where the elements A ij are 1 if x j is the k 1 -nearest neighbor of x i or vice versa; otherwise A ij  = 0, n c denotes the number of samples in class c, τ i, and τ j denote the class label of data x i and x j , respectively.

Intrinsic graph aims to map nearby data points from the same class in the observed space to nearby data in the reduced space. As it happens, a reasonable criterion for choosing a good map is to optimize the following objective function

$$\mathop {\hbox{min} }\limits_{{}} \,\sum\limits_{ij} {(y_{i} - y_{j} )^{2} W_{ij} }$$
(2)

where y i  = α T x i denotes the one-dimensional representation of x i , α is the transformation vector.

The objective function (2) with our choice of symmetric weight matrix W ij incurs a heavy penalty, if nearby points x i and x j , which are close or very close to each other and share the same label, are mapped far apart. Therefore, minimizing Eq. (2) is an attempt to ensure that if x i and x j are close, then y i and y j are also close. Thus, the intrinsic structure of the data manifold, which characterizes the similarity of data, is well preserved by Eq. (2).

2.3 Local penalty graph

Likewise, we construct an adjacency graph G B  = {XB}, namely local penalty graph, with a vertex set X and weight matrix B to encode the discriminating information embedded in nearby data from different classes. The elements B ij in B can be defined as follows:

$${B_{ij} = \left\{ {\begin{array}{*{20}l} {\frac{1}{{n_{c} }}} \hfill & {{\text{if}}\;x_{i} \in N_{{k_{2} }} \left( {x_{j} } \right)\;{\text{or}}\;x_{j} \in N_{{k_{2} }} \left( {x_{i} } \right)\; {\text{and}}\ \tau_{i} \ne \tau_{j} } \hfill \\ 0 \hfill & {\text{else}} \hfill \\ \end{array} ,} \right.}$$
(3)

where \(N_{{k_{2} }} \left( {x_{i} } \right)\) denotes the set of k 2 nearest neighbors of x i .

Local penalty aims to find a map so that the connected data in the adjacency graph G B stay as distant as possible. Suppose y i is a map of training data x i (i = 1, 2, ···n), a reasonable map is to optimize the following objective function:

$$\hbox{max} \,\;\sum\limits_{ij} {(y_{i} - y_{j} )^{2} B_{ij} }$$
(4)

The objective function (4) on the graph G B incurs a heavy penalty if nearby two points x i and x j are mapped close together while they actually belong to different classes. So, maximizing Eq. (4) is an attempt to ensure that if x i and x j are close but have different labels, then y i and y j are also far apart. Thus, we can well detect the local discriminant geometrical structure of data by Eq. (4) which encodes the local discriminating information.

Obviously, the objective function (4) does not explicitly consider the data points which are not in the local neighborhood and have different labels. In other words, these data points may be mapped to be close to each other in the reduced space. This impairs the global discriminant structure of data, resulting in not better recognition accuracy for the algorithm. In Sect. 2.4, we construct a global graph, namely global penalty graph, to address this problem.

2.4 Global penalty graph

Motivated by Fisher discriminant analysis, we construct a global graph, namely global penalty graph, G g  = {XF} with a vertex set X and weight matrix F to model the global discriminant structure that encodes the global discriminating information of data. The elements F ij in F can be defined as follows:

$$F_{ij} = \left\{ {\begin{array}{*{20}l} {1/n,} \hfill & {{\text{if}}\,\tau_{i} \ne \tau_{j} } \hfill \\ {0,} \hfill & {\text{Otherwise}} \hfill \\ \end{array} } \right.$$
(5)

The global penalty graph seeks to find a map so that the data points, which have different labels, can be mapped to be faraway in the reduced space. Suppose y i is a map of training data x i (i = 1, 2, ···n), a reasonable map is to optimize the following objective function:

$$\hbox{max} \;\sum\limits_{ij} {(y_{i} - y_{j} )^{2} F_{ij} }$$
(6)

The objective function (6) on the graph G g incurs a heavy penalty if two points x i and x j are mapped close together while they actually belong to different classes. So, maximizing Eq. (6) is an attempt to ensure that if x i and x j have different labels, then y i and y j are far apart. Thus, we can well detect the global discriminant geometrical structure of data by Eq. (6).

Moreover, the objective function (6) emphasizes the large distance pairs and guarantees that the larger the distance between two data points is, the farther they are embedded in the reduced space. It means that Eq. (6) maps data points, which are in the un-local neighborhoods and have different labels, to be still distant in the reduced space. This effectively avoids the disadvantage caused by Eq. (4).

2.5 Objective function

In this section, we describe the GLFDA approach that solves the objective functions (2), (4), and (6). Suppose α is a projection vector, substituting y i  = α T x i into the objective function (2), and following some simple algebraic steps, we can see that

$$\begin{aligned} \sum\limits_{ij} {(y_{i} - y_{j} )^{2} W_{ij} } & = \, \sum\limits_{ij} {(\alpha^{T} x_{i} - \alpha^{T} x_{j} )^{2} W_{ij} } \, \\ & = 2\alpha^{T} \left( {\sum\limits_{j} {x_{i} x_{i}^{T} W_{ii} } - \sum\limits_{ij} {x_{i} x_{j}^{T} W_{ij} } } \right)\alpha \\ & = \, 2\alpha^{T} X(D_{w} - W)X^{T} \alpha \\ & = \, 2\alpha^{T} \, XL_{w} X^{T} \alpha \\ \end{aligned}$$
(7)

where L w  = D w  − W, D w is a diagonal matrix whose entries are column (or row, since W is symmetric) sum of W, i.e., D w (ii) = ∑  j W ij .

Substituting y i  = α T x i into the objective function (4), we see that

$$\begin{aligned} \sum\limits_{ij} {(y_{i} - y_{j} )^{2} B_{ij} } & = \, \sum\limits_{ij} {(\alpha^{T} x_{i} - \alpha^{T} x_{j} )^{2} B_{ij} } \, \\ & = 2\alpha^{T} \left( {\sum\limits_{j} {x_{i} x_{i}^{T} B_{ii} } - \sum\limits_{ij} {x_{i} x_{j}^{T} B_{ij} } } \right)\alpha \\ & = \, 2\alpha^{T} X(D_{b} - B)X^{T} \alpha \\ & = \, 2\alpha^{T} \, XL_{b} X^{T} \alpha \\ \end{aligned}$$
(8)

where L b  = D b  − B, D b is a diagonal matrix whose entries are column (or row, since B is symmetric) sum of B, i.e., D b (ii) = ∑  j B ij .

Likewise, substituting y i  = α T x i into the objective function (6), we see that

$$\begin{aligned} \sum\limits_{ij} {(y_{i} - y_{j} )^{2} F_{ij} } & = \, \sum\limits_{ij} {(\alpha^{T} x_{i} - \alpha^{T} x_{j} )^{2} F_{ij} } \, \\ & = 2\alpha^{T} \left( {\sum\limits_{j} {x_{i} x_{i}^{T} F_{ii} } - \sum\limits_{ij} {x_{i} x_{j}^{T} F_{ij} } } \right)\alpha \\ & = \, 2\alpha^{T} X(D_{f} - F)X^{T} \alpha \\ & = \, 2\alpha^{T} \, XL_{f} X^{T} \alpha \\ \end{aligned}$$
(9)

where L f  = D f  − F, D f is a diagonal matrix whose entries are column (or row, since F is symmetric) sum of F, i.e., D f (ii) = ∑  j F ij .

Substituting Eqs. (7), (8), and (9) into Eqs. (2), (4), and (6), respectively, we have several ways to build the objective function of GLFDA by integrating Eqs. (2), (4), and (6). The representative one is the Rayleigh quotients form that is formally similar to LDA [8]. Thus, we can write the objective function of GLFDA as follows:

$$\mathop {\arg \hbox{max} }\limits_{\alpha } \;\frac{{\alpha^{T} \left( {S_{b} \cdot t + S_{f} \left( {1 - t} \right)} \right)\alpha }}{{\alpha^{T} S_{w} \alpha }}$$
(10)

where t ∊ [0, 1] is a parameter and balances the global and local discriminant information of data. S b  = XL b X T is the global between-class scatter matrix, S f  = XL f X T is the local between-class scatter matrix, and S w  = XL w X T is the local within-class scatter matrix.

The optimal projection vector α that maximizes Eq. (10) is given by the maximum eigenvalue solution to the generalized eigenvalue problem:

$$\left( {t \cdot S_{b} + \left( {1 - t} \right)S_{f} } \right)\,\alpha = \lambda S_{w} \alpha$$
(11)

Let the column vector α 1α 2···α d be the solution of (11), ordered according to its eigenvalues, λ 1 ≥ λ 2 ≥ ··· ≥ λ d  > 0, then the projection matrix P can be written as P = [α 1α 2···α d ].

Note that, in face recognition, S w is singular, and the optimal projection matrix is not directly calculated from Eq. (11). In this case, many approaches have been developed to solve the optimal projection vector α. In the experiments, we simply choose the same way as in Fisherface [8] for its simplicity, i.e., PCA is first used to reduce dimension such that XL w X is nonsingular, then solving the Eq. (11) in the PCA space.

2.6 Algorithm

Given training data set X = [x 1 x 2··· x n ] including C classes. The algorithmic procedure of GLFDA is summarized as follows.

2.6.1 Construct intrinsic and penalty graphs

Suppose that G w  = {XW}, \(G_{\text{B}} = \left\{ {X,B} \right\}\), and \(G_{\text{g}} = \left\{ {X,F} \right\}\) denote local intrinsic graph, local penalty graph, and global penalty graph, respectively. The weighted matrices W, B, and F are calculated by Eqs. (1), (3), and (5), respectively.

2.6.2 Calculate scatter matrices

In this step, we calculate scatter matrices S w , S b , and S f by Eqs. (7), (8), and (9), respectively.

2.6.3 PCA projection

In this step, we project the data points x i (i = 1, ···n) into the PCA subspace. Suppose that the projection matrix of PCA is W PCA  ∊ R p×l, whose columns consist of the l leading eigenvectors of the covariance matrix of data; the scatter matrixes S w , S b , and S f become \(\tilde{S}_{w} = W^{T}_{\text{PCA}} XL_{w} X^{T} W_{\text{PCA}}\), \(\tilde{S}_{b} = W^{T}_{\text{PCA}} XL_{b} X^{T} W_{\text{PCA}}\), and \(\tilde{S}_{f} = W^{T}_{\text{PCA}} XL_{f} X^{T} W_{\text{PCA}}\) , respectively, after PCA projection.

2.7 Compute the projection matrix

Solve the following generalized eigenvector problem:

$$\left( {t \cdot \tilde{S}_{b} + \left( {1 - t} \right)\tilde{S}_{f} } \right)\alpha = \lambda \tilde{S}_{w} \alpha$$

Let α 1α 2···α d be the solutions of the above equation, ordered according to their eigenvalues, λ 1 ≥ λ 2 ≥ ··· ≥ λ d . Thus, the projection matrix of GLFDA is P S  = [α 1 α 2⋯ α d ].

2.7.1 Linear embedding

In this step, we compute the linear projections. Given an arbitrary training vector x i , the linear embedding is as follows:

$$x_{i} \to y_{i} = P^{T} x_{i}$$
$$P = W_{\text{PCA}} P_{\text{S}}$$

3 Experiments

In this section, we evaluate the proposed approach on three well-known face databases (YALE, UMIST, and AR), and compare its performance with Fisherface [8], MFA [2], SDA [17], DLA [18], and LFDA [11]. In the experiments, we use Euclidean metric and nearest classifier for classification due to the simplicity. In real cases, it is very difficult to select a suitable neighbor parameter k for MFA, LFDA, DLA, and GLFDA approaches due to the complex distribution of training data. Note that the same problem is often encountered in many manifold learning approaches. In experiments, we empirically determined a proper neighbor parameter k within the interval [1, n − 1] for all approaches, where n denotes the number of training samples. The parameter t depends on the distribution of data. In our approaches, we determined t as 0.5, 0.1, and 0.4 for YALE, UMIST, and AR databases, respectively.

The Yale face database [19] contains 165 images of 15 individuals (each person providing 11 different images) under various facial expressions and lighting conditions. In our experiments, each image is manually cropped and resized to 32 × 32 pixels. Figure 1 shows some images of one person. In the experiments, the first 3, 6, and 9 images per person are selected for training, respectively, and the corresponding rest images for testing. Thus, we have three training sets that include 45, 90, and 135 images, and the corresponding three testing sets including 120, 75, and 30 images. The top recognition accuracies of the six approaches are shown in Table 1. Figure 2 plots the curve of recognition accuracy versus number of projection vectors when the first 6 images per person are selected for training.

Fig. 1
figure 1

Some images of one subject from the Yale database

Table 1 The top recognition accuracies (%) of the six approaches and the corresponding number of features on the Yale database (shown in parentheses)
Fig. 2
figure 2

The recognition accuracy versus number of projection vector on the YALE database

The UMIST database [20] is a multi-view database, consisting of 564 images from 20 individuals under various poses. Each image is of the 112 × 92 size. In the experiments, the first 19 images per image are selected as gallery. Figure 3 shows some images of one subject. In experiments, the first 3, 6, and 9 images per person are, respectively, selected as training images and the remaining images for the corresponding testing images. Thus, we have three training sets that include 60, 120, and 180 images, and the corresponding three testing sets including 320, 260, and 200 images, respectively. The top recognition accuracies of the six approaches are shown in Table 2. Figure 4 plots the curve of the recognition accuracy versus number of projection vectors of six approaches when the first 3 images per person are selected for training.

Fig. 3
figure 3

Some sample images of one subject in the UMIST database

Table 2 The top classification accuracies (%) of the six approaches and the corresponding number of dimensions on the UMIST database (shown in parentheses)
Fig. 4
figure 4

The recognition accuracy versus number of projection vector on the UMIST database

The AR face database [21] is established by American Purdue University, which contains over 4,000 color face images of 126 people (70 men and 56 women) including frontal views of faces with different facial expressions, lighting conditions, and occlusions. The pictures of most persons are taken in two sessions (separated by 2 weeks). Each session contains 13 color images and 120 individuals (65 men and 55 women) participate in both sessions. In the experiments, the facial portion of each image is manually cropped and then normalized to the size of 50 × 40. The images from the first session with (a) “neutral expression”, (b) “smile”, (c) “anger”, (d) “scream”, (e) “left light on”, (f) “right light on”, and (g) “both side light on” are selected for training, and the corresponding second session images, i.e., from (n)–(t), are selected for testing. Thus, training set and testing set both include 840 images. Figure 5 shows some sample images of one subject. Table 3 lists the top recognition accuracies of the six approaches. Figure 6 plots the curve of the recognition accuracy versus number of projection vectors of the six approaches.

Fig. 5
figure 5

Some sample images of one subject in the AR database

Table 3 The top classification accuracies (%) of the six approaches and the corresponding number of dimensions on the AR database (shown in parentheses)
Fig. 6
figure 6

The recognition accuracy versus number of projection vector on the AR database

From Tables 1, 2 and 3, Figs. 2, 4, and 6, we can see that:

  1. 1.

    Our GLFDA approach is superior to Fisherface. This is probably because that Fisherface does not well encode the discriminative information embedded in nearby data from different classes and also does not preserve the local intrinsic geometrical structure embedded in nearby data from the same class. This impairs the recognition accuracy of the algorithm.

  2. 2.

    The top recognition accuracy of GLFDA is also superior to MFA and DLA. This is probably because that MFA and DLA only encode the discriminative information embedded in the nearby data points from different classes and do not explicitly consider the global discriminative information embedded in data points having different labels. This will impair the recognition accuracy.

  3. 3.

    Global–local Fisher discriminant analysis is also superior to LFDA. This is probably due to the fact that LFDA only well detects the global discriminant structure, which maps data points from different classes to be faraway in the reduced space by the quadratic function. However, this quadratic function emphasizes the large distance pairs and may impair the local discriminant information of data. This leads to inseparability among nearby data from different classes.

  4. 4.

    SDA is inferior to GLFDA. Although SDA considers both the global and local geometrical structures of data, it ignores the local discriminant geometrical structure. This may impair the local discriminating information of data and lead to inseparability among nearby data having different labels. In Fig. 4, the recognition accuracy of GLFDA is not obvious better than SDA. This is probably because that GLFDA ignores the un-local intrinsic structure of data.

  5. 5.

    Global–local Fisher discriminant analysis is superior to the other discriminant approaches. The reason is probably because that the other discriminant approaches only encode the global or local discriminant information of data. In real applications, the intrinsic structure of data is unknown and complex, only single structure is not suitable to guarantee the discrimination ability in the reduced space. Different from these approaches, GLFDA explicitly considers both the global and local discriminant information of data. Thus, GLFDA can well separate the data having different class labels in the reduced space. We randomly generate some two-dimensional data points using MATLAB and show the projection directions of GLFDA, Fisherface, and MFA in Fig. 7a. We show the corresponding one-dimensional embedded results in Fig. 7b–d, respectively. It is easy to see that GLFDA well encodes the more discriminant information embedded in data than Fisherface and MFA.

    Fig. 7
    figure 7

    a Two-dimensional data and one-dimensional projection direction obtained by GLFDA, Fisherface, and MFA, respectively; b One-dimensional embedded results obtained by GLFDA; c One-dimensional embedded results obtained by Fisherface; d One-dimensional embedded results obtained by MFA

4 Conclusion

In this paper, we propose a novel dimensionality reduction approach, namely GLFDA. GLFDA explicitly considers the local intrinsic structure, global discriminant structure, and local discriminant structure of the data manifold by constructing three adjacency graphs and then incorporates them into one objective function for dimension reduction. In this way, GLFDA well encodes the discriminative information. Experiments on the YALE, UMIST, and AR databases show the effectiveness of the proposed GLFDA approach.