1 Introduction

Dimensionality reduction is a learning task that aims to find a low-dimensional representation of high-dimensional data, while preserving data information as much as possible. Processing data in the low-dimensional space can reduce computational cost and suppress noises. Provided that dimensionality reduction is performed appropriately, the discovered low-dimensional representations of data will benefit subsequent tasks, e.g., classification, clustering, data visualization.

PCA, as an unsupervised dimensionality reduction method, seeks a set of orthogonal projection directions along which the sum of variances of data is maximized. Some other popular unsupervised methods are geometrically motivated, which aim to discover the geometrical structure of the underlying manifold, such as Laplacian eigenmaps [2], Hessian eigenmaps [5], locally linear embedding [9], locality preserving projections [7], and local tangent space alignment [17], etc. Although unsupervised approaches can reveal the underlying data manifold, they may not be the best choices for some learning scenarios because they are not able to utilize the discriminative information from data labels.

LDA is a supervised dimensionality reduction method. It finds a subspace in which data points from different classes are projected far away from each other, while those belonging to the same class are projected as close as possible. However, LDA tends to get undesirable results when data are multimodal [6] or are mainly characterized by their variances. The reason why this happens lies in the assumption adopted by LDA that data points belonging to each class are generated from the multivariate Gaussian distributions with the same covariance matrix but different means. This assumption is invalid in dealing with the data formed by several separate clusters or those living on an underlying manifold.

To solve this problem, subclass discriminant analysis [18] approximates the potential distribution of data with a mixture of Gaussian distributions. More specifically, it first divides each class into a set of subclasses through clustering and then performs LDA on the divided data. Another way to overcome the drawback of LDA is preserving the data structure locally. Marginal fisher analysis (MFA) [15] aims to gather the nearby examples of the same class, and separate the marginal examples belonging to different classes. Locality sensitive discriminant analysis (LSDA) [3] maps data points into a subspace where the examples with the same label at each local area are close, while the nearby examples from different classes are apart from each other. Local fisher discriminant analysis (LFDA) [12] also focuses on discovering the local data structure. It is equivalent to operate LDA in the local scope around each example. In fact, these local structure oriented methods actually fall into the same graph Laplacian based framework. All of them employ the Laplacian matrix to preserve the local geometry of the data manifold. However, this framework fails to discover the local manifold information from tangent spaces which could be very useful and can enhance the performance of dimensionality reduction in some situations [11, 13].

In this paper, we present a novel supervised dimensionality reduction method named local tangent space discriminant analysis (TSD). Unlike previous approaches using the graph Laplacian to discover the data manifold, our method uses the first-order Taylor expansion to represent the geometry of the local area around each data point. This strategy provides us with a natural way to utilize the information from tangent spaces. Then we seek a linear transformation to preserve the local manifold structure of the data belonging to the same class as much as possible, while maximizing the marginal data points with different class labels. As a result, the geometrical information from tangent spaces can be readily incorporated into the proposed method to improve the performance of dimensionality reduction. Moreover, the objective function of our method can be optimized analytically by solving a generalized eigenvalue problem. This also leads to a natural extension for non-linear dimensionality reduction through the kernel trick.

The rest of this paper is organized as follows. We briefly review some related work including MFA, LSDA and LFDA, and show how these methods can be considered in the same framework in Sect.  2. Then the local Tangent Space Discriminant analysis (TSD) along with its kernelization are introduced in Sect. 3. Section 4 discusses the connection and difference between the proposed method and related work. In Sect. 5, experimental results are presented. Finally, we give concluding remarks and discuss some future work in Sect. 6.

2 Related Work

Many dimensionality reduction methods have been proposed in recent years. Although they have different names and are derived from various motivations, a large portion of them fall into an unified graph Laplacian based framework. For example, Yan et al. [15] proposed a dimensionality reduction framework called graph embedding which can group together many popular dimensionality reduction approaches into a general formulation. In this section, we first introduce the graph embedding framework, and then briefly review some related work and show how they fall into the same framework. Finally, we provide a brief summary to discuss the strength and weakness of this framework.

2.1 Graph Embedding

Given a data set X consisting of n examples and labels, \( \{ (\varvec{x}_i,y_i) \}_{i=1}^n \), where \( \varvec{x}_i \in \mathbb {R}^d\) denotes a d-dimensional example, \(y_i \in \{ 1,2,\ldots ,C \}\) denotes the class label corresponding to \(\varvec{x}_i\), and C is the total number of classes. The relationship between each example can be easily characterized by a undirected weighted graph \(G = \{X, W\}\), where each example is served as the vertex of G and a symmetric weight matrix \(W \in \mathbb {R}^{n \times n}\) records the weight on the edge of each pair of vertices. W measures the similarity between each example, and its characteristic varies as the criterion of similarity changes. Generally, if two examples \(\varvec{x}_i\) and \(\varvec{x}_j\) are “close”, the corresponding weight \(W_{ij}\) is large, whereas if they are “far away”, then the \(W_{ij}\) is small. Provided a certain W, the intrinsic geometry of graph G can be represented by the Laplacian matrix [4], which is defined as

$$\begin{aligned} L = D - W, \end{aligned}$$
(1)

where D is a diagonal matrix with the i-th diagonal element being \(D_{ii} = \sum _{j\ne i}W_{ij}\). The Laplacian matrix is capable of representing certain geometry of data according to a specific weight matrix, and thus can be used for dimensionality reduction.

To find a good low-dimensional embedding \(\varvec{b} = (b_{\varvec{x}_1},b_{\varvec{x}_2},\ldots ,b_{\varvec{x}_n})^{\top }\) from high-dimensional data, we have to preserve the intrinsic geometry of the original data as much as possible. Therefore, it is natural to seek the embedding preserving the most information from G in each dimension. This graph-preserving criterion can be formulated as follows [15]:

$$\begin{aligned} \varvec{b}^*=&\arg \mathop {\min }\limits _{\varvec{b}^{\top } L_b \varvec{b} = a} \sum _{i \ne j} W_{ij} \Vert \varvec{b}_i-\varvec{b}_j \Vert ^2 \nonumber \\ =&\arg \mathop {\min }\limits _{\varvec{b}^{\top } L_p \varvec{b}=a} \varvec{b}^{\top } L \varvec{b}, \end{aligned}$$
(2)

where L is the Laplacian matrix of G defined in (1), \(L_p\) is the penalty constraint matrix, and a is a constant defined to avoid a trivial solution of the objective function. Note that \(L_p\) can have multiple forms, which is usually a diagonal matrix or the Laplacian matrix of a penalty graph \(G_p=\{X,W^p\}\) constructed by the same vertices X yet a different weight matrix \(W^p\).

In this paper, we mainly focus on the linear dimensionality reduction, and the embedding of each example \(\varvec{x}_i\) is computed as \(b_{\varvec{x}_i}=\varvec{t}^{\top } \varvec{x}_i\) where \(\varvec{t}\) is a projection vector. In this case, (2) becomes:

$$\begin{aligned} \varvec{t}^* = \arg \min _{\varvec{t}^{\top } X L^p X^{\top } \varvec{t}=a} =\varvec{t}^{\top } \textit{XLX}^{\top } \varvec{t}. \end{aligned}$$
(3)

The objective function (3) can be converted to a generalized eigenvalue problem:

$$\begin{aligned} \textit{XLX}^{\top }\varvec{t}=\lambda X L^p X^{\top } \varvec{t}. \end{aligned}$$
(4)

whose solution can be easily given by the eigenvector with respect to the smallest eigenvalue.

Given the above results, many local structure oriented dimensionality reduction approaches, which closely related to our proposed method, can be grouped into the unified framework. Next, we mainly discuss three of them including MFA, LSDA and LFDA.

2.2 Marginal Fisher Analysis

Marginal fisher analysis (MFA) is a dimensionality reduction approach that is directly derived from the graph embedding framework [15]. The main idea of MFA is to preserve the intraclass compactness represented by an intrinsic graph under the constraint that the interclass separability characterized by a penalty graph should be kept. For this purpose, each example is connected to its \(k_1\)-nearest neighbors belonging to the same class in the intrinsic graph, and the penalty graph is built by connecting \(k_2\)-nearest pairs of the marginal point in different classes.

For each example, let \(N_{k_1}(i)\) be the set of the \(k_1\)-nearest neighbors of \(\varvec{x}_i\) in the same class, and \(P_{k_2}(i)\) indicates the set of the \(k_2\)-nearest pairs among the set \(\{(i,j),y_i \ne y_j \}\). The intraclass compactness is formulated by a local within-class scatter matrix \({\bar{S}}_w\):

$$\begin{aligned} {\bar{S}}_w&=\mathop {\sum }\limits _{i}\mathop {\sum }\limits _{i \in N_{k_1}(j)~or~j \in N_{k_1}(i)} (\varvec{x}_i - \varvec{x}_j)(\varvec{x}_i - \varvec{x}_j)^{\top } \\&= \,2 X ({\bar{D}}-{\bar{W}}) X^{\top } \\&= \,2 X {\bar{L}} X^{\top }, \end{aligned}$$

where \({\bar{L}} = {\bar{D}}-{\bar{W}}\) is the Laplacian matrix, \({\bar{D}}\) is a diagonal matrix with the i-th diagonal element being \({\bar{D}}_{ii} = \sum _{j\ne i}{\bar{W}}_{ij}\), and the weight matrix \({\bar{W}}\) is defined as follows:

$$\begin{aligned} {\bar{W}}_{ij} = \left\{ {\begin{array}{ll} 1 &{} ~~\text {if}~i \in N_{k_1}(j)~\text {or}~j \in N_{k_1}(i) \\ 0 &{} ~~\text {else}. \end{array}} \right. \end{aligned}$$

Similarly, the interclass separability can be characterized by a local between-class scatter matrix \({\bar{S}}_b\):

$$\begin{aligned} {\bar{S}}_b&=\mathop {\sum }\limits _{i}\mathop {\sum }\limits _{(i,j) \in P_{k_2}(i)~or~(i,j) \in P_{k_2}(j)} (\varvec{x}_i - \varvec{x}_j)(\varvec{x}_i - \varvec{x}_j)^{\top } \\&=\, 2 X ({\bar{D}}^p-{\bar{W}}^p) X^{\top } \\&=\, 2 X {\bar{L}}^p X^{\top }, \end{aligned}$$

where \({\bar{L}}^p = {\bar{D}}^p-{\bar{W}}^p\) is the Laplacian matrix, \({\bar{D}}^p\) is a diagonal matrix with the i-th diagonal element being \({\bar{D}}_{ii}^p = \sum _{j\ne i}{\bar{W}}_{ij}^p\) and the weight matrix \({\bar{W}}^p\) is defined as follows:

$$\begin{aligned} {\bar{W}}_{ij}^p = \left\{ {\begin{array}{ll} 1 &{} ~~\text {if}~(i,j) \in P_{k_2}(i)~\text {or}~(i,j) \in P_{k_2}(j) \\ 0 &{} ~~\text {else}. \end{array}} \right. \end{aligned}$$

Following the graph-preserving criterion presented in (2), the objective function of MFA can be written as follows:

$$\begin{aligned} \varvec{t}^{MFA}=&\arg \mathop {\min }\limits _{\varvec{t}^{\top } {\bar{S}}_b \varvec{t}=a} \varvec{t}^{\top } {\bar{S}}_w \varvec{t} \nonumber \\ =&\arg \min _{\varvec{t}^{\top } X {\bar{L}}^p X^{\top } \varvec{t}=a} =\varvec{t}^{\top } X {\bar{L}} X^{\top } \varvec{t}. \end{aligned}$$
(5)

2.3 Locality Sensitive Discriminant Analysis

Another related method is locality sensitive discriminant analysis (LSDA) [3] which assumes that data live on or close to a manifold. It aims to preserve the local geometrical structure of the manifold while maximizing the local margin between different classes.

LSDA seeks a linear projection \(\varvec{t}\) optimizing

$$\begin{aligned}&\min \mathop {\sum }\limits _{ij} {\bar{W}}_{ij}^w \left( \varvec{t}^{\top } \varvec{x}_i - \varvec{t}^{\top } \varvec{x}_j\right) ^2, \\&\max \mathop {\sum }\limits _{ij} {\bar{W}}_{ij}^b \left( \varvec{t}^{\top } \varvec{x}_i - \varvec{t}^{\top } \varvec{x}_j\right) ^2 \end{aligned}$$

under the constraint that \(\varvec{t}^{\top } X {\bar{D}}^w X^{\top } \varvec{t}=1\). Let \(N^w(i)\) be the set of the k-nearest neighbors of \(\varvec{x}_i\) sharing the same label \(y_i\), and \(N^b(i)\) be the set of the k-nearest neighbors of \(\varvec{x}_i\) having the labels different from \(y_i\). Then the weight matrices \({\bar{W}}^w\) and \({\bar{W}}^b\) are defined as:

$$\begin{aligned} {\bar{W}}_{ij}^w =&\left\{ {\begin{array}{ll} 1 &{} ~~\text {if}~i \in N^w(j)~\text {or}~j \in N^w(i) \\ 0 &{} ~~\text {else}, \end{array}} \right. \\ {\bar{W}}_{ij}^b =&\left\{ {\begin{array}{ll} 1 &{} ~~\text {if}~i \in N^b(j)~\text {or}~j \in N^b(i) \\ 0 &{} ~~\text {else}. \\ \end{array}} \right. \end{aligned}$$

The objective function of LSDA described above can be formulated as follows:

$$\begin{aligned} \varvec{t}^{ LSDA}=&\arg \mathop {\max }\limits _{ \varvec{t}^{\top } X {\bar{D}}^w X^{\top } \varvec{t}=1} \varvec{t}^{\top } X \left( \alpha {\bar{L}}^b + (1-\alpha ) {\bar{W}}^w\right) X^{\top } \varvec{t} \nonumber \\ =&\arg \mathop {\min }\limits _{ \varvec{t}^{\top } X {\bar{D}}^w X^{\top } \varvec{t}=1} \varvec{t}^{\top } X \left( (\alpha -1) {\bar{W}}^w - \alpha {\bar{L}}^b\right) X^{\top } \varvec{t} , \end{aligned}$$
(6)

where \(\alpha \) is a trade-off parameter, \({\bar{L}}^b\) is the Laplacian matrix constructed by \({\bar{W}}^b\), and \({\bar{D}}^w\) is a diagonal matrix with the i-th diagonal element being \({\bar{D}}_{ii}^w = \sum _{j\ne i}{\bar{W}}_{ij}^w\). LSDA follows the framework defined in (3) and the solution \(\varvec{t}^{LSDA}\) s given by (4) with \(L=(\alpha -1) {\bar{W}}^w - \alpha {\bar{L}}^b\) and \(L^p={\bar{D}}^w\).

2.4 Local Fisher Discriminant Analysis

Local Fisher discriminant analysis (LFDA) [12] combines the ideas of LDA and LPP [7] to overcome the problem that LDA [6] can not appropriately handle the data with multimodality. More specifically, it evaluates the levels of the between-class scatter and the within-class scatter in a local manner, and tries to attain the local between-class separation and the local within-class structure preservation at the same time [12].

Let \({\tilde{S}}_w\) and \({\tilde{S}}_b\) be the local within-class scatter matrix and the local between-class scatter matrix defined by

$$\begin{aligned} {\tilde{S}}_w&=\,\frac{1}{2}\mathop {\sum }\limits _{ij} {\tilde{W}}_{ij}^w(\varvec{x}_i - \varvec{x}_j)(\varvec{x}_i - \varvec{x}_j)^{\top } = X {\tilde{L}}^w X^{\top }, \\ {\tilde{S}}_b&=\,\frac{1}{2}\mathop {\sum }\limits _{ij} {\tilde{W}}_{ij}^b(\varvec{x}_i - \varvec{x}_j)(\varvec{x}_i - \varvec{x}_j)^{\top } = X {\tilde{L}}^b X^{\top }, \end{aligned}$$

where \({\tilde{L}}^w\) and \({\tilde{L}}^b\) are the Laplacian matrices constructed by the weight matrices \({\tilde{W}}_{ij}^w\) and \({\tilde{W}}_{ij}^b\) with

$$\begin{aligned} {\tilde{W}}_{ij}^w =&\left\{ {\begin{array}{ll} A_{ij}/n_c &{}\quad \text {if}~y_i=y_j \\ 0 &{}\quad \text {if}~y_i \ne y_j, \end{array}} \right. \\ {\tilde{W}}_{ij}^b =&\left\{ {\begin{array}{ll} A_{ij}(1/n - 1/n_c) &{}\quad \text {if}~y_i=y_j \\ 1/n &{} \quad \text {if}~y_i \ne y_j. \\ \end{array}} \right. \end{aligned}$$

\(n_c\) denotes the number of examples from the c-th class, and \(A_{ij}\) is a weight that indicates the similarity between \(\varvec{x}_i\) and \(\varvec{x}_j\), whose definition is given as follows:

$$\begin{aligned} A_{ij} = \left\{ {\begin{array}{ll} \exp \left( - \frac{{\Vert \varvec{x}_i - \varvec{x}_j \Vert }^2}{\sigma _i \sigma _j}\right) &{}\quad \mathrm{if}~i \in N_{k}(j)~\mathrm{or}~j \in N_{k}(i) \\ 0 &{}\quad \mathrm{else}, \end{array}} \right. \end{aligned}$$

where \(\sigma _i\) is set to be the distance between \(\varvec{x}_i\) and its k-th nearest neighbor.

LFDA seeks a linear projection so that \({\tilde{S}}_w\) is minimized and \({\tilde{S}}_b\) is maximized. Essentially, this strategy is equivalent to find a projection which fits the Fisher criterion in the local area around each example. The optimization problem of LFDA is given as follows:

$$\begin{aligned} \varvec{t}^{LFDA}=&\arg \mathop {\max }\limits _{\varvec{t}} \frac{\varvec{t}^{\top } {\tilde{S}}_b \varvec{t}}{\varvec{t}^{\top } {\tilde{S}}_w \varvec{t}} = \arg \mathop {\max }\limits _{\varvec{t}} \frac{\varvec{t}^{\top } X {\tilde{L}}^w X^{\top } \varvec{t}}{\varvec{t}^{\top } X {\tilde{L}}^b X^{\top } \varvec{t}} \nonumber \\ =&\arg \min _{\varvec{t}^{\top } X {\tilde{L}}^b X^{\top } \varvec{t}=1} =\varvec{t}^{\top } X {\tilde{L}} X^{\top } \varvec{t}. \end{aligned}$$
(7)

According to (7), it is easy to find that LFDA also falls into the Graph Embedding framework defined in (3) with \(L={\tilde{L}}^w\), \(L^p={\tilde{L}}^b\) and \(a=1\).

2.5 A Brief Summary

With the above results, it is easy to find that all methods mentioned above can be considered in the same graph Laplacian based framework and the main difference among them only lies in the different graphs adopted by each method. For different methods, such graphs are constructed by certain weight matrices to incorporate specific neighborhood information of the data set. One important merit of this framework is that it not only takes advantage of the facility of the Laplacian matrix to preserve the local geometry of the data manifold, but also benefits from the elegant formulation which can be easily optimized through the generalized eigenvalue decomposition.

Although the graph Laplacian provides us with a powerful and flexible tool to discover the underlying data manifold, it fails to discover the local geometrical structure from tangent spaces, and thus may lose much useful information whose effectiveness has been shown in many applications especially in the handwritten digit recognition [11]. Moreover, the graph Laplacian may fail to capture meaningful manifold structures, when data are sparsely distributed in the original space. In this case, the graph Laplacian constructed by sparsely distributed data in the high-dimensional space may not be able to discover the correct underlying manifold, since it can hardly connect sparse data points into a smooth manifold. On the other hand, the tangent spaces of the underlying manifold, which are low-dimensional in nature, can reflect the manifold structure in each local area. This implies that tangent spaces are very useful for learning the data manifold. Then how to develop a dimensionality reduction algorithm which is capable of combining the flexibility of the graph Laplacian with the utility of tangent spaces? To solve this problem, we present our algorithm which can readily use the structural information from tangent spaces for supervised dimensionality reduction.

3 Local Tangent Space Discriminant Analysis

In this section, we present the local tangent space discriminant analysis (TSD) algorithm and its non-linear extension. As a supervised dimensionality reduction method, TSD aims to seek an embedding space where the local manifold structure of the data belonging to the same class is preserved as much as possible, and the marginal data points with different class labels are better separated. Compared with the methods discussed in Sect.  2, the key advantage of our algorithm is that it is capable of capturing the local manifold structure from tangent spaces without losing the analytic form of the solution.

3.1 Preliminaries

To begin with, we briefly introduce the concepts of the tangent space and tangent vector. In differential geometry, one can attach to every point \(\varvec{x}\) of a differentiable manifold \(\mathcal {M}\) a tangent space \(T_{\varvec{x}} \mathcal {M}\) in which every vector tangentially passes through \(\varvec{x}\). The elements of the tangent space are called tangent vectors at \(\varvec{x}\), which is a vector that is tangent to a curve or surface at \(\varvec{x}\) (see Fig. 1 for the illustration). All the tangent spaces of a connected manifold have the same dimension, equal to the dimension of the manifold. In practice, if the manifold is smooth enough, the subspace constructed by performing PCA on the neighborhood of \(\varvec{x}\) can be a good approximation of the tangent space at \(\varvec{x}\) [14], since the nearby data points of \(\varvec{x}\) can be viewed as approximately lying in a subspace which is tangent to the data manifold. Once tangent spaces and tangent vectors have been introduced, they can serve to characterize a differentiable curve on the manifold whose derivative at any point is equal to the tangent vector attached to that point. This is a crucial property that plays a key role in deriving our TSD algorithm.

Fig. 1
figure 1

The tangent space \(T_{\varvec{x}} \mathcal {M}\) and a tangent vector \(\varvec{v} \in T_{\varvec{x}} \mathcal {M}\), along a curve travelling through \(\varvec{x} \in \mathcal {M}\)

In recent years, some tangent space based dimensionality reduction methods have been proposed by using the above property. They learn the data manifold by estimating a function whose value can serve as a low-dimensional representation of the manifold. Tangent space intrinsic manifold regularization (TSIMR) [13] estimates a local linear function on the manifold which has constant manifold derivatives. Parallel vector field embedding (PFE) [8] represents a function along the manifold from the perspective of vector fields and requires the vector field at each data point to be as parallel as possible. Although they are effective to preserve the manifold geometry, these tangent space based methods are unsupervised in nature, which have no ability to utilize the discriminant information of class labels. Therefore, they are not optimal for the supervised case. To solve this problem, we propose the TSD algorithm which partly shares the same spirit with TSIMR and PFE but is optimal for supervised dimensionality reduction.

3.2 The Algorithm

Suppose that data are sampled from an m-dimensional smooth manifold \(\mathcal {M}\) in a d-dimensional space. Let \(\mathcal {T}_{\varvec{z}} \mathcal {M}\) denotes the tangent space attached to \(\varvec{z}\), where \(\varvec{z} \in \mathcal {M}\) is a fixed data point on the \(\mathcal {M}\). Motivated by Tangent Space Intrinsic Manifold Regularization (TSIMR) [13], we represent the local manifold structure of data by means of tangent spaces. According to the first-order Taylor expansion at \(\varvec{z}\), any function f defined on the manifold \(\mathcal {M}\) can be expressed as:

$$\begin{aligned} f(\varvec{x}) = f(\varvec{z}) + {\varvec{w}_{\varvec{z}}}^{\top } \varvec{u}_{\varvec{z}}(\varvec{x}) + O({||\varvec{x} - \varvec{z} ||}^2), \end{aligned}$$

where \(\varvec{x} \in \mathbb {R}^d\) is a d-dimensional data point and \(\varvec{u}_{\varvec{z}}(\varvec{x}) = T_{\varvec{z}}^{\top } (\varvec{x} - \varvec{z})\) is an m-dimensional tangent vector which gives the m-dimensional representation of \(\varvec{x}\) in \(\mathcal {T}_{\varvec{z}} \mathcal {M}\). \(T_{\varvec{z}}\) is a \(d \times m\) matrix formed by the orthonormal bases of \(\mathcal {T}_{\varvec{z}} \mathcal {M}\), which can be estimated through local PCA, i.e., performing standard PCA on the neighborhood of \(\varvec{z}\). And \({\varvec{w}_{\varvec{z}}}\) is an m-dimensional vector representing the directional derivative of f at \(\varvec{z}\) with respect to \(\varvec{u}_{\varvec{z}}(\varvec{x})\) on the manifold \(\mathcal {M}\).

In the scenario of dimensionality reduction, \(f(\varvec{x})\) denotes a one-dimensional embedding of \(\varvec{x}\). If there are two data points \(\varvec{z}\) and \(\varvec{z}'\) have a small Euclidean distance, by using the first-order Taylor expansion at \(\varvec{z}'\), the embedding \(f(\varvec{z})\) can be represented as:

$$\begin{aligned} f(\varvec{z})=f(\varvec{z}') + \varvec{w}_{\varvec{z'}}^{\top } T_{\varvec{z'}}^{\top } (\varvec{z} - \varvec{z}') + O({||\varvec{z} - \varvec{z}' ||}^2). \end{aligned}$$
(8)

Suppose that the data can be well characterized by a linear function on the underlying manifold \(\mathcal {M}\). Then we can omit the remainders in (8) because the second-order derivatives of f vanishes. Therefore, provided \(\varvec{z}\) and \(\varvec{z}'\) are close enough, any embedding \(f(\varvec{z})\) can be well approximated by a linear function as follows:

$$\begin{aligned} f(\varvec{z}) \approx f(\varvec{z}') + \varvec{w}_{\varvec{z'}}^{\top } T_{\varvec{z'}}^{\top } (\varvec{z} - \varvec{z}'). \end{aligned}$$
(9)

Based on the above results, we know that every data point in a local area should satisfies (9), which leads to a natural criterion of preserving the local manifold structure of data. Suppose that the training data include n examples \( \{ (\varvec{x}_i,y_i) \}_{i=1}^n \) belonging to C classes where \(\varvec{x}_i \in \mathbb {R}^d\) is a d-dimensional example, and \(y_i \in \{ 1,2,\ldots ,C \}\) is the class label associated with the example \(\varvec{x}_i\). Consider a linear projection \(\varvec{t} \in \mathbb {R}^d\) which maps the data to a one-dimensional embedding. Then the embedding of \(\varvec{x}\) can be expressed as \(f(\varvec{x}) = \varvec{t}^{\top } \varvec{x}\). We aim to find a projection \(\varvec{t}\) to minimize the difference between the l.h.s and the r.h.s of (9) for every example and its neighbors belonging to the same class, and to better separate the marginal data points in different classes.

In order to minimize the difference between the l.h.s and the r.h.s of (9) for nearby intraclass data, we need to construct the within-class graph \(G^w\). For the within-class graph \(G^w\), if \(\varvec{x}_i\) is among the \(k_1\)-nearest neighbors of \(\varvec{x}_j\) with \(y_i=y_j\), an edge is added between \(\varvec{x}_i\) and \(\varvec{x}_j\), and the elements of the weight matrix \(W^w\) are set to \(W^w_{ij}=W^w_{ji}=1\). Then we can formulate a within-class objective function as follows:

$$\begin{aligned}&\min \mathop {\sum }\limits _{ij} W_{ij}^w \left( \varvec{t}^{\top } \varvec{x}_i - \varvec{t}^{\top } \varvec{x}_j - \varvec{w}_{\varvec{x}_j}^{\top } T_{\varvec{x}_j}^{\top } (\varvec{x}_i - \varvec{x}_j)\right) ^2, \end{aligned}$$
(10)
$$\begin{aligned}&W_{ij}^w = \left\{ {\begin{array}{ll} 1 &{} ~~\text {if}~i \in N_{k_1}(j)~\text {or}~j \in N_{k_1}(i) \\ 0 &{} ~~\text {else}, \end{array}} \right. \end{aligned}$$
(11)

where \(N_{k_1}(i)\) denotes the set of the k-nearest neighbors of \(\varvec{x}_i\) sharing the same label \(y_i\), and the orthonormal base matrix \(T_{\varvec{x}_i}\) of the tangent space \(\mathcal {T}_{\varvec{x}_i} \mathcal {M}\) at each data point \({\varvec{x}_i}\) are computed by performing PCA on the \(k_1\)-nearest neighborhood of \({\varvec{x}_i}\).

To separate the marginal data points in different classes, we need to maximize and the distances of the embeddings of nearby interclass data points. To this end, we need to construct the between-class graph \(G^b\). For the between-class graph \(G^b\), if the pair (ij) is among the \(k_2\)-shortest pairs in the set \(\{(i,j),y_i\ne y_j\}\), an edge is added between \(\varvec{x}_i\) and \(\varvec{x}_j\), and the elements of the weight matrix \(W^b\) are set to \(W^b_{ij}=1\). Similar to the above objective function, we can write a between-class objective function as follows:

$$\begin{aligned}&\max \mathop {\sum }\limits _{ij} W_{ij}^b \left( \varvec{t}^{\top } \varvec{x}_i - \varvec{t}^{\top } \varvec{x}_j\right) ^2 , \end{aligned}$$
(12)
$$\begin{aligned}&W_{ij}^b = \left\{ {\begin{array}{ll} 1 &{}\quad \text {if}~(i,j) \in P_{k_2}(i)~\text {or}~(i,j) \in P_{k_2}(j) \\ 0 &{}\quad \text {else}, \end{array}} \right. \end{aligned}$$
(13)

where \(P_{k_2}(i)\) indicates the set of the \(k_2\)-nearest pairs among the set \(\{(i,j),y_i \ne y_j \}\).

Note that the terms \(\varvec{w}_{\varvec{x}_j}^{\top } T_{\varvec{x}_j}^{\top } (\varvec{x}_i - \varvec{x}_j)~(i,j=1,\ldots ,n)\) in (10) distinguish our strategy of preserving the local data structure from the graph Laplacian based one, where \(\varvec{w}_{\varvec{x}_j}\) is a coefficient vector and should be optimized with \(\varvec{t}\) simultaneously. These terms characterize how well two different examples \(\varvec{x}_i\) and \(\varvec{x}_j\) fit into the local linear approximation of f, which leads to an appropriate way to preserve the local intraclass geometry along the manifold \(\mathcal {M}\). Therefore, our strategy can extract more geometrical information from the data than the graph Laplacian based one. Moreover, any valid weight matrix \(W^w\), such as the one used in LFDA, can be used to preserve specific geometrical structure of the data manifold. This free-form property of the weight matrix is of great importance when we want to apply dimensionality reduction to various types of data.

The objective function (10) can be reformulated as a canonical matrix quadratic form as follows:

$$\begin{aligned}&\mathop {\sum }\limits _{ij} W_{ij}^w \left( \varvec{t}^{\top } \varvec{x}_i - \varvec{t}^{\top } \varvec{x}_j - \varvec{w}_{\varvec{x}_j}^{\top } T_{\varvec{x}_j}^{\top } (\varvec{x}_i - \varvec{x}_j)\right) ^2 \nonumber \\&\quad = {\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {X{S_1}X^{\top }}&{}{X{S_2}} \\ {{S_2}^ \top X ^ \top }&{}{{S_3}} \end{array}} \right) \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) \nonumber \\&\quad ={\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } S \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) , \end{aligned}$$
(14)

where we have defined \(\varvec{w} = \left( \varvec{w}_{\varvec{x}_1}^{\top }, \varvec{w}_{\varvec{x}_2}^{\top },\ldots ,\varvec{w}_{\varvec{x}_n}^{\top }\right) ^{\top }\), \(X = (\varvec{x}_1,\ldots ,\varvec{x}_n)\) is the data matrix, and S is a \((d + mn) \times (d + mn)\) positive semi-definite matrix constructed by four blocks, i.e., \(X S_1 X^{\top }\), \(X S_2\), \(S_2^{\top } X^{\top }\) and \(S_3\). For simplicity, we omit the detailed derivation of S here, which is available in the Appendix 1.

Recall that \(\varvec{w}_{\varvec{x}_i}\) is the directional derivative of f at \(\varvec{x}_i\). Note that the linear projection vector \(\varvec{t}\) is under the influences of both the direction and the length of each \(\varvec{w}_{\varvec{x}_i}\). To make within-class examples further compacted, we hope that the projection \(\varvec{t}\) is more effected by \(\varvec{w}_{\varvec{x}_i}\)’s direction than its length. Therefore, it makes sense to regularize the length of \(\varvec{w}_{\varvec{x}_i}~ (i=1,\ldots ,n)\). This can be achieved by adding a regularizer \(\Vert \varvec{t}\Vert ^2 + \sum _i \Vert \varvec{w}_{\varvec{x}_i} \Vert ^2\) to (14). Define \(\varvec{f} = (\varvec{t}^{\top }, \varvec{w}^{\top })^{\top }\), the optimization function turns out to be:

$$\begin{aligned}&\varvec{f}^{\top } S \varvec{f} + \gamma \left( \Vert \varvec{t}\Vert ^2 + \sum _i \Vert \varvec{w}_{\varvec{x}_i} \Vert ^2\right) \nonumber \\&\quad = \varvec{f}^{\top } S \varvec{f} + \gamma \Vert \varvec{f}\Vert ^2 = \varvec{f}^{\top } (S + \gamma I) \varvec{f}, \end{aligned}$$
(15)

where \(\gamma > 0\) is a trade-off parameter. In fact, the extra term \(\gamma \Vert \varvec{f}\Vert ^2\) often refers to as the Tikhonov regularizer, which is commonly used with a small \(\gamma \) to keep matrices from being singular. However, the value of \(\gamma \) is crucial for our method, because it also controls the influence of \(\varvec{w}\)’s length on the projection \(\varvec{t}\).

With simple algebraic formulation, the objective function (12) becomes

$$\begin{aligned}&\mathop {\sum }\limits _{ij} W_{ij}^b \left( \varvec{t}^{\top } \varvec{x}_i - \varvec{t}^{\top } \varvec{x}_j \right) ^2 \nonumber \\&\quad = 2 \varvec{t}^{\top } X (D^b-W^b) X^{\top } \varvec{t} = 2 \varvec{t}^{\top } X L^b X^{\top } \varvec{t} \nonumber \\&\quad ={\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {2 X L^b X^{\top }}&{} \varvec{0} \\ \varvec{0} &{} \varvec{0} \end{array}} \right) \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) \nonumber \\&\quad ={\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } S_b \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) , \end{aligned}$$
(16)

where \(L^b = D^b - W^b\) is the Laplacian matrix, and \(D^b\) is a diagonal matrix with the i-th diagonal element being \(D_{ii}^b = \sum _{j\ne i}W_{ij}^b\).

Finally, by integrating (15) and (16), the objective function of TSD can be written as follows:

$$\begin{aligned} \varvec{f}^{*} = \arg \max _{\varvec{f}} \frac{\varvec{f}^{\top } S_b \varvec{f}}{\varvec{f}^{\top } (S + \gamma I) \varvec{f}}. \end{aligned}$$
(17)

The optimization of (17) can be achieved by solving a generalized eigenvalue problem:

$$\begin{aligned} S_b \varvec{f} = \lambda (S + \gamma I) \varvec{f} \end{aligned}$$
(18)

whose solution can be easily given by the eigenvector with respect to the largest eigenvalue. In order to obtain a one-dimensional embedding of an example \(\varvec{x}\), we just use the first part of \(\varvec{f}^* = (\varvec{t}^{*\top }, \varvec{w}^{*\top })^{\top }\) and compute \(b = \varvec{t}^{*\top } \varvec{x}\). Suppose that we want to project d-dimensional data into an r-dimensional subspace. Let \(\varvec{f}_1,\ldots ,\varvec{f}_r\) be the solutions of (17) corresponding to the r largest eigenvalues \(\lambda _1>\ldots >\lambda _r\). Then the r-dimensional embedding \(\varvec{b}\) of \(\varvec{x}\) is computed as follows:

$$\begin{aligned} \varvec{b} = T^{\top } \varvec{x}, ~~T = (\varvec{t}_1,\ldots ,\varvec{t}_r). \end{aligned}$$

Algorithm 1 gives the pseudo-code for TSD. It is worth noting that although \(\varvec{w}^*\) seems not to be used in computing the low-dimensional embeddings, as the parameter which is simultaneously optimized with \(\varvec{t}^*\), it exerts a crucial influence on the resultant transformation matrix T. This is means that both \(\varvec{t}^*\) and \(\varvec{w}^*\) determine the final results of TSD.

The main computational cost of TSD lies in building tangent spaces for n data points and solving the generalized eigenvalue problem (18). Our algorithm has a time complexity of \(O((d^2m+m^2d)\times n)\) for the construction of n tangent spaces and \(O(r^2\times (d+mn))\) for finding r eigenvectors with respect to the r largest eigenvalues. For comparison, we also give the time complexities of some classical and related methods. The time complexity of PCA is \(O(n^2d)\) and that of LDA is also \(O(n^2d)\) [16]. As we have discussed in Sect. 2, MFA, LSDA, LFDA fall into the same framework with different graphs, which implies that they have the same computational cost. Since LDA also falls into the graph Laplacian based framework [15], their time complexities turn out to be \(O(n^2d)\). The above analysis suggests that TSD is more time consuming compared with other methods. However, since local tangent spaces are estimated by local PCA, we can obtain at most \(k_1+1\) meaningful orthonormal bases for each tangent space,Footnote 1 where \(k_1\) is the size of within-class neighborhood. This implies that the dimensionality m of the directional derivative \(\varvec{w}_{\varvec{x}_i}\) \((i=1,\ldots ,n)\) is always less than \(k_1+1\). In practice, \(k_1\) is usually small to ensure the locality. This makes sure that m is actually a small constant. To conclude, the overall time complexity of TSD is \(O((d^2m+m^2d)\times n + r^2\times (d+mn))\). Since m is usually small, TSD has an acceptable computational cost.

figure a

3.3 Kernel TSD

TSD is a linear dimensionality reduction method. In this section, we propose Kernel TSD which can be performed in a Reproducing Kernel Hilbert Space (RKHS) for non-linear dimensionality reduction.

Consider a feature space \(\mathcal {F}\) induced by a non-linear mapping \(\phi \) : \(\mathcal {X} \rightarrow \mathcal {F}\), where \(\mathcal {X}\) is an input domain. We can construct an RKHS \(H_\mathcal {K}\) by defining a kernel function \(\mathcal {K}(\cdot ,\cdot )\) using the inner product operation \(\langle \cdot , \cdot \rangle \), such that \(\mathcal {K}(\varvec{x},\varvec{y}) = \langle \phi (\varvec{x}), \phi (\varvec{y}) \rangle \). Given a data set \(\{\varvec{x}_i \in \mathcal {X}\}_{i=1}^n\), we can define the data matrix in the feature space \(\mathcal {F}\) as \(\varPhi = (\phi (\varvec{x}_1),\ldots ,\phi (\varvec{x}_n))\). Then one can use the orthogonal projection to decompose any projection vector \(\varvec{t} \in H_\mathcal {K}\) into a sum of two functions: one lying in the \(span\{ \phi (\varvec{x}_1),\ldots ,\phi (\varvec{x}_n) \}\), and the other lying in the orthogonal complementary space. Therefore, there exist a set of coefficients \(\alpha _i\) \((i = 1,2,\ldots ,n)\) satisfying

$$\begin{aligned} \varvec{t} = \sum _{i=1}^n \alpha _i \phi (\varvec{x}_i) + \varvec{v} = \varPhi \varvec{\alpha } + \varvec{v}, \end{aligned}$$
(19)

where \(\varvec{\alpha } = (\alpha _1,\alpha _2,\ldots ,\alpha _n)^{\top }\) and \(\langle \varvec{v},\phi (\varvec{x}_i)\rangle = 0\) for all i. Note that although we set \(\varvec{f} = (\varvec{t}^{\top },\varvec{w}^{\top })^{\top }\) and optimize \(\varvec{t}\) and \(\varvec{w}\) together, we should estimate tangent spaces in \(\mathcal {F}\) through local Kernel PCA [10] rather than reparametrize \(\varvec{w}\) like \(\varvec{t}\).

Let \(T_{\varvec{x}_i}^\phi \) be the matrix formed by the orthonormal bases of the tangent space attached to \(\phi (\varvec{x}_i)\). By replacing \(T_{\varvec{x}_i}\) with \(T_{\varvec{x}_i}^\phi \) \((i = 1,2,\ldots ,n)\) and substituting (19) into (15), the objective function (15) turns out to be:

$$\begin{aligned}&{\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {X{S_1}X^{\top }}&{}{X{S_2}} \\ {{S_2}^ \top X ^ \top }&{}{{S_3}} \end{array}} \right) \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) + \gamma {\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) \\&\quad ={\left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {K S_1 K} &{} {K S_2} \\ {S_2^{\top } K} &{} {{S_3}} \end{array}} \right) \left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) + \gamma {\left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) ^ \top } {\left( {\begin{array}{cc} K &{}\varvec{0} \\ \varvec{0}&{} {\bar{I}} \end{array}} \right) } \left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) \\&\quad ={\left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {K S_1 K + \gamma K} &{} {K S_2} \\ {S_2^{\top } K} &{} {S_3+\gamma {\bar{I}}} \end{array}} \right) \left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) \\&\quad ={\left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) ^ \top } S^\phi \left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) , \end{aligned}$$

where K is a kernel matrix with \(K_{ij} = \mathcal {K}(\varvec{x}_i,\varvec{x}_j)\) and \({\bar{I}}\) is an identity matrix sized \(mn \times mn\). Similarly, the objective function (16) becomes

$$\begin{aligned}&{\left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {2 X L^b X^{\top }}&{} \varvec{0} \\ \varvec{0} &{} \varvec{0} \end{array}} \right) \left( {\begin{array}{cc} \varvec{t} \\ \varvec{w} \end{array}} \right) \nonumber \\&\quad = {\left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) ^ \top } \left( {\begin{array}{cc} {2 K {L^b} K} &{} \varvec{0} \\ \varvec{0} &{} \varvec{0} \end{array}} \right) \left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) \\&\quad = {\left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) ^ \top } S_b^\phi \left( {\begin{array}{cc} \varvec{\alpha } \\ \varvec{w} \end{array}} \right) . \end{aligned}$$

Note that due to \(\langle \varvec{v},\phi (\varvec{x}_i)\rangle = 0\) for all i, every term of \(\varvec{v}\) vanishes from the above formulations. Finally, Kernel TSD can be converted to a generalized eigenvalue problem as follows:

$$\begin{aligned} S_b^\phi \varvec{\varphi } = \lambda S^\phi \varvec{\varphi }, \end{aligned}$$
(20)

where we have defined \(\varvec{\varphi } = (\varvec{\alpha }^{\top },\varvec{w}^{\top })^{\top }\).

Given the eigenvectors \(\varvec{\varphi }_1,\ldots ,\varvec{\varphi }_r\) with respect to the r largest eigenvalues of (20), the resultant transformation matrix can be written as \(\varGamma = (\varvec{\alpha }_1,\ldots ,\varvec{\alpha }_r)\). Then, the embedding \(\varvec{b}\) of an original example \(\varvec{x}\) is computed as:

$$\begin{aligned} \varvec{b} = \varGamma ^{\top } \varPhi ^{\top } \phi (\varvec{x}) = \varGamma ^{\top } (\mathcal {K}(\varvec{x}_1,\varvec{x}),\ldots ,\mathcal {K}(\varvec{x}_n,\varvec{x}))^{\top }. \end{aligned}$$

4 Discussion

For developing a good graph-based dimensionality reduction method, one of the most important problems is how to construct a good graph so that the preferred data structure can be preserved. As we have discussed in Sect. 2, many existing methods such as MFA, LSDA and LFDA aim to design specific graphs to enhance the local compactness of the data in the same class and separate the data points with different class labels. However, none of them breaks the graph Laplacian based framework. Our method mainly focuses on developing a new strategy to extract more information of the data manifold from a given graph. More specifically, we use the first-order Taylor expansion to incorporate the structural information from tangent spaces, i.e., the terms \(\varvec{w}_{\varvec{x}_j}^{\top } T_{\varvec{x}_j}^{\top } (\varvec{x}_i - \varvec{x}_j))^2~(i,j=1,\ldots ,n)\) in (10), into the scatter matrix S. Moreover, it is worth noting that although we specify a certain weight matrix (11) to construct the within-class graph \(G^w\), our method is flexible enough to utilize the information from any valid graphs such as the one used in LFDA.

Local tangent space alignment (LTSA) [17] is a popular dimensionality reduction method which also use the information from tangent spaces. The main idea of LTSA is to align every local tangent space to construct global coordinates. Although both TSD and LTSA utilize local tangent spaces, there are mainly two differences between them: (1) they actually solve different problems in essence. TSD is a linear supervised dimensionality reduction method, while LTSA is a non-linear unsupervised one. As a result, our method not only considers the class labels to make use of discriminant information, but can obtain an explicit transformation matrix to compute the mappings for out-of-sample data. (2) TSD is a graph-based method which can adopt any valid graph for training, whereas LTSA is not. This property provides TSD with much more flexibility to handle various types of data for different applications.

Our method shares the same spirit with TSIMR [13], and both of them employ tangent spaces to discover the geometrical structure of the data manifold. However, our approach and TSIMR differ in two key aspects: (1) Like LTSA, TSIMR is a non-linear unsupervised method, and thus has no ability to capture the discriminant information or give an explicit transformation matrix. (2) They have totally different objective functions. It should be noted that TSD employs (10) to construct the scatter matrix S, while the objective function of TSIMR has other terms \(\left\| \varvec{w}_{\varvec{x}_i} - T_{\varvec{x}_i}^{\top } T_{\varvec{x}_j} \varvec{w}_{\varvec{x}_j} \right\| _2^2~(i,j=1,\ldots ,n)\). And we find these terms are not much beneficial for discriminant analysis.

The effect of the Tikhonov regularizer \(\gamma \Vert \varvec{f}\Vert ^2 = \gamma \left( \Vert \varvec{t}\Vert ^2+\sum _i\Vert \varvec{w}_{\varvec{x}_i}\Vert ^2\right) \) in (15) should be highlighted, since it plays a key role in our method. Generally, Tikhonov regularization is a common technique employed by many dimensionality reduction methods to deal with the singularity problem of the matrix, where the parameter \(\gamma \) is always set to a very small value. However, our method needs an appropriate large \(\gamma \) to penalize large \(\Vert \varvec{w}_{\varvec{x}_i}\Vert ~(i= 1,\ldots ,n) \) so that the within-class compactness can be enhanced.

5 Experiments

To evaluate the proposed method, related dimensionality reduction methods including PCA, LDA, MFA, LSDA and LFDA are compared with TSD on multiple real-world data sets from the UCI machine learning repository [1], the Protein Sequence data setFootnote 2 from glycosylation database Uniprot (v8.0), and the USPS data set. Specifically, we first perform dimensionality reduction to map all examples into a subspace, and then carry out classification using the nearest neighbor classifier (1-NN) in the subspace. This experimental setting is the same as the one adopted in [12]. Moreover, we also compare the baseline method that just employs the 1-NN classifier in the original space without performing dimensionality reduction.

Seven UCI data sets (Satellite, Theorem Prove, Breast Cancer, Column2C Image, Ionosphere, Semeion Handwritten), the Protein Sequence data set, and the USPS data set are used to conduct our experiments. Originally, the Theorem Prove and USPS data sets are divided into a training set and a test set. For simplicity, we just use their test sets to carry on the experiments. For the Protein Sequence data set, we use a subset of the Uniprot database which contains only 99 mammalian protein entries. For each data set, we randomly split certain rates of the data as the training set and the rest as the test set. Furthermore, all the parameters for MFA, LSDA, LFDA and TSD are selected by three-fold cross-validation. The configuration of each data set is shown in Table 1.

Table 1 List of the classification data sets used in our experiments

The performance of PCA and graph-based dimensionality methods including MFA, LSDA, LFDA and TSD depend on the dimensionality of the discovered embedding subspace. Thus we show the best results obtained by those methods. Every experimental result is obtained from the average over 20 splits. We give the mean values and standard deviations of the error rates (%) on the employed data sets, where the best method is highlighted in bold font and the best and comparable ones based on the t-test with the significance level 5 % are marked with ‘\(\vartriangle \)’.

The experimental results on the satellite, theorem prove, breast cancer and column2C data sets are shown in Table 2. In most cases, classification with dimensionality reduction is statistically better than the baseline. However, LDA perform well on none of the four data sets, probably because the implicit assumption adopted by LDA mismatches the distributions of these data sets. On the other hand, all the graph-based methods get reasonable well results, because they aim to preserve the local structure of data. PCA also works well for the purpose of separating data from different classes. Although it does not attain the best performance, our method achieves comparable good results.

Table 2 Mean values and standard deviations of the error rates (%) on the satellite, theorem prove, breast cancer and column2C data sets

Table 3 describes the classification performance of each method on the Image, Ionosphere, Protein Sequence, Semeion Handwritten and USPS data sets. Again, LDA gets worse results. Surprisingly, the counterparts of TSD including MFA, LSDA, LFDA, fail to perform well for the protein sequence and the Semeion handwritten data sets. In the case of Semeion handwritten data set, these methods are even worse than the baseline. The characteristics of the feature vectors in the two data sets probably explain why this happens. Every example in the two data sets has a sparse and binary feature vector with high dimensionality in which only a small number of elements are one, and the rest are zero. For instance, the Semeion Handwritten data set contains 1593 binary images of handwritten digits consisting \(16 \times 16\) pixels. In this case, the graph Laplacian based methods may not be able to capture the meaningful local geometry of the data manifold any more. On the other hand, TSD achieves the best results probably because it can capture extra geometrical information from tangent spaces. This suggests that our proposed method makes good use of the information from tangent spaces and thus can correctly discover the data structure. In addition, the limitation inherited from the graph Laplacian based framework rather than the choice of graphs in each graph-based method should be responsible for the undesirable results, since the adopted graph in TSD is similar to those in MFA, LSDA and LFDA. Moreover, even when the feature vectors are no longer sparse and binary, TSD can also get the lowest error rates compared with the other methods with high level of statistical significance in the image and ionosphere data sets. This demonstrates that due to utilizing the structural information from tangent spaces, TSD can not only improve the performance of dimensionality reduction, but be applied to the data sets on which the graph Laplacian based counterparts fail to perform effectively.

Table 3 Mean values and standard deviations of the error rates (%) on the image, ionosphere, protein sequence, semeion handwritten and USPS data sets

Table 4 gives the time consumptions of different methods. As can be seen, TSD is relatively less efficient than its counterparts, because it has to estimate the tangent space and tangent vector at each data point. In fact, this is also the weakness of other tangent space based methods [8, 13]. Therefore, proposing a strategy to make tangent space based methods more scalable can be an interesting research direction.

Table 4 Computation time (in seconds) of each method for dimensionality reduction

6 Conclusion

In this paper, we have proposed a novel supervised dimensionality reduction method named local tangent space discriminant analysis (TSD), which differs from the methods based on the graph Laplacian framework. By introducing tangent spaces and using the first-order Taylor expansion, we develop a new strategy to utilize the information from tangent spaces, which leads to a natural way of preserving the geometrical structure of the data manifold. The proposed method aims to seek an embedding space where the local manifold structure of the data belonging to the same class is preserved as much as possible, while the marginal data points with different class labels are better separated. Moreover, TSD has the analytic solution by solving a generalized eigenvalue problem and can be easily extended to non-linear dimensionality reduction through the kernel trick.

The effectiveness of the proposed method has been demonstrated by comparing with related work on multiple real-world data sets including the UCI data sets and the protein sequence data set. The experimental results show that TSD works well on the data sets which can hardly be well handled by its counterparts, and attains better performance of classification due to utilizing the extra information from tangent spaces. Future work directions include extending our method to different learning scenarios such as semi-supervised learning and developing the sparse algorithm of TSD for large-scale learning tasks.