1 Introduction

In many areas of research such as chemometrics, bioinformatics, and machine learning, data samples are usually represented as points in high-dimensional spaces. Researchers in this field are often confronted with tasks involving clustering and classification. Most clustering and classification algorithms work well with small dimensionality, usually less than 100. The efficiency and effectiveness of these algorithms drop rapidly as the dimensionality increases and this is usually referred to as the “curse of dimensionality” [1]. With the recent advances in computer technologies, there has been an explosion in the amount of data used for classification. This is certainly true for gene expression data generated using advanced microarray technologies. The data have high dimensionality and contain thousands to tens of thousands of genes. Another tricky part of the problem is that publicly available data size is very small, and some have sizes below 100. This situation is well-known as the small sample size problem where the number of predictor variables by far exceeds the number of samples. Clustering or classification of such data set is almost infeasible. Thus, dimensionality reduction is often needed prior to classification to reduce dimension and extract features that contain most of the necessary information in the data. Two of the most widely used dimension reduction algorithms are principal component analysis (PCA) [2,3,4,5] and partial least squares discriminant analysis (PLS-DA) [6,7,8,9,10,11].

PLS-DA is a supervised dimension reduction algorithm that attempts to maximize between-class covariance in the projected space. Its aim is to capture the global variability of the processed data and maximize between-class separation. Although PLS-DA is seen as an alternative to linear discriminant analysis (LDA), it has been shown in [12, 13] that the eigenstructure of PLS-DA does not capture any within-class information. In [13], Aminu and Ahmad proposed a modification of the PLS-DA algorithm called locality preserving PLS-DA (LPPLS-DA) which encodes the within-class information via an affinity graph. LPPLSDA then finds a projection that respects the graph structure while minimizing the within-class distance. The affinity graph represents a sub-manifold embedded in the ambient space that allows for the modeling of local within-class data structure. The effectiveness of the approach in increasing the discriminating power of the conventional PLS-DA was demonstrated through experiments on appearance-based face recognition and complex chemical data set [14]. Another effort to improve the discriminant power of PLS-DA is the locally weighted PLS-DA (LW-PLS-DA) [15]. LW-PLS-DA extended the locally weighted PLS approach described in [16] where locally weighted regression [17] was integrated into PLS-DA. It is also possible to incorporate other local manifold modeling techniques into PLS-DA [18]. These methods include Laplacian eigenmaps (LE) [19, 20], locally linear embedding (LLE) [21] and neighborhood preserving embedding (NPE) [22]. Combining a global method such as PLS-DA with a local manifold learning method in the form of a global–local structure preserving framework is among the most current trend in feature extraction, and very promising results are being reported [23,24,25,26,27,28,29]. It was also emphasized in [30] that, for multi-clustered data in particular, a global–local structure preserving framework can help to preserve information on both global and local clustering structures in the low dimensional embedding.

Both PLS-DA and LPPLS-DA are linear methods and are not equipped to handle nonlinearities. To overcome this limitation, we propose in this paper, a new algorithm for discriminant feature extraction, called Kernel Locality Preserving PLS-DA (KLPPLS-DA). Our proposed KLPPLS-DA algorithm is essentially a kernel extension of the linear LPPLS-DA algorithm in reproducing kernel Hilbert space (RKHS). Unlike other kernelized extensions of PLS-DA (for example, see [31,32,33,34,35]), the proposed KLPPLS-DA combines both global and local discriminating structure of the data in RKHS. While kernelized PLS-DA (KPLS-DA) finds global linear patterns in the associated RKHS, KLPPLS-DA goes one step further to also identify intrinsic local manifold structures and use that information to provide more compact within-class structures. In this way, KLPPLS-DA not only provides a nonlinear extension to PLS-DA but also has the potential to provide increased discriminant power. We show how the KLPPLS-DA solutions can be obtained by solving a generalized eigenvalue problem which shows the link between KLPPLS-DA and several other classical kernel-based feature extraction methods such as kernel fisher discriminant (KFD) [36], kernel locality preserving projections (KLPP) [37] and kernel local fisher discriminant analysis (KLFDA) [38].

The remainder of the paper is organized as follows: In Sect. 2, a brief review of the PLS-DA and LPPLS-DA algorithms is provided. In Sect. 3, global and local structure preservation in RKHS is described and the proposed Kernel Locality Preserving PLS-DA algorithm is outlined. Extensive experimental results on two synthetic datasets and six gene expression datasets are presented in Sect. 4. Finally, concluding remarks and suggestions for future work are deliberated in Sect. 5.

2 A Brief Review of PLS-DA and LPPLS-DA

2.1 PLS-DA

PLS-DA is derived from the well-known PLS algorithm for modeling the linear relationship between two sets of observed variables. The PLS algorithm has widely been used as a feature extraction method to deal with undersampled and multi-collinearity issues usually encountered in high-dimensional data. Suppose the data sets are \(X=[x_1,x_2,\ldots ,x_n]^T\in R^{n \times m}\) and \(Y=[y_1,y_2,\ldots ,y_n]^T\in R^{n \times N}\), where the rows correspond to observations and the columns correspond to variables. The main idea behind PLS is to find projection matrices \(W=[w_1,w_2,\ldots ,w_n]^T\in R^{n \times m}\) and \(V=[v_1,v_2,\ldots ,v_n]^T\in R^{n \times N}\), where each column projection vector pair (wv) maximizes the co-variance of the projected data. Mathematically, this is represented by the constrained optimization problem of the form

$$\begin{aligned} \max _{\left\| w \right\| = 1,\left\| v \right\| =1}[\textrm{cov}(\bar{X}w,\bar{Y}v)], \end{aligned}$$
(1)

where \(\bar{X}\) and \(\bar{Y}\) are mean-centered data matrices. In matrix form, (1) can be written as

$$\begin{aligned} \max _{\left\| w \right\| = 1,\left\| v \right\| =1} w^T\bar{X}^T\bar{Y}v. \end{aligned}$$
(2)

By using the Lagrangian method (see for example [18]), the PLS component that is often sought after for dimension reduction is the vector w which can be solved via an eigenvalue problem of the form,

$$\begin{aligned} \bar{X}^T\bar{Y}\bar{Y}^T\bar{X}w=\lambda w. \end{aligned}$$
(3)

Alternatively, one can also cast problem (3) as a subspace optimization problem in the projection matrix \(\hat{W}\), that is

$$\begin{aligned} \max _{\hat{W} \in R^{n \times d}, \hat{W}^T\hat{W} = I} tr\left( \hat{W}^T\bar{X}^T\bar{Y}\bar{Y}^T\bar{X}\hat{W}\right) , \end{aligned}$$
(4)

where the columns of \(\hat{W}\) are the d eigenvectors of \(\bar{X}^T\bar{Y}\bar{Y}^T\bar{X}\) (with \(d < m\)) associated with the d largest eigenvalues.

For discrimination and classification purposes, the input data matrix \(\bar{Y}\) is replaced by a dummy (class membership) matrix containing class information.

Let

$$\begin{aligned} X = [X_1,X_2,\ldots ,X_C]^T, \end{aligned}$$
(5)

where \(X_c=[x_{1}^{(c)},x_{2}^{(c)},\ldots ,x_{n_c}^{(c)}]\), \(c=1,2,\ldots , C\) with sample \(x_{i}^{(c)}\) being the ith vector belonging to the cth class, and, \(n_c\) denotes the number of samples in the cth class such that \(\sum _{c=1}^{C}n_c=n\). Then, the class membership matrix \(\bar{Y}\) can be defined as

$$\begin{aligned} \bar{Y}=\begin{pmatrix} 1_{n_1} &{} 0_{n_1} &{} \ldots &{} 0_{n_1} \\ 0_{n_2} &{} 1_{n_2} &{} \ldots &{} 0_{n_1} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0_{n_C} &{} 0_{n_C} &{} \ldots &{} 1_{n_C} \end{pmatrix} \end{aligned}$$
(6)

where \(0_{n_c}\) and \(1_{n_c}\) are \(n_c\times 1\) vectors of zeros and ones, respectively. It was shown in [13] that, given \(\bar{Y}\) as in (6),

$$\begin{aligned} S_b = \bar{X}^T\bar{Y}\bar{Y}^T\bar{X} = \sum _{c=1}^C n_{c}^2\left( \mu ^{(c)} - \mu \right) \left( \mu ^{(c)} - \mu \right) ^T, \end{aligned}$$
(7)

where \(\mu \) denotes the total sample mean vector and \(\mu ^{(c)}\) denotes the cth class mean vector. The matrix \(S_b\) in (7) is similar to the between-class scatter matrix in Linear Discriminant Analysis (LDA) except for the factor \(n_{c}\) on each term in the sum.

2.2 LPPLS-DA

In [13], locality preserving PLS-DA was proposed to combine PLS-DA and existing manifold learning techniques. More specifically, an affinity graph is constructed to model local neighborhood structures. To further enhance the discriminating ability of the LPPLS-DA method, class information is employed while constructing the neighborhood graph.

The objective of LPPLS-DA is to find a lower-dimensional subspace in which the intrinsic geometrical and discriminant structure of data is preserved. LPPLS-DA finds a projection matrix W that transforms the high-dimensional data set X into a lower-dimensional subspace \(Z=XW\) such that the relative local distances among data samples are preserved and class separation is maximized at the same time. The objective function of LPPLS-DA can be formulated as the following multi-objective optimization problem:

$$\begin{aligned} {\left\{ \begin{array}{ll} \max _{W} tr(W^T\bar{X}^TYY^T\bar{X}W) \\ \min _{W} tr(W^T\bar{X}^TL\bar{X}W) \end{array}\right. } \end{aligned}$$
(8)

where Y is given by (6), \(L=D-S\) is the graph Laplacian, D is a diagonal matrix whose i-th diagonal element is \(D_{ii}=\sum _{j}S_{ij}\). The weight matrix \(S_{ij}\) is define as:

$$\begin{aligned} S_{ij} = {\left\{ \begin{array}{ll} \exp {\Big (-{\frac{\Vert x_i-x_j\Vert ^2}{t}}\Big )}; &{} \text {if}\ x_i\ \text {and}\ x_j\ \text {both belong to the same class}. \\ 0; &{}\text {Otherwise.} \end{array}\right. } \end{aligned}$$
(9)

where t is a user-specified parameter. Equation (8) leads to the following optimal solution:

$$\begin{aligned} W_\textrm{opt}=\arg \max tr\left( \frac{W^T\bar{X}^T YY^T\bar{X}W}{W^T\bar{X}^TL\bar{X}W}\right) . \end{aligned}$$
(10)

It can be shown that \(W_\textrm{opt}\) is the matrix whose columns are the eigenvectors corresponding to the principal eigenvalues of a generalized eigenvalue problem of the form:

$$\begin{aligned} \bar{X}^TYY^T\bar{X}w=\lambda \bar{X}^TL\bar{X}w \end{aligned}$$
(11)

Let \(w_1,w_2,\ldots ,w_d\) be the solutions to (11) associated with the first d largest eigenvalues. Thus, a data point x can be mapped into a d-dimensional LPPLS-DA subspace by

$$\begin{aligned}{} & {} x\rightarrow z=xW_\textrm{opt} \\{} & {} W_\textrm{opt}=[w_1, w_2, \ldots , w_{d}] \end{aligned}$$

3 Kernel Locality Preserving PLS-DA (KLPPLS-DA)

From the description of LPPLS-DA in the previous section, it can be observed that LPPLS-DA inherits the linear framework of PLS-DA. In other words, just like PLS-DA, it attempts to describe a variable as a linear combination of linear features or vectors. However, the locality-preserving feature in LPPLS-DA allows it the added advantage of having the capability to represent locally nonlinear structures. This property makes LPPLS-DA suitable for problems involving data that are globally linear but have patches of data that resides on local nonlinear manifolds. For hyperspectral image and gene expression data sets that generally exhibit global nonlinear structure, the applicability of LPPLS-DA may probably be quite prohibitive.

To overcome this shortcoming, we propose a nonlinear extension of LPPLS-DA called kernel locality preserving PLS-DA (KLPPLS-DA) that allows for a global nonlinear framework while providing local representation for the respective classes so that local geometric/class structures are preserved. The main idea of KLPPLS-DA is based on the same perspectives that can be found in [34, 37] where LPPLS-DA is reformulated in an implicit feature space \({\mathcal {H}}\) induced by some nonlinear mapping

$$\begin{aligned} \phi :R^m \rightarrow {\mathcal {H}}. \end{aligned}$$

Given input data \(x_1, x_2, \ldots , x_n \in R^m\), the mapped data \(\phi (x_1),\ldots ,\phi (x_n) \in {\mathcal {H}}\) are assumed to have a linearly separable structure. The feature space \({\mathcal {H}}\) is the so-called reproducing kernel Hilbert space (RKHS) with the associated inner-product \(\langle .,.\rangle _{{\mathcal {H}}}\). The reproducing property of \({\mathcal {H}}\) allows the inner product to be represented by a kernel function . More specifically, for any two variables \(x,y \in R^m\),

(12)

holds where is a positive semi-definite kernel function. Some of the popular kernel functions are: Linear kernel ; polynomial kernel ; radial basis function kernel .

3.1 Partial Least Squares Discriminant Analysis in RKHS

The goal here is to construct a lower-dimensional embedding in \({\mathcal {H}}\) using PLS-DA, and this can be done by constructing a subspace optimization problem analogous to (4) where the optimum projection matrix is obtained from the eigenstructure of the between-class scatter matrix. Depending on the nonlinear transformation \(\phi (.)\) the feature space \({\mathcal {H}}\) can be high-dimensional, even infinite-dimensional when the Gaussian kernel function is used. However, in practice, we are working only with n observations. In such circumstances, we have to restrict ourselves to finding a lower-dimensional subspace in the span of the points \(\phi (x_1),\ldots ,\phi (x_n)\). We start with a matrix of mapped, mean-centered training data \(\bar{\Phi } = [\bar{\phi }(x_1), \ldots , \bar{\phi }(x_n)]^T\), where \(\bar{\phi }(x_i)=\phi (x_i)-\mu _{\phi }\), such that \(\mu _{\phi }\) is the global centroid that is given by \(\mu _{\phi } = \frac{1}{n}\sum _{i=1}^n\phi (x_i)=\frac{1}{n}\Phi ^T 1_n\) with \(1_n\) being an n-element vector with entries all equal to one. Equivalently, we can write \(\bar{\Phi }=[\bar{\phi }(x_1), \ldots , \bar{\phi }(x_n)]^T=H\Phi \), where \(H=I_n-\frac{1}{n}1_n1_n^T\) is the usual centering matrix.

Suppose the mean-centered mapped training set from C classes is represented by

$$\begin{aligned} \bar{\Phi } = \left[ \bar{\Phi }_1,\bar{\Phi }_2,\ldots ,\bar{\Phi }_C\right] ^T, \end{aligned}$$
(13)

where \(\bar{\Phi }_c=[\bar{\phi }(x_{1}^{(c)}), \bar{\phi }(x_{2}^{(c)}),\ldots ,\bar{\phi }(x_{n_c}^{(c)})]\), \(c=1,2,\ldots , C\) such that \(\bar{\phi }(x_{i}^{(c)})\) is the mean-centered, mapped ith training sample belonging to the cth class. Define the mean of the cth class in the feature space \({\mathcal {H}}\) as \(\mu _{\phi }^{(c)} =\frac{1}{n_c} \sum _{j}^{n_c}\phi (x_{j}^{(c)})\) and let \(\bar{Y}\) be defined as in (6). In terms of the mean vectors \(\mu _{\phi }\) and \(\mu _{\phi }^{(c)},c =1,2,\ldots ,C\):

$$\begin{aligned} \bar{\Phi }^T\bar{Y} =\begin{pmatrix} \vdots &{} \vdots &{} &{} \vdots \\ n_1(\mu _{\phi }^{(1)} - \mu _{\phi }) &{} n_2(\mu _{\phi }^{(2)} - \mu _{\phi }) &{} \ldots &{} n_C(\mu _{\phi }^{(C)} - \mu _{\phi }) \\ \vdots &{} \vdots &{} &{} \vdots \end{pmatrix}. \end{aligned}$$

Consequently, by writing \(\bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }\) as a sum of outer-products of the columns of \(\bar{\Phi }^TY\), we get

$$\begin{aligned} \bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi } =\sum _{c=1}^C n_{c}^2\left( \mu _{\phi }^{(c)} - \mu _{\phi }\right) \left( \mu _{\phi }^{(c)} - \mu _{\phi }\right) ^T. \end{aligned}$$
(14)

The term on the right-hand side of (14) gives us the slightly altered between-class scatter matrix in the feature space \({\mathcal {H}}\). Now, let U denotes a projection matrix in the feature space \({\mathcal {H}}\), then, the subspace optimization problem

$$\begin{aligned} \max _{U^TU = I} U^T\bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }U, \end{aligned}$$
(15)

finds a linear projection U that maximizes between-class separation in \({\mathcal {H}}\). The optimization problem (15) defines the PLS-DA method in the feature space \({\mathcal {H}}\).

3.2 Locality Projections and Within Class Structure Preservation in RKHS

From [39], optimal projection vector u that preserves local within-class structure in the feature space \({\mathcal {H}}\) is obtained from the minimization problem

$$\begin{aligned} \min \frac{1}{2}u^T\bar{\Phi }^TL^{\Phi }\bar{\Phi }u, \end{aligned}$$
(16)

where \(L^{\Phi }=D^{\Phi }-S^{\Phi }\) is the graph Laplacian in \({\mathcal {H}}\), \(D^{\Phi }\) is a diagonal matrix whose i-th diagonal element is \(D^{\Phi }_{ii}=\sum _{j}S^{\Phi }_{ij}\) and \(S^{\Phi }_{ij}\) is the weight matrix in \({\mathcal {H}}\) defined as:

(17)

The kernel function measures similarity between \(\phi (x_i)\) and \(\phi (x_j)\) in the feature space. Thus, (17) provides a local graph for each class and preserving the graph structure is akin to preserving the within-class structure. Furthermore, \(S^{\Phi }\) can be written as block-diagonal form given below:

$$\begin{aligned} S^{\Phi } = \begin{pmatrix} S_{\Phi }^{(1)} &{} \textbf{0} &{} \ldots &{} \textbf{0} \\ \textbf{0} &{} S_{\Phi }^{(2)} &{} \ldots &{} \textbf{0} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \textbf{0} &{} \textbf{0} &{} \ldots &{} S_{\Phi }^{(C)} \end{pmatrix}, \end{aligned}$$
(18)

where for each \(c=1,2,\ldots ,C\), \(S_{\Phi }^{(c)} \in R^{n_c \times n_c}\) and , \(r,s=1,2,\ldots ,n_c\). The block-diagonal shape of \(S^{\Phi }\) implies the same block-diagonal shape for graph Laplacian \(L^{\Phi }\):

$$\begin{aligned} L^{\Phi } = \begin{pmatrix} L_{\Phi }^{(1)} &{} \textbf{0} &{} \ldots &{} \textbf{0} \\ \textbf{0} &{} L_{\Phi }^{(2)} &{} \ldots &{} \textbf{0} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \textbf{0} &{} \textbf{0} &{} \ldots &{} L_{\Phi }^{(C)} \end{pmatrix}, \end{aligned}$$
(19)

where for each \(c=1,2,\ldots ,C\), \(L_{\Phi }^{(c)} = D_{\Phi }^{(c)}- S_{\Phi }^{(c)} \in R^{n_c \times n_c}\) and \((D_{\Phi }^{(c)})_{rr} = \sum _{j}(S_{\Phi }^{(c)})_{rj}\), \(r=1,2,\ldots ,n_c\).

By combining (13) and (19), we are able to write \(\bar{\Phi }^TL^{\Phi }\bar{\Phi }\) as a

$$\begin{aligned} \bar{\Phi }^TL^{\Phi }\bar{\Phi } = \sum _{c=1}^C \bar{\Phi }_c L_{\Phi }^{(c)}\bar{\Phi }_{c}^T. \end{aligned}$$
(20)

It is worth noting that for the special case where for all \(r,s=1,2,\ldots ,n_c\), \(L_{\Phi }^{(c)}\) reduces to the \(n_c\times n_c\) matrix \(I-\frac{1}{n_c}(1_{n_c}1_{n_c}^T)\) with I an identity matrix and \(1_{n_c}\) a \(n_c\)-dimensional vector with all entries equal to one. It can be shown further that, for this special case, the right-hand side matrix in (20) reduces to the within-class scatter matrix for the Kernel Fischer Discriminant Analysis (KFDA) [36]. This special case highlights the connection between (16) and the within-class distance in KFDA. KFDA assumes constant similarity in \({\mathcal {H}}\) between training samples of the same class, whereas the objective function in (16) admits a varying similarity among training samples. As such, samples from within the same class are allowed to have a structure and this structure is preserved by the similarity graph in (17).

3.3 Kernel LPPLS-DA: Global and Local Structure Preservation in RKHS

The Kernel LPPLS-DA aims to reduce the dimensionality of data while preserving both its global and local class structures in the feature space \({\mathcal {H}}\). The approach is similar to that of LPPLS-DA which finds a low-dimensional embedding of the training samples that best preserves both global and local structures of the data. Based on the analysis in Sects. 3.1 and 3.2, global and local structures can be preserved simultaneously by solving a multi-objective function of the form

$$\begin{aligned} {\left\{ \begin{array}{ll} \max _{U} tr(U^T\bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }U) \\ \min _{U} tr(U^T\bar{\Phi }^TL\bar{\Phi }U) \end{array}\right. } \end{aligned}$$
(21)

where U is an n by d matrix such that \(d<n\) is the dimension of the low-dimensional embedding. Following the approach in [13], equation (21) leads to an optimal projection matrix of the form:

$$\begin{aligned} U_{opt}=\arg \max tr\left( \frac{U^T\bar{\Phi }^T\bar{Y} \bar{Y}^T\bar{\Phi }U}{U^T\bar{\Phi }^TL^{\Phi }\bar{\Phi }U}\right) . \end{aligned}$$
(22)

To solve for \(U_{opt}\), first we observe that each column of U, \(u_i, i= 1,2,\ldots , d\) is a solution of the generalized eigenvalue problem

$$\begin{aligned} \bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }u_i = \lambda \bar{\Phi }^TL^{\Phi }\bar{\Phi }u_i. \end{aligned}$$
(23)

and since \(U \subseteq {\mathcal {H}}\), we can write \(u_i =\nu _{i}^{(1)} + \nu _{i}^{(2)}\) where \(\nu _{i}^{(1)} \in {\mathcal {C}}(\bar{\Phi }^T)\) and \(\nu _{i}^{(1)} \in {\mathcal {C}}(\bar{\Phi }^T)^{\perp }={\mathcal {N}}(\bar{\Phi })\) [40]. Here, \({\mathcal {C}}(\bar{\Phi }^T)\) denotes the column space of \(\bar{\Phi }^T\), whereas \({\mathcal {N}}(\bar{\Phi })\) denotes the nullspace of \(\bar{\Phi }\), which is also equivalent to the orthogonal complement of \({\mathcal {C}}(\bar{\Phi }^T)\). Since \(\bar{\Phi }\nu _{i}^{(2)}=0\), then

$$\begin{aligned} \bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }u_i =\bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }\left( \nu _{i}^{(1)} + \nu _{i}^{(2)}\right) = \bar{\Phi }^T\bar{Y}\bar{Y}^T\bar{\Phi }\nu _{i}^{(1)}, \end{aligned}$$

and

$$\begin{aligned} \bar{\Phi }^TL^{\Phi }\bar{\Phi }u_i = \bar{\Phi }^TL^{\Phi } \bar{\Phi }\left( \nu _{i}^{(1)} + \nu _{i}^{(2)}\right) = \bar{\Phi }^T L^{\Phi }\bar{\Phi }\nu _{i}^{(1)}. \end{aligned}$$

Thus, only \(u_i \in {\mathcal {C}}(\bar{\Phi }^T)\) needs to be considered. In other words, there exist coefficients \(\alpha _{ij}\) \((j=1,\ldots ,n)\) such that:

$$\begin{aligned} u_i=\sum _{j=1}^{n}\alpha _{ij} \bar{\phi }(x_j)={\bar{\Phi }}^T\alpha _i \end{aligned}$$
(24)

where \(\alpha _i=[\alpha _{i1},\ldots ,\alpha _{in}]^T\).

Substituting (24) into (22) results in the following optimization problem:

$$\begin{aligned} A_\textrm{opt}=\arg \max tr\left( \frac{A^T\bar{K}\bar{Y}\bar{Y}^T \bar{K}A}{A^T\bar{K}L^{\Phi }\bar{K}A}\right) , \end{aligned}$$
(25)

where \(A = [\alpha _1,\ldots ,\alpha _n]\) is an n by n matrix, and \(\bar{K}=\bar{\Phi }\bar{\Phi }^T=H {\Phi }{\Phi }^TH=HKH\). The matrix \(K=\Phi \Phi ^T\) is called the kernel matrix, whereas \(\bar{K}\) is the centered kernel matrix. The subspace optimization problem (25) is equivalent to the corresponding eigendecomposition:

$$\begin{aligned} \bar{K}\bar{Y}\bar{Y}^T\bar{K}A = \bar{K}L^{\Phi }\bar{K}A\Lambda , \end{aligned}$$
(26)

and the columns of \(A_\textrm{opt}\) are the first d principal generalized eigenvectors of (26). Each column in \(A_\textrm{opt}\), \(\alpha _i\), gives a projection functions \(u_i\) in \({\mathcal {H}}\), and the projection of a data sample x onto the direction \(u_i\) is computed as:

(27)

where . A data sample x can be mapped into d-dimensional subspace by:

$$\begin{aligned} x\rightarrow z=A_\textrm{opt}^T \bar{K}(:,x), \end{aligned}$$
(28)

The matrix \(\bar{K}L^{\Phi }\bar{K}\) in (26) is often semi-positive definite for many datasets. This situation often occurs because the mapping from an input space to a feature space by a kernel function often leads to a feature space with a dimension that is much larger than the dimension of the sample space. Semi-positive definiteness can lead to the ill-posedness of (25) because the denominator might become zero. To this end, we add a multiple of the identity matrix to \(\bar{K}L^{\Phi }\bar{K}\), i.e.,

$$\begin{aligned} \bar{K}L^{\Phi }\bar{K} \rightarrow \bar{K}L^{\Phi }\bar{K} +\mu I. \end{aligned}$$
(29)

For \(\mu > 0\), the strategy in (29) makes the denominator of (25) positive and leads to a more numerically stable generalized eigenvalue problem. Alternatively, the strategy can also be viewed as imposing a regularization on \(\Vert \alpha _i \Vert ^2\) that gives preference to solutions with small expansion coefficients. The algorithmic procedure of KLPPLS-DA is formally summarized in Algorithm 1.

figure a

4 Experimentation and Results

The focus of our experiments is to compare the performance of KLPPLS-DA with the other PLS-DA-based methods, namely, the conventional PLS-DA, LPPLS-DA, and kernel PLS-DA in supervised learning of low-dimensional subspace. The objective is to highlight the dimension reduction properties of each method and how these abilities will affect the classification of real-world datasets in low-dimensional embedding. For convenience, the dimension reduction mechanisms of each method are summarized in Table .

Table 1 Summary of the PLS-DA family of methods

4.1 Experiment 1: Dimensionality Reduction for Data Visualization

For the experiments in this section, two synthetic datasets are used to demonstrate visually the comparative dimension reduction properties of the PLS-DA family of methods. The synthetic datasets are Swiss Roll and Helix datasets which are benchmark datasets for nonlinear dimensional reduction [41,42,43]. Data points for the Swiss roll were generated using the coordinates: \(\{x = t \cos t, y = 0.5t \sin t, z\}\) where t is varied from 0 to \(5.5\pi \) in steps of \(5.5\pi /n\) (n is the number of data points), and z are random numbers in the interval [0, 30]. For the Helix dataset, data points are generated using the coordinates: \(\{x = (2 + \cos 8t)\cos t, y = (2 + \cos 8t)\sin t, z = \sin 8t\}\), where t is varied from \(-\pi \) to \(\pi \) in steps of \(2\pi /n\). Preliminary experiments show that for the kernel methods (KPLS-DA and KLPPLS-DA), the radial basis function (RBF) kernel is most suitable for both Swiss roll and Helix data and this is,

(30)

where the parameter \(\sigma \) is set to 10.

4.1.1 Local Class Structure

To simulate datasets with local class structures, 1500 data points are generated for Swiss roll and Helix structures, The data points are divided into three different classes as shown in Fig. . Data from each class are distributed in the same locality. The dimension reduction in the datasets by each PLS-DA-based method is shown in Fig.  for the Swiss roll dataset and Fig.  for the Helix dataset.

Fig. 1
figure 1

Original datasets with local class structures

Fig. 2
figure 2

Two-dimensional projection of the Swiss roll dataset with local class structures using PLS-DA, LPPLS-DA, KPLS-DA and KLPPLS-DA

Fig. 3
figure 3

Two-dimensional projection of the Helix dataset with local class structures using PLS-DA, LPPLS-DA, KPLS-DA and KLPPLS-DA

From Figs. 2 and 3, it is observed that both PLS-DA and LPPLS-DA cannot unfold the global nonlinear structure of the Swiss roll and Helix datasets. This is expected due to the linear nature of the projections. The class structures under LPPLS-DA projection are less spread out (particularly for the Swiss roll data) owing to the minimization of within-class distance. The Swiss roll dataset was unfolded successfully into linear structures by both the KPLS-DA and KLPPLS-DA projections, thereby justifying the kernelized extensions of the linear counterparts. Class separation by KLPPLS-DA projection is much more refined compared to the KPLS-DA projection. Here, the minimization of the locality-preserving within-class distance in KLPPLS-DA provides a clear advantage over KPLS-DA. KPLS-DA projection unfolds the nonlinear structure while preserving the global class structure but unlike KLPPLS-DA, it does not provide enough separation between the classes. For the Helix dataset in Fig. 3, again the KLPPLS-DA projection successfully unfolded the nonlinear structure into three well-separated linear class structures. However, KPLS-DA projection cannot quite unfold the nonlinear structure but still maintain the global class structure.

4.1.2 Non-local Class Structure

To simulate datasets with non-local class structures, 1500 data points are generated for Swiss roll and Helix structures, and the points are again divided into three different classes. This time, data points for Class 1 are separated into two groups at two different localities as shown in Fig. . The dimension reduction in the datasets by each PLS-DA-based method is shown in Fig.  for the Swiss roll dataset and Fig.  for the Helix dataset.

Fig. 4
figure 4

Original datasets with non-local class structures

Fig. 5
figure 5

Two-dimensional projection of the Swiss roll dataset with non-local class structures using PLS-DA, LPPLS-DA, KPLS-DA and KLPPLS-DA

Fig. 6
figure 6

Two-dimensional projection of the Helix dataset with non-local class structures using PLS-DA, LPPLS-DA, KPLS-DA and KLPPLS-DA

Figures 5 and 6 further highlight the limitation of PLS-DA and LPPLS-DA in separating data points of different classes that are embedded within a nonlinear structure. Both methods observe the original global (and local) distribution of the data points and are unable to assemble the two groups of Class 1 in the same locality. The minimization of the locality-preserving within-class distance in class structures in LPPLS-DA only managed to minimize within-class distance in the locality of each group. It is noticed in Fig. 6 that LPPLS-DA projection results in some unfolding of the nonlinear Helix structure to try to bring the two groups of Class 1 data together, but with a lot of limitations. On the other hand, the unfolding of the Swiss roll dataset and the Helix dataset by the KPLS-DA and KLPPLS-DA projections have proven to be very useful in merging the two groups of Class 1 data. The global class structures from KPLS-DA and KLPPLS-DA projections are quite similar; however, KLPPLS-DA projection has the tendency to provide more compact within-class structures.

4.2 Experiment 2: Dimensionality Reduction for the Classification of Real-World Dataset

4.2.1 Benchmark Procedure

To evaluate the performance of the PLS-DA family of methods in dimensionality reduction for the classification of real-world datasets, a benchmark procedure is used. Given a real-world dataset, the procedure involves five main components:

  1. 1.

    Splitting the dataset into two parts; training and test data;

  2. 2.

    Compute the projection matrix for low-dimensional embedding using training dataset;

  3. 3.

    Use the projection matrix to compute the projections of training and test data in the low-dimensional embedding according to (28);

  4. 4.

    Use projected training data to train suitable classifiers;

  5. 5.

    Classify projected test data using trained classifiers.

For classification in Steps 4 and 5, typical classifiers (amongst others) used in the literatures [44,45,46] are the Support Vector Machine (SVM) classifier, k-NN classifier, naive Bayes classifier (NBC) and Random Forest (RF). In the experiments in this section, both SVM and k-NN classifier (with \(k = 2\)) are used for the purpose of comparison.

For the kernel methods (kernel PLS-DA and KLPPLS-DA), two kinds of kernel functions are used, which are the radial basis function (RBF) kernel (30) and the polynomial kernel

A series of preliminary experiments show that setting \(\sigma =d=2\) produces overall good results. Thus, these values are used in all of the forthcoming experiments.

4.2.2 Datasets

The experiments in this section are conducted on six publicly available datasets. The important statistics of the datasets are summarized below as well as in Table .

  1. 1.

    The Brain Tumor dataset [47] contains gene expression profiles of 42 patient samples: 10 medulloblastomas, 10 non-embryonal brain tumors (malignant glioma), 10 central nervous system atypical teratoid/rhabdoid (CNS AT/RT) and, renal and extrarenal rhabdoid tumors, 8 supratentorial PNETs and 4 normal human cerebella. Each sample contains 5597 gene expression values.

  2. 2.

    The Colon cancer dataset contains gene expression levels of 40 tumors and 22 normal colon tissues for 6500 genes which are measured using Affymetrix technology. In [48], a selection of 2000 genes with the highest minimal intensity across the samples was made. We used this selection in our experiments.

  3. 3.

    The Leukemia dataset [49] contains gene expression levels of 72 patients diagnosed with acute lymphoblastic leukemia (ALL, 47 cases) and acute myeloid leukemia (AML, 25 cases). The number of gene expression levels is 7129. Following the procedure in [50], the dataset was further processed by carrying out the logarithmic transformation, thresholding, filtering and standardization so that the final data contain expression values of 3571 genes.

  4. 4.

    The Lymphoma dataset [51] consists of gene expression levels for 62 lymphomas: 42 samples of diffuse large B-cell lymphoma (DLBCL), 11 samples of chronic lymphocytic leukemia (CLL) and 9 samples of follicular lymphoma (FL). The whole dataset contains expression values of 4026 genes.

  5. 5.

    The prostate dataset [52] contains 102 tissue samples: 52 tumor tissues and 50 normal tissues. The number of gene expression levels is 6033.

  6. 6.

    The SRBCT dataset [53] contains 63 samples which include both tumor biopsy material [13 Ewing family of tumors (EWS) and 10 rhabdomyosarcoma (RMS)] and cell lines [10 EWS, 10 RMS, 12 neuroblastoma (NB) and 8 Burkitt lymphomas (BL)]. Each sample in this dataset contains 2308 gene expression values.

All data samples were normalized to have a unit norm so that the undesirable effects caused by genes with larger expression values are eliminated.

Table 2 Summary of datasets used in the experiments
Fig. 7
figure 7

Two-dimensional visualization of the Brain, Colon, and Leukemia datasets using LPPLS-DA and KLPPLS-DA

Fig. 8
figure 8

Two-dimensional visualization of the Lymphoma, Prostate, and SRBCT datasets using LPPLS-DA and KLPPLS-DA

4.2.3 Data Visualization

The complex structure of microarray data may be too challenging for linear methods to capture the true structure of the data in very low dimensional subspace [4]. To briefly investigate the comparative performance of LPPLS-DA and KLPPLS-DA in distinguishing the different classes in tumor datasets based on gene expression data, we produce projections of the gene expression data onto two viewable dimensions. The results of the two-dimensional representation of both methods on the different datasets are shown in Fig.  for two-class datasets and Fig.  for multi-class. For two-class datasets, both methods produced comparable results where both clusters in the datasets are separated quite well. For multi-class datasets, it is observed that KLPPLS-DA performed exceptionally better than LPPLS-DA on the SRBCT dataset. The two-dimensional embedding subspace obtained from KLPPLS-DA is able to preserve more structural information compared to that of LPPLS-DA to allow a much better separation of the four clusters in the SRBCT dataset. The brain dataset appears to be rather challenging for both LPPLS-DA and KLPPLS-DA. Although good separation is observed for samples of medulloblastoma, malignant glioma, and rhabdoid, it is apparent that two-dimensional embedding of these methods is unable to capture the information necessary to effectively separate the normal Cerebella from the pancreatic neuroendocrine tumors (PNETs). These results also imply that the structure of the data is sometimes too complex to be captured in a two-dimensional subspace.

Fig. 9
figure 9

Average classification accuracies by a k-NN classifier as a function of the reduced dimension. Here, two-thirds of the datasets are used as training sets and the remaining one-third as test sets

Fig. 10
figure 10

Average classification accuracies by an SVM classifier as a function of the reduced dimension. Here, two-thirds of the datasets are used as training sets and the remaining one-third as test sets

4.2.4 Data Classification: Average Classification Accuracies

Here, we use increasing values of embedding dimension to compare the performance of KLPPLS-DA with the other PLS-DA-based methods, namely, the conventional PLS-DA, LPPLS-DA, and kernel PLS-DA. These methods are used to compute optimal low-dimensional embedding of the gene expression data for a number of reduced dimensions. Then, for each reduced dimension, classification accuracies are calculated for each method. From these experiments, we aim to study the ability of each method to preserve class information in low-dimensional embedding. The benchmark procedure in 4.2.1 is used, where each dataset is randomly partitioned into training sets consisting of two-thirds of the whole sets and test sets consisting of the remaining one-third of the whole sets. To reduce variability in the results, we repeat the random splitting 10 times and report the average classification accuracies.

Figure  shows the average classification accuracies obtained using k-NN classifier as the number of dimensions increases. Overall, we observe that KLPPLS-DA has the best performance for all the datasets used in the experiment. For Colon, Leukemia, and Prostate datasets, KLPPLS-DA is able to achieve the highest accuracy rate with only the one-dimensional optimal embedding subspace. For the Brain, Lymphoma, and SRBCT data sets, the KLPPLS-DA method obtained the highest classification accuracies using only four, two, and three-dimensional optimal embedding subspaces, respectively. In Fig. , the average classification accuracies obtained using the SVM classifier are presented. Here, we can see similar performances as in Fig. 9 where it is also revealed that the KLPPLS-DA method provides the highest accuracy with the lowest dimension. Thus, based on this observation, it can be deduced that the superior performance of KLPPLS-DA is independent of the choice of the classifier.

One interesting observation from Figs. 9 and 10 is that the KPLS-DA method (i.e., kernelized PLS-DA without locality preserving feature) performs rather poorly in most of the experiments. In the classification of Brain and Prostate datasets, its performance is worse than LPPLS-DA and no better than the conventional PLS-DA method. We see this as an indication of the need for minimizing the locality-preserving within-class distance to enhance the dimensional reduction in gene expression data.

It is worth noting that the performance of KLPPLS-DA is comparable to several recently published feature selection methods. For example in [54], it is reported that for Colon and SRBCT data sets, the best classification accuracies obtained with a k-NN classifier are \(84\%\) and \(84.6\%\), respectively, and using the SVM classifier, the authors reported a classification accuracy of \(83.8\%\) and \(93.6\%\), respectively. In the meantime, using the k-NN classifier, our proposed KLPPLS-DA method reports best classification accuracies of \(>85\%\) for the Colon dataset (see Table ) and \(100\%\) for the SRBCT for the two-third train/test split (see Table ). In another study [55], for the lymphoma data set, the best classification accuracies obtained using both k-NN and SVM are \(100\%\). The best classification accuracy recorded for KLPPLS-DA is also \(100\%\) using both classifiers (see Table ). However, KLPPLS-DA only requires a two-dimensional optimal embedding subspace to achieve \(100\%\) accuracy, whereas in [55], an eight-dimensional subspace is used.

4.2.5 Data Classification: Effects of Different Data Partitioning

In this section, we investigate the sensitivity of the family of PLS-DA-based methods with respect to data partition, i.e., training sample size. It is reasonable to expect that as the training sample size is increased, classification accuracy will increase as well because more information is available to learn the relevant features of the dataset. All six data sets are randomly partitioned into training sets consisting of one-third, half, and two-thirds of the whole sets and test sets consisting of the remaining two-thirds, half, and one-third of the data sets, respectively. Similar to the previous experiments, the results are averaged over 10 random splits of the data sets and report the average classification accuracies. The best average classification accuracy rates, standard deviations, and the corresponding dimensions (in brackets) for the different splits are reported in Tables , 4, , 6, and 8. The original number of features in all six data sets is in the thousands. For each method, 20 components are extracted.

Based on the experimental results presented in Tables 3, 4, 5, 6, 7 and 8, it can be seen in general that all the methods used in the experiment show an increase in classification accuracy as the size of the training sample increases. On all six data sets, the classification accuracies of KLPPLS-DA (using both RBF and polynomial kernels) are significantly higher than the other methods for all data partitioning. Most importantly, KLPPLS-DA provides the best classification accuracies compared to LPPLS-DA and KPLS-DA on all the datasets under consideration. This goes to show that KLPPLS-DA provides an improvement over LPPLS-DA in terms of capturing nonlinear features of gene expression datasets. Moreover, KLPPLS-DA also provides an improvement over KPLS-DA due to its ability to provide more compact local structures in RKHS.

5 Conclusions

In this paper, a novel algorithm for nonlinear feature extraction is proposed, called Kernel Locality Preserving PLS-DA (KLPPLS-DA). KLPPLS-DA combines nonlinear subspace learning with enhanced class discrimination via local structure preservation. As such, KLPPLS-DA has the ability to resolve the nonlinear structure while at the same time discovering the discriminant structure in the low-dimensional RKHS embedding. Several experiments using synthetic datasets with nonlinear structures have allowed a visual understanding of the dimension reduction properties of KLPPLS-DA in comparison with the other PLS-DA-based methods.

The capabilities of KLPPLS-DA are also demonstrated in the multi-class tumor classification from gene expression data where the experimental results show that the KLPPLS-DA algorithm consistently outperforms linear PLS-DA-based methods. KLPPLS-DA also shows significant improvement over KPLS-DA, which indicates the importance of minimizing the locality-preserving within-class distance in RKHS to enhance the dimensional reduction in gene expression data. The performance of LPPLS-DA and KLPPLS-DA is noticeably comparable in two-dimensional embedding; however, in higher dimensions, KLPPLS-DA clearly outperforms LPPLS-DA. This is an indication of the presence of certain nonlinear features in the gene expression datasets that cannot be effectively resolved by LPPLS-DA. KLPPLS-DA also demonstrates the ability to handle the complexity of the gene expression datasets in the least number of dimensions and with a small number of training samples. In general, the performance of KLPPLS-DA is seen to be independent of the choice of the classifier.

One main disadvantage of KLPPLS-DA is the need to identify the optimum values for \(\sigma \) (for RBF kernel) and d (for polynomial kernel) for each dataset. In the experiments in Sect. 4, these parameters are empirically determined. For many of the data sets, the performances of KLPPLS-DA using both RBF and the polynomial kernels are comparable to one another. However, the KLPPLS-DA method can perform poorly when given unsuitable values for \(\sigma \) and d. Thus suitable values for \(\sigma \) and d are critical to the success of the KLPPLS-DA method.

Table 3 Classification accuracy on the Brain dataset
Table 4 Classification accuracy on the Colon dataset
Table 5 Classification accuracy on the Leukemia dataset
Table 6 Classification accuracy on the Lymphoma dataset
Table 7 Classification accuracy on the Prostate dataset
Table 8 Classification accuracy on the SRBCT dataset