1 Introduction

The presence of multivariate data in pattern recognition and machine learning applications has been increasing drastically over the years. Modern datasets are often composed by a large number of examples, each of which having an even larger number of features. If the effects of a large sample size are positive to learning processes in general, the effects of an arbitrary increase in the number of features are known to be highly negative, especially in pattern classification tasks (Trunk 1979; Lee and Verleysen 2007; Hughes 1968; Zimek et al. 2012; Marimont and Shapiro 1979; Chávez et al. 2001).

One of the major drawbacks in high-dimensional data analysis is the curse of the dimensionality. This term was first mentioned by Bellman (1961) and refers to the fact that to estimate a function of several variables to a given degree of accuracy, the sample size needs to grow with the number of variables. A related fact is that hyperspaces are inherently sparse, causing the empty space phenomenon (Carreira-Perpinan 1997). In contrast with the usual 3D Euclidean space, the geometric properties of hyperspaces are highly non-intuitive, making the learning of supervised discriminative functions a painful task (Jimenes and Landgrebe 1998). It has been shown that for linear classifiers, such as the Nearest Mean Classifier, the number of training samples needed in supervised classification problems is a linear function of the dimensionality and for quadratic classifiers, such as the Bayesian classifier under Gaussian hypothesis, it is a quadratic function of the dimensionality (Fukunaga 1990). In case of non parametric classifiers, the situation is even worse, since experimental results have shown that as dimensionality increases, the number of samples must grow exponentially (Scott 1992; Hwang et al. 1994). Thus, in order to extract relevant information from high-dimensional data, it is required a large n, which is not always possible under certain circumstances. Therefore, a natural way to mitigate this problem and indirectly reduce the value of n is to significantly reduce the data dimensionality, that is, m.

Another issue with high-dimensional data is that the usual Euclidean distance as a measure of dissimilarity tends to behave poorly. In this context, unsupervised metric learning methods try to overcome this limitation by finding suitable distance functions. A class of algorithms extremely relevant to this problem is manifold learning. Manifold learning is deeply connected to unsupervised metric learning in the sense that besides learning a more compact and meaningful representation for the observed dataset, these methods also learns a distance function that geometrically is better suited to represent a similarity measure between a pair of objects in the collection (Li and Tian 2018; Wang and Sun 2015; Yang and Jin 2006; Bellet et al. 2013; Suárez 2018).

The key idea of dimension reduction is to find the most compact low dimensional structure that is embedded in a higher dimensional space. Historically, Occam’s razor has been used to justify dimension reduction (Domingos 1999). The basic concept in Occam’s razor is to choose the simplest model from a set of equivalent models to explain a given phenomenon (Huo et al. 2008). There are many approaches to dimensionality reduction based on several assumptions and used in a variety of contexts. In this paper, we propose a parametric PCA algorithm based on a information-theoretic measure: the relative entropy. The main goal is to find a surrogate for the covariance matrix replacing the Euclidean distance in the feature space by the KL-divergence between Gaussian distributions estimated in each local neighborhood. One possible limitation of PCA is that this method maximizes the variance of the retained data, which often produces clusters with large scattering. This can be a negative side-effect to many classification problems. In summary, the main contribution of the proposed method is that unlike traditional dimensionality reduction methods, PCA-KL is a patch-based method, which means it is less sensitive to the presence of noise and outliers in data, making the clusters more compact, in the sense that their intra-class scatter is reduced. As a consequence, in several different datasets, the features extracted by PCA-KL show more discriminant power in comparison to the features obtained by some manifold learning algorithms, making the proposed method a promising alternative for unsupervised metric learning.

The remainder of the paper is organized as follows: Sect. 2 describes the classic dimensionality reduction methods PCA, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE. In Sect. 3, we briefly discuss the KL-divergence and its computation in the Gaussian case. In Sect. 4, we describe the proposed PCA-KL method in details. Section 5 shows the experiments and results. Finally, in Sect. 6 we present the conclusions, final remarks and future directions for research in dimensionality reduction for unsupervised metric learning.

2 Dimensionality reduction for unsupervised metric learning

2.1 Principal component analysis

Principal Component Analysis, or simply PCA, is a computational method that implements the Karhunen–Loève transform, also known as Hotteling transform (Hotelling 1933), a classical multivariate statistical technique that expands a given random vector \(\mathbf {x} \in R^m\) in the eigenvectors of its covariance matrix (Jolliffe 2002; Shlens 2005). PCA is the most widely known method for data compression and feature extraction. PCA does not make assumptions on probability density functions, since all information needed by the method can be estimated directly from data (Hastie et al. 2009). Since it depends solely on the covariance matrix, PCA is a second order statistical method. PCA is optimal in two different ways: (1) by maximizing the variance of the new compact representation Y; (2) by minimizing the mean square error between the original data X and the new compact representation Y. From a statistical point of view, the goal of PCA is to reduce the redundancy between the random variables that compose the random vector \(\mathbf {x} \in R^m\), which is measured by the correlations between them. In this sense, PCA first decorrelates the features and then reduce the dimensionality by finding new features that are linear combinations of the original ones.

2.1.1 PCA by the maximization of the variance

Let \(Z = [T^T, S^T]\) be an orthonormal basis for \(R^m\) in which \(T^T = [\mathbf {w}_1 , \mathbf {w}_2 , \ldots , \mathbf {w}_d]\) denotes the \(d < m\) components that we wish to retain during the dimensionality reduction process and \(S^T = [\mathbf {w}_{d+1} , \mathbf {w}_{d+2} , \ldots , \mathbf {w}_m]\) are the remaining components that should be discarded. In other words, T defines the linear PCA subspace and S defines the linear subspace eliminated by the reduction process (Young and Calvert 1974).

The problem in question can be summarized as: given an input feature space, we want to find d directions \(\mathbf {w}_j\), for \(j=1,2,\ldots , d\) that, when projecting the data, the variance is maximized. In other words, we want the directions that maximizes data scattering. The question is: how to obtain the directions \(\mathbf {w}_j\)? Without loss of generality, we assume that the sample \(X = \left[ \mathbf {x}_1 , \mathbf {x}_2 , \ldots , \mathbf {x}_n \right] \) has zero mean, that is, the data points are centred around the origin.

Note that we can write \(\mathbf {x} \in R^m\) as an expansion in the orthonormal basis Z as:

$$\begin{aligned} \mathbf {x} = \sum _{j=1}^{m} (\mathbf {x}^T \mathbf {w}_j) \mathbf {w}_j = \sum _{j=1}^{m} c_j \mathbf {w}_j \end{aligned}$$
(1)

where \(c_j\) are the coefficients of the expansion.

Thus, the new vector \(\mathbf {y} \in R^d\) can be obtained by the transformation \(\mathbf {y} = T \mathbf {x}\), that is:

$$\begin{aligned} \mathbf {y}^T = \mathbf {x}^T T^T = \sum _{j=1}^{m} c_j \mathbf {w}_j^T [\mathbf {w}_1 , \mathbf {w}_2 , \ldots , \mathbf {w}_d] \end{aligned}$$
(2)

As we have an orthonormal basis, \(\mathbf {w}_i^T \mathbf {w}_j = 1\) for \(i = j\) and \(\mathbf {w}_i^T \mathbf {w}_j = 0\) for \(i \ne j\), leading to:

$$\begin{aligned} \mathbf {y}^T = [c_1 , c_2 , \ldots , c_d] \end{aligned}$$
(3)

In this way, a linear transformation T is sought that maximizes the variance retained in the data, that is, we want to maximize the following functional (Hyvarinen et al. 2001):

$$\begin{aligned} J_1^{PCA}(T) = E[ \left\Vert \mathbf {y} \right\Vert ^2 ] = E[ \mathbf {y}^T \mathbf {y}] = \sum _{j=1}^{d} E[c_j^2] \end{aligned}$$
(4)

Since \(c_j\) is the projection of \(\mathbf {x}\) in \(\mathbf {w}_j\), that is, \(c_j = \mathbf {x}^T \mathbf {w}_j\), we have:

$$\begin{aligned} J_1^{PCA}(T) = \sum _{j=1}^{d} E \left[ \mathbf {w}_j^T \mathbf {x} \mathbf {x}^T \mathbf {w}_j \right] = \sum _{j=1}^{d} \mathbf {w}_j^T E \left[ \mathbf {x} \mathbf {x}^T \right] \mathbf {w}_j = \sum _{j=1}^{d} \mathbf {w}_j^T \varSigma _x \mathbf {w}_j \end{aligned}$$
(5)

where \(\varSigma _x\) denotes the covariance matrix of the data points X.

Hence, we have the following constrained optimization problem:

$$\begin{aligned} \mathop {\arg \max }\limits _{\mathbf {w}_j}~\sum _{j=1}^{d} \mathbf {w}_j^T \varSigma _x \mathbf {w}_j \qquad \text {subject to} \quad \left\Vert \mathbf {w}_j \right\Vert = 1 \quad \text {for} \quad j=1,2,\ldots ,d \end{aligned}$$
(6)

which is solved by Lagrange multipliers. The Lagrangian function is given by:

$$\begin{aligned} J_1^{PCA}(T, \lambda _1, \lambda _2,\ldots , \lambda _d) = \sum _{j=1}^{d} \mathbf {w}_j^T \varSigma _x \mathbf {w}_j - \sum _{j=1}^{d} \lambda _j \left( \mathbf {w}_j^T \mathbf {w}_j - 1 \right) \end{aligned}$$
(7)

Differentiating with respect to \(\mathbf {w}_j\) and setting the result to zero gives us the necessary condition for the optimum:

$$\begin{aligned} \frac{\partial }{\partial \mathbf {w}_j}J_1^{PCA}(T, \lambda _1, \lambda _2,\ldots , \lambda _d) = \varSigma _x \mathbf {w}_j - \lambda _j \mathbf {w}_j = 0 \end{aligned}$$
(8)

which leads to the eigenvector equation:

$$\begin{aligned} \varSigma _x \mathbf {w}_j = \lambda _j \mathbf {w}_j \end{aligned}$$
(9)

Going back to the optimization problem, we can rewrite it as:

$$\begin{aligned} \mathop {\arg \max }\limits _{\mathbf {w}_j}~ \sum _{j=1}^{d} \mathbf {w}_j^T \varSigma _x \mathbf {w}_j = \mathop {\arg \max }\limits _{\mathbf {w}_j}~ \sum _{j=1}^{d} \mathbf {w}_j^T \lambda _j \mathbf {w}_j = \mathop {\arg \max }\limits _{\mathbf {w}_j}~ \sum _{j=1}^{d} \lambda _j \end{aligned}$$
(10)

which means that we should select to compose the basis of the linear PCA subspace the k eigenvectors associated to the k largest eigenvalues of the data covariance matrix. Algorithm 1 describes dimensionality reduction by PCA.

figure a

The great advantage of PCA is that it is a very fast method. It has been verified that the time complexity of PCA is \(O(max(n^2 m, n m^2))\), where n is the number of samples and m is the number of input features (Nguyen and Holmes 2019).

2.2 Non-negative matrix factorization

Non-Negative Matrix Factorization (NMF) is applied to dimensionality reduction of multivariate data by considering as input a set of of multivariate m-dimensional data vectors, \(V_{m \times n}\) where n is the number of examples in the data set. This matrix is then approximately factorized into an \(m \times r\) matrix W and an \(r \times n\) matrix H, that is, \(X \approx WH\). Usually r is chosen to be smaller than n or m, so that W and H are smaller than the original matrix X. This results in a compressed version of the original data matrix (Lee and Seung 1999). The significance of this representation is that \(\mathbf {v} \approx W \mathbf {h}\), that is each vector is approximated by a linear combination of the columns of W. In this context, H plays the role of the extracted features, that is, each column \(\mathbf {h}_j\) of H represents a vector in the transformed space.

To measure how close V is from WH, a cost function is usually adopted. Two common choices of distance measures are the Euclidean distance and the Kullback–Leibler divergence, given by:

$$\begin{aligned} D_E(A, B)&= \left\Vert A - B \right\Vert ^2 = \sum _{i,j} (A_{i,j} - B_{i,j})^2 \end{aligned}$$
(11)
$$\begin{aligned} D_{KL}(A, B)&= \sum _{i,j} \left[ A_{i,j}~ log \left( \frac{A_{i,j}}{B_{i,j}} \right) - A_{i,j} + B_{i,j} \right] \end{aligned}$$
(12)

Theorem 1

The Euclidean distance \(D_E(V, WH) = \left\Vert V - WH \right\Vert ^2\) is non-increasing for a finite number of steps of the multiplicative rules:

$$\begin{aligned} H_{i,j}^{k+1}&= H_{i,j}^k \frac{ ((W^k)^T V)_{i,j} }{ ((W^k)^T W^k H^k)_{i,j} } \end{aligned}$$
(13)
$$\begin{aligned} W_{i,j}^{k+1}&= W_{i,j}^k \frac{ (V(H^{k+1})^T)_{i,j} }{ (W^k H^{k+1} (H^{k+1})^T)_{i,j} } \end{aligned}$$
(14)

where k is iteration counter and \(W^0\) and \(H^0\) are initialized with positive random numbers, typically sampled from a [0, 1] uniform distribution.

Theorem 2

The KL-divergence \(D_{KL}(V, WH)\) is non-increasing for a finite number of steps of the multiplicative rules:

$$\begin{aligned} W_{i,j}^{k+1}&= W_{i,j}^k \frac{\displaystyle \sum _r H_{jr}^k \frac{ V_{ir} }{ (W^k H^k)_{ir} } }{ \displaystyle \sum _r H_{jr}^k } \end{aligned}$$
(15)
$$\begin{aligned} H_{i,j}^{k+1}&= H_{i,j}^k \frac{\displaystyle \sum _r W_{ri}^{k+1} \frac{ V_{rj} }{ (W^{k+1} H^k)_{rj} } }{\displaystyle \sum _r W_{ri}^{k+1} } \end{aligned}$$
(16)

where k is iteration counter and \(W^0\) and \(H^0\) are initialized with positive random numbers, typically sampled from a [0, 1] uniform distribution.

The complete proofs for Theorems 1 and 2 are described in details in Lee and Seung’s seminal paper (Lee and Seung 1999). The computational complexity of NMF has been shown to be O(nmd) times the number of iterations for convergence, where n is the sample size, m is the number of features and d is the output dimensionality (Lin 2007).

2.2.1 The clustering property

It has been shown that NMF has an intrinsic clustering property, that is, it automatically clusters the columns of the data matrix \(X = \{ \mathbf {x}_1, \mathbf {x}_2,\ldots , \mathbf {x}_n \}\), where \(\mathbf {x}_i \in R^m\) (Ding et al. 2005). More precisely, the approximation of V as WH is achieved by minimizing the error function:

$$\begin{aligned} \min _{W, H} ~ || V - WH ||^2 \end{aligned}$$
(17)

subject to \(W \ge 0\) and \(V \ge 0\). By imposing the orthogonality constraint \(H H^T = I\), the NMF minimization problem is mathematically equivalent to the minimization of K-means clustering (Ding et al. 2005). When the error function to be minimized is the KL-divergence, NMF is identical to the PLSA (Probabilistic latent semantic analysis), another popular clustering method (Ding et al. 2008).

2.3 Kernel PCA

Principal Component Analysis only allows linear dimensionality reduction. However, if the data has more complicated structures which are non-linear functions of the original features, standard PCA will fail in capturing meaningful information. Fortunately, kernel PCA allows us to generalize standard PCA to non-linear dimensionality reduction (Schölkopf et al. 1999). The first assumption is that the mean of the data after the mapping to the high-dimensional space is zero, that is:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_i) = 0 \end{aligned}$$
(18)

Thus, the \(M \times M\) sample covariance matrix of the projected data is given by:

$$\begin{aligned} C = \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_i) \phi (\mathbf {x}_i)^T \end{aligned}$$
(19)

and the eigenvectors of C are:

$$\begin{aligned} C \mathbf {v}_k = \lambda _k \mathbf {v}_k \quad \text {for} \quad k=1,2,\ldots ,M \end{aligned}$$
(20)

The following result show that we can write the eigenvalues of the covariance matrix in terms of \(\phi (\mathbf {x}_i)\).

Theorem 3

The eigenvectors of C can be expressed as a linear combination of the features, that is:

$$\begin{aligned} \mathbf {v}_k = \sum _{i=1}^{n} \alpha _{ki} \phi (\mathbf {x}_i) \end{aligned}$$
(21)

Note that from equations (19) and (20), we have:

$$\begin{aligned} C \mathbf {v}_k = \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_i) \phi (\mathbf {x}_i)^T \mathbf {v}_k = \lambda \mathbf {v}_k \end{aligned}$$
(22)

which implies in:

$$\begin{aligned} \mathbf {v}_k = \frac{1}{n \lambda _k} \sum _{i=1}^{n} (\phi (\mathbf {x}_i)^T \mathbf {v}_k) \phi (\mathbf {x}_i) = \sum _{i=1}^{n} \alpha _{ki} \phi (\mathbf {x}_i) \end{aligned}$$
(23)

where \(\alpha _{ki} = \frac{1}{n \lambda _k}\phi (\mathbf {x}_i)^T \mathbf {v}_k\). So, finding the eigenvectors is equivalent to finding the coefficients \(\alpha _{ki}\). By substituting back Eq. (23) into Eq. (22), we have:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_i) \phi (\mathbf {x}_i)^T \left( \sum _{j=1}^{n} \alpha _{kj} \phi (\mathbf {x}_j) \right) = \lambda _k \sum _{j=1}^{n} \alpha _{kj} \phi (\mathbf {x}_j) \end{aligned}$$
(24)

Rewritting equation (24), we can express it as:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_i) \left( \sum _{j=1}^{n} \alpha _{kj} \phi (\mathbf {x}_i)^T \phi (\mathbf {x}_j) \right) = \lambda _k \sum _{j=1}^{n} \alpha _{kj} \phi (\mathbf {x}_j) \end{aligned}$$
(25)

And using the kernel trick, that is, \(K(\mathbf {x}_i, \mathbf {x}_j) = \phi (\mathbf {x}_i)^T \phi (\mathbf {x}_j)\), we have:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_i) \left( \sum _{j=1}^{n} \alpha _{kj} K(\mathbf {x}_i, \mathbf {x}_j) \right) = \lambda _k \sum _{j=1}^{n} \alpha _{kj} \phi (\mathbf {x}_j) \end{aligned}$$
(26)

Multiplying both sides by \(\phi (\mathbf {x}_l)^T\) leads to:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} \phi (\mathbf {x}_l)^T \phi (\mathbf {x}_i) \left( \sum _{j=1}^{n} \alpha _{kj} K(\mathbf {x}_i, \mathbf {x}_j) \right) = \lambda _k \sum _{j=1}^{n} \alpha _{kj} \phi (\mathbf {x}_l)^T \phi (\mathbf {x}_j) \end{aligned}$$
(27)

Using the kernel trick once again, we have:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n} K(\mathbf {x}_l, \mathbf {x}_i) \left( \sum _{j=1}^{n} \alpha _{kj} K(\mathbf {x}_i, \mathbf {x}_j) \right) = \lambda _k \sum _{j=1}^{n} \alpha _{kj} K(\mathbf {x}_l, \mathbf {x}_j) \end{aligned}$$
(28)

Using the matrix vector notation we can express the equation as (Schölkopf et al. 1999):

$$\begin{aligned} K^2 {\alpha }_k = (\lambda _k n) K {\alpha }_k \end{aligned}$$
(29)

where \(K_{i,j} = K(\mathbf {x}_i, \mathbf {x}_j)\) and \({\alpha }_k\) is the n-dimensional column vector of \(\alpha _{ki}\), that is, \({\alpha }_k = [\alpha _{k1}, \alpha _{k2}, \ldots , \alpha _{kn}]^T\). Simplifying Eq. (29), we finally reach:

$$\begin{aligned} K {\alpha }_k = (\lambda _k n) {\alpha }_k \end{aligned}$$
(30)

showing that the \({\alpha }_k\) are the eigenvectors of the kernel matrix. For a new point \(\mathbf {x}\), its projection onto the k-th principal component is given by:

$$\begin{aligned} y_k(\mathbf {x}) = \phi (\mathbf {x})^T \mathbf {v}_k = \sum _{i=1}^{n} \alpha _{ki} \phi (\mathbf {x})^T \phi (\mathbf {x}_i) = \sum _{i=1}^{n} \alpha _{ki} K(\mathbf {x}, \mathbf {x}_i) \end{aligned}$$
(31)

The advantage of employing the kernel trick is that we do not have to compute \(\phi (\mathbf {x}_i)\) explicitly for \(i=1,2,\ldots ,n\). We can directly construct the kernel matrix from the training data. Two widely used non-linear kernels are the polynomial kernel:

$$\begin{aligned} K(\mathbf {x}, \mathbf {y}) = (\mathbf {x}^T \mathbf {y} + c)^d \end{aligned}$$
(32)

where \(c >= 0\) is a constant, and the Gaussian kernel:

$$\begin{aligned} K(\mathbf {x}, \mathbf {y}) = exp\left( - \frac{\left\Vert \mathbf {x} - \mathbf {y} \right\Vert ^2}{2\sigma ^2} \right) \end{aligned}$$
(33)

with parameter \(\sigma ^2\). In case the projected data does not have zero mean, we need to centralize the data making:

$$\begin{aligned} \tilde{\phi }(\mathbf {x}_i) = \phi (\mathbf {x}_i) - \frac{1}{n}\sum _{k=1}^{n}\phi (\mathbf {x}_k) \end{aligned}$$
(34)

Hence, the corresponding kernel matrix is given by:

$$\begin{aligned} \tilde{K}(\mathbf {x}_i, \mathbf {x}_j)&= \tilde{\phi }(\mathbf {x}_i)^T \tilde{\phi }(\mathbf {x}_i) = \left( \phi (\mathbf {x}_i) - \frac{1}{n}\sum _{k=1}^{n}\phi (\mathbf {x}_k) \right) ^T \left( \phi (\mathbf {x}_j) - \frac{1}{n}\sum _{k=1}^{n}\phi (\mathbf {x}_k) \right) \nonumber \\&= \phi (\mathbf {x}_i)^T \phi (\mathbf {x}_j) - \frac{1}{n} \sum _{k=1}^{n}\phi (\mathbf {x}_i)^T \phi (\mathbf {x}_k) - \frac{1}{n} \sum _{k=1}^{n}\phi (\mathbf {x}_k)^T \phi (\mathbf {x}_j) \nonumber \\&\qquad + \frac{1}{n^2} \sum _{k=1}^{n}\sum _{l=1}^{n} \phi (\mathbf {x}_k)^T \phi (\mathbf {x}_l) \nonumber \\&= K(\mathbf {x}_i, \mathbf {x}_j) - \frac{1}{n} \sum _{k=1}^{n}K(\mathbf {x}_i, \mathbf {x}_k) - \frac{1}{n} \sum _{k=1}^{n}K(\mathbf {x}_k, \mathbf {x}_j) \nonumber \\&\qquad + \frac{1}{n^2} \sum _{k=1}^{n}\sum _{l=1}^{n} K(\mathbf {x}_k, \mathbf {x}_l) \end{aligned}$$
(35)

In matrix form, we have to replace the kernel matrix K by the Gram matrix \(\tilde{K}\):

$$\begin{aligned} \tilde{K} = K - 1_n K - K 1_n + 1_n K 1_n \end{aligned}$$
(36)

where \(1_n\) is the \(n \times n\) matrix with all elements equal to \(\frac{1}{n}\). In the following, we present an algorithm for dimensionality reduction through kernel PCA.

figure b

It has been shown that the computational complexity of the original Kernel PCA is \(O(n^3)\), where n denotes the number of samples. However, faster algorithms can perform KPCA in \(O(d n^2)\), such as Fast Iterative Kernel PCA (Günter et al. 2007).

2.4 ISOMAP

ISOMAP was one of the pioneering algorithms in manifold learning for dimensionality reduction. The authors propose an approach that combines the major algorithmic features of PCA and Multidimensional Scaling (Cox and Cox 2001; Borg and Groenen 2005) (MSD)—computational efficiency, global optimality, and asymptotic convergence guarantees - with the flexibility to learn a broad class of non-linear manifolds (Tenenbaum et al. 2000). The basic idea of the ISOMAP algorithm is first to build a graph by joining the k-nearest neighbors (KNN) in the input space, then compute the shortest paths between each pair of vertices in the graph and, knowing the approximate geodesic distances between the points, find a mapping to the an Euclidean subspace of \(R^d\) that preserves those distances. The hypothesis of the ISOMAP algorithm is that the shortest paths in the KNN graph are good approximations for the true geodesic distances in the manifold.

The ISOMAP algorithm can be divided in three main steps:

  1. 1.

    From the input data \(\mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_n \in R^m\) build an undirected proximity graph using the KNN rule or the \(\epsilon \)-neighborhood rule (von Luxburg 2007);

  2. 2.

    Compute the pairwise distance matrix D using n executions of the Dijkstra’s algorithm or one execution of the Floyd–Warshall algorithm (Cormen et al. 2009);

  3. 3.

    Estimate the new coordinates of the points in an Euclidean subspace of \(R^d\) by preserving the distances through the Multidimensional Scaling (MDS) method.

2.4.1 Multidimensional scaling

Basically, the main goal of MDS is, given an \(n \times n\) matrix of pairwise distances, recover the coordinates of the n points \(\mathbf {x}_r \in R^d\) for \(r=1,2,\ldots ,n\) in an Euclidean subspace, where d, the target dimensionality, is a parameter of the algorithm (Cox and Cox 2001; Borg and Groenen 2005).

We begin by noting that the pairwise distance matrix is given by \(D = \{ d_{rs}^2 \}\), for \(r,s=1,2,\ldots ,n\) where the distance between two arbitrary points \(\mathbf {x}_r\) and \(\mathbf {x}_s\) is:

$$\begin{aligned} d_{rs}^2 = \left\Vert \mathbf {x}_r - \mathbf {x}_s \right\Vert ^2 = (\mathbf {x}_r - \mathbf {x}_s)^T (\mathbf {x}_r - \mathbf {x}_s) \end{aligned}$$
(37)

Let B denote the inner products matrix, that is \(B = \{ b_{rs} \}\), where \(b_{rs} = \mathbf {x}_r^T \mathbf {x}_s \). To find the embedding, MDS needs the matrix B, not D. First, we need to assume a hypothesis that the data has zero mean, otherwise there would be infinitely many different solutions, since the application of any arbitrary translation in the set of points, would preserve the pairwise distances. From Eq. (37), applying the distributive law we have:

$$\begin{aligned} d_{rs}^2 = \mathbf {x}_r^T \mathbf {x}_r + \mathbf {x}_s^T \mathbf {x}_s - 2\mathbf {x}_r^T \mathbf {x}_s \end{aligned}$$
(38)

From the matrix D, we can calculate the mean of an arbitrary column s by:

$$\begin{aligned} \frac{1}{n}\sum _{r=1}^{n} d_{rs}^2 = \frac{1}{n}\sum _{r=1}^{n} \mathbf {x}_r^T \mathbf {x}_r + \mathbf {x}_s^T \mathbf {x}_s \end{aligned}$$
(39)

Similarly, we can compute the mean of an arbitrary row r as:

$$\begin{aligned} \frac{1}{n}\sum _{s=1}^{n} d_{rs}^2 = \mathbf {x}_r^T \mathbf {x}_r + \frac{1}{n}\sum _{s=1}^{n} \mathbf {x}_s^T \mathbf {x}_s \end{aligned}$$
(40)

Finally, we can compute the mean of all elements of D as:

$$\begin{aligned} \frac{1}{n^2}\sum _{r=1}^{n}\sum _{s=1}^{n} d_{rs}^2 = \frac{2}{n}\sum _{r=1}^{n} \mathbf {x}_r^T \mathbf {x}_r \end{aligned}$$
(41)

Note that from Eq. (38), it is possible to define \(b_{rs}\) as:

$$\begin{aligned} b_{rs} = \mathbf {x}_r^T \mathbf {x}_s = -\frac{1}{2}(d_{rs}^2 - \mathbf {x}_r^T \mathbf {x}_r - \mathbf {x}_s^T \mathbf {x}_s) \end{aligned}$$
(42)

Combining Eqs. (39), (40) and (41) we have:

$$\begin{aligned} b_{rs} = -\frac{1}{2}\left( d_{rs}^2 -\frac{1}{n}\sum _{r=1}^{n} d_{rs}^2 - \frac{1}{n}\sum _{s=1}^{n} d_{rs}^2 + \frac{1}{n^2}\sum _{r=1}^{n}\sum _{s=1}^{n} d_{rs}^2 \right) \end{aligned}$$
(43)

Making \(a_{rs} = -\frac{1}{2}d_{rs}\) we can write:

$$\begin{aligned} a_{r.} = \frac{1}{n}\sum _{s=1}^{n} a_{rs} \qquad a_{.s} = \frac{1}{n}\sum _{r=1}^{n} a_{rs} \qquad a_{..} = \frac{1}{n}\sum _{r=1}^{n}\sum _{s=1}^{n} a_{rs} \end{aligned}$$
(44)

leading to:

$$\begin{aligned} b_{rs} = a_{rs} - a_{r.} - a_{.s} + a_{..} \end{aligned}$$
(45)

which in matrix notation becomes \(B = H A H\), where:

$$\begin{aligned} H = I - \frac{1}{n}\mathbf {1}\mathbf {1}^T \end{aligned}$$
(46)

is the centring matrix. To find the embedding, that is, the coordinates of the points in \(R^d\), we have to perform an eigendecomposition of the matrix B, that is:

$$\begin{aligned} B = V \varLambda V^T \end{aligned}$$
(47)

where \(\varLambda = diag(\lambda _1, \lambda _2,\ldots , \lambda _n)\) is the diagonal matrix with the eigenvalues of B and V is the matrix whose columns are the eigenvectors of B. Algorithm 3 summarizes the whole process in a sequence of logical and objective steps.

figure c

It has been shown that the overall complexity of the ISOMAP algorithm is given by \(O(n^2(m + log~n))\) (Nguyen and Holmes 2019), where n and m denote the number of samples and number of features, respectively.

2.5 Locally linear embedding

The ISOMAP algorithm is a global method in the sense that to find the coordinates of a given input vector \(\mathbf {x}_i \in R^m\) in the manifold, it uses information from all the samples through the matrix B. On the other hand, Locally Linear Embedding (LLE), as the name emphasizes, is a local method, that is, the new coordinates of any \(\mathbf {x}_i \in R^m\) depends only on the neighborhood of that point. The main hypothesis behind LLE is that for a sufficiently high density of samples, it is expected that a vector \(\mathbf {x}_i\) and its neighbors define a linear patch, that is, they all belong to an Euclidean subspace (Roweis and Saul 2000).

Basically, the LLE algorithm require as inputs an \(n \times m\) data matrix X, with rows \(\mathbf {x}_i\), a desired number of dimensions \(d < m\) and an integer \(k > d + 1\) for finding local neighborhoods. The output is a \(n \times d\) matrix Y, wtih rows \(\mathbf {y}_i\). The LLE algorithm can be divided in three main steps (Roweis and Saul 2000; Saul and Roweis 2003):

  1. 1.

    From each \(\mathbf {x}_i \in R^m\) find its k nearest neighbors;

  2. 2.

    Find the weight matrix W which minimizes the reconstruction error for each data point \(\mathbf {x}_i \in R^m\);

  3. 3.

    Find the coordinates Y which minimize the reconstruction error using the optimum weights;

2.5.1 Least-squares estimation of the weights

The second step of LLE is to reconstruct each data point from its nearest neighbors. The optimal reconstruction weights can be computed in closed form. Without loss of generality, we can express the local reconstruction error at point \(\mathbf {x}_i\) as:

$$\begin{aligned} E(\mathbf {w}) = \left\Vert \sum _{j} w_j(\mathbf {x}_i - \mathbf {x}_j) \right\Vert ^2 = \sum _j \sum _k w_j w_k (\mathbf {x}_i - \mathbf {x}_j)(\mathbf {x}_i - \mathbf {x}_k)^T \end{aligned}$$
(50)

Defining the matrix C as:

$$\begin{aligned} C_{jk} = (\mathbf {x}_i - \mathbf {x}_j)^T (\mathbf {x}_i - \mathbf {x}_k) \end{aligned}$$
(51)

we have the following expression for the local reconstruction error:

$$\begin{aligned} E(\mathbf {w}) = \sum _j \sum _k w_j C_{jk} w_k = \mathbf {w}^T C \mathbf {w} \end{aligned}$$
(52)

Actually, the estimation of the matrix W reduces to n eigenvalue problems: as there are no constraints across the rows of W, we can find the optimal weights for each sample \(\mathbf {x}_i\) separately, which drastically simplifies the computations. Thus, we have n independent constrained optimization problems given by:

$$\begin{aligned} \mathop {\arg \min }\limits _{\mathbf {w}_i}~ \mathbf {w}_i^T C_i \mathbf {w}_i \qquad \text {subject to} \quad \mathbf {1}^T \mathbf {w}_i = 1 \quad \text {for} \quad i=1,2,\ldots ,n \end{aligned}$$
(53)

Using Lagrange multipliers, we write the Lagrangian function as:

$$\begin{aligned} L(\mathbf {w}_i , \lambda ) = \mathbf {w}_i^T C_i \mathbf {w}_i - \lambda (\mathbf {1}^T \mathbf {w}_i - 1) \end{aligned}$$
(54)

Taking the derivatives with relation to \(\mathbf {w}_i\):

$$\begin{aligned} \frac{\partial }{\partial \mathbf {w}_i}L(\mathbf {w}_i , \lambda ) = 2 C_i \mathbf {w}_i - \lambda \mathbf {1} = 0 \end{aligned}$$
(55)

which leads to

$$\begin{aligned} C_i \mathbf {w}_i = \frac{\lambda }{2}\mathbf {1} \end{aligned}$$
(56)

In order to speed up the algorithm, instead of computing the inverse of the matrix C, it is usual to solve the linear system:

$$\begin{aligned} C_i \mathbf {w}_i = \mathbf {1} \end{aligned}$$
(57)

and then normalize the solution to guarantee that \(\sum _j w_i(j) = 1\).

2.5.2 Finding the coordinates

The key idea behind the third step of the LLE algorithm is to use the optimal reconstruction weights estimated by least-squares as the proper weights on the manifold and solve for the local manifold coordinates. Thus, fixing the weight matrix W, the goal is to solve another quadratic minimization problem to minimize:

$$\begin{aligned} \varPhi (Y) = \sum _{i=1}^{n} \left\Vert \mathbf {y}_i - \sum _j w_{ij} \mathbf {y}_j \right\Vert ^2 \end{aligned}$$
(58)

In order to avoid degeneracy, we have to impose two constraints:

  1. 1.

    The mean of the data in the transformed space is zero, otherwise we would have an infinite number of solutions;

  2. 2.

    The covariance matrix of the transformed data is the identity matrix, that is, there is not correlation between the components of \(\mathbf {y} \in R^d\);

Denoting by Y the \(d \times n\) matrix in which each column \(\mathbf {y}_i\) for \(i=1,2,\ldots ,n\) stores the coordinates of of the i-th point in the manifold and knowing that \(\mathbf {w}_i(j) = 0\) unless \(\mathbf {y}_j\) is one of the neighbors of \(\mathbf {y}_i\), we can write \(\varPhi (Y)\) as:

$$\begin{aligned} \varPhi (Y) = Tr(Y^T (I - W)^T (I - W) Y) \end{aligned}$$
(59)

Defining the \(n \times n\) matrix M as:

$$\begin{aligned} M = (I - W)^T (I - W) \end{aligned}$$
(60)

we get the following optimization problem:

$$\begin{aligned} \mathop {\arg \min }\limits _{Y}~ Tr(Y^T M Y) \qquad \text {subject to} \quad \frac{1}{n}Y^T Y = I \end{aligned}$$
(61)

Thus, the Lagrangian function is given by:

$$\begin{aligned} L(Y, \lambda ) = Tr(Y^T M Y) - \lambda \left( \frac{1}{n}Y^T Y - I \right) \end{aligned}$$
(62)

Differentiating the function and setting the result to zero gives:

$$\begin{aligned} 2MY - 2\frac{\lambda }{n}Y&= 0 \end{aligned}$$
(63)
$$\begin{aligned} MY&= \beta Y \end{aligned}$$
(64)

where \(\beta = \frac{\lambda }{n}\), showing that the Y must be composed by the eigenvectors of the matrix M. Since we have a minimization problem, we want to select to compose Y the d eigenvectors associated to the d smallest eigenvalues. Note that M being a \(n \times n\) matrix, it has n eigenvalues and n orthogonal eigenvectors. Although the eigenvalues are real and non-negative, the smallest of them is always zero, with the constant eigenvector \(\mathbf {1}\). This bottom eigenvector corresponds to the mean of Y and should be discarded to enforce the constraint that \(\sum _{i=1}^{n} \mathbf {y}_i = 0\) (de Ridder and Duin 2002). Algorithm 4 shows a summary of the LLE method.

figure d

It has been shown that the overall complexity of the LLE algorithm is given by \(O((m~log~k) (n~log~n)) + O(m n k^3) + O(d n^2)\) (Roweis and Saul 2000), where n, m, k and d denote the number of samples, number of features, number of neighbors in the KNN graph and the output dimensionality, respectively.

2.6 Laplacian Eigenmaps

Basically, the Laplacian Eigenmaps algorithm require as inputs an \(n \times m\) data matrix X, with each row \(\mathbf {x}_i\) defining a data point, a desired number of dimensions \(d < m\) and an integer k for finding local neighborhoods. The output is a \(n \times d\) matrix Y, with rows \(\mathbf {y}_i\). The algorithm can be divided in three main steps (Belkin and Niyogi 2003):

  1. 1.

    Construct the neighborhood graph \(G = (V,E)\) by linking nodes \(v_i\) and \(v_j\) if \(\mathbf {x}_i\) and \(\mathbf {x}_j\) are close. The two variants are:

    • \(\epsilon \)-neighborhood: connect \(v_i\) and \(v_j\) by an edge if \(\left\Vert \mathbf {x}_i - \mathbf {x}_j \right\Vert ^2 \le \epsilon \).

    • k-nearest neighbors: connect \(v_i\) and \(v_j\) by an edge if \(v_i\) is among the k-nearest neighbors of \(v_j\) or \(v_j\) is among the k-nearest neighbors of \(v_i\).

  2. 2.

    Choose the weights to define the adjacency matrix W. There are also two variations:

    • Heat kernel (with parameter \(t \in R\)): if nodes \(v_i\) and \(v_j\) are connected, make

      $$\begin{aligned} W_{ij} = exp\left\{ -\frac{\left\Vert \mathbf {x}_i - \mathbf {x}_j \right\Vert ^2}{t} \right\} \end{aligned}$$
      (67)

      otherwise make \(W_{ij} = 0\). The justification for this choice is given by the heat equation.

    • Binary weights: make \(W_{ij} = 1\) if nodes \(v_i\) and \(v_j\) are connected by an edge and \(W_{ij} = 0\) if \(v_i\) and \(v_j\) are not connected by an edge. There is no need to choose t.

  3. 3.

    Embedding: find the coordinates Y by choosing the d eigenvectors associated to the d smallest non-zero eigenvalues of the graph Laplacian L.

2.6.1 Laplacian embedding on \(R^d\)

Consider the generalized problem of embedding the graph \(G=(V,E)\) into a d-dimensional Euclidean space. Now each node \(v_i \in V\) has to be mapped to a point in \(R^d\), that is, we need to estimate d coordinates for each node. We denote the final embedding by a \(n \times d\) matrix \(Y = [\mathbf {y}_1 , \mathbf {y}_2 , \ldots , \mathbf {y}_d]\), where the i-th row, \(\mathbf {y}^{(i)}\), provides the coordinates of \(v_i\) in the manifold. The objective function is generalized to:

$$\begin{aligned} J(Y) = \frac{1}{2}\sum _{i=1}^{n} \sum _{j=1}^{n} W_{ij} \left\Vert \mathbf {y}^{(i)} - \mathbf {y}^{(j)} \right\Vert ^2 \end{aligned}$$
(68)

where \(\mathbf {y}^{(i)} = [\mathbf {y}_1(i), \mathbf {y}_2(i), \ldots , \mathbf {y}_d(i)]\) is the d-dimensional representation of \(v_i\). Note that, considering Y as a \(n \times d\) matrix in which each row represents a \(\mathbf {y}^{(i)}\), for \(i = 1, 2, \ldots , n\) we rewrite the objective function as:

$$\begin{aligned} J(Y) = \frac{1}{2}\sum _{i=1}^{n} \sum _{j=1}^{n} W_{ij} (\mathbf {y}^{(i)} - \mathbf {y}^{(j)}) (\mathbf {y}^{(i)} - \mathbf {y}^{(j)})^T \end{aligned}$$
(69)

Expanding the expression for J(Y), we can simplify it to:

$$\begin{aligned} J(Y)&= \frac{1}{2} \sum _{i=1}^{n} \sum _{j=1}^{n} \left[ W_{ij} \mathbf {y}^{(i)} \mathbf {y}^{(i)^T} - W_{ij} \mathbf {y}^{(i)} \mathbf {y}^{(j)^T} - W_{ij} \mathbf {y}^{(j)} \mathbf {y}^{(i)^T} + W_{ij} \mathbf {y}^{(j)} \mathbf {y}^{(j)^T}\right] \nonumber \\&= \frac{1}{2} \left[ \sum _{i=1}^{n} d_i \mathbf {y}^{(i)} \mathbf {y}^{(i)^T} - 2 \sum _{i=1}^{n}\sum _{j=1}^{n} W_{ij} \mathbf {y}^{(i)} \mathbf {y}^{(j)^T} + \sum _{j=1}^{n} d_j \mathbf {y}^{(j)} \mathbf {y}^{(j)^T}\right] \nonumber \\&= \frac{1}{2} \left[ 2 \sum _{i=1}^{n} d_i \mathbf {y}^{(i)} \mathbf {y}^{(i)^T} - 2 \sum _{i=1}^{n}\sum _{j=1}^{n} W_{ij} \mathbf {y}^{(i)} \mathbf {y}^{(j)^T}\right] \nonumber \\&= \sum _{i=1}^{n} d_i \mathbf {y}^{(i)} \mathbf {y}^{(i)^T} - \sum _{i=1}^{n}\sum _{j=1}^{n} W_{ij} \mathbf {y}^{(i)} \mathbf {y}^{(j)^T} \end{aligned}$$
(70)

Considering \(Y_{n \times d}\) the matrix of the coordinates for the n points, \(D_{n \times n}\) the diagonal matrix of the degrees \(d_i\) and \(W_{n \times n}\) the adjacency matrix, we can rewrite the equation using a matrix-vector notation as:

$$\begin{aligned} J(Y)&= Tr(D Y Y^T) - Tr(W Y Y^T) \end{aligned}$$
(71)

As the trace is an operator that is invariant under cyclic permutations, we have:

$$\begin{aligned} J(Y)&= Tr(Y^T D Y) - Tr(Y^T W Y) = Tr(Y^T (D Y - W Y)) \nonumber \\&= Tr(Y^T (D - W) Y) = Tr( Y^T L Y ) \end{aligned}$$
(72)

Thus, we have the following constrained optimization problem:

$$\begin{aligned} \mathop {\arg \min }\limits _{Y}~ Tr(Y^T L Y) \qquad \text {subject to} \quad Y^T D Y = I \end{aligned}$$
(73)

whose Lagrangian function is given by:

$$\begin{aligned} L(Y, \lambda ) = Tr(Y^T L Y) - \lambda (Y^T D Y - I) \end{aligned}$$
(74)

Taking the derivative and setting the result to zero leads to:

$$\begin{aligned} \frac{\partial }{\partial Y}L(Y, \lambda ) = 2 L Y - 2 \lambda D Y = 0 \end{aligned}$$
(75)

leading to the following eigenvector problem:

$$\begin{aligned} L Y = \lambda D Y \end{aligned}$$
(76)

This result shows that we should select to compose the columns of the matrix Y the d eigenvectors associated to the d smallest non-zero eigenvalues of the normalized Laplacian \(D^{-1}L\). Algorithm 5 shows a summary of the Laplacian Eigenmaps method.

figure e

It has been shown that the overall complexity of the Laplacian Eigenmaps algorithm is given by \(O((m~log~k) (n~log~n)) + O(m n k^3) + O(d n^2)\) (Belkin and Niyogi 2003), where n, m, k and d denote the number of samples, number of features, number of neighbors in the KNN graph and the output dimensionality, respectively.

2.7 t-Distributed stochastic neighbor embedding

The algorithm t-SNE is based on its predecessor Stochastic Neighbor Embedding, or simply SNE (Hinton and Roweis 2003). Basically, SNE converts Euclidean distances between samples \(\mathbf {x}_i \in R^m\) for \(i=1,2,\ldots ,n\) into conditional probabilities. The similarity between two points \(\mathbf {x}_i\) and \(\mathbf {x}_j\) is the conditional probability \(p_{j|i}\) that \(\mathbf {x}_i\) would choose \(\mathbf {x}_j\) as a neighbor if neighbor selection is done in proportion to a probability density under a Gaussian centered at \(\mathbf {x}_i\) (van der Maaten and Hinton 2008):

$$\begin{aligned} p_{j|i} = \frac{ \text {exp} \left( - \left\Vert \mathbf {x}_i - \mathbf {x}_j \right\Vert ^2 / 2 \sigma _i^2 \right) }{\sum _{k \ne i} \text {exp} \left( - \left\Vert \mathbf {x}_i - \mathbf {x}_k \right\Vert ^2 / 2 \sigma _i^2 \right) } \end{aligned}$$
(80)

where \(\sigma _i^2\) is the variance of the Gaussian centered on \(\mathbf {x}_i\). For \(\mathbf {x}_i\) and \(\mathbf {x}_j\) close enough, \(p_{j|i}\) has a high value, but if they are far apart \(p_{j|i}\) tends to zero. It is possible to compute a similar conditional probability for the low dimensional representation vectors \(\mathbf {y}_i\) and \(\mathbf {y}_j\) in \(R^d\), denoted by \(q_{j|i}\). Setting the variance of the Gaussian to \(1/\sqrt{2}\), the similarity measure is given by:

$$\begin{aligned} q_{j|i} = \frac{ \text {exp} \left( - \left\Vert \mathbf {y}_i - \mathbf {y}_j \right\Vert ^2 \right) }{\displaystyle \sum \nolimits _{k \ne i} \text {exp} \left( - \left\Vert \mathbf {y}_i - \mathbf {y}_k \right\Vert ^2 \right) } \end{aligned}$$
(81)

Note that, \(p_{i|i} = q_{i|i} = 0\). The goal of SNE is to find a low dimensional representation that minimizes the distance between the two probabilities, making them as close as possible. A statistical measure of how close two probability distributions are is the Kullback–Leibler divergence, also known as relative entropy. SNE minimizes the sum of Kullback–Leibler divergences over all samples using gradient descent. Thus, we have to minimize (Hinton and Roweis 2003):

$$\begin{aligned} C = \sum _{i=1}^{n} KL(P_i || Q_i) = \sum _{i=1}^{n} \sum _{j=1}^{n} p_{j|i} \text {log} \frac{p_{j|i}}{q_{j|i}} \end{aligned}$$
(82)

where \(P_i\) represents the conditional probability distribution over all samples given \(\mathbf {x}_i\) and \(Q_i\) represents the conditional probability distribution over all other map points given map point \(\mathbf {y}_i\). The SNE cost function is proposed so that we retain the local structure of the data in the map.

2.7.1 Defining the variance of the Gaussians

It is not suitable to assume that there is a single value of \(\sigma _i^2\) that is optimal for samples. In dense regions, a smaller value is more appropriate than in sparse regions. The perplexity measure is then defined as (van der Maaten and Hinton 2008):

$$\begin{aligned} \text {Perp}(P_i) = 2^{H(P_i)} \end{aligned}$$
(83)

where \(H(P_i)\) is the Shannon entropy in bits:

$$\begin{aligned} H(P_i) = - \sum _{j=1}^{n} p_{j|i} log_2 p_{j|i} \end{aligned}$$
(84)

SNE searches for a value of \(\sigma _i^2\) that produces a \(P_i\) with a fixed perplexity, defined by the user. The perplexity can be interpreted as a smooth measure of the effective number of neighbors, and typical values are between 5 and 50 (van der Maaten and Hinton 2008). The minimization of the objective function (KL divergence) is performed by a gradient descent approach with momentum to speed up convergence:

$$\begin{aligned} \mathscr {Y}^{(t)} = \mathscr {Y}^{(t-1)} - \eta \frac{\partial C}{\partial \mathscr {Y}} + \alpha (t) \left( \mathscr {Y}^{(t-1)} - \mathscr {Y}^{(t-2)} \right) \end{aligned}$$
(85)

where \(\mathscr {Y}^{(t)}\) denotes the solution at iteration t, \(\eta \) denotes the learning rate and \(\alpha (t)\) represents the momentum at iteration t.

2.7.2 Computing the gradient in t-SNE

At each stage of the optimization process, we have a set of points \(\mathbf {y}_1. \mathbf {y}_2, \ldots , \mathbf {y}_n\) and the computation of the gradient is required. In the SNE algorithms, the whole process is summarized by Melville (2015):

  • Create a matrix where the element \(d_{ij}\) represents the Euclidean distance between \(\mathbf {y}_i\) and \(\mathbf {y}_j\);

  • Transform the distances to create \(f_{ij}\) (squaring the distances);

  • Apply a weighting function to define a weight \(w_{ij}\), in a way that the larger the weight, the smaller the distance;

  • Convert the weights into probabilities \(q_{ij} = q_{j|i}\), by normalizing over their sum;

With the probabilities at hand, we can compute the gradient. Basically, the cost function employed by t-SNE has two different aspects:

  1. 1.

    it uses a symmetrized version of the SNE cost function (Cook et al. 2007).

  2. 2.

    it uses a Student t distribution to compute the similarity between two points in the low-dimensional space.

In the symmetric version, we minimize a single KL divergence:

$$\begin{aligned} C = KL(P || Q) = \sum _{i=1}^{n} \sum _{j=1}^{n} p_{ij} \text {log} \frac{p_{ij}}{q_{ij}} \end{aligned}$$
(86)

where \(p_{ii} = q_{ii} = 0\), \(p_{ij} = p_{ji}\) and \(q_{ij} = q_{ji}\), \(\forall i,j\).

Moreover, the pairwise similarities in the high dimensional space are given by:

$$\begin{aligned} p_{ij} = \frac{ \text {exp}\left( - \left\Vert \mathbf {x}_i - \mathbf {x}_j \right\Vert ^2 / 2 \sigma ^2 \right) }{ \sum _{k \ne l} \text {exp} \left( - \left\Vert \mathbf {x}_k - \mathbf {x}_l \right\Vert ^2 / 2 \sigma ^2 \right) } \end{aligned}$$
(87)

where the normalization constant involves all pairs of points.

By using the Student’s t distribution with one degree of freedom, the probabilities \(q_{ij}\) are defined by:

$$\begin{aligned} q_{ij} = \frac{\left( 1 + \left\Vert \mathbf {y}_i - \mathbf {y}_j \right\Vert ^2 \right) ^{-1}}{ \sum _{k \ne l} \left( 1 + \left\Vert \mathbf {y}_k - \mathbf {y}_l \right\Vert ^2 \right) ^{-1} } = \frac{w_{ij}^{-1}}{\sum _{k \ne l} w_{kl}^{-1}} = \frac{w_{ij}^{-1}}{Z} \end{aligned}$$
(88)

and the objective function is:

$$\begin{aligned} C= & {} \sum _{k=1}^{n} \sum _{l=1}^{n} \left( p_{kl} \text {log}~p_{kl} - p_{kl} \text {log}~q_{kl} \right) \nonumber \\= & {} \sum _{k=1}^{n} \sum _{l=1}^{n} \left( p_{kl} \text {log}~p_{kl} - p_{kl} \text {log}~w_{kl}^{-1} + p_{kl} \text {log}~Z \right) \end{aligned}$$
(89)

Differentiating with respect to \(\mathbf {y}_i\) leads to:

$$\begin{aligned} \frac{\partial C}{\partial \mathbf {y}_i} = -\sum _{k=1}^{n} \sum _{l=1}^{n} p_{kl} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~w_{kl}^{-1} + \sum _{k=1}^{n} \sum _{l=1}^{n} p_{kl} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~Z \end{aligned}$$
(90)

For the first term of the gradient, the derivative is non zero if \(\forall j\), \(k = i\) or \(l = i\). Also, as \(p_{ij} = p_{ji}\) and \(w_{ij} = w_{ji}\):

$$\begin{aligned} -\sum _{k=1}^{n} \sum _{l=1}^{n} p_{kl} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~w_{kl}^{-1}&= -\sum _{j=1}^{n} \left( p_{ij} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~w_{ij}^{-1} + p_{ji} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~w_{ji}^{-1} \right) \nonumber \\&= -2 \sum _{j=1}^{n} p_{ij} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~w_{ij}^{-1} \end{aligned}$$
(91)

The derivative of the inverse of the weight is:

$$\begin{aligned} \frac{\partial w_{ij}^{-1}}{\partial \mathbf {y}_i} = -2 w_{ij}^{-2} \left( \mathbf {y}_i - \mathbf {y}_j \right) \end{aligned}$$
(92)

so the first term of the gradient becomes:

$$\begin{aligned} 4 \sum _{j=1}^{n} p_{ij} w_{ij}^{-1} \left( \mathbf {y}_i - \mathbf {y}_j \right) \end{aligned}$$
(93)

In the differentiation of the second term, note that Z does not depend on k or l and the sum of the probabilities is equal to one:

$$\begin{aligned} \sum _{k=1}^{n} \sum _{l=1}^{n} p_{kl} \frac{\partial }{\partial \mathbf {y}_i} \text {log}~Z&= \frac{\partial }{\partial \mathbf {y}_i} \text {log}~Z \sum _{k=1}^{n} \sum _{l=1}^{n} p_{kl} = \frac{\partial }{\partial \mathbf {y}_i} \text {log}~Z \nonumber \\&= \frac{1}{Z} \frac{\partial }{\partial \mathbf {y}_i} Z = \frac{1}{Z} \sum _{k=1}^{n} \sum _{l=1}^{n} \frac{\partial }{\partial \mathbf {y}_i} w_{kl}^{-1} \end{aligned}$$
(94)

Once again, the derivative is non zero if \(\forall j\), \(k = i\) or \(l = i\), that is:

$$\begin{aligned} \frac{1}{Z} \sum _{k=1}^{n} \sum _{l=1}^{n} \frac{\partial }{\partial \mathbf {y}_i} w_{kl}^{-1} = \frac{1}{Z} \sum _{j=1}^{n} \left( \frac{\partial }{\partial \mathbf {y}_i} w_{ij}^{-1} + \frac{\partial }{\partial \mathbf {y}_i} w_{ji}^{-1} \right) \end{aligned}$$
(95)

Due to symmetry, \(w_{ij} = w_{ji}\), leading to:

$$\begin{aligned} \frac{1}{Z} \sum _{j=1}^{n} \left( \frac{\partial }{\partial \mathbf {y}_i} w_{ij}^{-1} + \frac{\partial }{\partial \mathbf {y}_i} w_{ji}^{-1} \right) = 2 \sum _{j=1}^{n} \frac{1}{Z} \frac{\partial }{\partial \mathbf {y}_i} w_{ij}^{-1} \end{aligned}$$
(96)

From Eq. (92), we can write:

$$\begin{aligned} 2 \sum _{j=1}^{n} \frac{1}{Z} \frac{\partial }{\partial \mathbf {y}_i} w_{ij}^{-1} = -4 \sum _{j=1}^{n} \frac{w_{ij}^{-1}}{Z} w_{ij}^{-1} \left( \mathbf {y}_i - \mathbf {y}_j \right) = -4 \sum _{j=1}^{n} q_{ij} w_{ij}^{-1} \left( \mathbf {y}_i - \mathbf {y}_j \right) \end{aligned}$$
(97)

Finally, by combining Eqs. (93) and (97), we have an expression for the gradient in the t-SNE iteration:

$$\begin{aligned} \frac{\partial C}{\partial \mathbf {y}_i}&= 4 \sum _{j=1}^{n} \left( p_{ij} - q_{ij} \right) w_{ij}^{-1} \left( \mathbf {y}_i - \mathbf {y}_j \right) \nonumber \\&= 4 \sum _{j=1}^{n} \left( p_{ij} - q_{ij} \right) \left( 1 + \left\Vert \mathbf {y}_i - \mathbf {y}_j \right\Vert ^2 \right) ^{-1} \left( \mathbf {y}_i - \mathbf {y}_j \right) \end{aligned}$$
(98)

Algorithm 6 shows the main steps of t-SNE. The parameters of the algorithm are the input data set \(X = \{ \mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_n \}\), the number of iterations T, the perplexity Perp (a measure of the effective number of neighbors), the learning rate \(\eta \) and the momentum \(\alpha (t)\), used in the gradient descent optimization method.

figure f

It has been shown that the complexity of the t-SNE algorithm is \(O(n^2 m + n^2 h)\) (Nguyen and Holmes 2019), where n, m and h denote the number of samples, the number of features and the number of iterations, respectively. The t-SNE algorithm is considered as the state-of-the-art in dimensionality reduction for data visualization. Despite its remarkable performance, t-SNE also has some limitations: (1) SNE and t-SNE approaches do not preserve long-range interactions between data points and generate visualizations in which the arrangement of non-neighboring groups of observations is not informative (Nguyen and Holmes 2019; 2) the high computational cost, especially due to the numerical optimization of its objective function makes it prohibitive for larger datasets (it does not have a closed-form solution); (3) since the algorithm is stochastic, several initializations using different seeds can produce distinct embeddings.

3 Kullback–Leibler divergence

In pattern recognition and machine learning, the problem of quantifying the similarity between different objects or clusters is a challenging task, especially in cases where the standard Euclidean distance is not a reasonable choice. Many works on feature selection adopt statistical divergences to choose the set of features that maximize some measure of separation between classes. Part of their success comes from the fact that most dissimilarity measures are related to distance metrics. In this context, entropy and related divergences provide a fruitful mathematical background for metric learning.

We begin by introducing the entropy of a random variable x as the expected value of the self-information:

$$\begin{aligned} H(p) = - \int p(x) [log~p(x)] dx = - E \left[ log~p(x) \right] \end{aligned}$$
(101)

where p(x) is the probability density function (pdf) of x. Assuming x is normally distributed as \(N(\mu , \sigma ^2)\), its entropy is given by:

$$\begin{aligned} H(p)&= \frac{1}{2} log~(2\pi \sigma ^2) + \frac{1}{2\sigma ^2}E[(x - \mu )^2] = \frac{1}{2} \left( 1 + log~(2\pi \sigma ^2) \right) \end{aligned}$$
(102)

In a similar way, we can define the cross-entropy between two probability density functions as:

$$\begin{aligned} H(p, q) = - \int p(x) [log~q(x)] dx \end{aligned}$$
(103)

The Kullback–Leibler divergence, or simply relative entropy, is the difference between the cross-entropy of p(x) and q(x) and the entropy of p(x), that is:

$$\begin{aligned} D_{KL}(p, q) = H(p, q) - H(p)&= - \int p(x) [log~q(x)] dx + \int p(x) [log~p(x)] dx \nonumber \\&= \int p(x) log \left( \frac{p(x)}{q(x)}\right) dx \nonumber \\&= E_{p} \left[ log \left( \frac{p(x)}{q(x)} \right) \right] \end{aligned}$$
(104)

It should be mentioned that the relative entropy is always non-negative, that is, \(D_{KL}(p, q) \ge 0\), being equal to zero if, and only if, \(p(x) = q(x)\). Let p(x) and q(x) be univariate Gaussian densities, \(N(\mu _1, \sigma _1^2)\) and \(N(\mu _2, \sigma _2^2)\). Then, the KL-divergence between then is given by:

$$\begin{aligned} D_{KL}(p, q)&= E_p \left[ - log~\sigma _1 - \frac{1}{2\sigma _1^2}(x - \mu _1)^2 + log~\sigma _2 - \frac{1}{2\sigma _2^2}(x - \mu _2)^2 \right] \nonumber \\&= log \left( \frac{\sigma _2}{\sigma _1} \right) + \frac{1}{2\sigma _2^2} E_p[(x - \mu _2)^2] - \frac{1}{2\sigma _1^2} E_p[(x - \mu _1)^2]\nonumber \\ \end{aligned}$$
(105)

It is straightforward to note that:

$$\begin{aligned} E_p[(x - \mu _1)^2]&= \sigma _1^2 \end{aligned}$$
(106)
$$\begin{aligned} E_p[(x - \mu _2)^2]&= E[x^2] - 2E[x]\mu _2 + \mu _2^2 \end{aligned}$$
(107)
$$\begin{aligned} E[x^2]&= Var[x] + E^2[x] = \sigma _1^2 + \mu _1^2 \end{aligned}$$
(108)

which finally leads to:

$$\begin{aligned} D_{KL}(p, q)&= log \left( \frac{\sigma _2}{\sigma _1} \right) + \frac{1}{2\sigma _2^2}(\sigma _1^2 + \mu _1^2 - 2\mu _1\mu _2 + \mu _2^2) - \frac{1}{2} \nonumber \\&= log \left( \frac{\sigma _2}{\sigma _1} \right) + \frac{\sigma _1^2 + (\mu _1 - \mu _2)^2}{2\sigma _2^2} - \frac{1}{2} \end{aligned}$$
(109)

Note that \(D_{KL}(p, q) \ne D_{KL}(q, p)\), that is, the relative entropy is not symmetric. The symmetrized KL-divergence between p(x) and q(x) is:

$$\begin{aligned} D_{KL}^{sym}(p, q)&= \frac{1}{2} [D_{KL}(p, q) + D_{KL}(q, p)] \nonumber \\&= \frac{1}{4} \left[ \frac{\sigma _1^2 + (\mu _1 - \mu _2)^2}{\sigma _2^2} + \frac{\sigma _2^2 + (\mu _1 - \mu _2)^2}{\sigma _1^2} - 2 \right] \nonumber \\&= \frac{1}{4\sigma _1^2\sigma _2^2} \left[ \left( \sigma _1^2 - \sigma _2^2 \right) ^2 + \left( \mu _1 - \mu _2 \right) ^2 \left( \sigma _1^2 + \sigma _2^2 \right) \right] \end{aligned}$$
(110)

4 PCA-KL: a parametric dimensionality reduction algorithm

One issue found in high-dimensional spaces is related to the weak discrimination power of the Euclidean metric. As dimensionality grows, the contrast provided by the Euclidean distance decreases, i.e., the distribution of norms in a given distribution of points tends to concentrate. This is known as the concentration phenomenon (Lee and Verleysen 2007). In other words, the L2-norm of random vectors grows proportionally to the number of features, \(\sqrt{m}\), as naturally expected, but the variance remains more or less constant for a sufficiently large m. Therefore, high-dimensional random i.i.d. vectors seem to be distributed close to the surface of a hypersphere. Im summary, when data is high dimensional, PCA can behave in unexpected ways: upward bias in sample eigenvalues and inconsistency of sample eigenvectors are among the most notable phenomena that appear (Johnstone and Paul 2018). These findings have been exploited to develop new estimation and hypothesis testing methods for the population covariance matrix (Vaswani et al. 2018).

In the last decades, several PCA variations were proposed to mitigate the limitations of the original method. Kernel PCA is a non-linear generalization that corresponds to PCA performed in a reproducing kernel Hilbert space associated with a positive definite kernel (Schölkopf et al. 1999). While PCA is optimal in minimizing the mean squared error, it is still sensitive to outliers in the data. It is a common practice to remove outliers before PCA. However, outliers can be difficult to identify. Robust PCA is a PCA generalization developed to deal with the presence of outliers and random noise in data (Yang and Wang 1999). Another issue with PCA is that every feature is a linear combination of all input variables. Sparse PCA mitigates this issue by finding linear features that contain just a few input variables. It is an extension of PCA for dimensionality reduction that adds sparsity constraints on the input variables (Kunzhe and Huaitie 2018). In multilinear subspace learning, dimensionality reduction is performed on a data tensor. Multilinear PCA extracts features directly from these tensor representations by performing PCA in each mode of the tensor iteratively (Lee and Park 2012).

4.1 Proposed method

As a way to mitigate some limitations of PCA, we propose PCA-KL, a parametric patch-based algorithm for unsupervised metric learning based on the computation of the entropic covariance matrix, a surrogate for the covariance matrix of the data, using the KL-divergence between Gaussian distributions instead of the usual Euclidean distance between the data points.

We denote by \(X = \{ \mathbf {x}_1 , \mathbf {x}_2 , \ldots , \mathbf {x}_n \}\), with \(\mathbf {x}_i \in R^m\) the input data matrix. We can build a KNN graph \(G = (V, E)\), with \(|V| = n\), by connecting each sample \(\mathbf {x}_i\) with its k nearest neighbors. Since the neighborhood can be well approximated by a linear patch, we use the Euclidean distance as similarity measure in this step. Let a patch \(P_i\) be the set \(\{ \mathbf {x}_i \} \cup \{ \mathbf {x}_j \in N(i) \}\), where N(i) is the neighborhood set of \(\mathbf {x}_i\). Then, we can define the patch matrix \(P_i\) as:

$$\begin{aligned} P_i = \left[ \mathbf {x}_i, \mathbf {x}_{i1}, \mathbf {x}_{i2}, \ldots , \mathbf {x}_{ik} \right] \end{aligned}$$
(111)

to denote the \(m \times (k+1)\) data matrix that compose the i-th patch. In this study, we assume a parametric model \(p(\mathbf {x}; {\theta })\) for each feature of the patch matrix, that is, for each line of \(P_i\), where \({\theta } \in R^L\) is a vector of L parameters. For instance, if we consider a Gaussian model, then \(L = 2\), where \(\theta _1 = \mu \) is the mean and \(\theta _2 = \sigma ^2\) is the variance.

Basically, the proposed patch-based method maps each patch \(P_i\) to a m-dimensional vector composed by L-dimensional tuples, where each tuple j contains the maximum likelihood estimators of the model parameters. As we have m distinct features, we will have exactly m tuples. Figure 1 illustrates the mapping from a patch \(P_i\) to a parametric feature vector \(\mathbf {p}_i\).

Fig. 1
figure 1

Mapping from a patch \(P_i\) on a graph to a parametric feature vector \(\mathbf {p}_i\).

The parametric feature vector \(\mathbf {p}_i\) for the patch \(P_i\) can be expressed as:

$$\begin{aligned} \mathbf {p}_i = \left[ {\theta }_1^{(i)}, {\theta }_2^{(i)}, \ldots , {\theta }_m^{(i)} \right] \end{aligned}$$
(112)

where each component is a tuple of L parameters:

$$\begin{aligned} {\theta }_j^{(i)} = \left( \theta _{j1}^{(i)}, \theta _{j2}^{(i)}, \ldots , \theta _{jL}^{(i)} \right) \end{aligned}$$
(113)

The set of all \(\mathbf {p}_i\), for \(i = 1, 2, \ldots , n\) defines our entropic feature space. We associate to this feature space a centroid given by the sample average of all \(\mathbf {p}_i\)’s:

$$\begin{aligned} \tilde{\mathbf {p}} = \frac{1}{n}\sum _{i=1}^{n} \mathbf {p}_i \end{aligned}$$
(114)

We propose to define the entropic difference between two vectors \(\mathbf {p}_i\) and \(\mathbf {p}_j\) in the parametric feature space as the symmetrized KL-divergence between each one of the tuples, that is:

$$\begin{aligned} \mathbf {p}_i - \mathbf {p}_j = \left[ d_{KL}({\theta }_1^{(i)}, {\theta }_1^{(j)}) , \ldots , d_{KL}({\theta }_m^{(i)}, {\theta }_m^{(j)}) \right] = \mathbf {d}_{KL}\left( \mathbf {p}_i , \mathbf {p}_j \right) \end{aligned}$$
(115)

For univariate Gaussian models, a closed-form expression for the symmetrized KL-divergence is given by Eq. (110). Given the above, we can define the entropic covariance matrix C as:

$$\begin{aligned} C = E \left[ \mathbf {d}_{KL}(\mathbf {p}_i, \tilde{\mathbf {p}})~\mathbf {d}_{KL}(\mathbf {p}_i, \tilde{\mathbf {p}})^T \right] \end{aligned}$$
(116)

Note that both the regular covariance matrix (used in PCA) and the kernel matrix (used in KPCA) are defined in terms of each sample individually, whereas the entropic covariance matrix is defined in terms of patches, making it less sensitive to outliers. Besides, the proposed entropic PCA has the advantage of producing a projection matrix, exactly like regular PCA. It means that new instances that do not belong to the training set, can be mapped to their new representation in a very easy way. This is an advantage of PCA-KL in comparison to manifold learning algorithms, such as ISOMAP (Tenenbaum et al. 2000), LLE (Roweis and Saul 2000), Laplacian Eigenmaps (Belkin and Niyogi 2003) and t-SNE (van der Maaten and Hinton 2008).

In the following, we present the PCA-KL algorithm for unsupervised metric learning under Gaussian hypothesis. The input for PCA-KL is a data matrix \(X_{m \times n}\), in which each column \(\mathbf {x}_j \in R^m\) is a sample. Basically, the main steps of PCA-KL can be summarized as:

  1. 1.

    From the input data \(\mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_n \in R^m\) build an undirected proximity graph using the KNN rule;

  2. 2.

    For each patch, that is, \(\mathbf {x}_i \cup \eta _i\), where \(\eta _i\) denotes the local neighborhood around \(\mathbf {x}_i\), compute the mean and variance of each feature:

    $$\begin{aligned} \mu _k&= \frac{1}{|\eta _i|} \sum _{j \in \eta _i} x_j(k) \end{aligned}$$
    (117)
    $$\begin{aligned} \sigma _k^2&= \frac{1}{|\eta _i| - 1} \sum _{j \in \eta _i} ( x_j(k) - \mu _k )^2 \end{aligned}$$
    (118)

    for \(k = 1,2,\ldots ,m\). For simplicity we are assuming a Gaussian model, but other distributions could be adopted at this stage. At the end, this step generates for each patch, the following parametric vector:

    $$\begin{aligned} \mathbf {p}_i = \left[ (\mu _1, \sigma _1^2), (\mu _2, \sigma _2^2), \ldots (\mu _m, \sigma _m^2) \right] \end{aligned}$$
    (119)
  3. 3.

    Compute the mean parametric vector for all patches, \(\tilde{\mathbf {p}}\), which represents the average distribution, given all the dataset:

    $$\begin{aligned} \tilde{\mathbf {p}} = \left[ (\tilde{\mu }_1, \tilde{\sigma }_1^2), (\tilde{\mu }_2, \tilde{\sigma }_2^2), \ldots , (\tilde{\mu }_m, \tilde{\sigma }_m^2) \right] \end{aligned}$$
    (120)
  4. 4.

    Compute the entropic covariance matrix C, a surrogate for the usual covariance matrix based on the KL-divergence between each parametric vector \(\mathbf {p}_i\) and the average distribution \(\tilde{\mathbf {p}}\) as:

    $$\begin{aligned} C = \frac{1}{n-1} \sum _{i=1}^n \mathbf {d}_{KL}(\mathbf {p}_i, \tilde{\mathbf {p}})~\mathbf {d}_{KL}(\mathbf {p}_i, \tilde{\mathbf {p}})^T \end{aligned}$$
    (121)

    where \(\mathbf {d}_{KL}(\mathbf {p}_i, \tilde{\mathbf {p}})\) is a column vector of KL-divergences:

    $$\begin{aligned} \mathbf {d}_{KL}(\mathbf {p}_i, \tilde{\mathbf {p}}) = \left[ D_{KL}^{sym}\left( {\theta }_1, \tilde{{\theta }}_1 \right) , \ldots , D_{KL}^{sym}\left( {\theta }_m, \tilde{{\theta }}_m \right) \right] ^T \end{aligned}$$
    (122)

    with \({\theta }_i = (\mu _i, \sigma _i^2)\) and \(\tilde{{\theta }}_i = (\tilde{\mu }_i, \tilde{\sigma }_i^2)\).

  5. 5.

    Select the \(d < m\) eigenvectors associated to the d largest eigenvectors of the matrix C to compose the projection matrix \(W_{PCAKL}\).

  6. 6.

    Project the data into the subspace spanned by \(W_{PCAKL}\).

The entropic covariance matrix C is real, symmetric and positive semidefinite, so all its eigenvalues are non-negative, which is a desirable property of the proposed method. Moreover, the method is fully unsupervised since it does not make any assumptions regarding the class labels.

5 Experiments and results

In order to test and evaluate the proposed method for unsupervised dimensionality reduction in classification tasks, we compared its performance against the usual PCA, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE in several public datasets available at www.openml.org. It is worth mentioning that the selected datasets have significant variations in the number of samples and features, as well as different number of classes.

In the first set of experiments, we used an internal index to assess the clusters quality obtained after the unsupervised metric learning provided by the dimensionality reduction methods. We chose the Silhouette coefficient, which is a method of interpretation and validation of consistency within clusters of data (Rousseeuw 1987). Let \(C_i\) denote the i-th cluster, then for each data point \(i \in C_i\) let a(i) be the mean distance between i and all other data points in the same cluster \(C_i\):

$$\begin{aligned} a(i) = \frac{1}{|C_i| - 1} \sum _{j\in C_i, j\ne i} d(i, j) \end{aligned}$$
(123)

where d(ij) is the distance between data points i and j in the cluster \(C_i\). In other words, we can interpret a(i) as a measure of how well the data point i is assigned to its cluster (the smaller the value, the better). Then, we define the mean dissimilarity of a data point i to a cluster C as the mean of the distances from i to all points in C. For each data point i, let b(i) be the smallest mean distance of i to all points in any other cluster which i is not a member:

$$\begin{aligned} b(i) = \min _{k \ne i} \frac{1}{|C_k|}\sum _{j \in C_k} d(i, j) \end{aligned}$$
(124)

The cluster with the smallest mean dissimilarity is the neighboring cluster of i because it is the next best fit cluster for point i. Let:

$$\begin{aligned} s(i) = \frac{b(i) - a(i)}{max\{ a(i), b(i) \}}, \text { if } |C_i| > 1 \end{aligned}$$
(125)

be the silhouette value of the data point i and

$$\begin{aligned} s(i) = 0, \text { if } |C_i| = 1 \end{aligned}$$
(126)

Combining both definitions we have:

$$\begin{aligned} s(i) = {\left\{ \begin{array}{ll} 1 - \frac{\displaystyle a(i)}{\displaystyle b(i)} &{} \quad \text {if } a(i) < b(i) \\ 0 &{} \quad \text {if } a(i) = b(i) \\ \frac{\displaystyle b(i)}{\displaystyle a(i)} - 1 &{} \quad \text {if } a(i) > b(i) \end{array}\right. } \end{aligned}$$
(127)

Note that \(-1 \le s(i) \le 1\). An s(i) close to one means that the data is appropriately clustered. If s(i) tends to negative one, then i should be clustered in its neighboring cluster. An s(i) near zero means that the data point i is on the border of two natural clusters. The mean s(i) over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Therefore, the mean s(i) over all data of the entire dataset, known as the Silhouette coefficient is a measure of how appropriately the data have been clustered.

Table 1 Silhouette coefficients for clusters produced by PCA, PCA-KL, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE for several datasets from OpenML.org (2D case)

A few datasets have a prevalence of negative features, so NMF is not an option. Table 1 summarizes the results. The results suggest that PCA-KL is more efficient in building a meaningful representation in terms of the consistency within clusters of data than PCA. Note that in 22 of 44 datasets, PCA-KL obtained the highest Silhouette coefficient, that is, in 50% of the cases the proposed method produced better defined clusters than the other linear and non-linear methods. Another important aspect to be highlighted is that PCA-KL, in average, outperformed not only PCA and NMF (linear methods), but also all manifold learning algorithms, which indicates that the proposed method can be a promising alternative to dimensionality reduction for unsupervised metric learning. Moreover, according to a Wilcoxon signed-rank test, PCA-KL produced significantly better clusters (in terms of Silhouette coefficient) than PCA (p value = \(1.12 \times 10^{-8}\)), NMF (p value = \(4.60 \times 10^{-5}\)), Kernel PCA (p value = \(1.81 \times 10^{-6}\)), ISOMAP (p value = \(5.54 \times 10^{-7}\)), LLE (p value = \(9.04 \times 10^{-8}\)), Laplacian Eigenmaps (p value = \(1.42 \times 10^{-4}\)) and t-SNE (p value = \(9.02 \times 10^{-5}\)) for a significance level \(\alpha = 1\%\).

To illustrate how the proposed method is capable of producing better defined clusters, we present some scatter plots for the two dimensional case, comparing PCA and PCA-KL. Figures 2 and 3 show the clusters for the iris and cardiotocography datasets. Note that the clusters produced by PCA-KL have a lower intra-class scattering, that is, they tend to be more concentrated around the mean. Moreover, it is possible to notice that the variance in the principal components is smaller on PCA-KL.

Fig. 2
figure 2

Scatterplots of iris dataset for the 2D case: PCA versus PCA-KL

Fig. 3
figure 3

Scatterplots of cardiotocography dataset for the 2D case: PCA versus PCA-KL

In the second set of experiments, we compared the performance of PCA-KL against PCA, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE in supervised classification. For this purpose, eight different parametric and non-parametric classifiers were selected: K-Nearest Neighbors (k = 7), Support Vector Machine (linear), Naive Bayes and Quadratic Discriminant Analysis under Gaussian hypothesis, Decision Tree, Multilayer Perceptron, Gaussian Process Classifier and Random Forest Classifier. In all experiments, we selected 60% of the samples for training and 40% of the samples for testing. Tables 2, 3 and 4 show the classification accuracies for several datasets after dimensionality reduction. The best result in a line is boldfaced and the second best result is underlined.

At first glance, it is difficult to evaluate the results, since there is no method that is uniformly superior to all the others. However, looking at the average accuracy, the results are more conclusive. Table 5 shows the average and standard deviation of all accuracies for each feature extraction algorithm. The results indicate that for these datasets, in average, PCA-KL outperformed all linear and non-linear dimensionality reduction methods. We also performed a hypothesis test to check whether the differences are statistically significant. According to a Wilcoxon signed-rank test, PCA-KL produces significantly higher classification accuracies than PCA (p value = \(2.40 \times 10^{-16}\)), NMF (p value = \(1.61 \times 10^{-12}\)), Kernel PCA (p value = \(6.32 \times 10^{-14}\)), ISOMAP (p value = \(4.10 \times 10^{-11}\)), LLE (p value = \(9.31 \times 10^{-14}\)) and Laplacian Eigenmaps (p value = \(2.73 \times 10^{-13}\)) for \(\alpha = 0.01\), that is, a significance level of 1%. For the same value of \(\alpha \), we cannot conclude that there are significant differences between PCA-KL and t-SNE (p value = 0.371).

Table 2 Supervised classification accuracy obtained by different classifiers after PCA, PCA-KL, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE dimensionality reduction for datasets iris, mfeat-fourier, cardiotocography, car(3) and haberman from OpenML.org (2D case)
Table 3 Supervised classification accuracy obtained by different classifiers after PCA, PCA-KL, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE dimensionality reduction for datasets sonar, tae, hayes-roth, newton_hema and servo from OpenML.org (2D case)
Table 4 Supervised classification accuracy obtained by different classifiers after PCA, PCA-KL, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE dimensionality reduction for datasets disclosure_z, plasma_retinol, diggle_table_a2 and rmftsa_ladata from OpenML.org (2D case)
Table 5 Average accuracies obtained by different classifiers after PCA, PCA-KL, NMF, Kernel PCA, ISOMAP, LLE, Laplacian Eigenmaps and t-SNE dimensionality reduction for the OpenML.org datasets in Tables 2, 3 and 4 (2D case)

The obtained results emphasize that the proposed PCA-KL method is competitive with the existing dimensionality reduction algorithms, since, overall, it is capable of producing features that are more discriminant than those generated by PCA, NMF and some manifold learning algorithms. In other words, PCA-KL is a viable option for dimensionality reduction and unsupervised metric learning in classification problems. One important aspect to be highlighted in the proposed method is related to the choice of the number of neighbors (K) in the KNN graph. In some datasets, PCA-KL can produce significantly different results by changing this parameter, showing to be quite sensitive. One should keep in mind the tradeoff between locality preservation and precision in the parameter estimation in the definition of the best value of K. In our experiments, we observed that in all datasets the values of K that were chosen belonged to the interval [20, n/5], where n is the number of samples. In all experiments, the fine-tuning of this parameter was performed by a line search in which the lower bound was set to 20, the increment was set to 10 and the upper bound was n/2.

Finally, a comparison of computational times for each dimensionality reduction method was performed to investigate how efficient the different algorithms are in practice. Our hardware platform consists in a 64-bit Core i7-4510 2.0GHz with 8Gb of RAM. The software platform is comprised by Linux Ubuntu with Anaconda Python distribution. Table 6 shows the elapsed times in seconds for each dimensionality reduction algorithm considering several public datasets. Note that, as expected, PCA is the fastest method by a large margin. The results also show that t-SNE is, in average, about 175 times slower than PCA-KL and ISOMAP is about 4 times slower than the proposed method.

Table 6 Computational time in seconds for each dimensionality reduction algorithm for several public OpenML.org datasets

6 Conclusions

Dimensionality reduction is a fundamental step in the analysis of multivariate data in many pattern recognition and machine learning applications. From simple visualization to feature extraction, these methods play an important role in classification problems. Besides, it has been shown that learning from high dimensional data can be a challenging task.

Among all dimensionality reduction methods, PCA is still the de-facto standard algorithm considered as the first choice by machine learning researchers. PCA is based on finding the orthogonal directions that maximize the variance of the data. This is optimal from a data compression point of view, since it has been shown to be equivalent to the minimization of the mean square error between the original and the reduced representation. For this reason, after the PCA transformation, data is organized in clusters with large scattering, which is undesirable for classification problems. To overcome this limitation, in this paper we presented PCA-KL, a parametric method based on a information-theoretic measure to find a surrogate for the covariance matrix replacing the standard Euclidean distance in the feature space by the relative entropy between Gaussian distributions estimated in each local neighborhood (patch). Results with real datasets showed that besides improving the quality of the produced clusters, which is a desirable feature in unsupervised classification, PCA-KL can also improve the classification accuracy for several supervised classifiers from the pattern recognition literature, indicating that it is a promising alternative to dimensionality reduction and unsupervised metric learning.

Briefly speaking, the main advantages of PCA-KL are: (1) the evaluation of new instances is very straightforward, since once the projection matrix is build, the mapping is direct, unlike most manifold learning algorithms (ISOMAP, LLE, Laplacian Eigenmaps and t-SNE); (2) PCA-KL is a patch-based method so it can be less sensitive to the presence of noise and outliers in data; (3) the method can be easily extended to different statistical models and divergences. However, PCA-KL has some limitations, the major one being the sensitivity to the patch size K. Experiments have shown that variations on this parameter can lead to significantly different classification results. We still do not have a complete solution regarding the estimation of this parameter.

Future works may include the use of other information-theoretic measures such as the Bhattacharyya and Hellinger distances. Furthermore, we also intend to employ distances based on the Fisher information metric. A supervised version of PCA-KL, considering only neighbors that belong to the same class of the central data point is another possible improvement. Extensions to non-linear dimensionality reduction by the incorporation of different kernels is another possible extension to the proposed method. Different statistical models can be considered instead of the Gaussian distribution. For instance, Gaussian–Markov random field models can be used to capture the spatial dependence between the data points through the definition of an additional coupling parameter, known as inverse temperature. Another relevant problem to be tackled in the future is the adaptive definition of the appropriate patch size. Local analysis of the Hessian matrix can provide insights about how the adjustment of this parameter. Points exhibiting high curvature suggest that a smaller neighborhoods are preferred whereas points with lower curvature indicate that larger neighborhoods can be considered. A limitation with this approach would be an increase in the computational cost. Finally, we intend to study how information-theoretic measures can be applied in manifold learning algorithms as a way to improve unsupervised metric learning.