Keywords

1 Introduction

Dimensionality reduction (DR) aims to embed relevant information from high-dimensional data into a lower dimension representation, being of great use among data-related areas such as big data, pattern recognition, clustering or data visualization. Many DR techniques have been extensively studied, ranging from distance-based preservation criteria, e.g. classical approaches such as classical multidimensional scaling (CMDS) [1], to graph-based approaches as Laplacian eigenmaps (LE) [2]. Given such a wide range of techniques developed under different frameworks and settings, generalized DR approaches are of great interest as it enables to contrast and enhance them in a fair, proper manner.

In this work, a generalized spectral dimensionality reduction (GSDR) approach capable of representing DR spectral techniques and exploit their representation ability is introduced. To do so, in the here-studied GSDR a new space representation is obtained through an initial nonlinear transformation employed by the use of kernel-based representations. Then, a feature extraction process based on principal component analysis is applied in such new space. Remarkable experimental results shows that in terms of the scaled version of the average agreement rate, GSDR is be able to outperform the conventional implementation of well-known spectral DR techniques -namely, classical multidimensional scaling and Laplacian eigenmaps. Additionally, aimed at better understanding the effect of data structure preservation at local and global levels simultaneously, theoretical developments and relevant insights are provided.

The remaining of this manuscript is structured as follows: Sect. 2 states the notation used throughout this work and presents a brief overview on kernels. In Sect. 3, we introduce the GSDR method. Both the nonlinear mapping and the feature extraction are explained in theoretical and computational terms. Section 4 describes the setup and parameter settings for experiments. Section 5 gathers and discusses the experimental results. Finally, conclusions and final remarks are drawn in Sect. 6.

2 Background on Kernel Functions and Notation

2.1 Notation

Let us define the input data matrix as \(\mathbf{{X}} \in \mathbbm {R}^{N \times D}\) holding N samples represented by D variables, in the form: \(\mathbf{{X}} = (\mathbf{{x}}_1^\top , \ldots , \mathbf{{x}}_N^\top )^\top \), with \(\mathbf{{x}}_i \in \mathbbm {R}^D\) and \(i \in \{1, \ldots , N\}\). Likewise, let \(\mathbf{{Y}} \in \mathbbm {R}^{N \times d}\) be the output data matrix, such that \(\mathbf{{Y}}= (\mathbf{{y}}_1^\top , \ldots , \mathbf{{y}}_N^\top )^\top \), \(\mathbf{{y}}_i \in \mathbbm {R}^{d}\) and \(d \le D\). In terms of feature extraction, matrix \(\mathbf{{Y}}\) is the embedded (also extracted, projected, or mapped) space. In such vein, it is traditionally set \(d<D\) for DR purposes. That said, the aim of DR is to embed the space \(\mathbf{{X}}\) into a lower-dimensional space \(\mathbf{{Y}}\).

2.2 Concept of Kernel Function

Roughly speaking, the so-named kernel function can be understood as an approach that allows for estimating the similarity among input data samples [3]. In general, such similarity is calculated over samples from either independent or associated spaces [4, 5]. In this work, the concept of kernel is referred to the pairwise similarity or affinity measures intended to represent the input data. Naturally, similarity measures must be ruled by a positive semi-definite function. In mathematical terms, we can express a positive semi-definite kernel function \(\mathcal {K}(\cdot , \cdot )\) as follows:

$$\begin{aligned} \mathcal {K}(\cdot , \cdot ):&\; \mathbbm {R}^D \times \mathbbm {R}^D \longrightarrow \mathbbm {R} \nonumber \\&\;\;\; \mathbf{{x}}_i, \mathbf{{x}}_j \quad \longmapsto \mathcal {K}(\mathbf{{x}}_i,\mathbf{{x}}_j), \end{aligned}$$
(1)

satisfying

$$\begin{aligned} \sum \limits _{i = 1}^{N}\sum \limits _{j = 1}^{N} c_{i}\bar{c}_{j}\mathcal {K}(\mathbf{{x}}_i,\mathbf{{x}}_j)\ge 0, \end{aligned}$$
(2)

for all \(c_{i} \in \mathbb {C}\), being \(\bar{c}_i\) the complex conjugate of \(c_i\).

3 Proposed Generalized Spectral Dimensionality Reduction (GSDR) Approach

The here-proposed Generalized Spectral Dimensionality Reduction, short termed as GSDR, is based on the premise that data can be mapped onto another space \(\mathbf{{Z}} \in \mathbbm {R}^{N \times M}\) before going through a feature extraction procedure itself. In this connection and inspired by works devoted to dissimilarity-based representations [6], we alternatively propose to explore the possibility of a nonlinear mapping \(\mathcal {T}\{\cdot \}\) based on pairwise similarities, such that:

$$\begin{aligned} \mathbf{{Z}} = \mathcal {T}\{\mathbf{{X}}\}, \end{aligned}$$
(3)

where \(z_{ij} = \mathcal {K}(\mathbf{{x}}_i,\mathbf{{x}}_j)\). Therefore, \(\mathbf{{Z}}\) is said to be a kernel matrix as well as \(M = N\). Then, a linear projection is performed over the mapped space to obtain the embedded space \(\mathbf{{Y}}\), such that \(\mathbf{{Y}} = \mathbf{{Z}}\mathbf{{R}}\) where \(\mathbf{{R}} \in \mathbbm {R}^{N \times d}\) is a rotation matrix to be defined.

Figure 1 depicts a high-level outline of the proposed GSDR.

Fig. 1.
figure 1

Block diagram of the proposed GSDR. It mainly involves two steps: nonlinear transformation based on kernel functions (similarities) and linear feature extraction using PCA.

Notice that in this work, either an element of the space (matrix) or the space itself is indistinctly referred as space.

3.1 Nonlinear Transformation Using a Kernel Matrix

Since vectors \(\mathbf{{x}}_{i}\) are assumed to be real and D-dimensional, and a collection of N vectors is available (just as stated in notation given above), a matrix \(\mathbf{{Z}} \in \mathbbm {R}^{N \times N}\) with entries \(z_{ij} = \mathcal {K}(\mathbf{{x}}_{i},\mathbf{{x}}_{j})\) can be formed. Such a matrix is known as kernel matrix (Gram or generalized co-variance matrix as well). Therefore, a real symmetric \(N \times N\) matrix \(\mathbf{{Z}}\) whose ij entries satisfy Eq. (2) for all \(c_{i}\in \mathbb {R}\) is also a positive semi-definite matrix.

A remarkable benefit of this property is that all eigenvalues of \(\mathbf{{Z}}\) are ensured to be non-negative, which enables to readily carry out useful spectral developments for feature extraction purposes.

Figure 2 depicts the effect of the kernel-based data representation. By nature, a kernel entries can be understood as pairwise similarities, and therefore non-directed, weighted graph becomes a suitable geometric representation thereof.

Fig. 2.
figure 2

An explanatory diagram depicting the relationship between the kernel entries and the weights of an N-node weighted, non-directed graph within a similarity-based representation framework.

As a matter of fact, kernel matrix entries may be related to the similarity among nodes (data points), which is in turn related to an opposite notion of distance, and therefore the concept of close neighborhood (local structure) takes place. Such a local structure of data can be preserved by a kernel function if its corresponding kernel matrix is properly tuned and selected, and subsequently used as an input to either a robust enough kernelized DR method [7] or similarity-driven generalized DR [8].

3.2 PCA-Based Feature Extraction

For the dimensionality reduction process to be carried out over the new space \(\mathbf{{Z}}\), we use a PCA-based feature extraction approach. It works as follows: First, let us consider a linear projection in the form:

$$\begin{aligned} \mathbf{{Y}} = \mathbf{{Z}}\mathbf{{R}}, \end{aligned}$$
(4)

where \(\mathbf{{R}} \in \mathbbm {R}^{N \times d}\) is a projection or rotation matrix. In this connection, the condition \(d < D\) takes place to extract features in a lower dimensional space.

To ensure linear independence and prevent from length effects, an orthonormal rotation matrix is considered, i.e. \(\mathbf{{R}}^\top \mathbf{{R}} = \mathbf{{I}}_d\), where \(\mathbf{{I}}_d\) is d-dimensional identity matrix.

The estimation of \(\mathbf{{R}}\) follows from the distance-based framework widely explained in [8].

Briefly put, this framework minimizes the distance between of \(\mathbf{{Z}}\) and a low-rank representation thereof \(\widehat{\mathbf{{Z}}} \in \mathbbm {R}^{N \times N}\), as follows:

$$\begin{aligned} \min _{\mathbf{{R}}}&\; \Vert \mathbf{{Z}} - \widehat{\mathbf{{Z}}}\Vert ^2_2\\ \nonumber&\; \mathbf{{R}}^\top \mathbf{{R}} = \mathbf{{I}}_d, \end{aligned}$$
(5)

where \(\Vert \cdot \Vert _2\) stands for the Euclidean (\(L_2\)) norm. As demonstrated in [8], previous formulation is equivalent to the following dual problem:

$$\begin{aligned} \max _{\mathbf{{R}}}&\; \text {tr}(\mathbf{{R}}^\top \mathbf{{\Sigma }}\mathbf{{R}})\\ \nonumber&\; \mathbf{{R}}^\top \mathbf{{R}} = \mathbf{{I}}_d, \end{aligned}$$
(6)

where

$$\begin{aligned} \mathbf{{\Sigma }} = \mathbf{{Z}}^\top \mathbf{{Z}} \end{aligned}$$
(7)

and \(\text {tr}(\cdot )\) denotes the conventional matrix trace operator.

As the functional of the dual optimization problem presented in (6) is quadratic and \(\mathbf{{R}}\) is an orthonormal matrix, it is easy to demonstrate that a feasible solution is selecting \(\mathbf{{R}}\) are the eigenvectors corresponding to the d largest eigenvalues of \(\mathbf{{\Sigma }}\). It is worth noticing that, once centered the matrix \(\mathbf{{Z}}\) with

$$\begin{aligned} \mathbf{{Z}} \leftarrow \left( \mathbf{{I}}_{N}-\frac{1}{N}\mathbf{{1}}_{N}\mathbf{{1}}_{N}^\top \right) \mathbf{{Z}}, \end{aligned}$$
(8)

being \(\mathbf{{1}}_N\) an N-dimensional all ones vector, \(\mathbf{{\Sigma }}\) becomes an estimation of the covariance matrix of \(\mathbf{{Z}}\).

3.3 GSDR Algorithm

The Algorithm 1 is a pseudocode gathering the steps of the proposed GSDR.

figure a

4 Experimental Setup

Table 1. Brief description of the here-used kernel matrices representing dimentionality reduction.

Kernels for DR: Two kernel approximations for spectral DR methods [9] are considered, namely CMDS and LE, as detailed in Table 1.

All previously mentioned kernels are widely described in [9].

Performance Measure: The performance of the considered methods is quantified by the scaled version, ranged within the interval [0, 1], of the average agreement rate \(R_{NX}(K)\) presented in [11]. Given that \(R_{NX}(K)\) is obtained at each perplexity value from 2 to \(N-1\), a numerical indicator of the overall performance is acquired through calculating its area under the curve (AUC). Therefore, this AUC is an overall indicator of quality preservation of a DR approach, as it evaluates the most appropriate weights at all scales.

Both the dimensionality reduction techniques (GSDR, LE, and CMDS) and the performance measure (\(R_{NX}(K)\) curve) are implemented on MATLAB Version: 9.10 (R2021a)).

Databases: Experiments are carried out over fourth conventional data sets. Figure 3 depicts examples/views of the considered data sets.

Fig. 3.
figure 3

The four considered data sets: COIL-20, MNIST, Swiss roll and Sphere

The first data set is a randomly selected subset of the MNIST image bank [12], which is formed by 6000 gray-level images of each of the 10 digits (\(N = 1500\) data points –150 instances for all 10 digits– and \(D = 24^2\)), a sample is presented in Fig. 3(b). The second one is the COIL-20 image bank [13], which contains 72 gray-level images representing 20 different objects (\(N = 1440\) data points –20 objects in 72 poses/angles– with \(D = 128^2\)), as seen in Fig. 3(a). The third data set (Sphere) is an artificial spherical shell (\(N = 1500\) data points and \(D = 3\)), and the fourth data set is a toy set here called Swiss roll (\(N = 3000\) data points and \(D = 3\)), depicted in Fig. 3(d) and 3(c), respectively.

5 Results and Discussion

The plot of the \(R_{NX}\) curve and its AUC obtained from reducing the data sets by the conventional implementation (no kernelized) of CMDS [1] and LE [2] are compared with the ones reached by \({\text {GSDR}}\left( \mathbf{{X}}, \mathcal {K}(\cdot , \cdot ), 2\right) \) (according to Algorithm 1) when selecting \(\mathcal {K}(\cdot , \cdot )\) as kernel functions producing respectively the kernel matrices \(\mathbf{{K}}_{\text {CMDS}}\) and \(\mathbf{{K}}_{\text {LE}}\). Results are shown in Fig. 4, 5, 6 and 7.

As can be observed, GSDR slightly outperforms conventional CMDS and LE in all cases. A remarkable advantage of the proposed GSDR is its ability to both unfold manifolds (as seen in Fig. 4 and 5) and reach separable-classes visualization from complex, high dimensional real data (as seen in Fig. 6 and 7).

It is also worth noticing that some rotation occurs over the embedded spaces as can be appreciated in Fig. 6(b), 6(d), 7(b) and 7(d). This fact is due to the orthogonal rotation done at the second step of GSDR procedure, which adds an effect of global structure preservation.

Fig. 4.
figure 4

Results for Sphere. Both the embedded spaces of GSDR and conventional CMDS and LE are shown. Comparison is made in terms of the \(R_{NX}\) curve

Fig. 5.
figure 5

Results for Swiss roll.

That said, as it performs a kernel-based representation and linearly projects the data with a PCA-based rotation, GSDR is able to preserve both the global and local structure of the input data.

Fig. 6.
figure 6

Results for COIL 20.

Fig. 7.
figure 7

Results for MNIST.

Even though these preliminary results exhibit no great improvement regarding conventional DR methods, its mathematical development and versatility is highly promising as it opens possibilities to exploit simultaneously a kernel-matrix-based representation together with simple PCA.

As demonstrated in previous works [14, 15], a joint formulation involving linear projections (just as PCA) and either similarity-based or kernelized representations is a suitable framework to design DR alternatives able to preserve local and global structure of the data.

6 Conclusions and Future Work

In this work, we present a generalized spectral dimensionality reduction (GSDR) approach, which exploits simultaneously the use of kernel-based representations and a feature extraction stage. The former is an initial nonlinear transformation aimed to generate a new space wherein the local-structure attributes are captured. The latter uses that new space as an input for a principal-component-analysis-driven projection, which enables the preservation of global-structure attributes. Experimentally, we prove that proposed GSDR reaches competitive performance in contrast to the conventional implementation of classical multidimensional scaling and Laplacian eigenmaps in terms of structure preservation criteria.

As a future work, more kernel representations are to be explored, which can be plugged to spectral dimensionality reduction approaches aiming at reaching a suitable trade-off between the preservation of local and global structure of data.