Keywords

1 Introduction

Nowadays, the large volumes of data are accompanied by the need of powerful tools for analysis and representation, as, you could have a dense repository of data, but without the appropriate tools the information obtained may not be very useful [1]. The need arises to find different techniques and tools that help researchers or analysts in tasks such as obtaining useful patterns for large volumes of data, these tools are the subject of an emerging field of research known as Knowledge Discovery in Bases of Data (KDD). Dimension reduction (DR) is considered within the KDD process as a pre-processing stage because it projects the data to a space where the original data is represented with fewer attributes or characteristics, preserving the greater intrinsic information of the original data to enhance tasks such as data mining and machine learning. For example, in classification tasks knowing the representation of the data as well as knowing whether these have separability characteristics, make easier to engage and interpret by the user [2, 3].

We have two method PCA (Principal Component Analysis) and the CMDS (Classical Multi-Dimensional Scaling) which are part of those classic RD methods whose objective is to preserve variance or distance [4]. Recently, the focus of DR methods is based on criteria aimed at preserving the data topology. A topology of this type could be represented in an undirected and weighted graph based on data constructed whose points represent the nodes, and their edge’s weights are contained in an affinity and non-negative similarity matrix. This representation is leveraged by methods based on spectral and divergence approaches, for the spectral approach we can represent the weights of the distances in a similarity matrix, such as with the LE (Laplacian Eigenmaps) method [5] and using a matrix of unsymmetrical similarity and focusing on the local structure of the data, the method called LLE (Locally Linear Embedding) arises [6]. There is also the possibility of working on the high-dimensional space with the advantage of greatly enhancing the representation and the embedded data visualization of the original space mapped to the high-dimensional space, from the calculation of the eigen decomposition. An estimate of the inner product (kernel) can be designed based on the function and application which one wants to develop [7], in this work the kernel matrices will represent distance or similarity functions associated with a dimension reduction method.

In this research three spectral dimension reduction methods are considered, trying to encompass different criteria which CMDS, LLE and LE are based on, these are used under two approaches, one of them is the representation of their embedded spaces obtained from their standard algorithms widely explained in [5, 6, 8], and the second is based on the kernel approaches of the same methods. After obtaining each of the embedded spaces, a linear weighting is performed for combine the different approaches leveraging each of the RD methods, the same is done for the kernel ma-trices obtained from the approximations of the spectral methods. Subsequently the Kernel PCA technique is applied to reduce the dimension to obtain the embedded space from the combination of the kernel-based approach. The combination of embedded spaces already obtained from the RD methods is not clear and intuitive mathematically, on the other hand, the linear combination of kernel or similarity matrices which are represented in the same infinite space is more intuitive and concise mathematically. Nevertheless, in tasks such as visualization of information, choosing any of the two interaction methods for dimension reduction is a crucial task on which the representation of the data and also the interpretation by the user will depend, therefore this research proposes the quantitative and qualitative comparison in addition to the demonstration of the previous assumption in order to contribute to machine learning tasks, visualization data, data mining where dimension reduction execute an imperative role, For example, perform tasks of classification of high dimension data, it is necessary to visualize them in such a way that they are understandable for non-expert users who want to know he topology of the data and characteristics such as separability which aid to determine which classifier could be adequate for determinate data record.

2 Methodology

Mathematically, the objective of dimension reduction is to map or project (linear transformation) data from a high-dimensional space \({\varvec{Y}} \in \mathbb {R}^{D \times N}\) a low-dimensional space \({\varvec{X}} \in \mathbb {R}^{d \times n}\), where \(d < D\), therefore, The original data and the embedded data will consist of N points or registers, denoted respectively by \({\varvec{y}}_{i} \in \mathbb {R}^{D}\) and \({\varvec{X}}_{i} \in \mathbb {R}^{d}\) with \(\{{\varvec{K}}^{(1)},\cdots ,{\varvec{K}}^{(M)}\}\) [5, 6]. It means that the number of samples in the high-dimensional data matrix would not be affected when the number of attributes or characteristics is reduced. In order to represent the resulting embedded space in a two-dimensional Cartesian plane, this research takes into account only the two main characteristics in the kernel matrix, which represent most of the information in the original space.

2.1 Kernel Based Approaches

The RD method known as principal component analysis (PCA) is a linear projection that tries to preserve the variance from the values and eigenvectors of the covariance matrix [9, 10]. Moreover, when a data matrix is centered, which means that the average value of the rows (characteristics) is equal to zero, the preservation of variance could be named as a preservation of the Euclidean internal product [9].

Kernel PCA method is as similar as PCA method which maximizes the variance criterion, but in this case of a kernel matrix, which is basically an internal product of an unknown space of high dimension. We define \(\varvec{\phi } \in \mathbb {R}^{D \times N}\) a high-dimensional space with \(D_h \gg D\), which is completely unknown except for its internal product that can be estimated [9]. To use the properties of this new high-dimensional space and its internal product, it is necessary to define a function that can map the data from the original space to the high-dimension \((\varvec{\phi })\) as follows:

(1)

where the i-th vector column of the matrix \({\varvec{\phi }} = {\varvec{\phi (y_i)}}\).

Considering the conditions of Mercer [11], and the matrix f is centered, the internal product of the kernel function can be calculated as follows: \({\varvec{\phi }} (y_i)^T {\varvec{\phi }} (y_i) = {\varvec{K}}(y_i, y_j) \). In short, the kernel function can be understood as a composition of the mapping generated by and its scalar product as follows: \({\varvec{\phi }} (y_i)^T {\varvec{\phi }} (y_i)\), so for each pair of elements of the set Y its scalar product is directly assigned without going through the mapping \((\varvec{\phi })\). Organizing all possible internal products in a \({\varvec{K}}_{N \times N}\) array will result in a kernel matrix:

$$\begin{aligned} {\varvec{K}}_{N \times N} = {{\varvec{\varphi }}^T}_{D_h \times N} {{\varvec{\varphi }}_{D_h \times N}}. \end{aligned}$$
(2)

The advantage of working with the high-dimensional space \((\varvec{\phi })\) is that it can greatly improve the representation and visualization of the embedded data from the original space mapped to the high-dimensional space, from the calculation of the eigenvalues and eigenvectors of its product internal. An estimation of the internal product (kernel) can be designed based on the function and application that the user wants to develop [12], in this case the kernel matrices will represent distance functions associated with a dimension reduction method, approximations kernels presented below are widely explained in [13]. The kernel representation for the CMDS reduction method is defined as the distance matrix \(\varvec{D} \in \mathbb {R}^{R \times N}\) doubly centered, that is, making the mean of the rows and columns zero, as follows:

$$\begin{aligned} {\varvec{K}}_{CMDS} = -\dfrac{1}{2}({\varvec{I}}_N - {\varvec{1}}_N{\varvec{1}}_N^\top ){\varvec{D}}({\varvec{I}}_N - {\varvec{1}}_N{\varvec{1}}_N^\top ), \end{aligned}$$
(3)

where the ij entry of \({\varvec{D}}\) is given by the Euclidean distance:

$$\begin{aligned} d_{ij} = ||{\varvec{y}}_i - {\varvec{y}}_j||_2^2. \end{aligned}$$
(4)

A kernel for LLE can be approximated from a quadratic form in terms of the matrix \({\varvec{\mathcal {W}}}\) holding linear coefficients that sum to 1 and optimally reconstruct observed data. Define a matrix \({\varvec{M}} \in \mathbb {R}^{N \times N}\) as \({\varvec{M}} = ({\varvec{I}}_N - {\varvec{\mathcal {W}}})({\varvec{I}}_N - {\varvec{\mathcal {W}}}^\top )\) and \(\lambda _{max}\) as the largest eigenvalue of \({\varvec{M}}\). Kernel matrix for LLE is in the form:

$$\begin{aligned} {\varvec{K}}_{LLE} = \lambda _{max}{\varvec{I}}_N - {\varvec{M}}. \end{aligned}$$
(5)

Considering that kernel PCA is a maximization problem in the high-dimensional covariance represented by a kernel, LE can be represented as the pseudo-inverse matrix of the graph \(\varvec{L}\), as shown in the following expression:

$$\begin{aligned} {\varvec{K}}_{LE} = {\varvec{L}}^\dag , \end{aligned}$$
(6)

where \({\varvec{L}} = {\varvec{\mathcal {D}}} - {\varvec{S}}\), \({\varvec{S}}\), such that \(\varvec{S}\) is a dissimilarity matrix and \({\varvec{\mathcal {D}}} = \text {Diag}({\varvec{S}}{\varvec{1}}_N)\) is the degree matrix is the matrix of the degree of \(\varvec{S}\). The similarity matrix \(\varvec{S}\) is organized in such a way that the relative width parameter is estimated by maintaining the entropy of the distribution with the nearest neighbor with approximately \(\log K\), where K is the given number of neighbors as explained in [14]. For this investigation the number of neighbors was established as the integer closest to 10% of the amount of data.

Finally, to project the data matrix \(\varvec{Y} \in \mathbb {R}^{D \times N}\) into an embedded space \(\varvec{X} \in \mathbb {R}^{d \times N}\) we use the PCA dimension reduction method. In PCA, the embedded space is obtained by selecting the most representative eigenvectors of the covariance matrix [6, 10]. Therefore, we obtain the \(\varvec{d}\) most representative eigenvectors of the kernel matrix \({\varvec{K}}_{N \times N}\) obtained previously, constructing the embedded space \(\varvec{X}\). As it was said for this research, the embedded space with two dimensions that represents most of the characteristics of the data is established.

2.2 DR-Methods Mixturing

In terms of data visualization through RD methods, the parameters to be combined are the kernel matrices and the embedded spaces obtained in each method, each matrix corresponds to each of the \(\varvec{M}\) RD methods considered, that is \(\{{\varvec{K}}^{(1)},\cdots ,{\varvec{K}}^{(M)}\}\). Consequently, a matrix is obtained depending on the kernel approach or final embedded space \(\varvec{K}\) resulting from the mixing of the \(\varvec{M}\) matrices, such that:

$$\begin{aligned} \varvec{\widehat{K}} = \sum _{m=1}^{M}\alpha _{m}{\varvec{K}}^{(m)}, \end{aligned}$$
(7)

Defining \(\alpha _{m}\) as the weighting factor corresponding to the method \(\varvec{M}\) and \(\varvec{\alpha }=\{ \alpha _{1},\cdots ,\alpha _{m} \}\) as the weighting vector. In this research these parameters will be defined as 0.333 for each of the three methods used, so the sum of the three will be 1 in order to provide to each method equal priority, since the aim of this research is to present a comparison of each proposed approach in a equal conditions scenario, Each \({\varvec{K}}^(M)\) will represent the kernel matrices obtained after applying the approximations presented in Eqs. (3), (5) and (6) or the embedded spaces obtained by applying the RD methods in their classical algorithm.

3 Results

Data-Sets: Experiments are carried out over four conventional data sets. The first data set (Fig. 1(a)) is an artificial spherical shell (\(N = 1500\) data points and \(D = 3\)). The second data set (Fig. 1(c)) is a toy set here called Swiss roll (\(N = 3000\) data points and \(D = 3\)). The third data set (Fig. 1(d)) is Coil_20 is a database of gray-scale images of 20 objects. Images of the objects were taken at pose intervals of 5 degrees. This corresponds to 72 images per object (\(N =1440\) data points 20 and \(D = 1282\) -number of pixels) [15]. The fourth data set (Fig. 1(b)) is a randomly selected subset of the MNIST image bank [11], which is formed by 6000 gray-level images of each of the 10 digits (\(N = 1500\) data points 150 instances for all 10 digits and \(D = 242\)). Figure 1 depicts examples of the considered data sets.

Fig. 1.
figure 1

source: https://archive.ics.uci.edu/ml/datasets.html.

The fourth considered datasets,

Performance Measure: In dimensionality reduction, the most significant aspect, which defines why a RD method is more efficiency, is the capability of preserve the data topology in low-dimensional space regarding the high-dimension. Therefore, we apply a quality criterion used by conserving the k-th closest neighbors developed in [16], as efficiency measure for each approach proposed for the interactive RD methods mixture. This criterion is widely accepted as an adequate unsupervised measure [14, 17], which allows the embedded space to assess in the following way: The rank of \({\varepsilon }_j\) with respect to \({\varepsilon }_i\) in high-dimensional space is denoted as:

$$\begin{aligned} {\varvec{p}}_{{i}{j}}=|\{ {\varvec{k:{\delta }_{{i}{k}}< {\delta }_{{i}{j}}}} \,\,\, {\varvec{or}} \,\,\, {\varvec{({\delta }_{{i}{k}} = {\delta }_{{i}{j}}}} \,\,\, and \,\,\, 1 \le k < j \le N)\}|. \end{aligned}$$
(8)

In Eq. (8) \({\varvec{{|}\cdot {|}}}\) denotes the set cardinality. Similarly, in [13] is defined that the range of \(\varvec{{x}{j}}\) with respect to \(\varvec{{x}{i}}\) in the low-dimensional space is:

$$\begin{aligned} {\varvec{r}}_{{i}{j}}=|\{ {\varvec{k:{d}_{{i}{k}}< {d}_{{i}{j}}}} \,\,\, {\varvec{or}} \,\,\, {\varvec{({d}_{{i}{k}} = {d}_{{i}{j}}}} \,\,\, and \,\,\, 1 \le k < j \le N)\}|. \end{aligned}$$
(9)

The k-th neighbors of \({\varvec{\zeta {i}}}\) and \(\varvec{{x}{i}}\) are the sets defined by (10) and (11), respectively.

$$\begin{aligned} {{\varvec{v}}_i}^k =\{ {\varvec{ j:{1}\ge {p}{i}{j}}} < K \}, \end{aligned}$$
(10)
$$\begin{aligned} {{\varvec{n}}_i}^k =\{ {\varvec{ j:{1}\ge {r}{i}{j}}} < K \}. \end{aligned}$$
(11)

A first performance index can be defined as:

$$\begin{aligned} \varvec{Q_{NX}(K)} = {\sum _{i=1}^{N}\frac{|{{\varvec{v}}_i}^k \cap {{\varvec{n}}_i}^k|}{KN}}= 1. \end{aligned}$$
(12)

Equation (12) results in values between 0 and 1 and measures the normalized average according to the corresponding k-th neighbors between the high-dimensional and low-dimensional spaces. Defining in this way a coclasification matrix:

$$\begin{aligned} {\varvec{[{Q }={{q}_{NX}}]}} \,\,\, for \,\,\, {\varvec{j \ge N-1}}, \end{aligned}$$
(13)

whit \({\varvec{q}}_{{k}{l}}=|\{ {\varvec{(i,j)}} :{p}_{{i}{j}}=k \,\,\, and \,\,\, {p}_{{i}{j}}=l\}|.\)

Therefore \(\varvec{Q_{NX}(K)}\) counts k-by-k blocks of \(\varvec{Q}\), the range preserved (in the main diagonal) and the permutations within the neighbors (on each side of the diagonal) [12]. This research employs an adjustment of the curve \(\varvec{Q_{NX}(K)}\) introduced in [12] in order that the area under the curve is an adequate indicator of the embedded data topology preservation, hence, the quality curve that is applied into the visualization methodology is given by:

$$\begin{aligned} {\varvec{R_{NX}(K)}} = \frac{(N-1){\varvec{Q_{NX}(K)}}-N}{N-1-K}. \end{aligned}$$
(14)

When the equation in (14) is expressed logarithmically, errors in large neighborhoods are not proportionally as significant as small ones [14]. This logarithmic expression allows obtaining the area under the curve of \({\varvec{R_{NX}(K)}}\) given by:

$$\begin{aligned} {\varvec{AUC}}\log _{K}({\varvec{R_{NX}(K)}}) = \frac{\sum _{K=1}^{N-2} \frac{{\varvec{R_{NX}(K)}}}{K}}{\sum _{K=1}^{N-2} \frac{1}{K}}. \end{aligned}$$
(15)

The results obtained by applying the methodology proposed over four data bases described, are shown in Fig. 2, where the curve \({\varvec{R_{NX}(K)}}\) of each approach is presented as well as the \({\varvec{AUC}}\) in (13) which assess the dimension reduction quality corresponding to each proposed combination. As a result, for RD procedure in terms of visualization we show the embedded space for each test performed. It is necessary to clarify that each combination was carried out same scenario with equal conditions which allows us to measure a computational cost in terms of execution time, which are shown in Table 1. This is an important issue if users are seeking for an interactive RD methods mixture which has a satisfactory performance, as well as an efficient computational development.

Nevertheless, results achieved in this research allows us to conclude that in data visualization terms performing an interactive mixture RD method based on kernel is more favorable than based on standard methods, mathematically combining a kernel approximations, which means that each kernel approximation is in the same high-dimensional space where all classes are separable before developing the mixture, is more appropriate than combining obtained embedded space from an unknown space which are the standard methods.

The computational cost (Table 1) allows us to infer that the cost in executing kernel approaches and PCA kernel application for dimension reduction is a slightly more elevated in all cases. This is since the databases have a high number of registers, which means that acquiring the kernel matrices involves a lot of processing, as if the data base consists of n samples, the kernel matrix size will be \(N\times N\).

Table 1. Consumed time for performing each approach over the fourth dataset.
Fig. 2.
figure 2

Results obtained for the four experimental databases

Making a comparison of the \({\varvec{R_{NX}(K)}}\)curves for each database, there is a low performance in the dimension reduction process for the case of the Coil-20 database whose AUC is the lowest among all, which means that the data topology in the embedded space obtained is not as conserved as in the other studied cases. Evidently the best performance was accomplished for 3D spherical shell and Swiss roll which obtained the best AUC and preserve the data local structure, generally preserved local structure generates superior embedded spaces [13]. On the other hand, MNIST and spherical shell database preserved the global data structure in a preferable way as regards the other cases.

4 Conclusion

This work presented a comparative analysis of two different approaches for DRmethods mixturing which are applied in an interactive. Results obtained in this research allows us to conclude that performing an interactive DR-methods mixture could be a tough task for a dataset with a great number of points and dimensions as it was proved that the computational cost is higher but also this approach gives to users a high-quality performance since, a greater area is obtained under the quality curve which indicates that the topology of the data can be preserved more. On the other hand, embedded-spaces-based approach has a slightly difference in the \({\varvec{R_{NX}(K)}}\) AUC curve, but it is not wide so if the user wants to carry out a quicker mixture, the embedded-spaces-based approach will be more appropriate for data visualization where interactivity is the most important achievement seeking a better perception for the inexpert users of their datasets.