1 Introduction

Due to rapid emergence of high dimensional data in recent years, more and more scholars pay close attention to dimensionality reduction techniques, an important research issue in machine learning community. Dimensionality reduction aims to reduce redundant or irrelevant features, and extract salient characteristics in order to compress data, decrease computing complexity, and improve efficiency and accuracy of data classification and recognition [14, 37, 38, 43].

Feature extraction is a distinguished way to dimensionality reduction that maps a high dimensional dataset into a lower dimensional space where the essential characterization of original data is preserved as much as possible. It can be divided into two categories, the linear and nonlinear. Principal component analysis (PCA) [3], independent component analysis (ICA) [32], linear discriminant analysis (LDA) [24, 46], and local preserving projection (LPP) [13] are typical linear dimensionality reduction techniques, developed under different optimization criteria. In practice, almost all datasets do not have linear structures and linear techniques cannot handle nonlinear data. Various nonlinear dimensionality reduction techniques have been developed and falls into two types, based on kernel function methods [35] and based on manifold learning methods [21]. Nonlinear dimensionality reduction techniques based on kernel function methods have a difficulty of choosing suitable kernel functions. An appropriate kernel function can make data be linearly separable or approximately linearly separable in a lower dimensional space, but it is not applicable to every dataset. Nonlinear dimensionality reduction techniques based on manifold learning have been extensively investigated in recent years. ISOMAP [37], local tangent space analysis (LTSA) [52], local linear embedding (LLE) [31], Laplacian Eigenmaps (LE) [4], and their generalizations [2, 6, 19] have been successfully achieved.

Among the manifold learning based techniques, Laplacian Eigenmaps is a frequently used nonlinear dimensionality reduction method due to its superior property of preserving local neighborhood structure of data. However, it has lots of deficiencies, including sensitivity to noise, difficulty in choosing size of neighborhood, and no capability of preserving class structures of data. Many scholars focus on developing its improved scenarios. Raducanu [29] and Keyhanian et al. [17] extended, separatively, LE by constructing adaptive neighborhood graphs to avoid the puzzle of choosing the parameter of size of neighborhood. Wang et al. [40] proposed a distinguishing variance embedding (DVE) method by introducing the idea of minimum variance unbiased (MVU) to the classical LE. Liu et al. [25] proposed a local linear Laplacian Eigenmaps by combining LE with LLE. Malik et al. [26] explored a generalized incremental LE that can be applied to dynamic data.

Meanwhile, several researchers introduced supervised information of datasets into Laplacian Eigenmaps to improve the performances of dimensionality reduction  [5]. Jiang et al. [16] and Li [20] introduced class information of datasets into LE and developed a supervised LE algorithm applied to fault diagnosis and face recognition. Xu et al. [48] combined the idea of LDA with LE in the framework of marginal patch alignment. Such supervised LE techniques can not only preserve local neighborhood structures of samples, also strengthen class structures of datasets. In the mean while, Costa et al. [8] proposed a classification constrained dimensionality reduction (CCDR) that ensures samples tending to collapse into the prototypes. In that method, two neighborhood graphs, a k-nearest neighborhood graph and a sample-class neighborhood graph, were constructed. The weights between vertices (samples) in the first graph were assessed according to distances between samples, while the weights in the second were set to be 1 or 0, depending on whether a sample had a class label or not. In virtue of whether samples are labelled or not, Kim et al. [18] proposed a semi-supervised Laplacian Eigenmaps (SSLE) by constructing two neighborhood graphs. The weights were assigned to 1, 0.5, or 0, depending on whether the corresponding vertices are labelled and whether one sample is in the neighborhood of the other. SSLE can make homogeneous samples pull each other and heterogeneous samples push each other. The performance of dimensionality reduction can be strengthened.

In both CCDR and SSLE algorithms, fixed weights in the sample-class neighborhood graph are assigned to unlabelled samples, ignoring the membership degrees of them belonging to each class and cannot exactly express the topological structures of data. Wang et al. [41] developed a T-S norm neural network to train weights for fuzzy if-then rules, where the T-S norms are fundamental ingredients in the theory of fuzzy sets. Fuzzy set is a generalization of classical set for modelling imprecise and vague information [50]. A fuzzy similarity relation is a typical notion describing association of objects. The derived fuzzy similarity classes are fuzzy information granules characterizing topological structures of data [47]. Another notion, rough sets, was initiated by Pawlak [27] for modelling and processing incomplete information. It has been found extensive and successful applications in the field of artificial intelligence. Various extensions of the Pawlak rough set model have been exposed, such as a general relation based rough set [36], a dominance relation based rough set [34], a similarity or tolerance relation based rough set [15, 33], covering rough set [53], a neighborhood rough set [44, 49], and a decision-theoretic rough set [22, 45]. The integration of both granular computing frameworks brings out the models of fuzzy rough sets, rough fuzzy sets, and various variations [1, 9, 10, 30, 39, 51] to solve the problems with imprecise and incomplete information.

In this paper, the notion of granular computing is introduced into LE and a semi-supervised LE for dimensionality reduction is developed. In this method, a set of semi-supervised fuzzy similarity granules are constructed to characterize the similarity between samples according to the principle that homogeneous samples have higher similarity degrees than heterogeneous samples. A neighborhood rough fuzzy set model of such fuzzy similarity granules is built to assess the degrees two samples belong to the same class. A class-related neighborhood graph of dataset is created based on the semi-supervised fuzzy similarity granules for classes to describe the relationship between samples and their prototypes, whereas a Laplacian k-nearest neighborhood graph is established according to both the semi-supervised information for classes and the association degrees of samples derived from the neighborhood rough fuzzy lower approximations. Simultaneously, the feature significance is assessed by building an information entropy measure and the weighted distance is incorporated into the establishment of Laplacian neighborhood graph. The proposed semi-supervised rough fuzzy based Laplacian Eignmaps (SSRFLE) model for dimensionality reduction of hybrid data not only inherits the advantages of classical LE, also preserves class characterization of the original dataset.

The rest of this paper is organized as follows. In Sect. 2, the classical Laplacian Eigenmaps model and its semi-supervised extensions for dimensionality reduction are recalled. Some basic notions related to fuzzy sets, rough sets and their integrations are briefly reviewed. Section 3 presents an approach to determining the weights of features of a dataset. A semi-supervised Laplacian Eigenmaps model based on a neighborhood rough fuzzy sets is exposed. In Sect. 4, various comparative experiments on real world datasets are implemented. Parameters and performance analysis on the proposed method are presented sequentially. Conclusions and further work follow in Sect. 5.

2 Related work

In this section, we first recall the classical Laplacian Eigenmaps method and its two typical semi-supervised versions for dimensionality reduction. And then, some basic notions related to fuzzy sets, rough sets and their integrations are briefly reviewed. These notions will be used in the sequent sections of this work.

2.1 Laplacian Eigenmaps (LE)

Laplacian Eigenmaps [4] is a typical nonlinear dimensionality reduction technique based on spectral graph theory. It has remarkable properties of preserving local neighborhood structure of data. A k-nearest neighborhood graph or an \(\varepsilon\)-ball neighborhood graph is constructed and weights of edges (between vertices) are assigned using the Gaussian kernel function or 0-1 weighting method.

Given a dataset \(X = \{x_1, x_2, \ldots , x_n\}\) with n samples. Each sample \(x_i\in X\) has m features, namely, \(A=\{a_1, a_2, \ldots , a_m\}\). Let \(\{y_1, y_2, \ldots , y_n\}\) be the \(d\, (d\ll m)\) dimensional representations of X. That is, each \(y_i\) is a d dimensional row vector. With LE, the lower dimensional representation of X can be achieved by solving the following optimization problem

$$\begin{aligned} \min \;\sum _{i=1}^n \sum _{j=1}^n{\left\| {y_i - y_j } \right\| ^2W_{ij}} = 2tr(Y^TLY) \end{aligned}$$
(2.1)

where \(Y=(y_1^T\, y_2^T\, \ldots \, y_n^T)^T\) is the \(n\times d\) embedded matrix of X, \(W=(W_{ij})_{n\times n}\) is the weight matrix of the k-nearest neighborhood graph, \(D=(D_{ij})_{n\times n}\) is a diagonal matrix with \(D_{ii} = \sum _{j=1}^n W_{ij}\), and \(L = D - W\) is the Laplacian matrix, a symmetric and positive semi-definite matrix. In order to ensure the problem (2.1) having a unique solution, the constraints \(Y^TDY = I\) and \(Y^TD\mathbf{1} = 0\) are imposed to remove arbitrary scaling factor and translational degree of freedom in the lower dimensional embedding, where I is the \(n\times n\) identity matrix and \(\mathbf{1}\) is a column vector with all components being 1. By the Lagrange multiplier method, the optimization problem (2.1) can be translated to solve the following generalized eigenvalue problem

$$\begin{aligned} LY = \lambda DY \end{aligned}$$
(2.2)

If \(0 \ne \lambda _1 \le \lambda _2 \le \cdots \le \lambda _d\) are the d smallest positive eigenvalues of Eq. (2.2) and the column vectors \(y^1, y^2, \ldots , y^d\) are the corresponding d eigenvectors, then the lower dimensional embedding of \(x_i\) is as follows

$$\begin{aligned} y_i = (y^1(i), y^2(i), \ldots , y^d(i)), i=1, 2, \ldots , n \end{aligned}$$

meaning that the ith row vector of Y is right the d dimensional embedding of \(x_i\).

2.2 Classification constrained dimensionality reduction (CCDR)

CCDR [8], proposed by Costra et al. in 2005, is an extension of classical LE by fusing class information. It can make samples with the same class label collapse into corresponding prototypes.

Suppose each sample of a dataset X (or a subset of X) is labelled, namely \(x_i\) has a class label \(l_i \in \{1, 2, \ldots , f\}\), where f is the number of classes of X. Two neighborhood graphs \(G^N\) and \(G^C\) were constructed, where \(G^N\) is a k-nearest neighborhood graph and the weights of edges are computed by using the Gaussian kernel function. \(G^C\) is a graph regarding the class information of X, called the class-related neighborhood graph. Inserting edges between prototypes and samples with the same class label and the weights of such edges were set to be 1. Herein, the prototypes of X were computed by the way of maximum alignment between samples and classes.

In order to achieve the goal that samples with the same label were clustered together around the prototypes and simultaneously the neighborhood structures of X were preserved, a cost function was constructed as follows.

$$\begin{aligned} E=\beta \sum \limits _{i=1}^n \sum \limits _{j=1}^n{W_{ij} \left\| {y_i - y_j } \right\| ^2} + \sum \limits _{i = 1}^n {\sum \limits _{k = 1}^f {C_{ki} \left\| {y_i - z_k }\right\| ^2} } \end{aligned}$$
(2.3)

where \(W=(W_{ij})_{n\times n}\) is the weight matrix of \(G^N\), \(C=(C_{ki})_{f\times n}\) is the weight matrix of \(G^C\), \(\{y_1, y_2, \ldots , y_n\}\) is the lower dimensional representation of X, \({z_1, z_2, \ldots , z_f}\) are the prototypes of X in the lower dimensional space, and \(\beta\) is a parameter trading off the influences between preserving local neighborhood structures and keeping class structures.

Let \(Z=(z_1^T\, \ldots \, z_f^T\, y_1^T\, \ldots \, y_n^T)^T\), then minimizing Eq. (2.3) is equivalent to solving the following optimization problem

$$\begin{aligned} \min _{Z^{T}\mathbb {D}Z=I, Z^{T}\mathbb {D}1=0} tr(Z^{T}\mathbb {L}Z) \end{aligned}$$
(2.4)

where \(\mathbb {L}=\mathbb {D}-\mathbb {W}\) is a \((f+n)\times (f+n)\) Laplacian matrix associated with the weight matrix \(\mathbb {W} =(\mathbb {W}_{ij})_{(f+n)\times (f+n)}= \left( {{\begin{array}{c c} I &{} C \\ {C^T} &{} {2\beta W} \\ \end{array} }} \right)\) and \(\mathbb {D}=(\mathbb {D}_{ij})_{(f+n)\times (f+n)}\) with \(\mathbb {D}_{ii}=\sum _{j=1}^{f+n}\mathbb {W}_{ij}\), \(\mathbb {D}_{ij}=0\) when \(i\ne j\). By the Lagrange multiplier method, the problem (2.4) can be transformed to solve the following generalized eigenvalue problem

$$\begin{aligned} \mathbb {L}Z = \lambda \mathbb {D}Z \end{aligned}$$
(2.5)

If \(z^1, z^2, \ldots , z^d\) are the eigenvectors corresponding to the d smallest positive eigenvalues of Eq. (2.5), then the first f rows of \(Z=[z^1\, z^2\, \ldots \, z^d]\) correspond to the coordinates of prototypes and the following n rows determine the embedding of the original samples.

2.3 Semi-supervised Laplacian Eigenmaps (SSLE)

Kim et al. proposed a semi-supervised Laplacian Eigenmaps algorithm, called SSLE, which is suitable for sentiments analysis [18]. Execution of SSLE algorithm needs two premises. One is that the labelled information about similarity among samples and the other is that the similarity between homogeneous samples should be larger than that between heterogenous samples.

For a dataset X, let \(X_c=\{x_1, x_2, \ldots , x_s\}\) be a subset of X, in which each sample is labelled as a cluster among f clusters, namely \(x_i\in X_c\) is assigned a label \(l_i\in \{1, 2, \ldots , f\}\). Two k-nearest neighborhood graphs, \(G_u\) and \(G_l\), were built, where \(G_u\) was a k-nearest neighborhood graph without label information and the weights \(W_{ij}^u\) of edges were computed by using the Gaussian kernel function, while \(G_l\) was a k-nearest neighborhood graph with label information and the weights of edges were assessed as

$$\begin{aligned} W_{ij}^l=\left\{ \begin{array}{l l} 1, &{}\quad {\text {if}}~x_i\in N_{l+}(x_j)~{\text{or}}~x_j\in N_{l+}(x_i)\\ 0, &{}\quad {\text {if}}~x_i\in N_{l-}(x_j)~{\text{or}}~x_j\in N_{l-}(x_i)\\ 0.5, &{}\quad {\text {if}}~x_i\in N_{l0}(x_j)~{\text{or}}~x_j\in N_{l0}(x_i)\\ 0, &{}\quad {\text {otherwise}}\end{array} \right. \end{aligned}$$

where \(N_{l+}(x_j)\) and \(N_{l-}(x_j)\) were the homogeneous neighborhood set and heterogeneous neighborhood set of a labelled sample \(x_j\), respectively. \(N_{l0}(x_j)\) was the neighborhood set of an unlabelled sample \(x_j\). In SSLE, the objective function was defined as follows

$$\begin{aligned} \Phi (Y)=(1-\mu )tr(Y^T L^u Y)+\mu tr(Y^T L^l Y)= tr(Y^T((1-\mu )L^u+\mu L^l)Y) \end{aligned}$$
(2.6)

where \(\{y_1, y_2, \ldots , y_n\}\) is the lower dimensional embedding of X and \(Y=(y_1^T\, y_2^T\, \ldots \, y_n^T)^T\), \(L^u=D^u-W^u\) and \(L^l=D^l-W^l\) are the Laplacian matrices of \(G_u\) and \(G_l\), respectively, and \(D^u\) and \(D^l\) are two diagonal matrices with \(D^u_{ii}=\sum _{j=1}^nW^u_{ij}\) and \(D^l_{ii}=\sum _{j=1}^nW^l_{ij}\).

Let \(W=(1-\mu )W^u+\mu W^l\), \(D=(1-\mu )D^u+\mu D^l\), and \(L=(1-\mu )L^u+\mu L^l\), then \(L=D-W\) and \(D_{ii} = \sum _{j=1}^nW_{ij}\). Meanwhile, the same constraint conditions \(Y^TDY=I\) and \(Y^TD\mathbf{1}=0\) were imposed. The lower dimensional embedding of X can be obtained by solving the generalized eigenvalue problem \(LY=\lambda DY\).

If \(0 \ne \lambda _1 \le \lambda _2 \le \cdots \le \lambda _d\) are the d smallest eigenvalues of Equation \(LY=\lambda DY\) and \(y^1, y^2, \ldots , y^d\) are corresponding column eigenvectors, then the lower dimensional embedding of sample \(x_i\) is \(y_i = (y^1(i), y^2(i), \ldots , y^d(i)), i=1, 2, \ldots , n\).

2.4 Fundamentals of fuzzy sets and rough sets

Both fuzzy sets and rough sets are fundamental components of granular computing theory for uncertain information analysis and processing in the fields of decision analysis and artificial intelligence. In this subsection, we review some basic notions related to the granular computing framework that are indispensable in our proposal.

Let X be a universe of discourse (the dataset aforementioned), a map \(\mu _F\) from X to [0, 1] models a fuzzy concept F on X [50]. For any \(x\in X\), \(\mu _F(x)\), or F(x) briefly, denotes the membership degree of x belonging to the fuzzy concept F.

A fuzzy relation R on X is a fuzzy set on \(X\times X\). It is referred to be reflexive if \(\mu _R(x, x)=1\) for all \(x\in X\) and symmetric when \(\mu _R(x, y)=\mu _R(y, x)\) for any \(x, y\in X\). A fuzzy relation is called a fuzzy similarity relation if it is reflexive and symmetric. A fuzzy similarity relation characterizes the similarity degrees between objects. A fuzzy similarity relation R on X is associated with a group of fuzzy sets (fuzzy similarity classes) \(\{[x]_R\mid x\in U\}\), reflecting the topological and granular structures of X, where \(\mu _{[x]_R}(y)=\mu _R(x, y)\) for any \(y\in X\).

There exist lots of ways to determine a fuzzy similarity relation describing the associations between objects in a dataset, such as, the correlation coefficient method, the similarity coefficient method, and a kind of distance-based method [50].

The concept of rough sets, introduced by Pawlak [27], is different from fuzzy sets to interpret and handle objects by using an indiscernibility relation.

For a dataset X with its attribute (feature) set A, the pair (XA) is called an information system. For any \(B\subseteq A\), \(R_B=\{(x, y)\in X^2 \mid \forall b\in B (x(b)=y(b)) \}\) is an indiscernibility relation on X, which is a crisp equivalence relation partitioning X into a family of disjoint subsets \(X/R_B\), called the quotient set of X with respect to \(R_B\) or B, where x(b) denotes the value of x on b. Let \(Y\subseteq X\), the two sets

$$\begin{aligned} \underline{R_B}(Y)=\{Z\in X/R_B\mid Z\subseteq Y\},\quad \overline{R_B}(Y)=\{Z\in X/R_B\mid Z\cap Y\ne \emptyset \} \end{aligned}$$

are referred to as the lower approximation and upper approximation of Y with respect to \(R_B\) or B, respectively.

The lower approximation of a set consists of granules of indiscernible objects totally contained in the set, whereas its upper approximation is composed of granules of indiscernible objects partly contained in the set. The pair of the approximation sets can be used to discern and analyze such a set. It is the cores of rough sets theory for knowledge representation and discovery.

When the class information of X is known and D is the class label feature, called the decision feature of X, the information system \((X, A\cup D)\) is referred to be a decision information system. For any \(B\subseteq A\),

$$\begin{aligned} Pos_B(D)=\cup _{Y\in X/_{R_D}}\underline{R_B}(Y) \end{aligned}$$

is said to be the positive domain of D with respect to B, where \(R_B\) and \(R_D\) are indiscernibility relations derived from B and D, respectively. The dependency of D to B can be described by

$$\begin{aligned} \gamma _B(D)=|Pos_B(D)|/|X| \end{aligned}$$

where |Y| denotes the cardinality of Y. For any \(a\in B\), the measure

$$\begin{aligned} sig(a)=\gamma _B(D)-\gamma _{B{\setminus } \{a\}}(D)) \end{aligned}$$

can be used to describe the significance of feature a in X with respect to B [11].

In the case of an information system having continuous-values attributes, on one hand, some discretization methods [42] have been developed to process such information systems. On the other hand, the classical indiscernibility relation has been extended to general mathematical notions and various generalized rough set models are constituted, such as a general relation based rough set [36], a similarity or tolerance relation based rough set [15, 33], a dominance relation based rough set [12, 34], a covering rough set [53], and a neighborhood rough set [44, 49]. Herein the neighborhood rough set model is outlined and others can be referred to the literature.

Let X be a finite universe, for each \(x\in X\), we associate it with a subset \(n(x)\subseteq X\), called the neighborhood of x. A neighborhood system \(N(X)=\{n(x)\mid x\in X\}\) of X is a family of neighborhoods associated with all \(x\in X\). The pair (XN(X)) is referred to as a neighborhood approximation space. In the neighborhood approximation space (XN(X)), several models of neighborhood rough sets have been exposed [44, 49], where the following two sets of formulae

$$\begin{aligned} \underline{N_1}(Y)=\{x\in X\mid n(x)\subseteq Y\},\quad \overline{N_1}(Y)=\{x\in X\mid n(x)\cap Y\ne \emptyset \} \end{aligned}$$

and

$$\begin{aligned} \underline{N_2}(Y)=\cup \{n(x)\in N(X)\mid n(x)\subseteq Y, x\in X\},\quad \overline{N_2}(Y)=\cup \{n(x)\in N(X)\mid n(x)\cap Y\ne \emptyset , x\in X\} \end{aligned}$$

are two kinds of typical depictions of neighborhood rough lower approximations and neighborhood rough upper approximations. Each pair of them brings ones different characterizations about objects and has specific properties.

In the mean time, the connection between fuzzy sets and rough sets has been extensively investigated and several versions of fuzzy rough set models [9, 10, 30] have been worked out, in which

$$\begin{aligned} \mu _{\underline{R}(F)}(x)=\inf _{y\in X}I(\mu _R(x, y), \mu _F(y)), \mu _{\overline{R}(F)}(x)=\sup _{y\in X}T(\mu _R(x, y), \mu _F(y)) \end{aligned}$$

are the typical definitions of generalized fuzzy rough lower approximation and upper approximation of a fuzzy set F with respect to a fuzzy relation R on X, \(x\in X\), where I is a kind of fuzzy implication and T is a t-norm.

When the fuzzy relation R reduces to a crisp relation on X, or even a neighborhood system \(N(X)=\{n(x)\mid x\in X\}\), the model of rough fuzzy set

$$\begin{aligned} \mu _{\underline{N}(F)}(x)=\inf _{y\in n(x)}\mu _F(y), \mu _{\overline{N}(F)}(x)=\sup _{y\in n(x)}\mu _F(y) \end{aligned}$$

is achieved.

In the following section, the neighborhood rough fuzzy set model is introduced to assess the membership degree of an object belonging to a fuzzy association class in a neighborhood system. Such membership measures are incorporated with the similarity measures between objects into the renovation of weight matrix of Laplacian neighborhood graph.

3 A rough fuzzy sets based semi-supervised Laplacian Eigenmaps

Both CCDR and SSLE are semi-supervised nonlinear dimensionality reduction methods. In CCDR, the weights of labelled samples belonging to their classes were all 1 and those of unlabelled samples were all 0, regardless of the membership degrees of samples belonging to their classes. In SSLE, two neighborhood graphs are complementary to each other. The weights were directly assigned to 0.5 between unlabelled neighbors, which cannot exactly express the relationship between labelled and unlabelled samples. The fixed weights disregard the influences of distances and similarity between samples and their prototypes. Thereafter, both methods cannot preserve the class structures of datasets well. In this section we introduce the ideas of fuzzy sets and rough fuzzy sets into the design of weight matrices of the Laplacian neighborhood graph and class-related neighborhood graph, and propose a novel semi-supervised LE method for dimensionality reduction.

3.1 Assessment of significance of features

In a high dimensional data space, data distributions are sparse. Each feature contributes to different ingredients for data clustering and parts of features usually cause serious impacts on clustering effect. In this subsection, motivated by the rough set theory [11], the significance of each feature is adaptively assessed by introducing an information entropy measure.

Given a continuous, discrete, or hybrid dataset \(X=\{x_1, x_2, \ldots , x_n\}\) with the feature (attribute) set \(A = \{a_1, a_2, \ldots , a_m\}\), we assume that there are s labelled samples in X, and \(l_1, l_2, \ldots , l_s\in \{1, 2, \ldots , f\}\) are their class labels. Due to the fact that the ranges of different features vary in magnitude, the dataset needs to be normalized for all continuous values features before carrying out lower dimensional embedding. The commonly used approaches to data normalization are as follows [28].

  1. (1)

    The range normalization (min–max method)

    $$\begin{aligned} x_{ij}^*=\frac{x_{ij}-\min _k x_{kj}}{\max _k x_{kj}-\min _k x_{kj}},\quad i=1, 2, \ldots , n, j=1, 2, \ldots , m \end{aligned}$$
  2. (2)

    The Z-score normalization (mean-deviation method)

    $$\begin{aligned} x_{ij}^*=\frac{x_{ij}-\mu _j}{\sigma _j} \end{aligned}$$

    where

    $$\begin{aligned} \mu _j=\frac{1}{n}\mathop \sum \limits _{i=1}^n x_{ij}, \sigma _j=\sqrt{\frac{1}{n}\sum _{i=1}^n (x_{ij}-\mu _j)} \end{aligned}$$

After data normalization, we establish a semi-supervised fuzzy similarity matrix regarding the given dataset. For convenience, the normalized dataset is also denoted by X. The similarity degrees between samples with the same class labels are set as 1, while the similarity between samples with different class labels is set as 0, and the similarity degrees between the rests are computed by a certain kind of distance method. That is, for two samples, if one is labelled and the other is not, or neither of them is labelled, then the similarity degree between them is evaluated by their distance. Mathematically, the semi-supervised fuzzy similarity matrix \(R_A\) of X is obtained by

$$\begin{aligned} \mu _{R_A}(x_i, x_j) =\left\{ \begin{array}{l l} 1, &{}\quad {\text {if}}~x_i~{\text {and}}~x_j~{\text {have~the~same~label}} \\ 1-\alpha \sqrt{\sum _{k=1}^m d_k^2(x_{ik}, x_{jk})}, &{}\quad {\text {if~at~least~one~of}}~x_i~{\text {and}}~x_j~{\text {is~not~labelled}}\\ 0, &{}\quad {\text {if}}~x_i~{\text {and}}~x_j~{\text{have~different~labels}}\end{array} \right. \end{aligned}$$
(3.1)

where \(\alpha\) is an appropriate positive number such that \(\mu _{R_A}(x_i, x_j)\in [0, 1]\) for all \(x_i, x_j\in X\), and \(d_k(x_{ik}, x_{jk})\) is the distance between the values of kth feature of samples \(x_i\) and \(x_j\). If \(a_k\) is a continuous attribute, we define \(d_k(x_{ik}, x_{jk})=|x_{ik}-x_{jk}|\), and when \(a_k\) is a discrete or categorial attribute, we set

$$\begin{aligned} d_k(x_{ik}, x_{jk})=\left\{ \begin{array}{ll} 0, &{}\quad {\text {if}}~x_{ik}=x_{jk}\\ 1, &{}\quad {\text {if}}~x_{ik}\ne x_{jk}\end{array} \right. \end{aligned}$$

It is clear that \(R_A\) is a reflexive and symmetrical relation, or a fuzzy similarity relation. The family of fuzzy similarity classes \(\{[x]_{R_A}\}\) characterize the granular structures of X, where \([x]_{R_A}\) is a fuzzy set on X and its membership degree at \(y\in X\) is \(\mu _{R_A}(y, x)\), thus, \(\mu _{[x]_{R_A}}(y)\) or \(\mu _{R_A}(y, x)\) can be interpreted as the membership degree of sample y belonging to the fuzzy similarity class of sample x. Let

$$\begin{aligned} H(A) = - \sum \limits _{j = 1}^n {\frac{\sum \nolimits _{i = 1}^n {\mu _{R_A} (x_i, x_j)} }{\left| X \right| ^2}} \log \frac{\sum \nolimits _{i = 1}^n {\mu _{R_A}(x_i, x_j)} }{\left| X \right| ^2} \end{aligned}$$

then H(A) is referred to as the information entropy of the family of fuzzy information granules (similarity classes) \(\{[x]_{R_A}|x\in X\}\), reflexing the distributions or fluctuation of membership degrees of the fuzzy information granules. Obviously, \(H(A)\in [0, \log n)\).

According to (3.1), it is evident that \(\sum \nolimits _{i=1}^n {\mu _{R_A}(x_i, x_j)}\le \sum \nolimits _{i=1}^n{\mu _{R_{A{\setminus }\{a\}}}(x_i, x_j)}\le |X|\) for any \(a\in A\) and \(x_j\in X\). Thus,

$$\begin{aligned} {\frac{\sum \nolimits _{i = 1}^n {\mu _{R_A}(x_i, x_j)} }{\left| X \right| ^2}}\le {\frac{\sum \nolimits _{i = 1}^n {\mu _{R_{A{\setminus }\{a\}}} (x_i, x_j)} }{\left| X \right| ^2}}\le \frac{1}{n} \end{aligned}$$

Therefore, when \(n>2\), \(H(A)\le H(A{\setminus }\{a\})\). If \(n\le 2\), this result is trivial. Hence, for any \(a\in A\), let

$$\begin{aligned} sig(a) = \frac{H(A{\setminus }\{a\})- H(A)}{\max _{a_i\in A} (H(A{\setminus }\{a_i\}) - H(A))} \end{aligned}$$
(3.2)

then \(sig(a)\in [0, 1]\). sig(a) can characterize the significance of a in A while considering the problem of preserving the same indiscernibility. According to (3.2), if sig(a) is smaller, then the fluctuation of all of the kth attribute values is smaller, and for any \(x_i\) and \(x_j\), the value of \(d_k(x_i, x_j)\) is relatively smaller. Thus it has a higher possibility that both \(x_i\) and \(x_j\) are in the same class. On the other hand, if sig(a) is bigger, the fluctuation of all of the kth attribute values is larger and the value of \(d_k(x_i, x_j)\) is relatively bigger, thus the possibility of samples \(x_i\) and \(x_j\) belonging to different classes is higher and therefore the attribute a has a stronger discernibility ability. Based on such a fact, we modify the fuzzy similarity matrix (3.1) by introducing the significance of features into the computation of distances between samples and obtain the following weighted fuzzy similarity relation

$$\begin{aligned} \mu _{R_W}(x_i, x_j) =\left\{ \begin{array}{l l} 1, &{}\quad {\text {if}}~x_i~{\text {and}}~x_j~{\text {have~the~same~label}}\\ 1-\alpha \sqrt{\sum _{k=1}^m {S_k} d_k^2(x_{ik}, x_{jk})}, &{}\quad {\text {if~at~least~one~of}}~x_i~{\text {and}}~x_j~{\text {is~unlabelled}}\\ 0, &{}\quad {\text {if}}~x_i~{\text {and}}~x_j~{\text{have~different~labels}}\end{array} \right. \end{aligned}$$
(3.3)

where \(S_k=e^{sig(a_k)}\).

From (3.3) we know that if \(sig(a_k)\) is smaller, then \(\mu _{R_W}(x_i, x_j)\) is almost equal to \(\mu _{R_A}(x_i, x_j)\). If \(sig(a_k)\) is bigger, then \(\mu _{R_W}(x_i, x_j)\) will be smaller than \(\mu _{R_A}(x_i, x_j)\), assuming that there are no other variable factors. That is, if two samples belong to different classes with a high possibility, then we impose a bigger weight to the computation of distance and derive a smaller similarity degree between both samples. Therefore, if the similarity degree of two samples is small, then \(R_W\) can push the two samples farther when samples are embedded in a lower dimensional space.

If two samples \(x_i\) and \(x_j\) have the same label, we assume that the membership degree of any sample \(x\in X\) belonging to the fuzzy similarity class of \(x_i\) is equal to that of x belonging to the fuzzy similarity class of \(x_j\). Under such an assumption, we let

$$\begin{aligned} \mu _{R_S}(x_i, x_j ) = \left\{ \begin{array}{l l} \mu _{R_W}(x_i, x_j ), &{}\quad {\text {if}}~x_j~{\text {is~not~labelled}}\\ {\max }_{y \in l(t)}\mu _{R_W}(x_i, y), &{}\quad {\text {if}}~x_j~{\text {is~labelled}}\\ \end{array} \right. \end{aligned}$$
(3.4)

where t is the class label of sample \(x_j\) and l(t) is the subset of X, whose elements have the class label t.

From (3.4) we know that the similarity between homogeneous samples is larger than that between heterogeneous samples. We call \(R_S\) a fuzzy association relation. It is obvious that \(R_S\) is no longer a fuzzy similarity relation. We refer the fuzzy set \(X_j\), defined by \(\mu _{X_j}(x)=\mu _{R_S}(x, x_j)\), \(x\in X\), to as the fuzzy association class of \(x_j\). Thus, \(\mu _{R_S}(x_i, x_j)\), denoting the membership degree of \(x_i\) belonging to \(X_j\), is not necessarily equal to the membership degree of \(x_j\) belonging to \(X_i\). That is, a sample has a possibility of belonging to some class, but there may exist samples in this cluster which have different degrees of possibility belonging to the class that the given sample is in. These interpretations are rational in clustering analysis.

3.2 A semi-supervised rough fuzzy Laplacian Eigenmaps (SSRFLE)

For a hybrid dataset \(X=\{x_1, x_2, \ldots , x_n\}\) with its feature set \(A = \{a_1, a_2, \ldots , a_m\}\), we assume that there are s samples being labelled as \(l_1, l_2, \ldots , l_s\), where \(l_i\in \{1, 2, \ldots , f\}\) and f is the number of classes, \(f\le s\). For a given positive integer k, we construct a k-nearest neighborhood graph \(G^N\) of X, where the distance between samples \(x_i\) and \(x_j\) is computed by using the following weighted distance

$$\begin{aligned} d_W(x_i, x_j) = \sqrt{\sum \limits _{k = 1}^m {S_k d_k^2\left( {x_{ik}, x_{jk} } \right) } } \end{aligned}$$
(3.5)

In order to determine the membership degree of a sample belonging to a certain class, we construct a neighborhood rough fuzzy set model for the fuzzy association classes \(\{[x]_{R_S}\mid x\in X\}\).

Let \(N_k(x)\) be the k-nearest neighborhood set of \(x\in X\) and \(N_k=\{N_k(x)| x\in X\}\) be the k-nearest neighborhood system on X, the pair \(K = (X, N_k)\) is referred to as a neighborhood approximation space. Let \(X_j\) denote the fuzzy association class of \(x_j\) and is indeed the jth column of \(R_S\). The rough fuzzy lower and upper approximation sets of \(X_j\) with respect to the neighborhood system \(N_k\) are two fuzzy sets on X, and their membership functions are defined by, \(x_i\in X\),

$$\begin{aligned} \begin{aligned} \mu _{\underline{N}(X_j)}(x_i)=&\mathop {\wedge }\limits _{y\in N_k(x_i)} \mu _{X_j}(y)\\ \mu _{\overline{N}(X_j)}(x_i)=&\mathop {\vee }\limits _{y\in N_k(x_i)} \mu _{X_j}(y) \\ \end{aligned} \end{aligned}$$
(3.6)

Due to the fact that the similarity degrees between homogeneous samples are larger than those between heterogeneous samples, according to (3.6), we know that \(\mu _{\underline{N}(X_j)}(x_i)\) is bigger if the samples in the k-nearest neighborhood of \(x_i\) belongs to the class that \(x_j\) belongs to, whereas \(\mu _{\underline{N}(X_j)}(x_i)\) is smaller if the neighbors of \(x_i\) and \(x_j\) belong to heterogeneous classes. So \(\mu _{\underline{N}(X_j)}(x_i)\) can be used to assess the degree what two samples \(x_i\) and \(x_j\) belonging to the same class.

In order to ensure homogeneous samples being mapped closer in the lower dimensional space, in the k-nearest neighborhood graph \(G^N\) of X, the weight between two vertices should be large when both vertices are in the homogeneous neighborhoods, while it is small if they are in heterogeneous neighborhoods. Motivated by this idea, we designate the weights (similarity degrees) between samples with the same class label in a neighborhood as the largest value 1, while the similarity between samples with different labels as the smallest value 0. The similarity between samples that are not included in each other neighborhood is also set as the smallest value 0, and the similarity degree between samples that only one is labelled or neither is labelled is evaluated by combining the weighted Gaussian kernel distance between samples and the membership degrees of both samples belonging to the same class. As a result, we establish the weight measure between vertices \(x_i\) and \(x_j\) as

$$\begin{aligned} W_{ij}^N =\left\{ \begin{array}{l l} 1, &{}\quad {\text {if}}~x_i~{\text{and}}~x_j~{\text {have~the~same~label}}\\ &{}\quad {\text {and}}~x_i\in N_k(x_j)~{\text {or}}~x_j\in N_k(x_i)\\ \mu _{\underline{N}(X_i)}(x_j)e^{-d_W(x_i, x_j)^2/\sigma }, &{}\quad {\text {if~at~least~one~of}}~x_i~{\text {and}}~x_j ~{\text {is~not~labelled}} \\ &{}\quad {\text {and}}~x_i\in N_k (x_j )~{\text {or}}~x_j \in N_k (x_i )\\ 0, &{}\quad {\text {otherwise}}\end{array} \right. \end{aligned}$$
(3.7)

where \(\sigma\) is a scale factor of adjusting the Gaussian kernel function, usually determined by the average similarity degrees between samples. The weight matrix of \(G^N\) is denoted by \(W^N=(W_{ij}^N)_{n\times n}\).

Since the similarity degrees between each sample and samples with the same label are uniformed in Eq. (3.4), the samples with the same label can be regarded as one sample, namely the prototype. If a sample \(x_j\) has a class label \(t\in \{1, 2, \ldots , f\}\), then \(\mu _{R_S}(x_i, x_j)\) can be interpreted as the membership degree of \(x_i\) belonging to the tth class. For that, we build a weighted class-related neighborhood graph \(G^C\) to depict the relationship between samples and their prototypes and the corresponding weight matrix is denoted by \(W^C=(W_{it}^C)_{n\times f}\), where the weight \(W_{it}^C\), defined by

$$\begin{aligned} W_{it}^C = \mu _{R_s}(x_i, y) \end{aligned}$$
(3.8)

represents the membership degree of \(x_i\) belonging to the tth class, here y is an arbitrary element in l(t), whose labels are all t.

Let the row vectors \(y_1, y_2, \ldots , y_n\in \mathbb {R}^d\) be the \(d\, (d<n)\) dimensionality representations of dataset \(X = \{x_1, x_2, \ldots , x_n \}\). In order to achieve the goal that homogeneous samples are mapped closer and more compact around the prototypes in the lower dimensional space, the following optimization problem

$$\begin{aligned} \min (1-\gamma )\sum \limits _{i,j}^n {W_{ij}^N \left\| {y_i - y_j} \right\| ^2} +\gamma \sum \limits _{i = 1}^n {\sum \limits _{t = 1}^f {W_{it}^C \left\| {y_i - c_t } \right\| ^2} } \end{aligned}$$
(3.9)

is approached, where \(c_1, c_2, \ldots , c_f\in \mathbb {R}^d\) are the prototypes of the dataset X in the lower dimensional space and \(\gamma\) is a parameter trading off the performance of preserving local neighborhood structure and maintaining clustering structure.

Let

$$\begin{aligned} Z =\left( \begin{array}{c} c_1 \\ \vdots \\ c_f \\ y_1 \\ \vdots \\ y_n \\ \end{array} \right) = \left( {{\begin{array}{*{20}c} {c_{11} } &{} {c_{12} } &{} \cdots &{} {c_{1d} } \\ \vdots &{} \vdots &{} &{} \vdots \\ {c_{f1} } &{} {c_{f2} } &{} \cdots &{} {c_{fd} } \\ {y_{11} } &{} {y_{12} } &{} \cdots &{} {y_{1d} } \\ \vdots &{} \vdots &{} &{} \vdots \\ {y_{n1} } &{} {y_{n2} } &{} \cdots &{} {y_{nd} } \\ \end{array} }} \right) =[z^1\, z^2\, \ldots \, z^d] \end{aligned}$$

then the optimization problem (3.9) can be translated to

$$\begin{aligned} \min tr(Z^T\mathbb {L}Z) \end{aligned}$$
(3.10)

where \(\mathbb {L}=\mathbb {D}-\mathbb {W}\) is the \((f+n)\times (f+n)\) Laplacian matrix, which is a symmetric and positive semidefinite matrix, \(\mathbb {D}\) is a \((f+n)\times (f+n)\) diagonal matrix with \(\mathbb {D}_{ii} = \sum \nolimits _{j=1}^{f+n} \mathbb {W}_{ij}\), and \(\mathbb {W}=\left( \begin{array}{cc} I &{} \gamma (W^C)^T \\ \gamma W^C &{} 2(1-\gamma )W^N \\ \end{array} \right)\).

In order to ensure that the optimization problem (3.10) has a unique solution, two restricted conditions \(Z^T\mathbb {L}Z = I\) and \(Z^T\mathbb {D}{} \mathbf{1} = 0\) are imposed to remove scaling and translation factors in the lower dimensional embedding. By the Lagrange multiplier method, the optimization problem (3.10) can be transformed to solve the following generalized eigenvalue problem

$$\begin{aligned} \mathbb {L}Z = \lambda \mathbb {D}Z \end{aligned}$$
(3.11)

The column vectors \(z^1, z^2, \ldots , z^d\), corresponding to the d smallest positive eigenvalues \(0 \ne \lambda _1 \le \lambda _2 \le \cdots \le \lambda _d\), are the solutions to Eqs. (3.11) or (3.9). The first f rows of the matrix \([z^1\, z^2\, \ldots \, z^d]\) correspond to the coordinates of prototypes and the following n rows determine the d dimensional embedding of the original samples.

4 Comparative experiments and analysis

Due to diversities and complex natures of captured datasets, there is no single method being able to deal successfully with all situations. In this section, we compare the proposed SSRFLE with the classical (non-supervised) LE, DVE, and the state of the art semi-supervised CCDR and SSLE, for dimensionality reduction in the aspects of classification performance and data visualization through several experiments on benchmark datasets. All of the experiments are implemented on the platform of Matlab 7.0.

We take five benchmark datasets from UCI Machine Learning Repository [23] for our experiments. Three of which, Wine, Seeds, and Wisconsin Diagnostic Breast Cancer (WDBC), are continuous datasets. Statlog Heart (Heart) is a hybrid dataset, while Mushroom is a categorical dataset. Due to the fact that the Mushroom dataset has missing values, actually 5643 complete samples, and is classified into two classes, we randomly select \(30\%\) complete samples in each class as the test samples in the follow-up experiments. The details of the five datasets are illustrated in Table 1.

Table 1 The details of five datasets used in the experiments

4.1 Performance analysis

Since the values of continuous features in the first four datasets have different scales, all of these datasets are preprocessed by the normalized methods listed in Sect. 3.1. We compare the proposed SSRFLE with the classical LE and DVE, and the state of arts CCDR and SSLE. For consistence, we set the parameters \(\beta =1\) for CCDR, \(\mu =0.5\) for SSLE, and \(\gamma =0.5\) for SSRFLE.

The first two methods (LE and DVE) are non-supervised and the last two and the proposal are semi-supervised. Thus we randomly label a percentage of samples in each class of every dataset as the semi-supervised information in the simulation experiments. Based on the analysis of establishing the proposed method, one of the largest advantage of the proposed SSRFLE is to preserve class characterization of original data, the classification (clustering) accuracy is used as one of the criteria to test the performance of these techniques. With the proposed SSRFLE, the lower dimensional representations of original data as well as the prototypes of the lower dimensional embedding can be derived. Therefore, the commonly used fuzzy C-means (FCM) clustering method is introduced to cluster the dataset in the lower dimensional embedding. The fuzzification factor m in FCM is set to be 2 in all experiments.

Although there were several approaches to dimensionality estimation of a data manifold [7], the real intrinsic dimensionality of the dataset may not be correctly determined. In the following experiments, we assume the embedded dimensionality d varying from 2 to 4, instead of estimating its dimensionality. For every fixed d, the size of neighborhood of each sample varies from 4 to 14. In each semi-supervised method, \(5\%\) samples in each class of every dataset are randomly labelled. For every set of such parameters, the tenfold cross-validation is carried out and the average classification accuracy is used to evaluate the listed methods for dimensionality reduction. Tables 2, 3 and 4 show the average classification accuracies when d is set as 2, 3 and 4, respectively. The bold number in each column (each dataset) of every table shows the highest classification accuracy among the tested methods.

Table 2 The average classification accuracies when \(d=2\)
Table 3 The average classification accuracies when \(d=3\)
Table 4 The average classification accuracies when \(d=4\)

From these Tables we know that the semi-supervised dimensionality reduction methods, CCDR, SSLE, and the proposed SSRFLE, are all superior to the classical LE in classification performance for the implemented five datasets and for the chosen three dimensionalities. These results are consistent with our intuition that semi-supervised methods outperform non-supervised ones. The non-supervised DVE can bring out better classification accuracies than LE and SSLE in most cases. However, its time cost is rather larger since it is a global method of unfolding the data manifold by maximizing the global variance. Nevertheless, the experimental results show that the proposed SSRFLE brings out the highest average classification accuracies among the tested methods for each of the five datasets when the embedded dimensionality is taken 2 or 3. When \(d=4\), although SSRFLE does not produce the best classification accuracies for all of the five datasets, the best classification accuracies can be achieved by SSRFLE for four out of five datasets and the rest one (for the dataset Seeds) is close to the best one obtained by DVE. These facts indicate that different embedded dimensionalities taken produce slightly distinct impacts on the classification accuracy for a given dataset. One may choose empirically a suitable embedded dimensionality by experiments in practical applications.

As for the facts that each method aforementioned produces distinct classification accuracies for different datasets, the main reasons arise from the diversities of capture ways, mechanism and topological structures of the datasets. It is the difference between datasets that can be used to verify the performance of an algorithm.

To further test the semi-supervised performance of the proposed SSRFLE, we compare the proposed SSRFLE with the semi-supervised CCDR and SSLE through the five datasets aforementioned with the same parameter settings, except that the rates of labelled samples are set to be 5, 15 and \(30\%\), respectively, in each class of every dataset. The number of embedded dimensionality is set as 2. Figure 1 shows the experimental results.

Fig. 1
figure 1

The influence of rate of labelled samples on the classification accuracy (for interpretation of the references to color in the text, the reader is referred to the web version of this article)

In each subfigure of Fig. 1, the bars with blue, green and brown in each group represent the classification accuracies in the cases of 5, 15 and \(30\%\) labelled samples, respectively. The results show that for each of the five datasets, the classification accuracy of each method increases when the rate of labelled samples increases. These facts are consistent with our intuition that more supervised information brings out higher classification accuracy. Furthermore, the proposed method achieves the highest classification accuracy among the three implemented methods for each dataset and for each rate of labelled samples in the same experimental setting.

4.2 Data visualization

Data visualization is an important research issue in the field of machine learning. Especially, it makes ones perceive and inspect high dimensional data or complicated phenomena in an intuitionistic and visible way. An effective dimensionality reduction and visualization method brings ones to a vive and convictive demonstration on high dimensional data. In this subsection, we test the visualization of the Wine dataset by using LE, DVE, SSLE, CCDR, and the proposed SSRFLE method. This dataset has 178 samples that fall into 3 classes. We randomly label \(5\%\) samples in each class. Figure 2 shows the visualized results by using these methods.

Fig. 2
figure 2

The results of visualization obtained by LE, DVE, SSLE, CCDR, and SSRFLE (for interpretation of the references to color in the text, the reader is referred to the web version of this article)

In Fig. 2, three classes of samples are depicted by different colors. From Fig. 2 the first two algorithms lead to that the classes have not been completely separated. Although SSLE and CCDR are semi-supervised and the label information of dataset has been incorporated into, the similarity between homogeneous samples and dissimilarity between heterogeneous samples have not been involved. Homogeneous samples scatter, but have no compact embedding. The last figure in Fig. 2 plots the visualization of this dataset by using the proposed method. It is evident that homogeneous samples have compact distributions and heterogeneous samples can be better distinguished.

4.3 Impact of feature weights on classification accuracy

It is natural that different features have distinct impacts on clustering structures of a dataset. In the proposed SSRFLE, an adaptive weight has been designated to each feature based on the information entropy theory in the computation of similarity, or alternatively distance, between two samples. In order to verify the rationality of using the weighted distance in SSRFLE, we repeat the comparative experiments above in the environments of with and without weights of features (in the case of without weighting, we set \(S_k=1\) or \(sig(a_k)=0\) for all \(k=1, 2, \ldots , m\) in (3.1), and sign this method as ‘SSRFLE without weighting’). In these experiments, we set \(d=2\) and let the size of neighborhood vary from 4 to 14. We also randomly label \(5\%\) samples of each class in each dataset. Figure 3 shows the experimental results of the five datasets.

Fig. 3
figure 3

The effects of with and without weighting in SSRFLE for five datasets (for interpretation of the references to color in the text, the reader is referred to the web version of this article)

From Fig. 3 one sees that, on the whole, the classification accuracies of SSRFLE are higher than those of SSRFLE without weighting for the first four datasets. They are fluctuant for the Mushroom dataset. The reason may arise from that Mushroom is a discrete dataset and the weights of features are computed according to the frequencies of feature values appeared. Features with different significance may have similar frequent distributions of feature values. On the whole, it is convinced that the weighted distance can improve the effectiveness of dimensionality reduction and greatly increase the classification accuracy, especially for the continuous and hybrid datasets.

4.4 Parameters selection

In the proposed method, there are two parameters to be assigned before experiments, the size of neighborhood and the number of embedded dimensionality. The choice strategy of the latter has been accounted for in Sect. 4.1. In this subsection, we discuss the influence of the size of neighborhood on classification accuracy by comparing the five methods on the five datasets for dimensionality reduction and clustering performance.

We label \(5\%\) samples in each class of every dataset. The number of embedded dimensionality d is taken as 2, 3, and 4, and the size of neighborhood k varies from 4 to 14. The tenfold cross-validation is executed for each set of parameters and the average classification accuracy is recorded. Figure 4 shows the experimental results.

Fig. 4
figure 4

The influences of parameters k and d on classification accuracies for five datasets (for interpretation of the references to color in the text, the reader is referred to the web version of this article)

In Fig. 4, each of the three rows displays the experimental results for different embedded dimensionality, \(d=2\), 3, and 4, respectively. The five subfigures in each row show the relationship between classification accuracy and size of neighborhood of five datasets, namely, Wine, Seeds, WDBC, Heart, and Mushroom, separatively. Every subfigure plots five groups of classification accuracies derived from the aforementioned five methods with the variation of size of neighborhood from 4 to 14. It is shown that the proposed method produces the highest classification accuracies among all the five methods for all datasets, for all sizes of neighborhood sets, and for most of given numbers of embedded dimensionality. It also illustrates that the classification accuracy by using the proposed SSRFLE is not very sensitive to the choice of size of neighborhood. The essential may be the intervention of adaptive weighted distance and the rough fuzzy approximation characterization on neighborhood of samples.

5 Conclusions and future work

In this work, a semi-supervised rough fuzzy Laplacian Eigenmaps (SSRFLE) is proposed for nonlinear dimensionality reduction of hybrid data. In this method, a semi-supervised fuzzy similarity relation is introduced and the weights of features of a dataset are adaptively assessed by designing an information entropy measure based on this fuzzy similarity relation. A new fuzzy association relation is derived from the weighted distances between samples together with the labelled information of the dataset. The rough fuzzy lower approximations of the fuzzy association classes related to the fuzzy association relation together with the gaussian Kernel weighted distances between samples are used to characterize the similarities between vertices of the Laplacian neighborhood graph. At the mean time, the fuzzy association classes are considered as the descriptions of similarities between samples and their prototypes, which produce the weighted class-related neighborhood graph. The combination of both neighborhood graphs ensures homogeneous samples being embedded closer and more compact around the prototypes in the lower dimensional space of a dataset.

A series of comparative experiments on real world hybrid datasets are implemented. Experimental results show that the proposed method outperforms the state of arts methods in the aspect of classification accuracy and data visualization due to the superior performance of preserving class structures when a dataset is embedded into a lower dimensional space.

Although the proposed method is not very sensitive to the choice of size of neighborhood, we will lucubrate on appropriate approaches in theory to the choice of such a parameter in the future. Theoretical analysis and applications of the related approaches to incremental and dynamic data are under consideration.