Keywords

1 Introduction

With the continuous growing of the scale of datasets in recent years, a given learning problem and classification task contains a large number of features, and these features are often irrelevant or redundant. Such features will lead to the problems of high computational complexity, weak generalization ability and poor interpretability. Feature selection is an effective technique to alleviate these problems. It reserves highly correlated features and removes redundant and irrelevant features to find the optimal feature subset, and thus improving the performance of models [1]. Therefore, it becomes one of important preprocesses for machine learning, data mining, and pattern recognition etc. [2].

Rough set theories [3] provides an effective method for modeling vague, uncertain, or imprecise data. It uses a pair of exact sets (upper approximation and lower approximation) to describe the uncertainty within the data set. The attribute reduction methods in this theory remove redundant features in the dataset while maintaining the correlation between feature subsets and decision classes, which coincides with the purpose of feature selection [4]. Based on the rough set theory, the correlation information within data is found without the need for supplementary information, and the number of attributes contained in the data set is reduced, which realizes the feature selection based on rough sets. Presently, there are many extensions in rough sets theory, such as probabilistic rough sets [5], neighborhood rough sets [6], and fuzzy rough sets [7, 8], etc.

To deal with the information loss caused by discretizing data, Dubois and Prade defined fuzzy rough sets [7, 8] by introducing fuzzy membership functions and extending the membership of elements to [0,1], which provides a high degree of flexibility when dealing with continuous data in fields such as medicine, industry and finance, and can effectively model the ambiguity and uncertainty that exist in the data. On the premise of reserving the advantages of rough sets-based set feature selection for processing high dimensional data, fuzzy rough sets-based feature selection is realized by the fuzzy division of each feature by fuzzy set theory [9]. This method can effectively reduce discrete or continuous noise data, without the cost of adding extra information.

Aiming at the linear inseparability of the data obtained in the real world, that is, there is no dividing hyperplane that can correctly classify the training samples, we use the kernel methods to map the samples from the original space to a higher-dimensional feature space to solve the problem, which makes the samples linearly separable in this feature space. And for a limited-dimensional sample space, there must be a high-dimensional feature space that makes the mapped samples linearly separable. Hu integrated kernel functions with fuzzy rough sets and proposed the model of kernelized fuzzy rough sets, which forms a bridge between kernel machines and rough set-based data analysis [10]. Some generalized feature evaluation functions and attribute reduction algorithms based on the proposed model are shown and the effectiveness of the proposed technique is validated.

The three-way decision [11] theory extends the traditional two-way decision theory and is a decision-making method that conforms to human thinking. In two-way decision, the judgment of objects only stays in two results: acceptance and rejection. However, in practice, people often delay the judgment and decide on objects that they are confident to accept or reject instead of making decisions immediately for uncertain or incomplete information. The three-way decision divides objects into three domains (positive domain, negative domain and boundary domain) according to the decision-making state value by defining the decision function and the threshold of the domain, then constructs the corresponding three-way decision rules [12].

In this paper, we introduce a feature selection method based on kernelized fuzzy rough sets and three-way decision. When constructing feature subsets, how to maintain the maximum relevance for the decision class while minimizing the redundancy between feature subsets is a key issue in feature selection. The three-way strategy we employ is to construct three differentiated subsets of features and to expand the features in the subset from different perspectives. The feature subsets constructed by this strategy tend to be smaller than those constructed by traditional methods. Dependency is an important metric in rough set theory to measure the relevance of features with respect to decision classes. Thus, the dependency gained from new features is used as a reference for our feature subset expansion strategy. Finally, we consider the idea of ensemble learning to construct a multiple feature subsets-based co-classification model.

The rest of the paper is organized as follows. Section 2 presents the notions and properties of the fuzzy rough set model and feature selection. Section 3 shows the three-way attribute reduction algorithm. Experimental analysis is given in Sect. 4. Conclusions come in Sect. 5.

2 Preliminaries

In this section, we will first give some basic definitions, and then review the related work of rough sets, fuzzy rough sets, and kernelized fuzzy rough sets.

2.1 The Notations

Let \(I=(\mathbb {U}, \mathbb {A})\) be an information system, where \(\mathbb {U} = \{x_1,x_2,...,x_n\}\) is a nonempty set of finite objects called the universe of discourse and \(\mathbb {A}\) is a nonempty finite set of attributes \(a: \mathbb {U} \rightarrow V_{a}\) for every \(a \in \mathbb {A}\). For decision systems, \(\mathbb {A}=(\mathbb {C}, \mathbb {D})\),where \(\mathbb {C}\) is the set of input features and \(\mathbb {D}\) is the set of output features. Additionally \(a(x), a\in \mathbb {C}, x\in \mathbb {U}\) represents the value of the object x under the attribute a.

2.2 Rough Sets

With any \(P \subseteq \mathbb {A}\) there is an associated equivalence relation IND(P):

$$\begin{aligned} I N D(P)=\left\{ (x, y) \in \mathbb {U}^{2} \mid \forall a \in P, a(x)=a(y)\right\} \end{aligned}$$
(1)

An associated equivalence relation is reflexive, symmetric and transitive. The family of all equivalence classes of IND(P) are denoted by U/IND(P) or U/P for short, which is simply the set of equivalence classes generated by IND(P):

$$\begin{aligned} \mathbb {U} / I N D(P)=\otimes \{\mathbb {U} / I N D(\{a\}) \mid a \in P\} \end{aligned}$$
(2)

where

$$\begin{aligned} A \otimes B=\{X \cap Y \mid X \in A, Y \in B, X \cap Y \ne \emptyset \} \end{aligned}$$
(3)

The equivalence classes of the indiscernibility relation with respect to P are denoted \([x]_{P}, x \in \mathbb {U}\). Let \(X \subseteq U\), X can be approximated using only the information contained within P by constructing the P\(-lower\) and P\(-upper\) approximations of the classical crisp set X:

$$\begin{aligned} \underline{P} X=\left\{ x \mid [x]_{P} \subseteq X\right\} \end{aligned}$$
(4)
$$\begin{aligned} \bar{P} X=\left\{ x \mid [x]_{P} \cap X \ne \emptyset \right\} \end{aligned}$$
(5)

Let P and Q be subsets of condition attributes and decision attributes, respectively, then according to the upper approximation and the lower approximation, then the positive, negative, and boundary regions are defined as:

$$\begin{aligned} P O S_{P}(Q)=\bigcup _{X \in \textrm{U} / Q} \underline{P} X \end{aligned}$$
(6)
$$\begin{aligned} N E G_{P}(Q)=\mathbb {U}-\bigcup _{X \in \mathbb {U} / Q} \bar{P} X \end{aligned}$$
(7)
$$\begin{aligned} B N D_{P}(Q)=\bigcup _{X \in \mathbb {U} / Q} \bar{P} X-\bigcup _{X \in \mathbb {U} / Q} \underline{P} X \end{aligned}$$
(8)

All objects in the positive region \(P O S_{P}(Q)\), must belong to the set X. All objects in the negative region \(N E G_{P}(Q)\), must not belong to the set X. And the objects in the boundary region \(B N D_{P}(Q)\), may belong to X. The model of attribute reduction in rough set requires that the positive region of the decision attribute remains unchanged.

If IND(P) = \(I N D(P-{a})\), the attribute \(a\in P\) is dispensable in the feature set, otherwise it is indispensable. To achieve attribute reduction, that is, to find the smallest subset P of the conditional attribute set. The minimum subset P needs to satisfy the following two conditions:

  1. (1)

    \(POS_P(Q)=POS_C(Q)\)

  2. (2)

    \(\forall a \in P, POS_{P-\{a\}}(Q)=POS_C(Q)\)

Then the subset P is a reduct of C.

2.3 Fuzzy Rough Sets

The membership of an object \(x \in \mathbb {U}\), belonging to the fuzzy positive region can be defined by

$$\begin{aligned} \mu _{\underline{R_{P} X}}(x)=\inf _{y \in \mathbb {U}} I\left( \mu _{R_{P}}(x, y), \mu _{X}(y)\right) \end{aligned}$$
(9)

Here I is a fuzzy implicator and T is a t-norm. \(R_P\) is the fuzzy similarity relation induced by the subset of features P:

$$\begin{aligned} \mu _{R_{P}}(x, y)=\bigcap _{a \in P}\left\{ \mu _{R_{a}}(x, y)\right\} \end{aligned}$$
(10)

Many fuzzy similarity relations can be constructed to represent the similarity between objects x and y for feature a, such as

$$\begin{aligned} \mu _{R_{a}}(x, y)=1-\frac{|a(x)-a(y)|}{\left| a_{\max }-a_{\min }\right| } \end{aligned}$$
(11)
$$\begin{aligned} \mu _{R_{a}}(x, y)=\exp \left( -\frac{(a(x)-a(y))^{2}}{2 \sigma _{a}^{2}}\right) \end{aligned}$$
(12)

where \(\sigma ^2_a\) is the variance of feature a. The fuzzy positive region can be defined as

$$\begin{aligned} \mu _{{P O S}_{R_{P}}(Q)}(x)=\sup _{X \in \mathbb {U} / Q} \mu _{\underline{R_{P}}}(x) \end{aligned}$$
(13)

Using the definition of the fuzzy positive region, the new dependency function can be defined as follows:

$$\begin{aligned} \gamma _{P}^{\prime }(Q)=\frac{\sum _{x \in \mathbb {U}} \mu _{P O S_{R_{P}}(Q)}(x)}{|\mathbb {U}|} \end{aligned}$$
(14)

A fuzzy-rough reduct R can be defined as a subset of features that preserves the dependency degree of the entire dataset, that is, \(\gamma _{R}^{\prime }(\mathbb {D})=\gamma _{\mathbb {C}}^{\prime }(\mathbb {D})\).

2.4 Kernelized Fuzzy Rough Set

Some widely encountered kernel functions satisfying reflexivity, symmetry, and transitivity are:

  1. 1.

    Gaussian kernel: \(k_{G}(x, y)=\exp \left( -\frac{\Vert x-y\Vert ^{2}}{\delta _{u}}\right) \)

  2. 2.

    Exponential kernel: \(k_{E}(x, y)=\exp \left( -\frac{\Vert x-y\Vert }{\delta }\right) \)

  3. 3.

    Rational quadratic kernel: \(k_{R}(x, y)=1-\tfrac{\Vert x-y\Vert ^{2}}{\Vert x-y\Vert ^{2}+\delta }\)

Table 1. Selected t-Norms and their duals (S Conorms)

With the kernel function and the fuzzy operator in Table 1 and Table 2, we can substitute fuzzy relations in fuzzy rough sets. The kernelized fuzzy lower and upper approximation operators are defined as:

  1. 1.

    S-kernel fuzzy lower approximation operator: \(\underline{k_{S}} X(x)=\inf _{y \in U} S(N(k(x, y)),\) X(y));

  2. 2.

    \(\theta \)-kernel fuzzy lower approximation operator: \(\underline{k_{\theta }} X(x)=\inf _{y \in U} \theta (k(x, y),\) X(y));

  3. 3.

    T-kernel fuzzy upper approximation operator: \(\overline{k_{T}} X(x)=\sup _{y \in U} T(k(x, y),\) X(y))

  4. 4.

    \(\sigma \)-kernel fuzzy upper approximation operator: \(\overline{k_{\sigma }} X(x)=\sup _{y \in U} \sigma (N(k(x, y)),\) X(y))

Let the classification be formulated as \({<}U, A, D{>}\), where U is thenonempty and finite set of samples, A is the set of features characterizing the classification, D is the class attribute which divides the samples into subset \(\{d_1,d_2,...,d_K\}\). For \(\forall x \in U\),

$$d_{i}(x)=\left\{ \begin{array}{l}0, x \notin d_{i} \\ 1, x \in d_{i}\end{array}\right. $$

We construct the algorithms for computing the fuzzy lower and upper approximations for a given kernel function.

  1. 1.

    \(\quad \underline{k_{S}} d_{i}(x)=\inf _{y \notin d_{i}}(1-k(x, y))\);

  2. 2.

    \(\quad \underline{k_{\theta }} d_{i}(x)=\inf _{y \notin d_{i}}\left( \sqrt{1-k^{2}(x, y)}\right) \);

  3. 3.

    \(\quad \overline{k_{T}} d_{i}(x)=\sup _{y \in d_{i}} k(x, y)\);

  4. 4.

    \(\quad \overline{k_{\sigma }} d_{i}(x)=\sup _{y \in d_{i}}\left( 1-\sqrt{1-k^{2}(x, y)}\right) \).

The kernelized dependency function is defined as follows:

$$\begin{aligned} \gamma _{B}^{S}(D)=\frac{\left| \cup _{i=1}^{I} \underline{k_{S}} d_{i}\right| }{|U|} \text{ or } \gamma _{B}^{\theta }(D)=\frac{\left| \cup _{i=1}^{I} \underline{k_{\theta }} d_{i}\right| }{|U|} \end{aligned}$$
(15)

The coefficients of classification quality reflect the approximation ability of the approximation space or the ability of the granulated space induced by attribute subset B to characterize the decision.

3 Kernelized Fuzzy Rough Set-Based Three-Way Decision Feature Selection

This section first expounds the problems existing in the heuristic kernelization dependency feature selection strategy, and then describes the feature selection method using the idea of three-way decision.

3.1 Heuristic Feature Selection

Since finding the minimum subset is an NP-hard problem, a heuristic search algorithm is generally used to obtain feature subsets. The maximal dependency(MD) strategy is designed in [13], and its heuristic feature evaluation function is

$$\begin{aligned} \max _{f \in C-S} \Psi (f, S, D) \end{aligned}$$
(16)

where \(\Psi (f, S, D)=\gamma _{B}^{S \cup \{f\}}(D)-\gamma _{B}^{S}(D)\), C is the initial feature set, S is the selected feature subset, D is the decision feature, and F is a candidate feature.

The purpose of feature selection is to obtain the feature subset with the fewest features under the condition of maintaining the descriptive ability of the feature subset. MD adopts a greedy strategy, that is, adding a candidate feature that maximizes \(\Psi \) in each step, so that the dependency of the selected feature subset increases as quickly as possible, and its search can only guarantee a local optimum. The selected feature subset may be too large and redundant, and the quality of the feature subset is difficult to guarantee.

3.2 Feature Selection Based on Three-Way Decision

In order to avoid the problems caused by the greedy strategy and make the feature subset more concise and informative, this paper proposes a three-way decision-based feature selection strategy. In the three-way search, generally each layer maintains three feature subsets, which are used to generate the top three new feature subsets respectively, totaling 9 candidate feature subsets. Then, the top three are selected from the 9 feature subsets, and they are constrained from not originating from the same branch as the 3 feature subsets of the next layer. Three-way feature selection will eventually generate 3 better feature subsets. The method of feature selection and generation of successor is as follows:

$$\begin{aligned} \bigcup _{i=1}^{w} \max _{f_{i} \in C-S_{i}} \Psi \left( f_{i}, S_{i}, D\right) , \end{aligned}$$
(17)
$$\begin{aligned} \Psi \left( f_{i}, S_{i}, D\right) =\gamma _{B}^{S_i \cup \{f_i\}}(D)-\gamma _{B}^{S_i}(D), \end{aligned}$$
(18)

C is the conditional feature set, i represents the sequence number of the branch, \(S_i\) represents the feature selected by the ith branch, and \(f_i\) represents the candidate feature of the ith branch.

Fig. 1.
figure 1

Three-way feature selection

The idea of three feature selection is shown in Fig. 1. The solid and dashed circle nodes in the figure represent a subset of features. The solid circle indicates that the feature subset will continue to expand, and the dashed circle indicates that the feature subset will not expand. Node G indicates that the feature subset has reached the stopping condition.

The specific descriptions of the three feature selection algorithms are as Algorithm 1.

figure a

The algorithm first starts with an empty set, and selects the top three features of dependency to form a feature subset of size 1. Next, test whether the current feature subset reaches the threshold. If it reaches the threshold, terminate the expansion of the subset and add it to the output subset set. Otherwise, continue to select the top three features of dependency to expand the subset until the subset There are three feature subsets in the set. In order to maintain the difference of feature subsets, the algorithm constrains that all subsets selected in each round cannot come from the same branch, and existing subsets cannot be selected.

Let the size of the original feature set A in the dataset be N. In the kth round, a feature subset has selected k features, and the time complexity of calculating the dependency gain of the remaining \((N-k)\) features is \(O(N-k)\). Then in the worst case, that is, when all features are selected, the total complexity of one feature subset is \(O\left( \sum _{k=1}^{N}(N-k)\right) =O\left( N^{2}\right) \), and the total complexity of three feature subsets is approximately \(O(N^2)\).

After obtaining 3 feature subsets, the 3 feature subsets are respectively constructed as homogeneous learners to form three collaborative decision-making models to obtain better learning performance.

3.3 Computational Complexity

The main computational cost of three-branch decision feature selection is from the computation of kernelized dependency with different feature subsets and the selection of features with different branches. Compared with the traditional fuzzy rough set, the usage of kernel functions greatly reduces the storage space and computational cost. With M features, the time complexity of computing the Euclidean distance between a pair of samples is O(M). With N samples, it first spends O(N) to calculate the kernelized lower approximation of each sample, and then merges the lower approximation of all samples by O(N) to obtain the kernelized dependency to measure the quality of feature subsets. In the feature selection process of the three-branch decision, each branch evaluates M features at most and the size of the branch is at most \(M-2\) features. Therefore, the time complexity of computing the kernelized dependency and the feature selection process at different branches are \(O(N^2M)\) and \(O(M^2)\), respectively. However, the actual computation cost will be much smaller than the theoretical computation cost due to the branch size and the cutoff threshold.

4 Experiment

This part mainly includes the experiment steps and presents an analysis of the model with classification accuracy. We compare the three-way decision model based on kernelized fuzzy rough sets with the traditional greedy algorithm. At the same time, we also make the comparison between soft voting and hard voting for the model in this paper. For each sample, we obtained its three feature subsets obtained, and the closest distance from each feature subset to each class in the data set is calculated and voted, then the closest distance is selected as the feature subset described. Finally, the class to which the majority of feature subsets belong is taken as the class of the sample, this method is called hard voting, while soft voting corresponds to it, the sum of the three feature subsets to the nearest samples of a certain class is taken as the total distance, then the class to which the minimum value belongs can be taken as the class to which it belongs by comparing all distances. In order to facilitate the following representation of the experiment, ‘KFRS-FS(S)’ is used for soft voting, and ’KFRS-FS(H)’ is used for hard voting.

4.1 Datasets and Settings

In this experiment, the specific information of the datasets is shown in Table 3. We summarize the basic information of each dataset as dataset name, number of features and number of samples. At the same time, for the experiment results of each dataset, the average performance of the ten-fold cross-validation method is used as the final performance of our model on the dataset, in order to eliminate the adverse effects of accidental errors in the experiments.

Tests on small-scale datasets show that the kernel-based fuzzy rough set method can extract better feature subsets when the dependency value belongs to [0.5, 1.0]. The performance of the algorithm in this paper is compared with that of the greedy algorithm with dependency in [0.5, 1.0]. At the same time, for each dataset, the performance difference between soft voting and hard voting is compared. The specific experimental data are shown in Table 5 and Table VI.

Table 2. Residual Implication Induced by the t-Norms and Their Duals
Table 3. Experiment datasets
Table 4. Classification accuracy by KFRS-FS and GA

4.2 Algorithm Performance Comparison

The performance comparison results of the two algorithms on the selected dataset are shown in Table 4. In Table 4, the second column represents the performance of the model in this paper, which is represented by KFRS-FS here, and the third column represents the performance of the classic greedy algorithm, which is represented by GA. It can be seen that the performance of the algorithm in this paper is generally higher than that of the greedy algorithm. Among them, there are more than 5% points of performance improvement in australian, bupa, dnatest, mammographic, spect-train or other datasets, and the improvement is more significant. From overall view, the KFRS-FS algorithm proposed in this paper has different degrees of increase in algorithm performance compared with the classical greedy algorithm according to different datasets.

4.3 Analysis of KFRS-FS

This part is mainly aimed at the comparison between soft voting and hard voting inside the KFRS-FS algorithm introduced in this paper, as shown in Table 5 and Table 6, where the second column represents the feature subset distribution obtained by soft voting and hard voting, and the third column represents the performance of soft voting and hard voting. It can be seen that for datasets with fewer features, the performance of soft voting is higher than that of hard voting. On the contrary, the performance of datasets with more features is better than hard voting. It proves that hard voting, which first finds the class to which each feature subset belongs, will have more advantages in the comparison of model performance in the sample space with high dimension while soft voting will ignore the performance of individual feature subsets and try to find an overall performance, this gives soft voting a poor effects in higher dimensions. However, in low-dimensional space, the overall performance will have a better model performance.

Table 5. Classification accuracy by soft voting
Table 6. Classification accuracy by hard voting

In addition, experiments show that the most appropriate cutoff threshold varies in different datasets. When the cutoff threshold is too low, the model can not fully exploit the information in the feature space. When the cutoff threshold is too high, the feature subset may have high redundancy. Both of these will lead to degradation in the performance of the model.

5 Conclusions

In this paper, the idea of three-way decision is introduced into feature selection based on kernelized fuzzy dependency. From the perspective of multi-branch, multiple feature subsets containing sufficient information and complementarity are obtained, and the classification performance of this method is further improved through ensemble learning. The algorithm proposed in this paper has been performed on benchmark datasets and compared with traditional methods. The experimental results show that the scale of the three feature subsets calculated by the new method is much smaller than the original number of features, which reduces the computational complexity of classification. Moreover, the ensemble learning based on three feature subsets has better classification accuracy on multiple datasets than the traditional kernelized fuzzy rough set feature selection method, indicating that the new method has better classification accuracy. Further research topics include how to extend the three-way decision to the semi-supervised domain, so that the method can be used in more practical situations.