1 Introduction

Given a dataset, the task of feature selection is to select the smallest subset of the most important and discriminative input features. The aim of feature selection is to overcome the curse of dimensionality, which is a severe difficulty that can arise in datasets of high dimensions [13].

All the traditional feature selection methods, assume that the entire input feature set is available from the beginning. However, streaming features (SF) is an integral part of many real-world applications. In SF, the number of feature vectors (instances) is fixed while feature set (attributes) grows with time. For example, in bio-informatic and clinical machine learning problems, acquiring the entire set of features for every training instance is expensive due to the high cost lab experiments [4]. As another example, in texture-based image segmentation problems, the number of different texture filters can be infinite and therefore acquiring the entire feature set is infeasible [5]. In all these scenarios, we need to incrementally update the feature set as new features are available over time. There are also some scenarios where the entire feature space is accessible, but feature streaming offers many advantages. This scenario is common in statistical relational learning [6] and social network analysis [7], where the feature space is very large and exhaustive store and search over the entire feature space is infeasible.

Streamwise feature selection (SFS) is the task of selecting a best feature subset in SF scenarios. Any SFS method must satisfy three critical conditions; Firstly, it should not require any domain knowledge about feature space, because the full feature space is unknown or inaccessible. Secondly, it should allow efficient incremental updates in selected features, specially, when we have a limited amount of computational time available in between each feature arrival. Finally, it should be as accurate as possible at each time instance to allow reliable classification and learning tasks at that time instance.

Motivated by these challenges, several research efforts have been made to address SFS. Perkins and Theiler proposed an iterative gradient descent algorithm, called grafting [8, 9]. In this algorithm, a newly seen feature is added to the selected features if the improvement in the model accuracy, is greater than a predefined threshold \(\lambda\). While this algorithm is able to handle streaming features, it is ineffective in dealing with true OSF scenarios for three reasons; (1) choosing a suitable value for \(\lambda\) requires information about the global feature space. (2) This algorithm suffers from the so-called nesting effect, if a previously chosen feature is later found to be redundant, there is no way for it to be discarded [10].

Ungar et al. [6] proposed a streamwise regression algorithm, called information-investing. In this algorithm, a newly generated feature is added to the model if the entropy reduction is greater than the cost of coding. In spite of their success in handling feature spaces of unknown or even infinite sizes, these algorithms suffer from the nesting effect, such as grafting.

Wu et al. [5] proposed a causality-based SFS algorithm called fast-OSFS. This algorithm contains two major steps, (1) online relevance analysis that discards irrelevant features, and (2) online redundancy analysis, which eliminates redundant features. Although this framework is able to select most informative features in SF, it uses a conditional independence test which needs a large number of training instances, especially when the number of features contributed in test grows with time.

Wang et al. [11] proposed a dimension incremental attribute reduction algorithm called DIA-RED. This algorithm maintains a rough sets-based entropy value of the current selected subsets and updates this value whenever new conditional features are added to the dataset. While DIA-RED is able to handle streaming scenarios without any knowledge about feature space, it does not implement an effective redundant attribute elimination mechanism, and therefore the selected subsets are large during features streaming. This causes ineffective partitioning steps in calculating the rough sets approximations, and therefore, this algorithm is not time efficient for most of the real world datasets.

In this paper, the SFS problem is considered from the rough sets (RS) perspective. The main motivation for this consideration is that RS-based data mining does not require any domain knowledge other than the given dataset. Several successful RS-base feature selection algorithms are proposed in the literatures [1217]. However, all these algorithms consider the batch feature selection problem and are not applicable to SF scenarios. In this paper, a new SFS algorithm, which adopts the classical RS-based feature significance concept to eliminate irrelevant features, is proposed. The efficiency and accuracy of the proposed algorithm is demonstrated using several experimental results.

The remainder of the paper is organized as follows: Sect. 2 summarizes the theoretical background of rough sets along with a look at the rough set-based attribute reduction methods. Section 3 discusses the new SFS algorithm. Section 4 reports experimental results and Sect. 5 concludes the paper.

2 rough sets

Uncertainty is a natural phenomenon in machine learning, which can be embedded in the entire process of data preprocessing, learning and reasoning [1820]. Rough sets theory has introduced by Pawlak [21] to express uncertainty by means of boundary region of a set. The main idea of rough set is the use of a known knowledge in knowledge base to approximate the inaccurate and uncertain knowledge [22]. Therefore, the main advantage of this theory is that it requires no human input or domain knowledge other than the given dataset. This section summarizes the theoretical background of rough sets theory along with a look at the rough set-based attribute reduction methods.

2.1 Information system and indiscernibility

An information system is a pair \(IS = (U, F)\), where \(U\) is a non-empty finite set of objects called the universe and \(F\) is a non-empty finite set of features such that \(f : U \to V_{f}\), for every \({\text{f }} \in {\text{F}}\). The set \({\text{V}}_{\text{f}}\) is called the value set or domain of \({\text{f}}\). A decision system is a pair \({\text{IS }} = ({\text{U}}, {\text{F}},{\text{d}})\), where \({\text{d}}\) is decision feature.

For any set \(B \subseteq F\mathop \cup \nolimits \{ d\}\), we define the \(B\)-indiscernibility relation as:

$$INDIS \left( B \right) = \left\{ {\left( {x,y} \right) \in U \times U|\forall f \in B, f\left( x \right) = f\left( y \right)} \right\}.$$
(1)

If \((x,y)\) belongs to \(IND_{IS} (B)\), \(x\) and \(y\) are said to be indiscernible according to the feature subset \(B\). Equivalence classes of the relation \(IND_{IS} (B)\) are denoted \([x]_{B}\) and referred to as \(B\)-elementary sets. The partition of \(U\) into \(B\)-elementary subsets is denoted by \(U/B\). The time complexity of generating \(U/B\) is \((|{\text{B}}||{\text{P}}||{\text{U}}|)\), where \(|{\text{P}}|\) is the number of generated \(B\)-elementary subsets. If none of the objects in \({\text{U}}\) are indiscernible according to \({\text{B}}\), the number of \({\text{B}}\) elementary subsets is \(|{\text{U}}|\) and therefore the worst-case complexity of generating \(U/B\) is \({\text{O}}(\left| {\text{B}} \right|\left| {\text{U}} \right|^{2} )\) [23]. For most of the real world applications, \(|P| \ll |U|\), and therefore, the partitioning process is very time efficient from application view point [23].

2.2 Lower and upper approximations

Two fundamental concepts of rough sets are the lower and upper approximations of sets. Let \(B \subseteq F\) and \(X \subseteq U\), the \(B\)-lower and \(B\)-upper approximations of X are defined as follows:

$$\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{B} X = \left\{ {x|\left[ x \right]_{B} \subseteq X} \right\},$$
(2)
$$\bar{B}X = \left\{ {x|\left[ x \right]_{B} \mathop \cap \nolimits X \ne \emptyset } \right\},$$
(3)

The \(\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{B} X\) and \(\bar{B}X\) approximations define information contained in \(B\). If \(\in \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{B} X\), it certainly belongs to \(X\), but if \(x \in \bar{B}X\), it may or may not belong to \(X\).

By the definition of \(\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{B} X\) and \(\bar{B}X\), the objects in \(U\) can be partitioned into three parts, called the positive, boundary and negative regions.

$$POS_{B} (X) = \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{B} X,$$
(4)
$$BND_{B} \left( X \right) = \bar{B}X - \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{B} X,$$
(5)
$$NEG_{B} \left( X \right) = U - \bar{B}X.$$
(6)

2.3 Dependency

Discovering dependencies between attributes is an important issue in data analysis. Let \(D\) and \({\text{C}}\) be subsets of \(F\mathop \cup \nolimits \{ d\}\). For \(0 \le {\text{k}} \le 1\), it is said that \(D\) depends on \(C\) in the \(k\) th degree (denoted \(C \Rightarrow_{k} D\)), if

$$k = \gamma \left( {C,D} \right) = \frac{{|POS_{B} \left( D \right)|}}{|U|},$$
(7)

where,

$$POS_{B} \left( D \right) = \mathop {\bigcup }\limits_{X \in U/D} \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{C} X,$$
(8)

is called a positive region of the partition \(U/D\) with respect to \(C\). This region is the set of all elements of \(U\) that can be uniquely classified to blocks of the partition \(U/D\), by means of \(C\).

2.4 Significance

Significance analysis is a tool for measuring the effect of removing an attribute, or a subset of attributes, from a decision system on the positive region defined by that decision system. Let \(DS = (A, F, d)\) be a decision system. The significance of attribute \(f \in F\), denoted \(\sigma_{{\left( {F,d} \right)}} (f)\) is defined as [24]

$$\sigma_{{\left( {F,d} \right)}} (f) = \frac{{\gamma \left( {F,d} \right) - \gamma \left( {F - \{ f\} ,d} \right)}}{{\gamma \left( {F,d} \right)}}.$$
(8)

2.5 Reduct

A reduct R is a subset of conditional features that satisfies both of the following conditions:

$$IND_{IS} (R) = IND_{IS} (C),$$
(9)
$$\forall R^{'} \subset R s.t. IND_{IS} (R) \ne IND_{IS} (R^{\prime}).$$
(10)

An optimal reduct is a reduct with minimum cardinality. The intersection of all reducts contains those attributes that cannot be eliminated and is called the core. Finding a minimal reduct is NP-hard [24], because all possible subsets of conditional features must be generated to retrieve such a reduct. Therefore finding a near optimal has generated much of interest. Figure 1 represents the steps of QUICKREDUCT algorithm [25], which searches for a minimal subset without exhaustively generating all possible subsets.

Fig. 1
figure 1

The QUICREDUCT algorithm [25]

2.6 Rough set extensions

Efforts have been made to connect the attribute reduction concept in rough sets theory to feature selection in machine learning and classification tasks. However, traditional rough set based attribute reduction (RSAR) only operates effectively with datasets containing discrete values and therefore it is necessary to perform a discretization step for real-valued attributes [23, 26]. Therefore, several extensions to the original theory have been proposed to deal with real-valued (continuous datasets). Four well known extensions are variable precision rough sets (VPRS) [27], tolerance rough set model (TRSM) [28], fuzzy rough sets (FRS) [29] and neighborhood rough set model (NRSM) [3032].

VPRS [27] attempts to overcome traditional rough sets shortcomings by generalizing the standard set inclusion relation (\(\subseteq\)). In the generalized inclusion relation, a set \(X\) is considered to be a subset of \(Y\) if the proportion of elements in \(X\) which are not in \(Y\) is less than a predefined threshold. However, the introduction of a suitable threshold requires more information than contained within the data itself. This is contrary to the rough sets theory and OS consideration of operating with no domain knowledge.

TRSM [28] uses a similarity relation instead of indiscernibility relation to relax the crisp manner of classical rough sets theory. As equivalence classes (elementary sets) in classical rough sets, tolerance classes are generated using similarity relation in TRSM, which are used to define lower and upper approximations. TRSM has two deficiencies which are contrary two our OS considerations; First, generating tolerance classes needs a tolerance threshold, which is human defined. Second, the time complexity of generating all tolerance classes, using attribute subset \(B\), is \(\theta (\left| {\text{B}} \right|\left| {\text{U}} \right|^{2} )\), which is equal to worst-case time complexity of the partitioning process in the classical rough sets.

FRS [29] uses fuzzy equivalence classes generated by a fuzzy similarity relation to represent vagueness in data. The term fuzziness refers to the unclear boundary between two linguistic terms and is based on the membership function of fuzzy sets [20, 33]. Fuzzy lower and upper approximations are generated based on fuzzy equivalence classes. These approximations are extended versions of their crisp notions in classical rough sets, except that in the fuzzy approximations, elements may have membership degree in the range (0, 1). FRS needs no extra knowledge to define operations on a given dataset. However, generating fuzzy equivalence classes in FRS is an expensive routine (\(\theta (\left| B \right|\left| U \right|^{2} )\)).

NRSM [3032] is used to replace the equivalent approximation of traditional rough set model with \(\theta\)-neighborhood relation, which supports both continuous and discrete datasets. Although successful in dealing with real-valued datasets, NRSM suffers from the low computation efficiency, especially in computing the neighborhood of each record. Moreover, the introduction of a suitable \(\theta\) value (the distance threshold parameter) is a challenge in this extension, which needs extra knowledge about the whole feature space.

In addition to rough sets extensions, there are also some modifications, which do not change classical rough sets principles. The notable work in this group is the one proposed in [15]. This modification does not redefine the lower and upper approximations in classical rough sets, but introduces a new dependency measure based on the classical rough sets principals to deal with real-valued data. This new dependency measure uses a proximity measure that quantifies the information contained in the boundary region. This measure needs no human input knowledge to deal with available data. Moreover generating equivalence classes in this modification is more efficient than generating tolerance classes and fuzzy equivalence classes in TRSM and FRS, respectively.

3 Streamwise feature selection using rough sets

As stated, in the streaming features (SF), new conditional features flow in one by one over time while the number of objects in dataset remains fixed. In this section, we propose a new algorithm to implement the rough sets theory for feature selection with SF scenarios.

3.1 The proposed algorithm

Because we do not have access to the full feature space in the streaming features context, the batch versions of RS-based feature selection algorithms, such as QUICKREDUCT, are not directly applicable. This problem can be relaxed by allowing a new incoming feature to be included in the selected subset, if it increases the dependency measure. However, this relaxation may be dangerous in SF scenarios, because early incoming features will be selected with more chance and most of the late coming features may not be considered. Here, we first define a generalized definition of significance concept, then, we provide a general rough set-based feature selection algorithm for SF scenarios.

Definition 1

Let \(DS = (A, F, d)\) be a decision system and \(\pi :2^{{F\mathop \cup \nolimits \{ d\} }} \times 2^{{F\mathop \cup \nolimits \{ d\} }} \to [0,1]\) be a dependency function defined on \(DS\). The generalized significance of \(P \in F\), denoted \(\sigma^{g}_{{\left( {F,d} \right)}} (P)\), is defined as:

$$\sigma^{g}_{{\left( {F,d} \right)}} (P) = \frac{{\pi \left( {F,d} \right) - \pi \left( {F - P,d} \right)}}{{\pi \left( {F,d} \right)}}$$
(11)

\(\pi\) can be any dependency function such as the classical rough set-based dependency function (\(\gamma )\), or a dependency function based on rough sets extensions (such as VPRS, TRSM, and FRS) and modifications (such as the dependency measure defined in [aaa15]).

Definition 2

Let \(DS = (A, F, d)\) be a decision system and \(P \subseteq F\). \(P\) is non-significant for \(DS\), if and only if \(\sigma^{g}_{{\left( {F,d} \right)}} \left( P \right) = 0.\)

Definition 3

Let \(DS = (A, F, d)\) be a decision system and \(P \subseteq F\). \(DS\) is \(\pi\)-consistent using \(P\), if and only if \(\pi \left( {P,d} \right) = 1\).

Figure 2 represents the proposed streamwise feature selection algorithm, called SFS-RS, which uses generalized significance analysis to control the inclusion of any new incoming feature in SF context. The algorithm starts with an empty selected subset \(R\). Then it waits for a new incoming feature (line 3). Once a new feature \(f\) is provided, the algorithm proceeds based on the \(\pi\)-consistency of the dataset. If the dataset is not \(\pi\)-consistent (\(\pi (R,d) \ne 1)\), the algorithm tests the increase of the dependency value, when f is added to current subset (line 5). If the measure is increased, the current subset is updated to include \(f\) (line 6), otherwise, \(f\) is rejected. On the other hand, if the dataset is \(\pi\)-consistent (\(\pi \left( {R,d} \right) = 1)\), the new incoming feature is not eliminated immediately, but the algorithm checks to see if there exists any current reduct subset, which becomes non-significant due to the presence of \(f\) (lines 9–19). If such subset exists, and its size is larger than one, then the subset can be replaced with \(f\) (lines 21–23). This makes our dataset smaller, while keeping the \(\pi\)-consistency. Moreover, if only one feature (say \(f^{'}\)) becomes non-significant due to \(f\), then one of the features \(f\) and \(f^{'}\) is removed based on the dependency value (lines 23–25). The algorithm alternates the above two phases till the stopping criteria is satisfied. If the size of the streaming dataset is known, the algorithm can keep running until the last feature (no further features are available). However, if we have no knowledge about the feature space, then the algorithm can stop once a predefined accuracy is satisfied or a maximum number of iterations is reached.

Fig. 2
figure 2

The proposed streamwise feature selection

3.2 The time complexity of SFS-RS

The time complexity of SFS-RS depends on the number of \(\pi\) tests. Suppose that the time complexity of calculating \(\pi (B,d)\) can be attributed by function \(\varPsi \left( {B,U} \right)\), where \(U\) is the set of objects. For example if we use the classical rough set-based dependency function (\(\gamma\)), then \(\varPsi \left( {B,U} \right) = \varTheta (\left| B \right|\left| P \right|\left| U \right|)\) and if, on the other hand, we use fuzzy rough set-based dependency function, then \(\varPsi \left( {B,U} \right) = \varTheta (\left| B \right|\left| U \right|^{2} )\). Suppose that at time \(t\) a new feature \(f_{t}\) be present to the SFS-RS algorithm and let \(R_{t}\) be the selected feature subset at this time. If the available dataset is not \(\pi\)-consistent using \(R_{t}\), the first phase of the algorithm will be triggered. This phase includes a single \(\pi\) test and therefore, the worst-case time complexity of this phase is \(\varPsi \left( {B,U} \right)\). However, if the dataset is \(\pi\)-consistent, the second phase will be triggered, which needs \(2|R_{t} |\) \(\pi\) tests (two \(\pi\) tests for each consistency check) to remove non-significant features. Therefore the worst case time complexity of this phase will be \(2|R_{t} | \varPsi \left( {B,U} \right)\).

4 Experimental results

In this section, we provide several experimental results to illustrate the performance of the proposed method. To do this, we compare the performance of the proposed SFS algorithm (SFS-RS) with four existing SFS algorithms, grafting [9], information-investing [6], fast-OSFS [5] and DIA-RED [11]. Table 1 summarizes the datasets used in our experiments. The dorothea, arcene, dexter, and madelon datasets are from the NIPS 2003 feature selection challenge [34]. Arrhythmia, mf (multiple features), ionosphere, wine, and credit are selected from the UCI machine learning repository [35], and the threshold max 13 (tm13) are three synthetic datasets from [9]. All the experiments are carried out on a DELL workstation with Windows 7, 2 GB memory, and 2.4 GHz CPU. The J48 [36] and kernel SVM with RBF kernel function [37] classifiers are used to compare the performance of the proposed SFS algorithm.

Table 1 Summary of the Benchmark dimensional datasets

4.1 Experiments on different settings of the dependency function

Here we consider the effect of different dependency functions (different settings for \(\pi\)) on the performance of the SFS-RS algorithm. six dependency measures are considered here; (1) The classical rough sets-based dependency function, (2) VPRS-based dependency function, (3) TRSM-based dependency function, (4) FRS-based dependency function, NRSM-based dependency measure, and 6) The dependency measure introduced in [15]. Two different threshold values β = 0.1 and β = 0.2 are employed to define the generalized inclusion relation in VPRS-based dependency. Moreover, two different values of tolerance thresholds α = 0.9 and α = 0.95 are used for TRSM-based function. As experimental results in [32] show that (0.1, 0.3) is an optimal candidate interval for the \(\theta\) in NRSM, we used the value 0.2 for this parameter. Table 2, summarizes the six dependency measures and their representations used in our reports.

Table 2 The five dependency measures and their representations in our experiments

The selected subset sizes, running times, and classification accuracies of the five measures are reported in Tables 3, 4 and 5, respectively. Based on These tables, we can conclude the following results:

Table 3 The selected subsets sizes using SFS-RS with Different dependency settings
Table 4 Running times of SFS-RS with different dependency settings
Table 5 Classification accuracies of SFS-RS with different dependency settings
  • There is only marginal increase in running times of the dependency measure proposed in [15] (\(\rho\) in the tables), comparing with the classical rough set-based method. This is because of the fact that the two methods use the same partitioning process to generate elementary classes.

  • Although the VPRS-based algorithm is comparable with the classical rough sets-based algorithm, in terms of running times, the results show a strong dependence of the VPRS on the β-value. Although the ideal threshold value can be obtained by repeated experimentation for a given data set, this value will be biased to the used data set. Moreover, finding such a value will impose a large computational time to the overall feature selection process.

  • The TRSM, FRS and NRSM based algorithms are very slow and these algorithms are not able to finish for the seven datasets, dorothea, arcene, dexter, madelon, mf, tm1, and tm2.

  • Similar to VPRS, the optimal threshold value in TRSM is data driven and needs a pre-processing step for each data set. This is contrary to the rough sets theory consideration of operating with no domain knowledge.

  • Although the classical rough sets-based dependency measure is able to find accurate results for discrete-valued datasets, it lost the tests for most of the continues-valued datasets. Moreover, the dependency proposed in [15] shows increases in classification accuracies for most of the tests, specially, for continues-valued datasets.

4.2 Comparison of SFS-RS with other SFS Algorithms

Here, we compare the performance of the proposed SFS algorithm with four existing SFS algorithms, grafting [9], information-investing [6], fast-OSFS [5] and DIA-RED [11]. In the grafting algorithm, the multi-layer perceptron (MLP) is adopted as learning model and the \(\lambda\) parameters are chosen using fivefold cross-validation on each of the training datasets. In the information-investing algorithm, the parameters are set as their default settings, \(W_{0} = 0.5\) and \(W_{\Delta } = 0.5\). For the fast-OSFS algorithm, the independence tests are \(G^{2}\) tests for all fully categorical or integer datasets and Fishers z-tests for all datasets containing real-valued features. For both tests, the statistical significance level (\(\alpha\)) is set to be 0.05. DIA-RED is implemented using the combination entropy [38] and the size of incremental attribute set (SIA) is set to be 1. In the proposed SFS-RS algorithm, the measure proposed in [15] is used as dependency function.

Results are presented in terms of the selected subset size (compactness), the time to locate the subset (running time), the classification accuracy at the end of the streaming, and the classification accuracy of selected subsets during features streaming.

Table 6 reports the compactness of the selected subsets using the four algorithms. As it can be seen, the proposed algorithm selects fewer features than the other four algorithms. For datasets with larger feature sets (dorothea, arcane, and dexter), grafting and information-investing found subsets which are considerably large. This can be attributed to the nesting effect of the two algorithms. Moreover, the fast-OSFS algorithm failed to select a feature subset for five datasets, arcene, tm3, ionosphere, wine, and credit. This is related to the failure of the conditional independence tests to be applied to limited number of training instances. DIA-RED lost the comparison for all the tests. This algorithm failed to finish in a reasonable time for dorothea, arcane, dexter, mf, and tm3. This can be related to the fact that DIA-RED does not implement an effective redundant attribute elimination mechanism, and therefore the selected subsets are large during features streaming, which causes ineffective partitioning steps in calculating the rough sets approximations, and therefore, this algorithm is not time efficient.

Table 6 Selected subsets size comparison of the five SFS algorithms

The running time results are reported in Table 7. We see that the SFS-RS is superior for six datasets, dorothea, arcene, dexter, madelon, tm2, and tm3. The running times of fast-OSFS are comparable with the proposed method. Another important result is that the grafting, information-investing and DIA-RED algorithms are not time-efficient for datasets with large feature space.

Table 7 Running times comparison of the five SFS algorithms

The classification results, reported in Table 8, show that the proposed algorithm performs very well and shows increase in classification accuracies for most of the tests. Although grafting won the test for the tm1, SFS-RS shows an increase of up to 60 % for dorothea dataset. Compared with information-investing, our proposed algorithm is superior in all tests, except the credit dataset. Comparing with fast-OSFS and DIA-RED, the proposed algorithm is superior in all tests.

Table 8 Classification accuracy comparison of the five SFS algorithms

Figure 3 represents the classification accuracy of selected subsets during features streaming, for the three higher dimensional datasets dorothea, arcane, and dexer. A general conclusion from this figure is that the proposed algorithm is more accurate at each time instance and therefore, allows reliable classification and learning tasks at that time instance during the streaming phase. Moreover, the grafting and information-investing algorithms show lowest changes in classification accuracies as more features are seen. This can be attributed to the lower dynamism of the two algorithms, which is due to the nesting effect.

Fig. 3
figure 3

Classification results of selected subsets during features streaming

5 Conclusions

Feature selection, as a pre-processing step, is to select a small subset of most important and discriminative input features. This paper considered the SFS problem from the rough sets (RS) perspective. The main motivation was that RS-based data mining do not require any domain knowledge other than the given dataset. A new SFS algorithm, called SFS-RS, is proposed. This algorithm adopts the feature significance concept to eliminate features which have no influence in deciding output feature.

To show the efficiency and accuracy of the proposed algorithm, it was compared with grafting, information-investing, fast-OSFS, and DIA-RED algorithms. Twelve high dimensional datasets were used for comparisons, and their features were considered one by one to simulate the true SF scenarios. The experiments demonstrated that the proposed algorithm achieves better results than existing SFS algorithms, for all evaluation terms.