1 Introduction

The size of dataset has been increasing dramatically, which usually incurs high computational costs, so handling huge dataset and large dimensional dataset have become a major problem (Kumar 1998; Aouiche and Darmont 2009). In order to deal with this problem, feature reduction and data reduction are practically important. Feature selection (FS), as a kind of feature reduction, can find a more relevant feature subset with labels. Since FS reduces the number of features, it can save cost of computational time and memory when dealing with high dimensional datasets. It is also useful to improve classification accuracy as a result of removing redundant and irrelevant features. According to different mechanisms of selection, FS methods fall into three catalogues: wrapper feature selection (Kuncheva and Jain 1999; Ron and John 1997), embedded method (Breiman et al. 1984; Quinlan 1993), and filter method (Xu et al. 2010; Hu et al. 2010; Dash and Liu 2003; Liu et al. 2009; Peng et al. 2005; Sotoca and Pla 2010). Since filter methods can support the balance between classification accuracy and computational cost, it is used most widely. Two kinds of filter methods based on mutual information (MI) or rough sets are introduced below, because the theories are also employed in our methods.

Since MI can estimate the certainty between two variables, it is used as a criterion to select features. The main idea of MI based methods finds a subset by maximizing MI between features and labels, and minimizing MI among features. On the basis of this idea, each method defines its own selection criterion. The basic methods, Mutual information based feature selection (MIFS) (Battiti 1994) and MIFS-U (Kwak and Choi 2002a), consider the certainty between labels and features. Feature selection by mutual information based on parzen window (PWMI) eliminates features with little information about class variable (Kwak and Choi 2002b). For optimal feature subset using MI (OFS-MI), salient features are selected by comparing the quadratic MI (Chow and Huang 2005; Huang and Chow 2005). Normalized MIFS (Estévez et al. 2009) is a fast and efficient method, due to its incremental nature. For MI methods, a serious problem is that the subset found by ‘best individual feature’ does not imply that it is the best subset of features (Cover and Thomas 1991). That is to say that the subset selected by MI cannot verify the combination of features due to large computational cost of high dimensional MI. In contrast, rough sets based feature selection delivers encouraging results in combination of features.

Rough set is a mathematical theory developed by Pawlak (1982, 1991) and Pawlak and Skowron (2007a b, c). The process of standard feature selection employing rough sets can be viewed as reduct construction (Pawlak 1982). Comparing with other feature selection methods, reduct construction defines its own selection and termination criteria (Yao et al. 2008). A method of reduct construction does not stop until it finds a set where the combination of each feature with others favors the increasing of dependency. There are extensions about rough sets base FS methods. Samples in the Boundary Region are employed to define a distance metric as part of selection criterion (Parthaláin et al. 2007; Parthaláin and Shen 2009). The subsets selected by these methods include more information about labels. Heuristic functions guide the searching process of FS methods (Zhong et al. 2001; ElAlami 2009; Bae et al. 2010; Wang et al. 2007; Chen et al. 2011), which favor finding a more desirable output. Heuristic functions depend on internal information of data. In contract, external information based on semantics or constraints (Yao et al. 2006) is used to find a reduct with user preference. Hybrid fuzzy and rough method (Hu et al. 2007; Parthaláin et al. 2010; Cornelis et al. 2008, 2010a, b; Jensen et al. 2009; Cornelis and Jensen 2008) can provide flexible solutions by extending lower and upper approximations of rough sets to fuzzy sets. Finally, a feature selection algorithm for multiple classifiers can decrease the number of decision-relative feature subsets (Delimata and Suraj 2008). This method evaluates the feature subsets according to deterministic and inhibitory rules which consider the influence of an added new object on decision table (Moshkov et al. 2008, 2010; Delimata et al. 2008, 2009, 2010).

Rough sets based FS methods are filter methods, but the concept of rough sets is classification, which is similar with wrapper methods. That is to say that the subsets selected by rough sets methods favor the classification accuracy without dramatical improvement of computational cost. However, it is a NP-hard problem to find an optimal solution. In this paper, we propose a new algorithm to find a near optimal subset. Since dependency expresses the ability to discern feature values, it is set to guide the searching process. To avoid selecting the features with small addition of dependency, MI is employed to evaluate the similarities of features. As a result, the dependency of the subset found by our method reaches the maximum with small number of features. The dependency shows the definitive relevance of the samples in the Positive Region. But the relationship between the samples in the Boundary Region and labels is not considered by the dependency. A class-based distance metric (CDM), which is part of selection criterion, is defined to evaluate the benefit of the Boundary Region on classification accuracies.

Our proposed feature selection based on rough sets and mutual information (RSMI-FS) shows four major advantages. First, since the dependency of each single feature as heuristic information guides the searching process, the subset determined by RSMI-FS is optimal or near-optimal. This is confirmed by our results in Section 4. For each dataset, the classification accuracies of the subsets found by our method are larger or comparable comparing with other methods. Second, RSMI-FS can find more relevant features, since its selection criterion, which combines the dependency with CDM, evaluates the definitive and uncertain relevance between features and labels. Third, MI is used to reduce the redundant features. Fourth, the optimal subset is found without dramatical increasing of computational time cost, as quadratic MI is used to estimate the similarities of features, which can reduce the computational complexity of MI (Chow and Huang 2005; Huang and Chow 2005).

A real application dataset is also used to verify the effectiveness of our method. Estimated 5.4 million Americans have Alzheimer’s disease (AD). Every 68 s, an additional person with AD is found now. By 2050, there will be one new case of AD every 33 s. The situation of AD is serious, but the precise physiological changes triggering the development of AD largely remain unknown. In this work, our method is used for a real AD dataset developed by National Alzheimer’s Coordinating Center of US. The maximum of classification accuracies on the selected subsets arrives at 81.3 %. For the samples with AD, the maximum of classification accuracies is 82.75 %. These results verify our method is suitable tool to mine the characteristics of AD.

The rest of paper is organized as follows. Section 2 gives the background of our study. The basic knowledge about rough sets and mutual information is shown. The proposed feature selection scheme is detailed in Section 3. The experimental results are presented in Section 4. The application of AD dataset is in Section 5. Finally, the conclusion is drawn in Section 6.

2 Background

In this section, the related theories with our work are introduced. First, the concepts of mutual information is shown, which can be used to evaluate the similarities among features. Then, basic idea of rough sets and the extended method employing samples of the Boundary Region are given.

2.1 Mutual information

In according with Shannon’s information theory (Cover and Thomas 1991), the uncertainty of a variable C can be measured by entropy H(C). For two variables C and Y, the MI I(C;Y) expresses the degree of reduced uncertainty about C after observing Y. It is calculated as

$$ I\left( {C;Y} \right)=H\left( C \right)-H\left( {\left. C \right|Y} \right)\textrm{,} $$
(1)

where \(H\left( {\left. C \right|Y} \right)\) is conditional entropy measuring the uncertainty about C after observing Y.

When C is a discrete variable, with the entropy defined by Shannon, the entropy of C is expressed as

$$ H\left( C \right)=-\sum\limits_{c\in C} {p\left( c \right)\log } p\left( c \right), $$
(2)

where p(·) is the probability mass function of C. When Y is also a discrete variable, the conditional entropy H(C ∣ Y) is expressed as

$$ H\left( {\left. C \right|Y} \right)=-\sum\limits_{y\in Y} {p\left( y \right)\left( {\sum\limits_{c\in C} {p\left( {\left. c \right|y} \right)\log p\left( {\left. c \right|y} \right)} } \right)} \textrm{,} $$
(3)

where p(c ∣ y) represents the conditional probability mass of C and Y. Their MI is

$$ I\left( {C;Y} \right)=\sum\limits_{c\in C} {\sum\limits_{y\in Y} {p\left( {c,y} \right)\log \frac{p\left( {c,y} \right)}{p\left( c \right)p\left( y \right)}} } \textrm{,} $$
(4)

where p(c, y) is the joint probability mass function. When C and Y are continuous variables, the MI between C and Y is (Chow and Huang 2005)

$$ I\left( {C;Y} \right)=\int\int {p\left( {c,y} \right)\log \frac{p\left( {c,y} \right)}{p\left( c \right)p\left( y \right)}} dydc. $$
(5)

Based on the basic MI expression, quadratic MI is got by inserting quadratic expressions of the related variables (Torkkola and Campbell 2000), which are useful to reduce the complexity of computational cost. If D 1 and D 2 are discrete variables, their quadratic MI (Chow and Huang 2005; Huang and Chow 2005; Torkkola and Campbell 2000) is

$$ I\left( {D_1 ;D_2 } \right)=\log \frac{\sum\limits_{d_1 \in D_1 } {\sum\limits_{d_2 \in D_2 } {p\left( {d_1 ,d_2 } \right)} } ^2\sum\limits_{d_1 \in D_1 } {\sum\limits_{d_2 \in D_2 } {p\left( {d_1 } \right)^2p\left( {d_2 } \right)^2} } }{\bigg( {\sum\limits_{d_1 \in D_1 } {\sum\limits_{d_2 \in D_2 } {p\left( {d_1 ,d_2 } \right)p\left( {d_1 } \right)p\left( {d_2 } \right)} } } \bigg)^2}. $$
(6)

For two continuous variables E 1 and E 2, their quadratic MI is:

$$ I\left( {E_1 ;E_2 } \right)=\log \frac{\int\int {p\left( {e_1 ,e_2 } \right)^2de_1 de_2 } \int\int {p\left( {e_1 } \right)^2p\left( {e_2 } \right)^2de_1 de_2 } }{\left( {\int\int {p\left( {e_1 ,e_2 } \right)p\left( {e_1 } \right)p\left( {e_2 } \right)de_1 de_2 } } \right)^2}. $$
(7)

The probability mass function describes the relative likelihood for the variable to occur at a given point, which is estimated according to the observation. Histogram and Kernel method are also used to estimate the probability mass function. However, Histogram tends to produce large estimation errors and pushes the memory requirement exponentially, with the increasing size of data.

2.2 Rough sets based feature selection

The theory of rough sets aims to approximately describe the sets that are unknown, incompletely specified, or whose specification is over complex. The fundamental notions of rough sets are lower and upper approximations of sets (Pawlak 1982, 1991; Pawlak and Skowron 2007a, b, c). The lower approximation is a description of the domain objects with certainty belonging to the concept of interest, whereas the upper approximation is a description of the domain objects that possibly belong to the concept of interest.

Definition 2.1

Let IS(U,A) be a complete information system, where U is a nonempty finite set of objects and A is a nonempty finite set of features so that f:UV f for every f ∈ A. V f is the set of values that f takes. For any \(P\subseteq A\), there exists an indiscernible relation IND(P)

$$ IND\left( P \right)=\left\{ {\left( {x,y} \right)\in U^2\left| {\forall f\in P,f\left( x \right)=f\left( y \right)} \right.} \right\}\textrm{.} $$
(8)

Dataset can be seen as an information system, where samples are the objects of U and features are the elements of A.

Definition 2.2

A partition of U generated by p is defined as

$$ U IND (P) = \otimes \{f\in P,U / IND (\{f\})\}, $$
(9)

where

$$ {\begin{array}{*{20}c} {U \mathord{\left/ {\vphantom {U {IND\left( {\left\{ f \right\}} \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( {\left\{ f \right\}} \right)}=\left\{ {\left. {\left\{ {x\vert f\left( x \right)=b,x\in U} \right\}} \right|b\in V_f } \right\}} \\ {A\otimes B=\left\{ {X\cap Y\left| {\forall X\in A,\forall Y\in B,X\cap Y\ne \emptyset } \right.} \right\}} \\ \end{array} }\textrm{.} $$
(10)

If \(\left( {x,y} \right)\in IND\left( P \right)\), x and y are indiscernible according to the feature subset P. The equivalence class of x on the P-indiscernible relation is denoted by [x] p . If x and y are indiscernible according to the features subset P, \(y\in \left[ x \right]_p \).Construct the P-lower approximations and P-upper approximations of X as

$$ \underline{P}X=\left\{ {x\left| {\left[ x \right]_P \subseteq X} \right.} \right\}\textrm{,} $$
(11)
$$ \overline P X=\left\{ {x\left| {\left[ x \right]_P \cap X\ne \emptyset } \right.} \right\}\textrm{,} $$
(12)

where (11) is the P-lower approximations, and (12) is the P-upper approximations.

By the definition of the P-lower approximations and P-upper approximations, the objects in U can be partition into three regions which are the Positive Region, the Boundary Region, and the Negative Region.

Definition 2.3

Positive Region, Boundary Region and Negative Region are defined as

$$ \displaystyle POS_P \left( L \right)=\bigcup\limits_{{X\in U} \mathord{\left/ {\vphantom {{X\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} {\underline{P}} X, $$
(13)
$$ \displaystyle BND_P \left( L \right)=\bigcup\limits_{{X\in U} \mathord{\left/ {\vphantom {{X\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} {\overline P } X-\bigcup\limits_{{X\in U} \mathord{\left/ {\vphantom {{X\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} {\underline{P}} X, $$
(14)
$$ \displaystyle NE\,G_P \left( L \right)=U-\bigcup\limits_{{X\in U} \mathord{\left/ {\vphantom {{X\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} {\overline P } X\textrm{.} $$
(15)

Although L can be the set of any features, it is fixed as the set of label feature to reduce the burden of description. In this work, we just consider the system with one label feature. That is to say that L = {l}.

Definition 2.4

Dependency of label feature set L on a feature set P is calculated as

$$ \gamma_P \left( L \right)=\frac{\left| {POS_P \left( L \right)} \right|}{\left| U \right|}\textrm{,} $$
(16)

where ∣ · ∣ is the number of objects in the set.

The following lists the QUICK REDUCT (Słowiński 1992) using forward search. C is the set of all conditional features. That is to say that C includes all features except label. R is the output of the algorithm. However, the output of QUICK REDUCT is possible to be not a reduct. For a set S, that is a subset of C, it is a reduct, if \(POS_S \left( L \right)=POS_C \left( L \right)\) and \(\forall a\in S,\;POS_{S-\left\{ a \right\}} \left( L \right)\ne POS_S \left( L \right)\) (Pawlak 1991). The algorithm provides an addition strategy in constructing a subset. \(POS_R \left( L \right)=POS_C \left( L \right)\) is verified by the value of dependency. However, a ∈ R making \(POS_{R-\left\{ a \right\}} \left( L \right)=POS_{R} \left( L \right)\) possibly exists. In this case, the output of QUICK REDUCT is not a reduct. Three strategies shown in Yao et al. (2008) can make sure that the output is a reduct. They are reduction construction by deletion, reduction construction by addition-deletion and reduction construction by addition. Deletion strategy eliminates the features in the set of all conditional features until it finds a subset R producing the same Positive Region with the set of all conditional features. And there is not such feature a ∈ R making \(POS_{R-\left\{ a \right\}} \left( L \right)=POS_R \left( L \right)\). This strategy is not efficient when a reduct is short (Yao et al. 2008). Addition-deletion strategy firstly constructs a set R by inserting features regarding to the Positive Region. Then, a deletion process reduces such feature a ∈ R and \(POS_{R-\left\{ a \right\}} \left( L \right)=POS_R \left( L \right)\). Reduction construction by addition inserts features by considering instinct matrix to directly get a reduct.

figure a

2.3 Boundary region of rough sets

QUICK REDUCT only considers the samples in the Positive Region, so the information of the samples in other regions can not be contained. Distance Metric Quick Reduct (DMQR) (Parthaláin et al. 2007) and Distance Metric Tolerance Rough sets (DM-TRS) (Parthaláin and Shen 2009) drive distance metrics as part of the feature selection criteria for finding a better feature subset. The distance metrics employed by DMQR and DM-TRS qualify the Objects in the Boundary Region with regard to their proximity to the low approximations. Since evaluating the margin of high dimensional space needs large computational effort, the two definitions calculate the mean of all samples in the P-lower approximations to reduce the computational burden. And the mean defined in Parthaláin et al. (2007) and Parthaláin and Shen (2009) is

$$ \underline{P}X_{\mathrm{mean}} =\left\{ {\frac{\sum\limits_{x\in \underline{P}X} {f\left( x \right)} }{\left| {\underline{P}X} \right|}\left| {\forall f\in P} \right.} \right\}. $$
(17)

Another definition (Parthaláin et al. 2010) of the mean is given as

$$ \underline{P}X_{\mathrm{mean}} =\left\{ {\frac{\sum\limits_{x\in \underline{P}X} {f\left( x \right)} }{\left| {POS_P \left( X \right)} \right|}\left| {\forall f\in P} \right.} \right\}. $$
(18)

Formulas (17) and (18) give two definitions of the mean. The means are employed to calculate the distance metric in formula (19). These two formulas calculate the means of samples in the Positive Region according to their explanation. However, X is not described clearly. Moreover, they are general means of all objects. So the distance metric can not evaluate the clustering of samples in the same class. That is to say that the distance metric is not sure to be benefit to classification accuracy. In addition, it is more complex problem when the Positive Region is sparse, since there are not enough samples providing information.

The distance metrics of DMQR and DM-TRS are defined as

$$ \omega_P \left( L \right)=\left( {\sum\limits_{y\in BND_p \left( L \right)} {\delta \left( {\underline{P}X_{\mathrm{mean}} ,y} \right)} } \right)^{-1}\textrm{,} $$
(19)

where y is a sample in the Boundary Region, and δ is a distance function which is a Euclidean distance function described in Parthaláin et al. (2007) and Parthaláin and Shen (2009).

The feature selection criteria of DMQR and DM-TRS (Parthaláin et al. 2007; Parthaláin and Shen 2009) include two parts: the dependency and the distance metric of the samples in the Boundary Region. The evaluation measure of the two algorithms is defined as

$$ M_P \left( L \right)=\frac{\omega_P \left( L \right)+\gamma_P \left( L \right)}{2}\textrm{.} $$
(20)

Apart from the feature selection criterion and stopping rule, their mechanisms are the same as QUICK REDUCT. Since \(\gamma_R \left( L \right)=\gamma_C \left( L \right)\) is an ideal condition that cannot always be obtained in practical situations, DMQR and DM-TRS are set to stop when no new feature is found.

3 Class-based boundary rough sets with mutual information for feature selection

Rough sets based feature selection methods focus on analyzing the Positive Region which is the certain part of the data information. This, however, means the data information which lies in the Boundary Region is overlooked. DMQR and DM-TRS explore the distance metric of the samples in the Boundary Region as part of the feature selection criteria. Compared with other rough sets based feature selection methods, the feature subset determined by DMQR or DM-TRS exhibits a stronger relationship between selected features and labels. In this section, we introduce a new algorithm which selects features according to the data information of the Positive Region and Boundary Region. In the new method, a class-based distance metric is defined to measure the clustering degree of the samples in the Boundary Region and the Positive Region. With the class-based distance metric, samples in the Boundary Region are as close as possible to the samples having the same label in the Positive Region, so that the possibility of the samples in the Boundary Region decreases, which is classified incorrectly. Rough sets can not evaluate the degree of relevance improvement with a new feature inserted into the selected subset, so it can not always reduce the redundancy of features efficiently. In this paper, we use MI to eliminate feature redundancy. MI, which is robust to noise, can estimate the similarity of variables. Thus, the solution of the proposed method appears to be less redundant and more relevant.

3.1 Differentiated rough sets

For rough sets based FS methods, the dependency is one of the most important concepts. The dependency is based on samples in the Positive Region, so it verifies the definitive relevance between features and labels. DMQR and DM-TRS evaluate the uncertain relevance by the distance metrics which attempts to qualify the samples in the Boundary Region, with regard to their proximities to the samples in the P-low approximations. That is to say, the larger the distance metric is, the higher the likelihood that samples in the Boundary Region belong to the set of interest is Parthaláin and Shen (2009). But the P-lower approximations are not direct objects in the process of FS. Thus, in this paper, we focus on the Positive Region directly.

In some situations, the information concerning label feature is only partial. That is to say that the values of labels for some samples are missed. For rough sets, label feature is sensitive data, which plays important role in calculating relevance. When we categorize samples, missed labels must be considered.

We can categorize the samples of a dataset into six groups:

  1. Type-1:

    For a sample, its equivalence class of P-indiscernible relation only includes itself. And its value of label is observed.

  2. Type-2:

    For a sample, its equivalence class of P-indiscernible relation includes more than 1 object. Its value of label is observed. And all the samples in its equivalence class have the same value of label feature.

  3. Type-3:

    For a sample, its equivalence class of P-indiscernible relation includes more than 1 object. The values of labels for the samples in its equivalence class are all observed. And there exists at least 2 samples in its equivalence class with different values of label feature.

  4. Type-4:

    For a sample, its equivalence class of P-indiscernible relation only includes itself. And its value of label is not observed.

  5. Type-5:

    The equivalence class of a sample includes more than 1 sample. There exits samples with observed values of labels and samples with missed labels in its equivalence class.

  6. Type-6:

    The equivalence class of a sample includes more than 1 sample. All samples missed the values of label feature.

If there are some samples with missed labels, the dataset is not complete. In this work, we focus on complete dataset. That is to say that the values of all samples concerning label feature are observed.

Theorem 3.1

The samples of Type-1 or Type-2 are in the Positive Region; the samples of Type-3 are in the Boundary Region.

Proof of Theorem 3.1

When a sample “x” is of Type-1, the equivalence class [x] p of P-indiscernible relation only contains the sample itself. According to the definition of Type-1, the value of x concerning label is observed. So \({\exists Z\in U} \mathord{\left/ {\vphantom {{\exists Z\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}\)makes x ∈ Z, where U/IND(L) is a partition of U generated by label feature. As a result, \(\left[ x \right]_p =\left\{ x \right\}\subseteq Z\). According to Definition 2.3, the sample must be in the Positive Region.

When x is of Type-2, [x] p contains more than 1 samples. For label feature, the values of the samples in [x] p are the same. So \({\exists Z\in U} \mathord{\left/ {\vphantom {{\exists Z\in U} {IND\left( L \right)\;{\textrm{and}}\;\left[ x \right]_p \subseteq Z}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)\;\textrm{and}\;\left[ x \right]_p \subseteq Z}\). Thus, x must be in the Positive Region.

When x is of Type-3, there exists at least 2 samples with different values of label feature. For \({\forall Z\in U} \mathord{\left/ {\vphantom {{\forall Z\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}\), if \(\exists y\in \left[ x \right]_P \) makes y ∈ Z, \(\exists y^\prime \in \left[ x \right]_P \) and y  ∉ Z. So \(x\notin \bigcup\limits_{{X\in U} \mathord{\left/ {\vphantom {{X\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} {\underline{P}} X\). Since the values of labels for all samples in [x] p are observed, \({\exists Z^\prime \in U} \mathord{\left/ {\vphantom {{\exists Z^\prime \in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}\) makes \(\left[ x \right]_p \cap Z^\prime \ne \emptyset \). As a result, x is in the Boundary Region.□

In rough sets, missed value is generally considered as any possible value in the same feature. However, this method is dangerous for label. First, the reason of missed value is that the value is not possible to be obtained or is definitively impossible to be obtained (Stefanowski and Tsoukias 2001). Second, this method decreases the possibility that samples in the Positive Region are classified into correct class. The samples of Type 4–6 can not be directly partitioned into regions before redefining U/IND(L). Since incomplete system is not our objective in this work, its details are not given, and more contents of incomplete system can found in Stefanowski and Tsoukias (2001), Grzymala-Busse and Rzasa (2006) and Słowiński and Stefanowski (1989).

Table 1 shows an example elaborating the three types of samples and Theorem 3.1. The example contains 6 samples among which 3 samples are of class “F”, 3 samples are of class “S”. Sample 1 is of Type-1; sample 2 and 3 are of Type-2; sample 4–6 are of Type-3. So sample 1–3 are in the Positive Region; others are in the Boundary Region.

Table 1 An example elaborating Theorem 3.1

It is easy to verify each sample in complete dataset must be in only one group from Type-1 to Type-3, so Theorem 3.2 is found according to Theorem 3.1.

Theorem 3.2

For complete information system, the samples in the Positive Region are of Type-1 or Type-2; the samples in the Boundary Region are of Type-3.

In the Positive Region, we define that the samples, which are in the same equivalence class of P- indiscernible relation, consist of a cluster-element. The cluster-element of a Type-1 sample just contains the sample itself. Thus, the label of this cluster-element is the same with the sample. For a sample of Type-2, the samples in its cluster-element have the same label, so the cluster-element has the same label with the samples. By the analysis of cluster-elements, the Positive Region has the characteristic of labels. According to the concept of rough sets, the samples in the Positive Region are classified correctly. In theory, an optimal selected feature subset should be able to deliver high classification accuracy. In our work, we require that the samples in the Boundary Region should be with regard to their proximity to the samples in the Positive Region. Since the Positive Region has the characteristic of labels, we conduct partition of the samples in the Positive Region and Boundary Region according to their respective class labels.

Definition 3.1

Samples in the Positive region are divided into subsets where samples have the same class label. It can be denoted as

$$ \label{eq21} POSL_P =\left\{ {POS_P \left( L \right)\cap Z\left| {{Z\in U} \mathord{\left/ {\vphantom {{Z\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} \right.} \right\}\textrm{.} $$
(21)

By Definition 3.1, the Positive Region is divided into several subsets according to label feature.

Definition 3.2

The mean set of elements in POSL P is defined as

$$ \label{eq22} POSL_{MP} =\left\{ {\left. {\left( {Mean\left( {PL} \right)_P ,\;L\left( {PL} \right)} \right)} \right|PL\in POSL_P } \right\}\textrm{,} $$
(22)

where

$$ \label{eq23} Mean\left( {PL} \right)_P =\left\{ {\frac{\sum\limits_{x\in PL} {f\left( x \right)} }{\left| {PL} \right|}\left| {\forall f\in P} \right.} \right\}\textrm{.} $$
(23)

\(L\left( {PL} \right)=\left\{ {l\left( x \right)\left| {\forall x\in PL} \right.} \right\}\) and l is the label feature, since we just consider the system with unique decisional feature; ∣PL∣ denotes the number of samples in the current PL. Since all the samples in one PL have the same label because of Definition 3.1, the L(PL) of each object in POSL MP has only one element. Each object in POSL MP has two elements—the first one is the mean of the samples in the same PL, and the other is the label set.

Definition 3.3

Samples in the Boundary Region are divided into subsets where samples have the same class label. It can be denoted as

$$ \label{eq24} BNDL_P =\left\{ {BND_P \left( L \right)\cap Z\left| {{Z\in U} \mathord{\left/ {\vphantom {{Z\in U} {IND\left( L \right)}}} \right. \kern-\nulldelimiterspace} {IND\left( L \right)}} \right.} \right\}\textrm{.} $$
(24)

Definition 3.4

Distance metric set is denoted as

$$ \label{eq25} DMS_P =\left\{ {\sum\limits_{x\in BL} {\delta_P \left( {x,y} \right)\left| {BL\in BNDL_P ,\left( {y,L^\prime } \right)\in POSL_{MP} ,L\left( {BL} \right)=L^\prime } \right.} } \right\}\textrm{,} $$
(25)

where δ is a distance function, and \(L\left( {BL} \right)=\left\{ {l\left( x \right)\left| {\forall x\in BL} \right.} \right\}\). \(l\left( {BL} \right)=L^\prime \) means that the elements of these two sets are completely the same. Since all the samples in one BL have the same label, the element of L(BL) is unique.

Definition 3.5

Class-based distance metric on a feature set (CDM P ) is defined as

$$ CDM_P =\sum\limits_{\delta_P \in DMS_P } {\delta_P^{-1}} \textrm{.} $$
(26)

CDM P is based on POSL \(_{P\thinspace }\)and BNDL \(_{P\thinspace }\)which are partitioned according to class labels. Thus, CDM P exhibits the characteristic of labels. For samples with the same label, samples in the Boundary Region are getting closer to the mean of samples in the Positive Region by increasing CDM P .That is to say that the uncertain part of dataset is getting closer to samples which are sure to be classified correctly. As a result, CDM P evaluates the uncertain relevance of features and labels.

The combination of class-based distance metric and dependency is denoted as CDM − D P  = CDM P  + γ P .

3.2 Forward feature selection process with CDM-D and MI

Rough sets based feature selection relies on an estimated relevance relationship between features and labels to find a feature subset. But the increasing degree of relevance by adding a new feature into subset is not considered. The FSC proposed in Chow and Huang (2005) can estimate the similarity between the feature subset S and a single feature f m . The FSC is defined as

$$ FSC\left( {f_m } \right)=\mathop{\arg\max}\limits_{f_i \in S} \left( {\frac{I\left( {f_m ;f_i } \right)}{H\left( {f_i } \right)}} \right)\textrm{.} $$
(27)

When FSC(f m ) is large enough, i.e, \(FSC\left( {f_m } \right)\ge \theta \), the feature f m can be considered as a redundant feature for S, and should not be added into S. Throughout this paper, θ is set to 0.95. In order to verify the effectiveness of similarity estimation, the comparison of CDM − D on different feature subsets in the LED dataset, which is highly redundant, is presented in Section 4. Without the FSC, the redundant features are inserted into the selected subset, so the increasing of CDM − D is slow. In contrast, the redundant features can be reduced by FSC, enabling the CDM − D reaches its own maximum in a much speedy way.

The proposed RSMI-FS is a forward process. It begins with an empty feature set, and additional features are included in the way of one by one. The process of FS is guided by the dependency of each feature. That is to say the initial feature sequence is determined by its own dependency. Based on CDM − D and FSC, the RSMI-FS algorithm is realized as follow, and its flow diagram is illustrated in Fig. 1.

Fig. 1
figure 1

Flow diagram of RSMI-FS

figure b

4 Results and discussion

In this study, 2 synthetic datasets and 6 real datasets were used. The 2 synthetic datasets verify the correctness of the proposed RSMI-FS. The first synthetic dataset highlights the advantage of computational time of finding an optimal subset by RSMI-FS. The highly redundant LED domain dataset is used to verify the capability of dealing with the redundancy and irrelevance in feature set. For 6 real datasets, five indexes, namely Naïve Bayes, NNge, DTNB, OneR, and JRIP, are adopted to evaluate classification accuracies of the selected subsets. A comparative analysis among RSMI-FS, QUICK REDUCT, and DMQR, and DM-TRS is based on subset size and classification accuracy. Additionally, for RSMI-FS, the difference between Mahalanobis Distance and Euclidean Distance is discussed. Then, three MI based methods (FS-RAW-MI (Bonnlander 1996), PWMI (Kwak and Choi 2002b), and OFS-MI (Chow and Huang 2005)), two fuzzy rough methods (Attribute Selection with Fuzzy Decision Reducts (Cornelis et al. 2010a), and Vaguely Quantified Rough Sets based Method (VQRS) (Cornelis and Jensen 2008)), and typical Filter methods (Relief (Kononenko 1994; Robnik-Sikonja and Kononenko 1997), and Filter method based on correlation-based feature subset selection (Filter-CFS) (Hall 1998)) are also compared with RSMI-FS.

4.1 Feature subset evaluation index

In this section, five classifiers are employed to evaluate the classification accuracies of the selected subsets. They are Naïve Bayes, NNge, DTNB, OneR, and JRIP which are briefly outlined below.

Naïve Bayes (John and Langley 1995) is based on Bayesian Classifier. For Bayesian network, the model of probability mass function is an important problem, when the variables are continuous. Most works solve the problems by discretizing, or assuming that the data are generated by Gaussian distribution. But this assumption is not suitable for certain domains, such as clear semantics. Naïve Bayes employs a kernal estimation to approximate more complex distribution.

NNge (Martin 1995) is an instance-based learning classifier that classifies new examples by comparing them to those already known. A main problem of instance-based learners is that running time increases as more examples are used for learning. NNge proposes a non-nested generalized exemplar to solve this problem, and represents more useful rules. With the generalized exemplar, NNge reduces the role of distance function to determine the class, and decreases the classification errors caused by the inaccuracies of distance function.

The algorithm for learning the combined model (DTNB) (Hall and Frank 2008) investigates a semi-Naïve Bayesian ranking method that combines Naïve Bayesian with induction of decision table. DTNB splits the set of features into two groups. The class probability of one group is assigned according to Naïve Bayesian, the other group is based on decision table, and the resulting probability estimation is combined.

The OneR algorithm (Holte 1993) is based on the theory that highly accurate results on most datasets can be obtained by simple rules. The OneR uses a system, called 1R, whose input is a set of training examples and whose output is a 1-rule. OneR constructs a relatively small set of candidate rules, and selects one of these rules. It has been verified that, on many datasets, the performance of OneR is highly competitive with some complex numerous implications in the convex of machine learning research and applications.

JRIP (Cohen 1995) learns propositional rules by growing rules and pruning them to produce error reduction. Before a termination condition is satisfied, antecedents are inserted greedily during the growing phase. Then, the antecedents are pruned according to a pruning metric. When the rule set is generated, an optimization is performed to evaluate the rules.

4.2 Synthetic data set of varying size

Four dimension synthetic datasets F = {f 1, f 2, f 3, f 4} are generated in this experiment. These datasets consist of 100, 500, 1,000, 2,000, 3,000, 4,000 samples, respectively. For each dataset, the samples belong to two classes {1, 2}, and each class has one half of the samples. For the input variables f 1 and f 2, the data is generated from the following two Binomial Distributions:

  1. Class 1:

    (half samples of the dataset) \(\left\{ {f_1 ,f_2 } \right\}-\textrm{binornd}\left( {\left[ {10,10} \right],\left[ {0.5,0.5} \right]} \right)\)

  2. Class 2:

    (half samples of the dataset) \(\left\{ {f_1 ,f_2 } \right\}-\textrm{binornd}\left( {\left[ {20,20} \right],\left[ {0.5,0.5} \right]} \right)\).

The input variable f 3 is equal to the sum of f 1 and f 2, and f 4 is equal to 2×f 2. Obviously, the input variables f 1 and f 2 are considered more important than f 3 and f 4 in this experiment. So a good feature selection method should select the subset {f 1, f 2}.

Table 2 shows the selection results of the compared methods. In this section, RSMI-FS uses Euclidean distance to calculate CDM − D. It indicates that RSMI-FS, ORS, and Relief are able to identify the relevant features correctly. ORS (Bazan et al. 2000), which can also find an optimal method, is introduced to compare the computational time with RSMI-FS. Relief can find the correct subset, but it needs the priori knowledge that 2 features are suitable. QUICK REDUCT finds the correct subset when the number of samples is 1,000. DMQR finds the correct subsets when the numbers of samples are 500 and 2,000. Also, it is noticed that DM-TRS with tolerance value 0.9, FS-RAW-MI, PWMI, and OFS-MI are not able to obtain the correct results in each dataset. For Fuzzy Decision Reducts method, α = 0.95 is set in this work, which is suitable choice verified in Cornelis et al. (2010a). For VQRS based method, 0 and 0.8 are employed to define the VQRS Positive Region (Cornelis and Jensen 2008). The values of current CDM − D − D on different numbers of samples are shown in Fig. 2. When the selection process stops according to its stopping criterion, additional features are not selected into the subset by RSMI-FS. Although the remaining features are not deleted from the initial feature set, the values of current dependency and CDM − D − D do not change after the process stops. We find that, with the increasing of sample number, the maximum of related CDM − D − D in Fig. 2 decreases. This is because that the number of samples, which have the same feature value and different labels, is likely to be larger when the dataset is larger.

Table 2 Results of feature selection on the synthetic dataset
Fig. 2
figure 2

CDM-D-D of the datasets with increasing samples. ∣samples∣ expresses the number of samples in the dataset

Figure 3, which illustrates the average computational time on different numbers of samples for 5 trials, shows that the computational time cost of RSMI-FS is not significantly larger than QUICK REDUCT, DMQR, and DM-TRS. ORS can find the correct feature subset for datasets with different number of samples, but its time cost is substantially larger compared with other four rough sets methods. Because each new feature found by ORS must most dramatically increase the dependency of its combination with the features having been in selected set. That is to say that all features not in selected set must be considered in each selecting step. As a result, the feature number plays significant role in computation cost of time, which is illustrated in Fig. 4. In our study, with the number of samples being set at 1,000, the features ranged from 4 to 20 are used for conducting a running time comparative analysis. The additional features are generated by the linear combination of f 1 and f 2 which is similar with f 3 or f 4. In Fig. 4, with the increasing of feature numbers, the running time of ORS increases dramatically.

Fig. 3
figure 3

Comparison of computational time cost with increasing number of samples

Fig. 4
figure 4

Computational time cost with increasing features

4.3 Highly redundant LED display domain dataset

The dataset of LED display Domain has 24 features in which the first 7 features determine the concept of LED display, whilst the rest 17 features are redundant (Aha 1992). In this work, 1,000 samples are generated. With the priori knowledge, a subset, which consists of the first 7 features, is a good result. Table 3 shows the results of the first 7 features in the feature subsets found by the FS methods. In Table 3, f i represents the feature in the selected feature subset, and (f i ) represents the feature which is not included by feature selection. There are other features in the results, so QUICK REDUCT, DMQR, and DM-TRS cannot avoid the redundancy in the selected feature sets. MI-based Method, Fuzzy Decision Reducts and Filter-CFS can deal with redundancy of the feature set, but the relevance features are also ruled out from the selected subsets. On the contrary, RSMI-FS, Relief, and VQRS find the correct feature subset which only contains the first 7 features.

Table 3 Results of feature selection on LED dataset

In order to show the necessary of MI for RSMI-FS, the increasing value of current CDM − D − D with the number of features is shown in Fig. 5. When the selection progress of RSMI-FS stops, all the samples are in the Positive Region. Hence, the related maximum of CDM − D − D in Fig. 5 is 1. When feature similarities are not evaluated by MI, 15 features are selected in order to partition all samples into the Positive Region. In contrast, only 7 features are in the subset found by RSMI-FS, which is shown in Table 3. This comparative analysis verifies that MI is necessary for RSMI-FS.

Fig. 5
figure 5

Comparing of CDM − D − D between RSMI-FS and the method without MI

4.4 Comparison of RSMI-FS with rough sets based methods

This section shows the results of experimental studies using 5 UCI datasets and one YALE datasets (Georghiades et al. 2001). The data in Table 4 show the classification accuracies using the 5 classifiers, which are expressed as percentage. The setup of the experiments in this work is the same with Parthaláin and Shen (2009; Parthaláin et al. (2010), where classification using 10-fold cross validation is initially performed on the unreduced dataset, following by the reduced datasets. In this section, the results of RSMI-FS with Euclidean distance and Mahalanobis distance are shown to evaluate the effect of different distance functions. The results of DM-TRS setting tolerance values 0.8 or 0.9 are both considered.

Table 4 Classification accuracies of FS methods based on rough sets

In Table 4, the classification accuracies of RSMI-FS are larger or comparable with the related unreduced set. This verifies that FS is useful to find an indicative subset and remove measurement noise. For Small Soybean, Heart and Breast Tumor Diagnosis, the performance of RSMI-FS is almost better than other methods. Especially, for Small Soybean and Heart, the subsets of RSMI-FS do the best for every classifier. Although the performances of RSMI-FS and QUICK REDUCT are the same in Small Soybean, the features in the subset selected by QUICK REDUCT is 2.5 times as RSMI-FS, which is shown in Table 5. Comparing with other methods, their improvement of classification is more than 21.2766 %. For YALE, the best performance is found by DM-TRS (tolerance value 0.9), but 86 features are selected into the subset, which is 13 times larger than the results of RSMI-FS. The computational time cost of DM-TRS is 3827.3 s, but the cost of RSMI-FS with Euclidean distance and Mahalanobis distance is just 396.4 and 391.7 s, respectively.

Table 5 Numbers of features selected by rough sets based methods

For FS methods based on rough sets, their dependencies evaluate the definitive relevance of features with labels. Figure 6 shows the values of the current dependency with increasing features. For rough sets based method, when the selection process stops according to the stopping criterion, additional features are not selected. Although remaining features are not deleted from the initial feature set, they are not useful to increase the dependency. In Fig. 6, RSMI-FS (ED) and RSMI-FS (MD) express that RSMI-FS uses Euclidean distance and Mahalanobis distance to calculate distance metric, respectively. Figure 6 shows that the maximal dependencies of RSMI-FS are always larger than DMQR and DM-TRS. Comparing with QUICK REDUCT setting the dependency as its selection criterion, RSMI-FS can reach its maximum value in a faster rate. In SPECT, the dependency of QUICK REDUCT is larger than RSMI-FS, but its accuracies shown in Table 4 are not larger than RSMI-FS. According to those results, the features selected by RSMI-FS are more relevant and less redundant.

Fig. 6
figure 6

Values of dependencies with increasing features, a Small Soybean b SPECT c SPECTF d Heart e Breast Tumor f YALE

In this work, we use Euclidean distance and Mahalanobis distance, which are widespread in machine learning, to calculate CDM − D − D of RSMI-FS. Their difference of classification accuracies shown in Table 4 ranges from 0 to 2.99 %. Figure 7 shows the values of current CDM − D − D with the increasing features. It can be observed that the difference of CDM − D − D is not distinctive. As a result, we can conclude that, in RSMI-FS, the difference between Euclidean distance and Mahalanobis distance is not significant.

Fig. 7
figure 7

Values of CDM − D − D with increasing features, a Small Soybean b SPECT c SPECTF d Heart e Breast Tumor f YALE

4.5 Comparison of RSMI-FS with MI based methods

In this section, the results of RSMI-FS are compared with other three MI based FS methods which are FS-RAW-MI, PWMI, and OFS-MI. Table 6 shows their classification accuracies. For Small Soybean, Heart, and Breast Tumor, the accuracies of the subsets found by RSMI-FS with the 2 distance functions are almost larger than the other methods. For Heart, the accuracies of PWMI in Naïve Bayes and DTNB are larger than RSMI-FS. But the number of features found by PWMI is 13, which is shown in Table 7. That is to say that PWMI selects all features of Heart. So PWMI does not play role in this dataset. For YALE, the performance of MI based methods is better than RSMI-FS, but the difference is insignificant. RSMI-FS selects the most relevant features, so its number of selected features is rather smaller than MI methods. Moreover, the computational time cost of RSMI-FS is significantly less than the other three methods, and the difference of computational time cost ranges from 9,504 s to 24,706 s.

Table 6 Classification accuracies of RSMI-FS and MI based methods
Table 7 Numbers of features found by RSMI-FS and MI methods

The numbers of features found by RSMI-FS are consistently less than the MI methods, which are shown in Table 7. And, in terms of classification accuracies detailed in Table 6, RSMI-FS are larger or comparable. The above findings in Sections 4.4 and 4.5 confirm RSMI-FS is able to find a more relevant and less redundant feature subset, compared with the rough sets or MI based FS methods.

4.6 Comparison of RSMI-FS with other methods

In this section, RSMI-FS is compared with two fuzzy rough methods and two typical non rough sets methods. Fuzzy sets and rough sets are two natural computing methods to deal with data of inconsistency and uncertainty in a human-like fashion (Jensen and Cornelis 2011b). Their difference is the type of uncertainty and their approach to deal with it (Jensen and Cornelis 2011a). In this work, Attribute Selection with Fuzzy Decision Reducts (Cornelis et al. 2010a) and Vaguely Quantified Rough Sets based Method (VQRS) (Cornelis and Jensen 2008) are compared with RSMI-FS. By setting the degree of reducthood α, the subset found by Fuzzy Decision Reducts can provide a comparable accuracy with small size of features. In this work, α is set 0.95 which is suitable value introduced in Cornelis et al. (2010a). The other details of setup about this method are the same with Cornelis et al. (2010a). Fuzzy rough methods are abrupt in a sense that adding or omitting a single element may drastically alter the outcome of approximations. So any misclassified object prevents rough sets from making any conclusive statement about all objects related to it. VQRS based method defines a smoother region of tolerance towards classification errors to reduce this kind negative impact. The parameters of the region are set 0 and 0.8 which is employed in Cornelis and Jensen (2008). Relief (Kononenko 1994; Robnik-Sikonja and Kononenko 1997) gives each feature a relevance weighting that reflects its ability to discern class labels. It is typically used in conjunction with a feature ranking method to select features. In order to comparing the performance with our method, the number of its selected features is set as the same with RSMI-FS. When the numbers of features selected by RSMI-FS with two distance functions are different, the number for Relief is set as the larger one. For Filter method based correlation-based feature subset selection (Hall 1998) (Filter-CFS) runs a subset evaluator on the data passed through a resample filter. The subset evaluator is correlation-based feature subset selection which evaluates the worth of a feature subset by considering the individual predictive ability of each feature along with the degree of redundancy among the features. Filter-CFS employs linear forward search (Guetlein et al. 2009) which is an extension of best first.

Table 8 shows the accuracy performance. For Small Soybean, the accuracies of subsets found by RSMI-FS, VQRS, and Fuzzy Decision Reducts are the same in Naïve Bayes, DTNB, NNge, and JRIP. In OneR, the accuracy of RSMI-FS is 4.26 % more than VQRS and Fuzzy Decision Reducts. For the same dataset, comparing with Relief, the improvement of RSMI-FS is more than 23.4 %. For SPECTF, RSMI-FS obtain the largest classification accuracies in DTNB and OneR. For the results of JRIP about SPECTF, RSMI-FS with Euclidean distance is better than other methods. Moreover, features selected by RSMI-FS are not more than others, shown in Table 9. Especially, the features selected by Filter-CFS are 4.7 and 7 times as RSMI-FS with two distance function. For datasets of SPECT, Heart, and Breast Tumor, the performance of RSMI-FS is better or comparable with others. Because of computational cost of memory, the fuzzy rough methods, Relief, and Filter-CFS can not finish the process of FS about YALE.

Table 8 Classification accuracies of RSMI-FS and other methods
Table 9 Numbers of features found by RSMI-FS and other methods

5 An application of RSMI-FS in Alzheimer’s disease

Alzheimer’s disease (AD) was first identified more than 100 years ago, but the research about its symptoms, causes, risk factors, and treatment has gained momentum only in the past 30 years. Until now, the precise physiological changes triggering the development of AD largely remain unknown (Alzheimer’s Association 2012). In this section, we employ machine learning method to find the characteristics of AD.

In order to provide the data of current research initiatives, National Alzheimer’s Coordinating Center of US develops a dataset which includes standardized clinical and cognitive data. The dataset is not hypothesis-driven, since all clinical data are developed according to uniform assessment and diagnosed by all participants in AD center. So the research of this dataset is useful to mine the factors triggering AD.

RSMI-FS is a suitable tool to find the relevant factors. Rough sets analyze data using human-like fashion (Jensen and Cornelis 2011b) which is uniform with construction of dataset and process of diagnosing AD. Although rough sets based FS methods are filter methods, the concept of rough sets is classification. So they can find the subsets which improve classification accuracy without dramatical addition of computation cost. Our method has the advantage of rough sets and MI, so the subset selected by RSMI-FS is more relevant and less redundancy. That is to say that RSMI-FS can find the most dangerous factors. In order to find more information of AD, the algorithm stops only when the dependency of selected features arrives at maximum. In Section 4, it is verified that difference between Euclidean distance and Mahalanobis distance is not significant. So Euclidean distance is only used.

The dataset has 11,053 samples and 171 features. Each sample expresses the detail of one person. The features are reduced from dataset, whose proportion of missed values is larger than 70 %. After reducing the features, only 60 features remain. The missed values of the remaining features are estimated by Self-Organizing Map. The detail of Self-Organizing Map is in Kohonen (1982).

In the process of mining factors triggering AD, three aspects must be considered:

  1. 1.

    The quality of results obtained by machine learning method.

  2. 2.

    The number of samples in dataset. It should be large enough to mine the credible factors which trigger the development of AD.

  3. 3.

    The role of label distribution. The proportion of samples with disease is 70.52 %, so it must play role in classification accuracy.

The samples should be enough to provide credible results, so the variation of classification accuracies with different amount of samples is evaluated. 100, 500, 1,000, 2,000, 3,000 and 4,000 samples are respectively set as experiment objects. In order to avoid contingency, for each number of samples, 10 different sets are constructed by random sampling. The setup of classification on each subset found by RSMI-FS is the same with Section 4.4. Table 10 shows the details of results, where accuracy is expressed as percentage. For each amount of samples, there are 50 classification accuracies, because of 10 samplings and 5 classifiers. It is hard to directly analyze the variation of classification accuracies with increasing of samples. So the mean of 50 accuracies is used, presented as “Average of the same size” in Table 10. We find the difference among the means of 1,000, 2,000, 3,000, and 4,000 just ranges from 0.484 % to 3.175 %. This guarantees that the samples in the dataset are sufficient to provide precise information. In each row of Table 10, the average accuracy is the average of each sampling set about Naïve Bayes, DTNB, JRIP, NNge and OneR. Average accuracy favors performance evaluation of each sampling set, which is used in following experiment step.

Table 10 Results of AD dataset

The proportion of samples with AD in the dataset is 70.52 %. It is more dangerous that a person with AD is diagnosed as health, so the center mainly focuses on the people with the disease. But the results of FS should provide service to all kinds of people. That is to say that the proportion of AD in tested people is unknown. It is much possible that healthy people are more than the people with the disease. In order to evaluate effect of the testing sets with different proportion of AD, the setup of new classification is as follows: choose a selected subset from above results as training set; then, a testing set with the same number of samples is formed of samples except in the training set. It must be noted that the proportion of samples with AD in testing set is fixed before random sampling. For each training set, we have 5 testing sets with 0.9, 0.7, 0.5, 0.3, and 0.1 proportion of AD, respectively. The process of choosing training sets: for each number of samples, the selected subset with largest average accuracy and the smallest are both chosen as training sets; one of the 8 others is randomly picked as the third training set. Obviously, all the training sets are the results of FS, so the features in a testing set are reduced according to related training set before classification. A direct classifier, 1NN, is employed in Fig. 8 to avoid role of classifiers. In Fig. 8, the proportion of samples with AD plays significant role in accuracy. With addition of the proportion, the accuracies increase. This result comes to an agreement of label distribution between training and testing set. It must be noted that, for AD dataset, the indicative ability of samples with disease is more important. So Fig. 9 shows the classification accuracies of the samples with disease in the testing sets. It can be observed that the variation of accuracies with proportion in Fig. 9 is not significant. According to the analysis above, RSMI-FS provides the quality results.

Fig. 8
figure 8

Variation of classification accuracy under proportion of AD. (a1) the training set is one with smallest average accuracy among experiments of 100 samples; (a2) the training set is one with largest average accuracy among experiments of 100 samples; (a3) the training set is picked randomly among other 8 experiments of 100 samples. (b1)–(b3) the results of 500 samples are shown in the same way of 100 samples; (c1)–(c3) 1,000 samples (d1)–(d3) 2,000 samples. (e1)–(e3) 3,000 samples. (f1)–(f3) 4,000 samples

Fig. 9
figure 9

Classification accuracy of samples with diseases. The setup is the same with Fig. 8

6 Conclusion

Feature selection aims to find a small feature subset to represent a given high dimensional dataset. In this paper, we propose a new feature selection method (RSMI-FS) which is based on rough sets theory and mutual information. Attributing to the theoretical contribution of CDM-D-D and FSC criteria, the proposed feature selection method offers three desirable properties over the existing rough sets method. First, the feature subset considers both definitive and uncertain relevance with the labels. The definitive relevance is calculated on the samples in the Positive Region, and the uncertain part implies the relevance information provided by the samples in the Boundary Region. This enables RSMI-FS finds a strong indicative subset of datasets. Second, the proposed method also hybridizes the concept of rough sets with mutual information so that the redundancies of features are reduced by using the FSC. Third, using the dependency of each feature as heuristic information, the results of RSMI-FS are optimal or close-optimal feature subsets. And more importantly, the running time of our proposed RSMI-FS is comparable to the other rough sets based methods.