1 Introduction

Fuzzy models are developed by fuzzy rule-based classification systems, where the output of systems is crisp and discrete. The possibility to work with imprecise data and missing values, and also, human understandable form of the acquired knowledge, are the main advantages of fuzzy models (Mansoori et al. 2008; Marin-Blazquez and Shen 2002). Basically, the design of a fuzzy rule-based classifier tries to find a compact set of fuzzy if–then rules to be able to model the input–output behavior of the system. Many approaches for generating fuzzy classification rules from data have been proposed in the literature. These methods include heuristic approaches (Mansoori et al. 2007; Ishibuchi and Yamamoto 2004; Ishibuchi and Nakashima 2001), neuro-fuzzy techniques (Nauck and Kruse 1997; Almaksour and Anquetil 2011), association rule discovery (Alcala-Fdez et al. 2011a), genetic algorithm (Mansoori et al. 2008), and based on evolving systems (Iglesias et al. 2010; Lughofer and Buchtala 2013; Angelov et al. 2008).

In high-dimensional problems, the rule base of a fuzzy classification system would have too many rules (Rehm et al. 2007). So, reducing the search space of fuzzy rules in designing phase of classifiers is an important concern. In several researches, many methods have been suggested for solving this problem so far. For example, rule reduction methods using neural networks (Halgamuge and Glesner 1994), clustering techniques (Chiu 1994) and similarity measures (Setnes et al. 1998) have been recommended. Also, there have been GA-based methods for selecting a set of cooperative rules among the set of candidate rules (Cordon et al. 1999; Roubos and Setnes 2000). Feature weighting is another technique for decreasing softly and smoothly the influence of features in the rules (Lughofer 2011). However in high-dimensional problems, obtaining a small and efficient rule base is difficult and the interpretability of the system could not be guaranteed (Alcala-Fdez et al. 2011a; Cassillas et al. 2001).

Another suggested approach for reducing the search space is to generate some fuzzy rules with restricted antecedent conditions (Ishibuchi and Murata 1997). In this regard, some effective features from a high-dimensional problem should be selected. Feature selection or data dimensionality reduction refers to the process of identifying a few, yet more important, variables or features which help in predicting the outcomes. There are many potential benefits for feature selection. These include facilitating data visualization and data understanding, reducing the measurement and storage requirements, decreasing the training and utilization times and avoiding the curse of dimensionality to improve the prediction performance.

In general, the feature selection methods can be grouped in three categories: filters, wrappers and embedded models. Filters are used to score all features via a preprocessing stage and then select the best ones. In wrappers, some feature sets are selected and then evaluated via the designed classifiers. The embedded methods, however, are specific to the selected learning machines (Guyon and Elisseeff 2003; Tuv et al. 2009) and the process of feature selection is done in their training step. Some of the common feature selection approaches include: Fischer criterion (Fisher 1936), fuzzy entropy (Lee et al. 2001; Shie and Chen 2007) and similarity measure (Luukka 2011), mRMR method (Peng et al. 2005), and mutual information (MI) (Estevez et al. 2009). However, computing Shannon’s MI between high-dimensional vectors is impractical because the number of samples and so the required CPU time is high (Estevez et al. 2009).

Feature selection methods can also be viewed from another perspective. Traditional algorithms select the features for all classes in common while class-specific feature selection approaches try to find a subset of features for each class separately. Using class-specific feature selection methods, a better discrimination of classes have been resulted in most of cases (Pineda-Bautista et al. 2011). Also, recently some feature selection methods have been proposed which combine fuzzy and other approaches. Neuro-fuzzy solutions (Chakraborty and Pal 2004) or genetic feature selection methods (Yang and Honavar 1998; Casillas et al. 2001) constitute most of the researches in this field. However, computational complexity is their major difficulty.

Impressing the second preference of fuzzy models, we have proposed a fuzzy feature selection algorithm in this paper. Its aim is to choose the more relevant features; those which can distinguish the distinct classes well. Our method is a class-specific approach which tries to find a subset of features for each class separately. It combines the interclass distance concept [as in Fisher (1936)] with the compatibility degree of data in some predefined fuzzy sets on each feature to evaluate that feature. Since our method processes each feature individually, it can be applied efficiently on high-dimensional data.

Our approach also selects some suitable fuzzy sets for each dimension in order to have a good (small and so more interpretable) set of rules. Moreover, a new criterion for evaluating the capability of each candidate rule in discriminating the positive and negative patterns is also introduced. It leads to select more powerful rules which result in a more efficient rule base.

The rest of this paper is organized as follows. In Sect. 2, the general design of fuzzy rule-based classification systems is explained. Our fuzzy method for feature selection is described in Sect. 3. In Sect. 4, we explain our method for designing fuzzy classifiers. The experimental results are presented in Sect. 5. Section 6 concludes the paper.

2 General design of fuzzy rule-based classification systems

Consider a classification problem with a data set of m patterns, DS = {(X p ;y p ), p = 1..m}. For pth pattern, the input vector of variables, X p , is n-dimensional. That is, X p  = [x p1, …, x pn ] with feature labels {f i , i = 1, …, n}. The output variable, y p , is a class label in M classes such that y p ∈{c 1, …, c M }. We assume that each input variable, x pi , is rescaled to unit interval [0,1] using a linear transformation that preserves the distribution of the data set.

In this paper, the classical single model architecture of fuzzy classifiers is utilized to handle the multiclass classification problems. The benefits of this model are simplicity, transparency and more interpretability of the designed classifiers (Lughofer and Buchtala 2013). In this model, the general form of fuzzy if–then rules is:

$${\text{Rule}}\,R_{j} :\,{\text{if}}\,x_{1} \,{\text{is}}\,A_{j1} \,{\text{and}}\, \ldots \,{\text{and}}\;x_{n} \,{\text{is}}\,A_{jn} \,{\text{then}}\,{\text{class}}\,C_{j} ,\quad {\text{for}}\;j = 1, \ldots ,N$$
(1)

where X = [x 1,…, x n ] is an input vector, A ji (i = 1, …, n) indicates the fuzzy set on variable x i in the antecedent part of R j , C j is the consequent class (that is, C j ∈{c 1, …, c M }), and N is the number of fuzzy rules. Herein, the fuzzy rule R j is abbreviated as \(A_{j} \Rightarrow {\text{class }}C_{j}\) where A j  = A j1 × ··· × A jn . Generally, designing a fuzzy classifier can be described as generating a set of N fuzzy rules in the form of (1).

The first step in generating fuzzy rules is partitioning the pattern space into fuzzy subspaces. If a subspace contains some patterns, a fuzzy rule will refer to it. Partitioning is usually done using K suitable membership functions. The most common type of membership functions is triangular because they are simpler and easily understandable by humans. Moreover under some assumptions, the fuzzy partitions built out of the triangular membership functions lead to entropy equalization (Pedrycz 1994). Figure 1 shows these membership functions for four different values of K. Though up to five membership functions is common in generating fuzzy classification rules, the number of entities a human can reliably handle is seven to nine at most. So, this is often used as upper bound on the fuzzy sets in fuzzy modeling techniques (Gacto et al. 2011).

Fig. 1
figure 1

Fourteen fuzzy sets of each input variable

For the problem of generating fuzzy classification rules, some approaches have been suggested in Mansoori et al. (2007) and Ishibuchi and Yamamoto (2004). The approach in Ishibuchi and Yamamoto (2004) applies the fuzzy set don’t care (with membership function \(\mu_{don't\,care} (x_{i} ) = 1,\quad \forall x_{i} \in [0,1]\)) beside the 14 triangular fuzzy sets in Fig. 1. Using this don’t care fuzzy set for a variable in the antecedent part of a rule will have that variable to be removed and so reduce the length of rule.

The consequent class C j of fuzzy rule R j in (1) is determined using the patterns in the corresponding fuzzy subspace. The compatibility grade of training pattern X p  = [x p1 , …, x pn ] is defined with the antecedent part A j  = A j1  × ··· × A jn of rule R j as:

$$\mu_{j} (X_{p} ) = \prod\limits_{i = 1}^{n} {\mu_{ji} (x_{pi} )}$$
(2)

where µ ji (x i ) is the membership function of the antecedent fuzzy set A ji on variable x i . One of the methods for selecting the consequent class of a rule is based on confidence [Ishibuchi and Yamamoto 2004] Bouchachia and Mittermeir 2006. The confidence of the fuzzy rule \(A_{j} \Rightarrow {\text{class}}\,c\) is defined as:

$$Conf(A_{j} \Rightarrow {\text{class }}c) = \frac{{\sum\nolimits_{{X_{p} \in \,{\text{class}}\,\,c}} {\mu_{j} (X_{p} )} }}{{\sum\nolimits_{p = 1}^{m} {\mu_{j} (X_{p} )} }} .$$
(3)

The consequent class C j of fuzzy rule R j can be obtained by identifying the class with the maximum confidence as:

$$C_{j} = \mathop {\arg \hbox{max} }\limits_{c} \{ Conf\,(A_{j} \Rightarrow class \, c)|c \in \{ c_{1} , \ldots ,c_{M} \} \} .$$
(4)

In Ishibuchi and Yamamoto (2004), some heuristic measures for evaluating the candidate rules have been used. A basic criterion is the difference between the number of positives and negative samples. Its fuzzy version is specified as:

$$Eval(A_{j} \Rightarrow {\text{class }}C_{j} ) = \sum\limits_{{X_{p} \in {\text{ class }}C_{j} }} {\mu_{j} (X_{p} )} - \sum\limits_{{X_{p} \notin {\text{ class }}C_{j} }} {\mu_{j} (X_{p} )} .$$
(5)

Single winner (that is, winner-takes-all approach) is the most popular reasoning method in fuzzy rule-based classifiers (Ishibuchi et al. 1999) because of its simplicity and intuition for human users. Using this method, a new pattern X t  = [x t1,…, x tn ] is classified according to the consequent class of the winner rule R w . Indeed, the winner rule has the maximum compatibility grade with X t among the fired rules. This can be stated as:

$$\mu_{w} (X_{t} ) = \hbox{max} \{ \mu_{j} (X_{t} ),j = 1, \ldots ,N\}$$
(6)

where µ j (X t ) is the compatibility grade of rule R j with pattern X t in (2).

3 The proposed fuzzy feature selection method

The basis of our method is using the distribution of patterns in the fuzzy sets which are applied on each dimension (feature). This hopefully will obtain more relevant and interpretable features to be used in fuzzy rule-based classifiers. To avoid the curse of dimensionality problem, the features are selected for each class individually. Generally, more relevant features are those which can better discriminate the different classes. The basic idea is that a feature is relevant to a class if the number of patterns with true class labels (true positives) is more than the others (false positives). So, the more the difference of the summation of membership degree of positive and negative patterns in the fuzzy sets is, the more the feature is relevant to the positive class. The relevance degree of feature f i to class c in lth fuzzy set is defined as:

$$\text{Re} lev\,(f_{i} ,c,l) = \sum\limits_{{X_{p} \in {\text{ class}}\,\,c}} {\mu_{{L_{l} }} (x_{pi} )} - \sum\limits_{{X_{p} \notin {\text{ class}}\,\,c}} {\mu_{{L_{l} }} (x_{pi} )}$$
(7)

where L l is one of the fuzzy sets in Fig. 1. Thus, the effectiveness of this feature in class c can be calculated by summing up the measures in (7) for all fuzzy sets in Fig. 1, as:

$$Effec\,(f_{i} ,c) = \sum\limits_{l = 1}^{14} {\{ \text{Re} lev\,(f_{i} ,c,l){ | }\text{Re} lev\,(f_{i} ,c,l) > 0\} } .$$
(8)

Clearly, only the fuzzy sets with positive relevance degree in (7) are contributed in computing the effectiveness measure of each feature. This feature selection approach has been summarized in the following algorithm. Its computational complexity is O(nM) for n features and M classes in addition to ranking complexity of features and fuzzy sets, O(n logn) + O(16 n′), which sums up to O(n logn) since n′ ≪ n and 14 × log14 ≈ 16.

figure a

Since the features are ranked according to their effectiveness values and then the most important ones are selected, this algorithm performs a single dimension-wise feature selection step in a kind of greedy-like manner. So, it can be trapped in local optima and therefore, only truly redundant features can be discarded. A greedy method finds the global optimal solution only when a feature is completely unimportant, but may get important when joined with another feature in two-dimensional space (Guyon and Elisseeff 2003).

4 Our method for designing fuzzy rule-based classifiers

To generate the fuzzy classification rules (as candidates which should be evaluated in next phase), the method in Ishibuchi and Yamamoto (2004) is used. This approach simultaneously considers all membership functions in Fig. 1 for each variable (feature). That is, one of the 14 fuzzy sets beside the don’t care is used for each variable when generating a candidate rule. This can reduce the number of antecedent fuzzy sets of each rule. But instead of employing all 14 fuzzy sets for each variable in our approach, only n′ selected features and their K′ suitable fuzzy sets (which are identified by fuzzy feature selection algorithm) beside to don’t care are used. So, the length of rules would be n′, at most. Moreover, since the features and their fuzzy sets are class-specific, the consequent of generated rules are predefined to that class. In other words, the combination of K′ fuzzy sets, identified for each of n′ features (of class c), will construct the antecedent part of the rules while their consequent part is set to class c. Thus, the number of generated fuzzy rules for each class would be \(K^{{{\prime }n^{{\prime }} }}\) (at most), where K′ < 14 and n′ ≪ n for high-dimensional data. However, a fuzzy rule with a specific consequent class c is generated only if the number of positive patterns (from class c) is more than negative patterns.

After generating the candidate rules, the next step is to construct the rule base among the candidates. Since the interpretability of rules is a major issue in fuzzy rule-based classifiers, the final rule base should be as compact as possible (Lughofer et al. 2011). For this purpose, the candidate rules should be evaluated and the best ones are selected. Several heuristic criteria have been suggested so far (Mansoori et al. 2007, 2008) and there is a good survey on some of these metrics in Ishibuchi and Yamamoto (2004).

By introducing covering subspace and decision subspace for each fuzzy rule in Mansoori et al. (2007), the authors proposed two thresholds for identifying these two subspaces. In this regard, the patterns having positive membership degree are considered to reside in covering subspace of a rule. On the other hand, those patterns with membership degrees greater than 0.5 are used to determine the decision subspace since they will certainly be classified by this rule. Using only the patterns in the decision subspace of a rule, we have proposed a modified version of criterion in (5) for candidate rule evaluation. This new measure can be formulated as:

$$Eval'(R_{j} ) = Eval'(A_{j} \Rightarrow {\text{class}}\,C_{j} ) = \sum\limits_{{X_{p} \in {\text{ class }}C_{j} }} {\mu '_{j} (X_{p} )} - \sum\limits_{{X_{p} \notin {\text{ class }}C_{j} }} {\mu '_{j} (X_{p} )} \,$$
(9)

where

$$\mu '_{j} (X_{p} ) = \prod\limits_{i = 1}^{n} {\{ \mu_{ji} (x_{pi} )|\mu_{ji} (x_{pi} ) > 0.5\} } .$$
(10)

To construct the final rule base, all candidate rules which are generated for each class in the first step are ranked and some best rules are chosen. For this purpose, a simple hill climbing method is used. In this regard, firstly the best rule for each class according to (9) is considered as rule base. Then, the next best and most cooperative rules for all classes are added to the rule base repeatedly in a greedy manner according to the classification accuracy of rule base on the training data. The accuracy of classifier using fuzzy rule base RB on data set DS is defined as:

$$Acc(DS,RB) = \frac{{\sum\limits_{k = 1}^{M} {m_{k} } }}{|DS|}$$
(11)

where m k is the number of patterns from class c k in DS that are classified truly by using fuzzy rule base RB. This algorithm is explained here.

figure b

To obtain the complexity of algorithm, the required computations in each step is accounted. In step 1, the loop runs M times while in each iteration, \(K^{{{\prime }n^{{\prime }} }}\) fuzzy rules are generated (for n′ selected features, K′ fuzzy sets and M classes). So in step 1, the complexity is \(O(MK^{{{\prime }n^{{\prime }} }} )\). This is also the complexity of step 2, since for each of M classes, \(K^{{{\prime }n^{{\prime }} }}\) rules are evaluated. To classify m patterns in step 3, \(K^{{{\prime }n^{{\prime }} }}\) rules are used so, this step needs \(O(MK^{{{\prime }n^{{\prime }} }} )\). In steps 4–7, the computations are not noticeable, except in step 6 where requires the step 3 to be repeated some times. In overall, the complexity of algorithm is \(O\left( {\left( {m + 2M} \right)K^{{\prime n^{\prime } }} } \right) \approx O(m \, K^{{\prime n^{\prime } }} )\) since in real-world data sets m ≫ M. However, K′ and n′ are set to 4 in the experiments, so the complexity is O(256m), in practice.

5 Experimental results

In this section, the efficiency of the proposed methods is examined. The results are obtained by applying our algorithms on 14 data sets with low and moderate dimensions and 6 high-dimensional ones, all from UCI ML repository (Asuncion and Newman 2007). Table 1 summarizes the data used in the experiments, ranked in their number of features.

Table 1 Data sets used in the experiments

The Iris data set is used to illustrate the steps of our method. This data set consists of 150 samples with four dimensions and three classes. The effectiveness measure in (8) for its features is shown in Table 2. Table 3 illustrates at most four of the best fuzzy sets (in Fig. 1) for two best features of each class.

Table 2 The effectiveness measure of features in Iris data set
Table 3 The best fuzzy sets of two best features for each class of Iris

After applying our proposed method, the final rule base will contain three fuzzy rules. Figure 2 depicts the decision area of these rules.

Fig. 2
figure 2

Decision area of three rules generated for Iris data set

  • R 1: If Petal width is L 3 and Petal length is L 3 then class is Setosa.

  • R 2: If Petal width is L 4 and Petal length is L 4 then class is Versicolour.

  • R 3: If Petal width is L 5 and Petal length is L 13 then class is Virginica.

The most sensitive parameters of our feature selection algorithm, which also affect the fuzzy rule-based classifier, include: n′ as the number of desired features, and K′ as the number of fuzzy sets on each feature. To examine the sensitivity of our methods on these parameters, two data sets are used: Wine and Tiger with low and high dimensions, respectively. For this purpose, three distinct values of n′ (3, 4 and 5) versus all possible values of K′ (1, 2, …, 14) are studied. In this regard, n′ best features with K′ best fuzzy sets, reported by feature selection algorithm, are used to design the fuzzy classifiers. The accuracy, in (11), of these classifiers are computed by using the training data for test, also. Figure 3 depicts these accuracies for Wine and Tiger data sets. In both data, the number of features is not determinant, at least for 3, 4 and 5 features. This also happens for number of fuzzy sets in each feature, except when weak fuzzy sets are included. Clearly, using only the best fuzzy set is sufficient since the obtained accuracies are not influenced by more fuzzy sets. However, in the coming experiments, the number of features and fuzzy sets are set to 4.

Fig. 3
figure 3

Effects of number of features (n′) and fuzzy sets (K′) on accuracy of designed classifiers

The experimental results are studied in two subsections. First, our fuzzy feature selection method is evaluated. Next, the proposed method for designing fuzzy rule-based classifiers is discussed. All methods are implemented in MATLAB R2014 and are run on a Core i5, 3.1-GHz CPU with 4 GB of memory in Windows 7.

To compare the different approaches in a formal and efficient manner, the five times of tenfold cross-validation (5–10CV) testing method is used. In this method, each data set is randomly divided into ten subsets of the same size. Nine subsets are used for training and the tenth subset is used for test. The same training and testing procedure is also performed nine times after exchanging the role of each subset so that all subsets are used as test patterns once. Since the error rate on test patterns depends on the initial division of the data set, the 10CV is iterated five times using different divisions of the data set and the average accuracy is reported.

5.1 Examining our fuzzy feature selection approach

In this part, our fuzzy feature selection algorithm is compared with mRMR method (Peng et al. 2005), Fischer criterion (Fisher 1936), and a fuzzy method on the basis of fuzzy entropy and similarity measure (Luukka 2011). The mRMR method is based on maximizing the relevancy and minimizing the redundancy between the features using mutual information. In Fischer criterion, the ratio of traces of within-class and between-class scatter matrices in each dimension is the basis of ranking the features. Table 4 examines the scalability of methods in selecting four features for each data set. The computational cost of methods is stated in terms of CPU time. As shown in boldfaces, the CPU cost of our proposed algorithm is less than other methods in most of data sets and also in average.

Table 4 Computational cost (in sec) of four feature selection methods

Using the obtained features by each method, the approach in Ishibuchi and Yamamoto (2004) for designing fuzzy classifiers is employed to examine the influence of selected features on the classification accuracy and so the effectiveness of each method in feature selection. But since our proposed method selects the features class-specifically, an ensemble of classifiers for each class (Pineda-Bautista et al. 2011) is applied in this case. Using the features obtained by four feature selection methods, the performance of designed fuzzy classifiers, in terms of accuracy in (11), is compared in Table 5. According to these accuracies, the performance of our approach (in selecting discriminative features and suitable fuzzy sets) is much better than the entropy-based method and comparable to mRMR and Fischer methods.

Table 5 Classification accuracy obtained using different feature selection methods

To justify our claim statistically, the t test (Kreyszig 1970) is examined on the null hypothesis that the classification accuracy of our proposed method in 14 + 6 data sets is not better than the others. The p value of paired comparisons with α = 0.05 are reported in Table 6 where its small values cast doubt on the validity of the null hypothesis. Clearly, the differences are statistically significant and the performance of our method is considerably better than the entropy-based and mRMR methods but not than Fischer criterion.

Table 6 p Value of paired t test on the classification accuracy of methods

5.2 Performance of our method in designing fuzzy rule-based classifiers

In this subsection, the performance of our proposed method for generating fuzzy classification rules is compared with a well-known method; proposed by Ishibuchi and Yamamoto (2004). Because of its rule-length constraints, only fuzzy rules with the length of at most two are generated. Also, since it needs an evaluation criterion, the product of confidence and support (Ishibuchi and Yamamoto 2004) is used for this purpose. To construct the final rule base, the proposed method, in this paper, is also employed for Ishibuchi approach. Additionally, the number of rules per class (Q in algorithm) is set to 5 for both methods. The performance of our algorithm versus Ishibuchi method is compared in Table 7 in terms of classification accuracy in (11), computational cost, number of rules in the final rule base, |FRB|, and length of rules, n′. In this table, only low- and moderate-dimensional data sets are used for comparisons because of mentioned constraint of Ishibuchi’s approach.

Table 7 Comparing the performance of our proposed method versus Ishibuchi method

The data sets in this table are grouped in two categories because of the distinct performances. In the first group, the classification accuracy of our algorithm is significantly better than Ishibuchi method, about 8 % in average. Even in second group, our accuracies are almost better. This is because of selecting discriminative features for each class and suitable fuzzy sets for each feature. Moreover, the proposed criterion for rule evaluation puts the more accurate rules at top of the ranking list. The reported computational costs clearly justify the scalability of our approach in generating a compact set of short fuzzy rules for high-dimensional and/or large data. Moreover, the size of rule bases and the length of rules in proposed method are near to, but not as good as, those in Ishibuchi approach. This is because, the length of fuzzy rules in his approach are restricted to two while in our method the rules are allowed to be longer (indeed, till the number of selected features; at most 4 in these experiments). So, the generated rules of Ishibuchi are more general than our rules and therefore, our method must generate more rules to cover the same subspace of the problem.

As the last experiments, the effectiveness of our method against some state-of-the-art fuzzy classifier design schemes is compared. In this regard, two recently-proposed and efficient methods in fuzzy classifier design are used. The first one is SGERD (Mansoori et al. 2008), a steady-state genetic algorithm for extracting fuzzy classification rules from data. This method is fast and scalable with acceptable accuracy. The second algorithm is FARC-HD (Alcala-Fdez et al. 2011a), a fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning. FARC-HD is efficient and accurate, because of its goodness in rule selection and lateral tuning.

However, the computational complexity of these methods should be managed, especially when are applied on moderate- and high-dimensional data. For this purpose, first our proposed feature selection method is applied on each data set and then, the four best features are used by SGERD and FARC-HD. In these experiments, the implementation of algorithms in Keel data-mining software tool (Alcala-Fdez et al. 2011b) are used while their required parameters are set to default values.

Table 8 compares their classification accuracy in (11) against our method. Clearly, the performance of designed classifiers for some data sets is weak because of inadequacy of features. By selecting more than four features for these data sets, their classification rate would hopefully be increased.

Table 8 Classification accuracy of our method against SGERD and FARC-HD

According to the results in this table, the performance of our method is comparable to SGERD and FARC-HD, though the latter one is more accurate, both in average and in most of the data sets. This is because of its lateral tuning and the rules weight usage.

6 Conclusion

In this paper, we proposed a novel and fast method for fuzzy feature selection to choose more relevant features; those which can distinguish the distinct classes well. Our method uses the membership degree of positive and negative patterns in the fuzzy sets in order to compute the relevancy of features to the classes. The selected features and their effective fuzzy sets were then used in designing fuzzy rule-based classifiers. In order to evaluate the initially generated candidate rules, a new criterion was also proposed to measure the class-discrimination ability of each fuzzy rule.

The experimental results showed that our feature selection method is fast and scalable to be applied on high-dimensional data. By using just a few of these features, our approach for designing fuzzy rule-based classifiers could generate accurate and interpretable rule bases. In future works, we should develop a fuzzy feature selection approach which can also detect the redundant features.