Keywords

1 Introduction

Feature selection is an important step for many machine learning tasks. Unlike the traditional feature selection methods that choose a single “optimal” subset of the features for the entire dataset, some studies [6, 8,9,10,11,12, 14, 15] have used class-specific approaches for selecting features, where for each class, a unique subset of the original features is selected. If there are C classes, in the class-specific approach, C subsets are chosen. In a traditional feature selection method, the selected feature subset is chosen based on the global characteristics of the data. It does not take into account any class-specific or local characteristics of the data which may be present. For example, there may exist a group of features that follows a distinct distribution for a specific class but varies randomly over the remaining classes. Such a group of features plays a significant role in distinguishing the specific class from other classes but may not be very useful for the C-class problem as a whole. Class-specific characteristics may also exist in the form of class-specific redundancy. Different sets of features could be redundant for different classes. The class-specific feature selection (CSFS) methods in [6, 8,9,10,11,12, 14, 15] have proposed suitable frameworks that exploit class-specific feature subsets to solve classification problems. They have shown that the classifiers built with the subsets chosen by class-specific methods performed better than or comparable to the classifiers built with subsets chosen by global feature selection methods. The CSFS may also enhance the transparency/explainability of the classification process associated with it [6].

The majority of the CSFS methods [6, 8, 10, 11, 14, 15] follow the one-versus-all (OVA) strategy to decompose a C-class classification problem into C binary classification problems. They choose C class-specific feature subsets optimal for the C binary classification problems. OVA strategy-based CSFS methods have certain drawbacks. Generally, it leads to class imbalance demanding careful design of classifiers. The OVA strategy-based methods are computationally intensive and for testing we need an aggregation mechanism. Even, for the other CSFS methods [9, 12] which are not OVA strategy-based, to exploit the obtained class-specific feature subsets in the classification process, C classifiers are employed and require aggregation of C outputs for testing. Moreover, none of these methods consider the presence of - (i) class-specific redundancy, and (ii) within-class substructures in their frameworks.

Here, we propose a CSFS scheme embedded in a fuzzy rule-based classifier (FRBC) that does not use the OVA strategy. Our method selects class-specific feature subsets by learning a single FRBC and hence avoid the issues associated with the CSFS methods involving C classifiers. Moreover, we extend our framework to deal with- (i) CSFS with redundancy control, and (ii) rule-specific feature selection (RSFS) that can exploit the presence of substructures within a class. The rules provided by the FRBC are generally interpretable and more specific. Exploiting the class-specific local features, the proposed FRBC enjoys more transparency and interpretability than a standard classifier exploiting class-specific feature subsets. Our main contributions in this study are summarised as follows.

  1. 1.

    We propose a CSFS method that is not based on the OVA strategy like most of the existing CSFS schemes. Thus, our method is free from the weaknesses of the OVA strategy.

  2. 2.

    We propose a classifier framework that is defined by class-specific fuzzy rules involving class-specific feature subsets.

  3. 3.

    Our method can monitor the level of redundancy in the selected features.

  4. 4.

    We also propose a general version of our class-specific feature selection method that not only chooses different subsets for different classes but also, chooses different subsets for different rules within a class if different substructures are present.

2 Proposed Method

Let the input data be \(\textbf{X}=\{\textbf{x}^{i}=(x_{1}^{i},x_{2}^{i},\cdots ,x_{P}^{i})^{T}\in \mathcal {R}^{P}: i \in \{1,2,\cdots ,n\}\}\). Let us denote the input space by \(\mathcal {X} \subseteq \mathcal {R}^{P}\). The collection of class labels of \(\textbf{X}\) be \(\textbf{y}=\{y^{i} \in \mathcal {Y}:i \in \{1,2,\cdots ,n\}\}\), where, \(y^{i}\) is the class label corresponding to \(\textbf{x}^{i}\), and \(\mathcal {Y}=\{1,2,\cdots , C\}\). For our purposes we represent the class label of \(\textbf{x}^{i}\) as \(\textbf{t}^{i} \in \{0,1\}^{C}\), where \({t}_{k}^{i}=1\) if \(y^{i}=k\) and \({t}_{k}^{i}=0\), otherwise. We denote the jth feature by \(x_{j}\), the class label by y, and the target vector by \(\textbf{t}\). The set of original features is \(\textbf{f}=\{x_{1}, x_{2}, \cdots x_{P}\}\). Let the optimal class-specific subset of features for the kth class be \(\textbf{s}_k\). We need to find out \(\textbf{s}_{k}\)s where, \(\textbf{s}_{k} \subset \textbf{f} \forall k \in \{1,2,\cdots , C\}\). Let \(\mathcal {X}^{(\textbf{s}_k)}\) be the projected input space using the features in \(\textbf{s}_k\). The classifier constructed using the class-specific feature subsets \(\textbf{s}_{k}\)s can be defined as \(\mathcal {F}: \{\mathcal {X}^{(\textbf{s}_1)}, \mathcal {X}^{(\textbf{s}_2)}, \cdots \mathcal {X}^{(\textbf{s}_C)}\} \mapsto \mathcal {Y}\). We propose a CSFS mechanism embedded in a fuzzy rule-based classifier (FRBC). So next we discuss the FRBC.

2.1 Fuzzy Rule-Based Classifiers

We employ the FRBC framework used in [4, 5]. Each class is represented by a set of rules. Let there are \(N_{k}\) rules for the kth class. The lth rule corresponding to the kth class, \(\textrm{R}_{kl}\) is given by

$$\begin{aligned} \textrm{R}_{kl}:\texttt {If }\,\, x_{1}\,\, \texttt { is }\,\, A_{1,kl} \,\,\texttt { and }\,\, x_{2} \,\,\texttt { is } \,\,A_{2,kl} \,\,\texttt { and} \,\,\cdots x_{P}\,\, \texttt { is } \,\,A_{P,kl} \,\,\texttt {then }\,\, y \,\,\texttt { is } \,\,k. \end{aligned}$$
(1)

Here, \(k\in \{1,2,\cdots , C\}\), \(l \in \{1,2,\cdots , N_{k}\}\), and \(A_{j,kl}\) is a linguistic value (fuzzy set) defined on the jth feature for the lth rule of the kth class. Let, \(\alpha _{kl}\) be the firing strength of the rule \(\textrm{R}_{kl}\). The rule firing strength is computed using the product T-norm [7] over the fuzzy sets \(A_{1,kl},A_{2,kl},\cdots ,A_{P,kl}\). Let the membership to the fuzzy set \(A_{j,kl}\) be \(\mu _{j,kl}\). So, \(\alpha _{kl}\) is given by, \(\alpha _{kl}=\prod _{j=1}^{P} \mu _{j,kl}\). The final output of the FRBC is of the form \(\textbf{o}=(o_{1}, o_{2}, \cdots , o_{C})\), where, \(o_{k}\) is the support for kth class, computed as \(o_{k}=\max \{\alpha _{k1}, \alpha _{k2}, \cdots , \alpha _{kN_{k}}\}\). To learn an efficient classifier from the initial fuzzy rule-based system, the parameters defining the fuzzy sets \(A_{j,kl}\)s can be tuned by minimizing the loss function,

$$\begin{aligned} E_{cl}=\textstyle {\sum _{i=1}^{n}\sum _{k=1}^{C}(o^{i}_{k}-t^{i}_{k})^2}. \end{aligned}$$
(2)

To extract the rules, following [4, 5], we cluster the training data of the kth class into \(N_{k}\) clusters. We note here that the kth class may not have \(n_{k}\) clusters in the pattern recognition sense. By clustering we just group the nearby points and then define a rule for each group. Let the centroid of the lth cluster of the kth class be \(\textbf{v}_{kl}=(v_{1,kl}, v_{2,kl},\cdots ,v_{P,kl})\). The cluster centroid \(\textbf{v}_{kl}\) is then translated into P fuzzy sets, \(A_{j,kl}=\) “close to” \(v_{j,kl} \forall j \in \{1,2,\cdots ,P\} \). The fuzzy set ‘“close to” \(v_{j,kl}\)’ is modeled by a Gaussian membership function with mean \(v_{j,kl}\). Although the membership parameters can be tuned to refine the fuzzy rules, in this study, we have not done that. We have used fixed rules defined by the obtained cluster centers and a fixed spread value in the Gaussian membership functions.

2.2 Feature Selection

Following [2,3,4,5, 13], we use feature modulators which stop the derogatory features and promote useful features to take part in the rules of the FRBC. For each feature, there is an associated modulator of the form \(M(\lambda _{j})=\exp {(-\lambda _{j}^{2})}\), where \(j \in \{1,2,\cdots ,P\}\) [5]. To select or reject a feature using the modulator function, the membership values associated with the jth feature are modified as

$$\begin{aligned} \textstyle {\hat{\mu }_{j,kl}=\mu _{j,kl}^{M(\lambda _{j})}=\mu _{j,kl}^{\exp {(-\lambda _{j}^{2})}} \forall k,l} \end{aligned}$$
(3)

Note that, \(\lambda _{j} \approx 0\) makes \(\hat{\mu }_{j,kl}\approx \mu _{j,kl}\). Similarly, when \(\lambda _{j}\) is high (say, \(\lambda _{j} \ge 2\)), \(\hat{\mu }_{j,kl}\approx 1\). The rule firing strength is now calculated as \(\alpha _{kl}=\prod _{j=1}^{P} \hat{\mu }_{j,kl}\). So, when \(\hat{\mu }_{j,kl}\approx \mu _{j,kl}\), jth feature influences the rule firing strength \(\alpha _{kl}\) and in turn influences the classification process, whereas, if \(\hat{\mu }_{j,kl}\approx 1\) then the jth feature has no influence on the firing strength and hence no influence on the predictions by the FRBC. This would be true for any T-Norm as \(T(x,1)=x, x \in [0,1]\). Thus, for useful features, \(\lambda _{j}\)s should be made close to zero and for derogatory features \(\lambda _{j}\)s should be made high. The desirable values of \(\lambda _{j}\)s are obtained by minimizing \(E_{cl}\) defined in (2) with respect to \(\lambda _{j}\)s. The training begins with \(\lambda _{j}=2+\) Gaussian noise. \(M(\lambda _{j}) \approx 0\) indicates a strong rejection of \(x_j\), while \(M(\lambda _{j}) \approx 1\) suggests a strong acceptance of \(x_j\). However, training may lead \(\lambda _{j}\)s such that \(M(\lambda _{j})\) takes a value in between 0 and 1. This implies that the corresponding feature influences the classification partially. This is not desirable in our case, as our primary goal is to select or reject features. To facilitate this, we add a regularizer term \(E_{select}\) to \(E_{cl}\) such that \(E_{select}\) adds a penalty if any \(\lambda _{j}\) allows the corresponding feature partially. In [5], \(E_{select}\) is set as follows.

$$\begin{aligned} E_{select}=\textstyle {(\nicefrac {1}{P})\sum _{j=1}^{P}\exp {(-\lambda _{j}^{2})}(1-\exp {(-\lambda _{j}^{2})})} \end{aligned}$$
(4)

So, the overall loss function for learning suitable \(\lambda _{j}\)s becomes

$$\begin{aligned} \textstyle {E= E_{cl}+ c_{1}E_{select}} . \end{aligned}$$
(5)

Class-Specific Feature Selection. So far we have not considered selection of class-specific features. In the class-specific scenario, for each class, a different set of P modulators is engaged. So, a total of \(C\times P\) feature modulators are employed. Consequently, for each class a different set of features, if appropriate, can be selected. Here, we represent the feature modulator for the jth feature of the kth class as \(M(\lambda _{j,k})=\exp {(-\lambda _{j,k}^{2})}\), where \(j \in \{1,2,\cdots ,P\}; k \in \{1,2,\cdots ,C\}\). The modulator value \(M(\lambda _{j,k})\) modify the membership values corresponding to the jth feature of the kth class as following:

$$\begin{aligned} \textstyle {\hat{\mu }_{j,kl}=\mu _{j,kl}^{M(\lambda _{j,k})}=\mu _{j,kl}^{\exp {(-\lambda _{j,k}^{2})}} \forall l} \end{aligned}$$
(6)

For this problem, the \(E_{select}\) is changed to

$$\begin{aligned} E_{select}=\textstyle {(\nicefrac {1}{(CP)})\sum _{k=1}^{C}\sum _{j=1}^{P}\exp {(-\lambda _{j,k}^{2})}(1-\exp {(-\lambda _{j,k}^{2})})} \end{aligned}$$
(7)

We now minimize (5) with respect to \(\lambda _{j,k}\)s to find the optimal \(\lambda _{j,k}\)s.

2.3 Monitoring Redundancy

Suppose a data set has three useful features say \(x_{1},x_{2},x_{3}\) such that each of \(x_2\) and \(x_3\) is strongly dependent on (say correlated with) \(x_1\) then all the three features carry the same information and only one of them is enough. These three form a redundant set of features. However, if we just use one of them and there is some error in measuring that feature, the system may fail to do the desired job. Therefore, a controlled use of redundant features is desirable. For the global feature selection framework, redundancy control has been realized by adding the regularizer (8) to (5) [3, 5, 13]:

$$\begin{aligned} E_{r}=\textstyle {(\nicefrac {1}{(P(P-1))})\sum _{j=1}^{P}\sum _{m=1, m\ne j}^{P} \sqrt{\exp {(-\lambda _{j}^{2})}\exp {(-\lambda _{m}^{2})}\rho ^{2}(x_{j},x_{m})}} \end{aligned}$$
(8)

Here, \(\rho ()\) is the Pearson’s correlation coefficient (or it could be mutual information also), which is a measure of dependency between two features. When \(x_{j}\) and \(x_{m}\) are highly correlated, \(\rho ^{2}(x_{j},x_{m})\) is close to one (its highest value). In this case, to reduce the penalty \(E_{r}\), the training process will adapt \(\lambda _{j}\) and \(\lambda _m\) in such a way that one of \(\exp {(-\lambda _{j}^{2})}\) and \(\exp {(-\lambda _{m}^{2})}\) is close to 0 and the other is close to 1. Note the (8) is not suitable for class-specific scenario. Next we change (8) for class-specific redundancy.

Class-Specific Redundancy. For class-specific redundancy, we compute \(\rho _{k}(x_{j},x_{m})\) between features \(x_{j}\) and \(x_{m}\) considering only instances of the kth class. In the class-specific case, for each class, we have P feature modulators, \(M(\lambda _{j,k})=\exp {(-\lambda _{j,k}^{2})}\), where \(j \in \{1,2,\cdots ,P\}; k \in \{1,2,\cdots ,C\}\). So, (8) is modified as following.

$$\begin{aligned} E_{r_{c}}=\textstyle {(\nicefrac {1}{(CP(P-1))})\sum _{k=1}^{C}\sum _{j=1}^{P}\sum _{m=1, m\ne j}^{P} \sqrt{\exp {(-\lambda _{j,k}^{2})}\exp {(-\lambda _{m,k}^{2})}\rho _{k}^{2}(x_{j},x_{m})}} \end{aligned}$$
(9)

Considering the class-specific redundancy, our new loss function for learning the system becomes:

$$\begin{aligned} \textstyle {E_{tot}=E_{cl}+c_{1}E_{select}+c_{2}E_{r_c}}. \end{aligned}$$
(10)

2.4 Exploiting Substructures Within a Class

For some real world problems, the data corresponding to a class may have distinct clusters and some of the clusters may lie in different sub-spaces. For example,in a multi-cancer gene expression data set, each cancer may have several sub-types, where each sub-type is characterized by a different set of highly expressed genes/features. This generalizes the concept of class-specific feature selection further. To exploit such local substructures within a class while extracting rules, we need to use rule-specific feature modulators. Each rule of the kth class is assumed to represent a local structure or cluster present in the kth class. So, for the kth class there are \(n_{k}\times P\) feature modulators. For the overall system there are \(n_{rule}\times P\) feature modulators where, \(n_{rule}(=\sum _{k=1}^{C}n_{k})\) is the total number of rules. A modulator function is now represented by \(M(\lambda _{j,kl})\) and the corresponding modulated membership is the following.

$$\begin{aligned} \textstyle {\hat{\mu }_{j,kl}=\mu _{j,kl}^{M(\lambda _{j,kl})}=\mu _{j,kl}^{\exp {(-\lambda _{j,kl}^{2})}}} \end{aligned}$$
(11)

The regularizer, \(E_{select}\) is now modified as

$$\begin{aligned} E_{select}=\textstyle {(\nicefrac {1}{(CP)})\sum _{k=1}^{C}(\nicefrac {1}{n_{k}})\sum _{l=1}^{n_{k}}\sum _{j=1}^{P}\exp {(-\lambda _{j,kl}^{2})}(1-\exp {(-\lambda _{j,kl}^{2})})} \end{aligned}$$
(12)

In this framework, we do not consider redundancy. Using (11) for \(E_{cl}\) and (12) for \(E_{select}\) we define the loss function \(E= E_{cl}+ c_{1}E_{select}\) for discovering rule-specific feature subset.

3 Experiments and Results

We have done three experiments to validate the three main contributions of our proposed framework. In Experiment 1, we have shown the effectiveness of the proposed CSFS over the usual global feature selection using the proposed FRBC framework. In Experiment 2, we have demonstrated the significance of class-specific redundancy control using our approach. In Experiment 3, a data set having multiple sub-structures in different sub-spaces within a class have been considered to show the utility of our method. We have not tuned the rule base parameters of the FRBC and have tuned only the feature modulators to select/reject features. For clustering, we have used the K-means algorithm. Each fuzzy set is modeled using a Gaussian membership function having two parameters: center and spread. The cluster centers are used as the centers of the Gaussian membership functions and their spreads have been set as 0.2 times the feature-specific range. To minimize the error functions, we use the optimizer, train.GradientDescentOptimizer from TensorFlow [1]. For all experiments, the learning rate is set to 0.2 and the stopping criterion is 10000 iterations. However, typically the loss function reduces very rapidly at the beginning of the training and it converges within a few hundred iterations. As mentioned in Sec.2 we denote the class-specific feature subset for class 1 as \(\textbf{s}_{1}\), for class 2 as \(\textbf{s}_{2}\) and so on.

3.1 Experiment 1

We use a three-class synthetic data set, Synthetic1, with six features having distributions as described in Table 1. Here, \(\mathcal {N}(m,s)\) represents a normal distribution with mean, m and standard deviation, s; \(\mathcal {U}(a,b)\) represents a uniform distribution over the interval (ab). Without loss, we have assigned the first 100 points to class 1, next 100 points to class 2, and last 100 points to class 3. From Table 1 we can see that class 2 and class 3 are uniformly distributed over a given interval for features \(x_{1}\) and \(x_{2}\). On the other hand, class 1 is clustered around (0, 0) in the feature space formed by \(x_{1}\) and \(x_{2}\). Hence the feature space formed by \(x_{1}\) and \(x_{2}\), discriminate class 1 from the other two classes. Similarly, \((x_3,x_4)\) and \((x_{5},x_{6})\) discriminate class 2 and class 3 respectively, from the corresponding remaining classes. To understand the importance of CSFS, we perform both global feature selection (GFS) and CSFS, and compare their performances. We have also computed the performance of the FRBC with all features. Number of rules considered per class is one. We have conducted 5 runs for each of the FRBC. We observe from Table 2 that for proposed CSFS, in all five runs, for each class its characteristic features (i.e. \(x_{1}, x_{2}\) for class 1 and so on) are selected. The FRBC with the class-specific selected features has achieved an average accuracy of \(98.7\%\) - in fact, each run achieved the same accuracy. Whereas, in GFS, the selected subset is \(x_{3},x_{6}\). The FRBC using globally selected feature subset has achieved an accuracy of \(34.7\%\) in each of the five runs. One can argue that the class-specific model uses all six features, hence performs better than the global model which uses two features. But, when we learn the FRBC rules using all six features it has achieved an average accuracy of \(62.08\%\) over the five runs. Importance of class-specific feature selection is clearly established through this experiment.

Table 1. Description of the dataset Synthetic1
Table 2. Performance on the dataset, Synthetic1

3.2 Experiment 2

For Experiment 2, we have considered another synthetic dataset, Synthetic2, which is produced by appending two additional features \(x_{7}\) and \(x_{8}\) to Sythetic1 data set. For class 1, \(x_{7}\) and \(x_{8}\) are generated as \(x_{1}+\mathcal {N}(0,0.1)\) and \(x_{2}+\mathcal {N}(0,0.1)\), respectively. For the other two classes, \(x_{7}\) and \(x_{8}\) are generated from \(\mathcal {U}(-10,10)\) and \(\mathcal {U}(-10,10)\), respectively. We observe that \(x_{7}\) is dependent on \(x_{1}\) and \(x_{8}\) is dependent on \(x_{2}\) for class 1 but the remaining two classes are indiscernible among themselves considering features \(x_{7}\) and \(x_{8}\). Clearly, features \(x_{7},x_{8}\) are also discriminatory for class 1. However, do \(x_{7},x_{8}\) add any information over \(x_{1},x_2\) for class 1? The answer is no, as for class 1, \(x_{7}\) and \(x_{8}\) are noisy versions of \(x_{1}\) and \(x_{2}\), respectively. This feature-redundancy is specific to class 1. In Table 3 we have described the performances of the FRBCs in the CSFS framework without and with class-specific redundancy control. Here also, we have set the number of fuzzy rules per class as one and repeated the experiments five times with each model. The term ‘Acc.’ in Table 3 refers to accuracy of the FRBC in percentage. For class 1, features \(x_{1}\) and \(x_{7}\) are heavily dependent. Hence, to avoid redundancy only one of them should be selected. The same argument is true for features \(x_2\) and \(x_8\). Minimizing (10) which uses (9) to control class-specific redundancy, the FRBC has successfully chosen only one from \(x_{1}\) and \(x_{7}\) and one from \(x_2\) and \(x_8\) to include in \(\textbf{s}_{1}\) in all five runs (Table 3). On the other hand, we observe that without any redundancy control, the class-specific feature selection framework selects all the four discriminatory features to include in \(\textbf{s}_{1}\) in four runs. The best accuracy achieved by the CSFS framework without redundancy control and that of CSFS with class-specific redundancy control are the same and equal to \(99.3\%\) although the later selects only two features. This establishes the benefit of class-specific redundancy control.

Table 3. Class-specific feature selection on Synthetic2 data set

3.3 Experiment 3

In experiment 3, we validate our proposed framework for handling the presence of different clusters in different sub-spaces within a class. We have generated Synthetic3, a two class data having four features where each class is composed of two distinct clusters lying in two different sub-spaces. The data set Synthetic3 is described in Table 4. Each class has 200 points. For class 1, instances 1 to 100 create a distinct cluster around \((0,-5)\) in the feature space formed of \(x_{1},x_{2}\) and instances 101 to 200 create a distinct cluster around (0, 0) in the feature space formed of \(x_{3},x_{4}\). Similarly, class 2 also has two distinct clusters in the feature spaces formed of \(x_{2},x_{3}\) and \(x_{1},x_{4}\) respectively. To handle this dataset we employ our proposed rule-specific approach implemented using (11), and (12). The number of rules per class is set to two. As observed from Table 5, the rule-specific feature selection is successful in identifying the two important sub-spaces i.e. \(x_{1},x_{2}\) and \(x_{3},x_{4}\) for class 1 and \(x_{2},x_{3}\) and \(x_{1},x_{4}\) for class 2. We note that for both the classes the selected rule-specific subsets interchange between rules 1 and 2 because the cluster number assignment to different groups of points for a class varies. We also note from Table 5, using the class-specific feature selection method, in different runs, \(\textbf{s}_{1}\) comprises of \(x_{1}\) or \(x_{2}\) and \(\textbf{s}_{2}\) comprises of \(x_{2}\) or \(x_{1},x_{2}\). These subsets obviously do not characterize the classes correctly. The average accuracy of the FRBC using feature subsets selected by rule-specific, class-specific feature selection and using all features are \(100\%\), \(77.4\%\), and \(87.5\%\), respectively. This demonstrates the usefulness of our proposed RSFS framework.

Table 4. Description of the dataset Synthetic3
Table 5. Features subsets selected for Synthetic3

4 Conclusion

In this work, first, we have proposed a class-specific feature selection scheme using feature modulators embedded in a FRBC. The parameters of the feature modulators are tuned by minimizing a loss function comprising of classification error and a regularizer to make the modulators completely select or reject features. This framework is used in [4, 5] for selecting globally useful features. We modified it to make it suitable for CSFS. Our proposed CSFS method does not employ OVA strategy like most of the existing CSFS works and hence is free from the enhanced computational overload and hazards associated with the OVA based methods. We have extended the CSFS scheme so that it can monitor class-specific redundancy by adding a suitable regularizer. Finally, our CSFS framework is generalized to a rule-specific feature selection framework to handle the presence of multiple sub-space-based clusters within a class. The three approaches are validated through three experiments on appropriate synthetic data sets. There are certain limitations of the proposed CSFS. The rule firing strength is computed using product. So, for very high dimensional data it may lead to underflow. However, this issue is true for any FRBC, it is not special to the systems proposed here. Since, we have experimented on synthetic data sets, the parameters like number of rules per class, and the spreads of the Gaussian membership functions are set intuitively. However, for real data sets, these parameters need to be chosen more judiciously with a systematic method.