Keywords

1 Introductory Notes

In the construction of fuzzy rule based systems, we have been witnessing a wealth of design strategies and detailed algorithms involving the technology of Evolutionary Computing and neurocomputing. Just recent developments reported in this realm can be found in a series of studies [13, 5, 6, 8]. Predominantly, the development of fuzzy models is guided by the criterion of accuracy. Another fundamental criterion being at the heart of fuzzy modeling is interpretability (transparency) of resulting fuzzy models. This criterion is central to fuzzy models however its multifaceted nature requires a thorough formulation and a detailed quantification of essential aspects of interpretability. Subsequently, it calls for engaging advanced optimization techniques supporting the realization of the ensuing design.

The concept of interpretability of fuzzy rule-based models has been around for several decades and attracted a significant deal of attention. The transparency of fuzzy models is one of the outstanding and important features of fuzzy models. In contrast to the criterion of accuracy, whose quantification is relatively straightforward and easy to come up with performance indexes, transparency of fuzzy rules is more difficult to describe. What makes the fuzzy rule-based easier to interpret and comprehend is still an open issue. It is quite subjective to assess and in one way or another invokes a factor of subjective judgment given that a human user is ultimately involved in the evaluation process. What also becomes apparent, is a multifaceted nature of the problem and a multitude of various approaches supported by various optimization technologies including evolutionary optimization. When it comes to the main factors worth considering when discussing a concept of interpretability, we can enumerate a list of factors that may be involved in the reduction process:

  • number of rules forming a rule base of the model,

  • number of sub-conditions (input variables) forming a condition part of a given rule,

  • number or rules and the number of input variables,

  • complexity of local regression models forming the conclusion part of the rules (in case of Takagi-Sugeno model)

  • interpretability of a family of fuzzy sets formed in the input space for individual variables.

As a result, given this diversity of possible ways of reduction of rules, it is difficult to quantify the effect of reduction. For instance, it is not always clear if it would be better to have a larger number of simple rules (whose condition parts are linear functions) or a smaller number of rules of a more complex conclusion parts (say, those of polynomial form).

The reader may refer to a large body of studies devoted to the issue of interpretability, cf. [3, 10]. Quite often, given the combinatorial nature of the reduction problems, Genetic Algorithms are used in the optimization process, see [9]. There are also more specialized algorithmic vehicles to support the reduction process such as e.g., singular value decomposition [14] however one has to be cognizant as to the interpretability of the reduced rules.

As to the overall strategy of rule reduction and interpretability enhancement, there are two general design strategies: either the reduction is completed once the model has been constructed or the reduction mechanism of rule-based modeling is incorporated into the design process from its very beginning.

The objective of the study is to pursue a fundamental issue of building more transparent and user-centric fuzzy rule-based models by starting from an already designed fuzzy model. The intent is to make the rules more readable in two different ways: (a) by reducing the input space (the number of antecedents) of the individual rules, and (b) by isolating input variables completed for the input variables treated en block in the condition parts of the rules.

In both these fundamental scenarios we can establish and quantify a tradeoff between a gradual reduction of accuracy (which is inevitable when realizing any of the reduction mechanisms of the rules and enhancing its intensity) and the increased interpretability of the rules. The presentation of the material is structured in the following way. We briefly revisit the essentials of Takagi-Sugeno rule- based systems, stressing the proposed design approach in which we use fuzzy clustering (Sect. 2). The reduction of input space and a formation of input subspaces for each rule is introduced in Sect. 3; in the same section we also present an optimization process realized with the use of genetic algorithm. The second approach to the enhancement of the interpretability of the rule – an isolation of input variables is discussed in Sect. 5. Detailed experimental studies are reported in Sects. 3 and 6.

In the study, we experiment with a number of publicly available datasets coming from eight datasets from UCI Machine learning repository and DELVE repository; their main characteristics are listed in Table 1.

Table 1 Main characteristics of datasets used in the experiments (number of data and dimensionality –number of input variables)

2 Fuzzy Rule-Based Models: An Overview and Main Design Issues

Our point of departure of the reduction processes of the rules is a “standard” Takagi-Sugeno fuzzy model comprising c rules coming in the following form

$$ - if\,{\boldsymbol{x}}\,is\,A_{i} \,then\,y = \, f_{i} \left( {{\boldsymbol{x}},{\boldsymbol{a}}_{i} } \right) $$
(1)

for i = 1, 2, .., c where x \( \in \) R n. Ai is a fuzzy set defined in the n-dimensional space while fi is the corresponding local model (linear or nonlinear) endowed with its parameters a i forming the conclusion part of the i th rule. The design of such models is well reported in the literature and as usual consists of the two main steps, namely (a) a construction of condition parts (through clustering of input data done in the input space) and (b) estimating parameters of the linear models (which leads to the problem of linear regression). The number of rules (c) is determined by monitoring the behavior of the model on the training and testing data. The rules in the form (1) come with several essential properties. The condition part is a fuzzy set expressed in R n making the rules concise, which helps avoid a curse of dimensionality we are commonly faced with in rule based systems with a higher number of input variables. In the case of treating all input variables at the same time, the number of rules becomes small (c) and the rule base itself is compact. Unfortunately, the interpretability could be negatively impacted as no individual variables in the condition part are treated and visualized separately.

When it comes to the quantification of the accuracy of the fuzzy model (1) its performance is commonly expressed by the RMSE index computed for the training set

$$ \sqrt {\frac{ 1}{N}\sum\limits_{k = 1}^{N} { ( {\text{FM(}}{\mathbf{x}}_{k} )- target_{k} )^{ 2} } } $$
(2)

where N denotes the number of data in the training set. In the same way, quantified is the performance of the constructed model on the testing data (consisting of M data points)

$$ \sqrt {\frac{ 1}{\text{M}}\sum\limits_{{{\text{k}} = 1}}^{\text{M}} { ( {\text{FM(}}{\mathbf{x}}_{\text{k}} )- {\text{target}}_{\text{k}} )^{ 2} } } $$
(3)

Proceeding with the data summarized in Table 1, the performance of the corresponding models visualized versus the number of rules is illustrated in a series of figures shown below, Fig. 1. The results are reported both for the training and testing data. In the design, we use a standard version of the Fuzzy C-Means (FCM) [4] with the fuzzification coefficient (m) set to 2.0. The algorithm was run for 10 iterations (more specifically, we completed 10 runs with different splits of data into training/testing data)

Fig. 1
figure 1

Performance of fuzzy models versus the number of rules reported for the training and testing data for datasets (Table 1)

For the training sets, there is a general tendency of having lower values of the RMSE with the increase of the number of rules. The performance reported on the testing sets points at the memorization effect where the some models tend to lose their generalization capabilities. By eyeballing these plots, we choose a suitable number of rules (clusters) where sound approximation abilities come hand in hand with the generalization of the models. The values selected in this way are collected in Table 2.

Table 2 Number of rules of fuzzy models constructed for the corresponding data

3 Reduction of Input Subspaces in Rule-Based Models

The essence of the enhancement of interpretability of the rules is accomplished by reducing the number of input variables standing in the condition parts of the rules. The reduced rules are concisely described in the form

$$ - if\, {\boldsymbol{x}}\,is\,\left[ {A_{i} } \right]_{Xi} \,then\, \, y = \, f_{i} ({\boldsymbol{x}},{\boldsymbol{a}}_{i} ) $$
(4)

where the symbol [] X i stresses the fact that the fuzzy set A i is now effectively confined to the reduced input space X i \( \subset \) R n where some original input variables have been removed. In other words, dim (X i ) = n i  < n.

The computing of the activation level of A i positioned in this new reduced space is realized as follows

$$ \left[ {{\text{A}}_{\text{i}} } \right]_{{{\mathbf{X}}{\text{i}}}} \left( {\mathbf{x}} \right) = 1 /\sum\limits_{{{\text{j}} = 1}}^{\text{c}} { (\frac{{\left. {\left\| {\underline{{\mathbf{x}}} - \underline{{{\mathbf{v}}_{\text{i}} }} } \right\|} \right|_{{\aleph_{\text{i}} }} }}{{\left. {\left\| {\underline{{\mathbf{x}}} - \underline{{{\mathbf{v}}_{\text{j}} }} } \right\|} \right|_{{\aleph_{\text{i}} }} }} )^{{\frac{ 2}{{{\textit{m}} - 1}}}} } , $$
(5)

for i = 1,2,.., c; m > 1 where the computations of the distance are realized in the reduced input space X i .

The reduction of the rules can be quantified in terms of a reduction factor ν, which relates with the number n*c (expressing an overall number of variables across all the rules) in the following way

$$ p \, = n\left( {n*c} \right). $$
(6)

where p represents a reduced number of input variables used in c rules.

The reduction of the input variables existing in the new, more interpretable rules is done via engaging the optimization capabilities of Genetic Algorithms (GAs). More specifically, we optimize a matrix of allocation of input variables W = [w ij ] with c rows and n input variables. The p largest entries of W are selected giving rise to a binary 0–1 matrix. The p entries with the largest values are set to 1 while the remaining ones are suppressed to zero. Each row of the matrix formed in this way identifies the variables to be used in the corresponding reduced rule. If all entries of the i th row of W are equal to zero, this entails that the corresponding rule does not exist in the reduced set of rules.

The process of identifying which input variables should be kept in the antecedents of rules is translated into an optimization problem. Its solution is obtained through evolutionary optimization, namely Genetic Algorithm (GA) [8]. GA uses elements of natural selection to determine the best solution to a problem by minimizing a certain fitness function capturing the essence of the optimization problem. The best solution is obtained via selecting and modifying a population of potential solutions (chromosomes). A main flow of GA computing is outlined in Table 3.

Table 3 A flow of optimization realized by genetic algorithm

The two aspects of GA that require special attention and directly impact the quality of results of the optimization process are: 1) construction of a suitable fitness function, and 2) mapping a solution to the problem onto a structure of the chromosome.

A fitness function is used to evaluate possible solutions. Fitness values associated with solutions (chromosomes) represent their ability to solve a given problem. The values are used during selection of chromosomes for further processing (Table 3, line 5). The fitness function has to reflect the objective of optimization process and ensure that the obtained fitness values are adequate for mechanisms of selection, i.e., the fitness values should be such that a selection process leads to a diversified set of potential solutions [11]. In the case of our problem of selecting input attributes that should be kept in the rules, the fitness function evaluates quality of fuzzy rules that model a given dataset. The form of the fitness function used in the optimization is given by (2).

As noted earlier, each chromosome represents a solution to the considered problem. It consists of simple elements, called genes, that can assume values 1 or 0 (Binary Coded GA), or any real numbers from a specified range (Real Coded GA). A chromosome with such a simple structure, called genotype, has to represent a solution to the problem, i.e., it should be mapped into a form called phenotype that is adequate for a given problem domain.

For the optimization problem considered in the paper, we use a Real Coded GA (RCGA). The size of chromosome, i.e., a number of genes gsize, is equal to the product of a number of rules c and a number of input attributes n of a given dataset: \( gsize = cn \). Therefore, each chromosome contains information which input attributes are present in each rule. A single gene is a real number between 0 and 1. The translation of such a chromosome (genotype) into a solution to our problem (phenotype) occurs in the following way. The user has to determine a number p of input attributes or isolated attributes (Sect. 5) that should be kept in all c rules. This means that among all genes of a chromosome only p of them should be used to construct c rules. The selection of these genes is done via sorting all genes based on their values (between 0 and 1) and identifying the first p of them. The process is presented in Fig. 2.

Fig. 2
figure 2

GA chromosome: (i) original matrix of rules, (ii) a single chromosome built from the whole matrix, (iii) rules built based on this chromosome (there are p black boxes representing attributes selected for c rules)

Once an initial population of chromosomes c k has been generated, RCGA starts an iteration process. A single iteration consists of calculating fitness values (Table 3, line 4), selecting best chromosomes (line 6), and performing crossover and mutation operations on selected chromosomes (line 7). All this constitutes a single generation. After executing a number of generations, GA provides an optimal solution in a form of a chromosome that gives the highest value of the fitness function.

The purpose of crossover is to exchange information between chromosomes. In its simplest form it means exchange of genes between a pair of randomly selected, with the probability p c given by the user, chromosomes. For our RCGA, a linear crossover is applied [13]. The linear crossover generates three offspring (new chromosomes): \( O_{{k}} {{ = (o}}_{ 1}^{{k}} ,\ldots , {{o}}_{{i}}^{{k}} ,\ldots {{o}}_{{n}}^{{k}} ) , {{k}} = 1,2,3 \), where \( {{o}}_{{i}}^{\text{k}} \) represents the ith gene of the kth chromosome. The genes of offspring are built from the genes of two parents. Let us assume that the parents are: \( {{C}}_{{k}} {{ = (c}}_{ 1}^{{k}} ,\ldots , {{c}}_{{i}}^{{k}} ,\ldots {{c}}_{{n}}^{{k}} ) , {\text{ k}} = 1,2 \). The values of genes of the offspring are calculated in the following way:

$$ o_{\text{i}}^{ 1} = \frac{1}{2}{\textit{c}}_{\textit{i}}^{ 1} + \frac{1}{2}{\textit{c}}_{\textit{i}}^{ 2} ,\,{\textit{o}}_{\textit{i}}^{ 2} = \frac{3}{2}{\textit{c}}_{\textit{i}}^{ 1} - \frac{1}{2}{\textit{c}}_{\textit{i}}^{ 2} \,{\text{and}}\,{\textit{o}}_{\textit{i}}^{ 3} = - \frac{1}{2}{\textit{c}}_{\textit{i}}^{ 1} + \frac{3}{2}{\textit{c}}_{\textit{i}}^{ 2} $$
(7)

Each of the offspring is evaluated, i.e., a fitness value is determined for each of them. The two most promising offspring are selected to substitute their two parents in the population.

The mutation operator modifies genes of a single chromosome in order to introduce diversity to the population. The frequency of modifications is controlled by the mutation probability p m provided by the user. Mutation ensures that a search process covers the whole space of possible solutions. In the case of RCGA, the Muhlenbein’s mutation is adopted [12]. For a given parent chromosome \( {\textit{C}}_{\textit{k}} {{ = \textit{(c}}}_{\textit{1}} ,\ldots , {\textit{c}}_{\textit{i}} ,\ldots {\textit{c}}_{\textit{n}} ) \), the gene values of a new – mutated – chromosome are calculated as \( {\textit{c}}_{\textit{i}}^{\prime } {\textit{ = c}}_{\textit{i}} \pm {\textit{rang}}_{\textit{i}} \bullet {\textit{g}} \), where \( {\textit{rang}}_{\textit{i}} \) defines the mutation range, and it is normally set to \( 0.1 * ( {\textit{b}}_{\textit{i}} {\textit{ - a}}_{\textit{i}} ) \), where \( {\textit{b}}_{\textit{i}} \) and \( {\text{a}}_{\text{i}} \) are the maximum and the minimum of \( {\text{c}}_{\text{i}} \), respectively. The sign + or − is chosen with a probability of 0.5, while \( {\text{g = }}\sum\limits_{\text{k = 0}}^{ 1 5} {{\textit{a}}_{\textit{k}} 2^{\text{ - k}} , {\textit{a}}_{\textit{k}} \in \{ 0,1\} } \) is randomly generated with \( {\textit{p(a}}_{\textit{i}} = 1 )= \frac{1}{16} , {\textit{ i }= }0.15 \).

The operations on a single chromosome are implemented in such a way that at least one input attribute is kept in each rule.

For reduction of input spaces, if p is a number of input attributes to be kept, the top p genes will be used to calculate the model output. The rest gsize-p attributes will be omitted in the calculation.

For isolation of attributes (Sect. 5), if p is a number of isolated attributes in each rule, the top p attributes in each rule (row of the matrix W) will be marked as isolated. The rest n − p attributes will be treated as the remaining group of attributes. Thus, the total number of isolated attributes is pc.

4 Experimental Studies

In the following experiments we present how the reduction of the rules proceeds and how the reduced, more interpretable rules perform. We use different values of ν and report the corresponding values of the RMSE for the training and the testing data. The GA used a population of 100 individuals and was run for 100 generations. The crossover rate was set to 0.8 while the mutation rate was equal to 0.1. The choice of these numeric values was a result of some preliminary experimentation. The results are quantified by reporting the RMSE values obtained for different values of the reduction index; refer to Fig. 3.

Fig. 3
figure 3

RMSE values of the reduced rule-based models versus reduction level ν

It becomes apparent (and intuitively anticipated) that lower values of ν result in higher RMSE values. The detailed behavior varies across data with regard to how far the rules can be reduced and how the differences shape up for the training and testing data. For example, the reduction could be made quite substantial not compromising the performance of the model as this becomes present in case of abalone, auto, concrete, and white wine. In some case, we witness a phenomenon of increased generalization abilities of the model (lower differences of the RMSE for the training and testing data for lower values of ν). Figure 4 illustrates the performance of the GA for some selected data; most of the improvement is visible at the beginning of the optimization (first 20–30 generations).

Fig. 4
figure 4

Values of fitness function reported in successive GA generations for ν ranging from 0.1 to 0.9 with step 0.1 show direction of ν

The detailed results of reduction of the number of variables in the rules are contained in Fig. 5. The shaded regions identify the input variables being retained in the corresponding rules. This offers a better view as to which input variables can be dropped and points at a sequence of the variables, which have been eliminated.

Fig. 5
figure 5figure 5

Visualization of reduced rules: shaded regions identify the input variables being retained

5 Isolation of Input Variables

To enhance the transparency of the rules, we express the fuzzy set A i as a Cartesian product of a single isolated fuzzy set defined in R and a relational remainder expressed in R n1. In other words, we form the expression describing the condition part as follows

$$ A_{i}^{\wedge}{(x_{j})} \times A_{i}^{\wedge}{\sim } \left( {x^{\sim } } \right) $$
(8)

where A ^ i is a fuzzy sets defined in R and A ~ i is expressed in R n−1. Then the rules of the form read as follows

$$ \text{-} if\, x_{j}\, is\, A_{i}^{\wedge} (x_{j})\, and\, x^{\sim} A_{i}^{\sim}(x^{\sim})\,then\, y\, is\, f_{i}(x, a_{i}) $$
(9)

Here a certain input variable (jth one) has been selected to be isolated. The term isolation pertains to the fact that a certain variable has been chosen and subsequently a fuzzy set isolated from the fuzzy set Ai is treated separately and thus becomes more visible and interpretable.

Obviously, the above Cartesian product is not identical to the original A i that is the following holds

$$ A_{i} \ne (A_{i}^{\wedge} \times A_{i}^{\sim} ) $$
(10)

In terms of membership functions this means that the following relationship holds

$$ A_{i} \left( {\user2{x}} \right) \ne \text{min}(Ai^{\wedge}{(xj)} ,Ai^{\sim} (x^{ \sim})) $$
(11)

Note that the corresponding membership functions of A ^ i (x j ) and A ~ i (x ~) are computed as follows

$$ A_{i}^{\wedge}{(x_{j})} = \frac{ \it{1}}{{\sum\limits_{\it{l} = \it{1}}^{\it{c}} {\left( {\frac{{{\it{x}}_{\it{j}} {\it{ - v}}_{\it{ij}} }}{{{\it{x}}_{\it{j}} {\it{ - v}}_{\it{il}} }}} \right)^{{ \it{2} / ( {\it{m} {-} \it{1)}}}} } }} $$
(12)
$$ A_{i}^{ \sim } \left( {\varvec{x}^{\sim } } \right) = \frac{\it{1}}{{\sum\limits_{l = 1}^{c} {\left( {\frac{{||\varvec{x}^{\sim } - \varvec{v}_{i}^{\sim } ||}}{{||\varvec{x}^{\sim } - \varvec{v}_{l}^{\sim } ||}}} \right)^{{2/(\varvec{m} - 1)}} } }} $$
(13)

The consequence is that if A ^ i (x j ) \( \times \) A ~ i ( x ~ ) is used as the condition part of the ith rule, the output of the model is going to be different than the original rule. It is likely that the accuracy of the model could be reduced as a result of the increased interpretability of the rules because of the isolation of the input variables. In the above formulation, one is interested in choosing an individual variable (jth one) for which the results provided by the rule-based model are as close as possible to those formed by the original fuzzy model. The selection of the input variable is quite straightforward through a direct enumeration.

The plots showing the LHS and the RHS relationship is shown for concrete strength with one or five isolated input variables.

From the plots above, several observations of a general character can be drawn. First, the values of the LHS are higher than the corresponding ones for the RHS. This is reflective of the fact that the separation of the variable(s) leads to the higher activation levels of the rules with eventual reduction of the specificity of the results of reasoning. It is also apparent that with the increase of the variables being isolated – compare Fig. 6a, b, the differences between the values produced by the LHS and RHS of the expression (10) are more profound. Again, this is not surprising as by isolating more variables we depart from the RHS more vigorously.

Fig. 6
figure 6

Values of the original activation of the rules value versus the one with isolated input variables - concrete strength data set, 4 rules in the rulebase

In a general setting, one can realize an isolation of L input variables, which as a result leads to the rules in the form

$$ \begin{aligned} & - {\text{if}}\,{\text{x}}_{{{\text{j}}1}} \,{\text{is}}\,{\text{A}}_{i}^{ \wedge } \text{x}_\text{j{1}} \,and\,{\text{x}}_{{{\text{j}}2}} \,is\, {\text{A}}_{i}^{ \wedge } ({\text{x}}_{{{\text{j}}2}} ) \\ & and\, \cdots and\,{\text{x}}_{\text{jL}} \,{\text{is}}\,{\text{A}}_{i}^{ \wedge } ({\text{x}}_{\text{jL}}) \,and\,x^{ \sim } \,{\text{A}}_{i}^{ \sim } ({\mathbf{x}}^{ \sim } ) \\ & \quad \,\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad {\text{then}}\,{\text{y}}\,{\text{is}}\,{\text{f}}_{\text{i}} ({\mathbf{x}},{\mathbf{a}}_{\text{i}} ) \\ \end{aligned} $$
(14)

note that in this case x ~ is defined in R nL.

Here an optimal choice of L variables gives rise to a combinatorial optimization problem. This could be solved by GA optimization. Considering that the value of L is specified in advance, GA forms an optimal isolation matrix I consisting of c rows (number of rules) and n rows (number of input variables) where in each row there are L 1 s indicating the variables which are isolated in the rule. For instance, for L = 3 the matrix with the entries

$$ {\mathbf{I}} = \left[ {\begin{array}{*{20}l} { 1\quad 0\quad 0\quad 1\quad 1\quad 0} \hfill \\ { 0\quad 1\quad 1\quad 1\quad 0\quad 0} \hfill \\ \ldots \hfill \\ \end{array} } \right] $$
(15)

states that in the first rule isolated are variables 1, 4, and 5; in the second rule we isolate variables 2, 3, and 4, and so on.

6 Experiments

The GA was carried out with 100 populations and maximum 50 generations. Preliminary experiments with GA indicated lack of improvement before 50th generation, so the maximum generation is set to 50. The population is not large, so relative large values of moderate crossover and mutation rates are used. The crossover rate is set as 0.8 and mutation rate is 0.1.

We present the results in a similar way as before by focusing on the presentation of the rules with isolated variables and showing how the families of isolated variables impact the performance of the model (Figs. 7, 8 and 9).

Fig. 7
figure 7

Performance of the fuzzy model versus the number of isolated input variables

Fig. 8
figure 8

Fitness function reported in successive GA generations

Fig. 9
figure 9figure 9

Isolated variables (shaded) obtained for selected values of L

7 Conclusions

The two approaches enhancing the interpretability of rule-based models are directly applied to the already constructed Takagi-Sugeno fuzzy models realized with the use of fuzzy clustering. Both the formation of the input subspace of conditions as well as the isolation of the input variables are the methods refining multivariable fuzzy sets produced through fuzzy clustering. The proposed approaches are quantifiable in terms of the level of the interpretability abilities offered by them (expressed either in terms of the number of variables eliminated or the variables isolated). This aspect is helpful in determining how much the interpretability could be enhanced without any significant sacrifice of accuracy of the model. Furthermore in this way one could reveal input variables (or their combinations) that are essential in rule-based modeling.

The approach offers a certain new view at the enhancement of fuzzy rule-based models. There could be several avenues worth pursuing in the future including (a) development of a hybrid arrangement of the formation of subspaces of conditions of the rules associated with some further isolation of variables from such subspaces, (b) use of other techniques of Evolutionary Optimization in the entire process, (c) construction of interpretability measures quantifying various facets of the interpretation mechanisms.