1 Introduction

In knowledge discovery in databases we can distinguish between two different approaches (Lavrač et al. 2004): predictive induction and descriptive induction. The difference lies in the main objective pursued and, therefore, the learning method used to attain that. On the one hand, predictive induction looks for generating legible models that describe with the highest reliability the data set that represents the analyzed system. In that case, the goal is to use the obtained model to simulate the system, thus getting an explanation of its complex behavior. On the other hand, descriptive induction looks for particular (interesting) patterns of the data set. In that case, we do not get a global view of the relationships among variables but we discover a set of rules (different enough among them) statistically significant.

This paper focuses on the former approach, the predictive induction, to deal with regression problems where both input and output are real-valued and where the knowledge obtained is important to understand better the analyzed system. To represent the knowledge, and with the aim of generating legible enough models (which, no doubt, is one of the fundamental premises in any knowledge extraction process), we propose the use of fuzzy rule-based systems. These systems use IF–THEN fuzzy rules and linguistic variables to express the knowledge about the problem.

The automatic extraction of fuzzy rule-based systems can be done with different learning methods, either greedy algorithms (Nozaki et al. 1997; Wang and Mendel 1992) or other methods such as neural networks (Fullér 2000; Nauck et al. 1997) and genetic algorithms (GAs) (Cordón et al. 2001). Due to the aim of this paper on generating knowledge with good interpretability, we propose the use of GAs because it holds a sort of useful features for our purpose. Firstly, they have a powerful search capacity that allows us to work with multiobjective optimization. Secondly, they can manage flexible representation structures mixing coding schemes or including restrictions. From the beginning of the 90s many researchers have drawn their attention to the use of GAs to automatically design different components of a fuzzy rule-based system (Karr 1991; Thrift 1991; Valenzuela-Rendón 1991). These learning systems are usually known as genetic fuzzy systems (Cordón et al. 2001).

Regardless the learning tool used, a crucial problem emerges: to obtain both an accurate and an understandable model. Indeed, fuzzy modeling (i.e., the process of deriving fuzzy systems) usually comes with two contradictory requirements to the obtained model: the interpretability, capability to express the behavior of the real system in a comprehensible way, and the accuracy, capability to faithfully represent the real system. Of course, the ideal thing would be to satisfy both criteria to a high degree but, since they are contradictory issues, it is generally not possible. The quest of a good trade-off between interpretability and accuracy is target of numerous research works nowadays (Casillas et al. 2003a, b).

To reach this balance, we propose in this paper the use of fuzzy rules with antecedent in conjunctive normal form (i.e, where the antecedent is the conjunction of a set of propositions, each of them associating an input variable with a set of linguistic terms connected by disjunction), usually known as DNF-type fuzzy rules. This representation provides a high degree of compactness and knowledge synthesis. Since we are interested in predictive induction, the Pittsburgh-style GA (where each individual encodes a complete set of rules) seems to be the best approach to properly assess the interaction among the different fuzzy rules to perform interpolative reasoning.

However, the combination of DNF-type fuzzy rules and Pittsburgh-style GA are far from being easy since several difficulties arise:

  • Consistency: each combination of antecedents (one label per input variable) should have only one possible consequent label.

  • Completeness: every training data example should fire at least one fuzzy rule.

  • Compactness: the lowest number of rules to accurately represent input-output relationships should be obtained. Among other issues, it involves avoiding redundant rules.

  • Non-overgeneral rules: a DNF-type fuzzy rule should be general enough as to represent in a compact way the input-output relationship but specific enough as to avoid covering input areas without data.

Although it is relatively easy to comply with these conditions when using simple (Mamdani-style) fuzzy rules [see for example (Jin et al. 1999), where measures of incompleteness and inconsistency are used as penalty in the rule’s fitness], it becomes more complex in the case of DNF-type fuzzy rules. Most of the methods that deal with some kind of generalization of the antecedent of the fuzzy rule (e.g., DNF-type rules or rules with “do not care”) do not address properly the problem (Casillas and Martínez-López 2008; Castro et al. 1999; González and Pérez 1998, 1999; Ishibuchi et al. 2006; Ishibuchi and Nojima 2007; Magdalena 1997; Otero and Sánchez 2006; Sánchez et al. 2001; Xiong and Litz 2000). Indeed, some of these proposals use a penalty fitness to correct these deficiencies, others infer a default output when no rules are fired, others tend to generate a high number of rules, some other simply do not prevent the system from generating inconsistencies or redundancies...

There are few proposals that explicitly try to hold one or more of the consistency, completeness and compactness properties with a fuzzy rule structure with generalization capability of the antecedent. For example, Wang et al. (2005) use the same functions defined in Jin et al. (1999) to detect conflicts with an agent-based evolutionary approach in which the agents were multi-objective Pittsburgh-style genetic fuzzy systems. However, they use simple crossover and mutation operators and a posteriori reparation to solve inconsistencies and redundancies. Other solution is proposed in Wang et al. (1998), where redundancy by subsumption is removed by a specific a posteriori operator. However, consistency of the rule set is not ensured.

We take a completely different approach from the above methods and propose an algorithm that intrinsically explores feasible solutions (according to the mentioned consistency, completeness, non-redundancy, and non-overgenerality restrictions), thus avoiding the use of penalties, reparations, or additional function objectives. It considers a multiobjective optimization process which generates a large range of solutions with different interpretability-accuracy balances under the mentioned restrictions.

The paper is organized as follows: Section 2 briefly presents the difficulties that appear when using DNF-type fuzzy rules. Section 3 describes the proposed algorithm, called Pitts-DNF. Section 4 shows the results obtained in a set of real-world problems compared with other fuzzy rule learning methods. Finally, Sect. 5 concludes and suggests some further works.

2 Some properties to be considered when learning DNF-type fuzzy rules

In order to obtain a high degree of knowledge synthesis, thus improving the interpretability, we opted by a compact description based on DNF-type fuzzy rules where the antecedent is described in conjunctive normal form. This kind of fuzzy rule structure is defined as follows:

$$ {\varvec{{\mathsf{IF}}}}\; X_1\;{\mathsf{is}}\;\widetilde{A_1}\;{\mathsf{and}}\;\ldots\;{\mathsf{and}}\;X_n \;{\mathsf{is}}\;\widetilde{A_n}\;\varvec{{\mathsf{THEN}}}\;Y\;{\mathsf{is}}\; B, $$

where each input variable X i takes as a value a set of linguistic terms \(\widetilde{A_i}=\{A_{i1}\,\,or\,\ldots\,or\,\,A_{il_i}\},\) whose members are joined by a disjunctive (T-conorm) operator, whilst the output variable remains an usual linguistic variable with a single label associated. The structure is a natural support to allow the absence of some input variables in each rule (simply making \(\widetilde{A_i}\) be the whole set of linguistic terms available).

When a whole set of such a kind of rules is simultaneously learnt (as recommended in predictive induction tasks), collisions easily appear. Basically, these collisions are of two types: inconsistency and redundancy. Furthermore, to have a fuzzy rule structure that allows a variable generality degree of the antecedent may lead the learning algorithm to two undesirable situations: over-generality and incompleteness. These four cases are discussed in the following sections.

2.1 Consistency

The first kind of collision is the inconsistency. Two rules are inconsistent between them when their antecedents overlap themselves, i.e., their antecedents are the same, they coincide in some labels for each input variable, or one is subsumed by the other (i.e., an antecedent is completely contained in a larger and more comprehensive antecedent) but the consequent is different. For instance, the two following rules are inconsistent:

Example 1

$$ \begin{aligned} &{\tt R_1:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;B_1\;THEN\;Y\;is\;C_1}\\ &{\tt R_2:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;B_1\;THEN\;Y\;is\;C_2} \end{aligned} $$

and the same in this second example where the antecedents are partially overlapped:

Example 2

$$ \begin{aligned} &{\tt R_1:\;IF\;X_1\;is\;\{A_1\;or\;A_2\}\;and\;X_2\;is\;B_1\;THEN\;Y\;is\;C_1}\\ & {\tt R_2:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;\{B_1\;or\;B_2\}\;THEN\;Y\;is\;C_2} \end{aligned} $$

or in this third case where the antecedent of the former rule subsumes the latter:

Example 3

$$ \begin{aligned} &{\tt R_1:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;\{B_1\;or\;B_2\}\;THEN\;Y\;is\;C_1}\\ &{\tt R_2:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;B_1\;THEN\;Y\;is\;C_2} \end{aligned} $$

All these cases of inconsistency cause a linguistic contradiction that should be avoided for the sake of a better interpretability.

To solve these inconsistencies sometimes involves making more specific a general rule (as R1 or R2 in the example 2, or R1 in the example 3) or removing the more specific rule (as R1 or R2 in the example 1, or R2 in the example 3). Therefore, in these cases, to solve the inconsistency also helps to reduce the complexity of the rule set.

In other situations, however, to solve the inconsistency may imply the necessity of a higher number of rules as shown in Fig. 1. Therefore, the interpretability improvement obtained when solving the inconsistency may involve a more complex rule set. This fact could be solved by considering a firing-level-based hierarchy of the rule set (being the more specific rules in an upper position) (Yager 1993), discounting the strength of the more general rules when they are inconsistent with more specific ones (Ishibuchi et al. 2006), or even by extending the knowledge representation to consider rules with exceptions (Carmona et al. 2004). These issues are out of the scope of this paper.

Fig. 1
figure 1

Example of a an inconsistent solution and b a consistent solution. Notice that in this case more rules are needed to hold consistency

2.2 Redundancy

A second, less serious collision is when the antecedent is overlapped as in any of the above examples but the consequent is the same. In that case, we have a redundancy. For instance:

Example 4

$$ \begin{aligned} &{\tt R_1:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;\{A_2\;or\;B_2\}\;THEN\;Y\;is\;C_1}\\ &{\tt R_2:\;IF\;X_1\;is\;A_1\;and\;X_2\;is\;A_2\;THEN\;Y\;is\;C_1} \end{aligned} $$

Redundancy increases the fuzzy rule set size unnecessarily and can even provoke some undesirable interaction effects with some inference engines. When both rules have the same antecedent or one subsumes the other, the fuzzy rule set can be easily fixed by removing the repeated or the most specific rule (R2 in the Example 4), respectively.

2.3 Over-generality

The use of a flexible structure to represent the antecedent of the fuzzy rule can also lead the algorithm to generate over-general rules. This fact appears when a DNF-type fuzzy rule includes a higher number of linguistic terms per variable than the ones strictly necessary according to the training data as illustrated in Fig. 2. It makes the fuzzy rule cover input regions where there is no available information, which is not desirable at all since the quality of the rule in those regions can not be tested.

Fig. 2
figure 2

Example of a solution with a over-generality and b optimal generality according to training data. Notice that in this case more rules are needed to hold optimal generality

If the learning algorithm does not care about that, the antecedent structure of some generated fuzzy rules is actually designed in a random way, since any consequent used in the empty regions would return exactly the same accuracy degree. This fact is even more serious if the learning algorithm is oriented by an objective function designed to increase the generality degree of the fuzzy rules.

To ensure optimal generality, however, may provoke some undesirable effects. The first one is that sometimes a higher number of rules is needed (as shown in Fig. 2). Another drawback is that generating a fuzzy rule set that only covers the training input regions may worsen the prediction generality (i.e., the capability to accurately predict the output in unforeseen input data), typically measured by the test error, if there is data on these areas.

Therefore, the question whether keeping optimal generality is recommendable or not is a controversial issue. We believe, nonetheless, that from the knowledge discovery field point of view it is preferable to provide the expert with a set of rules that properly represent the analyzed data instead of doing a pseudo-random generalization of the rule antecedents.

In the last resort, and always considering the expert interests, a good solution would be to provide a fuzzy rule set strictly generated from the data and another one to cover the input gaps. This latter fuzzy rule set with total completeness could be generated by doing linguistic extrapolation (e.g., Wang 2003). We leave this last approach for a further work.

A simpler solution, but extensively done in the literature, is to return a conservative output (e.g., in regression problems it is usual to give the mean value of the output domain) when no rules are fired due to the fact that the data lies in an uncovered input region. If it is done during the learning process, an incomplete rule set may be generated (as discussed in the next section) and, therefore, it is not recommended at all. However, it can be considered in the inference engine once the learning has finished and the final fuzzy system is used to predict the output for new input data. We will follow this approach for reporting test results in this paper.

2.4 Completeness

The last interesting property is completeness. This means that no input regions where there is data should be uncovered, i.e., with no rules being triggered. In classification tasks this fact is not usual since an uncovered training example is considered as misclassified, so these kinds of solutions are penalized. However, in regression problems where an output must be always returned, if the inference is designed to give a default answer (e.g., the mean value as said in the previous section), incomplete fuzzy rule sets may be generated.

Moreover, if the learning system seeks for an optimal generality (see the previous section) and/or with a reduced number of rules, the risk of obtaining incomplete rule sets is higher. We will endow our proposed algorithm with an effective way of ensuring completeness.

3 Pitts-DNF algorithm

The learning algorithm we propose in this paper, called Pitts-DNF, has been designed to avoid generating DNF-type fuzzy rule sets with inconsistencies, redundancies, over-generality, or training incompleteness. Its scheme is shown in Algorithm 1.

Algorithm 1: Pitts-DNF algorithm

Parameters: Population size, crossover probability, antecedent mutation probability, and consequent mutation probability

Input: Data set: \(D=\{(x,y)\,|\,x \in {\mathbb R}^n,\ y \in {\mathbb R}^m\}.\) Membership function definitions

Output: Set of non-dominated solutions, each one with a different number of rules/accuracy trade-off. Each solution is a consistent, non redundant, non over-general, and complete DNF-type fuzzy rule set

begin

  Initialization(P)

  CH \(\longleftarrow\) Covering_Hypermatrix(D)

  Evaluation(P, D)

While  not stop condition  do

   P1 \(\longleftarrow\) Multiobjective_Selection(P)

   P2 \(\longleftarrow\) Crossover(P1)

   P3 \(\longleftarrow\) Antecedent_Mutation(P2, CH)

   P4 \(\longleftarrow\) Consequent_Mutation(P3)

   P5 \(\longleftarrow\) Completeness_Operator(P4, D)

   Evaluation(P5, D)

   P \(\longleftarrow\) Multiobjective_Replacement(P5, P)

end

end

3.1 Coding scheme

Each individual of the population represents a set of fuzzy rules (i.e., Pittsburgh style). Therefore, each chromosome consists of the concatenation of a number of rules. The number of rules is not fixed a priori so, the chromosome size is variable-length. Each rule (part of the chromosome) is encoded by a binary string for the antecedent part and an integer coding scheme for the consequent part. A binary coding could also be used for the consequent part without influencing on the algorithm behavior, but since we use a fuzzy rule structure where only one label is associated to each output variable, integer coding seems to be more appropriate.

The antecedent part has a size equal to the sum of the number of linguistic terms used in each input variable. The allele ‘1’ means that the corresponding linguistic term is used in the corresponding variable. The consequent part has a size equal to the number of output variables. In that part, each gene contains the index of the linguistic term used for the corresponding output variable.

For example, assuming we have three linguistic terms (S [small], M [medium], and L [large]) for each input/output variable, the fuzzy rule

$$ [\hbox{IF X}_1\hbox{ is S and X}_2\hbox{ is }\{\hbox{M or L}\}\hbox{ THEN Y is M}] $$

is encoded as

$$ [100|011||2]. $$

A chromosome will be the concatenation of a variable number of these fuzzy rules, e.g.,

$$ [100|011||2\quad 010|111||1\quad 001|101||3] $$

for a set of three rules. Notice that we do not fix a maximum number of rules per chromosome. Since our algorithm removes any unnecessary redundancy and unfired fuzzy rules, the number of rules is restrained all the time in a natural way.

It is allowed a variable with all the labels set to ‘1’ (which means the variable is not considered in the corresponding rule), but it is forbidden a variable with all the labels set to ‘0’. It is so because, although one could think of assigning this latter combination to the fact of not using the variable (as in the case of all the labels set to ‘1’), then we would have solutions genotypically closer but phenotypically far, which is not advisable.

3.2 Initialization

Since we are looking for optimal completeness, we need to start with rules which cover all the examples. Because of that, we use the well-know Wang–Mendel algorithm (Wang and Mendel 1992), briefly described in Appendix, to generate the antecedent structure.

Specifically, every chromosome is generated with the minimum number of rules that cover the examples according to this algorithm and with a random consequent for every rule (except one chromosome that uses the consequents provided by Wang–Mendel). In this way, all chromosomes start with the same number of rules, being as specific as possible (i.e., with Mamdani-type structure instead the DNF one).

3.3 Covering hypermatrix computation

The objective of this step is to generate a data structure which will be used when generating new rules to efficiently avoid over-generality or generating rules in regions without training data. This structure, that we have called covering hypermatrix, stores the label combinations of the antecedent that cover all the examples in the training data set. Notice that the hypermatrix represents the highest allowed input covering, but it does not show whether a lower number of rules would completely cover the training data set or not, so it can not be used to ensure completeness.

The structure of this hypermatrix is an array, which dimension is equal to the number of input variables, containing ‘1’ in a cell if the corresponding input combination covers at least a training example and containing ‘0’ in other case. With this structure it is possible to design an efficient mutation operator to avoid over-general rules.

The implementation of this structure must be specially efficient, because of its high requirements of access time to the information. In this work we decided implement the hypermatrix using a hash table, which keys are built with the label concatenation of the contained rules. In order to optimize the table in space and information retrieve time, the combinations ‘0’ are not stored. We consider that if a particular key does not exist then its value is ‘0’.

3.4 Crossover operator

The crossover operator is applied with a given probability rate to each randomly mated pair of parents. It only interchanges rules between the two parents, but it does not modify them. Furthermore, it guarantees the children does not present neither inconsistencies nor redundancies. The pseudo-code is included in Fig. 3 and an example of its application is shown in Fig. 4.

Fig. 3
figure 3

Crossover operator

Fig. 4
figure 4

Example of crossover operator application

3.5 Antecedent mutation operator

This operator together with the consequent mutation are the ones that create new rules. It is applied with a given probability rate to each individual. As its name suggests, it acts on input variables. When a gene in the antecedent part of a fuzzy rule is chosen to be mutated, the operator analyzes among the available movements (it will be explained below) those that ensure to keep consistency and non-overgenerality (this later case is quickly checked with the covering hypermatrix). The consistency is checked by analyzing the collision of the candidate mutate rule with the rest of them. An option among the permitted ones is randomly chosen. Therefore, the antecedent mutation operator only explores feasible solutions, thus constraining the search space and ensuring a better exploration.

Figure 5 shows the pseudo-code of the operator. The two different actions are explained in the following.

Fig. 5
figure 5

Antecedent mutation operator

3.5.1 Contraction operation

It converts the mutated rule into a more specific one by choosing a gene of the selected variable with a ‘1’ and flipping to ‘0’. Clearly, the contraction operator can only be applied when there are, at least, two ‘1’, because on the contrary all the genes of this variable will be ‘0’ and, as mentioned in Sect. 3.1, it is not allowed.

This operator will never cause inconsistency, redundancy or over-generality since it generates a more specific rule, thus avoiding to go into conflict with other rules. The only undesired property it could cause is lack of completeness, but it will be solved by the completeness operator later.

3.5.2 Expansion operation

This operation carries out the opposite process to contraction operator, making the rule be more general. It chooses a gene with allele ‘0’ and flips it to ‘1’. In this case, the restriction is that the expansion operation can only be applied when there is, at least, a ‘0’ in the genes of the variable.

Unfortunately, this operator can cause collision problems with other rules or generate over-general rules. Therefore, firstly the set of expansion movements that can be applied to the selected variable without causing inconsistencies or creating over-generality (this latter case is checked using the covering hypermatrix) are generated, and then one of them is randomly chosen. In case there are no allowed movements, and if the variable contains at least two linguistic terms, the contraction operation is applied.

If after performing expansion the mutated rule subsumes other rules, the more specific ones are removed. With this operation it is not possible to get lack of completeness.

3.6 Consequent mutation operator

This operator, which is applied with a given probability rate to each individual, creates new rules by changing the consequent. It simply consists on randomly selecting an output variable of a rule that is not partially overlapped with other rules (it would be the only problematic case since the consequent mutation operator receives consistent and non-subsumed rules). Then, the consequent is randomly changed to the immediately higher or lower linguistic term. The operation does not cause over-generality or lack of completeness since the fuzzy rule antecedent structures are kept invariable.

3.7 Completeness operator

The crossover operator and the antecedent mutation by contraction can produce fuzzy rule sets that do not cover some specific data set examples. It is fixed with this operator by adding rules to patch the uncovered input subspaces. It can be considered a reparation operator with a low incidence since it does not change the generated rules, it only adds new ones. Figure 6 shows its operation mode.

Fig. 6
figure 6

Completeness operator

3.8 Inference mechanism

We consider the Max–Min inference scheme (i.e., T-conorm of maximum as aggregation and T-norm of minimum as relational operator), and the T-norm of minimum as conjunction, T-conorm of maximum as disjunction, and center-of-gravity as defuzzification. Moreover, in test mode the mean of the output domain is returned when no rules are fired for the given test example as explained in Sect. 2.3.

3.9 Multiobjective approach

A generational approach with the multiobjective elitist replacement strategy of NSGA-II (Deb et al. 2002) is used. Crowding distance in the objective function space is considered. Binary tournament selection based on the nondomination rank (or the crowding distance when both solutions belong to the same front) is applied. The crowding distance is normalized for each objective according to the extreme values of the solutions contained in the analyzed front.

3.10 Objective functions

We consider two objective functions to assess the quality of the generated fuzzy systems, the former (approximation error) to improve the accuracy and the latter (complexity) to improve the interpretability.

  • Approximation error: The mean squared error (MSE) is used. It is computed as follows:

    $$ F_1(S) = {\frac{1}{N}}\sum\limits_{i=1}^N{(S(x^i)-y^i)^2}, $$
    (1)

    with S being the fuzzy system to be evaluated, N the data set size and (x i,y i) the ith input-output pair of the data set. The objective is to minimize this function.

  • Complexity: As complexity measure, we use the number of DNF-type fuzzy rules:

    $$ F_2(S) = |S|. $$
    (2)

    The objective is to minimize this function.

Since the algorithm is designed to ensure optimal covering, i.e., without lack of completeness or with over-generalization, we do not care on the linguistic complexity (i.e., generalization) of each fuzzy rule. In a natural way, the more general (i.e., with more labels considered in each rule) the fuzzy rules, the fewer the number of rules.

It is an advantage of our approach that simplifies the design of an interpretability-based objective function. For example, Ishibuchi and Nojima (2007) need to consider a third objective to evaluate the generality degree of “do not care” fuzzy rules in classification since their algorithm does not hold this correspondence between number of rules and generality.

4 Experimental results

This section includes the obtained results of the proposed algorithm in five real-world regression problems (i.e, with real-valued input and output), and compares them with other fuzzy rule learning methods.

4.1 Problems and learning methods

We have considered the following regression problems:

  • The Diabetes problem concerns the study of the factors affecting patterns of insulin-dependent diabetes mellitus in children (Hastie and Tibshirani 1990). The objective is to investigate the dependence of the level of serum C-peptide on the various other factors in order to understand the patterns of residual insulin secretion. The response measurement is the logarithm of C-peptide concentration (pmol/ml) at the diagnosis, and the predictor measurements age and base deficit, a measure of acidity. The data set has been obtained from L. Torgo’s website.Footnote 1

  • The Ele1 problem relates to the estimation of the low voltage electrical line length in rural towns (Cordón et al. 1999). We were provided with a sample of real measurements from 495 towns in Spain. The estimation is based on the number of inhabitants of the town and the distance from the center of the town to the three furthest clients. The data set (and the partitions used in this paper) is available at the authors’ website.Footnote 2

  • The Laser problem uses a set of laser data from the Santa Fe Institute (SFI) time series prediction and analysis competition (Weigend and Gershenfeld 1993). The original laser data set from the SFI competition consisted of 1000 observations of the fluctuations in a far-infrared laser. This time series data has been adapted for regression by considering the four last values as input and the next value as output. The data set is available at KEEL website.Footnote 3

  • The Ele2 problem concerns the estimation of electrical network maintenance costs of medium voltage line based on sum of the lengths of all streets in the town, total area of the town, area occupied by buildings, and energy supply to the town (Cordón et al. 1999). The information is obtained by sampling a model of the optimal electrical network for a town. The data set (and the partitions used in this paper) is available at the authors’ website 2.

  • The DEE problem involves predicting the daily average price of TkWhe electricity energy in Spain. The data set contains real values from 2003 about the daily consumption in Spain of energy from hydroelectric, nuclear electric, carbon, fuel, and natural gas. The data set has been obtained from KEEL website 3.

Table 1 collects the main features of each problem, where #InputVar stands for number of input variables, #Exam for total number of examples, and #LingTerms for the number of triangular-shaped uniformly distributed linguistic terms considered for each variable in this experimental analysis.

Table 1 Data sets considered in the experimental analysis

The experiments shown in this paper have been performed with a fivefold cross validation. Thus, the data set is divided into five subsets of (approximately) equal size. The algorithm is then applied five times to each problem, each time leaving out one of the subsets from training, but using only the omitted subset to compute the test error.

We have considered several learning methods for comparison (all of them use the same inference engine described in Sect. 3.8 for our proposal):

  • Wang and Mendel (Wang and Mendel 1992): It is a simple algorithm that, in spite of not obtaining accurate results, is a traditional reference in the research community. The algorithm has been implemented by us.

  • COR-BWAS (Casillas et al. 2005b): It is an ant colony optimization-based learning algorithm with a great performance between interpretability and accuracy. We have disabled fuzzy rule selection since the algorithm does not guarantee total completeness, so the results could not be directly compared with our proposal.

  • Thrift (Thrift 1991): It is a classic Pittsburgh-style GA-based Mamdani-type fuzzy rule learning method. The mean output value is provided to compute MSE when no fuzzy rules are fired for a training example. The algorithm has been implemented by us.

  • Pittsburgh (Casillas and Martínez-López 2008): It is a Pittsburgh-style GA that also learns DNF-type fuzzy rules. A generational approach and direct replacement are performed, with elitism of the best solution. The fitness is the MSE (Eq. 1). The pool is randomly initialized and binary tournament selection is done. The same length-variable coding scheme used in this paper is considered. Specific genetic operators for this representation are used. As in Thrift, the mean value is provided when no fuzzy rules are fired.

  • Fuzzy-GAP (Sánchez and Couso 2000): This method employs a genetic programming algorithm hybrided with a GA (i.e., GA-P) to learn a fuzzy regression model. The algorithm generates complex fuzzy rules with any combination of conjunction and/or disjunctions in the antecedent part. The number of fuzzy rules must be fixed a priori. We have used the implementation of this algorithm available at KEEL software3.

Our algorithm has been run with the following parameter values: 300 iterations, 60 as population size, 0.7 as crossover probability, and 0.2 as antecedent and consequent mutation probability per chromosome. We have not performed any previous analysis to fix these values, so better results may probably be obtained by tuning them though we have informally noticed our algorithm is not specially sensitive to any parameter. The same parameter values are also used in Thrift and Pittsburgh algorithms. For COR-BWAS, we have fixed standard values [mostly the ones recommended in Casillas et al. (2005b)], i.e., 100 iterations, number of ants equal to the number of input subspaces (defined by the Wang–Mendel algorithm), heuristic 1 (max value), ρ = 0.20, α = 2, β = 1, local search depth 10, local search neighbor size equal to the number of ants, mutation probability 0.3, σ = 4, and restart after 10 iterations without improvement. Fuzzy-GAP is run with the default parameter values suggested in KEEL software3, except the number of linguistic terms per variable that is the same that in the rest of algorithms and the number of fuzzy rules, that is set to about half of the number of rules used by the Wang–Mendel method (note Fuzzy-GAP uses flexible structures based on conjunctions and disjunctions to express the antecedent of the fuzzy rules).

4.2 Obtained results

Table 2 collects the obtained results for each problem, where #R stands for the number of fuzzy rules and MSE tra and MSE tst the approximation error (Eq. 1) values over the training and test data set, respectively. Since our algorithm performs multiobjective optimization, several solutions are returned in each run. Therefore, we show three representative solutions from the final Pareto-optimal set, those with the minimum number of rules (i.e., the worst MSE tra ), the median solution, and the maximum number of rules (i.e., the best MSE tra ). \(\bar{x}\) represents the arithmetic mean of each value over the five partitions and σ the corresponding standard deviation. The best mean results for each problem among the analyzed methods are shown in boldface.

Table 2 Results obtained in the different problems

4.3 Analysis

From the obtained results we can observe that the proposed method generates fuzzy models with a good degree of accuracy and interpretability. The most accurate solutions provided by our method (Pitts-DNF max) obtain the best training errors in four problems. Good test errors are also achieved.

Compared to the rest of the methods, we can observe the following:

  • As regards Wang–Mendel and COR-BWAS, our method not only outperforms them in accuracy but also uses about a 50% of the number of rules needed by them, thus improving the interpretability of the obtained fuzzy models. To illustrate the capability of Pitts-DNF to compactly represent the rules, Table 3 shows an example of the fuzzy rule set obtained by COR-BWAS and the median solution of Pitts-DNF in a data partition of the Ele1 problem. Both solutions offer similar degrees of accuracy, however the rule set obtained by Pitts-DNF is much more compact, only consisting of seven DNF-type fuzzy rules (they are identified with a subindex in the consequent linguistic term for each cell of Table 3b). Note also that this set of rules are the optimal ones to represent the obtained table decision.

  • Thrift and the Pittsburgh method show serious difficulties to generate compact fuzzy rule sets, being this fact more significant as the complexity of the problem increases. They both tend to generate a huge number of fuzzy rules, even taking into account that the Pittsburgh method uses DNF-type fuzzy rules. This fact shows how the constraints of the search space imposed by our Pitts-DNF algorithm dramatically improve the search process, being significantly more accurate and interpretable than these other two methods. This leads us to think our algorithm deals better with the curse of dimensionality.

  • Finally, even considering our median results from the Pareto sets (Pitts-DNF med), our method outperforms Fuzzy-GAP generating fuzzy models more accurate and with a lower number of rules.

Table 3 Fuzzy rule set obtained in the first data partition of Ele1 problem by (a) COR-BWAS and (b) Pitts-DNF (median solution)

Analyzing our median results (Pitts-DNF med) we can observe that the algorithm is able to derive fuzzy models with a very low number of fuzzy rules preserving a good accuracy degree. It is important to remark that, due to the design of the proposed algorithm, these small fuzzy rule sets still completely cover the training data sets, which is not ensured by the other two Pittsburgh-style algorithms.

Furthermore, Figs. 7, 8, 9, 10, and 11 show the average Pareto fronts and generality degrees obtained by the proposed Pitts-DNF algorithm. The generality degree is computed by counting the mean number of linguistic terms used per variable in each rule. It is normalized to be between 0 (maximum specificity, i.e., where all the fuzzy rules are Mamdani-style) and 1 (maximum generality, i.e., where only one rule that covers the whole input space is used).

Fig. 7
figure 7

Average Pareto front (solid circles) and generality degrees (empty squares) obtained in the Diabetes problem

Fig. 8
figure 8

Average Pareto front (solid circles) and generality degrees (empty squares) obtained in the Ele1 problem

Fig. 9
figure 9

Average Pareto front (solid circles) and generality degrees (empty squares) obtained in the Laser problem

Fig. 10
figure 10

Average Pareto front (solid circles) and generality degrees (empty squares) obtained in the Ele2 problem

Fig. 11
figure 11

Average Pareto front (solid circles) and generality degrees (empty squares) obtained in the DEE problem

The generality degrees are plotted to show the correspondence between number of rules and generality kept by the algorithm without needing to consider this third objective. Naturally, as the number of rules increases, they become more specific so the mean generality degree decreases. As it can be observed, the algorithm generates a large range of solutions with different interpretability-accuracy balances.

5 Self-analysis: strengths, weaknesses, opportunities, and threats

A honest self-analysis of the proposed algorithm is described in Table 4, where strengths represent the main advantages of Pitts-DNF, weaknesses show its drawbacks, opportunities outline some suggested further lines of investigation, and threats include some optional approaches that could compete with our proposal.

Table 4 SWOT analysis of Pitts-DNF

Pitts-DNF has some important strengths. Firstly, the experiments show that the algorithm performance, both in interpretability and accuracy, is competitive compared with other approaches. Moreover, it uses a flexible fuzzy rule structure for a better knowledge synthesis which increases the interpretability. Besides, it generates consistent fuzzy rule sets, which improves the interpretability since the rules do not interfere among them. It also generates both complete and non over-general fuzzy rule sets; therefore, the expert is sure that the provided fuzzy rule sets always cover the whole training data set and they do not cover areas where there are not training data, which helps to gain additional insight into the data set. It generates compact fuzzy rule sets without redundancies. Finally, it performs multiobjective optimization, so a large range of different interpretability-accuracy trade-offs are returned.

The main weaknesses of the method are the following. Firstly, it only works in regression problems. Although it can be easily adapted to classification, to get a competitive classification algorithm may involve more effort than a simple adaptation. Secondly, as discussed in Sect. 2, the fact of constraining the fuzzy rule sets to be consistent and non over-general may imply generating a higher number of rules is some problems. Finally, even when the algorithm is better prepared for dealing with high dimensional problems than other approaches, it still needs to be improved to properly generalize the fuzzy rules.

We also want to mention some possible threats to Pitts-DNF. On the one hand, other approaches based on guiding the search process by objective functions that penalize inconsistencies of the fuzzy rule set (Wang et al. 2005), though they seem to be worst approaches since unfeasible solutions are explored, can obtain good solutions in practice due to the search flexibility. On the other hand, consistency may be faced by applying a two-stage sequential approach: generation of Mamdani-type fuzzy rules (e.g., by Wang and Mendel 1992) plus an a posteriori rule grouping process (e.g., by Carmona and Castro 2005). However, we think this two-stage approach, although it is useful to look for accurate solutions, it is not able to balance the accuracy with the linguistic generalization capability since it keeps the original decision table unaltered.

As further work, we intend to adapt the algorithm to classification problems (where the output is a class instead of a real value) and to learn Takagi-Sugeno fuzzy rules, to combine Pitts-DNF with a membership function parameter learning/tuning process (e.g., Casillas et al. 2005a), to study other solutions for avoiding over-generality without leaving uncovered regions (e.g., by doing linguistic extrapolation), and to analyze other fuzzy rule structures even more flexible than DNF-type for a more compact knowledge representation (such as using more relational operators in the antecedent or local exceptions to general rules).