1 Introduction

If a conventional fuzzy system (FS) is applied to large-scale problems (i.e., those with a high number of input variables), the number of rules grows exponentially with respect to the number of inputs received (Raju et al. 1991; Joo and Lee 1999). Indeed, if we have \(n\) variables and \(k\) linguistic terms per variable, it requires up to \(k^n\) rules to build a complete Mamdanitype fuzzy system, and consequently, the accuracy-interpretability balance is broken. This problem is better known as the “curse of dimensionality.”

In order to solve it, several approaches have been suggested such as variable selection (Chiu 1996; Hong and Chen 1999; Jin 2000; González and Pérez 2001; Hong and Harris 2001; Lee et al. 2001; Xiong and Funk 2006; Tan et al. 2008) and rule set reduction (Ishibuchi et al. 1995; Taniguchi et al. 2001; Casillas et al. 2005; Alcalá et al. 2006, 2007, 2011b; Gacto et al. 2009). Nevertheless, when the number of variables increases considerably, this kind of reduction may not be enough to solve it. There is a different approach to deal with that: hierarchical fuzzy systems (HFSs). An HFS is made up of a set of fuzzy subsystems or modules. These modules are linked in such a way that the output of a module is the input of other ones.

We may distinguish between three types of modules (Fig. 1):

  • SISO (Single Input, Single Output): It has one input and one output (Fig. 1a).

  • MISO (Multiple Inputs, Single Output): It has several inputs and a single output (Wang et al. 2006; Chen et al. 2007; Zajaczkowski and Verma 2012) (Fig. 1b). We can find Fuzzy Logic Unit (FLU) in this kind of modules. An FLU special case has two inputs and one output, which is equally found in the literature (Joo and Lee 1999, 2002; Shimojima et al. 1995; Wang 1998; Lee et al. 2003; Gaweda and Scherer 2004; Jelleli and Alimi 2005, 2010; Zhang and Zhang 2006; Cheong 2007; Aja-Fernández and Alberola-López 2008).

  • MIMO (Multiple Inputs, Multiple Outputs): They have several inputs and outputs (Salgado 2008) (Fig. 1c).

Fig. 1
figure 1

Types of modules

Apart from distinguishing several kinds of modules, it is also possible to find different types of hierarchical structures. There are different classifications (Gaweda and Scherer 2004; Duan and Chung 2002; Torra 2002), though the most general one is the following (Aja-Fernández and Alberola-López 2008):

  • Serial HFS (SHFS): The input of a module is the output of the previous ones, along with external variables (Lee et al. 2003; Cheong 2007; Wang 1999; Aja-Fernández and Alberola-López 2008; Zeng et al. 2008; Benftez and Casillas 2009; Zajaczkowski and Verma 2012) (Fig. 2a).

  • Parallel HFS (PHFS): This system is organized in layers (Fig. 2b). The first one is made up of a set of modules receiving the input variables. Each variable is used as an input only in a single module. The outputs of the modules in the first layer are the inputs of the modules which constitute the next layer, and so on. An aggregate operation might also exist in order to combine the outputs of one layer (Holve 1998; Joo and Lee 1999; Jelleli and Alimi 2005, 2010; Lee et al. 2003; Salgado 2008; Zhang and Zhang 2006; Aja-Fernández and Alberola-López 2008; Joo and Sudkamp 2009).

  • Hybrid HFS (HHFS): This type of HFS is a mixture of the two previous ones (Joo and Lee 2002; Wang et al. 2006; Chen et al. 2004, 2007; Aja-Fernández and Alberola-López 2008) (Fig. 2c).

Figure 2 shows this classification of hierarchical structures.

Fig. 2
figure 2

Types of hierarchical structures

We want to emphasize each module of an HFS deals with a subset of variables, in other words, it does not handle the problem as a whole. Thanks to the decomposition of the FS made in an HFS, the complexity of each module is significantly reduced due to the fact that the modules’ rules are simpler than the rules of the FS with a module since the number of variables of each module is lower.

Several approaches have studied the best way to deal with the linking variables used to link modules. The general way is to generate new linking variables without linguistic meanings (Lee et al. 2003; Cheong 2007; Wang 1999; Zeng et al. 2008; Holve 1998; Joo and Lee 1999, 2002; Jelleli and Alimi 2005, 2010; Salgado 2008; Zhang and Zhang 2006; Joo and Sudkamp 2009; Wang et al. 2006; Chen et al. 2004, 2007; Aja-Fernández and Alberola-López 2008; Zajaczkowski and Verma 2012). Gaweda and Scherer (2004) approach this task with fuzzy numbers. In this paper, the term mutual subsethood appears as a similarity measure based on the degree of inclusion of \(A^{\prime }\) in \(A\) and, at the same time, on the degree of inclusion of \(A\) in \(A^{\prime }\), \(A\) being a fuzzy set and \(A^{\prime }\) a fuzzy numbers set.

Jelleli and Alimi (2005) propose an iterative algorithm, which associates linearly the inputs in pairs (\(x_{L,i}, x_{L,i+1}\)), building a first layer with a set of FLUs, with \(L\) being \(L\)th the layer and \(i\) the \(i\)th FLU. Each FLU generates a linking output, \(y_{L,i}\), and they are associated in pairs (\(y_{L,i}, y_{L,i+1}\)). The non-dominant linking output, \(d_{L,i}\), is selected from each pair. Next, these variables are the inputs of the next layer. This procedure is a looper manner until sweeping all hierarchical levels.

Therefore, we distinguish three types of rule bases: (1) a usual rule base of an FS (Fig. 3a), (2) HFS with artificial linking variables (the linking variables are new variables created by means of a mathematic function; in Fig. 3b, variables \(y_1\) and \(y_2\)), and (3) HFS with natural linking variables (the linking variables are problem’s variables; in Fig. 3c, variables \(X_2\) and \(X_5\)). Let the FSs be in Fig. 3, if we have five variables and three linguistic terms per variable, the FS in Fig. 3a requires up to \(3^5 = 243\) rules to build a complete FS; in Fig. 3b there are 45 rules (\(3^3\) rules in \(M_1\), \(3^2\) rules in \(M_2\), and \(3^2\) rules in \(M_3\)), and in Fig. 3c there are 21 rules (\(3^2\) rules in \(M_{1}, 3^{2}\) rules in \(M_2\), and \(3^1\) rules in \(M_3\)). In these figures, we can observe that the lowest total number of rules is obtained in Fig. 3c. If we compare the FS in Fig. 3c with the FS in Fig. 3a, we can notice the variables distributed in modules in an HFS decrease the total number of rules and, moreover, the complexity of each rule is reduced. If we compare the FS in Fig. 3c with the FS in Fig. 3b, the number of rules in Fig. 3c is lower than Fig. 3b because it does not generate artificial linking variables.

Fig. 3
figure 3

Types of rule bases in a fuzzy system

Therefore, in this paper, we suggest the use of the problem’s variables in order to communicate the modules. A previous version of our proposal was published in Benftez and Casillas (2009). Each module has a set of Mamdani-based rules, which infer the linking variables. The FS generated is more interpretable because it uses Mamdani-type rules and the natural linking variables have semantic meaning. As far as we know, it is the first time that an algorithm combines an HFS with Mamdani-type rules. Besides, we use a robust approach with a learning process based on evolutionary computation, called genetic fuzzy systems (GFS). GFS combines a population-based search with a linguistic representation interpretable (Casillas and Carse 2009; Nojima et al. 2011).

We propose a multi-objective genetic algorithm (Multi-Objective GA) with the following features:

  • It allows large-scale problems to be dealt, thanks to its hierarchical structure and the inclusion of variable selection.

  • It guarantees good interpretability due to the algorithm uses of problem’s variables in order to link modules. The hierarchical structure decreases the number of rules and each rule is simpler because the number of variables per rule is lower. Moreover, it allows the algorithm to deal with Mamdani-type rules for a better understanding.

  • It obtains different trade-offs between interpretability and accuracy, thanks to the multi-objective algorithm, and so, it generates different solutions with different trade-offs.

Most HFS proposals are focused on improving the accuracy. However, the aim of our proposal is to improve the interpretability by means of a precise and compact set of rules, while the accuracy is maintained or improved.

The paper is organized in the following sections: in Sect. 2, a summary of the most relevant algorithms that work in this area is carried out; in Sect. 3, the proposed algorithm is described: structure, coding scheme, initialization, genetic operators, inference mechanism, rule base learning, multi-objective approach, and the objective functions; in Sect. 4, the empirical study is shown: the problems used are described, the results obtained, and an analysis and study of its behavior; in Sect. 5, a comparison between our proposal and the state of the art in the area of GFSs is provided; in Sect. 6, conclusion and future work are discussed.

2 Related works

In the literature, there are some proposals to deal with HFS without learning a hierarchical structure (Wang 1998, 1999; Joo and Lee 1999, 2002; Gaweda and Scherer 2004; Lee et al. 2003; Zhang and Zhang 2006; Cheong 2007; Zajaczkowski and Verma 2012). These cases are outside of our interest. Indeed, the design of an optimal hierarchical structure is as important as the accuracy of each module because if the HFS has a higher number of modules, the number of errors carried by the system will be higher too. This section reviews the main existing approaches to learn hierarchical structures. The description is sorted in chronological order, and does not mention approaches that use hierarchical structures, but do not learn them.

The algorithm of Shimojima et al. (1995) learnt a hybrid hierarchical structure with a GA and back-propagation method. It uses exponential radial basis membership functions (MFs) and are tuned with supervised learning by means of the gradient descendent method. The rules of the HHFS are Takagi–Sugeno–Kang (TSK).

Chen et al. (2004) designed an algorithm with the capacity to learn HHFS structures by means of ant programming and realizes a fine tuning of rule’s parameters using Particle Swarm Optimization (PSO) algorithm. It codes the HFS in a tree structure and uses TSK rules with exponential MFs.

Wang et al. (2006) used the descendent gradient method in order to learn the HHFS structure. It uses Mamdani-type rules and triangular MFs.

Chen et al. (2007) improved the version of Chen et al. (2004), which learns HHFS structures by means of Probabilistic Incremental Program Evolution (PIPE) and realizes a fine tuning of the rule’s parameters using evolutionary programming. The shapes of MFs are learnt with adaptive techniques and it uses TSK rules.

Aja-Fernández and Alberola-López (2008) create a new way of learning the rule base by means of a matrix procedure called Fast Inference using Transition Matrices (FITM), which is proposed as a methodology to perform inferences with the standard additive model. It uses Mamdani-type rules and this rule base learning can be used in SHFS, PHFS, and HHFS.

Joo and Sudkamp (2009) designed a two-layer PHFS. It uses Mamdani-type rules in the first layer, TSK-type rules in the second layer, and applies a rule reduction establishing a dependence relationship among first-layer FSs. The MFs are triangular. This approach is described in detail in Appendix 3.

The learning of the HFS problem remains an open problem. For example, Jelleli and Alimi (2010) have recently published an learning algorithm of PHFS. The variables are grouped in modules by means of the learning the fuzzy behavior from data. Later, a decision maker chooses the modules of the next layer according to a weight. The MFs are Gaussian and TSK rules are used.

Table 1 shows a summary of the approaches with hierarchical structure learning. Notice that in all the reviewed algorithms, new variables are generated to link the modules. However, as stated above, our algorithm will use natural variables, thus ensuring better interpretability.

 

Table 1 Summary of the main existing approaches with hierarchical structure learning

 

3 GSHFS algorithm

The algorithm, called Genetic SHFS (GSHFS), uses a multi-objective genetic algorithm where there are two objectives to minimize: the mean squared error (MSE) and the maximum number of rules. Our algorithm has three specific genetic operators: one crossover operator and two mutation operators. Its scheme is shown in Algorithm 1. The following subsections detail the different components of the algorithm.

figure a

3.1 Coding scheme

We will distinguish between two kinds of input variables: (1) endogenous variable: it is a variable linking two modules, (2) exogenous variable: it is a system’s external variable.

In our algorithm, each individual represents an SHFS. The variable-length coding scheme consists of a gene concatenation. Each gene has two fields: a variable index and a flag. The flag takes binary values (‘0’ if it is an exogenous variable and ‘1’ if it is an endogenous variable).

Fig. 4
figure 4

Example of coding scheme

A variable with ‘1’ in the flag field means that is an endogenous variable and a module exists. This variable is the input of the next module. All the used variables are variables of the problem. The algorithm decides if a variable of the problem will be endogenous or exogenous, but does not create new variables. The highest hierarchy level happens when all modules are SISO. In this case, all variables will be endogenous except the first and the number of levels is equal to the number of input variables.

Figure 4 shows an example of the coding scheme. This example of SHFS has ten inputs: four of them are exogenous variables (\(X_{10}, X_{3}, X_{5}\), and \(X_{6}\)) and two of them are endogenous variables (\(X_8\) and \(X_7\)). There are four variables (\(X_{1}, X_{2}, X_{4}\), and \(X_9\)) that does not appear.

3.2 Initialization

The algorithm generates an initial population randomly. The variable index and the endogenous/exogenous flag is randomly chosen for each gene. The only restriction is that the variable cannot be repeated and the flag of the first gene cannot be ‘1’ (i.e., the first variable has to be exogenous). The algorithm is constrained to a maximum number of modules because if the number of modules is high, the propagation error will increase considerably. On the other hand, we want to reduce the search space too.

3.3 Crossover operator

The crossover operator is applied according to a probability between mated parents and exploits the search space. When the crossover operator is applied to two parents, \(P_1\) and \(P_2\), different cases may be distinguished: (1) both parents, \(P_1\) and \(P_2\), have several modules, (2) parent \(P_1\) has several modules and parent \(P_2\) has one module (or the reverse case with \(P_2\) and \(P_1\)), (3) both parents, \(P_1\) and \(P_2\), have one module. A parent-centric approach has been used in some cases where the offspring mainly inherits the information of one of the parents and takes the secondary parent in order to add diversity. Figure 5 shows the cases by means of decision trees. The crossover operator controls the maximum number of modules in order to reduce the search space. It creates a new solution if the cross between two parents generates a number of modules less than or equal to the maximum number of modules.

Fig. 5
figure 5

Decision trees of the crossover operator

The crossover is applied to individuals depending on the types of variables that the two parents have in common. Each case has an ordered priority list, thus the crossover is applied to the first matched.

3.3.1 Both parents, \(P_1\) and \(P_2\), have several modules

 

  • Priority 1. Endogenous–Endogenous case If \(P_1\) has common endogenous variables with \(P_2\), a common endogenous variable is selected at random. This variable is the crossing point. The offspring \(O_1\) is generated centered on \(P_1\). Thus, the offspring \(O_1\) inherits from \(P_2\) the variables and modules from the beginning of the chromosome to the crossing point, but excluding it. The remaining one is taken from \(P_1\) and repeated variables are removed from the part inherited from \(P_2\). The offspring \(O_2\) is generated in the same way but centered on \(P_2\). Figure 6a illustrates an example of this case.

  • Priority 2. Endogenous–Exogenous case If \(P_1\) has endogenous variables which are exogenous in \(P_2\), a common variable is randomly selected as the crossing point. The offspring \(O_1\) is generated centered on \(P_1\). The common endogenous variable inherited from \(P_1\) is converted into exogenous and the rest of previous modules of this variable are removed. This kind of alteration produces a lower aggressive change because the offsprings resemblance their parents.

  • Priority 3. Exogenous–Endogenous case If \(P_1\) has exogenous variables which are endogenous in \(P_2\), one of those variables is randomly selected as the crossing point. The offspring \(O_1\) is created centered on \(P_1\), but excluding the crossing point. The first part of \(O_1\) is inherited from the first part of \(P_2\) (from the first gene to the crossing point). The repeated variables are removed in \(O_1\) from the part inherited from \(P_2\).

  • Priority 4. Exogenous–Exogenous case If \(P_1\) has a set of exogenous variables in common with exogenous variables in \(P_2\), the offspring \(O_1\) is generated as copy of \(P_1\), but with a change: a common exogenous variable between \(P_1\) and \(P_2\) is randomly chosen. This variable in \(P_2\), along with other variables, is the input in a module that generates an endogenous variable. This endogenous variable in \(P_2\) is selected to replace the random exogenous selected variable in \(O_1\), which is common to \(P_1\) and \(P_2\). Later, the repeated variables of \(O_1\) in the part inherited from \(P_1\) are removed. If this case is true for \(P_2\), \(O_2\) offspring will be generated with the same procedure, but centered on \(P_2\). Figure 6b shows an example of this case. The generated SHFS offspring has four modules. Notice that the common exogenous variables of \(P_1\) are looked for in \(P_2\) excluding the exogenous variables of the module with output \(Y\) in \(P_2\), because, if this module is included, when an exogenous variable is selected, there is no endogenous variable as the output since it coincides with the output of the SHFS.

  • Priority 5. Different variables case If the variables in \(P_1\) and \(P_2\) are different, an endogenous variable from \(P_1\) and \(P_2\) is chosen at random as the crossing point. Offsprings are generated in the same way as in priority 1.

Fig. 6
figure 6

Crossover operator

3.3.2 Parent \(P_1\) has several modules and parent \(P_2\) has one module

 

  • Priority 1. Endogenous–Exogenous case If \(P_1\) has common endogenous variables with the exogenous variables of \(P_2\), the offspring \(O_1\) is generated as in the Endogenous–Exogenous case when \(P_1\) and \(P_2\) have several modules. Figure 6c depicts this case.

  • Priority 2. Exogenous–Exogenous case If \(P_1\) and \(P_2\) have common exogenous variables, the offspring \(O_1\) centered on \(P_1\) is created as follows. Firstly, \(O_1\) is generated as a copy of \(P_1\). Then, an exogenous common variable of \(P_1\) and \(P_2\) is randomly chosen and moved to module with output \(Y\). This is shown in Fig. 6d.

  • Priority 3. Different variables case In this case, \(O_1\) is a copy of \(P_1\) and later, an exogenous variable of \(P_2\) is selected at random and inserted into the module of \(O_1\) with the output \(Y\).

3.3.3 Parent \(P_1\) has one module and parent \(P_2\) has several modules

 

  • Priority 1. Exogenous–Endogenous case If \(P_1\) has exogenous variables which are endogenous in \(P_2\), the offspring \(O_1\) is generated as in the Exogenous–Endogenous case when \(P_1\) and \(P_2\) have several modules. Figure 6e illustrates this.

  • Priority 2. Exogenous–Exogenous case If \(P_1\) has a set of exogenous variables in common with exogenous variables in \(P_2\), the offspring \(O_1\) is generated as in the Exogenous–Exogenous case when \(P_1\) and \(P_2\) have several modules.

  • Priority 3. Different variables case If the variables in \(P_1\) and \(P_2\) are different, an exogenous variable from \(P_1\) and an endogenous variable from \(P_2\) are chosen at random as crossing point. The offsprings are generated in the same way as in priority 1.

3.3.4 Both parents, \(P_1\) and \(P_2\), have one module

In this case, the offsprings \(O_1\) and \(O_2\) are generated as follows (Fig. 6f shows this case): (1) the common exogenous variables of \(P_1\) and \(P_2\) are inserted into both offsprings, (2) the rest of the variables (no common variables between parents) are inserted in a set \(V\) with \(n=|V|\) being its size, (3) a number \(r\) is randomly generated: if there are no common variables between \(P_1\) and \(P_2\) then \(r \in \{1,\dots,n-1\}\); otherwise, \(r \in \{0,\dots,n\}\), (4) \(r\) variables are randomly taken out and inserted into \(O_1\), (5) the rest of the variables are inserted into \(O_2\).

3.4 Mutation operator

The mutation operator makes local changes in the chromosome. As well as the crossover operator, this operator controls the maximum number of modules. Two kinds of mutation have been designed (Fig. 7 shows a decision tree of this operator):

Fig. 7
figure 7

Decision tree of the mutation operator

3.4.1 Exchange mutation

This operator makes an exchange of variables in a module. It chooses an exogenous variable at random in a module and exchanges this exogenous variable for the endogenous variable of the module. Figure 8 shows an example.

Fig. 8
figure 8

Example of exchange mutation

3.4.2 Insertion mutation

The mutation operator distinguishes between used and unused variables in the individual. In accordance with this and by means of a probabilistic decision (Algorithm 2 shows a scheme), the mutation operator chooses between inserting an unused variable into a module, removing a used variable, or moving a used variable to the other module.

figure b
Fig. 9
figure 9

Insertion mutation operator

  • The insertion of an unused variable is as follows. Given an unused variable \(v\), a module \(m\) is chosen randomly. Next, the operator decides if \(v\) is inserted into \(m\) with a probability of 0.5. If so, \(v\) is inserted as an exogenous variable. Otherwise, if \(m\) has at least one exogenous variable as an input, one of them, \(e\), is selected at random. The mutation operator creates a new SISO module previous to \(m\) where the input is \(e\) and the output is \(v\). Figure 9a shows an example. In this case, \(v\) will be endogenous and the input of module \(m\). If \(m\) has no exogenous variables (Fig. 9b), the \(m\)’s output is converted into a new exogenous variable of \(m\) and \(v\) will be the new output of \(m\). The mutation is applied if the solution has a number of modules less than the maximum number of modules.

  • To remove a used variable, first a module \(m\) of the SHFS is randomly chosen. If \(m\) has exogenous variables then one of them is removed at random. Figure 9c depicts this case. Otherwise, the module \(m\) is removed, thus linking the output of the module before \(m\) with the input of the module after \(m\). Figure 9d illustrates an example.

  • To move a used variable from one module to another, two modules of the chromosome are chosen at random: a source module \(m_1\) where a variable is removed and a destination module \(m_2\) where the variable is inserted. An exogenous variable of the \(m_1\) is randomly removed and inserted as the exogenous variable of the \(m_2\). If there are no exogenous variables into the \(m_1\) and consequently \(m_1\) has one input and one output (a SISO module), its endogenous variable is chosen. In this situation, \(m_1\) disappears and the chosen variable is inserted as an exogenous variable into \(m_2\). If \(m_1 = m_2\), an exogenous variable \(v\) is randomly selected, then a new module is created after m 1/m 2. The input of this new module (and therefore the output of the module m 1/m 2) is \(v\) while the output of the new module will be the old output of m 1/m 2. In this case, the mutation is applied if the solution has a number of modules less than the maximum number of modules.

Algorithms 3, 4, and 5 show the behavior of this mutation operator.

figure c
figure d
figure e

3.5 Inference mechanism

We consider the Max–Min inference scheme (i.e., T-conorm of the maximum as aggregation and T-norm of the minimum as relational operator), and the T-norm of the minimum as conjunction, and center-of-gravity as defuzzification. The inference in the SHFS is produced from left modules to right modules. When a module infers the output, it produces an error that is carried in the SHFS. For this reason, it is not good to have a high number of modules in an HFS because the carried error will increase considerably. We must be careful to design the system in order to get a lower propagation of the error because the more modules are considered, the more uncertainty we get (Maeda 1996).

3.6 Fuzzy rule set learning

The fuzzy rule set learning is as follows: each module learns its Mamdani-type fuzzy rule set by Wang–Mendel (WM) method (Wang and Mendel 1992) (see Appendix 1). The module’s exogenous variables are extracted from the data set and the endogenous variables in SHFS are inferred by its respective module according to the input variables of the module. Figure 10 shows a graphic example: the first module receives three exogenous variables from the data set and infers an endogenous natural variable; the value of the endogenous inferred variable (it is not extracted from the data set) is the input of the next module along with other exogenous variables extracted from the data set, and so on. The obtained system is evaluated by training data examples and choosing the corresponding values of exogenous variables in that module. The aim is to build a closed system in order to ensure good interpretability and reduce the error.

Fig. 10
figure 10

Example of inference process

3.7 Multi-objective approach

A generational approach with the multi-objective elitist replacement strategy of NSGA-II (Deb et al. 2002 is used. Crowding distance in the objective function space is considered. Binary tournament selection based on the non-domination rank (or the crowding distance when both solutions belong to the same front) is applied. The crowding distance is normalized for each objective according to the extreme values of the solutions contained in the analyzed front.

We use NSGA-II because of its efficient and effective performance. Further analysis of the best multi-objective approach is away of the scope of this paper.

3.8 Objective functions

We consider two objective functions to minimize:

  • Mean square error (MSE) If MSE is lower, the accuracy will be better. It is computed as:

    $$F_1(S)=\frac{1}{N}\sum\limits _{{i=1}}^{N}(S(x^i)-y^i)^2$$
    (1)

    with S being the SHFS to be evaluated, N the data set size and \((x^i, y^i)\) the ith input–output pair of the data set. The objective is to minimize this function to get good accuracy.

  • Maximum number of rules The system will be more interpretable with a lower maximum number of rules. It is computed as follows:

    $$F_2(S)=\sum\limits _{{i=1}}^{M}(k^n)$$
    (2)

    with \(M\) being the number of modules of the SHFS, and \(k\) and \(n\) the number of labels and variables per module, respectively. This objective has the main advantage (compared to considering, e.g., the final number of rules) of depending exclusively of the “static” structure of the SHFS, that is the number of modules, number of variables per module and number of linguistic terms used in each variable. Therefore, the objective is not influenced by the learning algorithm used to generate the fuzzy rule set at each module. This is especially relevant when the MFs are tuned, as two solutions with exactly the same hierarchical structure would generate different number of rules. Besides, although the number of modules or the number of variables per module is also notable criterion to evaluate the compactness of the SHFS, the maximum number of rules is a more intuitive criteria to validate interpretability because it combines both criteria in one objective. If the maximum number of rules is considered, the number of variables and modules will be controlled automatically; as the maximum number of rules decreases, the number of modules increases and the number of variables decreases. We will analyze this behavior in Sect. 4.4.

4 Experimental results

4.1 Problems

We have considered 14 regression problems with a moderate and high number of real-valued variables. Table 2 collects the main features of each problem, where #InputVar stands for the number of input variables, #Exam for the total number of examples, #LingTerms for the number of triangular-shaped uniformly distributed linguistic terms considered for each variable in this experimental analysis, and Available indicates the web site where the data sets are available. We set three labels per variable for large-scale problems because if the number of inputs and labels is high, the rule set will grow up considerably and the FS has lower interpretability. For this reason, a higher number of labels are used in problems with a lower number of inputs.

Table 2 Data sets considered in the experimental analysis
Fig. 11
figure 11

Average Pareto Front in several problems

Our proposal solves regression problems. The data set with the highest number of inputs we have found of the considered web sites (Table 2, column Available) has been “Ailerons” with 40 real-valued variables. A data set called TIC (85 variables) can be found at KEEL web site. We do not use it because most of the variables are nominal with no order and the output variable is binary. Therefore, this data set is not a regression problem but a classification one.

4.2 Obtained results

The experiments shown in this paper have been performed by a fivefold cross validation with six runs for each data set partition (i.e., a total of 30 runs per problem). Thus, the data set is divided into five subsets of (approximately) equal size. The algorithm is then applied five times to each problem, each problem consisting of leaving out one of the subsets from training, but using only the omitted subset to compute the test error.

Our algorithm has been run with the following parameter values: 50,000 evaluations, 60 as population size, 0.7 as crossover probability, 0.2 as mutation probability per chromosome. We have not performed any previous analysis to fix these values, so better results may be obtained by tuning them, though we have informally noticed our algorithm is not especially sensitive to any parameter. Strong fuzzy partitions with uniformly distributed triangular MFs are used.

The maximum number of modules is set to four because an SHFS with more modules gets more propagation error and, at the same time, the search space is reduced.

Table 3 collects the obtained results, where \({\text{ MSE}}_{\text{ tra}}\) and \({\text{ MSE}}_{\text{ tst}}\) are the approximation error (Eq. 1) values over the training and test data set, respectively; #M, #R, and #V stand for the number of modules, the number of fuzzy rules, and the number of input variables, respectively; \(\sigma _{\#M}\) is the standard deviation of the number of modules, and \(\bar{x}_{\#R}\) is the average number of variables per rule, which is computed as follows:

$$\bar{x}_{\#R} = \frac{\displaystyle \sum \nolimits _{{i=1}}^{\#M} (V_i \times R_i)}{\sum\nolimits _{{i=1}}^{\#M}R_i}$$
(3)

with \(R_i\) being the number of rules of the ith module and \(V_i\) the number of variables of the ith module. This measure indicates the mean complexity of the rules.

Since our algorithm performs multi-objective optimization, several solutions are returned in each run. The average Pareto fronts (30 runs per problem) is computed according to a process based on percentiles. Let R be the number of runs (30 in our experiments) and \({\text{PS}}_i\) the Pareto set obtained in the ith run and ordered by the first objective. Thus, given the average size of the Pareto sets:

$$\bar{p} = {\text{ round}}\left(\frac{\sum _{i=1}^{R}{|PS_i|}}{R}\right),$$
(4)

the average Pareto set

$$\overline{\text{ PS}} = \{{\user2{p}}_k \in {\mathbb R }^C\ |\ k \in \{1,\dots,\bar{p}\}\} $$
(5)

is built as follows:

$${\user2{p}}_k = (p_{k,1},\dots,p_{k,C}), p_{k,i} = \frac{\sum _{j=1}^{R}{t^k_{j,i}}}{R}, $$
(6)

with C being the number of objectives and

$${\user2{t}}^k_j = (t^k_{j,1},\dots,t^k_{j,C}) = {\text{ rank}}\left(1+\frac{(|{\text{ PS}}_i|-1) \cdot (k-1)}{\bar{p}-1},{\text{ PS}}_i\right) $$
(7)

with \({\text{ rank}}(r,P)\) being the rth element of the ordered set P. If r is the integer, the element holding this position is directly taken. If not, a linear interpolation is computed. For example, if we are considering two objectives \((C=2)\) and \({\text{PS}}_1=\{(1,10),(2,7), (5,6), (7,5), (9,3)\}\):

$$\begin{aligned}{\text{ rank}}(1,{\text{ PS}}_1)&= (1,10), {\text{ rank}}(2,{\text{ PS}}_1)=(2,7), \\ {\text{ rank}}(3,{\text{ PS}}_1)&= (5,6), {\text{ rank}}(4,{\text{ PS}}_1)=(7,5), \\ {\text{ rank}}(5,{\text{ PS}}_1)&= 7(9,3), {\text{ rank}}(3.33,{\text{ PS}}_1)= (5,6) \cdot 0.33 + (7,5) \cdot 0.77 = (6.33,5.33) \end{aligned}\\ $$

Therefore, in an experiment with three runs (\(R=3\)) where we have the above \({\text{ PS}}_1, {\text{ PS}}_2=\{(2,9),(3,6),(7,4)\}\), and \({\text{ PS}}_3=\{(2,10),(4,8),(6,5),(8,3)\}\), the average Pareto set is:

$$\overline{\text{ PS}} = \{(1.67,9.67), (3.22,7.22), (5.56,5.22), (8,3.33)\}. $$

We show five representative solutions from the average Pareto ordered from low to high MSEs: the first row of each problem is the solution with the highest accuracy, the second row is the first quartile, the third row is the median, the fourth row is the third quartile, and the fifth row is the solution with the lowest accuracy.

The following algorithms have been implemented in order to carry out a study of the algorithm that we propose:

  • WM method (Wang and Mendel 1992). Although WM is a simple algorithm and the obtained results are not good, we include it in the analysis because it has been used to learn the rule base, and so, the benefits provided by the hierarchical structure can be observed.

  • JS (Joo and Sudkamp algorithm). The algorithm is explained in Appendix 3.

  • GSHFS w/VS (GSHFS without Variable Selection). This algorithm is the GSHFS but it does not do variable selection. The crossover operator does module exchanges between SHFS: a different crossing point is selected randomly in each parent; the first offspring inherits from the first parent from the beginning of the chromosome to the crossing point; the rest of the variables are inherited from the second parent but in the same order as the second parent. The second offspring inherits from the second parent from the beginning of the chromosome to the crossing point; the rest of the variables are inherited from the first parent but in relative order of the first parent. Regarding the mutation operator, it uses the exchange, insertion and movement of variables of GSHFS. The algorithm creates SHFS with \(n\) as maximum number of modules, being \(n\) the number of problem’s inputs.

  • GSHFS4M w/VS (GSHFS without variable selection creating 4 modules). It is the GSHFS w/VS but constrained to a maximum of four modules.

  • VSGA (Variable selection by genetic algorithm). It is a variable selection algorithm based on a genetic process. See Appendix 2. It has been implemented because is a fair comparison with our algorithm.

Table 3 shows the results obtained by these algorithms.

The GSHFS w/VS, GSHFS4M w/VS, and VSGA rows show the average of the last most accurate of the obtained Pareto per run which improves the WM solution. The JS rows contain the average of the results obtained by the JS algorithm.

An internal analysis of GSHFS is performed in Sect. 4.4. With this study, we will observe whether our algorithm’s benefits are due to the hierarchical structure, the variable selection, or both. Previously, on the next section, we will analyze the obtained results by GSHFS and we will compare it with the results of JS algorithm.

Table 3 Results obtained with MSEtra and MSEtst being the approximation error values over the training and test data set

 

4.3 Analysis

Our main objective is to improve the interpretability by means of a precise and compact set of rules, while the accuracy is maintained or improved. When the number of rules of a FS decreases, the system gets better interpretability, but the accuracy is lower and vice versa. This event can be observed clearly in the results obtained in problems with both low and high number of variables. Notice that the fuzzy rule set is learned individually for each module.

We can observe in the problems with a lower number of variables (from four to six variables) that the SHFS obtained by our algorithm with a higher number of modules has worse accuracy according to the WM method. However, if we observe the problems with a considerable number of variables (more than eight), our algorithm has a better performance. The results obtained by our algorithm in accuracy and complexity are better, the third quartile included, than the results obtained by WM.

In Dee problem, we can observe that accuracy is high when the number of rules is high and, consequently, the FS has lower interpretability. According to the WM method, the highest accuracy solution is better in accuracy, having a number of rules and a number of variables lower. In first quartile, the accuracy is slightly increased according to the solution obtained by WM in the \({\text{ MSE}}_{\text{ tra}}\), but the \({\text{ MSE}}_{\text{ tst}}\) is lower and the interpretability decreases by almost half.

Now, we shall examine the Elevators problem, which has 18 variables. The highest accuracy solution obtained by our algorithm is better than the WM solution. In both \({\text{ MSE}}_{\text{ tra}}\) and \({\text{ MSE}}_{\text{ tst}}\), the interpretability is better because the SHFS has a lower number of variables and almost two modules. The number of rules has been decreased by 95 %. The third quartile of the \({\text{ MSE}}_{\text{ tra}}\) is better in accuracy than WM. The third quartile of the \({\text{ MSE}}_{\text{ tst}}\) is also more accurate and the number of rules obtained by our algorithm has decreased by 98 %.

In the Ailerons problem (40 variables), we can see that our solution with the highest accuracy is better than the solution obtained by WM. We want to emphasize that the number of rules is decreased by 92 % and the number of variables by 75 %. The median of the \({\text{ MSE}}_{\text{ tra}}\) and the median of the \({\text{ MSE}}_{\text{ tst}}\) is lower and the interpretability is better than the solution of WM, due to a decrease in the number of variables and its distribution in modules.

Figure 11 shows the average Pareto front obtained by GSHFS in the following large-scale problems: Concrete, Census 8L, Census 16H, Elevators, Computer Activity, and Ailerons. The SHFS achieves higher accuracy when the number of modules tends to be two. If the SHFS has a higher number of modules, the module’s carried error will increase considerably. As for complexity, an SHFS has a lower input configuration per module. For this reason, the SHFS’ total number of rules is lower and allows major system interpretability.

Figure 12 illustrates some examples of the hierarchical structures obtained by our algorithm. It is an example of non-dominated solutions obtained by GSHFS in a data set partition of Census16H, Elevators, Computer Activity, and Ailerons. Its aim is to observe the effect produced by the first module in the SHFS. Therefore, we compare these solutions obtained by GSHFS and by WM to the set of input variables of the last module obtained by GSHFS. We can observe that the \({\text{ MSE}}_{\text{ tra}}\) and \({\text{ MSE}}_{\text{ tst}}\) obtained by our algorithm are lower than the values obtained by WM not inferencing the endogenous input variable in the last module, decreasing the number of rules by 6 % in Census16H, 14 % in Elevators, 26 % in Computer Activity, and 16 % in Ailerons. The average number of variables per rule is lower in the solutions obtained by GSHFS. It allows major system interpretability because the rules are simpler.

Fig. 12
figure 12

Example of solutions obtained by GSHFS (a, b, e, f) and the corresponding solutions obtained by WM over the set of input variables of the last module obtained by GSHFS (c, d, g, h). The results in this figure (MSEtra/MSEtst) should be multiplied by \(10^{5}\) in the case of Census 16H

The solutions obtained by GSHFS are more promising in accuracy as well as complexity due to it searches in local areas of the search space, and because the rules are simpler. The obtained system is more compact: it has a lower number of variables, which are handed out to different modules, and so it is easier to interpret the learned rules because they are simpler and there are not many modules (the number of modules is around two).

Table 4 illustrates a real example of a hierarchical rule base. This rule base corresponds to the SHFS in Fig. 12b. The problem consists of 18 inputs. They have been reduced to seven and hierarchically distributed in two modules.

Table 4 Rule base obtained in elevators problem corresponding to Fig. 12b

Figure 13 is a bar-chart of the variable dependence rate from the most accurate to the first quartile solutions for all problems. The variable dependence rate is defined as the number of different variables that depend on other variable(s) at least in one of the most accurate non-dominated solutions (those contained in the first quartile). We consider that variable \(X\) is a dependent variable if it is the output of one of the modules of the SHFS. The rate has been normalized to a percentage.

Fig. 13
figure 13

Bar-chart of the variable dependence rate from the most accurate to the first quartile solutions

The bar-chart shows that there are some complex problems with a high percentage of dependent variables (Mortgage, Census 16L, and Ailerons) but no others (Treasury and Computer Activity). This behavior also happens in problems with a low number of variables. For example, Ele2 and Laser have more percentage of dependent variables than Dee or Concrete problems. We can observe that some simple problems (Laser and Ele2) have more percentage of dependent variables than other complex problems (Treasury and Computer Activity). The dependence between variables is not related to the number of variables of the problem. The dependence of variables is linked with the characteristics of the problem. This can be seen clearly on the Census problems. The letters (L or H) denote a very rough approximation to the difficulty of the task. For Low task difficulty, more correlated inputs were chosen as signified by univariate smooth fit of that input on the target. Tasks with high difficulty have had their inputs chosen to make the modeling more difficult due to higher variance or lower correlation of the inputs to the target. Census 16L and Census 16H have the same number of variables and instances, but we can observe in the bar-chart that the variable dependence rate in Census 16L is higher than Census 16H. It means that the variables of Census 16L are more correlated and the learning of dependencies between variables is easier. In this way, an SHFS can work in simple and complex problem.

This interesting behavior deserves a deeper analysis than the one developed here. To do so, a data complexity analysis in a similar manner as made in classification (Ho and Basu 2002; Ho et al. 2006) would be helpful. However, these data complexity measures are not easy to define in regression, so tools as the GSHFS proposed in this paper may help to provide a further insight to understand the complexity of the problem beyond merely considering its number of variables and instances.

Figure 14 illustrates the correlation between variables by directed graphs in several problems. Each node has been labeled with the real name of the variable and the arcs represent the degree of dependency between pair of variables, where the head is input of a module with the tail as output in the shown percentage of cases. For instance, in Fig. 14a, the variable H5.6 depends on P19.4 by 13 % (i.e., in a 13 % of cases the algorithm obtains an accurate solution with a module where H5.6 is the output and P19.4 one of its inputs). This graphs may help to understand the correlation among the input variables discovered by GSHFS. As previously stated in Fig.  13, we can observe that the dependencies may differ considerably in problems with the same number of variables and instances as shown in Fig. 14a (Census 16L) and c (Census 16H).

Fig. 14
figure 14

Some examples of found dependence graphs (dependencies below 10 % are omitted)

The JS algorithm (Joo and Sudkamp 2009) is one of the most recent approach in this area. For this reason, we thought it would be interesting to implement this approach to compare it with our algorithm.

Table 3 shows the results obtained by JS for the dealt problems. In problems with four variables, the most accurate solution obtained by GSHFS is better than the JS solution in \({\text{ MSE}}_{\text{ tra}}\) and \({\text{ MSE}}_{\text{ tst}}\) with a lower number of rules and modules. In the problems with six to nine inputs, the first quartile or the mean of GSHFS obtains better results in \({\text{ MSE}}_{\text{ tra}}, {\text{ MSE}}_{\text{ tst}}\), number of modules, rules, and variables than the JS approach. In large-scale problems, the third quartile of GSHFS gets better accuracy and a lower number of rules than the JS algorithm.

In Dee, the mean value is considerably more accurate than the obtained solution by JS. Regarding the number of modules, GSHFS reduces it by 85 % and the number of rules by 78 %. In Census 16L, the highest accuracy solution obtained by GSHFS improves in accuracy by 25 % and decreases by 59 % the number of modules. The number of rules is a little higher. Ailerons is the dealt problem with the highest number of variables. The mean obtained by GSHFS improves the JS solution by 3 % in \({\text{ MSE}}_{\text{ tra}}\) and \({\text{ MSE}}_{\text{ tst}}\), the number of rules and modules are decreased by 59 and 63 %, respectively.

GSHFS improves the results obtained by JS and maintain a better trade-off between accuracy and interpretability. All in all, the JS approach obtains a solution with a higher number of modules. In previous sections, we mentioned the carried error increases with a higher number of modules. For this reason, the FS’s accuracy is worse. The JS algorithm does not remove variables and so the number of rules is higher. The generated FS is less interpretable due to the use of TSK rules and it needs a complete rule base in order to create the hierarchical fuzzy system. This creates a serious problem when the algorithm considers a higher number of variables, for example, with the Ailerons problem.

However, GSHFS solves these problems; it controls the number of modules and variables by means of genetic operators and does not generate new linking variables. In this way, there is a lower number of rules, and as the GSHFS’ rules are Mamdani-type, they are more interpretable. For these reasons, it is a very suitable algorithm for large-scale problems.

4.4 Study of the GSHFS’ behavior

In this section, we analyze the reasons why our algorithm obtains good results. In order to do this, we compare our algorithm with VSGA and with GSHFS generating one, two, three, and four modules. The aim of this study is to isolate the variable selection benefits and to check if the variable selection or the hierarchical structure has more influence.

The solution with the highest accuracy for each problem obtained by VSGA is shown in Table 3. If we compare the results obtained by VSGA and GSHFS for problems with a number of variables between four and nine, we can observe there is no major difference in accuracy of the results when the problem has four variables (Laser and Ele2). If our algorithm considers large-scale problems and problems with a number of variables between six and nine, the algorithm’s behavior is better. We can observe that the results obtained by GSHFS are better than VSGA, both in accuracy and simplicity. VSGA obtains a FS with a low number of rules, and the inference is not correct.

\(\overline{x}_{\#R}\) (Table 3) is a measure which distributes the number of variables, and so, if the number of variables is equal, the algorithm with a lower \(\overline{x}_{\#R}\) will be better. In this way, we can observe the GSHFS’s simplicity; the number of variables per rule in our algorithm is lower than VSGA. This is due to the search space is dealt with a modular way by GSHFS.

Figure 11 depicts the average Pareto Front of GSHFS and VSGA, and the \(\overline{x}_{\#R}\) in Concrete, Census 8L, Census 16H, Computer Activity, and Ailerons. It shows the GSHFS algorithm is better in accuracy and simplicity (because there is a lower number of variables per rule in each module). If we consider the measure \(\overline{x}_{\#R}\), the value of this measure in GSHFS is lower along the number of rules than VSGA. We can see this behavior in the rest of the problems, and this indicates to us that the rules generated by GSHFS are simpler and, therefore, more easily interpretable. In Census 16H, we can observe clearly how GSHFS is a bit worse in \(\overline{x}_{\#R}\) than VSGA when the number of rules is between 60 and 100 rules more or less.

In order to get further insight into the GSHFS’ behavior, we have experimented with other version of our algorithm. We changed the dominance criterion to obtain solutions with one, two, three, and four modules. In this way, we can observe the benefits given by a hierarchical structure. Figure 15 depicts the average Pareto fronts with one, two, three, and four modules obtained in the Dee, Census 8L, Census 16H, Elevators, Computer Activity, and Ailerons problems by our algorithm. In the graph’s legend, the average size of each Pareto is shown. These figures illustrate that the combination of this four types of SHFS gets a better accuracy as long as the number of rules is higher because the problem’s variables are selected and distributed in the SHFS.

Fig. 15
figure 15

Average Pareto front with one, two, three, and four modules in several problems

Next, we compare the GSHFS algorithm with GSHFS w/VS and GSHFS4M w/VS. We can observe that GSHFS4M w/VS is more accurate than GSHFS w/VS in some problems, although the number of rules is slightly increased. In large-scale problems, the improvement is more significant than problems with a lower number of variables. In Concrete, the \(MSE_{tra}\) is reduced by 2 % and the number of rules is increased by 1.6 % approximately. However, in Mortgage the accuracy is improved by 3.6 % although the number of rules is increased by 23 %. We want to notice that GSHFS4M w/VS reduces the WM’s \({\text{ MSE}}_{\text{ tra}}\) and the number of rules by 57 and 33 %, respectively. In this way, we can verify that it is not good to have an SHFS with a higher number of modules, because it will obtain worse accuracy.

If we compare GSHFS with GSHFS w/VS and GSHFS4M w/VS, we will observe that the elimination of variables by means of genetic operators produces improvements in all problems, particularly the large-scale problems. In Concrete, the mean improves the solution obtained by GSHFS4M w/VS in \({\text{ MSE}}_{\text{ tra}}\) and number of rules by 0.1 and 50 % respectively.

If we pay attention to CompAct, the improvement of the most accurate solution produced by GSHFS is quite considerable; the \({\text{ MSE}}_{\text{ tra}}\) is reduced by 64 % and the number of rules by 63 %. The third quartile reduces the \({\text{ MSE}}_{\text{ tra}}\) by 39 %.

4.5 Runtime analysis

To complete the analysis, we would like to show the running time of GSHFS. Table 5 collects the runtime of the GSHFS algorithm, where #VarWM indicates the average of the number of variables per module (and, therefore, per WM method call) during the complete execution, and #Runtime is the average and standard deviation runtime. The algorithm was implemented in C language, and all the experiments were executed in a Intel Core 4 Quad, 2.5-GHz CPU, 8 GB of RAM, and Linux Red Hat 5.2.

Table 5 Runtime of the GSHFS algorithm

The algorithm lasts until 12 h in the worst case. It can be a reasonable runtime considering that we are dealing with some large-scale regression datasets (which is not so prevalent in GFSs) and that the proposed algorithm is conceived for off-line learning.

We can find some interesting aspects after a closer look at the table. Indeed, the runtime mainly depends on the number of examples of the dataset (see Table 2). This fact would be expected as the WM method generates as many candidate rules as examples exist at the first stage, so it has a clear dependency of this parameter. In this way, problems as Census, Elevators or Ailerons last longer since the number of training examples are between 11,000 and 18,228. Furthermore, it is interesting to note that the number of input variables of the problem is not so relevant to determine the runtime. Thus, given almost the same number of examples, a problem with 8 variables as Concrete lasts longer (about 8 min) than a problem with 15 variable as Treasury (about 2 min). The explanation on this must be found in the real number of rules finally generated. As WM groups the candidate rules by antecedent and then chooses a rule per group to avoid inconsistency, sparse training data may lead to a higher number of rules even when the input dimension and the number of examples is lower. As shown in Table 5, the number of variables used in the modules along the execution of the algorithm directly influences on the runtime and ultimately may represent the difficulty of solving the problem beyond its number of variables. The parameter would be related with more complex features as spareness of the examples or correlation among the variables.

5 Comparison with other GFSs

In order to evaluate the accuracy and interpretability of our proposal in large-scale problems, we compare with some state-of-the-art algorithms in the area of GFSs. Nowadays, most of the GFS algorithms (and particularly in multi-objective optimization) develop an adjustment of the MF parameters in order to reach additional accuracy rates. Therefore, in order to do a fair comparison, we opted by endowing our proposal with the capability of tuning the fuzzy partitions. However, the main contribution of the paper is not providing an algorithm that generates fuzzy models with excellent accuracy but hierarchial fuzzy models that organize the input variables in a more understandable way without sacrificing much accuracy.

This section is organized as follows: Sect. 5.1 explains the proposed tuning approach of the GSHFS algorithm; Sect. 5.2 describes the experimental setup; Sect. 5.3 shows the obtained results, and Sect. 5.4 analyzes and compares our algorithm with other well-known GFSs.

5.1 Proposed tuning algorithm

This approach consists of two genetic operators that move the core of the MFs in order to adapt the fuzzy partitions for an additional accuracy improvement. Strong fuzzy partitions are considered.

  1. 1.

    Fuzzy partition codification A double-coding vector scheme to represent the cores of the MFs per variable is used. Figure 16 illustrates it. The interval of the MFs is calculated by means of the parameter (Eq. 8):

    $$\varepsilon = \frac{{\text{ max}}-{\text{ min}}}{k-1} $$
    (8)

    with max and min being the maximum and minimum of the variable interval, respectively, \(k\) the number of labels per variable. In this way, the core of the first label is the \(minimum\), the core of the second label is \(minimum + \varepsilon \), the core of the third label is \(minimum + 2\varepsilon \), and so on.

  2. 2.

    Initialization The first half of the population have been initialized with predefined MFs. The second half was initialized randomly by means of the Algorithm 6.

    figure f
  3. 3.

    Crossover operator This crossover operator is applied per variable to an offspring obtained by means of the GSHFS’ crossover. It is embedded into the GSHFS’ crossover operator. There are three types of crossing for tuning. It depends on the offspring’s common variable selected:

    1. (a)

      If the variable inherited by the offspring comes from the two parents, BLX-\(\alpha \) crossover will be applied per label. Figure 17 illustrates it.

    2. (b)

      If the variable inherited by the offspring comes from one of the two parents, the values of the MFs are copied from the parent into the offspring’s inherited variable.

    Finally, if the MFs’ cores are overlapping each, the MFs vector will be sorted from lowest to highest value.

  4. 4.

    Mutation operator The mutation process is as follows:

    1. (a)

      Select a random individual \(i\).

    2. (b)

      Choose a random variable \(x\) from \(i\).

    3. (c)

      Choose a random label \(l\) from \(x\).

    4. (d)

      Select a random number \(r\) belonging to \(l\)’s interval.

    5. (e)

      Test if \(r\) core are overlapping with the cores on both sides. In the positive case, the new core will be the average of the cores located on both sides of \(r\).

Fig. 16
figure 16

Example of fuzzy partition coding scheme

Fig. 17
figure 17

MFs crossover

These genetic operators are used regardless of hierarchical genetic operators. When the 75 % of the evaluations is exceed, the algorithm does not learn hierarchical structures and the MFs tuning is realized.

5.2 Experimental setup

The experiments shown in this section have been performed again by a fivefold cross validation with six runs for each data set partition (i.e., a total of 30 runs per problem). Our algorithm has been run with the following parameter values: 100,000 and 300,000 evaluations, 60 as population size, 0.7 as crossover probability, 0.2 as mutation probability per chromosome, and 0.3 as \(\alpha \) BLX crossing parameter. Strong fuzzy partitions with uniformly distributed triangular MFs are used with five fuzzy sets for each linguistic variable. The maximum number of modules is set to four.

The used datasets with 100,000 evaluations are Ele2, Weather Ankara, Mortgage, Treasury, and Computer Activity; while Ele2, Weather Ankara, Mortgage, and Computer Activity have been used for 300,000 evaluations. In this way, we can adapt our experiments to the setup applied by the algorithms considered for comparison.

5.3 Experimental results

Table 6 collects the obtained results, where \({\text{ MSE}}_{\text{ tra}}\) and \({\text{ MSE}}_{\text{ tst}}\) are the approximation error values (Eq. 1) over the training and test data set, respectively; #M, #R, and #V stand for the number of modules, the number of fuzzy rules, and the number of input variables, respectively; \(\sigma _{\#M}\) is the standard deviation of the number of modules, and \(\bar{x}_{\#R}\) is the average number of variables per rule.

Table 6 Results obtained by different algorithms with 100,000 and 300,000 evaluations and five fuzzy sets for each linguistic variable

The following GFSs have been included for comparison:

  • Three single-objective-based methods: GR-MF (Cordón et al. 2001a) learns the granularity for each fuzzy partition and MF parameters. GA-WM (Cordón et al. 2001b) learns the granularity, scaling factors, and the domains for each variable system. GLD-WM (Alcalá et al. 2007b) learns the knowledge base by obtaining the granularity and the individual lateral displacements of the MFs. The results are extracted from Alcalá et al. (2011a).

  • Three versions of the two-objective (2 + 2)M-PAES: PAES-RB learns only rules. PAES-SF and PAES-SFC learn concurrently the RB and the MF parameters by means of the piecewise linear transformation. These algorithms have been taken from Antonelli et al. (2011).

  • A three-objective evolutionary algorithm: PAES-SF3 (Antonelli et al. 2011) seeks for different balances among complexity, accuracy, and partition integrity by concurrently learning the RB and MF parameters of the linguistic variables.

Table 6 shows the results of the most accurate solution obtained by our tuning approach (called GSHFS-Tuning), GR-MF, GA-WM, and GLD-WM. These algorithms have been run with 100,000 evaluations and five fuzzy sets for each linguistic variable. This table also presents the FIRST point obtained by PAES-RB, PAES-SF, PAES-SFC, and PAES-SF3. They have been run with 300,000 evaluations and five fuzzy sets for each linguistic variable. Most of datasets in Antonelli et al. (2011) have a lower size. GSHFS-Tuning has been run with 100,000 and 300,000 evaluations in order to compare it with the algorithms previously mentioned. The most accurate Min and the one of the first quartile Q1 solutions (with ascending sorting by error) are shown in this table.

5.4 Analysis

In this subsection, we will analyze the ability of hierarchical structures against other GFS algorithms that generates non-hierarchical structures.

We can observe in the problems from 4 to 15 variables that the accuracy as well as the number of rules obtained for GSHFS-Tuning is ranked at intermediate positions against the rest of algorithms. However, in large-scale problems (from 18 to 40 variables) the approach is the most accurate but with a higher number of rules. However, notice that the number of variables per rule is lower in GSHFS-Tuning than the different MOEAs implemented by Antonelli et al. (2011). It means that although the number of rules obtained is high, each individual rule is simpler.

We can observe that with 100,000 evaluations, the GSHFS-Tuning errors are usually higher than GLD-WM but the number of rules is significantly decreased (e.g., by 70 % in Ele2).

Compared to PAES-based algorithms with 300,000 evaluations, GSHFS-Tuning’s accuracy only improves to PAES-RB algorithm in Ele2 problem but the simplicity is higher in both the number of rules as in the number of variables per rule, decreased by 65 and 50 %, respectively. If we take a look at the Computer Activity problem, we can observe that GSHFS-Tuning obtains the most accurate result. The accuracy is improved by 25 % compared to the result obtained by PAES-SF. GSHFS-Tuning increases the number of rules by 97 %, but we want to notice that the number of variables per rule is lowest and therefore simpler rules are considered. The accuracy is even outperformed by the first quartile solutions and the number of rules is reduced more than a half.

To sum up, the results obtained by GSHFS-Tuning are more accurate in large-scale problems but the number of rules is higher. However, it is difficult to decide which is better because the Pareto fronts of the PAES-based algorithms are not available. Besides, GSHFS-Tuning is compared with algorithms that learn the number of labels, thus reducing the number of rules considerably.

6 Conclusion and further work

We have proposed a multi-objective genetic algorithm applied to learning SHFS to palliate exponential increases in the number of rules when the number of variables increases. The set of variables is selected and distributed in modules by the algorithm. We have proved that this division by SHFS can obtain good results in problems with a higher number of variables.

The interpretability is good due to several reasons: (1) the hierarchical structure generates a lower number of variables, (2) the algorithm does not generate artificial linking variables, and so, all of the variables are interpretable because they belong to the system, (3) the rules are simpler because the number of variables per module is lower. At the same time, the accuracy is maintained or improved.

Besides, the hierarchical structures may help to analyze the relationships existing among the input variables thus providing a further insight to understand the complexity of the problem beyond merely considering its number of variables and instances.

The results obtained are promising, which opens a research line in GFSs. As further work, we suggest the study of other mechanisms for a better accuracy (e.g., learning MFs), a detailed study of the interpretability of each rule of the FS, and to extend this algorithm to parallel and hybrid structure learning.