Keywords

1 Introduction

The process of creation of methods for nonlinear classification is focused mostly on reaching high accuracy. The other important goal is focused on achieve a good clarity and interpretability of classification rules, which allows to better understand considered problem. These both aims are contradictory, so the balance between accuracy and interpretability of classifier is often investigated in the literature (see e.g. [6, 7, 8, 18]).

Nonlinear classification can be based on many types of approaches. Among them, for example, a neuro-fuzzy systems (see e.g. [13, 17]) can be found. In these systems the knowledge in the form of \(if \ldots then \ldots\) rules is gathered. These rules contain linguistic variables and variables corresponding to fuzzy sets and their parameters. Methods created to increasing interpretability of neuro-fuzzy system rules take an important place in the literature. The interpretability arises not only from complexity of the system, but also from semantics of the rules (see e.g. [2, 7, 19]). In this research area it is worth to list methods focused on: (a) Definition and implementation of new criteria of interpretability of fuzzy rules (see e.g. [1, 7]). (b) Appropriate aggregation of these criteria (see e.g. [8, 18]) and using multi-objective methods (see e.g. [1, 18]). (c) Use of population-based algorithms to obtain interpretable systems (see e.g. [12]) etc.

In this paper we propose a new approach which allows to select fuzzy classifiers taking into account different interpretability criteria (including, among others, semantics). This approach is based on hybrid population-based algorithm, which is a fusion between genetic algorithm (see e.g. [17]) and imperialist competitive algorithm (ICA) (see e.g. [3]). The genetic part of the algorithm allows for automatic selection of the structure of neuro-fuzzy system, use of the imperialist algorithm allows to simultaneously select the parameters of these structures. Algorithm ICA was chosen as a part of the proposed hybrid method because: (a) it was created taking inspiration from social evolution, (b) it is a multi-population algorithm and it provides migration and competition of sub-populations in order to improve obtained solutions, (c) it is distinguished by two interesting operators: assimilation and revolution. It is worth to mention that the system presented in our previous paper [14] was used for classification process. Our approach is additionally focused on trade-off between accuracy and interpretability of the system and allows to present accuracy-interpretability dependences using estimated Pareto front (see e.g. [17]).

This paper is organized as follows: in Sect. 2 a description of proposed system and its tuning process for nonlinear classification is presented. In Sect. 3 a interpretability criteria to increase interpretability for neuro-fuzzy systems are shown. The results of simulations are presented in Sect. 4, finally the conclusions are described in Sect. 5.

2 Description of Neuro-Fuzzy System for Classification and Algorithm for Its Tuning

2.1 Description of the System

We consider multi-input, multi-output neuro-fuzzy system mapping \({\mathbf{X}} \to {\mathbf{Y}}\), where \({\mathbf{X}} \subset {\mathbf{R}}^{n}\) and \({\mathbf{Y}} \subset {\mathbf{R}}^{m}\). The flexible fuzzy rule base consists of a collection of N fuzzy if-then rules in the form:

$$R^{k} :\left[ {\left( \begin{aligned} {\text{IF}}\left( {\bar{x}_{1}\; {\text{is}}\; A_{1}^{k} } \right)\left| {w_{k,1}^{A} } \right.{\text{AND}} \, \ldots \, {\text{AND}}\left( {\bar{x}_{n} \; {\text{is}}\; A_{n}^{k} } \right)\left| {w_{k,n}^{A} } \right. \\ {\text{THEN}}\left( {y_{1}\; {\text{is}}\; B_{1}^{k} } \right)|w_{1,k}^{B} , \ldots ,\left( {y_{m}\; {\text{is}}\; B_{m}^{k} } \right)|w_{m,k}^{B} \\ \end{aligned} \right)\left| {w_{k}^{\text{rule}} } \right.} \right],$$
(1)

where n is a number of inputs, m is a number of outputs, \({\bar{\mathbf{x}}} = \left[ {\bar{x}_{1} , \ldots ,\bar{x}_{n} } \right] \in {\mathbf{X}}\), \({\mathbf{y}} = \left[ {y_{1} , \ldots ,y_{m} } \right] \in {\mathbf{Y}},A_{1}^{k} , \ldots ,A_{n}^{k}\) are fuzzy sets characterized by membership functions \(\mu_{{A_{i}^{k} }} \left( {x_{i} } \right),i = 1, \ldots ,n,k = 1, \ldots ,N,B_{1}^{k} , \ldots ,B_{m}^{k}\) are fuzzy sets characterized by membership functions \(\mu_{{B_{j}^{k} }} \left( {y_{j} } \right),j = 1, \ldots ,m,k = 1, \ldots ,N,w_{k,i}^{A} \in \left[ {0,1} \right],i = 1, \ldots ,n,k = 1, \ldots ,N\), are weights of antecedents, \(w_{j,k}^{B} \in \left[ {0,1} \right],k = 1, \ldots ,N,j = 1, \ldots ,m\), are weights of consequences, \(w_{k}^{\text{rule}} \in \left[ {0,1} \right],k = 1, \ldots ,N\), are weights of rules. The flexibility of rule base results from using weights of the antecedences and consequences of the rules. Using of weights need a proper defined aggregation function, which definition can be found in our previous work (see [5]). In logical approach output signal \(\bar{y}_{j} ,j = 1, \ldots ,m,\) of the neuro-fuzzy system can be described by the formula:

$$\bar{y}_{j} = \frac{{\sum\nolimits_{r = 1}^{R} {\bar{y}_{j,r}^{\text{def}} \cdot\mathop {\mathop {T^{*} }\limits^{N} }\limits_{k = 1} \left\{ {S^{*} \left\{ {1 - \mathop {\mathop {T^{*} }\limits^{n} }\limits_{i = 1} \left\{ {\mu_{{A_{i}^{k} }} \left( {\bar{x}_{i} } \right);w_{k,i}^{A} } \right\},\mu_{{B_{j}^{k} }} \left( {\bar{y}_{j,r}^{\text{def}} } \right);1,w_{j,k}^{B} } \right\};w_{k}^{\text{rule}} } \right\}} }}{{\sum\nolimits_{r = 1}^{R} {\mathop {\mathop {T^{*} }\limits^{N} }\limits_{k = 1} \left\{ {S^{*} \left\{ {1 - \mathop {\mathop {T^{*} }\limits^{n} }\limits_{i = 1} \left\{ {\mu_{{A_{i}^{k} }} \left( {\bar{x}_{i} } \right);w_{k,i}^{A} } \right\},\mu_{{B_{j}^{k} }} \left( {\bar{y}_{j,r}^{\text{def}} } \right);1,w_{j,k}^{B} } \right\};w_{k}^{\text{rule}} } \right\}} }},$$
(2)

where \(\bar{y}_{j,r}^{\text{def}} ,j = 1, \ldots ,m,r = 1, \ldots ,R\), are discretization points, R is a number of discretization points (points in Y, in which the fuzzy inference from the rule base (1) is discretized, resulting from, among others, use of typical for neuro-fuzzy systems defuzzification operations, which allow to determine the real value of the system output signal), \(T^{*} \left\{ \cdot \right\}\) and \(S^{*} \left\{ \cdot \right\}\) are weighted triangular norms (see e.g. [17]). In particular, t-norm with weights of arguments can be denoted as follows (see e.g. [17]):

$$T^{*} \left\{ {a_{1} ,a_{2} ;w_{1} ,w_{2} } \right\} = T\left\{ {1 - w_{1} \cdot \left( {1 - a_{1} } \right),1 - w_{2} \cdot \left( {1 - a_{2} } \right)} \right\}\mathop = \limits^{{{\text{e}}.{\text{g}}.}} \left( {1 - w_{1} \cdot \left( {1 - a_{1} } \right)} \right) \cdot \left( {1 - w_{2} \cdot \left( {1 - a_{2} } \right)} \right) ,$$
(3)

where t-norm \(T\left\{ \cdot \right\}\) is a generalization of the usual two-valued logical conjunction (studied in classical logic), \(w_{1}\) and \(w_{2} \in \left[ {0,1} \right]\) mean weights of importance of the arguments \(a_{1} ,a_{2} \in \left[ {0,1} \right]\). T-conorm with weights of arguments can be denoted analogously:

$$S^{*} \left\{ {a_{1} ,a_{2} ;w_{1} ,w_{2} } \right\} = S\left\{ {w_{1} \cdot a_{1} ,w_{2} \cdot a_{2} } \right\}\mathop = \limits^{{{\text{e}}.{\text{g}}.}} 1 - \left( {1 - w_{1} \cdot a_{1} } \right) \cdot \left( {1 - w_{2} \cdot a_{2} } \right) .$$
(4)

For more details see our previous papers, e.g. [17].

2.2 Description of the Tuning Algorithm

The purpose of the algorithm described in this section is an automatic selection of the structure and parameters of the rules in form (1) (number of inputs, antecedences, consequences, rules) and system in form (2) (discretization points). In this process interpretability criteria defined in Sect. 3 are used. Considered algorithm is a fusion between genetic algorithm (which allows to select the structure of the system) with imperialist competitive algorithm (which allows to select the parameters of the system).

Encoding of parameters and structure. The parameters of system (2) are encoded in the following individuals (Pittsburgh approach, in which a single individual of the population encodes the entire neuro-fuzzy system):

$$\begin{aligned} {\mathbf{X}}_{ch}^{\text{par}} & = \left\{ {\begin{array}{*{20}c} {\bar{x}_{1,1}^{A} ,\sigma_{1,1}^{A} , \ldots ,\bar{x}_{n,1}^{A} ,\sigma_{n,1}^{A} , \ldots \bar{x}_{1,Nmax}^{A} ,\sigma_{1,Nmax}^{A} , \ldots ,\bar{x}_{n,Nmax}^{A} ,\sigma_{n,Nmax}^{A} ,} \\ {\bar{y}_{1,1}^{B} ,\sigma_{1,1}^{B} , \ldots ,\bar{y}_{m,1}^{B} ,\sigma_{m,1}^{B} , \ldots \bar{y}_{1,Nmax}^{B} ,\sigma_{1,Nmax}^{B} , \ldots ,\bar{y}_{m,Nmax}^{B} ,\sigma_{m,Nmax}^{B} ,} \\ {w_{1,1}^{A} , \ldots ,w_{1,n}^{A} , \ldots ,w_{Nmax,1}^{A} , \ldots ,w_{Nmax,n}^{A} ,w_{1,1}^{B} , \ldots ,w_{m,1}^{B} , \ldots ,w_{1,Nmax}^{B} , \ldots ,w_{m,Nmax}^{B} ,} \\ {w_{1}^{\text{rule}} , \ldots ,w_{Nmax}^{\text{rule}} ,\bar{y}_{1,1}^{\text{def}} , \ldots ,\bar{y}_{1,Rmax}^{\text{def}} , \ldots ,\bar{y}_{m,1}^{\text{def}} , \ldots ,\bar{y}_{m,Rmax}^{\text{def}} } \\ \end{array} } \right\} \\ & = \left\{ {X_{ch,1}^{\text{par}} , \ldots ,X_{ch,L}^{\text{par}} } \right\}, \\ \end{aligned}$$
(5)

where \(L = Nmax \cdot \left( {3 \cdot n + 3 \cdot m + 1} \right) + Rmax \cdot m\) is the length of the parameters \({\mathbf{X}}_{ch}^{\text{par}} ,ch = 1, \ldots ,\mu\) for the parent population or \(ch = 1, \ldots ,\lambda\) for the temporary population, \(\left\{ {\bar{x}_{i,k}^{A} ,\sigma_{i,k}^{A} } \right\},i = 1, \ldots ,n,k = 1, \ldots ,N\), are parameters of Gaussian membership functions \(\mu_{{A_{i}^{k} }} \left( {x_{i} } \right)\) of the input fuzzy sets \(A_{1}^{k} , \ldots ,A_{n}^{k}\) (were used in our simulations), \(\left\{ {\bar{y}_{j,k}^{B} ,\sigma_{j,k}^{B} } \right\},k = 1, \ldots ,N,j = 1, \ldots ,m\), are parameters of Gaussian membership functions \(\mu_{{B_{j}^{k} }} \left( {y_{j} } \right)\) of the output fuzzy sets \(B_{1}^{k} , \ldots ,B_{m}^{k} ,Nmax\) is the maximum number of rules, \(Rmax\) is the maximum number of discretization points. The process of selecting the structure of the system is done using additional parameters \({\mathbf{X}}_{ch}^{\text{str}}\). Their genes take binary values and indicate which rules, antecedents, consequents, inputs, and discretization points are selected. The parameters \({\mathbf{X}}_{ch}^{\text{str}}\) are given by:

$$\begin{aligned} {\mathbf{X}}_{ch}^{\text{str}} & = \left\{ {\begin{array}{*{20}c} {x_{1} , \ldots ,x_{n} ,A_{1}^{1} , \ldots ,A_{n}^{1} , \ldots ,A_{1}^{Nmax} , \ldots ,A_{n}^{Nmax} ,B_{1}^{1} , \ldots ,B_{m}^{1} , \ldots ,B_{1}^{Nmax} , \ldots ,B_{m}^{Nmax} ,} \\ {{\text{rule}}_{1} , \ldots ,{\text{rule}}_{Nmax} ,\bar{y}_{1,1}^{\text{def}} , \ldots ,\bar{y}_{1,Rmax}^{\text{def}} , \ldots ,\bar{y}_{m,1}^{\text{def}} , \ldots ,\bar{y}_{m,Rmax}^{\text{def}} } \\ \end{array} } \right\} \\ & = \left\{ {X_{ch,1}^{\text{str}} , \ldots ,X_{{ch,L^{\text{str}} }}^{\text{str}} } \right\}, \\ \end{aligned}$$
(6)

where \(L^{\text{str}} = Nmax \cdot (n + m + 1) + n + Rmax \cdot m\) is the length of the parameters \({\mathbf{X}}_{ch}^{\text{str}}\). Their genes indicate which rules \(( {\text{rule}}_{k} ,k = 1, \ldots ,Nmax)\), antecedents \((A_{i}^{k} ,i = 1, \ldots ,n,k = 1, \ldots ,Nmax)\), consequents \((B_{j}^{k} ,j = 1, \ldots ,m,k = 1, \ldots ,Nmax)\), inputs \((\bar{x}_{i} ,i = 1, \ldots ,n)\), and discretization points \((\bar{y}^{r} ,r = 1, \ldots ,Rmax)\) are taken to the system. We can easily notice that the number of inputs used in the system and encoded in the individual ch can be determined as follows:

$$n_{ch} = \sum\limits_{i = 1}^{n} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {x_{i} } \right\}} ,$$
(7)

where \({\mathbf{X}}_{ch}^{\text{str}} \left\{ {x_{i} } \right\}\) means parameters of the individual \({\mathbf{X}}_{ch}^{\text{str}}\) associated with the input \(x_{i}\) (as previously mentioned, if the value of the gene is 1, the associated input is taken into account during work of the system). The number of rules \((N_{ch} )\) used in the system and encoded in the individual ch may be determined analogously.

Evolution of the parameters and structure. The idea of the proposed algorithm is shown in Fig. 1. In Step 1 of the algorithm, an initial population (in a size of \(N_{pop}\)) is created and evaluated (each individual is called colony). It is worth to mention that for each colony both the real value parameters \({\mathbf{X}}_{ch}^{\text{par}}\) and the structure parameters \({\mathbf{X}}_{ch}^{\text{str}}\) are initialized. From initial population N best colonies are chosen, and on the basis of each of them empires (subpopulations) are created. Best colony in every empire is called imperialist. The remain \(N_{pop} - N\) colonies are spread in a specified way among the empires. In Step 2 of algorithm a assimilation and revolution process [which is responsible to tune real parameters of the system (2)] are made. These processes purpose is to move colonies toward imperialist in their empires. Extension of this step relies on using mutation operator from genetic algorithm, which is used to modify the structure of the system (2). The mutation operator has been designed to be proportional to the value of the evaluation function of the colonies (best colony have 0 % chances to be modified, worst colony have 100 % chances to be modified). In Step 3 an evaluation of the modified colonies is made. If a colony gets a better value than imperialist of its empire then the imperialist is replaced by this colony. It is worth to mention that the fitness function defined in our paper promotes these colonies which are characterized, among others, by the simplest structures. In Step 4 of the algorithm, an empire competition (based on the power of empires) takes place. The empire which win competition (empire selected using roulette wheel method basing on probability calculated by using power of empires) gets the weakest colony of the weakest empire. If empire lost all colonies, it is removed from the algorithm. In the Step 5 a stop condition is checked (e.g. if number of iterations reaches maximum value). If stop condition is met, algorithm ends (and best colony of best empire is presented), otherwise algorithm goes back to step 2. More details about algorithms used in proposed hybrid genetic-imperialist algorithm can be found e.g. in [3, 17].

Fig. 1
figure 1

The basic idea of proposed hybrid algorithm

Chromosome population evaluation. Each individual \({\mathbf{X}}_{ch}\) of the parental and temporary populations is represented by a sequence of parameters \(\left\{ {{\mathbf{X}}_{ch}^{\text{par}} ,{\mathbf{X}}_{ch}^{\text{str}} } \right\}\), given by formulas (5) and (6). First parameters take real values, whereas the second ones take integer values from the set \(\left\{ {0,1} \right\}\). The system aims to minimize value of the following fitness function:

$${\text{ff}}\left( {{\mathbf{X}}_{ch} } \right) = T^{*} \left\{ {{\text{ffaccuracy}}\left( {{\mathbf{X}}_{ch} } \right),{\text{ffinterpretability}}\left( {{\mathbf{X}}_{ch} } \right);w_{\text{ffaccuracy}} ,w_{\text{ffinterpretability}} } \right\},$$
(8)

where \(T^{*} \left\{ \cdot \right\}\) is the algebraic weighted t-norm (see e.g. [17]), \(w_{\text{ffaccuracy}} \in \left( {0,1} \right]\) is a weight of the component \({\text{ffaccuracy}}\left( {{\mathbf{X}}_{ch} } \right)\) and \(w_{\text{ffinterpretability}}\) is a weight of the component \({\text{ffinterpretability}}\left( {{\mathbf{X}}_{ch} } \right)\). The component \({\text{ffaccuracy}}\left( {{\mathbf{X}}_{ch} } \right)\) determines the accuracy of the system (2) (in a form of classification error). The component \({\text{ffinterpretability}}\left( {{\mathbf{X}}_{ch} } \right)\) determines complexity-based (component \({\text{ffint}}_{A} \left( {{\mathbf{X}}_{ch} } \right)\)) and semantic-based (components \({\text{ffint}}_{B} \left( {{\mathbf{X}}_{ch} } \right) - {\text{ffint}}_{E} \left( {{\mathbf{X}}_{ch} } \right)\)) interpretability of the system (2) encoded in the tested individual:

$$\begin{aligned} & {\text{ffinterpretability}}\left( {{\mathbf{X}}_{ch} } \right) = \\ & T^{*} \left\{ {\begin{array}{*{20}c} {{\text{ffint}}_{A} \left( {{\mathbf{X}}_{ch} } \right),{\text{ffint}}_{B} \left( {{\mathbf{X}}_{ch} } \right),{\text{ffint}}_{C} \left( {{\mathbf{X}}_{ch} } \right),{\text{ffint}}_{D} \left( {{\mathbf{X}}_{ch} } \right),{\text{ffint}}_{E} \left( {{\mathbf{X}}_{ch} } \right)} \\ {{\text{ffint}}_{E} \left( {{\mathbf{X}}_{ch} } \right),{\text{ffint}}_{F} \left( {{\mathbf{X}}_{ch} } \right),{\text{ffint}}_{G} \left( {{\mathbf{X}}_{ch} } \right);w_{\text{ffintA}} ,w_{\text{ffintB}} ,w_{\text{ffintC}} ,w_{\text{ffintD}} ,w_{\text{ffintE}} } \\ \end{array} } \right\}, \\ \end{aligned}$$
(9)

where \(w_{\text{ffintA}} \in \left( {0,1} \right]\) denotes weight of the component \({\text{ffint}}_{A} \left( {{\mathbf{X}}_{ch} } \right)\), etc. The individual components of the formula (9) are defined in the next section.

3 An Interpretability Criteria for Neuro-Fuzzy System for Nonlinear Classification

In this section a new interpretability criteria for neuro-fuzzy system for nonlinear classification are described. Each criterion is a component of fitness function responsible for interpretability (9) of the system. The criteria are defined as follows:

  1. (a)

    The component \({\text{ffint}}_{A} \left( {{\mathbf{X}}_{ch} } \right)\) determines complexity of the system (2) i.e. a number of reduced elements of the system (rules, input fuzzy sets, output fuzzy sets, inputs, and discretization points) in relation to length of the parameters \({\mathbf{X}}_{\text{ch}}^{\text{str}}\) (it allows to increase complexity-based interpretability):

    $${\text{ffint}}_{A} \left( {{\mathbf{X}}_{ch} } \right) = \frac{{\left( \begin{aligned} & \sum\nolimits_{i = 1}^{n} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {x_{i} } \right\} \cdot \sum\nolimits_{k = 1}^{Nmax} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k} } \right\} \cdot {\mathbf{X}}_{ch}^{\text{str}} \left\{ {A_{i}^{k} } \right\}} } \\ & \quad + \sum\nolimits_{j = 1}^{m} {\sum\nolimits_{k = 1}^{Nmax} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k} } \right\} \cdot {\mathbf{X}}_{ch}^{\text{str}} \left\{ {B_{j}^{k} } \right\}} } + \sum\nolimits_{j = 1}^{m} {\sum\nolimits_{r = 1}^{Rmax} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {\bar{y}_{m,r}^{\text{def}} } \right\}} } \\ \end{aligned} \right)}}{{N_{ch} \cdot \left( {n_{ch} + m} \right) + m \cdot Rmax}},$$
    (10)

    where \({\mathbf{X}}_{ch}^{\text{str}} \left\{ {x_{i} } \right\}\) means a parameter of \({\mathbf{X}}_{ch}^{\text{str}}\) associated with the input \(x_{i}\), etc.

  2. (b)

    The component \({\text{ffint}}_{B} \left( {{\mathbf{X}}_{ch} } \right)\) reduces overlapping of input and output fuzzy sets of the system (2) encoded in the tested individual. This criterion aims to the situation where crossover point between two nearest fuzzy sets have \(\mu \left( x \right)\) value at \(c_{\text{ffint}}\) (set to 0.5) and it prevents from situations where nearest fuzzy sets overlaps each other:

    $${\text{ffint}}_{B} \left( {{\mathbf{X}}_{ch} } \right) = \frac{{\sum\nolimits_{i = 1}^{{n_{ch} }} {\sum\nolimits_{k = 1}^{{{\text{noifs}}\left( i \right) - 1}} {\left( {2\left| {c_{\text{ffintc}} - \hat{y}_{i,k}^{1} } \right| + \hat{y}_{i,k}^{2} } \right) + \sum\nolimits_{j = 1}^{{m_{ch} }} {\sum\nolimits_{k = 1}^{{{\text{noofs}}\left( j \right) - 1}} {\left( {2\left| {c_{\text{ffintc}} - \hat{y}_{j,k}^{1} } \right| + \hat{y}_{j,k}^{2} } \right)} } } } }}{{2\left( {\sum\nolimits_{i = 1}^{{n_{ch} }} ({{\text{noifs}}\left( i \right) - 1}) + \sum\nolimits_{j = 1}^{{m_{ch} }} {\left( {{\text{noofs}}\left( j \right) - 1} \right)} } \right)}},$$
    (11)

    where \({\text{noifs}}\left( {\text{i}} \right)\) stands for number of active fuzzy sets of i input, \({\text{noofs}}\left( {\text{j}} \right)\) stands for number of active fuzzy sets of j output, \(\hat{y}_{i,k}^{1} ,\hat{y}_{i,k}^{2}\) are \(\mu_{{A_{i}^{k} }} \left( x \right)\) value of crossover points between two input fuzzy sets and \(\hat{y}_{j,k}^{1} ,\hat{y}_{j,k}^{2}\) are \(\mu_{{B_{j}^{k} }} \left( x \right)\) value of crossover points between two output fuzzy sets. Those values can be calculated for inputs (and analogically for outputs) as:

    $$\hat{y}_{i,k} = \exp \left( { - 0.5\left( {{\mathbf{x}}_{ch}^{\text{supp}} \left\{ {\bar{x}_{i,k}^{A} } \right\} + {\mathbf{x}}_{ch}^{\text{supp}} \left\{ {\bar{x}_{i,k + 1}^{A} } \right\}} \right)/\left( {{\mathbf{x}}_{ch}^{\text{supp}} \left\{ {\sigma_{i,k}^{A} } \right\} \pm {\mathbf{x}}_{ch}^{\text{supp}} \left\{ {\sigma_{i,k + 1}^{A} } \right\}} \right)^{2} } \right),$$
    (12)

    where \({\mathbf{X}}_{ch}^{\text{supp}}\) stands for additional set of system parameters [which is build temporary on a base of \({\mathbf{X}}_{ch}\) from Eq. (11)] with sorted (by position of their centres) list of non-reduced fuzzy sets (for details see [5]).

  3. (c)

    The component \({\text{ffint}}_{C} \left( {{\mathbf{X}}_{ch} } \right)\) increases the integrity of the shape of the input and output fuzzy sets associated with the inputs and outputs of the system (2) encoded in the tested individual. This criterion aims to achieve fuzzy sets with similar sizes under the same inputs and outputs:

    $${\text{ffint}}_{C} \left( {{\mathbf{X}}_{ch} } \right) = \frac{1}{{n_{ch} + m}}\left( \begin{aligned} \sum\nolimits_{i = 1}^{n} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {x_{i} } \right\} \cdot \sum\nolimits_{k1 = 1}^{N\hbox{max} } {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k1} } \right\} \cdot {\text{shx}}_{i,k1} \left( {{\mathbf{X}}_{ch} ,i,k1} \right)} } \\ + \sum\nolimits_{j = 1}^{m} { \cdot \sum\nolimits_{k1 = 1}^{N\hbox{max} } {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k1} } \right\} \cdot {\text{shy}}\left( {{\mathbf{X}}_{ch} ,j,k1} \right)} } \\ \end{aligned} \right),$$
    (13)

    where \({\text{shx}}_{i,k1} \left( {{\mathbf{X}}_{ch} } \right)\) (and analogically \({\text{shy}}_{j,k1} \left( {{\mathbf{X}}_{ch} } \right)\)) is a function for calculating proportion between fuzzy sets defined as follows:

    $${\text{shx}}\left( {{\mathbf{X}}_{ch} ,i,k1} \right) = 1 - \frac{{\hbox{min} \left( {{\mathbf{X}}_{ch}^{\text{par}} \left\{ {\sigma_{i,k1}^{A} } \right\},\frac{1}{{N_{ch} }}\sum\nolimits_{k2 = 1}^{N\hbox{max} } {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k2} } \right\}{\mathbf{X}}_{ch}^{\text{par}} \left\{ {\sigma_{i,k2}^{A} } \right\}} } \right)}}{{\hbox{max} \left( {{\mathbf{X}}_{ch}^{\text{par}} \left\{ {\sigma_{i,k1}^{A} } \right\},\frac{1}{{N_{ch} }}\sum\nolimits_{k2 = 1}^{N\hbox{max} } {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k2} } \right\}{\mathbf{X}}_{ch}^{\text{par}} \left\{ {\sigma_{i,k2}^{A} } \right\}} } \right)}},$$
    (14)

    where \({\mathbf{X}}_{ch}^{\text{par}} \left\{ {\sigma_{i,k}^{A} } \right\}\) stands for a gene of the individual \({\mathbf{X}}_{ch}^{\text{par}}\) associated with the parameter \(\sigma_{i,k}^{A}\) (the width of the Gaussian function), \({\mathbf{X}}_{ch}^{\text{par}} \left\{ {\sigma_{j,k}^{B} } \right\}\) means parameter of the \({\mathbf{X}}_{ch}^{\text{par}}\) associated with the parameter \(\sigma_{j,k}^{B}\).

  4. (d)

    The component \({\text{ffint}}_{D} \left( {{\mathbf{X}}_{ch} } \right)\) increases complementarity (adjusting position of the input fuzzy sets and data \(\bar{x}_{z,i}\)) of system (2) encoded in the tested individual:

    $${\text{ffint}}_{D} \left( {{\mathbf{X}}_{ch} } \right) = \frac{1}{{Z \cdot n_{ch} }}\left( {\sum\limits_{z = 1}^{Z} {\sum\limits_{i = 1}^{n} {{\mathbf{X}}_{ch}^{\text{str}} } } \left\{ {x_{i} } \right\} \cdot \hbox{max} \left( {1,\left| {1 - \sum\limits_{k = 1}^{N\hbox{max} } {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k} } \right\} \cdot \mu_{{A_{i}^{k} }} \left( {\bar{x}_{z,i} } \right)} } \right|} \right)} \right).$$
    (15)
  5. (e)

    The component \({\text{ffint}}_{E} \left( {{\mathbf{X}}_{ch} } \right)\) increases readability of the antecedents and weights (it aims to reach specified values of weights—0, 0.5 and 1) of rules of system (2) encoded in the tested individual:

    $$\begin{aligned} {\text{ffint}}_{E} \left( {{\mathbf{X}}_{ch} } \right) & = 1 - \frac{1}{{2N_{ch} }}\left( {\frac{1}{{n_{ch} }}\sum\limits_{k = 1}^{Nmax} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k} } \right\}\sum\limits_{i = 1}^{n} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {x_{i} } \right\}\cdot\mu_{w} \left( {w_{i,k}^{A} } \right)} } } \right. \\ & \quad \quad \quad \quad \quad \quad \left. { + \sum\limits_{k = 1}^{Nmax} {{\mathbf{X}}_{ch}^{\text{str}} \left\{ {{\text{rule}}_{k} } \right\}\cdot\mu_{w} \left( {w_{k}^{\text{rule}} } \right)} } \right), \\ \end{aligned}$$
    (16)

    where \(\mu_{w} \left( {w_{i,k}^{A} } \right)\) is a function defining congeries around values 0, 0.5 and 1 (in simulations we assumed that \(a = 0.25,b = 0.50\) and \(c = 0.75\)). This function is described as follows:

    $$\mu_{w} \left( x \right) = \left\{ {\begin{array}{*{20}l} {\begin{array}{*{20}c} {\left( {a - x} \right)a^{ - 1} } & {\text{for}} & {x \ge 0} & {\text{and}} & {x \le a} \\ \end{array} } \hfill \\ {\begin{array}{*{20}c} {\left( {x - a} \right)\left( {b - a} \right)^{ - 1} } & {\text{for}} & {x \ge a} & {\text{and}} & {x \le b} \\ \end{array} } \hfill \\ {\begin{array}{*{20}c} {\left( {c - x} \right)\left( {c - b} \right)^{ - 1} } & {\text{for}} & {x \ge b} & {\text{and}} & {x \le c} \\ \end{array} } \hfill \\ {\begin{array}{*{20}c} {\left( {x - c} \right)\left( {1 - c} \right)^{ - 1} } & {\text{for}} & {x \ge c} & {\text{and}} & {x \le 1} \\ \end{array} } \hfill \\ \end{array} } \right..$$
    (17)

4 Simulation Results

In our simulations we considered five typical problems from the field of non-linear classification [15]: (a) wine recognition problem, (b) glass identification problem, (c) Pima Indians diabetes problem, (d) iris classification problem, (d) Wisconsin breast cancer problem. For each problem a 10-fold cross validation was used, and the process was repeated 10 times. Moreover, for each simulation problem a seven variants of learning were applied. Each variant had different set of weights of fitness function (8)—see Table 1. Weights of remaining criteria were set as follows: \(w_{\text{ffintA}} = 0. 50,w_{\text{ffintB}} = 1.00,w_{\text{ffintC}} = 1.00,w_{\text{ffintD}} = 0. 20,w_{\text{ffintE}} = 0. 50\). The following parameters associated with ICA algorithm were set as follows: number of colonies \(N_{pop} = 100\), number of empires N = 10, number of iterations to 1000, the revolution rate to 0.3. The mutation probability of genetic operator was set to 0.2.

Table 1 Values of the weights of the components ffaccuracy(X ch ) and ffinterpretability(X ch ) [see formula (8)] for various variants considered in simulations: case I–case V

The conclusions from simulations can be summarized as follows: (a) Using a low value of the weights (such as 0.2) for components of the function (9) caused a reduction in the readability of the relationship between the values of interpretability criteria and the accuracy of the system (see Fig. 3-row 4). (b) Using extreme weight cases (Case I and Case VII) often has no effect on improvement of the system (see Table 2) and it can cause deterioration of the solutions (in comparison to other cases). Solutions founded for these cases may appear under estimated Pareto front (see Fig. 3). (c) Using proposed interpretability criteria allows to achieve semantic clear rules of the system (2) (see Fig. 2). (d) Considering seven cases of weights allowed to determine the estimated Pareto fronts, which make possible to select the interpretability-accuracy trade-off (compromise) by the user (see Fig. 3). (e) Number of reduced inputs and rules depends from the simulation problem (see Fig. 3-row 6 and 7). For example for classification problem (c) system can reduce up to 3 inputs (see Fig. 2) without significantly lost in the accuracy of the system. (f) Achieved results are comparable (in a field of accuracy) with results achieved by other authors using different methods (see Table 2). It should be emphasized that the purpose of the paper was not to achieve the best possible accuracy in comparison with the accuracy obtained by other methods. The purpose of the paper was to increase the legibility of knowledge represented in the form of fuzzy rules with acceptable accuracy of the system. It seems that this objective has been achieved.

Table 2 The accuracy (%) of the neuro-fuzzy classifier (2) for learning phase, testing phase and average value of them both for simulation variants case I–case VII
Fig. 2
figure 2

Example input and output fuzzy sets of the neuro-fuzzy system (2) for the Pima Indians diabetes problem for three various settings of the function (8): a case II, b case IV, c case VI. The position of the discretization points was marked as black circles, the weights of the fuzzy sets was marked by rectangles. The degree of coverage of the rectangle translates to value of the weight (fully covered rectangle stands for weight 1, and non-covered rectangle stands for weight 0)

Fig. 3
figure 3

Dependence between accuracy (%) of neuro-fuzzy classifier (2) (average for learning and testing phase) and values of interpretability components \({\text{ffint}}_{A} \left( {{\mathbf{X}}_{ch} } \right) - {\text{ffint}}_{G} \left( {{\mathbf{X}}_{ch} } \right)\) for considered variants of the simulations case I–case VII for following simulation problems: a wine recognition problem, b glass identification problem, c Pima Indians diabetes problem, d iris classification problem, e Wisconsin breast cancer problem

5 Conclusions

In this paper a new approach for non-linear classification was proposed. It is based on possibilities of neuro-fuzzy system and new hybrid genetic-imperialist algorithm. The purpose of this algorithm was to select both the structure and the structure parameters of the estimated classifier with different interpretability criteria taken into consideration. Those criteria are focused not only on the complexity of the system, but also on semantic part of the system. Simulation results performed for typical problems of classification confirmed the correctness of the proposed approach.