1 Introduction

Fuzzy inference systems are able to model the continuous input/output relationships by means of fuzzy IF–THEN rules without employing precise quantitative analysis. This fuzzy modelling has been successfully used in many applications, e.g., to build function approximators (Pomares et al. 2000; Chen and Saif 2005), fuzzy controllers (Boukezzoula et al. 2006), fuzzy classifiers (Pal and Mitra 1992), and decision making. Researchers have developed many practical systems, such as prediction and inference (Kandel 1992), non-linear control (Sugeno 1985; Pedrycz 1989; Berenji and Khedkar 1992; Lin 1994; Jouffe 1998), signal processing and communication systems (Nie and Linkens 1994), type 2 fuzzy system model (Uncu and Türkşen 2007), just to mention a few.

The basic components in a fuzzy inference system are fuzzy rules. It is desirable that the rule base covers all the states of the system that are important to data understanding (Duch et al. 2001; Duch et al. 2004) and the intended application (Setnes 2000). In general, fuzzy rules can be obtained by either human experts or a data-driven extraction scheme from measured input–output data pairs. The latter case is currently a growing research topic, since in most cases people or decision makers in a data-mining project or industry applications are usually not trained statisticians, mathematicians, or AI experts. Moreover, in many commercial areas there is a huge number of unstructured data collections. Therefore, it is important to learn knowledge and derive fuzzy rules from the data itself (Cherkassky and Mulier 1998). Various methods have been proposed, e.g., fuzzy clustering in product space (Setnes 2000) or in augmented data set (Pal et al. 2002), rule generation using data condensation algorithm (Mitra et al. 2002), genetic algorithms (Lee and Takagi 1993; Ishibuchi et al. 1995), entropy (Yager and Filev 1993), orthogonal transformation methods (Wang and Mendel 1992; Yen and Wang 1999), and a group of neuro-fuzzy approaches (Berenji and Khedkar 1992; Brown and Harris 1994; Gupta and Rao 1994; Jang 1993; Fritzke 1997), etc.

Over the last few years, research on combining neural networks and fuzzy systems, neuro-fuzzy systems, has gained considerable importance and attention (Mitra and Hayashi 2000; Jang and Sun 1995). The learning capabilities of neural networks give fuzzy systems the ability to tune the parameters and shapes of fuzzy membership functions (MFs). There are several paradigms to combine neural networks and fuzzy systems: concurrent neural/fuzzy models, cooperative neuro-fuzzy models, and hybrid neuro-fuzzy models (see Kruse and Nauck 1995 for details). Among these approaches, the hybrid neuro-fuzzy models are the most popular forms in the modern neuro-fuzzy systems. In general, a hybrid neuro-fuzzy system can be viewed as a special n-layer feed-forward neural network (Nauck 1997) with sampled fuzzy memberships as the input/output (Keller and Tahani 1992; Keller et al. 1992), or with parameterized MFs stored in the neurons (Jang 1993; Berenji and Khedkar 1992), or with fuzzy sets as the link weights (Nauck 1997). In fact, few neuro-fuzzy approaches actually employ true neural networks, although they are very often depicted in the form of some kind of neural network structure which exhibits some learning capability (Nauck 1997). With this learning capability, we are able to determine the fuzzy sets or fuzzy rules with an iterative process. Learning in a neuro-fuzzy system normally involves two phases: structural learning and parametric learning (Jouffe 1998; Rojas et al. 2000). The structure learning tunes a number of rules, thus it is also often referred to as rule learning, whereas the parametric learning tunes the positions of fuzzy sets. For some of the neural-fuzzy systems in the literature (e.g., Berenji and Khedkar 1992; Jang 1993), there is no rule learning procedure defined. Therefore, once the neural network architecture has been determined, the number of rules is assumed fixed during the learning process. Such a process is often not effective to yield good (i.e., small and interpretable) rule bases. In this case, we may consider pruning techniques to reduce the number of rules and variables in the neuro-fuzzy system (Nauck 1997; Zimmermann et al. 1996). This usually starts from a large number of rules, during the learning process some of the rules are pruned (rule deduction) resulting in a set of rules that are most relevant to the problem at hand. However, it is difficult in most cases to choose a reasonably large number of rules, because we simply do not have enough knowledge about the data. This can lead to poor system performance in real-world applications. In the last few years incremental learning (rule induction) is getting to be another solution. The adaptive nature of incremental learning requires it to be implemented in the on-line fashion (Fritzke 1997). A serious problem with such learning algorithms is that they do not have a criterion to terminate the growth process of the neural network structure: the termination is judged by the human satisfactory degree on the learning performance and a pre-defined maximum network size. Furthermore, the shape adaption of a MF is based on a simple calculation between two closest rule patches in the input space, e.g., half of their distance. This is not suitable for rule patches with different sizes, as in this case the memberships for smaller rule patches may overlap significantly with parts of larger rule patches, leading to misclassification in data analysis.

In this paper, we introduce a new incremental, hybrid neuro-fuzzy system using the supervised self-spawning competitive learning (SSSCL) paradigm, self-spawning neuro-fuzzy system (SSNFS). It is a scatter-partitioning fuzzy inference system that allows the IF-parts of the fuzzy rules to be positioned at arbitrary locations in the input space. A major problem with scatter partitions is to find a suitable number of rules, suitable positions and suitable width of the rule patches in the input space (Fritzke 1997). Our neuro-fuzzy model will focuses on this problem and is able to incrementally and adaptively build up a network structure during the training process. It requires only a single rule prototype randomly initialized in the input space; during the training process the rule base expands adaptively according to a split validity measure to discover more rule patches. The shape width of the rule patch is indicated by an on-line property vector. The rule induction process terminates when a stop criterion is satisfied. The output weights are trained in the on-line manner, this is more appropriate (computationally less complex) than to repeatedly solve the complete linear system with matrix manipulation as that in batched learning. To extract rules from data, we assume that a limited number of input–output pairs are provided on-line for single-pass processing.

The rest of this paper is organized as follows. We first give a brief analysis on the scatter-partitioning fuzzy inference system in Sect. 2. In Sect. 3 we describe the detailed neuro-fuzzy modelling and the self-spawning learning algorithm. Section 4 illustrates the experimental study and simulations, and Sect. 5 presents discussions and conclusions.

2 Scatter-partitioning fuzzy systems

2.1 Takagi–Sugeno fuzzy model

The problem we are going to solve is to design a partitioning fuzzy inference system from input–output data for classification analysis. The basic component of a fuzzy inference system is the fuzzy rule which is expressed using linguistic labels, for instance,

$$ \begin{aligned}&\hbox{IF } (x_{1} \hbox{ is low}) \hbox{ AND } (x_{2} \hbox{ is medium})\\&\quad \hbox{ THEN } (y_{1} \hbox{ is } 0.2) \hbox{ AND } (y_{2} \hbox{ is } 0.4),\end{aligned} $$
(1)

where \(x_{1},\;x_{2} \in {{\mathcal{R}}}\) is the input variables (antecedent) and \(y_{1},\;y_{2} \in {{\mathcal{R}}}\) is the output (consequent) of this rule. The linguistic labels (e.g., low) are usually modelled as parameterized MFs within a particular area of the input space. \({\tt AND}\) is the T-norm fuzzy operator and in some cases \({\tt OR}\) (T-conorm) is used in fuzzy rules. In the example shown above, the fuzzy sets involved only in the premise part \(({\tt IF}\hbox{-part}).\) Fuzzy systems consists of this kind of rules are referred to as Takagi–Sugeno fuzzy systems. Moreover, the \({\tt THEN}\hbox{-parts}\) consists of only crisp values, e.g., 0.2 and 0.4, this model is called as zero-order Sugeno fuzzy model. In an nth-order Sugeno fuzzy system the \({\tt THEN}\hbox{-part}\) of each rule consists of a polynomial of degree n in the input variables (Fritzke 1997).

In this paper we will concentrate on zero-order Sugeno fuzzy systems since they have the interesting property of being equivalent to radial basis function networks (RBFNs) and have been used for data analysis in many applications. Different shapes of the MFs have been proposed such as the Gaussian, triangular, or trapezoidal. In the following discussions we assume that the MFs have the form of the Gaussian function and that the fuzzy system consists of m fuzzy rules each with n input variables and is designed to classify data into k fuzzy or crisp classes. Thus, the fuzzy system can be described as \(\{R_i\}_{i \in \{1,\ldots,m\}}\). For the ith rule R i ,

$$\begin{aligned}{} &{{\mathcal{R}}}_{i}: \hbox{IF } (x_{1}\hbox{ is } g_{i_{1}}(x_{1})) \hbox{ AND } (x_{2}\hbox{ is } g_{i_{2}}(x_{2}))\\ & \qquad \hbox{ AND } \cdots \hbox{ AND } (x_{n}\hbox{ is } g_{in}(x_{n})) \\ & \quad \hbox{THEN } (y_{1}\hbox{ is }t_{i_{1}})\hbox{ AND } (y_{2} \hbox{ is } t_{i_{2}})\\ & \qquad \hbox{ AND } \cdots \hbox{ AND } (y_{k}\hbox{ is } t_{i_{k}}), \end{aligned} $$
(2)

where g ij (x j ) is the MF denoting the linguistic label associated with the jth input variable in the ith fuzzy rule,

$$ g_{ij}(x_{j})={\rm exp}\left[-\frac{(x_{j}-p_{ij})^{2}} {2\sigma_{ij}^{2}}\right], $$
(3)

where the variable p ij and σ ij are the center and variance of the Gaussian MF g ij , respectively. \({\bf T}_{i}=[t_{i_{1}},\ldots,t_{i_{k}}]^{{\rm T}}\) is the output vector for the ith fuzzy rule, t ij  ∈ {0, 1} for partitioning crisp clusters or function approximation and t ij  ∈ [0, 1] for partitioning fuzzy clusters.

Usually only a moderate number of MFs is defined for each input variable. It is always desirable that the membership values for each input variable add to unity everywhere. This may be achieved by dividing the membership values g ij (x j ) by the sum of all memberships with respect to x j , leading to normalized MFs,

$$ {\widehat{g}}_{ij}(x_{j})=g_{ij}(x_{j})\Bigg/\sum\nolimits_{l=1}^{m}g_{lj}(x_{j}). $$
(4)

We focus our attention on rules that combine their sub-expressions by fuzzy \({\tt AND}.\) In the case of Gaussian MFs, the fuzzy \({\tt AND}\) can be realized by the arithmetic product. Let G i denote the membership value for the IF-part of R i in Eq. 2,

$$ G_{i}=\prod_{j=1}^{n}g_{ij}(x_{j}). $$
(5)

In this case the \({\tt IF}\hbox{-part}\) of each fuzzy rule can be described by an n-dimensional Gaussian MF,

$$ G_{i}({\mathbf{X}})={\rm exp}\left(-\frac{1} {2}({\mathbf{X}}-{\mathbf{P}}_{i})^{{\rm T}}\Uplambda^{-1}({\mathbf{X}}- {\mathbf{P}}_{i})\right), $$
(6)

where X is the multivariate input vector \([x_{1}, x_{2}, \ldots, x_{n}]^{{\rm T}},\) \({\bf P}_{i}=[p_{i_{1}}, p_{i_{2}}, \ldots, p_{i_{n}}]^{{\rm T}}\) is the n-dimensional Gaussian center that has the corresponding centers of the one-dimensional factor Gaussians as components. \(\Uplambda^{-1}={\rm diag}(1/\sigma_{i_{1}}^{2},1/\sigma_{i_{2}}^{2},\ldots, 1/\sigma_{i_{n}}^{2})\) is the inverse of the covariance matrix Λ corresponding to the product of n one-dimensional Gaussian memberships. Similarly, we can normalize the multivariate Gaussian MFs as follows,

$$ {\widehat{G}}_{i}({\mathbf{X}})=G_{i}({\mathbf{X}})\Bigg/\sum\nolimits_{l= 1}^{m}G_{l}({\mathbf{X}}). $$
(7)

For the input pattern X, the output of the ith component of a fuzzy system with normalization is given by

$$ O^{(i)}({\mathbf{X}})=\sum_{j=1}^{m}t_{ji}{\widehat{G}}_{j}({\mathbf{X}})= \sum\nolimits_{j=1}^{m}t_{ji}G_{j}({\mathbf{X}})\Bigg/\sum\nolimits_{j=1}^{m}G_{j}({\mathbf{X}}). $$
(8)

2.2 Scatter-partitioning fuzzy systems

Based on Eq. 6, we may simplify the fuzzy rule in Eq. 2 as

$$ {{\mathcal{R}}}_{i}: \hbox{ if }{\mathbf{X}}\hbox{ is }G_{i}({\mathbf{X}})\hbox{ then } {\mathbf{Y}}\hbox{ is } {\mathbf{T}}_{i}, $$
(9)

where \({\bf Y}=\{y_{1}, y_{2}, \ldots, y_{k}\}\) is the multivariate output variable and T i is the consequent output vector for ith rule.

Since each Gaussian MF \(G_{i},\; i \in \{1,\ldots,m\}\) covers a particular area in the input space, each rule is considerably activated only in this area. One can think of the input space as being covered by some small patches, each of them corresponding to the IF-part of one fuzzy rule (Fritzke 1997). We denote these small patches as rule patches in this paper. Building a meaningful fuzzy inference system to a large extent is to establish a series of fuzzy rules that cover all the rule patches in the input space. At the same time, we must keep the number of rules as low as possible in order to maintain the generalizing ability of the model and to ensure a compact and transparent model (Setnes 2000). Therefore, our task is reduced to seeking an optimal number of rule patches, locating their positions, and computing their shape widths in the input space. This is known as scatter-partitioning distinguished from grid-partitioning approaches in which the rule patches are assumed on a regular predefined grid distribution. In a scatter-partitioning fuzzy system the fuzzy memberships are not confined to the corners of a rectangular grid. Rather, they can activate anywhere with any width in the input space as long as the P and Λ in Eq. 6 can be detected. If the covariance matrices Λ i of the Gaussians are diagonal, then they can still be thought of being generated as a product of n one-dimensional Gaussian MFs. Figure 1 shows an example of grid-partitioning and scatter-partitioning fuzzy systems. Once the IF-parts are solved, the output vector T for each rule may be achieved by minimizing the summed squared error,

$$ E=\frac{1} {2}\sum_{i=1}^{M}\sum_{j=1}^{m}\left[{\mathbf{Y}}_{i}- {\widehat{G}}_{j}({\mathbf{X}}_{i}) {\mathbf{T}}_{j}\right]^2, $$
(10)

where (X i , Y i ) is the training pattern, and M is the total number of patterns presented for training. This approach is usually computationally complex. In fact, the output vector can be learned on-line concurrently in the output space to reduce the complexity, such as training by the delta-rule (Fritzke 1997; Nauck and Kruse 1995).

Fig. 1
figure 1

TClassfication example with a grid-partitioning fuzzy system and a scatter-partitioning fuzzy system. a Two dimensional dataset in the unit square consisting of two classes (white and black). b The grid-partitioning system initializes 3 × 3 grid to arrange nine fuzzy rules. c The non-normalized Gaussian view in the input space for this grid-partitioning system. d The classification result by the grid-partitioning system. e The scatter-partitioning system captured six rule patches in the input space. f The classification result by the scatter-partitioning system

Now, the problem is how to correctly classify the rule clusters in the input space. The conventional learning algorithms using neural networks or fuzzy clustering are very sensitive to the number of prototypes initialized. They perform reasonably well when the number of prototypes has been chosen appropriately; the best case is when the number of prototypes equals to that of the natural clusters and with properly positioned initial prototypes. We, however, do not usually have an adequate prior knowledge about the data set. Therefore it is desirable to develop an algorithm that is able to adaptively detect the number of rule patches and their locations. In this paper we consider an incremental learning paradigm which first constructs a fuzzy system by seeking one fuzzy rule, and conducting rule induction in an iterative learning process according to a self-spawning validity measure. The rule growth is terminated when all the rule patches have been detected based on a stop criterion. The algorithm is implemented as an on-line competitive learning process. Therefore, we may consider the scatter-partitioning fuzzy systems as a neuro-fuzzy system.

3 The SSNFS neuro-fuzzy modeling

In general, neuro-fuzzy modelling includes two phases: building a neural network to interpret the fuzzy rules, and employing a learning algorithm to update fuzzy perceptrons or the structure itself. We consider neuro-fuzzy modelling as a technique to derive a fuzzy system from data, or to enhance it by learning from examples.

Many neuro-fuzzy systems use parameterized MFs that are stored within the “neurons” of a multilayered feed-forward architecture, e.g., GARIC (Berenji and Khedkar 1992) and ANFIS (Jang 1993). The links in these kinds of networks indicate only the data flow directions between nodes and no weights are associated with the links. The membership functions are usually parameterized Gaussians, trapezoidal, or triangular functions, or they are constructed by superposing sigmoids. NEFCLASS (Nauck 1997; Nauck and Kruse 1995) builds a different structure which uses sampled memberships (fuzzy sets) as link weights. For a quick review, Fig. 2a shows a typical ANFIS architecture and Fig. 2b shows an example of NEFCLASS structure. It can be seen that, either in Fig. 2a or b, each input variable is associated with a specific MF in its input area. This, however, brings difficulty to the structural learning. For instance, in Fig. 2a, a new linguistic label (membership function) will cause the network to learn the connection relationships from both the input to layer 1 and the input from layer 1 to layer 2.

Fig. 2
figure 2

An example of the ANFIS and NEFCLASS architectures. a The ANFIS structure with two antecedent variables, four fuzzy rules and one consequent output; the membership functions (MFs) are stored in the neurons and the link connections only indicate the data flow. b A NEFCLASS system with two inputs, five rules and two output classes; the MFs are modeled as the link weights. In both cases, each one-dimensional antecedent variable is associated with one MF in its input area

It must be clear that the above mentioned neuro-fuzzy models (and in fact almost all neuro-fuzzy models in the literature) have little to do with fuzzy logic in the narrow sense (Kruse and Gebhardt 1994) but to do with the parameterized MFs in the fuzzy IF-THEN rules which are associated with linguistic labels.

3.1 SSNFS architecture

As discussed in Sect. 2, each fuzzy rule is associated with one parameterized Gaussian membership with multivariate inputs and one output vector, as seen in Eq. 9. The Gaussian MFs can be identified by detecting rule patches in the input space, whereas the output vectors can be updated on-line in the output space. In this perspective, we propose SSNFS based on a generic incremental perceptron and a new learning algorithm, self-spawning competitive learning, to incrementally search the rule patches.

Figure 3 shows an example of SSNFS architecture and its fuzzy reasoning. Similar to ANFIS, the MFs are stored in the hidden nodes to model continuous input relationships. However, the difference is, the new model uses a multivariate membership function for each rule, thus the node with this MF is fully connected to all the input variables. Also, in SSNFS, the parameters for each MF are modelled as the link weights between the input nodes and membership nodes, and the consequent vector of each rule is modelled as the weights between the input nodes and rule nodes. The benefit is, the weight vectors can be adaptively updated given each input training pattern. Third, different from network models with pre-specified structures, the SSNFS is not a static model, but dynamically adjusted according to the distribution of patterns. This means that the number of fuzzy rules will change with the that of the nodes in layer 4. The dynamical learning scheme, called structure learning, makes SSNFS have better adaptivity than models with fixed structure.

Fig. 3
figure 3

An example of the SSNFS architecture. In this case, the MFs are generalized as multivariate parameterized functions stored in the hidden neurons. Either their parameters or the consequent output vectors are modelled as the link weights

To build a neuro-fuzzy system for extracting fuzzy linguistic rules, we present a generic five-layer incremental perceptron that will provide a basis for SSNFS architecture and self-spawning learning algorithm.

Definition 1

A generic five-layer incremental perceptron is a five-layer feed-forward neural network (U, W, P, T, NET, A, O, ex, Inc) with the following specifications:

  1. 1.

    \(U=\bigcup_{i\;\in\;I}U_{i}\) is a non-empty set of neurons and \(I=\{1,\ldots,5\}\) is the index set of U. For all i, jI, \(U_i \ne \emptyset\) and \(U_i \cap U_j = \emptyset\) for \(i \ne j\). \(U_{1}=U_{1}^{(\user2{{\mathcal{X}}})} \bigcup U_{1}^{({\bf X})} \bigcup U_{1}^{({\bf Y})}\) is the input layer, U 2, U 3 and U 4 are the hidden layers, and U 5 is the output layer.

  2. 2.

    Let M = {2m 1 + m 5, m 2, m 3, m 4, m 5} define the number of neurons for layer 1 to layer 5, respectively; m 1 is the number of neurons for either \(U_{1}^{(\user2{{\mathcal{X}}})}\) or U (X)1 ; m 2 = m 3m 4 is the number of neurons for each hidden layer; m 5 is the number of neurons for U (Y)1 or U 5.

  3. 3.

    W defines the link connectedness in the perceptron as follows:

    1. (a)

      for \(u_{i} \in U_{1}^{(\user2{{\mathcal{X}}})}\bigcup U_{1}^{({\bf X})}\) and v j U 2, W(u i , v j ) = 1;

    2. (b)

      for u i U (Y)1 and v j U 4, W(u i , v j ) = 1;

    3. (c)

      for u i U 2 and v j U 3, W(u i , v j ) = 1;

    4. (d)

      for u i U 3 and v j U 4 and i = j, W(u i , v j ) = 1;

    5. (e)

      for u i U 4 and v j U 5, W(u i , v j ) = 1;

    otherwise, W(u i , v j ) = 0.

  4. 4.

    \(P=\{{\bf P}_{1}, \ldots, {\bf P}_{m_{2}}\}\) defines the weight vectors (prototypes) on the connections W(U (X)1 , U 2). The ith prototype can be given by \( {\bf P}_{i}=[p_{i_{1}},\ldots, p_{i_{m_{1}}}]^{{\rm T}}\) where p ij is the weight on the connection W(u j , v i ) with uU (X)1 and vU 2. \(T=\{{\bf T}_{1}, \ldots, {\bf T}_{m_{2}}\}\) defines the weight vectors on the connections W(U (Y)1 , U 4). \( {\bf T}_{i}=[t_{i_{1}},\ldots, t_{i_{m_{5}}}]^{{\rm T}}\) where t ij is the weight on the connection W(u j , v i ) with uU (Y)1 and vU 4.

  5. 5.

    NET defines the propagation function for each unit uU to calculate the net input net u .

    1. (a)

      for uU 1:

      • \({\rm NET}_{u_{i}}: {{\mathcal{R}}} \;\mapsto\; {{\mathcal{R}}}, \; {\rm net}_{u_{i}}=ex_{u_{i}};\)

    2. (b)

      for uU 2:

      • \({\rm NET}_{u_{i}}: {{\mathcal{R}}}^{m_{1}} \;\mapsto\; {{\mathcal{R}}}^{m_{1}},\)

      • \(\begin{aligned}{}&{\rm net}_{u_{i}}= O(U_{1}^{(\user2{{\mathcal{X}}})})- [o_{v_{1}}p_{i_{1}}, o_{v_{2}}p_{i_{2}}, \ldots, o_{v_{m_{1}}}p_{i_{m_{1}}}]^{{\rm T}},\\&\quad v \in U_{1}^{({\bf X})};\end{aligned}\)

    3. (c)

      for uU 3:

      • \({\rm NET}_{u_{i}}: {{\mathcal{R}}}^{m_{2}} \;\mapsto\; {{\mathcal{R}}}, \)

      • \({\rm net}_{u_{i}}=\frac{o_{v_{i}}}{\sum_{j=1}^{m_{2}}o_{v_{j}}},\;\;v \in U_{2};\)

    4. (d)

      for uU 4:

      • \({\rm NET}_{u_{i}}: {{\mathcal{R}}} \times {{\mathcal{R}}}^{m_{5}} \;\mapsto\; {{\mathcal{R}}}^{m_{5}}, \)

      • \({\rm net}_{u_{i}}=[o_{v_{i}}t_{i_{1}}, o_{v_{i}}t_{i_{2}}, \ldots, o_{v_{i}}t_{i_{m_{5}}}]^{{\rm T}},\;\; v \in U_{3};\)

    5. (e)

      for uU 5:

      • \({\rm NET}_{u_{i}}: {{\mathcal{R}}}^{m_{4}} \;\mapsto\; {{\mathcal{R}}},\)

      • \({\rm net}_{u_{i}}=\sum_{j=1}^{m_{4}}o_{v_{j}}, \;\; v \in U_{4}. \)

  6. 6.

    A defines the activation function for each uU to calculate the activation a u .

    1. (a)

      for \(u \in U_{1} \bigcup U_{3} \bigcup U_{4} \bigcup U_{5}\!:\)

      • \(A_{u_{i}}: a_{u_{i}}=A_{u_{i}}({\rm net}_{u_{i}})={\rm net}_{u_{i}}; \)

    2. (b)

      for uU 2:

      • \(A_{u}: {{\mathcal{R}}}^{m_{1}} \;\mapsto\; {{\mathcal{R}}}, \quad a_{u}=A_{u}({\rm net}_{u}),\)

      where A u is the parameterized activation function.

  7. 7.

    O defines for each uU an output function O u to calculate the output o u .

    1. (a)

      for \(u \in U_{1} \bigcup U_{2} \bigcup U_{3} \bigcup U_{4}\!:\)

      • \(o_{u_{i}}=O_{u_{i}}(a_{u_{i}})=a_{u_{i}};\)

    2. (b)

      for uU 5:

      • \(O_{u_{i}}: {{\mathcal{R}}} \;\mapsto\; \{0,1\}\) (for fuzzy partitioning or function approximation),

      • o u  = O u (a u ) = a u ;

      • \(O_{u_{i}}: {{\mathcal{R}}} \;\mapsto\; [0,1]\) (for crisp partitioning),

      • o u  = O u (a u ) = DF(a u ),

      where DF is a suitable defuzzification function.

  8. 8.

    ex defines for each input unit uU 1 its external input ex(u) = ex u . For all other units ex is not defined.

  9. 9.

    Inc defines the policy for the perceptron increment. A spawning request will be carried out by taking the following actions:

    1. (a)

      One neuron is spawned for each hidden layer, while the input and output layers remain unchanged. Let u, v and w indicate the new added neuron for layer 2, 3, and 4, respectively.

      • \(U_{2}:=U_{2} \bigcup \{u\}, \;\;\; m_{2}:=m_{2}+1;\)

      • \(U_{3}:=U_{3}\bigcup \{v\}, \;\;\; m_{3}:=m_{3}+1;\)

      • \(U_{4}:=U_{4}\bigcup \{w\}, \;\;\; m_{4}:=m_{4}+1;\)

      • m 2 = m 3 = m 4 holds anytime.

    2. (b)

      for all U 1, U 2, U 3, U 4, and U 5, the structure are reconstructed following the definition of (U, W, P, T, NET, A, O, ex).

Given the definition of the generic perceptron, we can describe the SSNFS as follows.

Definition 2

A SSNFS system is a generic five-layer incremental perceptron employing supervised SSCL algorithm with the following specifications:

  1. 1.

    In SSNFS, \(U_{1}^{(\user2{{\mathcal{X}}})}\) is the input layer for data classification; U (X)1 and U (Y)1 are the input layers for training; U 2 is the membership layer; U 3 is the normalized membership layer; U 4 is the rule layer and U 5 is the output layer.

  2. 2.

    SSNFS starts from a single neuron for each hidden layer, m 2 = m 3 = m 4 = 1 holds initially.

  3. 3.

    SSNFS assigns each prototype P i with three property vectors (A i , C i , R i ), called asymptotic property vector (APV), center property vector (CPV), and distant property vector (DPV), respectively. In a simulated neural network, it is achieved by assigning each p ij three other weights, while for a real neural network we can simply set W(u i , v j ) = 4 where uU (X)1 and vU 2.

  4. 4.

    SSNFS applies Gaussian MFs or triangular MFs as the activation functions for layer 2; P and its property vectors are the parameter vectors for MFs and T is the consequent output vector for a rule.

  5. 5.

    SSNFS switches between the training process and classification task as following:

    1. (a)

      for training process, ex u  = N/A, for \(u \in U_{1}^{(\user2{{\mathcal{X}}})};\; U_{1}^{({\bf X})} \bigcup U_{1}^{({\bf Y})}\) is the active input layer;

    2. (b)

      for classification task, \(ex_{v}=1$,\; {\rm for}\; $v\; {\in}\;U_{1}^{({\bf X})}\bigcup U_{1}^{({\bf Y})};\;U_{1}^{(\user2{{\mathcal{X}}})}\) is the active input layer.

In the SSNFS model, we model fuzzy rule extraction as a task to spawn the structure and to establish the premise and consequent parameters in the training process.

3.2 Supervised self-spawning competitive learning

As defined above, SSNFS employs SSSCL algorithm to tune the structure and parameters. Compared to pruning techniques, it is an incremental learning process. Thus it has to face with the one prototype takes multi-clusters (OPTMC) problem. That is, when the number of initial prototypes is less than that of the actual clusters in the data set, the conventional competitive learning process ensures there must be at least one prototype that wins patterns from more than one cluster after the learning process. Unfortunately, this behavior is not desirable in a neuro-fuzzy system as the rules extracted will give missing representation of the input space. As shown in Fig. 4a, suppose there are three rule patches in the input space and one prototype is initialized to detect them, by the conventional competitive learning paradigm, this single prototype moves to the center of the training patterns. As a result, the extracted fuzzy rule covers a wrong rule patch leading to misclassification. In fuzzy systems, or for that matter any rule-based systems, it is desirable that the extracted fuzzy rules represent the true rule patches.

Fig. 4
figure 4

Two learning behaviors: OPTMC versus OPTOC. a OPTMC: one prototype P 1 is trying to take all three patches {S 1, S 2, S 3}, resulting in oscillation phenomenon. b OPTOC: this prototype detects only one rule patch S 2 and ignores the other two

The SSSCL tackles this problem by designing a new learning scheme in which one prototype takes one cluster (OPTOC) and ignores the others when the number of prototypes is less than that of the natural clusters, or in this case, the true rule patches. Figure 4b shows the same example as that in (a) but applying the OPTOC learning paradigm. In this case, one rule patch (cluster) is correctly detected, and the rest rule patches can be detected by spawning new prototypes in subsequent learning rounds. In this way we will be able to extract a correct set of rules from the data set.

In the neural-fuzzy network, the spawning of new prototypes is accomplished by expanding the SSNFS structure according to a spawning validity measure and the increment policy. The newly spawned prototype is hence to extract one more fuzzy rule in the next iterative learning process. This growth process terminates when a stop criterion is satisfied. The process is supervised in that the split validity measure is based on the desired output patterns for the given training patterns.

Let P i (A i , C i , R i ) denote the ith prototype in terms of its three property vectors; (X, Y) denote the pair of training patterns for input layer, where X is a m 1-dimensional pattern in the input space, and Y is a desired m 5-dimensional pattern in the output space; E i denote the regression error and T i the desired output vector for the fuzzy rule extracted by P i . Each time when a pair of patterns, (X, Y), is randomly picked from the training set, the competition occurs among all the current prototype vectors. The winning prototype is judged by the nearest neighbor criterion.

3.2.1 The SSSCL OPTOC learning algorithm

  1. 1.

    SSNFS starts from one fuzzy rule, therefore, m 2 = m 3 = m 4 = 1 holds initially. The only single prototype vector, P 1, is initialized randomly in the input space. Its APV (A 1) and CPV (C 1) are initialized at a random location far from P 1, whereas its DPV (R 1) is set at the same place as P 1. T i is randomly initialized in the output space and E i is initialized to 0.

  2. 2.

    For the input (X, Y), if P i is the winning prototype for X, its APV (A i ) is updated in the input space by

    $$ {\mathbf{A}}_{i}^{*}={\mathbf{A}}_{i}+\frac{1} {n_{{\mathbf{A}}_{i}}} \cdot \delta_{i} \cdot ({\mathbf{X}}-{\mathbf{A}}_{i}) \cdot \Uptheta({\mathbf{P}}_{i}, {\mathbf{A}}_{i}, {\mathbf{X}}). $$
    (11)

    where

    $$ \Uptheta(\varvec{\mu}, \varvec{\nu}, \varvec{\omega})= \left\{ \begin{aligned} 1 &\quad \hbox{if } |\varvec{\mu}\varvec{\nu}| \geq |\varvec{\mu}\varvec{\omega}|,\\ 0 &\quad \hbox{otherwise}. \end{aligned} \right.$$
    (12)

    and

    $$ \delta_{i}= \left(\frac{|{\mathbf{P}}_{i}{\mathbf{A}}_{i}|}{|{\mathbf{P}}_{i}{\mathbf{X}}|+ |{\mathbf{P}}_{i}{\mathbf{A}}_{i}|}\right)^{2}. $$
    (13)

    n A i is the winning counter and |u v| is the Euclidean distance between a vector u and a vector v. δ is the adaptively updated learning rate which satisfies 0 < δ i ≤1.

  3. 3.

    The DPV (R i ) always follows the farthest pattern for which P i has been the winner in the input space so far. For the input (X, Y), if P i is the winning prototype for X, R i is updated by

    $$ {\mathbf{R}}_{i}^{*}={\mathbf{R}}_{i} \cdot (1-\Uptheta({\mathbf{P}}_{i},{\mathbf{X}},{\mathbf{R}}_{i})) + {\mathbf{X}} \cdot \Uptheta({\mathbf{P}}_{i},{\mathbf{X}},{\mathbf{R}}_{i}). $$
    (14)
  4. 4.

    For the input (X, Y), if P i is the winning prototype for X, the following update scheme guarantees P i to cluster one rule patch and ignore the others in the input space.

    $$ {\mathbf{P}}_{i}^{*}={\mathbf{P}}_{i} + \alpha_{i} \cdot({\mathbf{X}}-{\mathbf{P}}_{i}), $$
    (15)

    where α i is computed with

    $$ \alpha_{i}=\left(\frac{|{\mathbf{P}}_{i}{\mathbf{A}}_{i}|} {|{\mathbf{P}}_{i}{\mathbf{X}}|+|{\mathbf{P}}_{i}{\mathbf{A}}_{i}|}\right)^{2} \quad (0 < \alpha_{i} \le 1). $$
    (16)

The OPTOC scheme enables each prototype to find only one natural rule patch in the input space when the patches are more than the prototypes. This in itself is a major improvement over other competitive learning algorithms in the literature. However, at this stage we are still not sure whether there are other rule patches that have not been detected yet. For this we introduce a spawning validity measure to judge if all the rule patches have been properly discovered. If not yet, SSNFS expands its structure by spawning and appending one neuron for each hidden layer, then reconstruct the architecture. The prototype vector of the newly spawned neuron in U 2 is hence to join the competition in the next iterative learning process. Based on the OPTOC learning scheme, it is therefore able to extract one more fuzzy rule.

Definition 3

(The SSSCL spawning validity measure and stop criteria)

  1. 1.

    For the input (X, Y), if P i is the winning prototype for X, C i and T i are updated on-line by the k-Means learning scheme (MacQueen 1967),

    $$ {\mathbf{C}}_{i}^{*}={\mathbf{C}}_{i}+\frac{1}{n_{{\mathbf{C}}_{i}}}({\mathbf{X}}- {\mathbf{C}}_{i}); \quad {\mathbf{T}}_{i}^{*}={\mathbf{T}}_{i}+\frac{1} {n_{{\mathbf{T}}_{i}}}({\mathbf{X}}-{\mathbf{T}}_{i}). $$
    (17)

    Just as the name k-Means indicates, the CPV (C i ) indicates the arithmetic means (centroid) of all the X’s and T i indicates the arithmetic centroid of all the Y’s, where (X, Y)’s are the presented patterns for which P i has been the winner.

  2. 2.

    E i denotes the regression error for the fuzzy rule extracted by P i . Therefore, it is given by

    $$ E_{i}^{*}=E_{i} + (O_{i}^{({\mathbf{X}})}-{\mathbf{Y}})^{2}, $$
    (18)

    where O i represents the output vector of the fuzzy rule extracted by P i as the if-parts and T i as the then-parts.

  3. 3.

    For all i ∈ {1, ..., m 2}, if |P i A i | < ε and there is at least one prototype P j satisfies |P j C j | > ε, then SSNFS is suitable for spawning. ε is a small positive constant which is theoretically 0. The self-spawning process is carried by:

    1. (a)

      SSNFS expands its structure in terms of the policy Inc defined in the Definition 1.

    2. (b)

      Let \({\mathbf{P}}_{m_{2}}\) denote the prototype vector for the newly spawned neuron in U 2. It is initialized with

      $$ {\mathbf{P}}_{m_{2}}={\mathbf{R}}_{s} $$

      where the index s satisfies

      $$ s=\arg{{\rm min}}_{j}E_{j}|_{|{\mathbf{P}}_{j}{\mathbf{C}}_{j}| > \epsilon} $$

      set \(\mathbf{R}_{m_2} = \mathbf{P}_{m_2}, \mathbf{A}_{m_2}=\mathbf{C}_{m_2}\) at a random location in the input space, \(E_{m_2} = 0,\ {\rm and}\ T_{m_2}\) at a random location in the output space.

  4. 4.

    For all i ∈ {1, ..., m 2}, if |P i A i | < ε and |P i C i | < ε, SSNFS finishes its structural and parametric learning and can be interpreted as a set of established fuzzy rules.

SSNFS adaptively updates its structure and weight parameters to extract fuzzy rules. In the next section, we present our experimental results to show the effectiveness of our rule-extracting neuro-fuzzy model for data classification.

4 Experimental results

To demonstrate our system and its applicability, in this section we present the experimental results for three classical problems. First, we consider the reconstruction of a known rule base from data to verify the ground truth. The purpose is to identify the partition in the data from the rule base. Second, we describe an application in pattern recognition using a well-known benchmark data, namely, the Iris data set. Finally, we consider a chaotic time series given by the Mackey–Glass differential equation and use SSNFS to approximate this function.

4.1 Modeling a known rule-base

We consider an existing fuzzy system which partitions a two-dimensional data into three classes: A, B, and C. Let Tr(x, a, b, c) denote the triangular MF specified by three parameters (a, b, c),

$$ {\rm Tr}(x,a,b,c)=\left \{ \begin{aligned} &\frac{x-a}{b-a}, \quad\ {\hbox{if } a \le x \le b} \\& \frac{c-x}{c-b}, \quad\ {\hbox{if } b < x \le c} \\& 0, \qquad\quad { \hbox{ if } x < a \hbox{ or } x > c.} \end{aligned} \right. $$
(19)

The rule base consists of seven fuzzy rules as shown in Table 1. The input variables take values in the domain [0, 10]. The output vector y indicates the class an input pattern belongs to. The three classes are labelled as A = [100], B = [010], and C = [001].

Table 1 Rule base consists of seven fuzzy rules for generating the data set

Figure 5a shows the training data set obtained by the data-generating rule base. It contains 200 data points partitioned into three classes. We know that the data-generating system consists of seven rules, but it is not straightforward to tell how many clusters are actually present in the input space (Setnes 2000). Given this data set, we apply the SSNFS model to extract the fuzzy rules with two-dimensional Gaussian MFs. Our objective is to verify, from the given data set, whether the SSNFS is able to extract a set of rules that are similar to those in the original, known rule base. It is a network of two input neurons, one neuron for each hidden layer and three output neurons. The constant parameter ε in the spawning validity measure was initialized to 2% of the domain scale. Initially, as expected, the only single prototype extracted one fuzzy rule by detecting one rule patch. During the self-spawning learning process, the spawning action occurred seven times. Consequently the rule base expanded to eight rules finally. In Fig. 5b the learning trajectories and spawning actions were drawn in detail in the input space to show the effectiveness of the SSNFS. The number in each circle indicates the spawning order and the initial position of the newly spawned prototype; the thin curves are the learning trajectories of prototype vectors. Figure 5c shows the partition of the data after the rules have been extracted by SSNFS. Table 2 lists the extracted rules with their parameters using SSNFS, where P i , R i , and T i are defined in Definition 3.

Fig. 5
figure 5

a The data set contains 200 data points obtained by the data-generating rule base; they were partitioned into 3 classes: A (plus signs), B (filled circles), and C (hollow circles). b The OPTOC learning trajectories and spawning actions of SSNFS; the spawning occurred seven times and finally eight fuzzy rules were extracted. c The classification result obtained by our neural-fuzzy system

Table 2 The rule parameters extracted by SSNFS based on the data set in Fig. 5

Figure 6a shows the classification map with respect to class B data (filled circles) obtained by the original, known rule base. Figure 6b shows the corresponding classification map obtained by using the rules extracted by the SSNFS from the input data set (the 200 data points). Since the training data set does not reflect the global distribution of the data-generating rules, the extracted rules mostly perform well on these training data points and thereabouts. One can easily extract the global optimal fuzzy rules by uniformly sampling the input space with small step. Thus this is not a problem of technique but of computation complexity.

Fig. 6
figure 6

a The class B output surface of the data generating rule base. b The corresponding output surface of the simulated fuzzy rules based on a limited data set

From the known rule set in Table 1, the extract rules in Table 2, and classification maps in Fig. 6, we can see that the SSNFS is able to reconstruct the existing rule base very accurately.

4.2 Rule extraction from Iris data

The Iris data is one of the best known databases found in pattern recognition.Footnote 1 The data set contains three classes, where each class refers to a type of iris plant and contains 50 data points. The three types are, Iris Setosa, Iris Versicolour, and Iris Virginica. For each data point, there are four real valued numerical attributes in turn: (1) sepal length in cm; (2) sepal width in cm; (3) petal length in cm; (4) petal width in cm. The objectives in this experiment are: (1) extract a set of fuzzy rules; (2) classify the data set using the extracted rules; (3) compare the class centroids with the actual centroids.

We applied the SSNFS to the Iris data set to extract a set of fuzzy rules. The SSNFS network consists of four input neurons, one hidden neuron for each hidden layer initially, and three output nodes. Throughout the test, we set ε = 0.02. During the learning process, the spawning action occurred four times, therefore five rules were extracted by SSNFS. Table 3 lists the final rules with their parameters. To test the robustness of the proposed model, we carried out Monte Carlo tests and run the SSNFS ten times. Each time, SSNFS was able to extract a very consistent set of rules; in fact, the rules were almost identical among the tests. Figure 7 shows a classification result for attribute 2 against attribute 4 by using the extracted fuzzy rules to classify data points as being Iris Setosa, Versicolour, or Virginica.

Table 3 The rule parameters extracted by SSNFS on the Iris data
Fig. 7
figure 7

An example: attribute 2 versus attribute 4 for Iris types 1, 2, and 3 classified by the SSNFS rules

The spawning constant ε is a constant that is used to tune the quality of rule extraction. The smaller the ε, the more rule patches in the input space, therefore the better accuracy of the rule extraction. Typically it is sufficient that ε < 0.05. As shown in Table 4, by setting ε = 0.02 or ε = 0.05, we can obtain the centroids of classes that are very close to the actual centroids of the three classes.

Table 4 The class centroids obtained by ε = 0.02 and ε = 0.05

4.3 Function approximation

To show the learning capability of the SSNFS, in this section we consider a chaotic time series given by the Mackey–Glass differential equation (Mackey and Glass 1977). The prediction of future values of this time series is a benchmark problem which has been considered by a number of connectionist researchers (see Jang 1993, for more details). The equation can be described as

$$ \dot{x}(t)=\frac{0.2x(t-T)}{1+x^{10}(t-T)}-0.1x(t). $$
(20)

We use the values, x(t−18), x(t−12), x(t−6), and x(t), to predict x(t + 6). The training data were created using the fourth-order Runge–Kutta procedure with a unity step size. Assume initially x(0) = 1.2, T = 17, x(t) can thus be derived for 0 ≤ t ≤ 2,000. We created 1,000 values between x = 118 and 1,117, where the first 500 samples were used for training and the second half was used as a validation set. To give a sense of each four-dimensional data \([x(t-18), x(t-12), x(t-6), x(t)]^{{\rm T}}\), we displayed it as two-dimensional data points. Figure 8a shows a plot of variable x(t−18) against x(t−6), whereas Fig. 8b shows a plot of variable x(t−6) against x(t). The SSNFS has four input nodes, one output node, and one node for each hidden layer. We set ε to 0.07. During the learning process, the spawning action occurred 46 times. Therefore, 47 fuzzy rules were extracted to approximate this function. The final positions of each rule patch were shown in Fig. 8c and d, where we mapped them into the space of x(t−18) against x(t−6), and the space of x(t−6) against x(t), respectively. It must be noted that some data points in the mapped space could come from other dimensions, therefore, not all data points are covered by the rule prototypes (Cherkassky and Mulier 1998; Vapnik 1998).

Fig. 8
figure 8

a The view of four-dimensional data points in the two-dimensional space of x(t−18) versus x(t−6). b The view of four-dimensional data points in the two-dimensional space of x(t−6) versus x(t). c The SSNFS rule prototypes mapped in the space of x(t−18) versus x(t−6). d The SSNFS rule prototypes mapped in the space of x(t−6) versus x(t)

Figure 9 shows the original Mackey–Glass time series (solid curve) and the simulated predictions (dashed curve) based on the first 500 samples. Since this is a local linear mapping problem, it would correspond to a first-order Sugeno fuzzy system; consequently, the SSNFS adaptively generates a large number of small systems to approximate the given data set.

Fig. 9
figure 9

Approximation of the Mackey–Glass time series by SSNFS. The solid curve shows the Mackey–Glass time series while the dashed one shows the prediction curve by SSNFS

5 Conclusions and discussions

In this paper, we first discussed the scatter-partitioning fuzzy system, based on which we then presented the SSNFS model, a neuro-fuzzy system for rule extraction. It is derived from a generic five-layer incremental perceptron model and employs the SSSCL algorithm. When initialized with a single rule prototype, the SSNFS is able to adapt its structure to reach a suitable number of rules by the self-spawning learning algorithm. The considered synthetic and real-world examples demonstrated the effectiveness and applicability of the new rule generating system.

In most real-world applications, rarely do we have an adequate prior knowledge to specify the shapes, locations, and the number of rules. Therefore, extracting rules using the SSCL algorithm is effective and robust. However, there are many challenging theoretical problems open to further research. For instance, rule generalization in SSNFS, convergence property, rule-patch validation in the context of soft computing, etc. The error based insertion strategy is a popular approach to guide the growth of a neuro-fuzzy system. This is effective in many ways, however, it has a disadvantage in that a well partitioned large cluster could still own more distortions than small cluster do. Thus, it can be always tricked to generate new, redundant prototypes to share the large cluster while ignore the small ones. To avoid this problem, in SSNFS model, we use a new spawning validity measure in which the error is part of the factors to be considered for rule growth. The rule induction and deduction strategies are always the active areas of investigation.