1 Introduction

Classification of streaming data has been one of the most common issues in the field of machine learning for recent years since the applications regarding data streams are increasing over time. Examples of these applications include phone conversations, ATM transactions, web log analysis, user profile learning, and traffic management (Gama 2011). The main goal of data stream classification, as in traditional ones (Fahmi et al. 2017), is to learn a model based on the observed data and then use this model to predict the class label of future samples. However, due to structural and conceptual differences of data streams with traditional data, constructing such a model is confronted with many challenges. Since streaming data are being produced continuously in unlimited rates, it is not feasible to store and process data, more than once, in order to train the model. Moreover, data streams are faced with the problem of changing the statistical nature of the target concept over time. Often this change happens gradually which referred to as concept drift (Widmer and Kubat 1996) and when the changes are abrupt the concept shift (Gama 2011) is occurred. Another issue is concept evolution which signifies the emergence of new target concept after a while. All these features make the off-line learning impractical because it requires huge memory, long computational time and retraining the constructed model for adapting to recent alteration of data.

Since concept change is one of the main limitations in data stream context, most efficient and powerful algorithms have been focused on this problem (Lughofer and Angelov 2011). In this regard, two common approaches have been developed. In the first approach, known as trigger-based or streaming algorithms, the distribution of data is monitored in constant time intervals, and the model is updated if some changes are detected. These methods, which are often related to the sudden alterations, have to use a group of data (data chunk) for the detecting process; thus, they cannot perform a completely online mechanism. In the second approaches which are called evolving or online methods, the model is updated as soon as each sample arrives, whether or not a change is detected. Therefore, it always remains up-to-date and never remains the model unchanged if it cannot detect any variation in the distribution of data. These methods only need to apply a suitable mechanism with low complexity in order to be able to update the model quickly and in online way (Shahparast et al. 2014).

Although online learning deals with future data in streaming or in a sequential order, its objective is very different from lifelong machine learning. Online learning still performs the same learning task over time. Its objective is to learn more efficiently with the data arriving incrementally. Lifelong learning, on the other hand, aims to learn from a sequence of different tasks, retain the knowledge learned so far, and use the knowledge to help future task learning. Online learning does not do any of these (Chen and Liu 2016).

Among all of the techniques mentioned above, only a recently emerged branch of machine learning, known as evolving intelligent systems (Angelov et al. 2010), allows modification of structure as well as parameters. They mostly apply fuzzy rule-based, neuro-fuzzy, or neural network frameworks for training the model; however, the other techniques like decision trees and probabilistic models, such as Markov models, could also be used (Angelov 2012). Fuzzy systems have the advantage of automatically design from data which can omit the time consuming and subjectivity of design phases by experts (Fahmi et al. 2018; Amin et al. 2018). Unlike the neuro-fuzzy inference systems where the structure of model is pre-specified and only its parameters are adaptively adjusted (Esmaeilpour and Mohammadi 2016), an evolving fuzzy model modifies its structure and parameters, develops its components, and generates new covering spaces according to the changing properties and new circumstances of the environment (Lughofer 2011).

In the last few years, several promising evolving classifiers with fuzzy rule-based structure have been developed (Angelov and Zhou 2008) that are capable of autonomously adjusting the model to concept drift online and in real time. These models have an unfixed structure which can grow or shrink regarding the properties of data and are applicable in fast modeling processes of very large datasets (Hamzeloo and Zolghadri Jahromi 2017). In summary, the evolving methods have the advantage of one-pass processing of each data sample without storing them, in online way, with speed efficiency of creating and modifying the model parameters and structure from empty knowledge base.

In this paper, an adaptive mechanism for classification of data streams, called adaptive fuzzy classifier based on gradient descent (AFCGD), is presented. This method is capable of online learning of an evolving fuzzy model by concerning all specifications and limitations of this kind of data and could handle the concept drift. In most of evolving methods, the antecedent part is tuned regarding the distribution of data and the consequent part is separately adjusted via an error minimization method (e.g., least square). However, in our proposed approach, the antecedent is tuned regarding the concept of data (i.e., the class label of instances). So, there is no need to adjust the consequent part separately which leads to alleviate the learning process. In AFCGD, the rule’s antecedents are assumed a hyper-Gaussian for the sake of generality and power of covering the space and their center and spreads are incrementally tuned based on the alterations in data concept. The proposed adjustment process, which is derived by solving the minimization problem through gradient descent, is computationally cheap and fast; hence, it can carry out after the emergence of new incoming instance. The method updates the rules such that the misclassification rate of the classifier is minimized regarding each data sample. Therefore, it is always compatible with the characteristics of the new instances including changing concepts.

The rest of paper is organized as follows. In the next section, some related works are described. The structure of a fuzzy rule-based classifier for data streams is introduced in Sect. 3. In Sect. 4, the proposed approach is explained. The experiments are elaborated in Sect. 5. And finally, the paper is concluded in Sect. 6.

2 Related works

In the past research, a wide variety of approaches have been introduced for building highly accurate fuzzy models in context of data streams. Although some of them are designed for both of regression and classification, in this section, we just explain them from the classification point of view.

In Angelov and Zhou (2008), a powerful family of evolving fuzzy classifiers is introduced: eClass0 and eClass1. They are based on zero-order and first-order fuzzy rules of Takagi–Sugeno type, respectively. They adjust the antecedent and consequent parts by the evolving Takagi–Sugeno learning approach. These adjustments are done in two phases: unsupervised learning of the antecedent part and supervisedly specifying the parameters of the consequent. For the first task, fuzzy clustering is conducted on the feature space, and the recursive least square method is used for parameter adjustment (Angelov et al. 2010). For fuzzy partitioning, the eTS online clustering approach, based on potential concept, is employed. This potential shows the global data density and is a measure for generating new rules. Indeed, a new rule is generated around a new data sample in two situations. If the potential of this new sample is higher than the potential of the center of existing rules or its class label has not been seen yet. Hence, the number of classes does not need to be known from the beginning. Moreover, another parameter, named age, determines the disused rules which should be removed to reduce the rule-based complexity and also the concept drift can be determined regarding this parameter. However, applying this parameter requires the re-computation of it after emerging every new sample, and thus, it multiplies the computational time by the number of samples. Two modified versions of eClass family are presented in Almaksour and Anquetil (2010), simple-eClass0 and simple-eClass1 for simplifying the structure adaptation of eClass with preserving its advantages.

The FLEXFIS-Class (Lughofer 2008a, b) is very similar to eClass, except that it begins with an off-line procedure to construct the initial classifier, and after that, a new rule is generated around a new data sample that is not covered by previously generated rules. For learning the fuzzy rule-base, the evolving vector quantization (Lughofer 2008a, b) is utilized in this method. Unfortunately in FLEXFIS, no additional mechanism is considered to remove the old rules and it cannot handle noise.

The Simple-eTS (Angelov and Filev 2005), which is a modified version of the pioneering evolving eTS (Angelov and Filev 2004), was developed for decreasing the complexity of eTS in calculation of information potential by replacing it with the concept of scattering. Both eTS and Simple-eTS employ the incremental version of subtractive clustering algorithm (Chiu 1994) to build the fuzzy rule-base. Another extension to eTS is a method called eTS+ (Angelov 2010), in which some measures like age, utility, local density, and zone of influence are applied to update the antecedent parameters and the model structure.

In 2006, an evolving participatory learning (ePL) introduced in Elton et al. (2006), and after that, it considerably developed in 2010 (Lima et al. 2010) and 2014 (Maciel et al. 2014). ePL combines the participatory learning concept along with the evolving fuzzy modeling idea to learn and construct the antecedent parts of rules. Participatory learning, which is considered as an unsupervised clustering algorithm, is an evident way for online generation of the rule-based structure. The processes of structure identification and self-adaptation in ePL are similar to eTS, except that it applies the participatory learning clustering instead of scattering or potential. ePL uses a fuzzy similarity measure for determining the closeness of data samples with rules, and therefore, each sample could update the current model regarding their compatibility amount.

Evolving Fuzzy Pattern Tree (eFPT) (Shaker et al. 2013), for the problem of binary classification, incrementally learns an ensemble of several fuzzy pattern trees. This method comprises a current tree, which is utilized for prediction, and a set of neighbor trees, which are considered as anticipated adaptation. Once the efficiency of the current model decreases, it is replaced by the best neighbor tree. The first pattern tree is generated off-line by a small number of data, and then, it is set as the initial current model. Afterward, the neighbors are built by either expanding or pruning the current model. This process continues after emerging every data sample; therefore, the computational overhead is increased with growth of the ensemble size.

pClass (Parsimonious Classifier) (Pratama et al. 2015), in accordance with other evolving models, can derive its learning engine with either an empty or initially trained fuzzy rule-base. Datum significance and generalized adaptive resonance theory methods (Vigdor and Lerner 2007) are applied in this classifier for automatically extracting the fuzzy rules from streaming data. The antecedent parts are being constructed on fully flexible ellipsoids which allow arbitrary circling in any directions. Considering this property, local model correlations among input dimensions can be feasible, in addition to the ability of better data/class representations. Also this property makes it possible to have generalized antecedents that have not been applied in the context of evolving fuzzy classifiers till then. In summary, automatic knowledge building, rule-based simplification, knowledge recall mechanism and soft feature reduction are integrated to construct the structure of this method with limited expert knowledge and without prior assumptions about the underlying data distribution.

Neuro-fuzzy frameworks are also applied as evolving schemes. One of the first neuro-fuzzy methods for adaptive online learning is DENFIS (Dynamic Evolving Neural Fuzzy Inference System) (Kasabov and Song 2002). It works like another evolving neuro-fuzzy, called EFuNN (Evolving Fuzzy Neural Network) (Kasabov 2001) in some principles, and this method is developed to make it more suitable for online adaptive systems. DENFIS, which evolves through either supervised or unsupervised incremental learning, can handle new data samples, new features, and new classes by local tuning of its elements. A new specially designed evolving clustering method for DENFIS is employed in order to generate the fuzzy rules by scatter partitioning of the input space.

GENEFIS is a recently conducted evolving approach based on radial basis function networks, which applies a generalized version of TS fuzzy systems for all incremental learning operations and evolving steps (Pratama et al. 2014). It employs generalized adaptive resonance theory to update the premise parameters according to new input data distributions. The output parameters are tuned by fuzzily weighted generalized recursive least square method that cause a smooth weight decay effect. Moreover, a fast kernel-based algorithm is applied for determining the redundancy on fuzzy set and rule level. Some other neuro-fuzzy models, generally known as fuzzy neural networks, are introduced which are suitable for online adaptation (Rubio 2010; Juang and Tsao 2008).

Considering the specifications and characteristics of our proposed method in this paper, we have adopted some of the above methods in experiments.

3 The structure of fuzzy rule-based classifiers for data streams

The purpose of data stream classification as the traditional problems is to construct a model, based on the seen data, and use this model to classify the incoming samples. For an m-class n-dimensional problem, this model is a mapping from the input feature space \( X \in R^{n} \), represented by \( X = [x_{1} ,x_{2} , \ldots ,x_{n} ] \), to the class label space \( T \in R^{1} \) where \( T \in \{ 1, \ldots ,m\} \). The fuzzy rule-based structure, used in this paper, which is of the zero-order Takagi–Sugeno type (Sugeno and Takagi 1983), chases the common form used for data streams (Angelov and Zhou 2008):

$$ \begin{aligned} & Rule\;R^{i} :IF(x_{1} \;is\;around\;x_{1}^{i} )AND \ldots AND \\ & (x_{n} \;is\;around\;x_{n}^{i} )\;THEN\;class\;L^{i} \;WITH\;A^{i} \\ \end{aligned} $$
(1)

where \( R^{i} \) is the \( i{\text{th}} \) rule, \( i = 1, \ldots ,N \), \( X^{i} = [x_{1}^{i} ,x_{2}^{i} , \ldots ,x_{n}^{i} ] \) is the focal point of the antecedent of \( R^{i} \), and \( x_{j} \;is\;around\;x_{j}^{i} \) denotes the membership degree of \( x_{j} \) in membership function with focal point \( x_{j}^{i} \). Also, \( L^{i} \in \{ 1, \ldots ,m\} \) is the consequent class of \( R^{i} \) which is set as the class label of the focal point, and \( A^{i} \) is the rule’s age.

For the purpose of fuzzy rule generation, different partitioning methods such as grid partitioning, tree partitioning, and scatter partitioning are employed. Albeit promising methods are proposed for generating the rules relying on grid and tree partitioning schemes, as in Mansoori (2014), Shahparast and Mansoori (2017) and Mansoori et al. (2008), in many data-driven fuzzy modeling, the scatter partitioning is more proper than the other methods. Since it generates the partitions based on data density, it tries to produce a compact rule-base. Using this partitioning scheme to identify the fuzzy sets in each dimension, a new fuzzy rule is generated whenever a new density is formed in some part of space. The first sample in this region is set as the focal point of the new fuzzy rule, and then, it would be tuned regarding more incoming samples, according to the proposed algorithm.

For each fuzzy rule, its focal point is the center of membership functions in its dimensions. Gaussian membership functions are used, in our algorithm, because of its generality and power in covering the input space. These memberships are specified by two parameters \( \{ C^{i} ,\sum^{i} \} \) as:

$$ \mu_{j}^{i} (x_{j} ) = {\text{e}}^{{ - \frac{1}{2}\left( {\frac{{x_{j} - c_{j}^{i} }}{{\sigma_{j}^{i} }}} \right)^{2} }} ,\quad i = 1, \ldots ,N, \, j = 1, \ldots ,n $$
(2)

where \( C^{i} = [c_{1}^{i} ,c_{2}^{i} , \ldots ,c_{n}^{i} ] \) represents the center and \( \sum^{i} = [\sigma_{1}^{i} ,\sigma_{2}^{i} , \ldots ,\sigma_{n}^{i} ] \) is the spread of rule \( R^{i} \), which determines the membership width. Note that in (2), the projection of the distance between \( x_{j} \) and \( c_{j}^{i} \) in \( j{\text{th}} \) dimension of the feature space is used. This leads to have different spreads in each dimension of feature space which, in turn, increases the power of the classifier. When a new fuzzy rule is generated, the spread parameter is initialized to a constant value (e.g., 1), and then, it is adjusted regarding the new data densities.

The reasoning mechanism of fuzzy inference system is produced using the single winner rule. In this method, an input sample \( X^{t} \) is classified according to the consequent class of the winner rule, \( R^{\text{w}} \). In fact, the winner rule has the maximum compatibility grade with \( X^{t} \) in the whole rule-base. That is:

$$ \mu^{\text{w}} (X^{t} ) = \hbox{max} \{ \mu^{i} (X^{t} ),\quad i = 1, \ldots ,N\} $$
(3)

where \( \mu^{i} (X^{t} ) \) denotes the compatibility grade of rule \( R^{i} \) with \( X^{t} \). Using product t-norm in the antecedent part of rules, this grade is computed by multiplying the membership degree \( \mu_{j}^{i} \) of the \( j{\text{th}} \) feature as:

$$ \mu^{i} (X^{t} ) = \prod\limits_{j = 1}^{n} {\mu_{j}^{i} (x_{j} )} $$
(4)

Note that every \( X^{t} \), not covered by any rule, cannot be classified by the rule-base. Moreover, when two or more rules with different consequent classes have the same compatibility grade with \( X^{t} \), the rule-base is unable to classify it.

4 The proposed method

In this section, we introduce the learning mechanism of the proposed method, an adaptive fuzzy classifier based on gradient descent (AFCGD). This method attempts to present a fast approach in order to update the model after emergence of each incoming sample. In this regard, it tries to react to any alteration in data distribution without having a distinct detection mechanism. Therefore, it is unnecessary to store the incoming samples in order to monitor their distribution to distinguish concept drift.

AFCGD consists of zero-order Takagi–Sugeno rules as in (1). Its algorithm is explained later in detail. At present, we focus on learning mechanism that is developed to tune and modify the generated rules. We aim to tune the antecedent part of fuzzy rules regarding the class label of data instances. Since there is no need to tune the consequent parts separately, this can simplify the learning process. We use the Gaussian membership function for each attribute, so the position of each rule’s centers,\( C^{i} \), and its spread,\( \sum^{i} \), which represents the covering area of that rule, are adjusted. By changing these two factors, the location and influence area of each rule will be changed. In fact, we can change the space partitioning by changing the rules’ center and spread.

To do this, the constructed rules are tuned so that the misclassification rate of the classifier always remains at minimum regarding any new sample \( X^{t} \) with label \( T^{t} \) (shown as \( \left\langle {X^{t} ,T^{t} } \right\rangle \)). In this regard, we try to minimize the misclassification rate of AFCGD according to \( X^{t} \) (Fakhrahmad and Zolghadri Jahromi 2009):

$$ J(X^{t} ) = {\text{step}}\left( {\frac{{\mu^{ \ne } (X^{t} )}}{{\mu^{ = } (X^{t} )}}} \right) $$
(5)

where the step function is defined as:

$$ {\text{step}}(z) = \left\{ \begin{aligned} 0\quad {\text{if}}\quad z \le 1 \hfill \\ 1\quad {\text{if}}\quad z > 1 \hfill \\ \end{aligned} \right. $$
(6)

Also, \( \mu^{ = } (X^{t} ) \) is the compatibility grade of rule with the same class, called \( R^{ = } \), that is most compatible with \( X^{t} \):

$$ \mu^{ = } (X^{t} ) = \hbox{max} \left\{ {\mu^{i} (X^{t} ),\quad i = 1, \ldots ,N\;{\text{and}}\;L^{i} = T^{t} } \right\} $$
(7)

Similarly, \( \mu^{ \ne } (X^{t} ) \) is the compatibility grade of rule with different class, called \( R^{ \ne } \), that is most compatible with \( X^{t} \):

$$ \mu^{ \ne } (X^{t} ) = \hbox{max} \left\{ {\mu^{i} (X^{t} ),\quad i = 1, \ldots ,N\;{\text{and}}\;L^{i} \ne T^{t} } \right\} $$
(8)

In order to apply the gradient descent method to minimize \( J(.) \), we need to calculate its partial derivatives with respect to \( C^{i} \) and \( \sum^{i} \). Since step function in \( J(.) \) is discrete and not differentiable, it is approximated by the sigmoid function defined as:

$$ {\text{step}}(z) \approx \psi_{\beta } (z) = \frac{1}{{1 + {\text{e}}^{\beta (1 - z)} }} $$
(9)

Hence, the error rate in (5) is approximated as:

$$ J(X^{t} ) \approx \psi_{\beta } (h(X^{t} )) $$
(10)

where \( h(X^{t} ) = \frac{{\mu^{ \ne } (X^{t} )}}{{\mu^{ = } (X^{t} )}} \) is defined for simplicity. More precisely, according to (4):

$$ \begin{aligned} h(X^{t} ) & = \frac{{\prod\limits_{j = 1}^{n} {\mu_{j}^{ \ne } (x_{j}^{t} )} }}{{\prod\limits_{j = 1}^{n} {\mu_{j}^{ = } (x_{j}^{t} )} }} = \frac{{\prod\limits_{j = 1}^{n} {{\text{e}}^{{ - \frac{1}{2}\left( {\frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{\sigma_{j}^{ \ne } }}} \right)^{2} }} } }}{{\prod\limits_{j = 1}^{n} {{\text{e}}^{{ - \frac{1}{2}\left( {\frac{{x_{j}^{t} - c_{j}^{ = } }}{{\sigma_{j}^{ = } }}} \right)^{2} }} } }} \\ & \quad = \frac{{{\text{e}}^{{\sum\limits_{j = 1}^{n} { - \frac{1}{2}\left( {\frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{\sigma_{j}^{ \ne } }}} \right)^{2} } }} }}{{{\text{e}}^{{\sum\limits_{j = 1}^{n} { - \frac{1}{2}\left( {\frac{{x_{j}^{t} - c_{j}^{ = } }}{{\sigma_{j}^{ = } }}} \right)^{2} } }} }} \\ \end{aligned} $$
(11)

To tune the positions (\( C^{ = } , \, C^{ \ne } \)) and spreads (\( \sum^{ = } , \, \sum^{ \ne } \)) of center of rules (\( R^{ = } , \, R^{ \ne } \)), respectively, the updating formulas are derived for each dimension separately. For instance, we need to calculate the partial derivative of \( J(.) \) with respect to \( c_{j}^{ \ne } \):

$$ \frac{\partial J}{{\partial c_{j}^{ \ne } }} = \frac{\partial J}{{\partial \psi_{\beta } }} \, \frac{{\partial \psi_{\beta } }}{\partial h} \, \frac{\partial h}{{\partial c_{j}^{ \ne } }} $$
(12)

Since from (10),

$$ J(X^{t} ) = \psi_{\beta } (h(X^{t} )) \Rightarrow \frac{\partial J}{{\partial \psi_{\beta } }} = 1 $$
(13)

and from (9),

$$ \frac{{\partial \psi_{\beta } }}{\partial h} = \frac{{\beta \, e^{\beta (1 - h)} }}{{(1 + {\text{e}}^{\beta (1 - h)} )^{2} }} $$
(14)

and according to (11),

$$ \begin{aligned} \frac{\partial h}{{\partial c_{j}^{ \ne } }} & = \frac{{{\text{e}}^{{\sum\limits_{j = 1}^{n} { - \frac{1}{2}\left( {\frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{\sigma_{j}^{ \ne } }}} \right)^{2} } }} }}{{{\text{e}}^{{\sum\limits_{j = 1}^{n} { - \frac{1}{2}\left( {\frac{{x_{j}^{t} - c_{j}^{ = } }}{{\sigma_{j}^{ = } }}} \right)^{2} } }} }} \, \frac{1}{{\sigma_{j}^{ \ne } }} \, \frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{\sigma_{j}^{ \ne } }} \\ & \quad = h(X^{t} ) \, \frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{(\sigma_{j}^{ \ne } )^{2} }} \\ \end{aligned} $$
(15)

so Eq. (12) is derived as:

$$ \frac{\partial J}{{\partial c_{j}^{ \ne } }} = \frac{{\beta {\text{e}}^{\beta (1 - h)} }}{{(1 + {\text{e}}^{\beta (1 - h)} )^{2} }} \, h(X^{t} ) \, \frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{(\sigma_{j}^{ \ne } )^{2} }} $$
(16)

by defining \( H = \frac{{\beta {\text{e}}^{\beta (1 - h)} }}{{(1 + {\text{e}}^{\beta (1 - h)} )^{2} }} \, h(X^{t} ) \), we simplify (16) to:

$$ \frac{\partial J}{{\partial c_{j}^{ \ne } }} = H \, \frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{(\sigma_{j}^{ \ne } )^{2} }} $$
(17)

So, using the gradient descent procedure, this updating formula for jth dimension of \( C^{ \ne } \) is obtained:

$$ c_{j}^{ \ne } = c_{j}^{ \ne } - \eta_{1} \, H\frac{{x_{j}^{t} - c_{j}^{ \ne } }}{{(\sigma_{j}^{ \ne } )^{2} }} $$
(18)

where \( \eta_{1} \) is a learning rate. Similarly, for \( C^{ = } \):

$$ c_{j}^{ = } = c_{j}^{ = } + \eta_{1} \, H\frac{{x_{j}^{t} - c_{j}^{ = } }}{{(\sigma_{j}^{ = } )^{2} }} $$
(19)

In this manner, we obtain for \( \sum^{ \ne } \):

$$ \sigma_{j}^{ \ne } = \sigma_{j}^{ \ne } - \eta_{2} \, H\frac{{(x_{j}^{t} - c_{j}^{ \ne } )^{2} }}{{(\sigma_{j}^{ \ne } )^{3} }} $$
(20)

and for \( \sum^{ = } \):

$$ \sigma_{j}^{ = } = \sigma_{j}^{ = } + \eta_{2} \, H\frac{{(x_{j}^{t} - c_{j}^{ = } )^{2} }}{{(\sigma_{j}^{ = } )^{3} }} $$
(21)

where \( \eta_{2} \) is another learning rate. Using the update equations in (18)–(21), we have developed the algorithm of AFCGD. Before describing it, we explain some parameters, used in this algorithm.

  • The compatibility threshold,\( \tau_{c} \), determines whether or not a new fuzzy rule should be generated for incoming instance \( \left\langle {X^{t} ,T^{t} } \right\rangle \). Indeed, when the width of compatibility margin between \( R^{ \ne } \) and \( R^{ = } \) is small (according to \( \tau_{c} \)), it means that \( X^{t} \) may be correctly classified after adjustment of \( R^{ = } \) and \( R^{ \ne } \) regarding \( \left\langle {X^{t} ,T^{t} } \right\rangle \). However, if this margin is wide enough, a new fuzzy rule is generated for \( \left\langle {X^{t} ,T^{t} } \right\rangle \) and the other rules are not updated. Thus, \( \tau_{\text{c}} \) directly affects the classifier’s rule-base.

  • The age threshold,\( \tau_{\text{a}} \), controls the death time of rules (according to incoming speed of instances). This threshold is used to destroy out-of-date rules, of those which are not fired/used for a long time. To detect these rules, the method in Bouchachia and Mittermeir (2007) is employed. Each time a new rule is generated, its age is set to zero. In emerging each new instance, the age of the fired rules is reset to zero while the other rules’ age is incremented. So, after a while and according to \( \tau_{\text{a}} \), the out-of-date rules are detected and removed. This process will maintain the compactness of the rule-base.

  • The rule threshold, \( \tau_{\text{r}} \), specifies the maximum number of rules for each class in the rule-base. Indeed, after updating \( R^{ = } \) and \( R^{ \ne } \), if \( X^{t} \) is still misclassified, another chance is taken to the model and a new rule is generated for \( \left\langle {X^{t} ,T^{t} } \right\rangle \), provided the number of existing rules with consequent \( T^{t} \) is below \( \tau_{\text{r}} \). Thus, \( \tau_{\text{r}} \) controls the size of rule-base.

In this algorithm, each time a new instance \( \left\langle {X^{t} ,T^{t} } \right\rangle \) comes, if its label,\( T^{t} \), was not seen before (i.e., is a new label), a new fuzzy rule, \( R^{t} \), is generated by employing \( X^{t} \) as position of its center (its spread is set to 1) and \( T^{t} \) as its consequent \( L^{t} \). However, if label \( T^{t} \) is not new, the new rule \( R^{t} \) might also be generated (according to \( \tau_{\text{c}} \)). Otherwise, the centers of \( R^{ = } \) and \( R^{ \ne } \) are updated. After that, \( X^{t} \) is classified by the rule-base. If it is misclassified yet, a new fuzzy rule is generated again, using \( \left\langle {X^{t} ,T^{t} } \right\rangle \). The algorithm of AFCGD is given here:

figure a

5 Experimental results and discussion

The aim of this section is to assess the performance of proposed AFCGD. The experiments are divided into two categories described in the following subsections. First, we apply some non-fuzzy state-of-the-art algorithms designed for data stream mining. Then, some online fuzzy and neuro-fuzzy methods are investigated.

5.1 Experiments on non-fuzzy methods

The first group of experiments is performed with the aid of MOA (Massive Online Analysis) framework (Bifet et al. 2010), which are designed specifically for data stream mining. Using this framework, different datasets with drift property as well as several collections of machine learning algorithms are available. Besides, some tools are provided for evaluating the performance of the methods. MOA categories include Active learning classifiers, Bayesian classifiers, Decision tree classifiers, Drift classifiers, Function classifiers, and Meta classifiers. The default parameters, which are set in MOA, are used for running each algorithm. Table 1 depicts the competitor methods and their categories. These methods are described briefly here.

Table 1 Classifiers and their categories in MOA

Active Classifier (Zliobaite et al. 2011) is a framework for active learning on non-stationary data streams which tries to learn the final model with as few labels as possible, concerning the new challenges in data streams. This method, unlike the conventional active learning strategies, does not only focus on the instances around the decision boundary. Therefore, it is not sensitive to changes which are not placed near the boundary.

Naïve Bayes classifier is based on Bayes theorem while making the independent assumption between the input features. This algorithm is known for its simplicity and low computational complexity, because no iterative parameter estimation is needed. Assuming each feature can take some different values, by storing a table for the number of each feature and a table for the number of each class, the naïve Bayes can be performed incrementally.

Decision trees for online learning, including HT (Hoeffding Tree) (Hulten et al. 2001) and Random HT. HT is an incremental decision tree capable of learning from data streams using the Hoeffding bound for induction. It assumes that the distribution of samples does not change over time. Hence, this method is equipped with a DDM (Drift Detection Method) in the experiments.

Single Classifier Drift is designed for handling concept drift datasets with a wrapper on a classifier (Gama et al. 2004). It controls the error rate of the learning model during the prediction by comparing the statistics of windows of data to determine the concept drift.

Function classifiers such as Majority Class, which predicts the most frequently class as winner, Single Perceptron as the incremental version of the classic single-layer perceptron multiclass, SGD (stochastic gradient descent) optimization method, which tries to learn various linear models like binary class SVM, logistic regression and linear regression, and finally SPEGASOS (Stochastic variant of Primal Estimated sub-GrAdient SOlver for SVM) method (Shalev-Shwartz et al. 2011) are used.

At last, Meta classifiers include AWE (Accuracy Weighted Ensemble) (Wang et al. 2003) and OCBoost (Online Coordinate Boosting) (Pelossof et al. 2010). OCBoost performs a new online boosting method for adapting the weights of a boosted classifier along with a new online algorithm derived from previous online boosting works.

In our experiments, we apply three generation mechanisms for producing three synthetic datasets including Led, Wave and Sea, each one having 1,000,000 samples. Also, three real-world datasets, Elec (Electricity market data) (Harries 1999), Ozone (Ozone-level detection) (Zhang et al. 2006) and Spam (Spam Assassin corpus) Katakis et al. (2009), are used.

The experiments are carried out using the interleaved test-then-train method which utilizes each data sample for testing and then uses it to train the model. Thus, it is not necessary to store the test set separately, and it can benefit from all available data samples. Moreover, in this way, the accuracy is updated incrementally on samples that have not been utilized to construct and update the model. Table 2 compares the accuracy of MOA classifiers for synthetic and real data streams. Moreover, these methods are compared, in terms of average precision and recall, in Table 3 where all the results are obtained via the MOA framework.

Table 2 Classification rates of AFCGD and MOA classifiers
Table 3 Average precision and recall of AFCGD and some of MOA classifiers

Although these methods are neither evolving nor fuzzy methods, the aim of this comparison is to show that AFCGD can attain comparable or even better results against the other non-fuzzy methods. Moreover, as seen in Table 2, the methods which obtain better results in real datasets have less accuracy in synthetic ones and vice versa. Thus, AFCGD represents good and stable behavior in both real and synthetic datasets and almost the best result in average. It is worth noting that AFCGD is the only method in this experiment which starts learning from scratch and does not store any chunk of samples.

5.2 Experiments on fuzzy methods

In this subsection, AFCGD is compared with some other evolving fuzzy methods. We utilize the results of some evolving fuzzy and neuro-fuzzy approaches from Pratama et al. (2015), comprising pClass (Pratama et al. 2015), eClass (Angelov and Zhou 2008), GENEFIS class (Pratama et al. 2014), and SRAN (Suresh et al. 2010). The results of two more algorithms are also considered: NN (Haykin 1999) as the classical batch learning classifier and OS-ELM (Liang et al. 2006) as an incremental learning method, designed for streaming data.

These experiments are done using tenfold periodic hold-out process (Pratama et al. 2015). In this way, the dataset is divided into ten parts. Each part is also separated into 70% and 30% parts. The 70% part is used for training and the 30% for testing the model. The data samples come to the system in the single-pass mode. In these experiments, some datasets having concept drift subsuming constant, gradual and abrupt changes are used. These datasets are Sea (Street and Kim 2001), Electricity pricing, hyperplane dataset (Bifet et al. 2010) and some synthetic datasets, obtained from diversity for dealing with drift algorithm (Minku and Yao 2012; Minku et al. 2010), named Boolean, Sin, Circle, Line, and Sinh. Table 4 shows the characteristics of these datasets. Table 5 compares the test accuracy and the number of generated rules of the methods when applied on these datasets. In these experiments, we have applied the same specifications and evaluation mechanism as in Pratama et al. (2015). The complexity of the algorithms, in terms of training time, is also reported where all methods are implemented using an Intel (R) core (TM) i7–2600 CPU @3.4 GHz processor with 8 GB of memory, as in Pratama et al. (2015). All the results are obtained as the average results of the subproblems of the tenfold periodic hold-out process. Clearly, AFCGD attains better or comparable performance, against the other fuzzy methods, in almost all datasets. In Table 6, the average precision and recall of AFCGD is compared against some of its competitors.

Table 4 Statistics of the datasets
Table 5 Classification rate (accuracy), CPU time and the number of rules of compared methods
Table 6 Average precision and recall of compared methods

As described in the algorithm, AFCGD has three parameters:\( \tau_{\text{c}} \), \( \tau_{\text{a}} \), and \( \tau_{\text{r}} \) where \( \tau_{\text{c}} \) and \( \tau_{\text{r}} \) are more effective. In order to determine these data-dependent thresholds, first, we relax the condition of rule generation according to \( \tau_{\text{r}} \) and run the algorithm with different values of \( \tau_{\text{c}} \). The classification rate for some datasets is depicted in Fig. 1.

Fig. 1
figure 1

The classification rate of AFCGD, based on threshold \( \tau_{\text{c}} \)

According to Fig. 1, the best value of \( \tau_{\text{c}} \) for each dataset can be attained. For example, in Hyperplane dataset, the accuracy increases for \( \tau_{\text{c}} < 0.5 \) and it is almost constant for values above 0.5. So, any \( 0.5 \le \tau_{\text{c}} < 1 \) is suitable for this dataset. In Line dataset, though the accuracy for \( \tau_{\text{c}} = 0.1 \) is maximum, it shows a uniform behavior for \( \tau_{\text{c}} > 0.5 \) as in Hyperplane. By applying this mechanism for other datasets, we have convinced to set \( \tau_{\text{c}} = 0.5 \) for all datasets, though for some datasets like Sin, better results are obtained when setting \( \tau_{\text{c}} < 0.5 \). Indeed, the performance of AFCGD is acceptable when \( \tau_{\text{c}} = 0.5 \) and it will improve when applying the second part of the algorithm.

The second parameter,\( \tau_{\text{r}} \), is determined in the same manner. This time, \( \tau_{\text{c}} \) is set to 0.5 and AFCGD is run for different values of \( \tau_{\text{r}} \), including 2, 3, 5, 10, 15, 20, and 25. The classification results are plotted in Fig. 2. As shown, the classification accuracy is reasonable when \( \tau_{\text{r}} \) is about 5 in these datasets. Therefore, for the sake of simplicity and complexity reduction, we have set \( \tau_{\text{r}} = 5 \) in all experiments.

Fig. 2
figure 2

The classification rate of AFCGD, based on threshold \( \tau_{\text{r}} \) when \( \tau_{\text{c}} = 0.5 \)

5.3 Performance analysis of AFCGD

As mentioned before, the distribution of data in streaming context is not fixed (i.e., may have concept drift). This means that the underlying distribution moves smoothly from one region to another. So, a constructed model may become outdated for new incoming samples which lead, in turn, to performance decay. AFCGD tries to cope with concept drift via applying the updating mechanism.

In this mechanism, there are situations where no new rule needs to be generated for incoming samples; alternatively, some related rules are updated. These rules are the nearest ones, \( R^{ = } \) and \( R^{ \ne } \), which have the same and different class to incoming instance \( \left\langle {X^{t} ,T^{t} } \right\rangle \), respectively. Indeed, when \( X^{t} \) comes from the current distribution, the center of \( R^{ = } \) will not move to another part via modifications in the updating mechanism. Instead, \( R^{ = } \) is adjusted such that its position is located in the middle of samples in its part. Besides, the spread of \( R^{ = } \) is tuned to cover \( X^{t} \) as well. On the other hand, when the distribution of samples drifts to another part of space, the center of \( R^{ = } \) moves toward the new data density and its spread alters to cover the new region. For \( R^{ \ne } \), however, whether or not drift happens, its center goes away from \( X^{t} \) and its spread shrinks to reduce its negative influence. This mechanism of handling concept drift justifies the good performance of AFCGD in experiments since almost all datasets are having this feature.

Apart from the concept drift, sometimes concept evolution and/or concept shift may occur in the data stream. The concept evolution means the emergence of a new class. That is, the class label of incoming instance \( \left\langle {X^{t} ,T^{t} } \right\rangle \) is not seen before. On the other hand, the concept shift occurs when the distribution of instances from a class suddenly changes.

Most of the state-of-the-art techniques ignore the concept evolution. However, if this problem is not handled properly, the model may misclassify the new concept as an existing class and so its error rate increases. According to algorithm of AFCGD, once the model detects a new sample \( X^{t} \) with a new class label \( T^{t} \), it generates a new rule \( R^{t} \) using \( \left\langle {X^{t} ,T^{t} } \right\rangle \). Now, if class \( T^{t} \) is actually a new concept rather than noise, more samples will support \( R^{t} \) in the future and so it will survive (via reseting its age \( A^{t} \), regarding incoming instances). Otherwise,\( A^{t} \) increases regularly, and according to \( \tau_{\text{a}} \), \( R^{t} \) is recognized as an outdated/died rule and is removed from the rule-base.

About the concept shift, suppose a data cloud with a known class appears in a totally different region, out of current area of that class. By incoming these samples, AFCGD will build a new rule since none of the existing rules can cover them. Consequently, the already generated rules of that class will be outdated and removed after a while. The three concept alterations as well as their handling techniques are depicted in Fig. 3.

Fig. 3
figure 3

Action of AFCGD in concept drift/shift/evolution. a Concept drift, b concept shift, and c concept evolution

The time complexity of AFCGD is superior to other fuzzy classification methods because of simple structure of its rule-base. In contrast to other methods which have multiple outputs, AFCGD considers single output. This, in turn, alleviates the modification time of rules. Moreover, since the antecedent parts are tuned regarding the label of samples in addition to the samples’ distribution, the consequent part could be considered fixed which, in turn, reduces significantly the complexity. For each data sample, only two rules, \( R^{ = } \) and \( R^{ \ne } \), need to be updated in the modification phase of the algorithm. In this sense, the compatibility grade of these rules must be found among all rules. Hence, the time complexity of AFCGD, in modification step is \( O(nN) \) for N rules of an n-dimensional problem. In rule generation step, the complexity is also \( O(nN) \) since, in the worst case, all N rules should be examined and then the new rule is generated. For comparison, the complexity order of pClass and eClass is \( O(n^{2} N) \) and \( O(mn^{2} N) \), respectively, where m is the number of class labels.

6 Conclusion

As the main challenge, in evolving fuzzy systems, an efficient mechanism with low complexity is needed for updating the model. In this paper, we proposed a fast technique for updating the fuzzy classifier’s rule-base. By presenting a fast updating formula, which is derived by solving an error minimization problem through gradient descent, it helps the model to always remain up-to-date and therefore can handle concept drift, concept shift, and concept evolution. Moreover, the formula works per incoming instance, and thus, there is no need to store any chunk of data.

Several experiments, using different datasets of real-world and synthetic types, are carried out on AFCGD, and its results are compared against a broad range of algorithms. They include some state-of-the-art algorithms in area of data stream classification as well as some evolving fuzzy and neuro-fuzzy classifiers. According to the results, AFCGD represents good behavior in both real and synthetic datasets and attains better or comparable performance against the other methods.