1 Introduction

Data streams became one of the main ways of collecting data such as audio and video samples, sensor network signals or network monitoring logs [4]. Structured data are produced unlimitedly in real time with all types of rates. This fact rises problems of storage, processing time and changes on the data probabilistic distributions over time [12]. For these reasons, applications based on classification need to adapt their training and prediction operations to obtain better performance [10].

Traditionally, classification assigns one class label from a set of possible labels to a nonobserved data example. However, in some classification problems, more than one class label is assigned to an example. Additionally, different examples may be assigned to a different number of classes [18]. This type of classification is called multi-label classification (MLC) [23].

Formally, the input attributes are defined as a vector of input random variables \(\mathbf X =\{X_1,{\ldots },X_j,{\ldots },X_M\}\in \mathbb {R}^M \), where M is the number of variables. The output attributes are defined as random subsets of labels \(\mathbf Y \subseteq \{ \lambda _1, {\ldots },\lambda _k, {\ldots }, \lambda _L \}\) where L is the number of possible labels. The vectors \(\mathbf x _i=(x_{i,1},{\ldots },x_{i,j},{\ldots },x_{i,M})\in \mathbb {R}^M \) and \(\mathbf y _i\subseteq \{ \lambda _1, {\ldots },\lambda _k, {\ldots }, \lambda _L \}\), where \(i\in \{0,1,2,{\ldots }\}\), represent realizations of \(\mathbf X \) and \(\mathbf Y \), respectively. A stream is defined as the sequence of examples \(\mathbf e _i=(\mathbf x _i,\mathbf y _i)\) represented as \(\mathcal {S}=\{(\mathbf x _0,\mathbf y _0),(\mathbf x _1,\mathbf y _1),{\ldots },(\mathbf x _i,\mathbf y _i),\) \({\ldots }\}\). The objective of MLC methods is to learn a function \(f(\mathbf x _i)\rightarrow \mathbf y _i\) that maps the input values of \(\mathbf x _i\) into the output values of \(\mathbf y _i\).

This methodology is present in many domains such as biology (gene and protein function classification) [6], engineering (network monitoring and sensor applications) [1], economics (online stock market data) [16], social sciences (social networks evolution) [19], library and information science (text categorization) [16] and multimedia (image, video and music categorization and annotation) [18].

Structured output classifiers predict complex objects such as label sets, in this particular case [19]. Conventionally, these structured classifiers can be categorized into local and global methods [15]. Local methods decompose the structured output into elementary scalar outputs, and conventional learners predict for each one of these outputs. In turn, global methods train and predict over the complete structured output and consist of adaptation of conventional classification algorithms [8]. Output specialization is an intermediate concept where subsets of outputs are modelled.

Rule learning methodology develops methods that create partitions on the representation space of input attributes and create local models (e.g. linear models) in each partition. Therefore, the obtained models are more interpretable and highly modular (independent partitions) [9]. This property is the most prominent advantage of rule learning towards decision trees. Modularity is prospected in this work to surpass the global and local methods through rule specialization on output variables [8].

Ensemble learning is focused on methods that explores model diversity to amplify the classification accuracy [5]. Multiple different models are employed to yield multiple predictions. These predictions are aggregated into one final prediction according to a criterion (e.g. majority vote). Diversity is created using bagging or boosting techniques [21].

In this paper, two solutions for online MLC problem based on rule and ensemble learning are presented. These solutions are inspired on a regression approach of multi-target regression. A performance evaluation was conducted where the proposed solutions were compared to other methods found in the literature [8].

This paper is organized as follows. Section 2 describes related work which involves six online MLC classifiers description found in the literature, and Sect. 3 explains the proposed algorithms. Performance evaluation is described in Sect. 4; the respective results are discussed in Sect. 5, and the conclusions are remarked in Sect. 6.

2 Related work

This section provides a brief description of six existing online approaches in the literature. Most approaches are based on problem transformation [19]. The random subsets of nominal labels \(\mathbf Y \) are transformed into a vector of binary random variables \(\mathbf Y =\{Y_1,{\ldots },Y_j,{\ldots },Y_L\}\in \mathbb {R}^L \). The realizations \(\mathbf y _i\) are transformed into a vector of outputs variables \(\left[ y_{i,1} \cdots y_{i,k} \cdots y_{i,L}\right] \), where \(y_{i,k} \in \{0,1\}\) are binary. If label \(\lambda _k\) is assigned to the ith example, then \(y_{i,k}=1\), otherwise \(y_{i,k}=0\). Here, the output realizations are redefined as \(\mathbf y _i=\left[ y_{i,1} \cdots y_{i,k} \cdots y_{i,L}\right] \).

Binary relevance (BR) is a simple classifier that uses directly the problem transformation. One online binary classifier trains and predicts for each kth output variable only. These predictions are concatenated to produce a final structured prediction. The final prediction is formally represented by \(\hat{\mathbf{y }}_{i}=[ f_1(\mathbf x _i), {\ldots },f_k(\mathbf x _i),{\ldots },\) \(f_L(\mathbf x _i) ]\), where \(f_k\) represent the classifier of the kth output variable. As advantages, BR method is very simple and can use any online binary classifiers. As limitations, the information related to the correlation between the single label outputs is lost. The method also requires a vast amount of hardware resources [23]. This classifier is used as a baseline in the performance tests.

Classifier chain (CC) is also based on the problem transformation [24]. The L output variable indexes are reordered randomly in a different sequence to avoid preference. Then, a classifier k is used to model the inputs and the first \((k-1)\) output variables. The prediction is defined by \(\hat{\mathbf{y }}_{i}=[ f_1(\mathbf x _i),{\ldots }, f_k(\mathbf x _i,\hat{y}_{i,1} ,{\ldots },\) \(\hat{y}_{i,k-1}),{\ldots },f_L(\mathbf x _i,\hat{y}_{i,1} ,{\ldots },\hat{y}_{i,L-1}) ]\). Finally, the outputs   are reordered in the original sequence. Unlike the BR method, the CC method models the labels dependency. However, CC does not guaranteed the optimal order of output variables.

Multi-label Hoeffding tree (MHT) is a structured classifier based on a decision tree that uses the Hoeffding bound splitting criterion in the induction. This method uses the information gain as homogeneity heuristic and online classifiers (essentially BR based) at the tree leaves [23]. The prediction can be modelled as \(\hat{\mathbf{y }}_{i}=f_n(\mathbf x _i)\), where \(f_n\) is a basic online classifier of n leaf. The main advantage is the creation of local models by dividing the input space into partitions dynamically. The main disadvantage is the vulnerability to outliers and creation of disjoint and dependent partitions.

i-SOUP Tree is another structured method that was firstly developed for multi-target regression and was adapted to MLC through problem transformation. This algorithm is based on FIMT-MT method that creates model trees based on split selection criterion (Hoeffding criterion) and a homogeneity measure (variance). This uses an enhanced top-down training and MTR perceptrons (without activation function and using delta rule) in the leaves of the tree [20].

3 AMRules and random rules for multi-label classification

This section describes the contributions of this work: Multi-Label AMRules (ML-AMRules) and Multi-Label Random Rules (ML-Random Rules) methods. The main contribution is the subsets specialization which is more efficient (memory and processing) than some methods of the literature and also very competitive. This method is also more efficient than other modes of ML-AMRules which are explained below. Rule learning underlying theory and the fundamental of local and global operation modes are also clarified.

Fig. 1
figure 1

ML-AMRules on subsets specialization mode. The rule set is composed by rules with associated model that predict different output set

Fig. 2
figure 2

Instantiation of the ML-AMRules on local mode. One ML-AMRules classifier for each output

These methods are based on the adaptation of the AMRules and random rules multi-target regression (MTR) methods to the MLC problem through problem transformation [2, 8]. The adaptation also implied the modification of the heuristics that are used in the splitting functions.

3.1 Rule learning

Rule R is defined as \(\mathcal {A} \Rightarrow \mathcal {C}\) implication where the antecedent \(\mathcal {A}\) is a conjunction of conditions (called literals) of the input variables \(\mathbf x _i\). The consequent \(\mathcal {C}\) is a predicting function which corresponds to a basic online classifier, in this context.

For numerical data, literals may present the forms \((X_j\le v)\) and \((X_j > v)\), where \(X_j\) represents the jth input variable, meaning that \(x_{i,j}\) must be less or equal to v, or \(x_{i,j}\) must be greater than v, respectively. Regarding nominal data, literals may present forms \((X_j = v)\) expressing that \(x_{i,j}\) must be equal to v or \((X_j \ne v)\), indicating that \(x_{i,j}\) must be different than v.

R is said to cover \(\mathbf x _i\) if, and only if, \(\mathbf x _i\) satisfies all the literal conditions in \(\mathcal {A}\). The support of the input variables of an example \(S(\mathbf x _i)\), corresponds to a set of rules that cover \(\mathbf x _i\). Function (the basic multi-label classifier) in \(\mathcal {C}\) returns a prediction \(\hat{\mathbf{y }}_i\) if a rule \(R_r\) covers the example input variables \(\mathbf x _i\). Data structure \(\mathcal {L}_r\) is associated with the rule \(R_r\) and contains the necessary statistics (about the rule and the examples) to the model training and prediction operations (expand the rule, detect changes and identify anomalies, etc.).

Default rule D is used in the initial steps and for the case of none of the current rules covers the example (\(S(\mathbf x _i)=\emptyset \)). The antecedent of D and its statistics \(\mathcal {L}_D\) start as an empty set. Rule set is composed by U learned rules represented as \(\mathcal {R}=\{R_1,\ldots ,R_r,\ldots ,R_U\}\) and a default rule D as depicted in Figs. 1, 2 and 3.

3.2 Multi-Label Adaptive Model Rules

Algorithm 1 presents the model training procedure of Multi-Label Adaptive Model Rules (ML-AMRules) method. In the initialization stage, the statistics \(\mathcal {L}_D\) of the default rule are initialized and the rule set \(\mathcal {R}\) is started out empty. In the processing stage, given an incoming example (\(\mathbf x _i,\mathbf y _i\)), the method finds the covering rules of example’s input variables \(\mathbf x _i\).

figure a

When a covering rule \(R_r \in S(\mathbf x _i)\) is found, the example is submitted to anomaly (\(\mathtt {isAnomaly}(\mathcal {L}_r,\mathbf x _i)\)) and change (\(\mathtt {changeDetected}(\mathcal {L}_r,\mathbf x _i)\)) detection. These operations prune the model by avoiding anomalous examples and changes on probabilistic distributions, respectively.

Page-Hinkley (PH) was used for change detection [22]. A method based on Cantelli’s inequality was employed to anomaly detection [8]. In case of anomaly occurrence, the example is simply rejected, and in case of change detection, \(R_r\) is removed from the rule set (the rule is outdated). Otherwise, the statistics \(\mathcal {L}_r\) are updated (\(\mathtt {update}(\mathcal {L}_r)\)).

Rule expansion (addition of new literal) is attempted (\(\mathtt {expand}(R_r)\)), and in an affirmative case, specialization of the rule on the output subset and rule addition to \(\mathcal {R}\) are performed. This specialization leads to more accurate predictions, and it increases the speed of processing.

Fig. 3
figure 3

Global approach mode. All outputs are used in the learning and prediction process

The example input variables \(\mathbf x _i\) may not be covered by any rule. Consequently, the statistics of the default rule \(\mathcal {L}_D\) are updated and the expansion is attempted. If an expansion occurs, the default rule D is added to the rule set \(\mathcal {R}\) and a new default rule is initialized. The training process also involves the computation of the online average of false positives and negatives. This measure is used in the prediction stage to compute a weight which reflects the reliability of the model prediction (Algorithm 2). The online average of false positives and negatives of the rule r and output k is defined as \(e_{r,k}=\frac{T_{r,k}}{W_r}\). \(T_{r,k}\) is the accumulated number of false positives and negatives, and \(W_r\) is the number of examples observed since the last expansion. These variables are updated as

$$\begin{aligned} T_{r,k}\leftarrow \alpha T_{r,k}+ |\hat{y}_{i,k}\oplus y_{i,k}|, \; W_r\leftarrow \alpha W_r + 1, \end{aligned}$$
(1)

where \(0<\alpha < 1\) is a fading factor, \(y_{i,k}\) is the true value, and \(\hat{y}_{i,k}\) is post-training prediction. The exclusive or operation identifies the false positives and negatives, and the norm transforms it into an integer (zero or one). Each output variable under the rule \(R_r\) is associated with a linear and to a majority vote predictors. The majority vote predictor finds the binary label \(\hat{y}_{i,k}^r\) that occurred most in the last n examples since last expansion of the rule r. The purpose is to provide fast training convergence to the linear predictor.

Rule expansion event is the addition of a new literal to an antecedent \(A_r\) of a rule or the addition of a new rule to the rule set. This event corresponds to the creation of new partitions in the input space. The literal composition implies the \(X_j=v\) splitting value and the sign computations. \(X_j=v\) maximizes the uniformity of two groups of examples in the output space. Rule expansion uses the extended binary search tree (E-BST) method with limited depth to find these elements of the literals [14].

This method is an adaptation from the multi-target regression (MTR) to MLC problem. The main adaptation consisted of splitting heuristics changing. The MTR version uses the mean variance ratio (MVR) which is also measure of uniformity [8]. Instead of MVR, the MLC version uses the mean information gain (MIG) as uniformity maximizing function which is defined as

$$\begin{aligned} \textit{MIG}(X_j, v)=\frac{1}{|\mathcal {O}_r|}\sum _{u \in \mathcal {O}_r}^{} IG_u(X_j, v), \end{aligned}$$
(2)

where \(IG_u(X_j, v)\) is the information gain of splitting \(X_j\) given v and considering the output variable \(Y_u\). \(\mathcal {O}_r\) is the set of output variable indexes currently being considered by the rule \(R_r\). The information gain (IG) is defined as

$$\begin{aligned} IG_u(X_j,v)=H_u(E)-\frac{|E_{L}|}{|E|}\frac{H_u(E_{L})}{H_u(E)} -\frac{|E_{R}|}{|E|}\frac{H_u(E_{R})}{H_u(E)},\quad \end{aligned}$$
(3)

where

$$\begin{aligned} H_u(\mathcal {E})= -[p\log (p) + (1-p)\log (1-p)] \end{aligned}$$

is the entropy of \(Y_u\), and p is the probability \(P(Y_u=1)\) considering the set of examples \(\mathcal {E}\). If the input variables are numerical, \(E_{L}=\{\mathbf{x }_i \in E: x_{i,j} \le v\}\) and \(E_{R}=\{\mathbf{x }_i \in E: x_{i,j} > v\}\). Considering nominal input variables, \(E_{L}=\{\mathbf{x }_i \in E: x_{i,j} = v\}\) and \(E_{R}=\{\mathbf{x }_i \in E: x_{i,j} \ne v\}\).

The rule expansion procedure uses the Hoeffding bound [13] to determine the minimum number of examples n required to expand. Hoeffding bound states that the true mean of a random variable \(\beta \), with range P, will not differ from the sample mean more than \(\epsilon \) with probability \(1-\delta \). The Hoeffding bound is defined as \(\epsilon = \sqrt{\frac{P^2\ln {({1}/ {\delta })}}{2n}}\).

This procedure suggests several candidates for splitting \([ \textit{MIG}(X_j, v_1),{\ldots }, \textit{MIG}( X_j, v_c) ]\) that are organized in decreasing order. Comparison between the two best splits is computed using the difference: \(\beta =\textit{MIG}(X_j, v_1)\) \(-\textit{MIG}(X_j, v_2)\). \(P=1\) because the range of \(\beta \) is [0, 1]. In case of \(\beta >\epsilon \), \(\textit{MIG}(X_j, v_1)\) is the best split with the probability of \(1-\delta \). Threshold \(\tau \) is defined to limit \(\epsilon \) and to avoid division by very low values or zero. If \(\epsilon < \tau \) is met, the split with higher \(\textit{MIG}(X_j, v_1)\) is selected and the expansion takes place. The literal sign is determined from \(H(E_L)\) and \(H(E_R)\) values. If \(H(E_L)\le H(E_R)\), the sign is \(\ge \) and <, otherwise.

Specialization on output variable subsets is now described. Let \(\mathcal {O}_r\), \(E_\textit{best}\) be the set of the current learning output indexes for rule \(R_r\) and the set of examples with the lowest entropy (\(E_R\) or \(E_L\)) in the splitting, respectively. \(\mathcal {O}_r'\) denotes the new specialized set of output variable indexes that reduces in entropy after the split:

$$\begin{aligned} \mathcal {O}_r'=\left\{ u: u \in \mathcal {O}_r \wedge \frac{H_u(E_\textit{best})}{H_u(E)} < 1 \right\} . \end{aligned}$$
(4)

These selected outputs participate significantly in the splitting. The not selected outputs are kept in a complementary rule \(R_c\) which is also added to the rule set. The learning outputs are given by \(\mathcal {O}_c' = \mathcal {O}_r {\setminus } \mathcal {O}_r' \). Before the expansion, the antecedents of \(R_c\) and \(R_r\) are equal. In the learning process, \(R_c\) learns only with \(\mathcal {O}_c'\) and \(R_r\) learns only with \(\mathcal {O}_r'\). Figure 1 illustrates (see Sect. 3.1 for more notation clarification) the subsets specialization of the outputs. Here, each rule is associated with a model that is trained for a nondisjoined subsets of outputs (e.g. {\(y_1\),\(y_3\)},{\(y_2\),\(y_3\)},{\(y_1\)},{\(y_1\),\(y_2\),\(y_3\)}, etc.). The default rule predicts all outputs. In the prediction stage, the available prediction will be used to produce the final prediction for \(Y_1,Y_2,Y_3\).

In the prediction stage, a weight measure is computed from the average of false positives and negatives which were computed in the training stage. Only the rules \(R_r\) that covers the example are considered \(\Lambda =\{r:R_r \in S(\mathbf x _i)\}\). The predictor that yields lower errors should present higher weights \(\theta _{r,k}\) for the output k. Intuitively, the weights are defined as the ratio of the rule’s average false positives and negatives inverse to the sum of averages of false positive and negative inverses of all rules.

Algorithm 2 presents the prediction stage of the ML-AMRules method which performs specialization on output variable subsets.

figure b

The variable \(\varepsilon \) is a small positive number used to prevent numerical instabilities. If the output attribute \(Y_u\) cannot be predicted by any rule, the prediction is given by the default rule D.

From lines 6–9, the average of false positives and negatives for each output is computed. From lines 10–14, the averages associated with the output k are summed up. From lines 15–20, the weights and the aggregation variables are computed. Finally, the aggregation variables are converted into binary variables.

Local and global modes are now presented. The local mode is based on one instantiation of ML-AMRules classifiers for each output variable. This operation mode resembles the BR approach. Each output presents an independent rule set in the model training. The final prediction is yielded by concatenation of the individual predictions. Formally, all rules of one rule set present the same learning outputs \(\mathcal {O}_r\).

Figure 2 depicts the local operation mode.

In this illustration, each of the outputs \(y_1\),\(y_2\) and \(y_3\) is predicted by one set of rules and the default rule also predicts one output.

In regard to global mode, the rules are learned and predicted for all output attributes using one instantiation, as depicted in Fig. 3.

Here, the outputs \(y_1\), \(y_2\) and \(y_3\) are predicted together by each rule of the rule set. The learning outputs of each rule of the rule set comprise all outputs \(\mathcal {O}_r=\{1,{\ldots },L\}\).

3.3 Multi-label random rules

This section presents the extension of the ML-AMRules to the ensemble methods. This approach resembles the random forests principles where decision tree classifiers were replaced by rule-based classifiers. As motivation, creating several but different models for the same problem increases the probability to produce more accurate predictions. This method is designated Multi-Label Random Rules (ML-Random Rules).

ML-Random Rules start with an initialization of an ensemble \(\mathcal {F}\) set with M models \(\hat{f}_m,~m \in \{1,{\ldots }, M\}\) which correspond to ML-AMRules classifier instantiations.

When an example is received, each ML-AMRules classifier is trained using on-line bagging method [21]. For each classifier, a weight integer value is randomly generated from a Poisson distribution with \(\lambda =1\) \((w_m \sim \textit{Poisson}(1))\). If \(w_m=0\), the classifier is not trained with the current example. Similarly to ML-AMRules training stage, ML-Random Rules also involve the computation of the online average of false positives and negatives for weight computation. This measure is defined as \(e_m=C_m/N_m\). The variables \(C_m\) (cumulative variable) and \(N_m\) (observed examples since last expansion) are updated with a fading factor \(0<\alpha < 1\). The ML-Random Rules training process is illustrated in Algorithm 3.

figure c

The prediction of ML-Random Rules is obtained through weighting the predictions of all classifiers. The weights \(\theta _m\) computation follows the same principle that was employed in Algorithm 2. The pseudocode of the prediction stage is presented in Algorithm 4.

figure d

From line 6 to 8, the average of false positives and negatives for each classifier is computed. From lines 9 to 11 the weights and aggregation variable are computed for each output. Finally, the aggregation variables are converted into a binary variables.

4 Experimental set-up

This section presents the evaluation tests of the proposed methods described in Sect. 3. The classification performance of ML-AMRules and ML-Random Rules operation modes (global, local and subset) described in Sect. 3.2 is compared to each other and to three online classifiers mentioned in Sect. 2.

The CC method is incorporated in the open-source MEKA platform that includes both batch and online methods. The algorithms were implemented in JAVA programming language and are based on WEKA [23]. BR, MHT and the proposed methods ML-AMRules and ML-Random Rules were implemented in the Massive Online Analysis (MOA) platform. It is an open-source platform of machine learning and data mining algorithms applied to data streams. This platform was also implemented in JAVA programming language [3].

Table 1 Data sets description
Table 2 Accuracy for 20NG, mediamill and OHSUMED data sets

Real scenarios data sets 20NG, mediamill, ENRON, OHSUMED, SLASHDOT and yeast were used to simulate data streams. These data sets are extensively described in the literature and were obtained from UCI data set repository [19]. The main features of the data sets are presented in Table 1.

Performance example-based measures, Exact Match, Accuracy, Precision, Recall and F-measure, were used [18]. Prequential evaluation was applied where the methods first predict output values and then use the example for training [11].

The sequence of example predictions was divided into 100 windows, and the above-mentioned measures were computed for each window. Since all algorithms present random initializations, 10 runs were performed to get more consistency. As validation, for each run, 100 combinations of the algorithm parameter values were tested. For all methods, the parameters that lead to the best F-measure value were kept. Finally, the mean and the standard deviation of the measures of all windows were computed. Perceptron with a logistic activation function was used as linear predictor by all methods due to its models simplicity, low computational cost and low error rates [17]. The ML-Random Rules were configured with 10 ML-AMRules instantiations, and the fading factor was 0.99. A Friedman and Nemenyi post hoc tests were applied to result tables to find groups of methods that differ significantly[7]. Both tests were performed with 0.05 of significance level.

Finally, evolution graphs were computed to observe the dynamics of the models training. Sliding windows with length of 2000 examples and with steps of 50 examples were used to obtain high resolution. For each window, the F-measure was computed. F-measure was chosen because this measure is widely employed for global evaluation.

5 Results

In this section, the evaluation results are presented. The results are organized by performance measures. Tables 23, 4, 5, 6, 7, 8, 9, 10 and 11 present the Accuracy, Exact Match, Precision, Recall and F-measure values of the methods for each data set, respectively. Higher values mean better performance. The values in bold points out the best value for one dataset. Each table is followed by the critical diagram to evaluate the distinct groups of methods. Implicitly, the Friedman test revealed that the ranks for each data sets were different for all performance measures.

The ML-AMR(S), ML-AMR(G) and ML-AMR(L) correspond to the subset, global and local ML-AMRules operation modes, respectively. Similarly, ML-RR(S), ML-RR(G) and ML-RR(L) correspond to the subset, global and local ML-Random Rules operation modes. The BR method was chosen as baseline because of its simplicity. The last part of this section shows the dynamics of the training. The F-measure variation is depicted for all data sets and methods. Tables 2 and 3 show that the ML-AMR approaches present competitive values of accuracy in comparison with the methods from the literature. The ML-RR approaches present the lower values considering all methods. The MHT method stands out in the experiments with the OHSUMED data set.

Table 3 Accuracy for ENRON, SLASHDOT and yeast data sets

Figure 4 displays the critical diagram related to Tables 2 and 3. This diagram shows that the ML-AMR approaches are in the best group and above the baseline method. The ML-RR approaches are at lower ranks.

Fig. 4
figure 4

Critical diagram for Accuracy. Nemenyi post hoc test at 0.05 significance

Tables 4 and 5 present very low values of Exact Match for mediamill data set due to the high number of possible labels (101 outputs). The ML-AMRules and ML-RR approaches present the same observations as the Accuracy measure.

Table 4 Exact Match for 20NG, mediamill and OHSUMED data sets
Table 5 Exact Match for ENRON, SLASHDOT and yeast data sets
Fig. 5
figure 5

Critical diagram for Exact Match. Nemenyi post hoc test at 0.05 significance

Figure 5 provides the critical diagram associated with Tables 4 and 5. Similarly to the previous diagram, the ML-AMR approaches are in the best groups and the ML-RR approaches at lower ranks. MHT method stands out in mediamill and OHSUMED-F data set experiments.

Tables 6 and 7 display favourable precision values for ML-AMRules approaches. On the other hand, ML-RR approaches present the lowest values.

Table 6 Precision for 20NG, mediamill and OHSUMED data sets
Table 7 Precision for ENRON, SLASHDOT and yeast data sets

Figure 6 presents the critical diagram of Tables 6 and 7. This diagram also shows that the ML-AMR approaches are among the best methods.

Fig. 6
figure 6

Critical diagram for Precision. Nemenyi post hoc test at 0.05 significance

Tables 8 and 9 exhibit predominance of the ML-AMRules approaches. The global approaches present better performance. The ML-Random Rules methods obtained the best results for ENRON and yeast data sets.

Table 8 Recall for 20NG, mediamill and OHSUMED data sets
Table 9 Recall for ENRON, SLASHDOT and yeast data sets

Figure 7 presents the critical diagram of Tables 8 and 9. This diagram also shows that the ML-AMR approaches are in the best groups.

Fig. 7
figure 7

Critical diagram for Recall. Nemenyi post hoc test at 0.05 significance

Tables 10 and 11 reveal a general positive behaviour of the ML-AMRules approaches in terms of F-measure. The MHT method stands out in the experiments with OHSUMED data set.

Table 10 F-measure for 20NG, mediamill and OHSUMED data sets
Table 11 F-measure for ENRON, SLASHDOT and yeast data sets
Fig. 8
figure 8

Critical diagram for F-measure. Nemenyi post hoc test at 0.05 significance

Figure 8 presents the critical diagram of Tables 10 and 11. This diagram also shows that the ML-AMR approaches are in the best groups.

The main observations of this set of results are:

  • ML-AMRules approaches among the best methods and above the baseline method (BR).

  • The MHT stands out in some scenarios.

  • ML-AMRules performs better in global mode.

  • ML-Random Rules are at low rates.

The MHT method presents better performance for Accuracy and Exact Match. The ML-AMR(G) presents best performance for Recall. For the remaining measures, the performance values are slightly similar for these two methods. In general, the mean values are close due to the fact that all methods use the same predictor.

Figures 9, 10 and 11 present graphs that reveal the dynamics of the models training. Each graph presents the F-measure variation. These figures show overlapped curves which means that groups of methods present the same dynamics of training.

Fig. 9
figure 9

F-measure evolution curves of all methods for the 20N data set

Figure 9 shows that the methods BR, CC, ML-AMR(L), ML-AMR(G) and ML-AMR(S) belong to the first higher values group of curves. MHT is the second isolated curve. ML-RR(L), ML-RR(G) and ML-RR(S) correspond to the third group of curves.

In Fig. 10, the methods BR, MHT, CC, ML-AMR(L), ML-AMR(G) and ML-AMR(S) are in the first higher values group. The ML-RR(L) and ML-RR(G) are in second group. The ML-RR(G) method is isolated in the third group.

Fig. 10
figure 10

F-measure evolution curves of all methods for the mediamill data set

Figure 11 shows that the MHT method is isolated in the first higher values group.

Fig. 11
figure 11

F-measure evolution curves of all methods for the OHSUMED data set

The methods BR, CC, ML-AMR(L), ML-AMR(G) and ML-AMR(S) are in the second group, and the last group comprises the ML-RR(L), ML-RR(G) and ML-RR(S) methods.

The most relevant remarks of this set of results are:

  • ML-AMR approaches present similar behaviour.

  • The curves are almost parallel.

  • The initial training conditions seem to influence significantly the overall performance.

  • The ML-RR approaches produced initial training that lead to lower performance.

In addition, the curves with higher level present higher learning rates. The curves present increasing tendency which means that the methods could reach higher-performance values with larger data sets. In general, the curves are parallel which means that they present almost the same dynamics.

6 Conclusions

This paper is the result of a preliminary work that suggests a MTR algorithm adaptation to the MLC problems using rule learning methods. It can be concluded that the ML-AMRules approaches are competitive when compared to online MLC algorithms from the literature and when the experimental context of this work is considered.

The ML-Random Rules approaches did not present favourable results. The initial training conditions of this ensemble method may be the cause of the low performance. Another possible cause is the aggregation function which usually influences the ensembles method performance.

The AMRules global mode has shown to be competitive against its local and subset modes. However, its performance values do not stand out from other operation modes. This means that the subset operation mode presents the advantage of lower resource requirements.

As future work, the number of data sets with higher number of examples will be used. For the ML-AMRules improvement, new heuristics will be implemented. The ensemble methods Random Rules will be improved by creating better aggregation functions.