1 Introduction

Most of present-day applications involve operation in dynamically and drastically ever-changing environments. In such settings, systems that have the ability to adapt to the new conditions and evolve can have a decisive advantage over static and monolithic structures. More specifically, in the area of knowledge discovery and supervised learning, models that have the ability to continually take advantage of new data as they become available, can have a significant edge over conventional static approaches.

Active learning is a characteristic paradigm of such a dynamic approach, with the ability of constructing learning models that will be able to fully adapt to new data. As opposed to conventional supervised learning, it allows the model, in other words the classifier, to interactively ask for supervision from an oracle (most usually a human). The motivation is twofold: first, when dealing with learning tasks from domains with few labeled and abundant unlabeled data, this approach can effectively bypass the expensive task of labeling, since the classifier, based on some strategy, will only request manual annotation for a few characteristic instances. Second, this approach allows for classifiers that receive data in a stream-like fashion and dynamically choose which of this new data should be annotated to be used for their training (Zliobaite et al. 2011).

Furthermore, evolving predictive systems usually operate in domains where data is received in real time, continuously and changing over time. In such settings, it is often unfeasible to store data and it is critical to constantly update the model with the new training data. Obtaining such data though, can often be costly. Active learning is a technology for making the best of an annotation budget in such cases.

There has been a substantial body of work regarding active learning for single-label classification in the literature (McCallumzy and Nigamy 1998; Tong and Koller 2001; Settles 2010). However, this is not the case for multi-label learning, where each object can be associated with multiple labels simultaneously (Tsoumakas et al. 2012).

In this work, we present the general framework for applying active learning to multi-label tasks, studying the main aspects of an active learning model and discussing the key issues that need to be taken into account in such a configuration. We focus on the pool-based active learning scenario (Settles 2010), in which all unlabeled data are first evaluated and choices for which instances to be annotated are made subsequently by the model. Such an approach is suitable for a large number of real-world problems, such as text classification, image classification and retrieval, video classification, speech recognition and cancer diagnosis (Settles 2010; Zhang et al. 2014; Ye et al. 2015; Huang et al. 2015). Given that in a stream-based scenario new data arrive most often in batches that can be essentially treated with pool-based approaches, it is reasonable to assume that our work is, with minor adjustments, applicable to stream-based active learning as well.

An earlier and significantly shorter version of this work, has been previously presented in Cherman et al. (2016). We here extend this line of work, by substantially extending our experiments: we consider thirteen data sets instead of two in the previous paper, we employ two additional algorithms and consider also transductive inference apart from inductive inference for experiments that use the remaining examples in the query pool for testing,

Furthermore, in this work we propose a novel aggregation function that evaluates examples to be picked for active labeling. This approach considers the scores and the ranking of labels delivered by a given algorithm to assess if an example is to be picked for active labeling. The motivation behind this approach, is to try identifying the certainty of the algorithm in differentiating positive and negative labels for a given example. Our results show a consistent advantage of our proposed method with respect to passive supervised learning and to the rest of the methods as well.

To summarize, the contributions of this work are as follows:

  • we present the key issues that have to be considered when applying active learning on multi-label data and we thoroughly describe the existing approaches regarding these issues in the literature (Sect. 2)

  • we propose a novel aggregation method regarding the evaluation and subsequent choice of the unlabeled instances to be manually annotated (Sect. 2.4).

  • we conduct extensive experiments on 13 multi-label data sets, with two multi-label algorithms and for both inductive and transductive inference, studying the performance and behavior of the different methods and approaches on a variety of conditions and comparing them with conventional passive supervised learning (Sect. 3).

2 Active learning for multi-label data

In this section, we first briefly present the concepts of active learning and multi-label learning and then focus on the key issues that need to be considered when attempting to apply active learning on multi-label data.

2.1 Active learning

In conventional supervised learning, the learner is passively given a set of labeled data points to be trained on. Active learning on the other hand, permits the learner to interactively request supervision, or labeling in other words, for the data points of its own choice.

There are mainly three active learning approaches (Settles 2010; Aggarwal et al. 2014)

  1. 1.

    Membership query synthesis;

  2. 2.

    stream-based;

  3. 3.

    pool-based.

In the first case, the learner may query any unlabeled instance in the input space. That also includes queries generated by the learner de novo (synthesis). In the second setting, data points are made available continuously in a stream-like fashion, and therefore decisions about whether an unlabeled instance should or not be labeled are made individually or in small batches. The pool-based scenario assumes that a pool of unlabeled data is made available from the onset of training. All instances from this unlabeled pool are evaluated before selecting which of them are to be labeled.

2.2 Multi-label learning

Unlike single-label or multi-class learning, multi-label learning concerns supervised learning tasks in which there exist multiple target variables and a subset of them can be assigned to an instance simultaneously. Formally, let D be a training set composed of N examples \(E_i = (\mathbf x _i,Y_i)\), \(i=1 \ldots N.\) Each example \(E_i\) is associated with a feature vector \(\mathbf x _{i}=(x_{i1},x_{i2}, \ldots , x_{iM})\) described by M features \(X_j\), \(j=1 \ldots M,\) and a subset of labels \(Y_i \subseteq L\), where \(L=\{y_1,y_2,\ldots ,y_q\}\) is the set of q labels. A multi-label learning task consists of generating a classifier H, which given an unlabeled instance \(E =(\mathbf x ,?)\), is capable of accurately predicting its subset of labels Y, i.e., \(H(E) \rightarrow Y\).

In the general setting, a multi-label learning model can produce a ranking of labels, relevance scores or marginal probabilities per label, or even the full joint probability distribution of labels per instance.

Multi-label learning methods are divided into two broad classes, algorithm adaptation and problem transformation methods (Tsoumakas et al. 2009). Methods in the first category extend specific single-label learning algorithms to deal with multi-label data directly. Methods in the second category transform a multi-label problem into one or more single-label problems in which any traditional single-label learning algorithms can be applied. Binary Relevance (BR), is one of the most widely employed problem transformation methods, that proceeds by decomposing the multi-label problem into q binary single-label problems, one for each label in L.

2.3 Manual annotation

A first key issue concerning an active learning system relates to the manual annotation of the instances selected by the learner. Most often, instances are annotated in batches, e.g. ground truth acquisition for the ImageCLEF 2011 photo annotation and concept-based retrieval tasks was achieved via crowd-sourcing in batches of 10 and 24 images (Nowak et al. 2011). An annotator can accomplish this task either instance-wise (for each instance the annotator determines the relevancy to each label) or label-wise (for each label the annotator determines relevancy to each instance).Footnote 1

Let us consider a request for the annotation of n instances with q labels. Let \(c_o\) be the average cost of understanding an instance, \(c_l\) be the average cost of understanding a label and \(c_{lo}\) be the average cost of deciding whether an instance should be annotated with a particular label or not. Setting aside the cognitive and psychological aspects of the annotation process, such as our short-term memory capacity, a rough estimate of the total cost of instance-wise annotation will be given by:

$$\begin{aligned} n[c_o+q(c_l+c_{lo})]=nc_o+nqc_l + nqc_{lo} \end{aligned}$$

Similarly, a rough estimate of the total cost of label-wise annotation will be:

$$\begin{aligned} q[c_l+n(c_o+c_{lo})]=qc_l+nqc_o + nqc_{lo} \end{aligned}$$

Assuming that the cost of label-wise annotation is smaller than that of instance-wise annotation, we have:

$$\begin{aligned} qc_{l} + nqc_{o} + nqc_{{lo}} & < nc_{o} + nqc_{l} + nqc_{{lo}} \\ qc_{l} + nqc_{o} & < nc_{o} + nqc_{l} \\ n(q - 1)c_{o} & < q(n - 1)c_{l} \\ c_{o} < \frac{{q(n - 1)}}{{n(q - 1)}}c_{l} & \approx \frac{{qn}}{{nq}}c_{l} = c_{l} key \\ \end{aligned}$$

In other words, the choice of the annotation approach, largely depends on the instance and label understanding costs.

2.4 Evaluation of unlabeled instances

The most fundamental part of an active learning algorithm concerns the way it evaluates the informativeness of unlabeled instances. In a multi-label setting, the evaluation function (query) comprises two important parts:

  1. 1.

    a scoring function to evaluate instance-label pairs; and

  2. 2.

    an aggregating function to aggregate these scores.

Algorithm 1 shows the general procedure for a batch-size = t, i.e., t examples are annotated in each round. The evaluation function query calculates the evidence value of each example \(E_i \subset D_u\) and returns the t most informative instances, according to the evidence value used. In each round, these t examples will be labeled by the oracle and included in the set \(D_l\) of labeled examples.

figure a

Algorithm 2 shows the query function of a multi-label active learning procedure. The scoring function considers instance-label pairs \((E_i, y_j)\) and evaluates the participation (\(e_{i,j}\)) of label \(y_j\) in instance \(E_i\). It returns an evidence value \(e_{i,j}\) for all instances \(E_i \subset D_u\) and for each label \(y_j \in L = \{y_1, y_2, \ldots , y_q\}\). The aggregating function considers the q evidence values \(e_{i,1}, e_{i,2}, \ldots , e_{i,q}\) of each instance \(E_i\) given by the scoring function, and combines these values into a unique evidence value \(e_i\).

figure b

The following three families of measures have been proposed in the literature for evaluating instance-label pairs (scoring):

  1. 1.

    Confidence-based score (Brinker 2006; Esuli and Sebastiani 2009; Singh et al. 2010). The distance of the confidence of the prediction from the average value is used. The nature of this value depends on the bias of learner. It could be a margin-based value (distance from the hyper-plane), a probability-based value (distance from 0.5) or other. The value returned by this approach represents how far an example is from the boundary decision threshold between positive and negatives examples. We are interested in examples that minimize this score. In the following, we will denote this method as conf.

  2. 2.

    Ranking-based score (Singh et al. 2010). This strategy works like a normalization approach for the values obtained from the confidence-based strategy. The confidences given by the classifier are used to rank the unlabeled examples for each label. We are interested in examples that maximize this score. This score will be represented by rank in the rest of the paper.

  3. 3.

    Disagreement-based score (Hung and Lin 2011; Yang et al. 2009). Unlike the other approaches, this strategy uses two base classifiers and measures the difference between their predictions. We are interested in maximizing this score. The intuitive idea is to query the examples that most disagree in their classifications and could be most informative. In the literature, there have been proposed three ways to combine confidence values from two base classifiers:

  1. I.

    The Maximum Margin Reduction (MMR) criterion uses a major classifier which outputs confidence values and an auxiliary classifier that outputs decisions (positive/negative). The auxiliary classifier is used to determine how conflicting the predictions are.

  2. II.

    The Hamming Loss Reduction (HLR) approach considers a more strict disagreement using the decisions output by both classifiers to decide if there is disagreement or agreement between each label prediction of an example.

  3. III.

    The soft Hamming Loss Reduction (SHLR) method tries to make a balance between MMR and HLR through a function that defines the influence of each approach in the final score.

In the experiments, we do not consider the disagreement-based strategies, due to the inferior results that were obtained in previous work (Cherman et al. 2016). After having obtained the instance-label scores, there are two main aggregation strategies for combining the instance-label scores to an overall instance score:

  1. 1.

    averaging of the instance-label scores across all labels (avg). Thus, given the q instance-label scores \(e_{i,j}\) of instance \(E_i\), the overall instance-label score of instance \(E_i\) is given by:

    $$\begin{aligned} e_i = aggregating_{avg}\left( \left\{ e_{i,j}\right\} _{j=1}^{q}\right) = \frac{\sum _{j=1}^q{e_{i,j}}}{q} \end{aligned}$$
  2. 2.

    considering the optimal (minimum or maximum) of the instance-label scores (min/max), given by:

    $$\begin{aligned} e_i = aggregating_{min/max}\left( \left\{ e_{i,j}\right\} _{j=1}^{q}\right) = {min/max}\left( \left\{ e_{i,j}\right\} _{j=1}^q\right) \end{aligned}$$

    Note that for HLR, only the average aggregation strategy makes sense, as taking the maximum would lead to a value of 1 for almost all unlabeled instances and would not help in discriminating among them. We here propose a new aggregation strategy which we will denote as dev.

  3. 3.

    dev is based on the differences (deviations) between the values of evidence \(e_{i,j}\) of each instance. The motivation behind this strategy is that an instance that contains small differences in the values between the evidences of the labels predicted as positive and the evidence of the labels predicted as negative indicate uncertainty in the prediction of the instance, which makes it a potential candidate for oracle active labeling. Equation 1 defines the dev strategy.

    $$\begin{aligned} {aggregating } _{dev } \left( \left\{ e_{i,j}\right\} _{j=1}^{q}\right) = e_i = avgpos\left( \left\{ e_{i,j}\right\} _{j=1}^{q}\right) \nonumber \\ - firstneg\left( \left\{ e_{i,j}\right\} _{j=1}^{q}\right) \end{aligned}$$
    (1)

    where the avgpos function returns the average value of the labels evidences classified as positive and the firstneg function returns the evidence value of the label closest to being classified as positive but is actually classified as negative.

Table 1 Illustrative example of the evaluation method conf
Table 2 Illustrative example of the evaluation method score
Table 3 Illustrative example of the evaluation method rank
Table 4 Illustrative example of the evaluation method conf with application of avg and MIN aggregation functions
Table 5 Illustrative example of the score evaluation method with application of avg, MIN and dev aggregation functions
Table 6 Illustrative example of the evaluation method rank with application of aggregation functions avg and MIN

Thus, the lower the value of the \({aggregating } _{dev }\) function, the higher the instance’s priority to be selected for oracle active labeling. We should note that this strategy is appropriate when applied directly to the raw score produced from the given classifier (in the rest we denote this approach as score) and therefore is not applicable to confidence or ranking-based scoring strategies that manipulate the raw scores. To illustrate how the methods proceed, Tables 1, 2, 3, 4, 5 and 6 depict some characteristic examples of each of the scoring and aggregation methods.

2.5 Experimental protocol

Besides the multi-label active learning strategies themselves, the way that they are evaluated is another important issue to consider. Some aspects to be considered are the size of the initial labeled pool, the batch’s size, the set of examples used as testing, the sampling strategy and also the evaluation approach. Next, these aspects are described with references to previous work in the literature.

Regarding the initial labeled pool, different papers built it in different ways. In Singh et al. (2010), the examples are chosen to have at least one example positive and one negative for each label. In Yang et al. (2009), 100–500 examples were selected randomly to compose the initial labeled pool. In Esuli and Sebastiani (2009), the first 100 chronologically examples were selected. In Brinker (2006), the author choose randomly ten examples to compose the initial labeled pool. Gao et al. (2016) randomly sample \(5\%\) of the instances from the unlabeled pool as initial training labeled data.

The batch size defines how many examples are queried in each round of active learning. In Singh et al. (2010) and Brinker (2006), only one example was queried per round. Esuli and Sebastiani (2009) chose 50 examples in each round, while Yang et al. (2009) performed experiments with both 50 and 20 examples. Finally, Gao et al. (2016) perform five fold cross-validation to choose for each data set the batch size taking values between five and ten instances per round.

There are basically two different ways to define the test set. The first one is to consider a totally separated test set. This was followed by Esuli and Sebastiani (2009) and though not explicitly mentioned, it seems to have also been followed by Brinker (2006). The second way is to use the remaining examples in the unlabeled pool for testing. This approach was used by Singh et al. (2010), Yang et al. (2009) and Gao et al. (2016).

It is worth noting that the quality of the model assessed using this second approach holds for examples in the unlabeled pool, and does not necessarily hold for new unlabeled data. Although there is a lack of discussion about this topic in the active learning literature, the decision of which evaluation approach to use depends on the application’s nature. Most learning applications are interested in building a general model from a training set of examples to predict future new examples, e.g., this kind of application uses inductive inference algorithms to make its predictions. An experimental protocol using a separate test set is the correct evaluation approach for the performance assessment in the inductive inference setting. The remaining evaluation approach is biased by the active learner and hence the evaluation on these remaining examples will not be representative of the actual distribution of new unseen examples, which is the case for inductive inference.

However, there are active learning applications that want to predict labels of an a priori known specific set of examples. For example, in a real world personal image annotation scenario, the user would like to annotate some images of his/her collection and after few rounds of active learning, the system would annotate the remaining image in the collection (Singh et al. 2010). For such an application, the learning assessment should use the remaining examples in the query pool.

The learning curve is the most common evaluation approach used to assess active learning techniques. A learning curve plots the evaluation measure considered as a function of the number of new instance queries that are labeled and added to \(D_l\). Thus, given the learning curves of two active learning algorithms, the algorithm which dominates the other for more or all the points along the learning curve is better than the other. Besides the learning curve, Singh et al. (2010), Yang et al. (2009) and Esuli and Sebastiani (2009) also used the value of the evaluation measure in the end of some specific number of rounds to assess the active learning techniques.

3 Experiments

We here describe the experiments performed, presenting the data sets, evaluation measures, experimental setup and the relevant results. The active learning algorithms described in Sect. 2.4, as well as the active learning evaluation framework, were implemented under MulanFootnote 2 (Tsoumakas et al. 2011), a Java package for multi-label learning based on Weka.Footnote 3 Our implementation is publicly available at http://www.labic.icmc.usp.br/pub/mcmonard/Implementations/Multilabel/active-learning.zip.

3.1 Data sets

We employed 13 data sets from different domains. Specifically, bibtex, cal500, corel16k, corel5k, emotions, enron, medical, scene, tmc2007 and yeast were obtained from Mulan’s website,Footnote 4 while llog and slashdot were obtained from Meka’s website.Footnote 5 Finally, ohsumed is a widely used data set that is a subset of the MEDLINE database from years 1987–1991, with a labelset of the 23 Medical Subject Headings (MeSH) tags of cardiovascular diseases group.

In Table 7, we show the data sets statistics, with Instances denoting the number of total instances, Features the number of features and #Dist the number of distinct label sets. Similarly, |L| stands for the number of labels, Cardinality for the average number of labels of the examples in D, Density for the the average number of labels of the examples in D divided by |L| while Min, Med and Max refer to the minimum, average and maximum label frequencies respectively. Finally, the first and third quartiles of the label distributions are represented by 1Q and 3Q.

Table 7 Statistics of the data sets used throughout the experiments

3.2 Evaluation measures

For the evaluation of the multi-label classification models, we employed three measures in total, Micro-F, Macro-F and Ranking Loss. The first two measures essentially consist two different averaging schemes of the F-measure, which is used for single-label classification. Specifically, the F-measure is defined as

$$\begin{aligned} F\text {-}measure =\frac{2 T_P}{2 T_P + F_P + F_N} \end{aligned}$$
(2)

with \(T_P\) denoting the true positives, \(F_P\) the false positives and \(F_N\) the false negatives. The F-measure combines both the Precision and Recall measures, being the harmonic mean of them and obtains values between zero and one, with F-measure = 1 signifying a perfect classification.

The Micro-F and Macro-F measures are defined as the micro- and macro- averages of the F-measure respectively:

$$\begin{aligned} Micro\text {-}F = \frac{2 \times \sum \nolimits _{l=1}^{|L|}{tp_l}}{2 \times \sum \nolimits _{l=1}^{|L|}{tp_l}+\sum \nolimits _{l=1}^{|L|}{fp_l}+\sum \nolimits _{l=1}^{|L|}{fn_l}} \end{aligned}$$
(3)
$$\begin{aligned} Macro\text {-}F = \frac{1}{|L|} \sum \limits _{l=1}^{|L|}{\frac{2 \times tp_l}{2 \times tp_l+fp_l+fn_l}} \end{aligned}$$
(4)

We note that, by definition, Micro-F typically favors frequent labels while Macro-F is more influenced by rare labels.

Finally, Ranking Loss expresses the number of times that irrelevant labels are ranked higher than relevant labels and is defined as:

$$Ranking{\text{ - }}Loss = \frac{1}{N}\sum\limits_{{d = 1}}^{N} {\frac{1}{{|Y_{d} ||\overline{{Y_{d} }} |}}} \left| {\left\{ {(y_{a} ,y_{b} ):r_{d} (y_{a} )> r_{d} (y_{b} ),{\text{ }}(y_{a} ,y_{b} ) \in Y_{d} \times \overline{{Y_{d} }} } \right\}} \right|$$
(5)

In the experiments, we consider the respective learning curves for each of the above measures and use the Final Value (FV) (Yang et al. 2009) and the Area Under the Learning Curve (AULC) (Settles and Craven 2008). Specifically, FV represents a measure’s value for the last iteration of the learning curve for each active learning method, while AULC is calculated by summing over all points of the learning curve, since we are using a discrete curve and all points are equally spaced in the x-axis (the number of iterations) for all learning curves.

3.3 Setup

As mentioned earlier, the multi-label active learning algorithms are instantiated with two functions:

  1. 1.

    a scoring function to evaluate object-label pairs; and

  2. 2.

    an aggregating function to aggregate these scores.

Three strategies were considered for the scoring function:

  • Confidence-based score (conf)

  • raw score (score)

  • Ranking-based score (rank)

For the aggregation function, two strategies were considered:

  • Average (avg)

  • Deviation (dev)

The other strategies were not considered since they exhibited inferior results in Cherman et al. (2016). In Table 8, we present the combinations of the scoring and aggregating functions employed throughout our experiments.

Table 8 Different combinations of multi-label active learning strategies considered in this work

With respect to the multi-label learning algorithms used throughout the experiments, we employed both an inductive and a transductive algorithm. Specifically, the inductive algorithm, Binary Relevance with Linear SVMs as binary classifiers (BR SVMs), is used in experiments with the separated protocol. Regarding the remaining protocol, we chose to employ a multi-label classification algorithm with transductive inference, LPBHN with RCut. Even if both algorithms could be used for the remaining protocol, a transductive algorithm is better suited with the nature of that protocol, which calls for transductive inference.

BR SVMs were implemented based on the LIBLINEAR library.Footnote 6 This implementation is optimized to handle efficiently and effectively sparse data, a crucial feature for active learning experiments due to the time cost involved in training a large number of models: we need to train one model for each label and for each iteration of the active learning procedure. In addition, the library can output normalized values of probability for the predictive confidence. In our experiments, we kept all parameters for the SVMs at default values, setting \(C=1\), \(e=0.01\) and employing the L2-loss SVC dual solver.

LPBHN was proposed by Rossi et al. (2013). The algorithm is based on graphs and more specifically on the Gaussian Fields and Harmonic Functions (GFHF) algorithm and is optimized for sparse data. Since the algorithm originally outputs a ranking of labels for each new instance to be predicted, we employed the RCut ranking strategy (Yang 2001), in order to apply a threshold. This method proceeds by choosing the t first labels of the ranking, with t being the closest integer to the training data set’s cardinality (average number of labels of the examples). Table 9 presents the active learning settings used in our experiments and presented in this work.

Table 9 Active learning settings used in experiments

All experiments were performed using tenfold cross validation. In the transductive context, where the remaining protocol is used, the independent test partitions of each of the ten folds are discarded, since the remaining examples of the query set are used to evaluate the predictive performance.

Finally, the initial labeled pool of examples was built by randomly choosing examples until having \(N_{ini} \times q\) positive single labels, i.e.. until \(N_{ini} \times q \ge \sum _{i=1}^{|D_l|} {Y_i}\), where \(N_{ini}\) is user-defined. This strategy allows for fairer comparison across the data sets. We used \(N_{ini}=1, 5\) in order to evaluate the influence of different sizes of the initial labeled pool.

3.4 Results

In this section we present the results of our experiments. We show the learning curves and the results in terms of Micro-F and Macro-F (for the separated protocol) and Ranking Loss (for the remaining protocol) for all data sets, using the FV and the AULC measures. The random method represents the baseline (passive learning) strategy and the score dev method refers to our proposed approach, while the rest of the methods refer to the ones previously proposed in the literature.

3.4.1 Experiments with the separated test protocol: inductive inference

Figure 1 present the learning curves for the algorithms considered in this work using the bibtex data set. The learning curves for the rest of the data sets are presented in an on-line appendix.Footnote 7

Fig. 1
figure 1

Learning curves for BR-SVMS and the separated protocol on bibtex. The left plots are for \(N_ini = 1\) while the ones in the right are for \(N_ini = 5\)

From the plots, we can easily observe that our proposed method score dev is consistently outperforming conventional supervised learning (random) for all four considered scenarios. The score avg method shows also a steady advantage compared to random in all scenarios, except for \(N_{ini}=1\) and for the Micro-F measure, in which case it performs worse than all other methods. The other active learning methods show mixed results, not being able to exhibit a steady advantage compared to the random method in any scenario.

Table 10 FV represents the performance value for the last iteration of the learning curve for each active learning method

Table 10 shows the results for all data sets and the separated protocol, in terms of Micro-f and Macro-F. As we can see, several of the active learning methods have FV values equal to or greater than those presented by the random method. The conf avg method, however, is inferior to random for \(N_{ini} = 1\) for both evaluation measures (Micro-F / Macro-F) and for \(N_{ini} = 5\) for Micro-F. The other active learning methods have average ranking values higher than the one for random in all cases.

An important aspect to be evaluated in an active learning method is its consistency in outperforming the random method, since, unlike the evaluation of standard learning algorithms, one does not have the data labeled beforehand, and the purpose of active learning is to obtain good labeled examples. Thus, it is not possible to predict the effectiveness of various active learning methods beforehand in a way similar to the one followed for standard supervised learning algorithms. Thus, in an effort to measure the stability of the active learning method, the better or equal torandom value is displayed in the last line of Table 10. This value refers to the percentage of data set where the active method was greater than or equal to the random method.

score dev presents the best stability values considering Micro-F as the evaluation measure. This method obtained results better than or equal to random for 100 and 92% of the data sets for \(N_{ini} = 1\) and \(N_{ini} = 5\), respectively. score dev, along with rank avg and conf avg, also presented the best stability value for Macro-F and \(N_{ini} = 5\), with 92% of cases better than or equal to random. In the scenario with Macro-F and \(N_{ini} = 1\), rank avg and score avg presented the best stability values with 92%, followed by the score dev method with 85%.

In summary, conf avg did not perform in a satisfying manner regarding stability for any scenario; The rank avg method showed the best stability value in 2 / 4 out of scenarios, the score avg method also had the best stability value in 2 / 4 out of scenarios and the score dev method, our proposed method, exhibited the best result for 3 / 4 out of the cases, that is, presented the best stability ratio.

Figure 2 presents the average ranking plotted against the FV measure of each Friedman test with a Nemenyi post hoc test with a significance level of 95% to identify statistically significant differences between the methods (Demšar, 2006).

Fig. 2
figure 2

Friedman ranking with Nemenyi post-test for the BR SVMs and FV measure. Experimental protocol: separated

Although there are consistent differences between the active learning methods and the baselinerandom, no active method showed improvement with a statistically significant difference compared to the random method. The only difference observed is between score dev and conf avg for \(N_{ini}=1\) and Micro-F1, which again indicates the difficulty of conf avg in obtaining satisfying results.

In Table 11, we present the results for AULC as the evaluation method.

Table 11 AULC values for each active learning method using the separated experimental protocol

rank avg, score avg and score dev show better average rankings than random for all setups and evaluation measures, whereas conf avg has worse average ranking than random for Micro-F for both \(N_{ini}\) parameter choices.

Regarding stability (equal or better than random), we observe a similar behavior to the one for the FV measure. For \(N_{ini}=1\), the score dev method was the only one to be better or equal to random in all data sets, for both Micro-F and Macro-F. For the scenario with \(N_{ini}=5\) and Micro-F, the score dev method showed the best stability (85%). The score avg method was higher for the scenario with \(N_{ini}=5\) and Macro-F, where it was better than or equal to the random method in 92% of the data sets. The other three methods showed a stability of 85%.

Figure 3 shows the average ranking plotted for the AULC measure of each method. Again, no active method showed improvement with statistically significant difference in relation to random. Significant differences considering AULC as evaluation measure were found only between score dev and conf avg in terms of Micro-F for \(N_{ini}=1\) / and between score avg and conf avg in terms of Macro-F again for \(N_{ini}=1\).

Fig. 3
figure 3

Friedman ranking with Nemenyi as post-test for the BR-SVMs and the AULC measure. The experimental protocol separated is followed

3.4.2 Experiments with the remaining test protocol: transductive inference

The experiments performed using the remaining protocol simulate applications that are intended to annotate examples in the context of transductive inference, i.e. applications in which the test data is observed a priori. Inductive inference methods, such as BR-SVMs method, could also be used in this context. However, inductive inference is intended to solve a more general problem than what is necessary in that case. Also, we should note that there are cases in which transductive inference may be more effective, such as a scenario with extremely few labeled examples, with all unlabeled data available beforehand.Footnote 8

In this round of experiments we employed the LPBHN algorithm, a transductive inference multi-label algorithm that outputs ranking of labels, additionally using the RCut method to apply a threshold on the ranking and obtain a hard assignment of labels for each test instance.

The Ranking Loss measure was used to evaluate the quality of predictions since LPBHN outputs rankings. As mentioned earlier in the paper, we also used the RCut method in this experimental setup, in order to obtain a hard assignment of labels from the rankings. We remind here that RCut selects the t first labels from the ranking, with t being the nearest integer to the training set cardinality. Since in the active learning setting the number of labeled examples is reduced, we expect that the estimation of the cardinality using the training set could be less precise in our case.

Figure 4 presents the learning curve for the corel5k data set. The figures referring to the learning curves of other data sets are, similar to the plots for the separated protocol, presented in an on-line appendix.Footnote 9 from the plots we can observe that score dev is consistently superior to both random and score avg for the two considered scenarios. score avg on the other hand, fails to consistently outperform the passive method, especially for \(N_{ini}=5\).

Fig. 4
figure 4

Learning algorithm: LPBHN-Rcut. Experimental protocol: separated. Data set: corel5k

Tables 12 and 13 presents results for the remaining protocol for all data sets concerning learning evaluation measures FV and AULC,Footnote 10 respectively. score dev shows the best average ranking and the best stability (better or equal than random) for all scenarios (85%). furthermore, score avg has lower stability for the remaining protocol than the one exhibited for the separated protocol.

Table 12 Results for the FV measure in the last iteration of the learning curve for each active learning method with the remaining protocol
Table 13 Results for the AULC measure in the last iteration of the learning curve for each active learning method with the remaining protocol

Finally, Fig. 5 depicts the average ranking of each method together with the performed statistical significance test. Again, score dev shows the best values of average ranking for all scenarios, with a statistically significant difference to random when considering AULC as the evaluation measure. Additionally, we can observe a statistically significant difference between score dev and score avg when considering AULC for \(N_{ini}=1\).

Fig. 5
figure 5

Friedman test with a Nemenyi post-test for the LPBHN classifier and for the AULC and FV measures. The remaining protocol was used.

4 Conclusions

Dealing with learning tasks in constantly changing environments, requires approaches that, far from relying on static and passive learning models, enable effective evolving of the learning system when prompted with new data. Active learning is such an approach, enabling a given classifier to actively choose which of the new data will be manually annotated for training. In this manner, apart from reducing annotation costs and requiring fewer training examples, the classifier is capable of evolving to better represent new data.

Although active learning for single-label learning has been a well investigated topic of research, this is not the case for multi-label learning. In this work, we discussed key issues in pool-based multi-label active learning based on previous work in the literature. We presented the main approaches regarding the scoring and aggregation strategies of multi-label active learning and proposed a novel aggregation approach, called score dev, We implemented all previously existing approaches, as well as our method in a common framework and performed extensive experimental comparisons for two different multi-label learning algorithms, on thirteen multi-label data sets and under two different application settings (transductive, inductive).

The results on two different evaluation protocols, an inductive and a transductive learning scenario with BR-SVMs and LPBHN as base classifiers respectively, show a consistent advantage of our aggregation method dev with score as the evaluation approach, compared to the rest of the methods and to conventional passive learning, followed by the average aggregating strategy, again with score for evaluation. It should be noted that the raw score used by dev is advantageous since one does not need to normalize or define thresholds to calculate it. rank and conf strategies, on the other hand, obtain their results using manipulated or normalized (by the base classifiers) scores, which make them more base classifier dependent.