Keywords

1 Introduction

Machine Learning (ML) has found applications across a wide range of domains and impacts (implicitly) nearly every aspect of nowaday’s life. Still, one of the most limiting factors of successful application of ML is the absence of labels for a training set. Usually, domain experts that are rare and costly are required to obtain a labeled dataset. Thus, to improve the manual label task is a prime object to improve. For example, the average cost for the common label task of segmenting a single image reliably is 6,40 USDFootnote 1.

Reducing the amount of necessary human input into the process of generating labeled training sets is of utmost importance to make ML projects possible and scalable. A standard approach to reduce the number of required labels without compromising the quality of the trained ML model, is to exploit Active Learning (AL). The approach consists of an iterative process of selecting exactly those unlabeled samples for labeling by the domain experts that benefit the to-be trained model the most. Given a small initial labeled dataset \(\mathcal {L}= \{(x_i,y_i)\}_i^n\) of n samples \(x_i\) with the respective labels \(y_i\) and a large unlabeled pool \(\mathcal {U}= \{x_i\}, x_i \not \in \mathcal {L}\), an ML model called learner \(\theta \) is trained on the labeled set. A query strategy then subsequently chooses a batch of \(b\) unlabeled samples \(Q\), which will be labeled by the human experts and added to the set of labeled data \(\mathcal {L}\). This AL cycle repeats \(\tau \) times until a stopping criterion is met.

The challenge of applying AL is the almost paradoxical problem to be solved: how to decide, which samples are most beneficial to the ML model, without knowing the label of the samples, since this is exactly the task to be learned by the to-be-trained ML model.

During the past years, many different AL query strategies have been proposed, but to our knowledge, none excels consistently over a large number of datasets and from different application domains. By deliberately focusing on domain-independent AL strategies we aim to shed some light onto this problem. Even though various extensive general survey papers [14, 15, 20] exist, no clearly superior AL strategy has been identified. The results of the individual evaluations in papers with newly proposed AL strategies suggest that current AL strategies highly depend on the underlying dataset domain. Even more interestingly, the naïve baseline of randomly selecting samples often achieves surprisingly competitive results [7, 9, 11, 13].

Fig. 1.
figure 1

General overview on the training procedure of ImitAL

At its core, the vast majority of AL strategies rely on the same set of two simple heuristics: informativeness and representativeness. The first favors samples that foremost improve the classification model, whereas the latter favors samples that represent the overall sample distribution in the feature vector space. Most recent AL strategies add more layers of complexity on top of the two heuristics in their purest form, often resulting in excessive runtimes. This renders many AL strategies unusable in large-scale and interactively operating labeling projects, which are exactly those projects that would benefit the most from “optimal” learning strategies.

We are presenting ImitAL, a novel AL strategy, which at its core is a Neural Network (NN) trained on very large simulated AL episodes with the goal to optimally combine the basic AL heuristics informativeness and representativeness. As it is not practically feasible to enumerate all possible real-world datasets as training data in the simulations, we are approximating them by using synthetic datasets instead. The benefit of synthetic datasets is that we can leverage the knowledge about all the labels to construct an optimal AL strategy, which then serves as training basis for ImitAL. Our work falls therefore under the category of “learning AL strategies”. According to our knowledge, our approach is, in contrast to similar works [9, 11, 13], the first one to solely utilize purely synthetical data to train the strategy. We can present a pre-trained, ready-to-apply AL strategy which can be applied without any further necessary transfer-learning or fine-tuning in any domain.

We start in Sect. 2 by presenting our synthetic datasets simulation process, followed by our Imitation Learning (IL) procedure in Sect. 3. In Sect. 4, we are comparing ImitAL with 7 common AL strategies on 13 real-world datasets and conclude in Sect. 5.

2 Simulating AL on Synthetic Training Data

For the IL training procedure of ImitAL we need an expert AL strategy, which the neural network behind ImitAL can learn to imitate. In order to capture the characteristics of “all” possible datasets we pursue the idea by generating initially nearly “infinite” synthetic datasetsFootnote 2 and computing an optimal AL strategy on them, leveraging the information about the known full labels for the synthetic datasets.

We construct the nearly-optimal strategy by selecting a batch of those samples for labeling, which will result in the highest accuracy (in the following called reward), if they each were added to the set of labeled samples \(\mathcal {L}\). As this process is computationally heavy, we do not consider all possible batches, but perform a pre-selection based on a heuristic, which selects a promising and diverse set of the top-\(k\) batches. Details of the pre-selection are explained in Sect. 3.

The results of the AL simulation for each AL cycle \(t\) for a specific synthetic dataset is a state-action-reward triple. The state \(s\) is represented as a triple \(s=(\mathcal {U}^t, \mathcal {L}^t, \theta ^t)\), consisting of the set of unlabeled samples \(\mathcal {U}^t\), the set of labeled samples \(\mathcal {L}^t\), and the state of the learner model \(\theta ^t\) trained on \(\mathcal {L}^t\). The corresponding actions \(\boldsymbol{a}_{s}\) is a set of the pre-selected queries \(x\), whereas the respective rewards \(\boldsymbol{r}_{s}\) for each of these actions is a set of rewards \(r\). The optimal choice \(Q_{s}^{t} \in \boldsymbol{a}_{s}\) for the AL cycle \(t\) can be easily computed from the given accuracies – the action with the highest future accuracy. This simulation is repeated \(\alpha \)-times using different synthetic datasets. The accumulated state-action-reward pairs, denoted as the triple \((\mathcal {S}, \mathcal {A}, \mathcal {R})\), reflect then the input for IL training procedure of the NN of ImitAL. The whole synthetic data training generation and training of ImitAL has to be done only once, afterwards it is applicable to real-world datasets without any further transfer learning or fine-tuning steps.

3 Training a Neural Network by Imitation Learning

The final step of ImitAL is to use the generated state-action-reward triples \((\mathcal {S}, \mathcal {A}, \mathcal {R})\) for training an NN as AL query strategy. Therefore, we are deploying the ML technique IL [12], where demonstrated expert actions are being replicated by the ML model. The training task for ImitAL is to find patterns in the presented actions.

Subsequently (Sect. 3.1) we will first explain the IL learning process, followed by the details of the NN input and output encoding (Sect. 3.2), and lastly, the necessary pre-selection process (Sect. 3.3).

3.1 Imitation Learning

For training ImitAL we use Imitation Learning (IL), where an expert demonstrates an optimal strategy, which the neural network behind ImitAL learns to replicate. We use behavioral cloning [12] as a variant of IL, which reduces IL to a regular regression ML problem. The desired outcome is a trained strategy returning the optimal action for a given state. We use the state-action-reward set \((\mathcal {S}, \mathcal {A}, \mathcal {R})\) to extract an optimal strategy \(\hat{\pi }\), which we are then demonstrating to the to-be-trained network \( \hat{\pi }(s) = \mathop {\textrm{argmax}}\limits \nolimits _{ x\in \boldsymbol{a}_{s} , r_x\in \boldsymbol{r}_{s} }{\left( r_x\right) } \). For a given state \(s\), the action set \(\mathcal {A}\) contains all pre-selected actions \(\boldsymbol{a}_{s}\) for this state; the reward set \(\mathcal {R}\) contains the respective rewards \(\boldsymbol{r}_{s}\). As the optimal strategy only contains the optimal actions, it can be used to construct the optimal batch by taking the \(b\)-highest actions. In other words: we train a network \(\widehat{\pi }\) predicting for a given state \(s\) and a possible action \(x\in \boldsymbol{a}_{s}\) – which equals labeling the sample represented by this action – the reward \(r_x\in \boldsymbol{r}_{s}\). The expected future accuracy is in our case demonstrated by the true reward \(\dot{r}\) function as \( \dot{r}(s, x) = r_{x} \).

3.2 Neural Network Input and Output Encoding

Before using the state-action-reward set \((\mathcal {S}, \mathcal {A}, \mathcal {R})\) to train the network predicting the future accuracy, we first transform it into a fixed sized vector representation using feature encoding, and thus making ImitAL dataset agnostic. NNs are limited by the number of the neurons to either a fixed size input, or when using recurrent NN to circumvent this limitation, they often suffer from the case of memory loss where the beginning of the input sequence is forgotten due to exploding or vanishing gradients [6]. The last problem occurs more frequently the larger and more length-varying the input is, which is the case for AL. The raw actions set \(\mathcal {A}\) may then contain – depending on the number of unlabeled samples – many possible samples, or just a few, varying again drastically. That underpins the already mentioned pre-selection method, reducing the number of possible actions to a fixed size.

Fig. 2.
figure 2

Pre-selection process and action meaning for ImitAL, example for \(j\) = 4, \(k\) = 6, and \(b\) = 3, and encoding of a state-action-triple

The transformation of the state-action-reward set into a suitable form for the network is called input and output encoding. Figure 2 displays the general procedure of the encoding to the right. We chose a listwise input encoding, where we enter \(k\) possible actions \(x\in \boldsymbol{a}_{s}, |\boldsymbol{a}_{s} |= k\) at once into the network, in contrast to a pointwise encoding, where a single action is entered at a time. This has the benefit of enabling the network to compare each possible action relatively to the others, enabling ImitAL to take batch-aware AL query decisions. Batch-awareness is, as thoroughly explained in [8], a beneficial and desirable property of AL query strategies, meaning that the joint diversity of all samples of the final AL batch \(Q\) is taken into account. The input of the network is defined by the vector \(\mathbb {I}_{s}=\{E(x, s)|x\in \boldsymbol{a}_{s}\}\), with \(E\) being the encoding of the action \(x\). The output \(\mathbb {O}= (\widehat{r}_1, \ldots , \widehat{r}_k)\) of the network consists of exactly \(|\boldsymbol{r}_{s} |\) output neurons, one for each of the predicted accuracies \(\widehat{r}_{x}\) for the respective possible actions \(x\). The amount of output neurons equals therefore the amount of pre-selected actions: \(|\boldsymbol{r}_{s} |= |\boldsymbol{a}_{s} |= k\). We use a final softmax layer of \(k\) output neurons, each per possible action. The \(b\) highest output neurons indicate the samples for the unlabeled query \(Q\).

A single action represents an unlabeled sample \(x\in \mathcal {U}\). The input encoding function \(E(x, s)\) defines on what basis the network can make the AL query strategy decision. We use the state \(s=(\mathcal {U}, \mathcal {L}, \theta )\) to calculate the encoding, which is a 5-tuple consisting of multiple parts, the individual functions will be explained in the following paragraphs \( E(x, s) =\left( u_1(x, \theta ), u_2(x, \theta ), u_3(x, \theta ), dl(x, \mathcal {L}), du(x, \mathcal {U})\right) \). The complete network input vector \(\mathbb {I}\) consists then of 5-times \(k\) values, an encoded 5-tuple for each unlabeled sample \(x\) out of the set of possible actions \(\boldsymbol{a}\) \( \mathbb {I}_{s} = \{E(x_1, s), \ldots , E(x_k, s)\}, x\in \boldsymbol{a}_s\) As mentioned in the beginning, a good AL query strategy takes informativeness as well as representativeness into account. Informativeness is derived by \(u_i(x, \theta )\), a function computing the uncertainty of the learner \(\theta \) for the i-th most probable class for the sample \(x\in \mathcal {U}\), given the probability of the learner \(P_{\theta }(y|x)\) in classifying \(x\) with the label y:

$$\begin{aligned} u_i(x, \theta ) = {\left\{ \begin{array}{ll} P_{\theta }\Bigl (\left( \mathop {\textrm{argmax}}\limits _{y,i} P_\theta (y|x)\right) \Bigm |x\Bigr ), &{} \text {if } i \le C\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

\(\mathop {\textrm{argmax}}\limits _{\_~,i}\) denotes the i-th maximum argument, and \(C\) the number of classification classes.

For representativeness we compute \(dl(x, \mathcal {L})\) and \(du(x, \mathcal {U})\), the first denoting the average distance to all labeled samples, the latter the average distance to all unlabeled samples \( dl(x, \mathcal {L}) = \frac{1}{|\mathcal {L}|} \sum _{x_l \in \mathcal {L}} d(x,x_l), \quad du(x, \mathcal {U}) = \frac{1}{|\mathcal {U}|} \sum _{x_u \in x} d(x,x_u), \) where \(d(x_1,x_2)\) is an arbitrary distance metric between the points \(x_1\) and \(x_2\). We use the Euclidean distance for small feature vector spaces, and recommend using the cosine distance for high-dimensional feature vector space. Both feature encoding functions represent the most raw, unpreprocessed forms of informativeness and representativeness, as the neural network should learn necessary transformations of the feature vector space.

3.3 Pre-selection

Instead of considering all possible actions, we pre-select promising actions, whose individual samples have the largest diversity and whose individual samples are the furthest away from each other, similar to [8]. The pre-selection fulfills two objectives: first and foremost, we can present a fixed amount of actions to the network, and secondly it keeps the runtime of the simulations within a processable range. A positive side effect of the fixed-size input of the network is the low and static runtime of ImitAL, which is almost independent of the size of the dataset. The effect is especially apparent with very large datasets.

We start the pre-selection by drawing randomly \(j\) possible actions \(\{\widetilde{a}_1, \dots , \widetilde{a}_j\}\), with each \(\widetilde{a}\) being a subset of \(\mathcal {U}\). After that we use a heuristic to select the top-\(k\) most promising actions \(\boldsymbol{a}\) out of the random ones. The pre-selection process is illustrated in Fig. 2 at the left side.

We are using a heuristic to filter out potentially uninteresting actions. By calculating the average distance to the already labeled samples of all the samples in each possible action set \(\widetilde{a}\) and select the action set \(\boldsymbol{a}\) having the highest average distance: \( \boldsymbol{a}=\mathop {\textrm{argmax}}\limits \nolimits _{\widetilde{a}} \sum _{x\in \widetilde{a}}dl(x, \mathcal {L})\), where \(\widetilde{a}\) contains \(k\) unlabeled samples. Thus, we are ensuring that we sample evenly distributed from each region in the sample vector space during the training process. We compute the heuristic for \(j\) random possible batches, instead of all possible subsets of \(\mathcal {U}\).

4 Evaluation

The goal of ImitAL is to learn a domain-independent AL strategy from synthetic datasets, which combines the strength of both the basic informativeness and the representativeness heuristics. For evaluation, we are comparing ImitAL therefore with 7 AL strategies on 13 real-world datasets from varying domains.

4.1 Experiment Details

The datasets are from the UCI ML Repository [1] with varying sample size, feature size, and application domain, similar to the evaluations of [7, 9, 13]. As an additional larger dataset the table classification dataset DWTC [2] was also included. For the experiments we started with a single random sample of each class, and ran the AL loop with a batch size \(b\) of 5 for 25 cycles, or until all data was labeled. We repeated this for 1,000 times with varying initial labeled samples to generate statistically stable results. As learner model \(\theta \) a simple NN with 2 hidden layers and 100 neurons each was used. The datasets were split randomly into a 50 % train and 50 % test evaluation set.

As evaluation metric we used the area-under-the-curve (AUC) of the learning curve, as has been also done recently in the AL survey by Chan et. al. [20]. This makes it easy to calculate the mean of 1,000 times repeated experiments. Similarly to [5] we are further normalizing the AUC values by the maximum possible AUC value – a rectangle of 100% F1-Scores for each time step – to additionally enable comparisons across datasets.

The training of ImitAL is highly parallelizable, as the generation of the synthetic datasets and the respective AL simulation may run completely in parallel. For a full training of ImitAL with the best parameters we needed 100,000 computation jobs, resulting in a set of 1,000,000 state-action pairs as training data. In total, \(\sim \)1M CPU-hours were needed for all experiments conducted for this paper, including testing out different NN and IL configurations, and training the final version of ImitAL. For the final version of ImitAL we set the parameter of the simulated AL cycle \(\tau \) to 10, the pre-sampling parameter \(k\) to 20 and \(j\) to 10 during training, and 2 during application, as this suffices for a trained ImitAL. The batch size was fixed to a standard value of 5 for the used UCI datasets.

4.2 Comparison with Other Active Learning Strategies

Our evaluation compares 7 AL strategies against our AL strategy, ImitAL. The results are shown in Table 1. Each displayed value is the mean of F1-AUC values for the 1,000 repeated runs. As the percentages are often quite similar, we additionally included the ranks. The displayed percentages are rounded, but the ranks are computed on the complete numbers, which can lead to different ranks for otherwise equally rounded displayed percentages.

Table 1. F1-AUC-scores (%) for different AL query strategies, mean for 1,000 repeated experiments each, including the ranks and the ranked mean. Empty cells indicate no calculable results within the maximum runtime window of seven days.

We included Least Confidence (LC) and Uncertainty Entropy (Ent) [17], the two most common and basic variants of the informativeness heuristic, where greedily the most uncertain samples based on the classification probability of the learner model are selected for labeling. The Graph Density (GD) strategy [3] was added as a pure representativeness heuristic-based strategy which solely focuses on sampling evenly from the vector space. BatchBALD [8] is a popular AL strategy which works well for computer vision deep neural networks. Querying Informative and Representative Examples (QUIRE) [7] is a computationally expensive combination of both heuristics, and Query-bycommittee (QBC) [16] a combination of the uncertainty of multiple learner models [10]Footnote 3.

ImitAL learns a combination of the two heuristics informativeness and representativeness. For the datasets fertility, flag, german, and heart GD is much better than LC. This is an indication that on these datasets a pure informativeness heuristic is challenged the most, whereas for the other strategies LC still seems to be the safest bet as a general-purpose AL strategy. ImitAL successfully learned to combine the best of both strategies, which can be especially seen by the superior performance on the datasets. QBC achieved quite competitive results, but at the cost of almost twice as high running cost than ImitAL due to the expensive retraining of multiple learner models instead of a single one. The good results from the original QUIRE and BatchBALD paper could not be reproduced by us. Additionally, the runtime of QUIRE was so high that not even one AL experiment finished within seven days. The pre-selection of ImitAL with our used parameters means that ImitAL always considers a fixed amount of 40 unlabeled samples during each AL iteration, making it 10 times faster than even the second fastest LC strategy, which has to consider all unlabeled samples.

We also performed a significance test to prove that ImitAL is not only by chance but indeed statistically sound better than the competitors. We used a Wilcoxon signed-rank test [19] with a confidence interval of 95% to calculate the proportional win/tie/losses between ImitAL and each competing strategy. For each of the 1,000 starting points, we took the F1-values of all the 25 AL iterationsFootnote 4 of the two strategies to compare. Our null hypothesis is that the mean of both learning curves is identical. If the null hypothesis holds true we count this experiment repetition as a tie, and otherwise as a win or loss depending on which strategy performed according to the better mean. Due to lack of space we are omitting the table with the results of all datasets, but overall, ImitAL won at least 35% more often compared to each strategy than lost against them. It also has to be noticed that the majority of the direct comparisons resulted in a tie with a total amount of 55%.

5 Conclusion

We presented a novel approach of training a universally applicable AL query strategy on purely synthetic datasets by encoding AL as a listwise learning-to-rank problem. For training, we chose IL, as it is cheap to generate a huge amount of training data when relying on synthetic datasets. Our evaluation showed that ImitAL successfully learned to combine the two basic AL heuristics informativeness and representativeness by outperforming both heuristics and other AL strategies over multiple datasets of varying domains. In the future, we want to include more requirements of large ML projects into the state-encoding of ImitAL to make it more applicable.