1 Introduction

The community of artificial intelligence has witnessed great progress owing to deep learning, whose success heavily relies on the quality and volume of accurately annotated datasets. To ease the pressure of such costing labeling work, numerous researchers have been investigating active learning (AL) (Settles, 1995), which aims to achieve as high-performance gain as possible by labeling as few samples as possible. A popular setting in AL is pool-based AL (Settles, 1995), where a fixed number of samples selected by a selector are sent to an oracle for labeling iteratively until the exhaustion of the sampling budget. Pool-based AL has a wide range of applications, including but not limited to semantic segmentation (Cai et al., 2021) and object detection Haussmann et al. (2020).

Most existing pool-based AL frameworks (Joshi et al., 2009; Luo et al., 2013; Yoo & Kweon, 2019; Kirsch et al., 2019; Parvaneh et al., 2022) assume that the oracle is perfect, i.e., the oracle always provides accurate labels for selected samples. However, due to inherent label ambiguity and noise, we cannot expect such a “perfect” oracle to exist in real-world applications (Fang & Zhu, 2012). To apply AL in a more practical way, we turn to a new type of imperfect oracle, which would provide the selected samples with a special but prevailing form of the weak label, i.e., partial label. A partial label of an instance, essentially a set of candidate labels that includes the true label, is intuitively adaptable to various real-world tasks, including image retrieval (Cour et al., 2011) and face recognition (Zeng et al., 2013). With the full potential of partial labels seen in these real-world scenarios, partial-label learning (PLL), has naturally emerged and boomed in the community (Feng & An, 2018; Wang et al., 2019; Zhang et al., 2022). Motivated by the industrial and academic value of PLL, we propose a new setting for AL, i.e., active learning with partial labels (ALPL). Formally, ALPL is built on a pool-based AL learning problem but with only one imperfect oracle that assigns partial labels to samples. Figure 1 illustrates the pipelines of AL and ALPL. Compared with AL, the oracle in ALPL shall provide noise-tolerant partial labels instead of the exact true label when annotating confusing objects, highly improving the labeling efficiency while easing the annotation pressure of the oracle. Such relaxing to the annotation could be valuable for some real-world tasks. For instance, considering a medical diagnosis problem, the experts sometimes would be uncertain about the disease pathogen in some images, but they could provide a group of reliable options. Another example goes to face recognition, which aims to learn a face recognition system from online images associated with text captions and video scripts. In this way, the face image is often labeled with multiple names since a caption or script usually contains multiple annotations. Based on the great potential of ALPL, we believe it is a fascinating topic in this community, which merits deep investigation.

Fig. 1
figure 1

Comparison of pool-based AL (blue arrow) and our proposed ALPL framework (red arrow). The core difference between these two settings is the label form provided by the oracle (Color figure online)

To address ALPL, we first focus on building a group of promising baselines by adopting the RC loss (Feng et al., 2020), as one of the state-of-the-art milestones in PLL (Lv et al., 2020; Wen et al., 2021; Zhang et al., 2022), to train the predictor with the given partial labels from the oracle. By doing so, we are able to establish a robust baseline for ALPL that can be seamlessly integrated into various pool-based AL frameworks. Though encouraging and effective, ALPL with RC loss, similar to all AL frameworks, confronts the inevitable overfitting challenge (Chen et al., 2006; Perez & Wang, 2017; Shorten & Khoshgoftaar, 2019) during the training process with simply few annotated samples provided. Besides, this simple baseline also falls short of the selection of the representative samples with partial labels during the query process.

To move toward better prediction, we turn to an interesting concept from cognitive science named counter-examples (CEs). According to the mental models in cognitive science (De Neys et al., 2005; Verschueren et al., 2005; Johnson-Laird, 2010), humans are able to assess the deductive validity of inference with the help of CEs, leading to drawing an accurate conclusion. Inspired by such an adversarial working mechanism, we aim to excavate useful knowledge from CEs to address ALPL by guiding the predictor to deduce in an explicit way. Firstly, we construct CEs for the predictor by directly reversing their partial labels to the inverse version. Building upon the proposed CEs, we propose a simple but effective WorseNet to learn in a way complementary to the predictor. To this end, we propose Worse loss, which contains the inverse RC (IRC) loss and the Kullback–Leibler divergence (KLD) regularization, to guide WorseNet to learn from the inverse partial labels from CEs. Figure 2 illustrates the overall framework. Compared with the predictor, WorseNet would possess lower confidence toward the labels inside the partial label.

Based on the complementary learning pattern between WorseNet and the predictor, we propose to take advantage of the predicted probability gap between these two networks to separately improve the evaluating and selecting process (shown in Fig. 2). To improve the predicting accuracy, we treat the class with the maximum distribution gap, rather than the maximum predictor score, as the predicted true label during the evaluation. On the other hand, we propose to enhance the sample selector by focusing solely on labels with positive probability gaps, as these labels predominantly cover the true label. This narrows down the range for calculating the uncertainty score, thereby refining the selection process and reducing uncertainty. Consequently, we propose three new selectors in ALPL by adopting this selecting strategy. Experimental results on benchmark-simulated and real-world datasets validate the effectiveness and superiority of our proposed WorseNet in improving both the selector and the predictor in ALPL. Our main contributions are summarized here:

  • We for the first time propose a practical setting, i.e., active learning with partial labels (ALPL), to economically facilitate the annotation process for the experts. In this way, we provide a solid baseline on top of any AL approach to address ALPL.

  • We turn to exploring and exploiting the learning pattern from counter-examples (CEs), and propose a simple but effective WorseNet to explicitly improve the predictor and the selector in ALPL in a complementary manner.

  • Experimental results on four benchmark datasets and five real-world datasets show that our proposed WorseNet achieves promising performance elation over compared baseline methods, achieving state-of-the-art performance in ALPL.

2 Related work

2.1 Pool-based active learning

According to the different query types between the oracle and the predictor, active learning (AL) normally can be divided into membership query synthesis, stream-based query, and pool-based query (Settles, 1995). Pool-based AL, where the selector decides on the annotated samples from a large pool of unlabeled datasets, has drastically appealed to many scholars from academia and industry because of its huge potential value in practical application. With the development of deep learning, pool-based AL has simultaneously experienced the stage from model-driven to data-driven.

For the prevailing model-driven category, the selector heavily relies on handcrafted features or metrics to query the data. Uncertainty sampling, as the most used metric for the selector, aims to pick out the samples with low confidence from the predictor. Often, such uncertainty could be modeled in three following ways: the posterior probability of a predicted class (Lewis & Catlett, 1994), the margin between posterior probabilities of a predicted class and the secondly predicted class (Roth & Small, 2006), or the entropy (Luo et al., 2013). Furthermore, all these uncertainty metrics could be improved, though time-consuming as it is, by using Monte Carlo Dropout and multiple forward passes based on Bayesian inference (Gal et al., 2017; Kirsch et al., 2019). Some methods also modeled the impacts of the selected sample on the current model through Fisher information (Settles et al., 2007), mutual information (Gal et al., 2017; Kirsch et al., 2019), or expected gradient length (Ash et al., 2020). Specifically, Ash et al. (2020) proposed to select the samples that were disparate and high magnitude in a hallucinated gradient space constructed by using the model parameters of the predictor. Another important metric for the selector is diversity sampling, which aims to select representative and diverse samples for the predictor to better learn from the datasets. To this end, some methods using discrete optimization (Yang et al., 2015) focused on sample subset selection while (Nguyen & Smeulders, 2004) aimed at mining out the center points of subsets by clustering.

The methods in the data-driven category describe that the selector, often equipped with deep models, is trained to automatically learn features or metrics. To learn the auto-feature or auto-metric, some methods adopted a generative model-based selector, such as VAE or GAN, to learn to distinguish unlabeled samples from labeled ones (Sinha et al., 2019; Kim et al., 2021). Moreover, some methods turned to adopting or designing data augmentation to help the selector better learn the input space (Parvaneh et al., 2022). Yoo and Kweon (2019) introduced an auxiliary deep network, predicting the “loss” of the unlabeled samples, to select the samples with large “loss” to help the query process.

2.2 Active learning with imperfect oracle

Most works in AL assumed that the oracle would always yield the accurate label, overlooking that the oracle could practically not be infallible in some real-world applications. Therefore, a few researchers have investigated AL with an imperfect oracle, where the oracle could provide a wrong (noise) label to the selected sample (Donmez & Carbonell, 2008; Du & Ling, 2010; Yan et al., 2016; Chakraborty, 2020). Early works (Donmez & Carbonell, 2008) assumed that there were two oracles in the system with one always returning the correct label, while the other returned an incorrect label with a fixed probability. Du and Ling (2010) modeled a human-like oracle that would provide noisy labels for the samples with low confidence from the predictor. Yan et al. (2016) studied a case where the oracle could choose to return incorrect labels or abstain from labeling. Some works (Chakraborty, 2020) focused on active learning with multiple noisy oracles and formed the query process as a constrained optimization problem. In this paper, we work towards a new setting for active learning with simply one imperfect oracle involved in the query process, who would annotate the selected samples with partial labels.

2.3 Partial-label learning

In this part, we concisely give an introduction to the two mainstream strategies for partial-label learning (PLL), i.e., the averaged-based strategy (ABS) and the identification-based strategy (IBS). This method in this paper belongs to the ABS.

ABS treats all candidate labels equally and then averages the model outputs of all candidate labels for evaluation. Some non-parametric methods (Hüllermeier & Beringer, 2006; Gong et al., 2017) focused on predicting the label by using the outputs of its neighbors. Moreover, some approaches (Cour et al., 2009; Yao et al., 2020) concentrated on leveraging the labels outside the candidate set to discriminate the potential true label. Some recent works (Feng et al., 2020; Lv et al., 2020; Wen et al., 2021) focused on the data generation process and proposed a classifier-consistent method based on a transition matrix. Wen et al. (2021) proposed a family of loss functions, introducing a leverage parameter to consider the trade-off between losses on partial labels and non-partial labels.

IBS focuses on identifying the most possible true label from the candidate label set to eliminate label ambiguity. Early works treated the potential truth label as a latent variable, optimizing the objective function by the maximum likelihood criterion (Liu & Dietterich, 2014) or the maximum margin criterion (Yu & Zhang, 2016). Later, many researchers engaged in leveraging the representation information of the feature space to generate the score for each candidate label (Wang et al., 2019; Zhang et al., 2022). Zhang et al. (2022) proposed to use the class activation map, discriminating the learning pattern of the classifier, to distinguish the potential true label from the candidate label set.

3 Preliminaries

3.1 Symbols and notations on pool-based AL

Pool-based AL depicts a learning process where the performance gain of the system is achieved through active interaction between the human and the target predictor. Formally, we are given a bunch of training samples \({\mathbb {X}}= \{{\varvec{x}}_i\}_{i=1}^n \in \mathbb {R}^{d}\) with a total number of n, which is initially split into a small set of labeled samples \({\mathbb {L}}= \{{\varvec{x}}_i\}_{i=1}^{l} \in \mathbb {R}^{d}\) and a large pool of unlabeled samples \({\mathbb {U}}= \{{\varvec{x}}_i\}_{i=1}^{u} \in \mathbb {R}^{d}\). Note that here d denotes the input dimension, and \({\mathbb {U}}\cup {\mathbb {L}}= {\mathbb {X}}, {\mathbb {U}}\cap {\mathbb {L}}= \varnothing \). Let \({\mathbb {Y}}= \{1,2,\ldots ,k\} \in \mathbb {R}\) denote the label space with k classes, and \(y_i \in {\mathbb {Y}}\) denote the ground truth for each \({\varvec{x}}_{i}\). A classifier (predictor) \(f: \mathbb {R}^{d} \rightarrow \mathbb {R}^{k}\) is then trained by using the original labeled samples \({\mathbb {L}}\). Afterwards, a specifically-designed selector \(\Psi ({\mathbb {L}},{\mathbb {U}},f)\) evaluates the samples in \({\mathbb {U}}\) and selects \(\triangle {\mathbb {U}}= \{{\varvec{x}}_i\}_{i=1}^{b} \in {\mathbb {U}}\) samples to be labeled by an oracle (human expert). Then samples in \(\triangle {\mathbb {U}}\) with oracle-annotated true labels are added to \({\mathbb {L}}\), leading to a group of new labeled samples (\({\mathbb {L}}= {\mathbb {L}}\cup \triangle {\mathbb {U}}\)), which are further reused to train the classifier f. This cycle of predictor-oracle-based interaction is repeated continuously until a well-performed metric is achieved or the sampling budget is exhausted. The sampling budget aims to restrict the total number of labeled samples for training the classifier, so the overall size of the sampling budget is denoted as B such that \(B<< u\).

A well-suited selecting metric \(\Psi \) could help elate the performance of the model by using as few labeled examples as possible, achieving a win-win situation for the human oracle and the predictor. Uncertainty is one of the most prevailing metrics in active learning, arguing that the oracle-annotated samples are able to confound the model most. To mine out those “uncertain samples”, the selector firstly calculates the uncertainty score for each sample in \({\mathbb {U}}\). Typically there are three simple ways to obtain the uncertainty scores by using the model outputs, which are minimum confidence uncertainty (MCU), minimum margin uncertainty (MMU) and entropy uncertainty (EU). These three metrics can be sequentially expressed as follows Footnote 1:

$$\begin{aligned} {\varvec{x}}^{*}_{\text {MCU}} = \mathop {\arg \max }_{{\varvec{x}}_i \in {\mathbb {U}}} \{ 1 - \mathop {\arg \max }_{y_i \in {\mathbb {Y}}} P(y_i|{\varvec{x}}_i)\}, \end{aligned}$$
(1)
$$\begin{aligned} {\varvec{x}}^{*}_{\text {MMU}} = \mathop {\arg \min }_{{\varvec{x}}_i \in {\mathbb {U}}} \{ \max ^{1}_{y_i \in {\mathbb {Y}}} P(y_i|{\varvec{x}}_i)- \max _{y_i \in {\mathbb {Y}}}^{2} P(y_i|{\varvec{x}}_i)\}, \end{aligned}$$
(2)
$$\begin{aligned} {\varvec{x}}^{*}_{\text {EU}} = \mathop {\arg \max }_{{\varvec{x}}_i \in {\mathbb {U}}} \{ \sum \nolimits _{y_i \in {\mathbb {Y}}} P(y_i|{\varvec{x}}_i)\log (P(y_i|{\varvec{x}}_i))\}, \end{aligned}$$
(3)

where \(P(y_i|{\varvec{x}}_i)\) refers to class-conditional probability and \({\varvec{x}}^{*}\) denotes the selected uncertain samples. Consequently, uncertainty samples handed over to the oracle could be picked by ranking the uncertainty score of each sample in \({\mathbb {U}}\) in descending order, resulting in a new labeled dataset to retrain the classifier.

3.2 Symbols and notations on PLL

Formally, let us denote \({{\mathbb {C}}} = \{2^{{\mathbb {Y}}} \backslash \varnothing \backslash {{\mathbb {Y}}}\}\) as the candidate label space where \(2^{{\mathbb {Y}}}\) is the power set of \({\mathbb {Y}}\), and \(|{\mathbb {C}}| = 2^k - 2\) means that the candidate label set is neither the empty set nor the whole label set. For each training instance \({\varvec{x}}_i\), let \({S_i} \in {{\mathbb {C}}}\) be the partial labels. We denote \(P({\varvec{x}},y)\) and \(P({\varvec{x}}, S)\) as the probability densities of fully labeled examples and partially labeled examples. Building upon the critical assumption of PLL that the candidate label set of each instance must include the correct label, we have \({y_i} \in {S_i}\). PLL targets at learning a predictor f with training examples sampled from \(P({\varvec{x}}, S)\) to make correct predictions for test examples. Practically, there are two common ways to generate the partial label sets: (I) uniformly sampling strategy (USS). Uniformly sampling the partial label for each training instance from all the possible candidate label sets (Feng et al., 2020; Zhang et al., 2022). (II) Flip Probability Strategy (FPS). By setting a flip probability q to any false label, the false label could be selected as a candidate label with a probability q (Feng & An, 2019a; Yan & Guo, 2020; Lv et al., 2020; Wen et al., 2021). In this paper, we adopt both of them to generate partial labels. Refer to the Online Appendix file for more details.

3.3 Baseline for ALPL

In this paper, we introduce a new setting named active learning with partial labels (ALPL). Different from the previous AL settings, ALPL regulates that the oracle is asked to label the samples with partial labels, easing the annotation pressure for the oracle when facing confusing samples. Note that the key difference between ALPL and AL is the label supervision, so it is intuitive to address ALPL by simply adopting a PLL-based loss function to train the predictor, relieving the negative effects caused by the false positive labels in the candidate label sets. In this case, we use RC loss (Lv et al., 2020; Feng et al., 2020), as one of the most prevailing state-of-the-art loss functions (Wen et al., 2021; Zhang et al., 2022), to address ALPL in a simple but effective manner. The empirical risk function \(\hat {{{\mathcal{R}}}}_{\textrm{rc}}\) is defined as

$$\begin{aligned} \hat {{{\mathcal{R}}}}_{\textrm{rc}} = \sum \nolimits _{i = 0}^{l}\sum \nolimits _{j \in S_i} \frac{P(y_i = j|{\varvec{x}}_i)}{\sum \nolimits _{z \in S_i} P(y_i = z|{\varvec{x}}_i) } {\mathcal {L}}(f({\varvec{x}}_i),j). \end{aligned}$$
(4)

Here \({\mathcal {L}}(f({\varvec{x}}),s), s \in S\) refers to the cross entropy loss. As shown in Eq. (4), RC loss is essentially a form of weighted cross entropy among the labels in the candidate set, which is theoretically proved to reach risk consistency in PLL, i.e., achieving comparable performance when compared to the fully supervised methods. Therefore, here we train the predictor f with RC loss to serve as the baseline of ALPL. In this way, we could seamlessly apply any AL-based frameworks to address ALPL (ten approaches implemented in our paper, see Sect. 5 for more details).

4 WorseNet: learning from counter examples

In this section, we introduce our proposed method to address ALPL in detail. Figure 2 illustrates the overall framework of our proposed WorseNet. Section 4.1 introduces the training procedure of our WorseNet. Section 4.2 and Sect. 4.3 introduce how WorseNet could address ALPL in both prediction and selection processes.

4.1 Constructing counter-examples

Though effective, it is observed two potential issues for the baseline method in ALPL. (1) The first goes to the overfitting (Chen et al., 2006; Perez & Wang, 2017; Shorten & Khoshgoftaar, 2019), which is a common challenge in both AL and ALPL due to the utilization of a relatively small set of annotated samples. (2) The second one is how to effectively find the representative samples that could achieve the maximum benefit during each query round. Apart from conventional AL, where the true label is provided for each query sample, the selection strategy for ALPL needs to be carefully considered, maximizing the learning of RC loss.

To address these two problems, we turn to an interesting concept in human reasoning. When humans perceive and learn the world, vision yields a mental model to help understand the things described in the scene, and builds a prior knowledge base to proceed further reasoning. Specifically, when evaluating the deductive validity of an inference, humans search for counter-examples (CEs) to help disapprove the conjecture (De Neys et al., 2005; Verschueren et al., 2005; Johnson-Laird, 2010). For instance, the fact that “John Smith is not a lazy student” is one CE to the inference “all students are lazy”. Therefore, we can tell that “all students are lazy” is a false conclusion because of “John Smith”. Intuitively, CEs occupy an important position in human reasoning. Inspired by CEs, we are driven to adopt this interesting concept to benefit the predictor. Assuming an image of a wolfhound labeled with “dog" and “wolve”, a normal predictor may misclassify it with a predicting probability of (0.6, 0.7). However, if we have another model that could tell this image does not belong to these two classes with (0.1, 0.4), then this model provides CEs to correct the falsified “inference” made by the predictor. In this way, we aim to explore and exploit CEs from the data, and then introduce a CE-teller that explicitly assists the predictor to improve its performance in ALPL.

The first question goes to how to construct CEs for the predictor. It is emphasized that CEs rigorously deplore the inference. Let us consider that we classify an image of a dog with a one-hot label, and assume that the inference here is “The image has a dog”. In this way, this conjecture is rejected once this image is annotated “0” at the “Dog” index. Here the simple inverse on the true label intuitively leads to a CE, which violates the original accurate inference, leading to a complementary conclusion. Motivated by this, we propose to build up CEs for the predictor by adopting label inversion to the selected samples. Formally, we are given a set of data samples \({\mathbb {W}}= \{{\varvec{x}}_i\}_{i=1}^l \in \mathbb {R}^{d}\) such that \({\mathbb {W}}= {\mathbb {L}}\), and the assigned label of each sample in \({\mathbb {W}}\) is defined as follows:

$$\begin{aligned} {\overline{S}}_i = {\mathbb {Y}}- S_i, \end{aligned}$$
(5)

where \({\overline{S}}_i\) denotes the candidate label set for the instance in \({\mathbb {W}}\). Intuitively, \({\overline{S}}_i\) is complementary to \(S_i\), i.e., \({\overline{S}}_i = \complement _{\mathbb {Y}}S_i\), meaning that there is no true label within \({\mathbb {W}}\). For convenience, we name the candidate label set \({\overline{S}}\) as the inverse partial label (IPL). Note that IPL is different from the complementary label (Ishida et al., 2017). The former provides a wrong indicator to the samples while the latter aims to train a true-label predictor by specifying the classes that the example does not belong to.

There are two benefits to forming IPL by following Eq. (5) in ALPL. Firstly, it is convenient and efficient to construct CEs with a label-based operation to the selected label samples \({\mathbb {L}}\). Secondly, IPL considers that all false labels outsides \({\overline{S}}_i\) shall become the inverse knowledge to the instance \({\varvec{x}}_i\), enriching the label variety of CEs.

Fig. 2
figure 2

The overall framework of our proposed method to address ALPL. A strong baseline for ALPL is achieved by directly using RC loss to train the predictor (red arrows). To further improve the performance, we propose WorseNet (blue arrows) to extract the useful knowledge from the constructed counter examples, individually learning in a complementary way to the predictor. With the help of the distribution gap between the predictor and WorseNet, the selecting and inference process (green arrows) in ALPL could be improved in an explicit way (Color figure online)

4.2 Predicting better with WorseNet

In this section, we introduce how to assist the predictor with the help of the proposed CEs in ALPL. Firstly, an extra classifier apart from the predictor is needed to learn from CEs obtained from \({\mathbb {W}}\) annotated with IPL. Formally, let us name such a classifier as the WorseNet and denote it as \(w: \mathbb {R}^{d} \rightarrow \mathbb {R}^{k}\). Note that w shares the same input and output space as the predictor f since w is trained with training samples from \(Q({\varvec{x}}, {\overline{S}})\), which denotes the probability densities of samples with IPL. To help w extract the inverse knowledge from \(Q({\varvec{x}}, {\overline{S}})\), we formulate this learning process, treating the IPL as the normal partial labels, to a similar PLL problem, where we propose inverse RC (IRC) loss to address it as follows:

$$\begin{aligned} \hat {{{\mathcal{R}}}}_{\textrm{irc}} = \sum \nolimits _{i=1}^{l} \sum \nolimits _{j \in {\overline{S}}_i} \frac{Q(y_i = j|{\varvec{x}}_i)}{\sum _{z \in {\overline{S}}_i} Q(y_i = z|{\varvec{x}}_i) } {\mathcal {L}}(w({\varvec{x}}_i),j), \end{aligned}$$
(6)

where \({{{\hat{\mathcal R}}}}_{\textrm{irc}} ({\mathcal {L}}, w)\) denotes the empirical risk function for w, and \(Q(y|{\varvec{x}})\) denotes the class-conditional probability modeled by w. Clearly, IRC loss focuses on the labels outside the candidate label set in a way complementary to RC loss.

Supported by the IRC loss, WorseNet is able to latch on to a pattern that is complementary to the predictor. To improve the predictor with WorseNet, we leverage the output distribution gap between w and f to predict the true label during the inference. Since the original true label only lies in the candidate label set S, we should intuitively aim at enlarging the gap of the output distribution on S between f and w. To this end, we further add a Kullback–Leibler divergence (KLD) regularization item for w, regulating its learning process toward the gainful direction to the predictor. Specifically, the KLD item is expressed as

$$\begin{aligned} \textrm{KLD} = \sum \nolimits _{i=1}^{l} \sum \nolimits _{j \in {\overline{S}}_i} {P(y_i = j|{\varvec{x}}_i)} \log \frac{{P(y_i = j|{\varvec{x}}_i)}}{{Q(y_i = j|{\varvec{x}}_i)}}. \end{aligned}$$
(7)

Note that here we stop the gradient backpropagation of P when training w. As shown in Eq. (7), we calculate the KLD between the predictor and WorseNet by merely using their outputs inside \({\overline{S}}\), which could be minimized to implicitly enlarge the output distribution of the candidate set between f and w. In all, the learning loss function for WorseNet, denoted as Worse loss, could be expressed as follows:

$$\begin{aligned} {{{\hat{\mathcal R}}}}_{\textrm{worse}} = {{{\hat{\mathcal R}}}}_{\textrm{irc}} ({\mathcal {L}}, w) + \alpha \textrm{KLD} {,} \end{aligned}$$
(8)

where \(\alpha \) is a regularized parameter and we empirically set \(\alpha =1\). After training by Eq. (8), the predictor during the inference could predict the potential true label by

$$\begin{aligned} y_i^{*} = \mathop {\arg \max }\nolimits _{y_i \in {\mathbb {Y}}} \{ P(y_i|{\varvec{x}}_i) + (1- Q(y_i|{\varvec{x}}_i))\} {,} \end{aligned}$$
(9)

where \(y_i^{*}\) denotes the predicted true label of \({\varvec{x}}_i\). Note that here we use \(1-Q\) to help the predictor recognize the true label. As WorseNet is trained independently of the predictor, the proposed WorseNet is able to benefit the predictor on top of any selector in ALPL. To better illustrate this, we provide the following theorem.

Theorem 1

Assume that the posterior probability of WorseNet satisfies \(Q(y = j | {\varvec{x}}_i) + P(y = j | {\varvec{x}}_i) = 1\) for any label \(j \in {{\mathbb {Y}}}\) of sample \({\varvec{x}}_i\), and the loss function \({\mathcal {L}}\) is the standard cross entropy. Then the Worse loss \(\hat {{{\mathcal{R}}}}_{\textrm{worse}}\) holds

$$\begin{aligned} \hat {{{\mathcal{R}}}}_{\textrm{worse}} \propto {\sum \nolimits _{i=1}^{l}} \sum \nolimits _{j \in {\overline{S}}_i} -n(Q_{ij}) \textrm{log}(Q_{ij}) {.} \end{aligned}$$
(10)

Here \(Q_{ij}\) represents \(Q(y = j | {\varvec{x}}_i)\) for simplicity and \(n(Q_{ij}) > 0, \forall {Q_{ij}} \in [0,1]\). The proof and analysis of Theorem 1 is in the Online Appendix file. Theorem 1 shows that the WorseNet is learned to approximate the false labels in \({\overline{S}}\) in an entropy-based manner. As \(\hat {{{\mathcal{R}}}}_{\textrm{worse}}\) decreases and \(Q_{ij} \rightarrow 1\), the predictor is correspondingly pushed away from \( {{\overline{\mathcal S}}}\) (\(P_{ij} \rightarrow 0\)). In all, the Worse loss could serve as an auxiliary module to the predictor by considering the extra supervision on the elements outside the partial labels. For convenience, we denote this improvement of WorseNet to the predictor during the evaluation as WorseNet-Predictor (WP), and its pseudo-code is given in Algorithm 1.

4.3 Selecting better with WorseNet

In this section, we illustrate that the proposed WorseNet can also promote the sampling metric of some uncertainty-based selectors. As shown in Sect. 3.1, a selector \(\Psi ({\mathbb {L}},{\mathbb {U}},f)\) needs to calculate the uncertainty score of \({\varvec{x}}_i\) in the entire class space since it has no prior knowledge about the class of this sample. We argue that such a strategy could be further improved if the class space for obtaining the uncertainty could be narrowed down, bringing well inductive bias to the selector.

figure a

As shown in Eq. (9), we test our proposed framework during the inference by measuring the gap of the output distribution between f and w. In particular, we assume that the true label is the class with the maximum probability distance between f and w. As f focuses on the candidate label set S while w learns from CEs, the former one shall have a higher response to the labels in S than the latter one. Hence, it reveals that the potential true label must satisfy \(P > Q\) since the true label absolutely lies on S. Based on this, we construct a pseudo partial label candidate set \(S^{'}\) for each unlabeled sample in \({\mathbb {U}}\) as follows:

$$\begin{aligned} S_i^{'} = \{z | P(y_i = z|{\varvec{x}}_i) - Q(y_i = z | {\varvec{x}}_i) \ge 0, z \in {\mathbb {Y}}\}. \end{aligned}$$
(11)

Building upon \(S^{'}\), a selector could narrow the class range of acquiring the uncertainty score in \({\mathbb {U}}\). To this end, we propose three sampling strategies based on MCU (Eq. 1), MMU (Eq. 2), and EU (Eq. (3)) by directly substituting \({\mathbb {Y}}\) with \(S{'}\). For convenience, we denote the improvement of WorseNet on the selector as WorseNet-Selector (WS), and denote these three methods as WS-MCU, WS-MMU, and WS-EU.

5 Experiments

In this section, we evaluate our proposed WP, WS-MCU, WS-MMU, and WS-EU against several algorithms from the literature, and extensive experiments are implemented to verify the correctness and effectiveness of our proposed modules. More details could be found in the Online Appendix file.

5.1 Benchmark datasets comparisons

Datasets and backbones. Our proposed WorseNet-based modules are evaluated on four popular benchmark datasets, which are MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky et al., 2009). Note that it is necessary for the oracle to manually generate the candidate label sets for these datasets, which are supposed to be used for single-classification problems. Recall that we introduce two different candidate label generation approaches, i.e., USS and FPS. For FPS, we set \(q \in \{0.3,0.5\}\) to represent different ambiguity degrees. For MNIST and Fashion-MNIST, we adopt a 3-layer MLP and a simple CNN-based network denoted as C-Net (similar to the network used in Gal et al. (2017), Kirsch et al. (2019)) as the backbones for the predictor. For SVHN and CIFAR-10, we follow most works (Yoo & Kweon, 2019; Ash et al., 2020; Kim et al., 2021) and choose ResNet18 (He et al., 2016) and VGG11 (Simonyan & Zisserman, 2014) as the base models. Note that WorseNet w follows the identical architecture to the predictor f.

Compared methods and training settings. We compare our proposed modules with ten approaches which contain seven model-driven methods: 1) Random Sampling (RS), 2) MCU, 3) MMU, 4) EU, 5) Coreset (Sener & Savarese, 2018), 6) BALD (Kirsch et al., 2019), 7) BADGE (Ash et al., 2020), and five data-driven methods: 8) LL4AL (Yoo & Kweon, 2019), 9) VAAL (Sinha et al., 2019), 10) TA-VAAL (Kim et al., 2021), 11) ALFA-MIX (Parvaneh et al., 2022), 12) CAMPAL (Yang et al., xxxx). For the seven model-driven methods, we adopt the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001 to train f. We take a mini-batch size of 256 images and train all seven methods for 200 epochs. For these data-driven methods, we strictly follow the reported training hyper-parameters in their papers (Yoo & Kweon, 2019; Sinha et al., 2019; Kim et al., 2021). Besides, we simply adopt ResNet18 as the backbone for f and w in these five data-driven methods. For the ALPL setting, we construct an initial labeled set \({\mathbb {L}}\) with the size \(b_0 = 20\), and acquire \(b = 100\) instances (\(b = 1000\) for SVHN and CIFAR-10) from \({\mathbb {U}}\) in each query round, following prior works (Gal et al., 2017; Kirsch et al., 2019; Kim et al., 2021). We repeat the query process 10 times such that the overall budget size \(B = 1000\) (\(B = 10000\) for SVHN and CIFAR-10). Note that we directly adopt RC loss on these ten methods to build the baselines (see Sect. 3.3 for more details). To guarantee comparison fairness, we repeatedly conduct all experiments 5 times and report the average test accuracy using the model achieving the maximum performance on a validation set, which is constructed by randomly selecting 100 instances from the training datasets. Here the validation performance of w is measured by Eq. (9). All the implemented methods are trained on 2 RTX3090 GPUs each with 24 GB memory.

Experiment results. As shown in Table 1, following the default settings, our proposed WorseNet shows its effectiveness and superiority in addressing ALPL on these four benchmark datasets. Firstly, WP can bring a constant gain to the classifier regardless of the backbone and the adopted AL methods. Moreover, the improvement by WP shall be witnessed in both USS and FPS cases, validating that our WP does not rely on any data generation assumption. Our approach could also deliver promising performance with full access to the datasets, which means that WP is also an effective way to address PLL. Particularly, we would like to highlight a counter-intuitive phenomenon that RS may perform better than some methods in some cases. RS (70.73%) performs far better than EU (64.58%) and Coreset (53.17%) in Fashion-MNIST. This counter-intuitive could also be seen in Kim et al. (2021), Yoo and Kweon (2019), Sinha et al. (2019), Ash et al. (2020). This phenomenon can be attributed to the instability caused by a relatively small number of labeled samples.

Table 1 Test performance of the methods on benchmark datasets using label generation by FPS (\(q = 0.5\))
Fig. 3
figure 3

Visualized tSNE results on three proposed selectors and baselines in MNIST with FPS (\(q=0.5\)). The red circles mark that more samples near the class boundary are selected, and the blue circle mark that more samples near the center of the class cluster are selected (Color figure online)

For three WS-based selectors, i.e., WS-MMU, WS-MCU, and WS-EU, they are found to better elate the performance of the classifier in ALPL when compared to the original version. Additionally, these three improved uncertainty-based approaches show competitive performance compared with the other ten AL methods, and such performance could be further improved by reusing WP to reach state-of-the-art performance in ALPL. As shown in Fig. 3, we select 6 classes and visualize the selected samples of EU and WS-EU. Compared to EU, our WS module could enforce the selector to select more representative and diverse samples. Specifically, our proposed selectors are able to select more samples (marked by the red circle) that nearby the class boundary. Besides, more samples near the center of the class cluster are also selected to ensure the accuracy (marked by the blue circle), illustrating that our WS could help ALPL to select more representative samples with partial labels. Overall, the experimental results on four benchmark datasets reasonably verify the generalization and effectiveness in addressing ALPL.

5.2 Real-world datasets comparisons

Datasets and backbones. Apart from benchmark datasets whose candidate label set needs to be self-generated, here we evaluate our proposed WorseNet-based modules on five real-world datasets that are widely used in PLL: Lost (Cour et al., 2011), MSRCv2 (Liu and Dietterich, 2012), BirdSong (Briggs et al., 2012), Soccer Player (Zeng et al., 2013) and Yahoo!News (Guillaumin et al., 2010). Note that all five of these real-world datasets are annotated with the given candidate label sets, and most samples, as a realistic scenario, are annotated with similar semantic labels. Thus, we simply use them as the oracle annotation. For these five datasets, we adopt the same 3-layer MLP used in Sect. 5.1 as the sole backbone since these real-world datasets are not limited to image input (simple vector inputs), which also follows conventions in Feng and An (2019a), Feng and An (2019b), Feng et al. (2020), Lv et al. (2020), Wen et al. (2021), Zhang et al. (2022).

Table 2 Test performance of compared methods on five real-world datasets. The underline points out improved accuracy by WP. \(\uparrow \) indicates the improved accuracy is beyond 3%. Note that three data-driven methods are not implemented here due to the framework incompatibility

Compared methods and training settings. Due to the simplicity of these five real-world datasets, we adopt a simple MLP as the backbone for both the predictor and WorseNet, so here we compare our methods with seven model-driven methods, 1) - 7), the architecture of which does not necessarily build upon the deep models. Based on the different data quantities, we specifically design different settings for these five datasets. Specifically, we set the size of the initial labeled set \({\mathbb {L}}\) to 5, and repeat the query process 5 times. We repeatedly conduct all experiments 10 times, and record the average testing accuracy by using the model achieving maximum performance on a validation set built by randomly selecting 10 instances from the training datasets. Other settings are similar to Sect. 5.1.

Experiment results. The experimental results in Table 2 validate that our proposed WorseNet is also effective in dealing with ALPL in five real-world datasets. Specifically, our WP is capable of delivering promising performance gains to the predictor with any baseline method. Furthermore, the three improved metrics (WS-MMU, WS-MCU, and WS-EU) in the selector also show competitive performance compared to the baselines.

5.3 Ablation studies on WorseNet

Anti-overfitting of WorseNet. To better understand WorseNet, we show the validation error of WorseNet and the baseline in a training round with randomly selected 100 samples. As shown in Fig. 4, we add the WorseNet when the model starts to meet the overfitting. Clearly, the proposed WorseNet could effectively address the overfitting, leading to a further decrease in validation error. Additionally, we also compare WorseNet with different data augmentations (please refer to Online Appendix), and the results validate the superiority of WorseNet in improving the predictor. In conclusion, our WorseNet is a promising method to address overfitting.

Fig. 4
figure 4

The average validation error of one training time on four benchmark datasets. Note that here the settings are the same to Table 1

Fig. 5
figure 5

The average test accuracy over the different number of query samples on four benchmark datasets during the training time. Note that here the settings are the same with Table 1

Number of selected samples. As shown in Fig. 5, with the increase of queried samples (100 samples in each round), all methods achieve steady performance enhancement throughout the whole training time. Clearly, it is noticed that all baseline methods (dashed lines) are comparably strengthened by our proposed WP (solid lines) in each query round. Besides, the three new proposed selectors could also achieve competitive performance. More relevant results can be found in the Online Appendix file.

6 Conclusion

We have proposed and investigated a new and practical setting, active learning with partial labels (ALPL), where the oracle is requested to provide partial labels for the selected samples during the query process. To address ALPL, we first adopt RC loss on different prevailing AL frameworks to establish a strong and effective baseline. Motivated by the salutary effects of counter examples (CEs) in human reasoning, we turn to such a human-based adversarial learning process to relieve the overfitting and improve the partially-labeled sample selection process in ALPL. In this regard, we designed CEs by reversing the original partially-labeled examples. Furthermore, we introduced WorseNet that directly learns such complementary knowledge by using the proposed Worse loss. By capitalizing on the probability gap between the predictor and WorseNet, our proposed WorseNet not only explicitly enhances the evaluation performance of the predictor but also improves the selector’s ability to query partially-labeled samples more precisely. Comprehensive experimental results on various datasets demonstrate that our WorseNet yields state-of-the-art performance in ALPL, and validates the superiority of such an adversarial learning pattern. Additionally, PLL could also be well addressed by this method, which warrants further investigation in the future.