Learning sample-aware threshold for semi-supervised learning

Wei, Qi; Feng, Lei; Sun, Haoliang; Wang, Ren; He, Rundong; Yin, Yilong

doi:10.1007/s10994-023-06425-7

Learning sample-aware threshold for semi-supervised learning

Published: 18 January 2024

Volume 113, pages 5423–5445, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Machine Learning Aims and scope Submit manuscript

Learning sample-aware threshold for semi-supervised learning

Download PDF

Qi Wei ORCID: orcid.org/0000-0002-4073-7598^1,2,
Lei Feng²,
Haoliang Sun¹,
Ren Wang¹,
Rundong He¹ &
…
Yilong Yin¹

418 Accesses
1 Citation
1 Altmetric
Explore all metrics

A Correction to this article was published on 16 May 2024

This article has been updated

Abstract

Pseudo-labeling methods are popular in semi-supervised learning (SSL). Their performance heavily relies on a proper threshold to generate hard labels for unlabeled data. To this end, most existing studies resort to a manually pre-specified function to adjust the threshold, which, however, requires prior knowledge and suffers from the scalability issue. In this paper, we propose a novel method named Meta-Threshold, which learns a dynamic confidence threshold for each unlabeled instance and does not require extra hyperparameters except a learning rate. Specifically, the instance-level confidence threshold is automatically learned by an extra network in a meta-learning manner. Considering limited labeled data as meta-data, the overall training objective of the classifier network and the meta-net can be formulated as a nested optimization problem that can be solved by a bi-level optimization scheme. Furthermore, by replacing the indicator function existed in the pseudo-labeling with a surrogate function, we theoretically provide the convergence of our training procedure, while discussing the training complexity and proposing a strategy to reduce its time cost. Extensive experiments and analyses demonstrate the effectiveness of our method on both typical and imbalanced SSL tasks.

LaRW: boosting open-set semi-supervised learning with label-guided re-weighting

Article 20 October 2023

AdaptMatch: Adaptive Consistency Regularization for Semi-supervised Learning with Top-k Pseudo-labeling and Contrastive Learning

Unsupervised Selective Labeling for More Effective Semi-supervised Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semi-supervised learning (SSL) (Zhu and Goldberg 2009; Sohn et al. 2020) aims to improve model performance by leveraging both abundant unlabeled data and limited labeled data. SSL algorithms provide a solution to explore the latent pattern underlying unlabeled data, which reduces requirements of a large amount of annotations (Sohn et al. 2020). Most of the previous SSL studies heavily rely on the pseudo-labeling strategy (Lee 2013; Sohn et al. 2020) that generates a hard label for unlabeled sample and trains the deep model on these pseudo-labels.

For pseudo-labeling methods (Lee 2013; Sohn et al. 2020; Zhang et al. 2021; Xu et al. 2021), it is essential to set a proper threshold for selecting reliable pseudo-labels for unlabeled data. For example, FixMatch (Sohn et al. 2020) selected high-confidence pseudo-labels via a fixed threshold (e.g., 0.95 for CIFAR Krizhevsky and Hinton (2009) and 0.65 for ImageNet (Deng et al. 2009)). However, as reported in Xu et al. (2021), fixing the threshold in the entire training process could mitigate the learning efficiency and raise the error rate of pseudo-labels, especially in the early learning stage.

To address this issue, subsequent works (Xu et al. 2021; Guo and Li 2022; Zhang et al. 2021; Saito et al. 2021) that dynamically generate the threshold to enable more robust SSL have been proposed. For instance, Xu et al. (2021) translated the fixed threshold to a loss threshold and selected the unlabeled data whose loss values (evaluated on pseudo-labels) are smaller than the loss threshold. Then, these selected data are incorporated into the training set, while the loss threshold gradually decreases over training iterations. Zhang et al. (2021) leveraged the idea of curriculum learning (Bengio et al. 2009) to take into account the learning status of each class and flexibly adjusted thresholds for different classes at each time step via a preset function.

Despite the decent performance of the pseudo-labeling methods mentioned above, they share two common drawbacks. Firstly, they (Xu et al. 2021; Guo and Li 2022; Zhang et al. 2021) always resort to manually pre-specified functions to adjust the threshold. This tends to be infeasible when we know little knowledge of underlying datasets or when the label conditions are too complicated. Secondly, these methods (Xu et al. 2021; Guo and Li 2022; Zhang et al. 2021) usually involve at least two hyper-parameters, which requires complex cross-validation phase and thus suffer from the scalability issue (Franceschi et al. 2018) when we apply them to real-world application.

To address the two drawbacks mentioned above, this paper presents a simple yet effective strategy to automatically learn sample-aware confidence thresholds for each unlabeled data. In contrast with previous works, our method does not resort prior knowledge to pre-define a function for adjusting thresholds while including only one hyper-parameter. Besides, to the best of our knowledge, we for the first time introduce instance-level thresholds, which is inspired by that the deep model has different learning capabilities for different categories even for different examples. Figure 1a shows a practical example. Intuitively, setting instance-level thresholds is more logical and beneficial to generate more accurate pseudo-labels for unlabeled instances, further facilitating deep model’s learning.

Specifically, we leverage the idea of meta-learning (Finn et al. 2017) to construct a lightweight meta-net (e.g., three-layer MLP) for explicitly modeling the instance-level thresholds (finally obtain a set of thresholds for all unlabeled data). Thanks to the universal approximation theorem (Hornik et al. 1989) of multilayer feedforward networks, our meta-net can be considered as a generalized version of the pre-defined functions mentioned above (Zhang et al. 2021; Xu et al. 2021). In this way, our framework contains a classifier network and a meta-net, where the training problem of two networks is in a nested optimization scheme. This optimization problem can be solved by a bi-level strategy, which is presented as 1) Inner loop. Generate instance-level thresholds for all unlabeled instances and utilizes the hard pseudo labels to train the classifier network, 2) Outer loop. Update all parameters of the meta-net by a small scale of meta-data which are constructed on the labeled data.

An appealing feature of this formulation is that the inner loop can be viewed as a mapping from the sample threshold space into the meta-net parameter space, and the outer loop performs the optimization on thresholds. Since the indicator function $\mathbbm {1}(\cdot )$, which is non-differentiable, explicitly exists in the pseudo-labeling framework, we thus leverage a surrogate function to approximate it, making the bi-level optimization problem reachable. In Fig. 1b, we compare our method with vanilla FixMatch (Sohn et al. 2020) and two improved methods (Xu et al. 2021; Zhang et al. 2021), which highlights the merits of our method such as avoiding preset function and no prior knowledge is required.

Our contributions can be summarized as follows:

We propose a simple yet effective training framework (named Meta-Threshold, Meta-T) based on bi-level optimization for threshold-based SSL, which enjoys the following benefits: 1) Meta-T learns thresholds of unlabeled sample automatically through bi-level optimization, avoiding the the pathology of conventional threshold-based methods’ reliance on strong prior knowledge on data. 2) Meta-T only includes one extra hyper-parameter, i.e., the learning rate of the meta-net, which is not sensitive and thus does not require complex cross-validation.
We introduce the surrogate function to replace the indicator function. Further, we theoretically provide the convergence of our framework and demonstrate that it enjoys a convergence rate of $\mathcal {O}(1/\epsilon ^2)$.
We integrate the proposed Meta-T into the framework of curriculum learning dubbed Green Meta-T, which significantly reduce the training cost of our learning algorithm with only slight loss of accuracy.
Our method can be applied to solve both the conventional and imbalanced SSL tasks, exhibiting great potential in real-world applications.

2 Related work

Deep Semi-Supervised Learning As a common learning paradigm, deep SSL exhibits remarkable performance in leveraging a great deal of unlabeled data to train the deep model. Current deep SSL methods can be roughly divided into three categories: consistency-based methods, pseudo-labeling methods, and hybrid methods. The key idea of consistency-based methods is that forcing the model’s output of original unlabeled data and perturbed unlabeled data to keep the same (Laine and Aila 2016; Tarvainen and Valpola 2017; Xie et al. 2020). Pseudo-labeling methods, which are also called self-learning in previous works, belong to an iterative mechanism that uses limited labeled data to train the model to predict unlabeled data. Then, the generated labels of unlabeled data are introduced to train the model Lee (2013). Hybrid approaches (Sohn et al. 2020; Zhang et al. 2021; Xu et al. 2021) always integrate the above two methods with strong augmentation strategies (e.g., RandAugment (Cubuk et al. 2020) and CTAugment (Berthelot et al. 2019)).

Imbalanced Semi-Supervised Learning To improve the universality of SSL algorithms, some works (Kim et al. 2020; Wei et al. 2021; Guo and Li 2022) turn attention to more challenging settings like SSL under class-imbalanced label distribution. DARP (Kim et al. 2020) designed a distribution-aligning manner to modify biased pseudo-labels to match the true class distribution. However, this method requires prior knowledge about data distribution, which is hard to fulfill in real applications. For this, DARP manages to estimate the class distribution by a confusion matrix between labeled and unlabeled data. CReST (Wei et al. 2021) is based on a typical self-training strategy that adaptively adds pseudo-labeled data to the training set according to the label frequency.

Meta-Learning also known as “learning to learn", has been widely applied to several weakly-supervised tasks, such as noisy labels learning (Shu et al. 2019; Sun et al. 2022), out-of-distribution learning (Guo et al. 2020), and semi-supervised learning (Wang et al. 2020; Xiao et al. 2021). In SSL fields, some works introduce the idea of meta-learning to learn a set of parameters. For example, Wang et al. (2020) proposed a framework to learn sample weights for all unlabeled data, which aims to give high weights to more reliable pseudo-labels. Xiao et al. (2021) proposed to learn soft labels for unlabeled data while designing a one-order update strategy for bi-level framework.

Relations Two works L2RW (Ren et al. 2018) and MW-Net (Shu et al. 2019) employed bi-level to efficiently learn a set of hyper-parameter. Our work bears three critical differences.

(1)
Problem setting: (Ren et al. 2018; Shu et al. 2019) focus on improving the generalization performance of deep models under noisy labels learning, while our work aims to enhance the quality of generated pseudo-labels for unlabeled data in semi-supervised learning.
(2)
Methodology: (Ren et al. 2018; Shu et al. 2019) learn a set of sample weights for training (label-corrupted) samples and then minimize the product of training loss and corresponding weight, while our framework generates thresholds which are used to select the high-reliability pseudo-labels instead of directly participating in model’s training. Besides, our method obeys the framework of the pseudo-labeling method and thus suffers from the non-differentiable issue of the indicator function, which can be solved by a surrogate function. Eventually, we joint the bi-level training framework with curriculum learning, significantly reducing the cost of bi-level strategy.
(3)
Theory: We introduce a surrogate function to replace the indicator function and provide the convergence guarantee of our learning algorithm when the upper bound of the surrogate function is given. Besides, we simply give an analysis of training costs of both Meta-T and Green Meta-T.

3 Preliminaries

Problem setting. In a C-class classification task, we have a set of training data which contains N labeled examples $D^l = \{(\textbf{x}_1^l,\textbf{y}_1^l), \cdot \cdot \cdot ,(\textbf{x}_N^l,\textbf{y}_N^l)\}$ and M unlabeled examples $D^u = \{\textbf{x}_1, \cdot \cdot \cdot ,\textbf{x}_M\}$, where $\textbf{x} \in \mathcal {X} \subseteq \mathbb {R}^d$ denotes the input d-dimensional feature vector and $\textbf{y} \in \mathcal {Y}$ is one-hot label. Given a deep model f with learnable parameters $\textbf{w}$ and a classification loss function $H(\cdot )$ (e.g., cross-entropy loss), the training objective in typical supervised learning is $L_s = \mathbb {E}_{(\textbf{x},\textbf{y}) \sim D^l}H(f(\textbf{x}),\textbf{y})$. To achieve higher performance, the training objective of SSL algorithms can be summarised as $L_s + \lambda _u L_u$, where $L_u$ is constructed on $D^u$ and the trade-off coefficient $\lambda _u$ satisfies $\lambda _u > 0$.

3.1 Confidence thresholds in semi-supervised learning

Due to its simplicity yet great performance, we select FixMatch (Sohn et al. 2020) as an example to illustrate the usage of confidence threshold in pseudo-labeling methods.

The core idea of FixMatch is the introduction of confidence threshold and strong augmentation strategies. To train the classifier on unlabeled data, FixMatch first computes the pseudo-label on the weakly-augmented version of image. For each unlabeled data $\textbf{x}_m \in D^u$, the prediction of classification network is $p_m = f(\mathcal {A}^w(\textbf{x}_m); \textbf{w})$, where $\mathcal {A}^w$ denotes weak augmentation strategies, and the pseudo-label can be written as $\hat{\textbf{y}}_m = \arg \max (p_m)$. Due to the property of function $\arg \max$, $\hat{\textbf{y}}_m$ is a one-hot probability distribution. Then, the training loss of $\textbf{x}_m$ can be summarised as

$$\begin{aligned} \ell _{\textbf{x}_m} = \mathbbm {1}(\max (p_m) > \tau ) \cdot H(\hat{\textbf{y}}_m, f(\mathcal {A}^s(\textbf{x}_m); \textbf{w})), \end{aligned}$$

(1)

where $\mathbbm {1}(\cdot )$ is an indicator function and denotes the selection of high-reliability of pseudo-label, $\mathcal {A}^s$ denotes strong augmentation strategies, and $\tau$ is a fixed constant. Eventually, the training objective of all unlabeled data is $L_{u} = \frac{1}{M}\sum\nolimits_{{x_{m} \in D^{u} }} {\ell _{{x_{m} }} }$.

As mentioned before, many related works (Zhang et al. 2021; Xu et al. 2021) modified the fixed constant $\tau$ to improve the universality of pseudo-labeling algorithms. However, they always resort to prior knowledge and further design a task-specific function to adjust this value, limiting their application in practice. Thus, in the next section, we devise a framework that does not require pre-defined functions yet enables sample-aware confidence thresholds.

4 Proposed method

Overview. We construct a meta-net (threshold generation network, or TGN) for dynamically produce sample-level threshold. First, we rewrite the learning objective for threshold-based SSL methods. Second, we introduce the architecture of TGN. Then, we solve this meta-optimization problem via bi-level strategy which alternatively trains the classifier and TGN. Eventually, we analyse the convergence of our algorithm and provide a green version of our method which enjoys lower training time.

4.1 Learning with sample-level thresholds

To alleviate the aforementioned issues of previous methods, we want to construct a meta-learning framework that could generate a sample-level confidence threshold for all unlabeled data in each training step. To be specific, given a meta-net $\mathcal {V}$ with parameters $\Theta$, the confidence threshold of unlabeled data $\textbf{x}_m$ can be written as $\tau _m \leftarrow \mathcal {V}_m(\textbf{w}, \Theta )$, while the architecture and input of $\mathcal {V}$ is detailedly illustrated in Sect. 4.2. Then, the fixed constant $\tau$ in Eq. (1) can be replaced with a sample-level threshold $\tau _{m}$ and the loss of unlabeled data $\textbf{x}_m$ is formulated as

$$\begin{aligned} \ell _{\textbf{x}_m}(\textbf{w}, \Theta ) = \mathbbm {1}(\max (p_m) > \mathcal {V}_m(\textbf{w}, \Theta )) \cdot H(\hat{\textbf{y}}_m, f(\mathcal {A}^s(\textbf{x}_m); \textbf{w})). \end{aligned}$$

(2)

However, due to the non-differentiable property of the indicator function $\mathbbm {1}(\cdot )$, computing partial derivative with respect to $\Theta$ in Eq. (2) is infeasible. In the practical training phase, we introduce a modified sigmoid function to replace it, which can be written as $\mathcal {S}(x) = \frac{1}{1+\exp ^{-\beta x}}$ where the input is $\max (f(\mathcal {A}^w(\textbf{x}_m); \textbf{w})) - \mathcal {V}_m(\textbf{w}, \Theta )$ and $\beta$ is the slope parameter to control the shape of the function.

Discuss about the approximate function $\mathcal {S}(\cdot )$. In Fig. 2, we compare the difference between the indicator function $\mathbbm {1}(\cdot )$ and the suggorate function $\mathcal {S}(\cdot )$. We can observe that the input of function satisfies $\textrm{max}(f(\mathcal {A}^w(\textrm{x}_m); \textbf{w})-\mathcal {V}_m(\textbf{w}, \Theta )) \in [-1, 1]$. Meanwhile, the first-order and second-order gradient of sigmoid function obviously exist, making backpropagation of the training loss in Eq. (2) possible.

Eventually, the optimal classifier parameters $\textbf{w}^*$ can be calculated by minimizing the loss

$$\begin{aligned} \textbf{w}^*({\Theta }) = \mathop {\arg \min }\limits _{\textbf{w}} L_u = \frac{1}{M} \sum \nolimits _{\textbf{x}_m \in D^u} {\ell _{\textbf{x}_m}}({\textbf{w}, \Theta }). \end{aligned}$$

(3)

4.2 Threshold generation network TGN

In this subsection, we design a threshold generation network (TGN), serving as a meta model. By summarizing previous works (Zhang et al. 2021; Guo and Li 2022), we found that considering average class confidence provides more valuable information for generating threshold and improves the applicability of methods on extreme data distribution. Thus, we construct the meta-net which learns from instance confidence and average class confidence simultaneously and outputs sample-aware threshold for unlabeled data.

Formally, given a weakly-augmented version of unlabeled data $\textbf{x}_m$, the classifier network $f_{\textbf{w}}$ gives the prediction result (a soft label) $g(p_m^t)$ in t-th iteration, where $g(\cdot )$ denotes Softmax function. Further, the pseudo-label is ${\hat{{{\textbf{y}}}}}_m^t = \arg \max (g(p_m^t))$. Meanwhile, the average class confidence can be represented as $\overline{\textrm{p}}_c^t = \frac{1}{M}\sum \nolimits _{m=1}^{M} g(p_m^t | c=\hat{{{\textbf{y}}}}_m^t)$. Note that $\overline{\textrm{p}}_c^t$ can be regarded as an average soft label of class c in time t. Therefore, for unlabeled data $\textbf{x}_m$, the generated threshold in t-th iteration is

$$\begin{aligned} \tau _m^t = \mathcal {V}(g(f(\textrm{x}_m; \textbf{w})), \, \overline{\textrm{p}}_c^t; \Theta ). \end{aligned}$$

(4)

As shown in Fig. 3b, we illustrate the architecture of proposed TGN, which belongs to a lightweight net (e.g., three full-connected layers). For $\textbf{x}_m$, we connect its prediction result $g(p_m^t)$ (a C-dimension soft label) with the average class confidence $\overline{\textrm{p}}_c^t$ (a C-dimension vector). Therefore, the input layer in TGN is 2C dimension.

4.3 Meta-optimization problem

There are two networks in our training framework, including a classification network $f_{\textbf{w}}$ and a meta-net $\mathcal {V}_{\Theta }$. The parameters $\textbf{w}$ and $\Theta$ can be optimized by the meta-learning idea (Andrychowicz et al. 2016; Shu et al. 2019). Specifically, we require a small amount of meta-data set which can be sampled from labeled data in SSL task. Since some works (Shu et al. 2019; Sun et al. 2022) proved that the generalization performance of the meta-model largely benefits from a large scale of meta-data, we straightforwardly represent this meta-data set as $D^{\textrm{meta}} = D^l = \{(\textbf{x}^{l}_i, \textbf{y}^{l}_i)\}_{i=1}^N$ (i.e., we use the total labeled data for constructing the meta-data set). The optimal parameters $\Theta ^*$ can be obtained by minimizing the following loss

$$\begin{aligned} \Theta ^* = \mathop {\arg \min }\limits _{\Theta } L_{\textrm{meta}}(\textbf{w}^*({\Theta })) = \frac{1}{N} \sum \nolimits _{i=1}^N H_i(\textbf{w}^*({\Theta })). \end{aligned}$$

(5)

For clarity, we represent $H_i(\textbf{w})$ as $H(\textbf{y}_i^l, f(\textbf{x}_i^l; \textbf{w}))$.

Obtaining the optimal parameters $\textbf{w}^*$ in Eq. (3) and $\Theta ^*$ in Eq. (5) is a nested optimization problem. For this, we resort to bi-level training strategy as MAML (Finn et al. 2017) and update parameters of meta-net with online strategy. To be specific, the training loss of classifier network and meta-net (Eq. (3) and Eq. (5)) can be optimized via the SGD optimizer. In each training iteration, given a mini-batch size number n, we have two batches of meta data and unlabeled data and represent them as $\{(\textbf{x}_1^l, \textbf{y}_1^l),...,(\textbf{x}_n^l, \textbf{y}_n^l)\}$ and $\{\textbf{x}_1,...,\textbf{x}_{(\mu \times n)}\}$, respectively. Note that we can increase $\mu$ to expand the size of unlabeled data in one iteration. In t-th iteration, we formulate the parameter of classifier network as $\textbf{w}^{(t)}$ and the parameters of the meta-net as $\Theta ^{(t)}$. The updates of two networks are as the following three phases.

Formulating learning manner of classifier network. Given the learning step with a size of $\alpha$, the descent direction of the objective loss in Eq. (3) on a mini-batch unlabeled data is
$$\begin{aligned} {\hat{{{\textbf{w}}}}}^{(t)}(\Theta ) = \textbf{w}^{(t)} - \alpha \frac{1}{n\mu } \sum \nolimits _{i=1}^{n\mu } \nabla _{\textbf{w}} \ell _{\textbf{x}_i}(\textbf{w}^{(t)}, \Theta ^{(t)}), \end{aligned}$$
(6)
where $\ell _{\textbf{x}_i}$ is calculated by Eq. (2).
Updating parameters$\Theta$ As we obtain parameter ${\hat{{{\textbf{w}}}}}^{(t)}(\Theta )$ with fixed $\Theta$ in Eq. (6), the update of our meta-net TGN can be achieved by a mini-batch of meta-data $\{(\textbf{x}_1^l, \textbf{y}_1^l),...,(\textbf{x}_n^l, \textbf{y}_n^l)\}$. Specifically, $\Theta ^{(t)}$ moves along the direction of direction of gradients w.r.t. the objective in Eq. (5)
$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H_i({\hat{{{\textbf{w}}}}}^{(t)}(\Theta )), \end{aligned}$$
(7)
where $\psi$ denotes the learning step of the SGD optimizer. Note that $\Theta$ in this equation is a variable, which enables gradient computation of $\frac{\partial {\hat{{{\textbf{w}}}}}^{(t)}(\Theta )}{\Theta }$.
Updating parameters $\textbf{w}$ of classifier network. Eventually, we utilize the updated TGN $\Theta ^{(t+1)}$ to regenerate confidence threshold for unlabeled data and update the parameters $\textbf{w}$ of classifier network
$$\begin{aligned} {\textbf{w}}^{(t+1)} = \textbf{w}^{(t)} - \alpha \frac{1}{n\mu } \sum \nolimits _{i=1}^{n\mu } \nabla _{\textbf{w}} \ell _{\textbf{x}_i}(\textbf{w}^{(t)}, \Theta ^{(t+1)}). \end{aligned}$$
(8)

We illustrate the flowchart of our learning algorithm in Fig. 3a, where Step 4,5,6 represent Eqs. (6), (7) and (8), respectively. Meanwhile, we summarize the overall updating steps in Algorithm 1. Compared to current SSL methods, Meta-T does not rely on any prior knowledge to predefine the function for adjusting the threshold. We believe that this merit would expand applicability of our method in certain environments where we cannot model the data distribution.

4.4 Convergence analysis

We analyze the convergence of Meta-T and give a rigorously theoretical guarantee.

Lemma 1

(Smoothness). Suppose the loss function H is L-Lipschitz and smooth, and the approximate function $\mathcal {S}$ is $\zeta$-Lipschitz, and $\mathcal {V}(\cdot )$ is differential with $\delta$-bounded gradient and twice differential with $\mathcal {B}$-bounded Hessian, and the loss function H have $\rho$-bounded gradients w.r.t. training/meta data and has upper bound with $\phi$. Replacing indicator function with $\mathcal {S}$, the gradient of $\Theta$ w.r.t. the meta loss is Lipschitz continuous.

The Proof is shown in Appendix A.1 and Lemma 1 implies that the meta loss w.r.t. the meta-network is smooth-bounded.

Theorem 1

(Convergence) Based on Lemma 1, let the learning rate $\alpha _t$ satisfies $\alpha _t = \min \{1, \frac{k}{T}\}$, for some $k > 0$, such that $\frac{k}{T} < 1$, and $\psi _t$, $1 \le t \le T$ is a monotone descent sequence, $\psi _t = \min \{\frac{1}{L}, \frac{\mathrm{{c}}}{\sigma \sqrt{T}}\}$ for some ${\textrm{c}}>0$, such that $\frac{\sigma \sqrt{T}}{\textrm{c}} \ge {L}$ and $\sum \nolimits _{t=1}^{\infty } \psi _t \le \infty , \sum \nolimits _{t=1}^{\infty } \psi _t^2 \le \infty$. Then we have ${\frac{1}{T} \sum \nolimits _{t=1}^T \mathbb {E} \Big [ \left\| \nabla L_{\textrm{meta}} \Big ( \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}) \Big ) \right\| _2^2 \Big ] \le \mathcal {O}(\frac{1}{\sqrt{T}})}$.

The Proof is shown in Appendix A.2. To be specific, Theorem 1 means that the our algorithm can achieve ${\mathbb {E} \Big [ \left\| \nabla L_\textrm{meta} \Big ( \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}) \Big ) \right\| _2^2 \Big ] \le \epsilon }$ in $\mathcal {O}(1/\epsilon ^2)$ steps, and would eventually convergence to a stationary point with the training iteration step increases.

4.5 Green meta-T: training with lower complexity

Training complexity analysis. Compared with the single-step training procedure, the training process of Meta-T can be divided into three parts, (1) forward and backward passes of the classifier network for computing $\hat{{{\textbf{w}}}}(\Theta )$; (2) forward and backward passes of TGN for updating $\Theta$; (3) forward and backward passes of classifier network for updating $\textbf{w}$. Hence, compared with FixMatch, which only involves one forward and backward pass, Meta-T requires approximately three times of training time.

As summarized by Xu et al. (2021), the main cost of training time is caused by the backpropagation in updating the parameters $\Theta$ of the meta-net since the meta-gradient in Eq. (7) needs to compute the similarity between each meta-data and unlabeled data. Therefore, reducing the computation of $\hat{{{\textbf{w}}}}(\Theta )$ would significantly decrease training time. To this end, we change the training procedure that integrates our proposed Meta-T algorithm with curriculum learning and name it Green Meta-T. Specifically, we conduct the bi-level strategy (i.e. Meta-T) once for learning the classifier network and TGN, and then continuously do $\mathrm k$-step classifier learning. Then, we give the training complexity of Green Meta-T as follows.

Proposition 1

Suppose a fixed training iteration T, the training time of FixMatch and Meta-T can be represented as $\mathcal {T}$ and $3\mathcal {T}$, respectively. Given a hyper-parameter $\mathrm k$, the training time of Green Meta-T is $\frac{\textrm{k}+2}{\textrm{k}} \mathcal {T}$.

Proposition 1 means that the training complexity of Green Meta-T could gradually reduce to $\mathcal {T}$ with the value of ${\textrm{k}}$ increases.

5 Experiments

Table 1 Details about five tested benchmarks

Full size table

5.1 Experimental settings

Datasets. We select five image classification datasets and three text classification datasets to evaluate the effectiveness of Meta-T, including five image benchmarks CIFAR-10 (Krizhevsky and Hinton 2009), CIFAR-100 (Krizhevsky and Hinton 2009), SVHN (Coates et al. 2011), SLT-10 (Netzer et al. 2011), and ImageNet (Deng et al. 2009), three text benchmarks IMDb (Maas et al. 2011), Amazon-5 (Zhang et al. 2015) and Yelp-5 (Zhang et al. 2015). Detailed statistics of these datasets are shown in Table 1.

Implementation Details. Our code is implemented by Pytorch 1.9.0 with GTX 3090. We leverage a pytorch library called Higher (Grefenstette et al. 2019) to implement our algorithm, which provides support for higher-order optimization. For all experiments, we repeat five times with different random seeds. Others for two networks are shown below

For the classifier, more information about data preprocessing and training procedure can be found in Table 2.
For TGN, we set the size of meta-data as 32 and utilize Adam optimizer with 1e-3 learning rate for all training epoches. We construct the three-layers fully-connected MLP for TGN, whose structure is $\{2 \mathcal {C}, h, 1\}$. Notably, h is set as 100 for all image datasets and 1000 for all text datasets and $\mathcal {C}$ is the number of categories.

Table 2 Detailed settings about training procedure of the backbone (the classifier network)

Full size table

Table 3 Error rates (%) for previous SOTA methods on CIFAR-10 and CIFAR-100 with varying size of labeled set

Full size table

5.2 Results on typical SSL task

Baselines. We categorize compared methods into two types. 1) Threshold-based methods, including Pseudo-Labeling (PL) Lee (2013), FixMatch (Sohn et al. 2020), FlexMatch (Zhang et al. 2021) and Dash (Xu et al. 2021). 2) others, including $\Pi$-Model (Sajjadi et al. 2016), MixMatch (Berthelot et al. 2019), UDA (Xie et al. 2020), CoMatch (Li et al. 2021) and SimMatch (Zheng et al. 2022).

Results on four image datasets. We conduct experiments on CIFAR-10, CIFAR-100, SVHN, SLT-10 and ImageNet. The results are shown in Tables 3 and 4. On CIFAR-10 & 100, Meta-T outperforms previous methods in the majority of settings. Under an extremely small size of the labeled set, the superiority of our method is significant. For example, we achieve 1.64% Top-1 accuracy improvements on CIFAR-100 with only 4 samples per class. Compared with threshold-based methods (Lee 2013; Sohn et al. 2020; Zhang et al. 2021; Xu et al. 2021), the improvement of our method is significant. On all settings, Meta-T constantly outperforms their performance. Eventually, our method also achieved the SOTA performance on ImageNet. By leveraging only 1% labeled data, Meta-T attains 67.7% top-1 accuracy on the test set. Compared to the previous state-of-the-art method SimMatch, the obtained improvement of 0.5% is significant in ImageNet. The superiority of Meta-T on ImageNet can already demonstrate its effectiveness on real-world SSL tasks.

Results on three text datasets. For a fair comparison, we keep the same training procedure with SoftMatch. Under two text benchmarks, including IMBb and Yelp-5, our method consistently achieves the best top-1 accuracy. Especially in Yelp-5 dataset, Meta-T outperforms the second-best method FlexMatch with 0.57% accuracy, which is a huge improvement in such a large-scale dataset.

Table 4 (Left) Error rates (%) for previous methods on SVHN and STL-10 with varying size of labeled set

Full size table

Table 5 Top-1 test accuracy (%) on imbalanced CIFAR-10 under three imbalanced ratio and two different size of labeled set. The backbone is Wide ResNet-28-2

Full size table

5.3 Results on imbalanced SSL task

We categorized compared methods into two parts. 1) Threshold-based methods, FixMatch (Sohn et al. 2020, Dash Xu et al. 2021) and FlexMatch (Zhang et al. 2021). 2) Others, cRT (Kang et al. 2019), LDAM, MixMatch (Berthelot et al. 2019), ReMixMatch (Berthelot et al. 2019), DARP (Kim et al. 2020), CReST (Wei et al. 2021) and Adsh (Guo and Li 2022). For constructing imbalanced datasets, we refer to Guo and Li (2022). Specificlly, we write the size of two training sets as $N = \sum \nolimits _{c=1}^C N_c$ and $M = \sum \nolimits _{c=1}^C M_c$. To construct imbalanced datasets, two parameters (imbalance ratio) $\gamma _l, \gamma _u$ is introduced, i.e., $\gamma _l = \frac{N_l}{N_C}, \gamma _2 = \frac{M_1}{M_C}$. Once $\gamma _l,\gamma _u$ and $N_1, M_1$ are given, we set $N_c = N_1 \cdot \gamma _l^{-\frac{c-1}{C-1}}, M_c = M_1 \cdot \gamma _u^{-\frac{c-1}{C-1}}$ for $1 < c \le C$. We conduct experiments on two settings, i.e., $N_1=500, M_1=4000$ and $N_1=1500, M_1=3000$ with varying imbalanced ratios $\gamma _1, \gamma _2 \in [50, 100, 150]$.

In Table 5, we conduct the comparison experiments on the settings $\gamma = \gamma _1 = \gamma _2$ and report the results. From the results, we can see that (1) our proposed Meta-T achieves the state-of-the-art performance in most cases, showing its robustness in such a data-imbalanced case; (2) with the imbalanced ratio increasing, the performance of our algorithm becomes more significant. Compared to the second best performance (i.e., Adsh), we achieve 1.43% top-1 accuracy improvements under $\gamma =100$ and 2.42% improvements under $\gamma =150$. The performance of Meta-T is slightly lower than that of Adsh on the case $N_1=500, M_1=4000, \gamma =50$.

5.4 Effectiveness analysis

Pseudo-labels. We verify the quality of produced pseudo-labels on both typical and imbalanced SSL settings.

Typical SSL. In Fig. 4a, b left, Meta-T shows greater performance in generating correct pseudo-labels, which benefits from the higher quality of thresholds produced by TGN. In the early learning stage, the number of correct labels in our method is remarkably higher than that in FixMatch, reflecting the superiority of sample-level thresholds. In Fig. 4a, b right, we exhibit the results of the number of wrong labels. Due to the poor performance of TGN in the early learning stage, some thresholds with low quality are produced, causing a greater number of wrong pseudo-labels compared with deterministic methods such as FlexMatch. Fortunately, the number of wrong labels decrease with the learning process and is lastly lower than that of FixMatch.
Imbalanced SSL. We conduct experiments from the perspective of the confusion matrix on unlabeled data and show results in Fig. 5. Thanks to the average class confidence, which is input into the TGN, we believe that TGN can learn the classifier confidence scores regarding varying categories under imbalanced settings and thus adaptively generate class-balanced confidence thresholds. Experimentally, FixMatch focuses on the studies of majority categories and thus produces unreliable pseudo-labels for minority classes. However, Meta-T achieves significant results on tailed classes and attains more than 80% accuracy on all classes.

Sample-level thresholds. We show the learned thresholds from three aspects to demonstrate the effectiveness of Meta-T.

Accuracy. Figure 6a shows the learned confidence thresholds on CIFAR-10 and CIFAR-100. We can observe that (1) the main learned sample-level thresholds are in the interval of [0.9, 1.0], supporting the prior knowledge that the confidence threshold should be set as 0.95 for CIFAR. The results verify that competitive sample-level thresholds can be learned by TGN; (2) some thresholds less than 0.95 are learned by our algorithm, where the samples can be regarded as hard (or boundary) samples. For this, it is reasonable that TGN gives them relatively low thresholds, which benefits the model’s learning for these samples.
Robustness. Figure 6b visualize the produced thresholds and test accuracy (%) under long-tail semi-supervised learning. We can see that our proposed Meta-T learns lower thresholds for tailed classes while keeping high thresholds for many-shot classes. Since a small number of tailed classes, the classifier has moderate or low confidence for these samples. For this, Meta-T produces relatively small thresholds (around 0.5) and thus enables the classifier to learn from more long-tailed unlabeled samples.
Stability. Figure 6c shows the comparison results from dynamic threshold generation. In the beginning, Meta-T tends to initialize thresholds of all unlabeled data as 0.5 and then immediately grow up to 0.95, which is identical to the setting in FixMatch. This result demonstrates thresholds learned by Meta-T are close to the optimal thresholds.

5.5 Sensitivity analysis

We conduct experiments to analyse the sensitivity of Meta-T in three aspects.

The architecture of TGN. To exhibit the impact of the architecture of TGN, we try different MLP architecture settings with different depths and widths and show the results in Table 6left. It can be seen that varying (five) MLP settings have unsubstantial effects on the final result. Therefore, we prefer to adopt the simple yet effective one, i.e., $\{2\mathcal {C}, 100, 1\}$, for all datasets. Meanwhile, we consider that TGN can attain great performance even under a small-scale meta-data due to its tiny number of parameters.

The learning rate $\psi$ w.r.t. the meta-net. Compared with existing methods, our framework introduces an extra hyper-parameter (i.e., the learning rate of meta-net $\psi$), which does not require complex cross-validation process. Experimentally, we conduct ablation studies and show results with different settings of optimization for TGN in Table 6right. We can conclude that our algorithm is insensitive to the hyperparameter $\psi$. Thus, we select a normal setting, i.e., Adam optimizer with 1e-3 learning rate.

The slope parameter $\beta$. We conduct experiments with varying settings, $\beta \in \{1, 10, 50, 100, 1000\}$. As shown in Figure 7a, b, the generalization performance improves as $\beta$ increases at the beginning. When $\beta$ exceeds 100, the improvement of the performance can be trivial. We thus set $\beta =100$ for all experiments.

Table 6 Ablation studies of different settings of the meta-net TGN

Full size table

5.6 Efficiency analysis

The step number $\mathrm k$ of Green Meta-T. We make ablation studies on two SSL settings with $\textrm{k} \in \{1,2,...,10\}$. In Fig. 7c, d, we can observe that (1) with $\mathrm k$ increases, the error rate of Green Meta-T gradually increases compared to Meta-T. It is reasonable that the learning of TGN would significantly decrease when conducting more rounds of classifier learning in the outer loop of curriculum learning. (2) A relatively large $\mathrm k$ might not degrade the performance of Green Meta-T under a mild SSL setting.

To demonstrate efficiency of Green Meta-T, we plot learning curves whose abscissa is the number of accumulative floating point operations (FLOPs). FLOPs are from both the forward and backward propagation. To show the efficiency of Green Meta-T, we plot train loss, train accuracy, test loss, test accuracy with identical numbers of FLOPs for two learning algorithms in Figure 8. Since the number of epoch for two algorithms is identical, the learning process of Green Meta-T ends after approximately 240k FLOPs. We highlight that Green Meta-T achieves faster convergence than Meta-T when accumulative FLOPs are identical and reduces the computation cost from the second-order derivative at the meta-learning phase.

6 Conclusion

In this paper, we consider sample-level thresholds for pseudo-labeling methods in semi-supervised learning while a simple yet effective framework Meta-T is proposed. Compared with previous methods, Meta-T only contains one hyperparameter and does not rely on preset adjustment functions. By constructing a lightweight meta-net, the sample-aware thresholds can be automatically generated by this network. The update of the classifier network and meta-network can be achieved via bi-level strategy. We also design a surrogate function to replace the indicator function in typical pseudo-labeling methods. Further, we theoretically analyze the convergence of Meta-T and provide a solution to reduce training complexity, called Green Meta-T. Extensive experiments on typical and imbalanced SSL demonstrate its effectiveness.

Availability of data and materials

Not applicable.

Change history

16 May 2024
A Correction to this paper has been published: https://doi.org/10.1007/s10994-024-06552-9

References

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. in NIPS 29
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009) Curriculum learning. In: ICML, pp. 41–48
Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., & Raffel, C. (2019) Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C.A. (2019) Mixmatch: A holistic approach to semi-supervised learning. in NIPS 32
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In: AISTATS, pp. 215–223
Cubuk, E.D., Zoph, B., Shlens, J., & Le, Q.V. (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: CVPRW, pp. 702–703.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR, pp. 248–255 IEEE
Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: ICLR (PMLR), pp. 1126–1135.
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., & Pontil, M. (2018) Bilevel programming for hyperparameter optimization and meta-learning. In: ICML
Grefenstette, E., Amos, B., Yarats, D., Htut, P.M., Molchanov, A., Meier, F., Kiela, D., Cho, K., & Chintala, S. (2019) Generalized inner loop meta-learning. arXiv preprint arXiv:1910.01727
Guo, L.-Z., & Li, Y.-F. (2022) Class-imbalanced semi-supervised learning with adaptive thresholding. In: ICLR, pp. 8082–8094
Guo, L.-Z., Zhang, Z.-Y., Jiang, Y., Li, Y.-F., & Zhou, Z.-H. (2020) Safe deep semi-supervised learning for unseen-class unlabeled data, in: ICLR
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Article Google Scholar
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217
Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., & Shin, J. (2020). Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. NIPS, 33, 14567–14579.
Google Scholar
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Laine, S., & Aila, T. (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242
Lee, D.-H. (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop. vol 3, p. 896
Li, J., Xiong, C., & Hoi, S.C. (2021) Comatch: Semi-supervised learning with contrastive graph regularization. In: ICCV, pp. 9475–9484
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011) Reading digits in natural images with unsupervised feature learning
Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018) Learning to reweight examples for robust deep learning. In: ICML, pp 4334–4343
Saito, K., Kim, D., & Saenko, K. (2021) Openmatch: Open-set consistency regularization for semi-supervised learning with outliers. in NIPS
Sajjadi, M., Javanmardi, M., & Tasdizen, T. (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. in NIPS 29
Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., & Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In NIPS, 32, 19.
Google Scholar
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., & Li, C.-L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NIPS, 33, 596–608.
Google Scholar
Sun, H., Guo, C., Wei, Q., Han, Z., & Yin, Y. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.
Article Google Scholar
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 30, 17.
Google Scholar
Wang, Y., Guo, J., Song, S., & Huang, G. (2020). Meta-semi: A meta-learning approach for semi-supervised learning. arXiv preprint arXiv:2007.02394
Wei, C., Sohn, K., Mellina, C., Yuille, A., & Yang, F. (2021) Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR, pp. 10857–10866
Xiao, T., Zhang, X.-Y., Jia, H., Cheng, M.-M., & Yang, M.-H. (2021). Semi-supervised learning with meta-gradient. In: International Conference on Artificial Intelligence and Statistics, pp. 73–81
Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. NIPS, 33, 6256–6268.
Google Scholar
Xu, Y., Shang, L., Ye, J., Qian, Q., Li, Y.-F., Sun, B., Li, H., & Jin, R. (2021). Dash: Semi-supervised learning with dynamic thresholding. In: ICLR, pp. 11525–11536
Xu, Y., Zhu, L., Jiang, L., & Yang, Y. (2021) Faster meta update strategy for noise-robust deep learning. In: CVPR, pp. 144–153
Zhang, X., Zhao, J., & LeCun, Y. (2015) Character-level convolutional networks for text classification. in NIPS 28
Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., & Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NIPS, 34, 18408–18419.
Google Scholar
Zheng, M., You, S., Huang, L., Wang, F., Qian, C., & Xu, C. (2022) Simmatch: Semi-supervised learning with similarity matching. In: CVPR, pp. 14471–14481
Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–130.
Article Google Scholar

Download references

Funding

This research was supported by Natural Science Foundation of China(No. 62106129, 62176139, 62106028), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15) and Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Qi Wei, Haoliang Sun, Ren Wang, Rundong He & Yilong Yin
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Qi Wei & Lei Feng

Authors

Qi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Lei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Haoliang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ren Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rundong He
View author publications
You can also search for this author in PubMed Google Scholar
Yilong Yin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: W-Q; Methodology: W-Q; Theoretical analysis: F-L; Writing-original draft preparation: W-Q, S-HL; Writing-review and editing: W-R, H-RD; Funding acquisition: S-HL, F-L, Y-YL.

Corresponding authors

Correspondence to Haoliang Sun or Yilong Yin.

Ethics declarations

Conflict of interest

The author declares that he has no confict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Code availability

Not applicable.

Additional information

Editors: Vu Nguyen, Dani Yogatama.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Theoretical proof of our method

1.1 A.1 Proofs of smoothness

Given a small amount of meta dataset with n samples $\{(\textbf{x}_1^l, \textbf{y}_1^l),...,(\textbf{x}_n^l, \textbf{y}_n^l)\}$ and another unlabeled data $\{\textbf{x}_1,...,\textbf{x}_{(\mu \times n)}\}$ with size of $\mu \times n$. By replacing the indicator function with the approximate function, the meta loss is $L_\textrm{meta}(\textbf{w}^*({\Theta })) = \frac{1}{n} \sum \nolimits _{i=1}^n H(\textbf{y}_i^l, f(\textbf{x}_i^l; \textbf{w}^*({\Theta })))$ and the training loss is

$$\begin{aligned} L_{train}(\textbf{w},\Theta ) = \frac{1}{n\mu } \sum \nolimits _{i=1}^{n\mu } \mathbbm {1}(\max (f(\mathcal {A}^w(\textbf{x}_i); \textbf{w})) > \mathcal {V}_i(\textbf{w}, \Theta )) \cdot H(\hat{\textbf{y}}_i, f(\mathcal {A}^s(\textbf{x}_i); \textbf{w})), \end{aligned}$$

(A1)

where $\mathcal {S}_i(\textbf{w}, \Theta ) = \mathcal {S}( \max (f(\mathcal {A}^w(\textbf{x}_i; \textbf{w}))) - \mathcal {V}_i(\textbf{w}, \Theta ))$.

Firstly, we recall the update equation of the parameters of TGN as follows:

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta ))). \end{aligned}$$

(A2)

To be concise, we formulate $H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta )))$ as $H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))$. Then, the computation of backpropagation for the above equation can be written as

$$\begin{aligned} \begin{aligned}&\frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = \frac{1}{n} \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}} \sum \nolimits _{j=1}^{n\mu } \frac{\partial \hat{{{\textbf{w}}}}^{(t)}(\Theta )}{\partial \mathcal {S}_j(\textbf{w}^{(t)}; \Theta )} \, \frac{\partial \mathcal {S}_j(\textbf{w}^{(t)}; \Theta )}{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )} \, \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \\ =&\frac{-\alpha }{n^2\mu } \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}} \sum \nolimits _{j=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \\ =&\frac{- \alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{1}{n} \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned} \end{aligned}$$

(A3)

Let $G_{ij} = \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}}$ and substitute $G_{ij}$ into Eq. (A3), then

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} + \frac{\alpha \psi }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{1}{n} \sum \nolimits _{i=1}^{n} G_{ij} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned}$$

(A4)

Proof

The gradient of $\Theta$ w.r.t. meta loss can be formulated as:

$$\begin{aligned} \begin{aligned}&\nabla _{\Theta } H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = -\frac{\alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{\partial H^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned} \end{aligned}$$

(A5)

Let $\mathcal {V}_j(\Theta ) = \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )$ and introduce $G_{ij}$ which is defined in Eq. (A4). Taking the gradient of $\Theta$ on both side of Eq. (A5), we attain

$$\begin{aligned} \begin{aligned} \nabla _{\Theta ^2}^2 H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = -\frac{\alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {[} \frac{\partial }{\partial \Theta } (G_{ij}) \Big |_{\Theta ^{(t)}} \frac{\partial \mathcal {V}_j(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} + (G_{ij}) \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \bigg {]}. \end{aligned} \end{aligned}$$

(A6)

The first term in Eq. (A6) right hand side can be summarized as

$$\begin{aligned} \begin{aligned}&\left\| \frac{\partial }{\partial \Theta } (G_{ij}) \Big |_{\Theta ^{(t)}} \frac{\partial \mathcal {V}_j(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \, \delta \left\| \frac{\partial }{\partial \hat{{{\textbf{w}}}}} \bigg {(} \frac{\partial H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ =&\, \delta \left\| \frac{\partial }{\partial \hat{{{\textbf{w}}}}} \bigg {(} \frac{\partial H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \hat{{{\textbf{w}}}}} \Big |_{\hat{{{\textbf{w}}}}^{(t)}} \, \frac{-\alpha }{n\mu } \sum \nolimits _{k=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_k}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_k(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ =&\, \delta \left\| \bigg {(} \frac{\partial ^2 H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \hat{{{\textbf{w}}}}^2} \Big |_{\hat{{{\textbf{w}}}}^{(t)}} \, \frac{-\alpha }{n\mu } \sum \nolimits _{k=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_k}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_k(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ \le&\, \alpha L \delta ^2 \phi ^2 \zeta ^2, \end{aligned} \end{aligned}$$

(A7)

since ${\left\| \frac{\partial H(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \right\| \le \rho , \left\| \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \right\| \le \phi , \left\| \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \right\| \le \zeta , \left\| \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \mathcal {B}}$.

The second term in Eq. (A6) right hand side can be summarized as

$$\begin{aligned} \begin{aligned} \left\| (G_{ij}) \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| = \left\| \frac{\partial H^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \, \rho \phi \zeta \mathcal {B}. \end{aligned} \end{aligned}$$

(A8)

Combining the results in Eq. (A7) and Eq. (A8), we have $\left\| \nabla _{\Theta ^2}^2 H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} \right\| \le \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B}).$ Define ${{{\hat{L}}}} = \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B})$, based on the Lagrange mean value theorem, we have:

$$\begin{aligned} \left\| \nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} - \nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _2))} \right\| \le {{{\hat{L}}}} \left\| \Theta _1 - \Theta _2 \right\| , \, \text {for all} \, \Theta _1, \Theta _2, \end{aligned}$$

(A9)

where $\nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} = \nabla _\Theta {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))}\Big |_{\Theta _1}$. $\square$

1.2 A.2 Proofs of convergence

Proof

The update of parameters $\Theta$ in t-th iteration can be written as $\Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}}.$ Training with a mini-batch of meat-data $\textrm{B}_t$ that is uniformly drawn from the data set, we rewrite the equation above as:

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi _t \Big [ \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) + \varepsilon ^{(t)} \Big ], \end{aligned}$$

(A10)

where $\varepsilon ^{(t)} = \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\textrm{B}_t} - \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))$. Note that the expectation of $\varepsilon ^{(t)}$ obeys $\mathbbm {E}[\varepsilon ^{(t)}]=0$ and its variance is finite. Consider that

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ =&\, \underbrace{H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t +1 )}))}_\textrm{term 1} + \underbrace{H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}))}_\textrm{term 2}. \end{aligned} \end{aligned}$$

(A11)

For $\textrm{term 1}$, by Lipschitz smoothness of the meta loss function for $\Theta$, we have

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) \\ \le&\, \left\langle \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})), \hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)}) - \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)}) \right\rangle + \frac{L}{2} \left\| \hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)}) - \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)}) \right\| _2^2. \end{aligned} \end{aligned}$$

According to Eq. (6) (8) (A1), then we have

$$\begin{aligned} \left\| H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t +1 )})) \right\| \le \alpha _t \rho ^2 + \frac{1}{2} L \alpha _t \rho ^2 = \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) \end{aligned}$$

(A12)

since ${ \left\| \frac{\partial H_j(\textbf{w})}{\partial \textbf{w}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}\right\| \le \rho , \left\| \frac{\partial H_i^\textrm{meta}(\textbf{w})}{\partial \hat{{{\textbf{w}}}}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \right\| \le \rho }$.

For $\mathrm term 2$, considering Lipschitz continuity of $\nabla H_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))$ demonstrated in Lemma 1, we can obtain the following:

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ \le&\, \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \Theta ^{(t+1)} - \Theta ^{(t)} \right\rangle + \frac{L}{2} \left\| \Theta ^{(t+1)} - \Theta ^{(t)} \right\| _2^2 \\ =&-(\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 - (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

(A13)

Summing up the Eq. (A12) (A13), the Eq. (A11) can be summarized as

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ \le&\, \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 - (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

Rearranging the terms, we can obtain

$$\begin{aligned} \begin{aligned}&(\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \\ \le&\, \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta} \big ( \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}) \big ) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 \\&- (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

Summing up the above inequalities and rearranging the terms, we can obtain

$$\begin{aligned} \begin{aligned}&\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2\\ \le&\, H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) + \\&\quad \quad \quad \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - \sum \nolimits _{t=1}^T (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle \, + \, \frac{L}{2} \sum \nolimits _{t=1}^T \left\| \varepsilon ^{(t)} \right\| _2^2 \\ \le&\, H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2})\\&- \sum \nolimits _{t=1}^T (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle + \frac{L}{2} \sum \nolimits _{t=1}^T \left\| \varepsilon ^{(t)} \right\| _2^2. \end{aligned} \end{aligned}$$

(A14)

We take the expectations w.r.t. $\varepsilon ^{(N)}$ on both size of Eq. (A14), then we have:

$$\begin{aligned} \begin{aligned}&\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \le H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))\\&+ \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) + \frac{L\sigma ^2}{2} \sum \nolimits _{t=1}^T \psi _t^2, \end{aligned} \end{aligned}$$

since ${\mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle =0}$ and $\mathbbm {\left\| \varepsilon ^{(t)} \right\| _2^2} \le \sigma ^2$, where $\sigma ^2$ represents the variance of $\varepsilon ^{(t)}$. Eventually, we deduce that

$$\begin{aligned} \mathop {\min }\nolimits _{t}&\, \mathbbm {E} \Big [ \left\| \nabla H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \Big ] \le \frac{\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\| \nabla H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 }{\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2})} \\ \le&\, \frac{1}{\sum \nolimits _{t=1}^T (2\psi _t - L\psi _t^2)} \Big [ 2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (2 + \alpha _t L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{1}{\sum \nolimits _{t=1}^T \psi _t} \Big [ 2H^{\textrm{meta}} (\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (2 + \alpha _t L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{1}{T \psi _t} \Big [ 2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \alpha _1 \rho ^2 T (2 + L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{T} \, \frac{1}{\psi _t} + \frac{2 \alpha _1 \rho ^2 (2 + L)}{\psi _t} + L\sigma ^2 \psi _t \\ =&\, \frac{H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{T} \max \{L, \frac{\sigma \sqrt{T}}{\textrm{c}}\} + \min \{1, \frac{k}{T}\}\max \{L, \frac{\sigma \sqrt{T}}{\textrm{c}}\}\rho ^2(2+L) + L\sigma ^2 \min \{\frac{1}{L}, \frac{\textrm{c}}{\sigma \sqrt{T}}\} \\ \le&\, \frac{\sigma H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{{\textrm{c}} \sqrt{T}} + \frac{k\sigma \rho ^2(2+L)}{{\textrm{c}} \sqrt{T}} + \frac{L\sigma {\textrm{c}}}{\sqrt{T}} = \mathcal {O}(\frac{1}{\sqrt{T}}). \end{aligned}$$

Therefore, we can conclude that under some mild conditions, our algorithm can always achieve $\min _{0 \le t \le T} \mathbbm {E} \Big [ \left\| \nabla H^\textrm{meta}(\Theta ^{(t)}) \right\| _2^2 \Big ] \le \mathcal {O}(\frac{1}{\sqrt{T}})$ in T steps. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wei, Q., Feng, L., Sun, H. et al. Learning sample-aware threshold for semi-supervised learning. Mach Learn 113, 5423–5445 (2024). https://doi.org/10.1007/s10994-023-06425-7

Download citation

Received: 31 May 2023
Revised: 24 August 2023
Accepted: 03 October 2023
Published: 18 January 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s10994-023-06425-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning sample-aware threshold for semi-supervised learning

Abstract

Similar content being viewed by others

LaRW: boosting open-set semi-supervised learning with label-guided re-weighting

AdaptMatch: Adaptive Consistency Regularization for Semi-supervised Learning with Top-k Pseudo-labeling and Contrastive Learning

Unsupervised Selective Labeling for More Effective Semi-supervised Learning

Explore related subjects

1 Introduction

2 Related work

3 Preliminaries

3.1 Confidence thresholds in semi-supervised learning

4 Proposed method

4.1 Learning with sample-level thresholds

4.2 Threshold generation network TGN

4.3 Meta-optimization problem

4.4 Convergence analysis

Lemma 1

Theorem 1

4.5 Green meta-T: training with lower complexity

Proposition 1

5 Experiments

5.1 Experimental settings

5.2 Results on typical SSL task

5.3 Results on imbalanced SSL task

5.4 Effectiveness analysis

5.5 Sensitivity analysis

5.6 Efficiency analysis

6 Conclusion

Availability of data and materials

Change history

16 May 2024

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Appendix A: Theoretical proof of our method

Appendix A: Theoretical proof of our method

1.1 A.1 Proofs of smoothness

Proof

1.2 A.2 Proofs of convergence

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation