1 Introduction

Segmentation is fundamental in medical image analysis, recognizing anatomical structures. Supervised learning has led to a series of advancements in medical image segmentation [1]. However, the availability of fully and densely labeled data is a common bottleneck in supervised learning, especially in medical image segmentation, since annotating pixel-wise labels is usually tedious and time-consuming and requires expert knowledge. Thus, training a model with limited supervision using datasets with imperfect labels is essential.

Existing works have made efforts to take advantage of unlabeled data and weakly labeled data to train segmentation models [2] with semi-supervised learning (SSL) [3,4,5] and weakly-supervised learning [6,7,8]. Semi-supervised segmentation [9,10,11] is an effective paradigm for learning a model from scarce annotations, exploiting labeled and unlabeled data. Weakly-supervised segmentation aims to alleviate the longing for densely labeled data, utilizing sparse annotations, e.g., points and scribbles, as supervision signals [2]. In this study, in addition to semi-supervised segmentation, we focus on scribble-supervised segmentation, one of the hottest topics in the family of weakly-supervised learning. A conceptual comparison of fully-supervised, semi-supervised, and scribble-supervised segmentation is shown in Fig. 1.

Fig. 1.
figure 1

Conceptual comparison of fully-supervised (using fully and densely labeled data), semi-supervised (using a part of densely labeled data and unlabeled data), and scribble-supervised (using data with scribble annotations) segmentation.

Consistency regularization aims to enforce the prediction agreement under different kinds of perturbations, e.g., input augmentation [3, 9], network diversity [11, 12], and feature perturbation [13]. Recent works [7, 8, 14,15,16,17,18,19,20] involving consistency regularization shows advanced performance tackling limited supervision. Despite their success in learning from non-fullness supervision, an impediment is that existing studies are task-specific for semi- and scribble-supervised segmentation. Driven by this limitation, a question to ask is: Does a framework generic to semi- and scribble-supervised segmentation exist? Although the two tasks leverage different kinds of imperfect labels, indeed, they have the same intrinsic goal: mining the informative information as much as possible from pixels with no ground truth. Thus, such a framework should exist once it can excellently learn representations from the unlabeled pixels.

Consistency regularization under a more rigorous perturbation empirically leads to an improved generalization [11]. However, lacking sufficient supervision, models may output inaccurate predictions and then learn from these under consistency enforcement. This vicious cycle would accumulate prediction mistakes and finally lead to degraded performance. Thus, the key to turning the vicious cycle into a virtuous circle is increasing the quality of model outputs when adopting a more challenging consistency regularization. From these perspectives, we hypothesize that an eligible framework should be endowed with these characteristics: (i) it should output more accurate predictions, and (ii) it should be trained with consistency regularization under a more challenging perturbation.

Based on the above hypothesis, we find a solution: we present a general and effective framework that, for the first time, shows its dual purpose for both semi- and scribble-supervised segmentation tasks. The method is simple, jointly trains triple models, and adopts a mix augmentation scheme, and thus is called TriMix. To meet the requirement of (i), TriMix maintains triple networks, which have identical structures but different initialization to introduce model perturbation and imposes consistency to minimize disagreement among models, inspired by the original tri-training strategy [21]. Intuitively, more diverse models can extract more informative information from the dataset. Each model receives valuable information from the other two through intra-model communication and then generates more accurate predictions. To meet the requirement of (ii), the model diversity is further blended with data perturbation, which accompanies the mix augmentation scheme, to form a more challenging perturbation. We hypothesize that the tri-training scheme within TriMix well complements consistency regularization under the hybrid perturbation. This self-complementary manner enables TriMix to serve as a general learner learning from limited supervision using different kinds of imperfect labels. Our contributions are:

  • We propose a simple and effective method called TriMix and show its generic solution for semi- and scribble-supervised segmentation for the first time.

  • We show that purely imposing consistency under a more challenging perturbation, i.e., combining data augmentation and model diversity, on the tri-training framework can be a general mechanism for limited supervision.

  • We first validate TriMix on the semi-supervised task. TriMix presents competitive performance against state-of-the-art (SOTA) methods and surprisingly strong potential under the one-shot settingFootnote 1, which is rarely challenged by existing semi-supervised segmentation methods.

  • We then evaluate TriMix on the scribble-supervised task. TriMix surpasses the mainstream methods by a large margin and realizes new SOTA performance on the public benchmarks.

2 Related Work

Semi-supervised learning (SSL) trains a model utilizing both labeled and unlabeled data. Existing SSL methods are generally based on pseudo-labeling (also called self-training) [5, 25,26,27] and consistency regularization [3, 4, 28, 29]. Pseudo-labeling takes the model’s class prediction as a label to train against, but the label quality heavily influences the performance. Consistency regularization assumes predictions should be invariant under perturbations, such as input augmentation [3, 9], network diversity [11, 12], and feature perturbation [13]. Consistency regularization usually performs better than self-training and has been widely involved in the task of semi-supervised segmentation [14,15,16,17,18, 30, 31]. A more challenging perturbation empirically profits model generalization if the model could sustainably generate accurate predictions [11]. In this work, we introduce a hybrid perturbation harsher than its elements, i.e., data augmentation, and model diversity.

Weakly-supervised segmentation learns a model using the dataset with weak annotations, e.g., bounding boxes, scribbles, sparse dots, and polygons [2]. In this work, we utilize scribbles as weak annotations, which are mostly used in computer vision community, from classical methods [32, 33] to current scribble-supervised methods [6,7,8, 19, 34,35,36,37], due to the convenient format. To learn from scribble supervision, some methods [34,35,36] make efforts to construct complete labels based on the scribble for training. Other works like [37, 38] explore possible losses to regularize the training from scribble annotations, and the scheme of [6] adds additional modules to improve the segmentation accuracy. Recently, consistency regularization is explored in several works [7, 8, 20, 39].

Data augmentation generates virtual training examples to improve model generalization. Auto-augmentation methods [40,41,42,43] automatically search for optimal data augmentation policies and show higher accuracy than handmade schemes but with relatively higher search costs. In our study, we focus on the mix augmentation [44,45,46,47,48,49,50], which is one type of strong data augmentation and is more efficient than auto-augmentation methods. Mix augmentation mixes two inputs and the corresponding labels up in some way to create virtual samples for training. It has been widely applied in semi-supervised segmentation [9,10,11] as an effective way to import data perturbation and synthesize new samples during training. In [7], mix segmentation is firstly introduced to increment supervision for scribble-supervised segmentation.

Co-training and tri-training are two SSL approaches in a similar flavor, which maintain multiple models and regularize the disagreement among the outputs of models. Co-training framework [51, 52] assumes there are sufficient and different views of the training data, each of which can independently train a model. Maintaining view diversity, in some sense, is similar to the data perturbation in SSL. Co-training has been extended to semi-supervised segmentation [18, 53]. Unlike co-training, tri-training [21] does not require view difference. Instead, it introduces model diversity and minimizes the disagreement among various outputs. This strategy is similar to imposing consistency under the model perturbation in SSL. There are several variants of tri-training [54,55,56,57], but none are for semi- or scribble-supervised segmentation. In this work, we revisit tri-training and explore its potential and general solution for handling limited supervision when it meets mix augmentation.

Fig. 2.
figure 2

Overview of TriMix. TriMix maintains triple networks \(f_{1}\), \(f_{2}\), and \(f_{3}\), which have same architectures but different weights. Three steps are taken when given a mini-batch containing images \({\textbf {x}}\) and ground truth \({\textbf {y}}\) at each training iteration. Step 1: first forward pass. For \(i\in \left\{ 1,2,3\right\} \), each network \(f_{i}\) outputs \({\textbf {p}}_{i}\) for \({\textbf {x}}\), with the supervision of \({\textbf {y}}\). Step 2: mix augmentation. Three batches \(\left\{ {\textbf {x}},{\textbf {y}},{\textbf {p}}_{1}\right\} \), \(\left\{ {\textbf {x}},{\textbf {y}},{\textbf {p}}_{2}\right\} \), and \(\left\{ {\textbf {x}},{\textbf {y}},{\textbf {p}}_{3}\right\} \) are randomly shuffled to obtain new batches \(\left\{ \tilde{{\textbf{x}}}_{1},\tilde{{\textbf{y}}}_{1},\tilde{{\textbf{p}}}_{1}\right\} \), \(\left\{ \tilde{{\textbf{x}}}_{2},\tilde{{\textbf{y}}}_{2},\tilde{{\textbf{p}}}_{2}\right\} \), and \(\left\{ \tilde{{\textbf{x}}}_{3},\tilde{{\textbf{y}}}_{3},\tilde{{\textbf{p}}}_{3}\right\} \). Then each pair of these new batches are mixed up to form batches \(\left\{ \bar{{\textbf{x}}}_{1},\bar{{\textbf{x}}}_{2},\bar{{\textbf{x}}}_{3}\right\} \), \(\left\{ \bar{{\textbf{y}}}_{1},\bar{{\textbf{y}}}_{2},\bar{{\textbf{y}}}_{3}\right\} \), \(\left\{ \hat{{\textbf{y}}}_{1},\hat{{\textbf{y}}}_{2},\hat{{\textbf{y}}}_{3}\right\} \). Squares with mixed colors indicate mixed samples. Step 3: second forward pass. For \(i\in \left\{ 1,2,3\right\} \), each network \(f_{i}\) outputs \(\bar{{\textbf{p}}}_{i}\) for \(\bar{{\textbf{x}}}_{i}\), with the supervision of \(\bar{{\textbf{y}}}_{i}\). An unsupervised loss is calculated between \(\bar{{\textbf{p}}}_{i}\) and \(\hat{{\textbf{y}}}_{i}\). Note that \(\hat{{\textbf{y}}}_{i}\) can be soft (probability maps) or hard pseudo-labels (one-hot maps). (Color figure online)

3 Method

3.1 Overview

This paper proposes a simple and general framework, TriMix, to tackle semi- and scribble-supervised segmentation. The plain architecture of TriMix is illustrated in Fig. 2. TriMix adheres to the spirit of tri-training, simultaneously learning triple networks \(f_{1}\), \(f_{2}\), and \(f_{3}\), which have identical structures but different weights \(\textbf{w}_{1}\), \(\textbf{w}_{2}\), and \(\textbf{w}_{3}\), to import network inconsistency. In addition, mix augmentation is adopted to introduce input data perturbation. Generally, assume a mini-batch \(\textbf{b} = \left\{ \textbf{x},\textbf{y}\right\} \) is fetched at each training iteration, where \(\textbf{x}\) and \(\textbf{y}\) are images and the corresponding ground truth. TriMix involves three steps to process a batch flow at each training iteration.

Step 1: first forward pass. For \(i\in \left\{ 1,2,3\right\} \), each network \(f_{i}\) is fed with images \(\textbf{x}\) and outputs the prediction \({\textbf {p}}_{i}\). A supervised loss \(L_{sup}^{} \left( {\textbf {p}}_{i}, \textbf{y}\right) \) is then imposed between \({\textbf {p}}_{i}\) and the ground truth \(\textbf{y}\).

Step 2: mix augmentation. With Step 1, we obtain three batches \({\textbf {b}}_{1} = \left\{ {\textbf {x}},{\textbf {y}},{\textbf {p}}_{1}\right\} \), \({\textbf {b}}_{2} =\left\{ {\textbf {x}},{\textbf {y}},{\textbf {p}}_{2}\right\} \), and \({\textbf {b}}_{3} =\left\{ {\textbf {x}},{\textbf {y}},{\textbf {p}}_{3}\right\} \). The goal is to mix up the pair of \(\left( {\textbf {b}}_{2}, {\textbf {b}}_{3}\right) \), the pair of \(\left( {\textbf {b}}_{1}, {\textbf {b}}_{3}\right) \), and the pair of \(\left( {\textbf {b}}_{1}, {\textbf {b}}_{2}\right) \) to generate new batches. Similar to the mixing operation described in original papers [44, 46], we first randomly shuffle \({\textbf {b}}_{1}\), \({\textbf {b}}_{2}\), and \({\textbf {b}}_{3}\) to generate three new batches of \(\tilde{{\textbf{b}}}_{1} = \left\{ \tilde{{\textbf{x}}}_{1},\tilde{{\textbf{y}}}_{1},\tilde{{\textbf{p}}}_{1}\right\} \), \(\tilde{{\textbf{b}}}_{2} = \left\{ \tilde{{\textbf{x}}}_{2},\tilde{{\textbf{y}}}_{2},\tilde{{\textbf{p}}}_{2}\right\} \), and \(\tilde{{\textbf{b}}}_{3} = \left\{ \tilde{{\textbf{x}}}_{3},\tilde{{\textbf{y}}}_{3},\tilde{{\textbf{p}}}_{3}\right\} \), in which \(\tilde{{\textbf{x}}}_{1}\), \(\tilde{{\textbf{x}}}_{2}\), and \(\tilde{{\textbf{x}}}_{3}\) have different image order, and each \(\tilde{{\textbf{y}}}_{i}\) and \(\tilde{{\textbf{p}}}_{i}\) correspond to \(\tilde{{\textbf{x}}}_{i}\) for \(i\in \left\{ 1,2,3\right\} \). Afterward, we apply the mix augmentation to the pair of \(\left( \tilde{{\textbf{b}}}_{2}, \tilde{{\textbf{b}}}_{3}\right) \), the pair of \(\left( \tilde{{\textbf{b}}}_{1}, \tilde{{\textbf{b}}}_{3}\right) \), and the pair of \(\left( \tilde{{\textbf{b}}}_{1}, \tilde{{\textbf{b}}}_{2}\right) \) to generate new batches of \(\bar{{\textbf{b}}}_{1} = \left\{ \bar{{\textbf{x}}}_{1},\bar{{\textbf{y}}}_{1},\hat{{\textbf{y}}}_{1}\right\} \), \(\bar{{\textbf{b}}}_{2} = \left\{ \bar{{\textbf{x}}}_{2},\bar{{\textbf{y}}}_{2},\hat{{\textbf{y}}}_{2}\right\} \), and \(\bar{{\textbf{b}}}_{3} = \left\{ \bar{{\textbf{x}}}_{3},\bar{{\textbf{y}}}_{3},\hat{{\textbf{y}}}_{3}\right\} \) with mixed samples. Take the pair of \(\left( \tilde{{\textbf{b}}}_{2}, \tilde{{\textbf{b}}}_{3}\right) \), for example. Each image of \(\tilde{{\textbf{x}}}_{2}\) is mixed with the image indexed in the same order in \(\tilde{{\textbf{x}}}_{3}\) to yield \(\bar{{\textbf{x}}}_{1}\), then \(\tilde{{\textbf{y}}}_{2}\) and \(\tilde{{\textbf{y}}}_{3}\), \(\tilde{{\textbf{p}}}_{2}\) and \(\tilde{{\textbf{p}}}_{3}\) are proportionally mixed to get \(\bar{{\textbf{y}}}_{1}\) and \(\hat{{\textbf{y}}}_{1}\). Squares with mixed colors in Fig. 2 indicate mixed samples.

Step 3: second forward pass. For \(i\in \left\{ 1,2,3\right\} \), we feed each network \(f_{i}\) with mixed images \(\bar{{\textbf{x}}}_{i}\) to get the individual prediction \(\bar{{\textbf{p}}}_{i}\). Each \(\bar{{\textbf{p}}}_{i}\) is optimized to be close to the mixed ground truth \(\bar{{\textbf{y}}}_{i}\) with a supervised loss \(L_{sup} \left( \bar{{\textbf{p}}}_{i}, \bar{{\textbf{y}}}_{i}\right) \). Besides, consistency is enforced between \(\bar{{\textbf{p}}}_{i}\) and the mixed pseudo-labels \(\hat{{\textbf{y}}}_{i}\), with an unsupervised loss \(L_{unsup} \left( \bar{{\textbf{p}}}_{i}, \hat{{\textbf{y}}}_{i}\right) \). Note that \(\hat{{\textbf{y}}}_{i}\) could be soft (probability maps) or hard pseudo-labels (one-hot maps). A typical choice selected by most methods [4, 14, 17] is a soft pseudo-label, and an unsupervised loss \(L_{unsup}^{p}\) compares the probability consistency by the mean square error (MSE) equation. By contrast, several works, e.g., [8, 10] utilize a hard pseudo-label, where an unsupervised loss \(L_{unsup}^{s}\) calculates the pseudo supervision consistency.

To conclude, the total optimization objective of each network is

$$\begin{aligned} L_{i} = L_{sup} \left( {\textbf {p}}_{i}, \textbf{y}) + \lambda _{1} L_{sup} (\bar{{\textbf{p}}}_{i}, \bar{{\textbf{y}}}_{i}\right) + \lambda _{2} L_{unsup} \left( \bar{{\textbf{p}}}_{i}, \hat{{\textbf{y}}}_{i}\right) , \end{aligned}$$
(1)

where \(i\in \left\{ 1,2,3\right\} \) is the index pointing out items corresponding to network \(f_{i}\), and \(\lambda _{1}\) and \(\lambda _{2}\) are hyperparameters to balance each term.

Default Settings. In this study, we adopt pseudo supervision consistency. We will show that TriMix potentially achieves better accuracy integrated with pseudo supervision consistency than probability consistency in Sect. 4.4. Besides, we utilize CutMix [46] as the mix strategy, similar to [9,10,11], but note that other kinds of mix augmentations should also fit our framework.

Inference Process. Triple networks with different weights are in TriMix. For a test sample, each network individually outputs a prediction. We will report the average result of them and report their ensemble result obtained by soft voting.

The below two sections will show how TriMix can be applied to semi- and scribble-supervised tasks, following the standard process from Step 1 to Step 3.

3.2 TriMix in Semi-Supervised Segmentation

Semi-supervised segmentation aims to learn a model by exploiting two given datasets: labeled dataset \(\textbf{D}_{l} = \left\{ \textbf{X}_{l}, \textbf{Y}_{l}\right\} \), and unlabeled dataset \(\textbf{D}_{u} = \left\{ \textbf{X}_{u}\right\} \), where \(\textbf{X}_{}\) and \(\textbf{Y}_{}\) are images and the corresponding ground truth.

Assume a mini-batch of labeled data \(\textbf{b}_{l} = \left\{ \textbf{x}_{l},\textbf{y}_{l}\right\} \in {\textbf{D}_{l}}\) and a mini-batch of unlabeled data \(\textbf{b}_{u} = \left\{ \textbf{x}_{u}\right\} \in {\textbf{D}_{u}}\) are sampled at each training iteration. We illustrate the training detail of \(\textbf{b}_{l}\) and \(\textbf{b}_{u}\) in the following.

First, the mini-batch \(\textbf{b}_{l}\) contains the images and the corresponding ground truth, and TriMix can be optimized with \(\textbf{b}_{l}\) obeying the standard process as illustrated in Fig. 2. However, existing SSL methods, e.g., [10, 11] rarely introduce perturbations to the labeled data, even though it is beneficial for performance. Following previous methods, we optimize TriMix only with Step 1 and eliminate the processes of Step 2 and Step 3 when using \(\textbf{b}_{l}\). Thus, for \(i\in \left\{ 1,2,3\right\} \), assume each network \(f_{i}\) outputs perdition \(\textbf{p}_{l_{i}}\) for images \(\textbf{x}_{l}\), then only a supervised loss \(L_{sup} \left( \textbf{p}_{l_{i}}, \textbf{y}_{l}\right) \) is calculated between \(\textbf{p}_{l_{i}}\) and the ground truth \(\textbf{y}_{l}\).

Second, the mini-batch \(\textbf{b}_{u}\) contains images \(\textbf{x}_{u}\) but no related labels. TriMix can still be optimized with \(\textbf{b}_{u}\) following the standard process as illustrated in Fig. 2 but without supervised terms. Specifically, for \(i\in \left\{ 1,2,3\right\} \), each network \(f_{i}\) outputs individual prediction \(\textbf{p}_{u_{i}}\) for \(\textbf{x}_{u}\) with the first forward pass at Step 1. There is no supervised term at Step 1 for each \(\textbf{p}_{u_{i}}\), due to the lack of ground truth. At Step 2, three batches \({\textbf {b}}_{u_{1}} = \left\{ {\textbf {x}}_{u},{\textbf {p}}_{u_{1}}\right\} \), \({\textbf {b}}_{u_{2}} =\left\{ {\textbf {x}}_{u},{\textbf {p}}_{u_{2}}\right\} \), and \({\textbf {b}}_{u_{3}} =\left\{ {\textbf {x}}_{u},{\textbf {p}}_{u_{3}}\right\} \), which contain no ground truth, can be mixed up to generate augmented batches \(\bar{{\textbf{b}}}_{u_{1}} = \left\{ \bar{{\textbf{x}}}_{u_{1}},\hat{{\textbf{y}}}_{u_{1}}\right\} \),\(\bar{{\textbf{b}}}_{u_{2}} = \left\{ \bar{{\textbf{x}}}_{u_{2}},\hat{{\textbf{y}}}_{u_{2}}\right\} \), and \(\bar{{\textbf{b}}}_{u_{3}} = \left\{ \bar{{\textbf{x}}}_{u_{3}},\hat{{\textbf{y}}}_{u_{3}}\right\} \), that have no mixed ground truth. At Step 3, each network \(f_{i}\) fed with mixed images \(\bar{{\textbf{x}}}_{u_{i}}\) is expected to output a similar prediction \(\bar{\textbf{p}}_{u_{i}}\) compared to \(\hat{{\textbf{y}}}_{u_{i}}\), with an unsupervised loss \(L_{unsup} \left( \bar{\textbf{p}}_{u_{i}}, \hat{{\textbf{y}}}_{u_{i}}\right) \).

To conclude, the total training objective of each network in this task is

$$\begin{aligned} L_i = L_{sup} \left( \textbf{p}_{l_{i}}, \textbf{y}_{l}) + \lambda L_{unsup} (\bar{\textbf{p}}_{u_{i}}, \hat{{\textbf{y}}}_{u_{i}}\right) , \end{aligned}$$
(2)

where items with \(i\in \left\{ 1,2,3\right\} \) correspond to network \(f_{i}\), and \(\lambda \) is a trade-off hyperparameter. Moreover, we use the dice loss [58] \(L_{dice}\) as both the supervised and unsupervised losses. Thus, Eq. (2) is re-written as

$$\begin{aligned} L_i = \underbrace{ L_{dice} \left( \textbf{p}_{l_{i}}, \textbf{y}_{l}\right) }_\mathrm{{sup}} + \underbrace{\lambda L_{dice} \left( \bar{\textbf{p}}_{u_{i}}, \hat{{\textbf{y}}}_{u_{i}}\right) }_\mathrm{{unsup}}. \end{aligned}$$
(3)

3.3 TriMix in Scribble-Supervised Segmentation

Scribble-supervised segmentation trains a model from a given dataset \(\textbf{D}_{s} = \left\{ \textbf{X}_{s}, \textbf{Y}_{s}\right\} \), where \(\textbf{X}_{s}\) and \(\textbf{Y}_{s}\) are images and the related scribble annotations.

Let \(\textbf{b}_{s} = \left\{ \textbf{x}_{s},\textbf{y}_{s}\right\} \in {\textbf{D}_{s}}\) indicate a mini-batch fetched at every training iteration. Since \(\textbf{b}_{s}\) contains images and the corresponding ground truth in scribbles, we follow the standard process illustrated in Fig. 2 to train TriMix with \(\textbf{b}_{s}\). Let us say, for \(i\in \left\{ 1,2,3\right\} \), each network \(f_{i}\) outputs its prediction \(\textbf{p}_{s_{i}}\) for \(\textbf{x}_{s}\) at Step 1, and we obtain mixed batches of \(\bar{{\textbf{b}}}_{s_{1}} = \left\{ \bar{{\textbf{x}}}_{s_{1}},\bar{{\textbf{y}}}_{s_{1}},\hat{{\textbf{y}}}_{s_{1}}\right\} \), \(\bar{{\textbf{b}}}_{s_{2}} = \left\{ \bar{{\textbf{x}}}_{s_{2}},\bar{{\textbf{y}}}_{s_{2}},\hat{{\textbf{y}}}_{s_{2}}\right\} \), and \(\bar{{\textbf{b}}}_{s_{3}} = \left\{ \bar{{\textbf{x}}}_{s_{3}},\bar{{\textbf{y}}}_{s_{3}},\hat{{\textbf{y}}}_{s_{3}}\right\} \) at Step 2. Then identical to Eq. (1), the training objective of each network \(f_{i}\) in scribble-supervised segmentation is

$$\begin{aligned} L_i = L_{sup} \left( \textbf{p}_{s_{i}}, \textbf{y}_{s}\right) + \lambda _{1} L_{sup} \left( \bar{{\textbf{p}}}_{s_{i}}, \bar{{\textbf{y}}}_{s_{i}}\right) + \lambda _{2} L_{unsup} \left( \bar{{\textbf{p}}}_{s_{i}}, \hat{{\textbf{y}}}_{s_{i}}\right) , \end{aligned}$$
(4)

where \(\lambda _{1}\) and \(\lambda _{2}\) are hyperparameters balancing each term.

Besides, since \(\textbf{y}_{s}\) and \(\bar{{\textbf{y}}}_{s_{i}}\) are scribble annotations, we apply the partial cross-entropy (pCE) function [38] \(L_{pce}\), which calculates the loss only for annotated pixels as the supervised loss, following [7, 8, 38]. Formally, let \(\textbf{m}\) and \(\textbf{n}\) be the prediction and the scribble annotation, and \(L_{pce} \left( \textbf{m}, \textbf{n}\right) \) is defined as

$$\begin{aligned} L_{pce} \left( \textbf{m}, \textbf{n}\right) = -\sum _{j\in J}^{}\sum _{k\in K}^{} \textbf{n}_{}^{jk} \log \textbf{m}_{}^{jk}, \end{aligned}$$
(5)

in which J is the set of pixels with scribble annotation, K is the number of classification categories. \(\textbf{m}_{}^{jk}\) indicates the predicted value of k-th channel for the j-th pixel in \(\textbf{m}\), and \(\textbf{n}_{}^{jk}\) is the corresponding ground truth of k-th channel for the j-th pixel annotation in \(\textbf{n}\).

Lastly, we use the cross-entropy (CE) loss \(L_{ce}\) as the unsupervised loss. Thus, Eq. (4) is re-written as

$$\begin{aligned} L_{i} = \underbrace{L_{pce}^{unmix} (\textbf{p}_{s_{i}}, \textbf{y}_{s}) + \lambda _{1} L_{pce}^{mix} (\bar{{\textbf{p}}}_{s_{i}}, \bar{{\textbf{y}}}_{s_{i}})}_\mathrm{{sup}} + \underbrace{\lambda _{2} L_{ce}^{mix} (\bar{{\textbf{p}}}_{s_{i}}. \hat{{\textbf{y}}}_{s_{i}})}_\mathrm{{unsup}}, \end{aligned}$$
(6)

where the superscript unmix denotes that labels for calculation are original and without the mix augmentation. The superscript mix indicates that labels and pseudo-labels for calculation are generated from the mix augmentation.

4 Experiments on Semi-Supervised Segmentation

4.1 Data and Evaluation Metric

ACDC Dataset.  [59] consists of 200 MRI volumes from 100 patients, and each volume manually delineates the ground truth for the left ventricle (LV), the right ventricle (RV), and the myocardium (Myo). The original volume sizes are \( (154-428)\times (154-512)\times (6-18)\) pixels. We resized all the volumes to \(256\times 256\times 16\) pixels and normalized the intensities as zero mean and unit variance. We performed 4-fold cross-validation. We validated our method under the 16/150 partition protocol. In each fold, we sampled 16 volumes among 150 as the labeled data, and the remaining ones were treated as unlabeled data.

Hippocampus dataset was collected by The Medical Segmentation DecathlonFootnote 2, is comprised of 390 MRI volumes of the hippocampus. We utilized the training set (260 volumes) for validation, which contains the corresponding ground truth of the anterior and posterior regions of the hippocampus. Volume sizes are \( (31-43)\times (40-59)\times (24-47)\) pixels. We resized all the volumes to \(32\times 48\times 32\) pixels. With this dataset, we challenged a more tough problem where only one labeled sample is available for training, i.e., one-shot setting. We conducted 4-fold cross-validation, sampled 1 volume among 195 cases as the labeled data in each fold, and treated the rest as unlabeled data.

Evaluation Metric. Dice score and 95% Hausdorff Distance (95HD) were used to measure the volume overlap rate and the surface distance.

Table 1. Comparison with semi-supervised state-of-the-arts on ACDC dataset under 16/150 partition protocol. We report the average (standard deviation) results based on 4-fold cross-validation. \(_{}^{\dagger }\): method with ensemble strategy.

4.2 Experimental Setup

Implementation Details. We adopted V-Net [58] as the backbone architecture. To fit the volumetric data, we extended CutMix [46] to 3D and set the cropped volume ratio to 0.2. We empirically set \(\lambda \) to 0.5 in Eq. (3). We trained TriMix 300 epochs using SGD with a weight decay of 0.0001 and a momentum of 0.9. The initial learning rate was set to 0.01 and was divided by 10 every 100 epochs. At each training iteration, 4 labeled and 4 unlabeled samples were fetched for the ACDC dataset, and 1 labeled and 4 unlabeled samples were fetched for the Hippocampus dataset.

Baseline and Upper Bound. We provided the baseline and upper bound settings for reference. We trained the backbone V-Net only with the partitioned labeled data and treated the result as the baseline setting. Besides, we regraded the result trained with the complete labeled data as the upper bound accuracy.

Mainstream Approaches. We implemented several SSL algorithms: Mean Teacher (MT) [4], Uncertainty-Aware Mean Teacher (UA-MT) [14], CutMix-Seg [9], Spatial-Temporal Smoothing Mean Teacher (STS-MT) [28], Uncertainty-Aware Multi-View Co-Training (UMCT) [18], and Cross Pseudo Supervision (CPS) [10], and compared TriMix to them. CutMix-Seg and CPS were incorporated with the 3D CutMix augmentation. UMCT was trained with three different views. We will report the student model results for MT, UA-MT, STS-MT, and CutMix-Seg. Since there is more than one trainable model within CPS and UMCT, we will report their average result among the trained models and the ensembled result for UMCT, the same as TriMix.

4.3 Experiment Results

Improvement over the Baseline. We investigated TriMix’s effectiveness in exploiting the unlabeled data. As illustrated in Table 1 and Table 2, we note that TriMix significantly improve the baseline. Specifically, it gains +15.7% in Dice and -13.7 in 95HD on the ACDC dataset and +54.7% in Dice and -6.8 in 95HD on the Hippocampus dataset, demonstrating that TriMix can effectively mine informative information from the unlabeled data to improve generalization.

Comparison with SOTAs. For the ACDC dataset under 16/150 partition protocol (see Table 1), CutMix-Seg achieves better average results than MT and confirms its effectiveness with strong input perturbation. STS-MT employs the spatial-temporal smoothing mechanism and outperforms CutMix-Seg. UMCT is in a co-training style and takes advantage of multi-view information. It brings higher accuracy than STS-MT but can not achieve the performance of CPS. TriMix obtains the best results among the methods. For the Hippocampus dataset with the one-shot setting (see Table 2), the existing SSL methods generally improve the baseline, verifying how effectively they exploit the unlabeled data. TriMix greatly outperforms the other methods, producing meaningful accuracy. Notably, TriMix surpasses the second-best method CPS by +12.5% in Dice and -2.1 in 95HD. Validation of these two datasets reveals that TriMix is competitive with SOTAs under typical partition protocols and has strong potential for learning from extremely scarce labeled data.

Table 2. Comparison with semi-supervised state-of-the-arts on Hippocampus dataset with one-shot setting. We report the average (standard deviation) results based on 4-fold cross-validation. \(_{}^{\dagger }\): method with ensemble strategy.
Fig. 3.
figure 3

Empirical study on different types of consistency regularization and various partition protocols with ACDC and Hippocampus datasets. \(L_{unsup}^{s}\): an unsupervised loss that compares pseudo supervision consistency. \(L_{unsup}^{p}\): an unsupervised loss that calculates probability consistency. \(_{}^{\dagger }\): method with ensemble strategy.

4.4 Empirical Study and Analysis

Pseudo Supervision Consistency vs. Probability Consistency. We compared the pseudo supervision consistency (denoted by \(L_{unsup}^{s}\)) and probability consistency (denoted by \(L_{unsup}^{p}\)) on the ACDC and Hippocampus datasets under different partition protocols. Results are shown in Fig. 3. Overall, TriMix incorporated with \(L_{unsup}^{s}\) outperforms TriMix with \(L_{unsup}^{p}\) across all the partition protocols on the two datasets. Especially under the one-shot setting on the Hippocampus dataset, \(L_{unsup}^{s}\) surpasses \(L_{unsup}^{p}\) by +54.2% in Dice and -5.9 in 95HD, indicating that a one-hot label map plays a more crucial role than a probability map as the expanded ground truth to supervise the other models within the framework TriMix. Previous works [5, 8, 10] have reported similar observations. Using hard pseudo-labels encourages models to be low-entropy/high-confidence on data and is closely related to entropy minimization [60]. Based on this ablation, we utilize the pseudo supervision consistency as the default setting for TriMix in semi- and scribble-supervised segmentation.

Robustness to Different Partition Protocols. We studied TriMix’s robustness to various partition protocols on the ACDC and Hippocampus datasets. As shown in Fig. 3, TriMix consistently promotes the baseline and outperforms UA-MT across all the partition protocols, demonstrating the robustness and effectiveness of our method under different data settings. Moreover, TriMix surpasses the upper bound accuracy under the 72/150 partition protocol on the ACDC dataset and the 96/195 partition protocol on the Hippocampus dataset, revealing that TriMix can greatly reduce dependence on the labeled data.

Relations to Existing Methods. Among the semi-supervised methods for comparison, UMCT and CPS are the two most related methods to TriMix. UMCT is a co-training-based strategy to introduce view differences. Thus, TriMix resembles UMCT in some sense as both methods follow the spirit of multi-model joint training and encourage consistency among models. However, TriMix adopts a stricter perturbation than UMCT. Moreover, CPS can be regarded as a downgraded version of TriMix, in which two perturbed networks are trained to generate hard pseudo-labels to supervise each other. TriMix outperforms UMCT and CPS on the ACDC and Hippocampus datasets, demonstrating the superiority of our strategy, where consistency regularization under a more challenging perturbation is adopted in tri-training.

5 Experiments on Scribble-Supervised Segmentation

5.1 Data and Evaluation Metric

ACDC Dataset.  [59] introduced in Sect. 4.1 was reused in this task, but with corresponding scribble annotations [6]. We resized all slices to the size of 256\(\times \)256 pixels and normalized their intensity to [0,1], identical to the work [8].

MSCMRseg Dataset.  [61] comprises of LGE-MRI images from 45 patients. We utilized the scribble annotations of LV, Myo, and RV released from [7] and used the same data partition setting as theirs: 25 images for training, 5 for validation, and 15 for testing. For data prepossessing, we re-sampled all images to the resolution of 1.37\(\times \)1.37 mm, cropped or padded images to the size of 212\(\times \)212 pixels, and normalized each image to zero and unit variance.

Evaluation Metric. Dice score and 95HD were utilized.

5.2 Experimental Setup

Implementation Details. We adopted the 2D U-Net architecture [62] as the backbone for all experiments in this task. The cropped area ratio was set to 0.2 when performing the CutMix augmentation. \(\lambda _{1}\) and \(\lambda _{2}\) in Eq. (6) were empirically set to 1. For the ACDC dataset, we used almost the same settings as in [8]. Specifically, we used SGD (weight decay = 0.0001, momentum = 0.9) to optimize TriMix for a total of 60000 iterations under a poly learning rate with an initial value of 0.03. The batch size was set to 12. We performed 5-fold cross-validation. For the MSCMRseg dataset, we followed [7] to train TriMix 1000 epochs with the Adam optimizer and a fixed learning rate of 0.0001. We conducted 5 runs with seeds 1, 2, 3, 4 and 5.

Baseline and Upper Bound. 2D U-Net trained with scribble annotations using the pCE loss [38] was regarded as the baseline setting. Furthermore, the upper bound accuracy was obtained using entirely dense annotations.

Mainstream Approaches. We compared TriMix with several methods, including training with pseudo-labels generated by Random Walks (RW) [33], Scribble2Lables (S2L) [19], Uncertainty-Aware Self-Ensembling and Transformation Consistency Model (USTM) [39], Entropy Minimization (EM) [60], Mumford-Shah Loss (MLoss) [63], Regularized Loss (RLoss) [37], Dynamically Mixed Pseudo Labels Supervision (simply abbreviated to DMPLS in this paper) [8], CycleMix [7], and Shape-Constrained Positive-Unlabeled Learning (ShapePU) [20].

5.3 Experiment Results

Improvement over Baseline. As shown in Table 3 and Table 4, TriMix significantly improves the baseline on the ACDC and MSCMRseg datasets, gaining +20.2% and +49.6% Dice scores, respectively, which proves that TriMix can learn good representations from sparse scribble annotations.

Comparison with SOTAs. For the ACDC dataset (see Table 3), TriMix achieves the highest average accuracy in Dice and 95HD among all scribble-supervised methods and reaches the closest result to the upper bound accuracy. It is worth noting that TriMix obtains a gain of 1.6% in Dice over DMPLS and a reduction of 1.0 in 95HD than RLoss. For the MSCMRseg dataset (see Table 4), TriMix surpasses all mix augmentation-based schemes, i.e., MixUp, CutOut, CutMix, PuzzleMix, CoMixUp, and CycleMix, as well as two SOTAs, i.e., CycleMix, and ShapePU. TriMix outperforms CycleMix by +7.4% and ShapePU by +2.2% and even improves the upper bound accuracy by +11.9% in Dice. Evaluations of these two benchmarks reveal that TriMix shows stronger generalization learning from sparse annotations than SOTAs.

Table 3. Comparison with scribble-supervised state-of-the-arts on ACDC dataset. Other average (standard deviation) results are from [8]. Ours are based on 5-fold cross-validation. \(_{}^{\dagger }\): method with ensemble strategy.

5.4 Empirical Study and Analysis

Ablation on Different Loss Combinations. We investigated the effectiveness of different loss combinations on the accuracy, as illustrated in Fig. 4. Only leveraging the original scribble annotations, \(L_{pce}^{unmix}\) brings the lower bound accuracy. \(L_{pce}^{mix}\) contributes to the performance and boosts the lower bound by +2.8% in Dice, showing that mix augmentation aids in increasing scribble annotations and thus improves accuracy. \(L_{ce}^{mix}\) contributes much more than \(L_{pce}^{unmix}\) and improves the lower bound by +41.0% in Dice, revealing that pseudo supervision is essential for TriMix. Besides, combining all losses yields the highest accuracy.

Relations to Existing Methods. TriMix is related to DMPLS and CycleMix. Specifically, DMPLS utilizes co-labeled pseudo-labels from multiple diverse branches to supervise single-branch output based on consistency regularization. CycleMix employs mix augmentation to increase scribble annotations and imposes consistency under the input perturbation. TriMix seems to be at the middle ground. It imports mix augmentation similar to CycleMix and enforces the consistency among various outputs with pseudo-label supervision, resembling DMPLS. TriMix incorporates valid features beneficial for scribble-supervised segmentation and achieves the new SOTA performance on two public benchmarks, i.e., the ACDC and MSCMRseg datasets.

Table 4. Comparison with scribble-supervised state-of-the-arts on MSCMRseg dataset. Other average (standard deviation) results in Dice score are from [7, 20]. Ours are based on 5 runs. \(_{}^{\dagger }\): method with ensemble strategy.
Fig. 4.
figure 4

Ablation study on different loss combinations on the ACDC dataset with scribble annotations using Dice score.

6 Discussion and Conclusion

This paper seeks to address semi- and scribble-supervised segmentation in a general way. We provide a hypothesis on a general learner learning from limited supervision: (i) it should output more accurate predictions and (ii) it should be trained with consistency regularization under a more challenging perturbation. We empirically verify the hypothesis with a simple framework. The method, called TriMix, purely imposes consistency on a tri-training framework under a stricter perturbation, i.e., combining data augmentation and model diversity. Our method is competitive with task-specific mainstream methods. It shows strong potential training with extremely scarce labeled data and achieves new SOTA performance on two popular benchmarks when learning from sparse annotations. We also provide extra evaluations of our method in appendix.

Moreover, as suggested by [64], Deep Ensembles can provide a simple and scalable way for uncertainty estimation. TriMix maintains triple diverse networks, and such nature allows for its efficient uncertainty modeling. It is essential to estimate and quantify uncertainty for models learned from limited supervision, which is, however, rarely explored. It is also interesting to investigate whether TriMix can be applied to handle other types of imperfect annotations, e.g., noise labels [2, 65]. In addition, TriMix’s mechanism is similar to that of the method BYOL [66], which employs two networks and enforces representation consistency between them. TriMix may be applicable for self-supervised learning, but it needs further evaluation. Last but not least, similar to multi-view co-training [18], TriMix is inherently expensive in computation. To make TriMix more efficient, we may investigate strategies such as MIMO [67] for TriMix in the future. The above avenues are regarded as our follow-up works.