Keywords

1 Introduction

Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from the label-rich source domain to an unlabeled new target domain. It is a practical and crucial problem as it could be beneficial for various label-scarce real-world scenarios, e.g., simulated learning for robots [11] or autonomous driving [31]. In this paper, we focus on the UDA for semantic segmentation, aiming to adopt a source segmentation model to a target domain without any labels.

The dominant paradigm in UDA is based on adversarial learning [5, 17, 19, 25, 36, 37]. In particular, it minimizes both (source domain) task-specific loss and domain adversarial loss. The method thus retains good performance on the source domain task, and at the same time, can bridge the gap between source and target feature distributions. While the adversarial learning has achieved great success in UDA, recently another line of studies using self-training emerged [39, 40]. Self-training generates a set of pseudo labels corresponding to high prediction scores in the target domain and then re-trains the network based on the generated pseudo labels. Recently, Zou & Yu have proposed two seminal works on CNN-based self-training methods; class balanced self-training (CBST) [39], and confidence regularized self-training (CRST) [40]. Unlike adversarial learning methods which utilize two separate losses, CBST presents a single unified self-training loss. It allows learning of domain-invariant features and classifiers in an end-to-end manner, both from labeled source data and pseudo labeled target data. CRST further generalizes the feasible space of pseudo labels and adopts regularizer. These self-training methods show state-of-the-art results in multiple UDA settings. However, we observe that its internal pseudo label selection tends to excessively cut-out the predictions, which often leads to sparse pseudo labels. We argue that sparse pseudo labels significantly miss meaningful training signals, and thus, the final model may deviate from the optimal solution eventually. A natural way to obtain dense pseudo labels is by lowering the selection threshold. However, we observe this naive approach brings noisy, unconfident predictions at an early stage, and this accumulates and propagates the errors.

To effectively address this issue, we present a two-step, gradual pseudo label densification method. The overview is shown in Fig. 1. In the first phase, we use sliding window voting to propagate the confident predictions, utilizing the intrinsic spatial correlations in the images. In the second phase, we perform an easy-hard classification using a proposed image-level confidence score. Our intuition is simple: As the model improves over time, its predictions can be trusted more. Thus, if the model in the second stage is confident with their prediction, we now do not zero out them. Indeed, we empirically observe that the confident, easy samples are near to the ground truth and vice versa. This motivates us to utilize full pseudo labels for the easy samples, while for the hard samples, we enforce adversarial loss to learn hard-to-easy adaption. Meanwhile, to tackle noisy labels effectively for both first and second phase training, we introduce the bootstrapping mechanism into the self-training loss function. By connecting all together, we build a two-phase pseudo label densification framework called TPLD. Since our method is general, we can easily apply TPLD to the existing self-training based approaches. We show consistent improvements over the strong baselines. Finally, we achieve new state-of-the-art performances on two standard UDA benchmarks.

We summarize our contributions as follows:

  1. 1.

    To our best knowledge, it is the first time that pseudo label densification is formally defined and explored in the self-training based domain adaptation.

  2. 2.

    We present a novel two-phase pseudo label densification framework, called TPLD. In particular, for the first phase, we introduce voting-based densification method. For the second phase, we propose an easy-hard classification-based densification method. Both phases are complementary in constructing an accurate self-training model.

  3. 3.

    We propose a new objective function to ease the training. Specifically, we re-formulate the original self-training loss function by incorporating the bootstrapping mechanism.

  4. 4.

    We conduct extensive ablation studies to thoroughly investigate the impact of our proposals. We apply TPLD to the various existing self-training approaches and achieve new state-of-the-art results on two standard UDA benchmarks.

2 Related Works

Domain Adaptation is a classic problem in computer vision and machine learning. It aims to alleviate the performance drop caused by the distribution mismatch in cross-domains. It is mostly investigated in image classification problems by both conventional methods [8, 12, 13, 20, 22] and deep CNN-based methods [9, 10, 21, 24, 27, 29, 33]. Besides image recognition, domain adaptation is recently being applied other vision tasks such as object detection [4], depth estimation [1], and semantic segmentation [17]. In this work, we are particularly interested in unsupervised domain adaptation for the task of semantic segmentation. The primary approach is to minimize the discrepancy between source and target feature distribution using adversarial learning. This type of approaches is studied on three different levels in practice: input-level alignment [5, 17, 28, 34], intermediate feature-level alignment [18, 19, 23, 25, 37], and output-level alignment [36]. Although these methods are proven to be effective, the potentially meaningful training signals from the target domain are under-utilized. Therefore, self-training based UDA approaches [39, 40], described next, emereged recently and came to dominate the performance quickly.

Self-training has been initially explored in semi-supervised method [14, 38]. Recently, two seminar works [39, 40] have been presented for UDA semantic segmentation. Unlike adversarial learning approaches, these methods explicitly explore the supervision signals from the target domain. The key idea is to use the prediction from the source-trained model as pseudo-labels for the unlabeled data and re-trains the current model in the target domain. CBST [39] extends this basic idea with class balancing strategy and spatial priors. CRST [40] further adds regularization term in the loss function to prevent overconfident predictions. In this paper, we also investigate the self-training framework. However, different from the previous studies, we see that the spare pseudo label problem is a fundamental limitation of self-training. We empirically found that these sparse pseudo-labels inhibit effective learning; thus, the model significantly deviates from the optimal. We, therefore, propose to densify the sparse pseudo-labels in a two-step gradually. Also, we present a new loss function to handle noisy pseudo labels and reduce optimization difficulties during training. We empirically confirm that our proposals greatly improve the strong state-of-the-art baselines with healthy margins.

3 Preliminaries

3.1 Problem Setting

Following the common UDA setting, we have full access to the data and labels, \((\varvec{\mathrm {x_{s}}}, \varvec{\mathrm {y_{s}}})\), in the labeled source domain. In contrast, in the unlabeled target domain, we can only utilize the data, \(\varvec{\mathrm {x_{t}}}\). In self-training, we thus train the network to infer pseudo target label, \(\varvec{\mathrm {\hat{y}_{t}}}\) = \(({\hat{y}_{t}}^{(1)}, ..., {\hat{y}_{t}}^{(K)})\), where K denotes the total number of classes.

3.2 Self-training for UDA

We first revisit the general self-training loss function [40] below:

(1)

\(\varvec{\mathrm {x_{s}}}\) denotes an image in source domain indexed by \(s = 1, 2, ...,S\), and \(\varvec{\mathrm {x_{t}}}\) is an image in target domain indexed by \(t = 1, 2, ...,T\). \(y_{s}^{(k)}\) is ground truth source label for class k, and \(\hat{y}_{t}^{(k)}\) is generated pseudo target label. Note that feasible set of pseudo-label is the union of \(\{\mathbf{{0}}\}\) and a probability simplex \(\varDelta ^{K-1}\) (i.e., continuous). \(\varvec{\mathrm {w}}\) is the network weights, and \(p(k|\varvec{\mathrm {x}};\varvec{\mathrm {w}})\) indicates the classifier’s softmax probability for class k. \(\lambda _{k}\) is a parameter, controlling pseudo-label selection [39]. \(\sum _{t \in T} r_{c}\mathrm (\mathbf{w} , \hat{\mathbf{Y }}_\mathbf{T })\) is the confidence regularizer and \(\alpha \ge 0\) is the weight coefficient.

We can better understand the Eq. (1) by dividing it into three terms; The first term is model training on source domain with source labels, \(y_{s}\). The second term is model re-training on target domain with generated target pseudo labels, \(\hat{y}_{t}\). The last term is confidence regularization, \(\alpha r_{c}\mathrm (\mathbf{w} , \hat{\mathbf{Y }}_\mathbf{T })\), which prevents over-confident predictions of target pseudo-labels. The first two terms are equivalent to the CBST formula [39]. With the additional confidence regularization term, we come up with the CRST formula [40]. In general, there are two types of regularization: label-regularization (e.g., LRENT) and model regularization (e.g., MRKLD).

To minimize Eq. (1), the optimization algorithm alternatively takes block coordinate descent on both 1) pseudo-label generation and 2) network retraining. For solving step 1), there is an optimizer formulated as:

$$\begin{aligned} some={\left\{ \begin{array}{ll} 1, &{} \text {if }\,k = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\} \\ &{} {\rm and\ } p(k|\varvec{\mathrm {x_{t};w}}) > \lambda _{k} \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(2)

If the prediction is confident, \(p(k|\varvec{\mathrm {x_{t};w}}) > \lambda _{k}\), it is selected and labeled as a class \(k^{*} = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\}\). Otherwise, the less confident predictions are set to zero vector \(\varvec{0}\). For each class k, we determine \(\lambda _{k}\) by the confidence value that is selected from the most confident p portion of class k predictions in the entire target set [39]. To avoid selecting unconfident predictions at the early stage, the hyperparameter p is usually set to a low value (i.e., 0.2), and is gradually increased for each additional round. To solve step 2), we use typical gradient-based methods (e.g., SGD). For more details, please refer to the original papers [39, 40].

We see the current self-training approach simply zeroes out the less confident predictions and in turn generates sparse pseudo labels. We argue that this limits the power of model representations and could produce sub-optimal model. Motivated by our empirical observations, we attempt to densify the sparse pseudo labels gradually, and avoid noisy predictions. In this work, we propose TPLD, which alleviates these fundamental issues successfully. We show the TPLD can be applied to any type of existing self-training based frameworks, and can consistently boost the performance significantly.

3.3 Noisy Label Handling

To handle noisy predictions, Reed et al. [30] proposed bootstrapping loss. It is a weighted sum of the standard cross-entropy loss and the (self) entropy loss. In this work, we apply it to the self-training formula as:

$$\begin{aligned} \begin{aligned} \sum _{t \in T}\sum _{k=1}^K [\beta \hat{y}_{t}^{(k)} + (1-\beta )\frac{p(k| \mathbf{x _\mathbf{t };\mathbf{w} })}{\lambda _{k}}]\log \frac{p(k| \mathbf{x _\mathbf{t };\mathbf{w} })} {\lambda_{k}} \end{aligned} \end{aligned}$$
(3)

Intuitively, it simultaneously encourages the model to predict the correct (pseudo) target label and have high confidence on its prediction.

Fig. 1.
figure 1

The overview of the proposed two-phase pseudo-label densification framework. (a) The first phase utilizes the sliding window based voting in which it propagates neighbor confident predictions to fill in the unlabeled pixels. We use \(\mathcal {L}_{st_{1}}\) to train the model in the first phase. (b) The second phase employs confidence-based easy-hard classification (EH class.) along with the hard-to-easy adversarial learning. This allows the model to utilize full pseudo labels for easy samples while pushing hard samples to be like easy. We use both \(\mathcal {L}_{st_{2}}\) and \(\mathcal {L}_{adv}\) to train the model in the second phase.

4 Method

The overview of our two-phase pseudo-label densification algorithm is shown in Fig. 1. For the first phase, we design a sliding window-based voting method to propagate the confident predictions. After enough training, we enter the second phase. Here, we present confidence based easy-hard classification and hard/easy adversarial learning. For both phases, we use the proposed bootstrapped self-training loss (Eq. (3)). We detail each phase below.

4.1 1\(^\mathrm{st}\) Phase: Voting Based Densification

As mentioned above, pseudo labels are generated only when the sample’s prediction is confident (Eq. (2)). Specifically, the most confident p portion of predictions are selected class-wise. Because the hyperparameter p is set to a low value in practice, pseudo labels are inherently sparse during training. To overcome this issue, we present a sliding window-based voting, in which it relaxes the current hard-thresholding and propagates the confident predictions based on the intrinsic spatial correlations in the image. We attempt to utilize the fact that neighboring pixels tend to be alike. To efficiently employ this local spatial regularity in the image, we adopt the sliding-window approach. We detail the process in Fig. 2. Given the window with the unlabeled pixel at the center, we gather the neighboring confident prediction values (voting). To be more specific, for the unlabeled pixel, we first obtain the top two competing classes (i.e., classes with highest and second-highest prediction values, which would have caused ambiguity in deciding the correct label) (Fig. 2- ), and then pool the neighboring confident values for these classes (Fig. 2- ). The spatially-pooled prediction values are then weighted sum with the original prediction values (Fig. 2- ). Among the two values, we choose the bigger one. Finally, if it is above the threshold, we select the according class as a pseudo label. Note that, we use normalized prediction values (i.e., \(\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\)) during the voting process, thus the thresholding criteria is \(\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}} > 1\). Otherwise, it continues to be a zero vector.

Fig. 2.
figure 2

The overall procedure of the voting-based densification. We describe the process in three steps. 1) We find the top two competing classes on the unlabeled pixel, 2) We pool neighboring confident values for these classes, 3) We combine the original prediction values and the pooled values (weighted-sum with hyperparameter \(\alpha \)). We pick the bigger one and assign the corresponding class if it passes the thresholding criteria. We repeat this process by sliding the window across the images.

We call the above whole process voting-based densification. We abbreviate it as \( \mathbf{Voting }\). We iterate over total 3 times with the window size of \(57\times 57\). Those hyperparameters are set through the parameter analysis (see Table 4b). The qualitative voting results are shown in Fig. 3. We can clearly see that the initial sparse pseudo label gradually becomes dense. The pseudo label generation in the 1st phase can be summarized as:

$$\begin{aligned} \begin{aligned} \hat{y}^{(k)*}_{t}= \left\{ \begin{array}{ll} 1, &{} \text {if}\, k = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k| \varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\} \\ &{} {\rm and\ } p(k|\varvec{\mathrm {x_{t};w}}) > \lambda _{k} \\ \mathbf{Voting (\frac{p(k|\varvec{\mathrm {x_{t};w}})} {\lambda _{k}})}, &{} \text {otherwise} \end{array}\right. \end{aligned} \end{aligned}$$
(4)

Objective Function for the \(\mathbf{1}^\mathbf{st}\) Phase. To effectively train the model under the existence of noisy pseudo labels, we introduce bootstrapping (Eq. (3)) in our final objective function. The original self-training objective function can be re-formulated as the following:

(5)

As a result, the target domain training benefits from both densified pseudo label and bootstrapped training.

Fig. 3.
figure 3

Voting based densification results by iteration. We can see the initial sparse pseudo label becomes dense as iteration number increases. Though it may bring noisy predictions. We set the total iteration number to 3 after conducting parameter analysis in Table 4.

4.2 \({2}^\mathrm{nd}\) Phase: Easy-Hard Classification Based Densification

As the predictions of model can be trusted more over time, we now attempt to use full pseudo-labels. One may attempt to use voting multiple times for full densificaiton. However, the experimental evidence shown in Table 4b proves that it is hard for voting to generate fully densified pseudo labels. By construction, the voting is operated with a local window, which can only capture and process local predictions. Thus, iterating the voting process multiple times brings some extent of smoothing effect and noisy predictions. We, therefore, present another phase which enables full-pseudo label training. Our key idea is to consider the confidence on image-level and classify the images into two groups: easy and hard. For the easy, confident samples, we utilize their full predictions, while for the hard samples, we instead enforce hard-to-easy adaption. Indeed, we observe that the easy samples are near to the ground truth and vice versa (see Fig. 4).

To reasonably categorize target samples into easy and hard, we present effective criteria. For a particular image \(\mathbf{t} \), we define a confidence score as \({conf}_{\mathrm {t}} = \frac{1}{K'}\sum _{k=1}^{K'} \frac{N_{\mathrm {t}}^{k*}}{N_{\mathrm {t}}^{k}} \cdot \frac{1}{\lambda _{k}},\) where \(N_{\mathrm {t}}^{k}\) is the total number of pixels predicted as class k. Among \(N_{\mathrm {t}}^{k}\), we count the number of pixels that have higher prediction values than the class-wise thresholding value \(\lambda _{k}\) [39], and is set to \(N_{\mathrm {t}}^{k*}\). As a result, the ratio \(\frac{N_{\mathrm {t}}^{k*}}{N_{\mathrm {t}}^{k}}\) indicates how well the model predicts confident values for each class k. We average these values with \(K'\), which is the total number of (predicted) confident classes. Thus, the higher the value, we can say that the model is more confident with that target image (i.e., easy). Note that, we multiply \(\frac{1}{\lambda _{k}}\) to avoid sampling too easy images and instead encourage sampling of images with rare classes. We compute these confidence scores for every target image. In practice, we picked up the top q portion as easy samples and consider the rest as hard samples for the training. We initially set q to 30% and add 5% in each round.

Fig. 4.
figure 4

Qualitative easy and hard samples. For the illustration, we randomly selected three samples from each. Note that easy samples are near to the ground truth with low entropy values, whereas hard samples are far from the ground truth and have high entropy values. Therefore, in the second phase, we train easy samples with their full-pseudo labels and make hard samples to be easy using adversarial loss.

Objective Function for the \(\mathbf{2}^\mathbf{nd}\) Phase. After classifying target images into easy and hard samples, we apply different objective functions to each. For the easy samples, we utilize full pseudo label predictions and employ bootstrapping loss for training (Eq. 3). For the hard samples, we instead adopt adversarial learning to push hard examples to be like easy samples (i.e., feature alignment). We describe the details below. Easy Sample Training. To effectively generate full pseudo labels, we calibrate the prediction values. Specifically, the full pseudo-label generation of easy samples is formulated as:

$$\begin{aligned} \hat{y}^{(k)*}_{t_{e}}={\left\{ \begin{array}{ll} 1, &{} \text {if}\, k = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k\mid x_{t};w)}{\lambda _{k}}\} \\ &{} {\rm and\ } p(k|\varvec{x_{t};w}) > \lambda _{k} \\ \bigl (\frac{p(k|\varvec{x_{t};w})}{\lambda _{k}}\bigr )^{\gamma }, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(6)

Note that the prediction value is calibrated with the hyper parameter \(\gamma \), which is set to 2 empirically (see Table 4e). We then train the model using the following bootstrapping loss:

(7)

Hard Sample Training. To minimize the gap between easy (e) and hard (h) samples in the target domain, we propose intra-domain adversarial loss, \(\mathcal {L}_{adv}\). In order to align the feature from hard to easy, the discriminator \(D_{intra}\) is trained to discriminate that the target weighted self-information map \(I_{t}\) [37] is whether from easy samples or hard samples. The learning objective of the discriminator is:

$$\begin{aligned} \min _{\theta _{D_{intra}}}\frac{1}{\left| e\right| }\sum _{e}L_{D_{intra}}(I_{e}, 1) + \frac{1}{\left| h\right| }\sum _{h}L_{D_{intra}}(I_{h}, 0) \end{aligned}$$
(8)

and the adversarial objective to train the segmentation network is:

$$\begin{aligned} \min _{\theta _{seg}}\frac{1}{\left| h\right| }\sum _{h}L_{D_{intra}}(I_{h}, 1) \end{aligned}$$
(9)

5 Experiments

5.1 Dataset

We evaluate our model on the most common adaptation benchmarks. 1) GTA5 [31] to Cityscapes [6] and 2) SYNTHIA [32] to Cityscapes. GTA5 and SYNTHIA contain 24966 and 9,400 synthetic images, respectively. Following the standard protocols, we adapt the model to the Cityscapes training set and evaluate the performance on the validation set.

5.2 Implementation Details

To push the state-of-the-art benchmark performances, we apply TPLD to the CRST-MRKLD framework [40]. For the backbones, we use VGG-16 [35] and ResNet-101 [15]. For the segmentation models, we adopt different versions of deeplab; deeplab-v2 [2] and deeplab-v3 [3]. We pretrain the model on ImageNet [7] and fine-tune on source domain images using SGD. We train the model total 9 rounds: 6 rounds for the first phase training and 3 rounds for the second phase training. The detailed training settings are the followings: For the source domain pre-training, we use learning rate of \(2.5\times 10^{-4}\), weight decay of \(5\times 10^{-4}\), momentum of 0.9, batch size of 2, patch size of \(512\times 1024\), multiscale training augmentation (0.5–1.5), and horizontal flipping. For the self-training, we adopt SGD with the learning rate of \(5\times 10^{-5}\).

5.3 Main Results

GTA5 \(\rightarrow \) Cityscapes: Table 1 summarizes the adaptation performance of TPLD and other state-of-the-art methods [25, 36, 37, 39, 40]. We can obviously see that TPLD outperforms state-of-the-art approaches in all cases. For example, with Deeplab-v2 and ResNet-101 backbone, our TPLD significantly outperforms CRST by 4.2%. Moreover, to analyze the effect on rare classes, we also put rare-class mIoU. With the R-mIoU metric, we see the improvement is even much higher; 4.8%. We provide qualitative results in Fig. 5. Clearly, our final model generates the most visually pleasurable results.

SYNTHIA \(\rightarrow \) Cityscapes: Table 2 shows the adaptation results with SYNTHIA. Our approach again achieves the best performance among all the other methods. Specifically, with Deeplab-v3 and ResNet101 backbone, we greatly improve the baseline performance of 48.1% mIoU to 55.7% mIoU.

Table 1. Experimental results on GTA5 \(\rightarrow \) Cityscapes. “V” and “R” denote VGG-16 and ResNet-101 respectively. We highlight the rare classes [25] and compute Rare class mIoU (R-mIoU) as well.
Table 2. Experimental results on SYNTHIA \(\rightarrow \) Cityscapes. mIoU\(*\) is computed with 13 classes out of total 16 classes except the classes with \(*\).

Combining with Existing Self-training Methods. We see the proposed TPLD is general, thus can be easily applied to the existing self-training based methods. In this experiment, we combine the TPLD with three different self-training approaches: CBST [39], CRST with label regularization (LRENT) [40], and CRST with model regularization (MRKLD) [40]. The results are summarized in Table 3. We observe that TPLD consistently improves the performance of all the baselines. The positive results imply that the sparse pseudo-label is indeed a fundamental problem in self-training, and the previous works notably overlooked this problem. We show that the proposed concept of two-phase pseudo-label densificaiton effectively addresses the issue.

Table 3. Performance improvements in mIoU of integrating our TPLD with existing self-training adaptation approaches. We use the Deeplabv2-R segmentation model.
Fig. 5.
figure 5

Qualitative results on GTA5 \(\rightarrow \) Cityscapes. We can clearly see that our full model generates the most visually pleasable results.

5.4 Ablation Study

Lowering the Selection Threshold of CRST. A straightforward way to generate dense pseudo labels is by lowering the selection threshold (i.e., increasing p) of self-training models. We summarize the results in Table 4a. Since the scheme brings unconfident predictions at an early stage, either limited improvement (\(p=0.4\), 47.0 \(\rightarrow \) 47.1 mIoU) or worse performance is obtained (\(p=0.6\), 47.0 \(\rightarrow \) 45.7 mIoU). Compared to these naive baselines, our TPLD shows significant improvement (47.0 \(\rightarrow \) 51.2 mIoU).

Framework Design Choices. The main components of our framework design are the two-phase pseudo label densification. The ablation results are shown in Table 4a. If we drop the voting stage, the model is trained alone with the easy-hard classification stage. However, using full pseudo labels without any proper early-stage training introduces too noisy training signals (51.2 \(\rightarrow \) 38.1 mIoU). If we drop the easy-hard classification stage, the model misses a chance to receive rich training signals from the full pseudo labels (51.2 \(\rightarrow \) 49.5 mIoU). We also explore the effect of ordering. We observe that the voting-first method performs better than the easy-hard classification-first method (51.2 vs. 49.1 mIoU). This implies that gradual densification is indeed important for stable model training.

Effect of \(\frac{1}{\lambda _{k}}\) in Confidence Score \(conf_{\mathrm {t}}\). We suggest to multiply \(\frac{1}{\lambda _{k}}\) in computing the confidence score \(conf_{\mathrm {t}}\). The rationale behind this is to oversample the images, which include rare classes, and thus prevent the learning from being biased by images composed of obvious frequent classes. The results without and with the \(\frac{1}{\lambda _{k}}\) are (50.5 vs 51.2 mIoU) and (33.7 vs 35.1 R-mIoU). This demonstrates the efficacy of incorporating \(\frac{1}{\lambda _{k}}\).

Table 4. Results of ablation studies.
Table 5. Detailed analysis on the proposed objective functions. We note the corresponding equations for each proposals. Adv. denotes adversarial loss term for hard sample training.
Fig. 6.
figure 6

A contrastive analysis of with and without hard sample training (Eq.(8)+Eq.(9)). (a): target image, (b): ground truth, (c): prediction result without hard sample training, (d): prediction result with hard sample training. We map high-dimensional features of (c) and (d) to 2-D space features of (e) and (f) respectively using t-SNE [26].

5.5 Parameter Analysis

Here, we conduct experiments to decide optimal hyper-parameters in our framework. For the first phase, we have a total of three hyper-parameters; voting field size, voting iteration number, and \(\alpha \). In  Table 4b, we conduct a grid search on the first two, and we obtain the best result with voting field 57, and voting number 3. The hyperparameter \(\alpha \) controls how much to maintain the initial prediction value, and we observe that 0.7 produces the best result (see Table 4c). We see that the results are in the same line with the residual learning [16]. Providing residual features (i.e., pooled neighboring confident prediction values) while securing the initial behavior (i.e., initial prediction values) is important. For the second phase, we have a total of two hyper-parameters; q and \(\gamma \). The hyperparameter q controls the ‘easy’ portions in the target images. For example, if we increase the value, more images will be used as easy samples for the training. We observe that setting q to 0.3 provides the best result (see Table 4d). Note that if we set q to 1 (i.e., making all the target images to be trained with the full pseudo labels), we instead obtain degraded performance. This implies that a proper portion of easy and hard samples are need to be set, and both the full pseudo label training and hard-to-easy feature alignment are important. The hyperparameter \(\gamma \) is related to the calibration degree of the prediction values in generating full pseudo labels (see Eq. (6)). We obtain the best result when \(\gamma \) equals 2.

5.6 Loss Function Analysis

Finally, we explore the impact of loss functions in Table 5. We begin with the standard self-training loss, \(\mathcal {L}_{st}\). Introducing the bootstrapping mechanism boosts the performance significantly, from 47.00 to 48.47 mIoU. This implies that explicitly handling noisy pseudo labels is crucial but lacking in the original formulation. Also, using voting to densify the sparse pseudo labels further pushes the performance from 48.47 to 49.52 mIoU. The densified pseudo labels help model learning due to the increased training-signals and are complementary to the bootstrapping effect. In the second phase, we investigate the impact of both easy sample training (EH Cls.) and hard sample training (Adv.). The easy sample training pushes the performance from 49.52 to 50.11 mIoU, and the hard sample training further increases the performance from 50.11 to 51.20. The results demonstrate that the full-pseudo label training is indeed important and the hard-to-easy feature alignment further enhances the model learning. Especially for the hard sample training, we conduct a contrastive analysis in Fig. 6. We observe that hard sample training improves category-level feature alignment (Fig. 6 (e)\(\rightarrow \)Fig. 6 (f)), and thus the prediction values become more accurate and clean (Fig. 6 (c) \(\rightarrow \) Fig. 6 (d)).

6 Conclusions

In this paper, we point out that self-training methods for UDA suffer from the sparse pseudo label during training. Therefore, we present a novel two-phase pseudo label densification method. Combined with recently proposed CRST framework, we achieve new state-of-the-art results on UDA benchmarks.