Two-Phase Pseudo Label Densification for Self-training Based Domain Adaptation

Shin, Inkyu; Woo, Sanghyun; Pan, Fei; Kweon, In So

doi:10.1007/978-3-030-58601-0_32

Inkyu Shin¹²,
Sanghyun Woo¹²,
Fei Pan¹² &
…
In So Kweon¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12358))

Included in the following conference series:

European Conference on Computer Vision

3882 Accesses
49 Citations

Abstract

Recently, deep self-training approaches emerged as a powerful solution to the unsupervised domain adaptation. The self-training scheme involves iterative processing of target data; it generates target pseudo labels and retrains the network. However, since only the confident predictions are taken as pseudo labels, existing self-training approaches inevitably produce sparse pseudo labels in practice. We see this is critical because the resulting insufficient training-signals lead to a sub-optimal, error-prone model. In order to tackle this problem, we propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD. In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images. In the second phase, we perform a confidence-based easy-hard classification. For the easy samples, we now employ their full pseudo-labels. For the hard ones, we instead adopt adversarial learning to enforce hard-to-easy feature alignment. To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss. We show the proposed TPLD can be easily integrated into existing self-training based approaches and improves the performance significantly. Combined with the recently proposed CRST self-training framework, we achieve new state-of-the-art results on two standard UDA benchmarks.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Classifier Decoupled Training for Black-Box Unsupervised Domain Adaptation

Learning from Scale-Invariant Examples for Domain Adaptation in Semantic Segmentation

Transferable adversarial masked self-distillation for unsupervised domain adaptation

Article Open access 24 May 2023

Keywords

1 Introduction

Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from the label-rich source domain to an unlabeled new target domain. It is a practical and crucial problem as it could be beneficial for various label-scarce real-world scenarios, e.g., simulated learning for robots [11] or autonomous driving [31]. In this paper, we focus on the UDA for semantic segmentation, aiming to adopt a source segmentation model to a target domain without any labels.

The dominant paradigm in UDA is based on adversarial learning [5, 17, 19, 25, 36, 37]. In particular, it minimizes both (source domain) task-specific loss and domain adversarial loss. The method thus retains good performance on the source domain task, and at the same time, can bridge the gap between source and target feature distributions. While the adversarial learning has achieved great success in UDA, recently another line of studies using self-training emerged [39, 40]. Self-training generates a set of pseudo labels corresponding to high prediction scores in the target domain and then re-trains the network based on the generated pseudo labels. Recently, Zou & Yu have proposed two seminal works on CNN-based self-training methods; class balanced self-training (CBST) [39], and confidence regularized self-training (CRST) [40]. Unlike adversarial learning methods which utilize two separate losses, CBST presents a single unified self-training loss. It allows learning of domain-invariant features and classifiers in an end-to-end manner, both from labeled source data and pseudo labeled target data. CRST further generalizes the feasible space of pseudo labels and adopts regularizer. These self-training methods show state-of-the-art results in multiple UDA settings. However, we observe that its internal pseudo label selection tends to excessively cut-out the predictions, which often leads to sparse pseudo labels. We argue that sparse pseudo labels significantly miss meaningful training signals, and thus, the final model may deviate from the optimal solution eventually. A natural way to obtain dense pseudo labels is by lowering the selection threshold. However, we observe this naive approach brings noisy, unconfident predictions at an early stage, and this accumulates and propagates the errors.

To effectively address this issue, we present a two-step, gradual pseudo label densification method. The overview is shown in Fig. 1. In the first phase, we use sliding window voting to propagate the confident predictions, utilizing the intrinsic spatial correlations in the images. In the second phase, we perform an easy-hard classification using a proposed image-level confidence score. Our intuition is simple: As the model improves over time, its predictions can be trusted more. Thus, if the model in the second stage is confident with their prediction, we now do not zero out them. Indeed, we empirically observe that the confident, easy samples are near to the ground truth and vice versa. This motivates us to utilize full pseudo labels for the easy samples, while for the hard samples, we enforce adversarial loss to learn hard-to-easy adaption. Meanwhile, to tackle noisy labels effectively for both first and second phase training, we introduce the bootstrapping mechanism into the self-training loss function. By connecting all together, we build a two-phase pseudo label densification framework called TPLD. Since our method is general, we can easily apply TPLD to the existing self-training based approaches. We show consistent improvements over the strong baselines. Finally, we achieve new state-of-the-art performances on two standard UDA benchmarks.

We summarize our contributions as follows:

1.
To our best knowledge, it is the first time that pseudo label densification is formally defined and explored in the self-training based domain adaptation.
2.
We present a novel two-phase pseudo label densification framework, called TPLD. In particular, for the first phase, we introduce voting-based densification method. For the second phase, we propose an easy-hard classification-based densification method. Both phases are complementary in constructing an accurate self-training model.
3.
We propose a new objective function to ease the training. Specifically, we re-formulate the original self-training loss function by incorporating the bootstrapping mechanism.
4.
We conduct extensive ablation studies to thoroughly investigate the impact of our proposals. We apply TPLD to the various existing self-training approaches and achieve new state-of-the-art results on two standard UDA benchmarks.

2 Related Works

Domain Adaptation is a classic problem in computer vision and machine learning. It aims to alleviate the performance drop caused by the distribution mismatch in cross-domains. It is mostly investigated in image classification problems by both conventional methods [8, 12, 13, 20, 22] and deep CNN-based methods [9, 10, 21, 24, 27, 29, 33]. Besides image recognition, domain adaptation is recently being applied other vision tasks such as object detection [4], depth estimation [1], and semantic segmentation [17]. In this work, we are particularly interested in unsupervised domain adaptation for the task of semantic segmentation. The primary approach is to minimize the discrepancy between source and target feature distribution using adversarial learning. This type of approaches is studied on three different levels in practice: input-level alignment [5, 17, 28, 34], intermediate feature-level alignment [18, 19, 23, 25, 37], and output-level alignment [36]. Although these methods are proven to be effective, the potentially meaningful training signals from the target domain are under-utilized. Therefore, self-training based UDA approaches [39, 40], described next, emereged recently and came to dominate the performance quickly.

Self-training has been initially explored in semi-supervised method [14, 38]. Recently, two seminar works [39, 40] have been presented for UDA semantic segmentation. Unlike adversarial learning approaches, these methods explicitly explore the supervision signals from the target domain. The key idea is to use the prediction from the source-trained model as pseudo-labels for the unlabeled data and re-trains the current model in the target domain. CBST [39] extends this basic idea with class balancing strategy and spatial priors. CRST [40] further adds regularization term in the loss function to prevent overconfident predictions. In this paper, we also investigate the self-training framework. However, different from the previous studies, we see that the spare pseudo label problem is a fundamental limitation of self-training. We empirically found that these sparse pseudo-labels inhibit effective learning; thus, the model significantly deviates from the optimal. We, therefore, propose to densify the sparse pseudo-labels in a two-step gradually. Also, we present a new loss function to handle noisy pseudo labels and reduce optimization difficulties during training. We empirically confirm that our proposals greatly improve the strong state-of-the-art baselines with healthy margins.

3 Preliminaries

3.1 Problem Setting

Following the common UDA setting, we have full access to the data and labels, $(\varvec{\mathrm {x_{s}}}, \varvec{\mathrm {y_{s}}})$, in the labeled source domain. In contrast, in the unlabeled target domain, we can only utilize the data, $\varvec{\mathrm {x_{t}}}$. In self-training, we thus train the network to infer pseudo target label, $\varvec{\mathrm {\hat{y}_{t}}}$ = $({\hat{y}_{t}}^{(1)}, ..., {\hat{y}_{t}}^{(K)})$, where K denotes the total number of classes.

3.2 Self-training for UDA

We first revisit the general self-training loss function [40] below:

(1)

$\varvec{\mathrm {x_{s}}}$ denotes an image in source domain indexed by $s = 1, 2, ...,S$, and $\varvec{\mathrm {x_{t}}}$ is an image in target domain indexed by $t = 1, 2, ...,T$. $y_{s}^{(k)}$ is ground truth source label for class k, and $\hat{y}_{t}^{(k)}$ is generated pseudo target label. Note that feasible set of pseudo-label is the union of $\{\mathbf{{0}}\}$ and a probability simplex $\varDelta ^{K-1}$ (i.e., continuous). $\varvec{\mathrm {w}}$ is the network weights, and $p(k|\varvec{\mathrm {x}};\varvec{\mathrm {w}})$ indicates the classifier’s softmax probability for class k. $\lambda _{k}$ is a parameter, controlling pseudo-label selection [39]. $\sum _{t \in T} r_{c}\mathrm (\mathbf{w} , \hat{\mathbf{Y }}_\mathbf{T })$ is the confidence regularizer and $\alpha \ge 0$ is the weight coefficient.

We can better understand the Eq. (1) by dividing it into three terms; The first term is model training on source domain with source labels, $y_{s}$. The second term is model re-training on target domain with generated target pseudo labels, $\hat{y}_{t}$. The last term is confidence regularization, $\alpha r_{c}\mathrm (\mathbf{w} , \hat{\mathbf{Y }}_\mathbf{T })$, which prevents over-confident predictions of target pseudo-labels. The first two terms are equivalent to the CBST formula [39]. With the additional confidence regularization term, we come up with the CRST formula [40]. In general, there are two types of regularization: label-regularization (e.g., LRENT) and model regularization (e.g., MRKLD).

To minimize Eq. (1), the optimization algorithm alternatively takes block coordinate descent on both 1) pseudo-label generation and 2) network retraining. For solving step 1), there is an optimizer formulated as:

$$\begin{aligned} some={\left\{ \begin{array}{ll} 1, &{} \text {if }\,k = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\} \\ &{} {\rm and\ } p(k|\varvec{\mathrm {x_{t};w}}) > \lambda _{k} \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(2)

If the prediction is confident, $p(k|\varvec{\mathrm {x_{t};w}}) > \lambda _{k}$, it is selected and labeled as a class $k^{*} = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\}$. Otherwise, the less confident predictions are set to zero vector $\varvec{0}$. For each class k, we determine $\lambda _{k}$ by the confidence value that is selected from the most confident p portion of class k predictions in the entire target set [39]. To avoid selecting unconfident predictions at the early stage, the hyperparameter p is usually set to a low value (i.e., 0.2), and is gradually increased for each additional round. To solve step 2), we use typical gradient-based methods (e.g., SGD). For more details, please refer to the original papers [39, 40].

We see the current self-training approach simply zeroes out the less confident predictions and in turn generates sparse pseudo labels. We argue that this limits the power of model representations and could produce sub-optimal model. Motivated by our empirical observations, we attempt to densify the sparse pseudo labels gradually, and avoid noisy predictions. In this work, we propose TPLD, which alleviates these fundamental issues successfully. We show the TPLD can be applied to any type of existing self-training based frameworks, and can consistently boost the performance significantly.

3.3 Noisy Label Handling

To handle noisy predictions, Reed et al. [30] proposed bootstrapping loss. It is a weighted sum of the standard cross-entropy loss and the (self) entropy loss. In this work, we apply it to the self-training formula as:

$$\begin{aligned} \begin{aligned} \sum _{t \in T}\sum _{k=1}^K [\beta \hat{y}_{t}^{(k)} + (1-\beta )\frac{p(k| \mathbf{x _\mathbf{t };\mathbf{w} })}{\lambda _{k}}]\log \frac{p(k| \mathbf{x _\mathbf{t };\mathbf{w} })} {\lambda_{k}} \end{aligned} \end{aligned}$$

(3)

Intuitively, it simultaneously encourages the model to predict the correct (pseudo) target label and have high confidence on its prediction.

4 Method

The overview of our two-phase pseudo-label densification algorithm is shown in Fig. 1. For the first phase, we design a sliding window-based voting method to propagate the confident predictions. After enough training, we enter the second phase. Here, we present confidence based easy-hard classification and hard/easy adversarial learning. For both phases, we use the proposed bootstrapped self-training loss (Eq. (3)). We detail each phase below.

4.1 1$^\mathrm{st}$ Phase: Voting Based Densification

As mentioned above, pseudo labels are generated only when the sample’s prediction is confident (Eq. (2)). Specifically, the most confident p portion of predictions are selected class-wise. Because the hyperparameter p is set to a low value in practice, pseudo labels are inherently sparse during training. To overcome this issue, we present a sliding window-based voting, in which it relaxes the current hard-thresholding and propagates the confident predictions based on the intrinsic spatial correlations in the image. We attempt to utilize the fact that neighboring pixels tend to be alike. To efficiently employ this local spatial regularity in the image, we adopt the sliding-window approach. We detail the process in Fig. 2. Given the window with the unlabeled pixel at the center, we gather the neighboring confident prediction values (voting). To be more specific, for the unlabeled pixel, we first obtain the top two competing classes (i.e., classes with highest and second-highest prediction values, which would have caused ambiguity in deciding the correct label) (Fig. 2- ), and then pool the neighboring confident values for these classes (Fig. 2- ). The spatially-pooled prediction values are then weighted sum with the original prediction values (Fig. 2- ). Among the two values, we choose the bigger one. Finally, if it is above the threshold, we select the according class as a pseudo label. Note that, we use normalized prediction values (i.e., $\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}}$) during the voting process, thus the thresholding criteria is $\frac{p(k|\varvec{\mathrm {x_{t};w}})}{\lambda _{k}} > 1$. Otherwise, it continues to be a zero vector.

We call the above whole process voting-based densification. We abbreviate it as $ \mathbf{Voting }$. We iterate over total 3 times with the window size of $57\times 57$. Those hyperparameters are set through the parameter analysis (see Table 4b). The qualitative voting results are shown in Fig. 3. We can clearly see that the initial sparse pseudo label gradually becomes dense. The pseudo label generation in the 1st phase can be summarized as:

$$\begin{aligned} \begin{aligned} \hat{y}^{(k)*}_{t}= \left\{ \begin{array}{ll} 1, &{} \text {if}\, k = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k| \varvec{\mathrm {x_{t};w}})}{\lambda _{k}}\} \\ &{} {\rm and\ } p(k|\varvec{\mathrm {x_{t};w}}) > \lambda _{k} \\ \mathbf{Voting (\frac{p(k|\varvec{\mathrm {x_{t};w}})} {\lambda _{k}})}, &{} \text {otherwise} \end{array}\right. \end{aligned} \end{aligned}$$

(4)

Objective Function for the $\mathbf{1}^\mathbf{st}$ Phase. To effectively train the model under the existence of noisy pseudo labels, we introduce bootstrapping (Eq. (3)) in our final objective function. The original self-training objective function can be re-formulated as the following:

(5)

As a result, the target domain training benefits from both densified pseudo label and bootstrapped training.

4.2 ${2}^\mathrm{nd}$ Phase: Easy-Hard Classification Based Densification

As the predictions of model can be trusted more over time, we now attempt to use full pseudo-labels. One may attempt to use voting multiple times for full densificaiton. However, the experimental evidence shown in Table 4b proves that it is hard for voting to generate fully densified pseudo labels. By construction, the voting is operated with a local window, which can only capture and process local predictions. Thus, iterating the voting process multiple times brings some extent of smoothing effect and noisy predictions. We, therefore, present another phase which enables full-pseudo label training. Our key idea is to consider the confidence on image-level and classify the images into two groups: easy and hard. For the easy, confident samples, we utilize their full predictions, while for the hard samples, we instead enforce hard-to-easy adaption. Indeed, we observe that the easy samples are near to the ground truth and vice versa (see Fig. 4).

To reasonably categorize target samples into easy and hard, we present effective criteria. For a particular image $\mathbf{t} $, we define a confidence score as ${conf}_{\mathrm {t}} = \frac{1}{K'}\sum _{k=1}^{K'} \frac{N_{\mathrm {t}}^{k*}}{N_{\mathrm {t}}^{k}} \cdot \frac{1}{\lambda _{k}},$ where $N_{\mathrm {t}}^{k}$ is the total number of pixels predicted as class k. Among $N_{\mathrm {t}}^{k}$, we count the number of pixels that have higher prediction values than the class-wise thresholding value $\lambda _{k}$ [39], and is set to $N_{\mathrm {t}}^{k*}$. As a result, the ratio $\frac{N_{\mathrm {t}}^{k*}}{N_{\mathrm {t}}^{k}}$ indicates how well the model predicts confident values for each class k. We average these values with $K'$, which is the total number of (predicted) confident classes. Thus, the higher the value, we can say that the model is more confident with that target image (i.e., easy). Note that, we multiply $\frac{1}{\lambda _{k}}$ to avoid sampling too easy images and instead encourage sampling of images with rare classes. We compute these confidence scores for every target image. In practice, we picked up the top q portion as easy samples and consider the rest as hard samples for the training. We initially set q to 30% and add 5% in each round.

Objective Function for the $\mathbf{2}^\mathbf{nd}$ Phase. After classifying target images into easy and hard samples, we apply different objective functions to each. For the easy samples, we utilize full pseudo label predictions and employ bootstrapping loss for training (Eq. 3). For the hard samples, we instead adopt adversarial learning to push hard examples to be like easy samples (i.e., feature alignment). We describe the details below. Easy Sample Training. To effectively generate full pseudo labels, we calibrate the prediction values. Specifically, the full pseudo-label generation of easy samples is formulated as:

$$\begin{aligned} \hat{y}^{(k)*}_{t_{e}}={\left\{ \begin{array}{ll} 1, &{} \text {if}\, k = \mathop {\mathrm {arg\,max}}\limits _{k}\{\frac{p(k\mid x_{t};w)}{\lambda _{k}}\} \\ &{} {\rm and\ } p(k|\varvec{x_{t};w}) > \lambda _{k} \\ \bigl (\frac{p(k|\varvec{x_{t};w})}{\lambda _{k}}\bigr )^{\gamma }, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(6)

Note that the prediction value is calibrated with the hyper parameter $\gamma $, which is set to 2 empirically (see Table 4e). We then train the model using the following bootstrapping loss:

(7)

Hard Sample Training. To minimize the gap between easy (e) and hard (h) samples in the target domain, we propose intra-domain adversarial loss, $\mathcal {L}_{adv}$. In order to align the feature from hard to easy, the discriminator $D_{intra}$ is trained to discriminate that the target weighted self-information map $I_{t}$ [37] is whether from easy samples or hard samples. The learning objective of the discriminator is:

$$\begin{aligned} \min _{\theta _{D_{intra}}}\frac{1}{\left| e\right| }\sum _{e}L_{D_{intra}}(I_{e}, 1) + \frac{1}{\left| h\right| }\sum _{h}L_{D_{intra}}(I_{h}, 0) \end{aligned}$$

(8)

and the adversarial objective to train the segmentation network is:

$$\begin{aligned} \min _{\theta _{seg}}\frac{1}{\left| h\right| }\sum _{h}L_{D_{intra}}(I_{h}, 1) \end{aligned}$$

(9)

5 Experiments

5.1 Dataset

We evaluate our model on the most common adaptation benchmarks. 1) GTA5 [31] to Cityscapes [6] and 2) SYNTHIA [32] to Cityscapes. GTA5 and SYNTHIA contain 24966 and 9,400 synthetic images, respectively. Following the standard protocols, we adapt the model to the Cityscapes training set and evaluate the performance on the validation set.

5.2 Implementation Details

To push the state-of-the-art benchmark performances, we apply TPLD to the CRST-MRKLD framework [40]. For the backbones, we use VGG-16 [35] and ResNet-101 [15]. For the segmentation models, we adopt different versions of deeplab; deeplab-v2 [2] and deeplab-v3 [3]. We pretrain the model on ImageNet [7] and fine-tune on source domain images using SGD. We train the model total 9 rounds: 6 rounds for the first phase training and 3 rounds for the second phase training. The detailed training settings are the followings: For the source domain pre-training, we use learning rate of $2.5\times 10^{-4}$, weight decay of $5\times 10^{-4}$, momentum of 0.9, batch size of 2, patch size of $512\times 1024$, multiscale training augmentation (0.5–1.5), and horizontal flipping. For the self-training, we adopt SGD with the learning rate of $5\times 10^{-5}$.

5.3 Main Results

GTA5 $\rightarrow $ Cityscapes: Table 1 summarizes the adaptation performance of TPLD and other state-of-the-art methods [25, 36, 37, 39, 40]. We can obviously see that TPLD outperforms state-of-the-art approaches in all cases. For example, with Deeplab-v2 and ResNet-101 backbone, our TPLD significantly outperforms CRST by 4.2%. Moreover, to analyze the effect on rare classes, we also put rare-class mIoU. With the R-mIoU metric, we see the improvement is even much higher; 4.8%. We provide qualitative results in Fig. 5. Clearly, our final model generates the most visually pleasurable results.

SYNTHIA $\rightarrow $ Cityscapes: Table 2 shows the adaptation results with SYNTHIA. Our approach again achieves the best performance among all the other methods. Specifically, with Deeplab-v3 and ResNet101 backbone, we greatly improve the baseline performance of 48.1% mIoU to 55.7% mIoU.

Table 1. Experimental results on GTA5 $\rightarrow $ Cityscapes. “V” and “R” denote VGG-16 and ResNet-101 respectively. We highlight the rare classes [25] and compute Rare class mIoU (R-mIoU) as well.

Full size table

Table 2. Experimental results on SYNTHIA $\rightarrow $ Cityscapes. mIoU$*$ is computed with 13 classes out of total 16 classes except the classes with $*$.

Full size table

Combining with Existing Self-training Methods. We see the proposed TPLD is general, thus can be easily applied to the existing self-training based methods. In this experiment, we combine the TPLD with three different self-training approaches: CBST [39], CRST with label regularization (LRENT) [40], and CRST with model regularization (MRKLD) [40]. The results are summarized in Table 3. We observe that TPLD consistently improves the performance of all the baselines. The positive results imply that the sparse pseudo-label is indeed a fundamental problem in self-training, and the previous works notably overlooked this problem. We show that the proposed concept of two-phase pseudo-label densificaiton effectively addresses the issue.

Table 3. Performance improvements in mIoU of integrating our TPLD with existing self-training adaptation approaches. We use the Deeplabv2-R segmentation model.

Full size table

5.4 Ablation Study

Lowering the Selection Threshold of CRST. A straightforward way to generate dense pseudo labels is by lowering the selection threshold (i.e., increasing p) of self-training models. We summarize the results in Table 4a. Since the scheme brings unconfident predictions at an early stage, either limited improvement ($p=0.4$, 47.0 $\rightarrow $ 47.1 mIoU) or worse performance is obtained ($p=0.6$, 47.0 $\rightarrow $ 45.7 mIoU). Compared to these naive baselines, our TPLD shows significant improvement (47.0 $\rightarrow $ 51.2 mIoU).

Framework Design Choices. The main components of our framework design are the two-phase pseudo label densification. The ablation results are shown in Table 4a. If we drop the voting stage, the model is trained alone with the easy-hard classification stage. However, using full pseudo labels without any proper early-stage training introduces too noisy training signals (51.2 $\rightarrow $ 38.1 mIoU). If we drop the easy-hard classification stage, the model misses a chance to receive rich training signals from the full pseudo labels (51.2 $\rightarrow $ 49.5 mIoU). We also explore the effect of ordering. We observe that the voting-first method performs better than the easy-hard classification-first method (51.2 vs. 49.1 mIoU). This implies that gradual densification is indeed important for stable model training.

Effect of $\frac{1}{\lambda _{k}}$ in Confidence Score $conf_{\mathrm {t}}$. We suggest to multiply $\frac{1}{\lambda _{k}}$ in computing the confidence score $conf_{\mathrm {t}}$. The rationale behind this is to oversample the images, which include rare classes, and thus prevent the learning from being biased by images composed of obvious frequent classes. The results without and with the $\frac{1}{\lambda _{k}}$ are (50.5 vs 51.2 mIoU) and (33.7 vs 35.1 R-mIoU). This demonstrates the efficacy of incorporating $\frac{1}{\lambda _{k}}$.

Table 4. Results of ablation studies.

Full size table

Table 5. Detailed analysis on the proposed objective functions. We note the corresponding equations for each proposals. Adv. denotes adversarial loss term for hard sample training.

Full size table

5.5 Parameter Analysis

Here, we conduct experiments to decide optimal hyper-parameters in our framework. For the first phase, we have a total of three hyper-parameters; voting field size, voting iteration number, and $\alpha $. In Table 4b, we conduct a grid search on the first two, and we obtain the best result with voting field 57, and voting number 3. The hyperparameter $\alpha $ controls how much to maintain the initial prediction value, and we observe that 0.7 produces the best result (see Table 4c). We see that the results are in the same line with the residual learning [16]. Providing residual features (i.e., pooled neighboring confident prediction values) while securing the initial behavior (i.e., initial prediction values) is important. For the second phase, we have a total of two hyper-parameters; q and $\gamma $. The hyperparameter q controls the ‘easy’ portions in the target images. For example, if we increase the value, more images will be used as easy samples for the training. We observe that setting q to 0.3 provides the best result (see Table 4d). Note that if we set q to 1 (i.e., making all the target images to be trained with the full pseudo labels), we instead obtain degraded performance. This implies that a proper portion of easy and hard samples are need to be set, and both the full pseudo label training and hard-to-easy feature alignment are important. The hyperparameter $\gamma $ is related to the calibration degree of the prediction values in generating full pseudo labels (see Eq. (6)). We obtain the best result when $\gamma $ equals 2.

5.6 Loss Function Analysis

Finally, we explore the impact of loss functions in Table 5. We begin with the standard self-training loss, $\mathcal {L}_{st}$. Introducing the bootstrapping mechanism boosts the performance significantly, from 47.00 to 48.47 mIoU. This implies that explicitly handling noisy pseudo labels is crucial but lacking in the original formulation. Also, using voting to densify the sparse pseudo labels further pushes the performance from 48.47 to 49.52 mIoU. The densified pseudo labels help model learning due to the increased training-signals and are complementary to the bootstrapping effect. In the second phase, we investigate the impact of both easy sample training (EH Cls.) and hard sample training (Adv.). The easy sample training pushes the performance from 49.52 to 50.11 mIoU, and the hard sample training further increases the performance from 50.11 to 51.20. The results demonstrate that the full-pseudo label training is indeed important and the hard-to-easy feature alignment further enhances the model learning. Especially for the hard sample training, we conduct a contrastive analysis in Fig. 6. We observe that hard sample training improves category-level feature alignment (Fig. 6 (e)$\rightarrow $Fig. 6 (f)), and thus the prediction values become more accurate and clean (Fig. 6 (c) $\rightarrow $ Fig. 6 (d)).

6 Conclusions

In this paper, we point out that self-training methods for UDA suffer from the sparse pseudo label during training. Therefore, we present a novel two-phase pseudo label densification method. Combined with recently proposed CRST framework, we achieve new state-of-the-art results on UDA benchmarks.

References

Atapour-Abarghouei, A., Breckon, T.P.: Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2810 (2018)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell., June 2016. https://doi.org/10.1109/TPAMI.2017.2699184
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation, June 2017
Google Scholar
Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster R-CNN for object detection in the wild. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 3339–3348 (2018)
Google Scholar
Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B.: Crdoco: pixel-level domain transfer with cross-domain consistency. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a large-scale hierarchical image database, pp. 248–255, June 2009. https://doi.org/10.1109/CVPR.2009.5206848
Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 2960–2967 (2013)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495 (2014)
Ghifary, M., Kleijn, W.B., Zhang, M., Balduzzi, D., Li, W.: Deep reconstruction-classification networks for unsupervised domain adaptation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_36
Chapter Google Scholar
Golemo, F., Taiga, A.A., Courville, A., Oudeyer, P.Y.: Sim-to-real transfer with neural-augmented robot simulation. In: Billard, A., Dragan, A., Peters, J., Morimoto, J. (eds.) Proceedings of the 2nd Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 87, pp. 817–828. PMLR, 29–31 October 2018. http://proceedings.mlr.press/v87/golemo18a.html
Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2066–2073. IEEE (2012)
Google Scholar
Gopalan, R., Li, R., Chellappa, R.: Domain adaptation for object recognition: an unsupervised approach. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 999–1006. IEEE (2011)
Google Scholar
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization, pp. 529–536 (2005)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016. https://doi.org/10.1109/CVPR.2016.90
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: Proceedings of International Conference on Machine Learning (ICML), pp. 1989–1998 (2018)
Google Scholar
Hoffman, J., Wang, D., Yu, F., Darrell, T.: FCNS in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649 (2016)
Hong, W., Wang, Z., Yang, M., Yuan, J.: Conditional generative adversarial network for structured domain adaptation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Kulis, B., Saenko, K., Darrell, T.: What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 1785–1792. IEEE (2011)
Google Scholar
Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 5542–5550 (2017)
Google Scholar
Li, W., Xu, Z., Xu, D., Dai, D., Van Gool, L.: Domain generalization and adaptation using low rank exemplar SVMs. In: IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 40, 1114–1127. IEEE (2017)
Google Scholar
Long, M., Cao, Y., Cao, Z., Wang, J., Jordan, M.I.: Transferable representation learning with deep adaptation networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41, 3071–3085 (2019). https://doi.org/10.1109/TPAMI.2018.2868685
Article Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791 (2015)
Luo, Y., Zheng, L., Guan, T., Yu, J., Yang, Y.: Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 11, 2579–2605 (2008)
MATH Google Scholar
Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 5715–5725 (2017)
Google Scholar
Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., Kim, K.: Image to image translation for domain adaptation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 4500–4509, June 2018. https://doi.org/10.1109/CVPR.2018.00473
Panareda Busto, P., Gall, J.: Open set domain adaptation. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 754–763 (2017)
Google Scholar
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping, December 2014
Google Scholar
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 3234–3243 (2016)
Google Scholar
Sener, O., Song, H.O., Saxena, A., Savarese, S.: Learning transferrable representations for unsupervised domain adaptation. pp. 2110–2118 (2016)
Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 2107–2116 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556, September 2014
Google Scholar
Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7472–7481 (2018)
Google Scholar
Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2517–2526 (2019)
Google Scholar
Zhu, X.: Semi-supervised learning tutorial. In: Proceedings of International Conference on Machine Learning (ICML) (2007)
Google Scholar
Zou, Y., Yu, Z., Kumar, B.V., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of European Conf. on Computer Vision (ECCV), pp. 289–305 (2018)
Google Scholar
Zou, Y., Yu, Z., Liu, X., Kumar, B.V., Wang, J.: Confidence regularized self-training. In: Proceedings of International Conference on Computer Vision (ICCV), October 2019
Google Scholar

Download references

Acknowledgement

This research is supported by the National Cancer Center(NCC).

Author information

Authors and Affiliations

KAIST, Daejeon, South Korea
Inkyu Shin, Sanghyun Woo, Fei Pan & In So Kweon

Authors

Inkyu Shin
View author publications
You can also search for this author in PubMed Google Scholar
Sanghyun Woo
View author publications
You can also search for this author in PubMed Google Scholar
Fei Pan
View author publications
You can also search for this author in PubMed Google Scholar
In So Kweon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to In So Kweon .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1335 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shin, I., Woo, S., Pan, F., Kweon, I.S. (2020). Two-Phase Pseudo Label Densification for Self-training Based Domain Adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12358. Springer, Cham. https://doi.org/10.1007/978-3-030-58601-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-58601-0_32
Published: 28 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58600-3
Online ISBN: 978-3-030-58601-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Two-Phase Pseudo Label Densification for Self-training Based Domain Adaptation

Abstract

Similar content being viewed by others

Classifier Decoupled Training for Black-Box Unsupervised Domain Adaptation

Learning from Scale-Invariant Examples for Domain Adaptation in Semantic Segmentation

Transferable adversarial masked self-distillation for unsupervised domain adaptation

Keywords

1 Introduction

2 Related Works