Keywords

1 Introduction

Semantic segmentation is one of the most challenging tasks in visual understanding. Compared with simpler problems such as image classification and object detection, semantic segmentation is a deeper understanding of visual content. Semantic segmentation needs to assign labels pixel by pixel, and its labels correspond to its semantic content. Therefore, it is also called a dense label classification task. Semantic segmentation is a pervasive research field, and the academic community has proposed many methods to solve these issues.

In addition, with the development of computer graphics, it has become a feasible solution to generate large amounts of automatically labeled simulation data for many vision-related tasks. Using computer graphics technology to simulate the virtual environment and automatically generate images and segmentation annotations is an economical choice. For example, GTA5 [37] and SYNTHIA [38] are two popular urban street simulation data sets. It is similar to the real-world data set and has shared classes (for example, CityScapes [5]). Although the appearance of the simulation image is similar to the real image, there are still differences in texture, layout, color, and lighting conditions [13], which leads to the problem of domain shift with different data distributions. Therefore, a certain model trained on the simulation data set will fail when applied to a real scene. This requires us to solve the problem of domain shift between simulation data and real-world data to fully use the labeled samples in the source domain simulation data and a large number of unlabeled samples in the target domain real data. To solve this challenge, we need to study unsupervised cross-domain transfer learning. Unlike cross-domain transfer learning for image classification, cross-domain transfer learning for unsupervised semantic segmentation has received less attention due to its difficulties.

One method learns transferable knowledge by solving the visual domain shift between virtual simulation data and real-world data. The cross-domain transfer learning based on the generative adversarial network focuses attention on the pixel-level transfer of the input space, i.e., image translation. This transfer method is not limited to specific downstream tasks and has become a powerful weapon for cross-domain transfer learning. In the traditional unsupervised image-to-image translation methods, the cycle-consistency loss has emerged as the de facto gold standard [21, 47, 49]. With the recent rise of contrastive learning, a powerful tool has been brought to the field of self-supervised representation learning [3, 12, 34, 46]. In the work of CUT [35], the effectiveness of contrastive learning in unsupervised image-to-image translation is demonstrated. They use a multilayer, patch-based approach rather than operate on entire images to calculate the contrastive loss. And they only sample positive and negative samples in a single image. To replace the cycle-consistency loss, they learn the cross-domain similarity function between input and output image patches to maximize the mutual information, which is used to avoid mode collapse. Thus, they not only improve the effect of image-to-image translation, but also achieve high-efficiency single-directional cross-domain translation.

In this paper, the contrastive loss of CUT [35] is used to replace the two-directional cycle-consistency loss. At the same time, we reuse the feature extractor of the semantic segmentation network as the encoder of the image-to-image translation network. In addition, in order to fine-grained transfer, we reuse the classifier of the semantic segmentation network as a generator of the image-to-image translation network and project the probability space of the segmentation output to the RGB space. By maximizing the mutual information between the input image and the semantic projection map, we can extract the features that are invariant with color and style. We resort to a discriminator at the output of the semantic projection map to guide the output of the target domain to transfer to the source domain. The feature extractor can extract domain-invariant features in the process of unsupervised image-to-image translation. Specifically, we hope that the semantic projection map, which is generated by the target domain after the feature extractor and classifier, can obtain some of the attributes of the source domain, while retaining the content and structural attributes of the target domain input image (i.e. the position and shape of the object).

Fig. 1.
figure 1

Based on mutual information maximization cross-domain transfer

Figure 1 shows an illustration of our cross-domain transfer network for unsupervised semantic segmentation. In the figure, the image patch marked by the red box represents a car, so it can be considered that its semantic features are different from the signs, roads, and leaves marked by the yellow boxes. Based on this assumption, we can construct positive and negative samples and use the recently prevalent contrastive loss InfoNCE [34] to bring together the positive samples and push away the negative samples in the feature space to learn the universal representations. Maximizing the mutual information between the input and the output (e.g. the representation similarity between the car of the input image and the car of the semantic projection map), the representations can retain position and class information while ignoring attributes such as color and texture to distinguish between positive and negative samples. At the same time, through the introduction of a generative adversarial network, the semantic projection map can be transferred across domains to help further the feature extractor learn the domain-invariant representations.

Finally, a remaining issue of our approach is how to perform adversarial training. One solution is output cross-domain transfer (i.e., the discriminator is used to distinguish the semantic projection map of the source domain and that of the target domain) and further extracts the common semantic features of the input and output space by contrastive learning; Another solution is input-output cross-domain transfer (i.e., the discriminator is used to distinguish the input RGB image of the source domain and the semantic projection map of the target domain), so that the semantic segmentation projection space of the target domain can be transferred to the RGB color space of the source domain. This input-output cross-domain transfer method can provide a reconstruction regularization term for the projection of the source domain. Still, it may damage the capacity to extract the semantic features by contrastive learning forcibly. We will give a comparison of these two adversarial transfer methods in the experimental part. After cross-domain adversarial training, we will conduct self-supervised training for knowledge distillation to obtain the final unsupervised semantic segmentation model.

We summarize our contributions as follows:

  • To the best of our knowledge, we are the first to reuse the semantic segmentation network for image-to-image translation specifically for unsupervised cross-domain transfer learning, which is based on mutual information maximization by contrastive learning. By such a reusing, the output space of semantic segmentation is projected to RGB space to construct positive and negative samples for contrastive learning to extract features that are invariant with style and variant with semantics information for semantic segmentation tasks.

  • This paper develops two adversarial transfer methods based on generative adversarial networks, including output cross-domain transfer and input-output cross-domain transfer. Through contrastive learning, a connection between the input and output spaces can be established to extract domain-invariant representations in the feature space further.

  • This paper designs a simple and effective three-stage training paradigm, including pre-training, cross-domain transfer, and self-supervised training. Extensive experimental evaluations on main benchmarks reveal that the proposed method outperforms various famous counterparts. The comprehensive ablation studies and visual analysis are also conducted to dig further and explain the principles.

2 Related Work

Unsupervised Semantic Segmentation. The first work to introduce adversarial learning for semantic segmentation is FCNs in the wild [17], which aligns the global and local features of the two domains in the feature space. Curriculum domain adaption[48] estimates the worldwide distribution and labels of superpixels and then learns a more refined semantic segmentation model. AdaptSegNet [41] uses multiple discriminators to perform a multi-level cross-domain transfer of features at different levels. Another method is to migrate the foreground, and background classes separately [45].

Another method learns transferable knowledge by solving the visual domain shift between virtual simulation data and real-world data. Using the cross-domain transformed image as input, CyCADA [16] further aligns the feature distribution between the two domains in the feature space. BDL introduced a two-way learning framework [25], in which image-to-image translation and semantic segmentation models can both promote each other in a closed loop. In addition, there is also much work devoted to aligning different attributes between the two domains, such as entropy [43] and information [30].

Unsupervised Image-to-Image Translation. In terms of unsupervised image-to-image translation with unpaired training data, CycleGAN [49], DiscoGAN [21], DualGAN [47] preserve key attributes between the input and the translated image by using a cycle-consistency loss. Various studies have been proposed towards the extension of CycleGAN. The first kind of development is to enable multi-modal generations: MUNIT [18] and DRIT [24] decompose the latent space of images into a domain-invariant content space and a domain-specific style space to get diverse outputs. Another enhancement of CycleGAN is to perform translation across multiple (more than two) domains simultaneously, such as StarGAN [4]. A more functional line of research focuses on the transformation between domains with a larger difference. For example, CoupledGAN [28] and UNIT [27] using domain-sharing latent space, and U-GAT-IT [20] resort to attention modules for feature selection.

Contrastive Representation Learning. Recently, a family of methods based on contrastive learning has emerged to learn universal representations [3, 12, 15, 34, 46]. Contrastive losses measure the distance, or similarity, between representations in the latent space, one of the critical differences between contrastive learning methods and other representation learning approaches [23]. CPC [34] first proves that minimizing this loss based on NCE is equivalent to maximizing a lower bound on the mutual information. SimCLR [3] further elaborate on it advantages over other losses. These methods make use of noise contrastive estimation [11], learning an embedding where associated samples are brought together, in contrast to other samples in the dataset. Associated samples can be an image with itself [12, 46], neighboring patches within an image [34], or multiple views of the input image [40], and an image with a set of transformed versions of itself [3, 33]. Inspired by CUT [35], we are the first to use InfoNCE loss for the unsupervised semantic segmentation tasks.

3 Methods

This section presents the detailed formulation of our method. We first introduce the general idea and then follow it up by providing the details of each component. The three-stage training mechanism is specified as well.

3.1 General Formulation

Source and Target Domains Definition. We perform data augmentation of color jitters on the simulation image, and call this augmented image domain as the source domain \(\mathcal {X}\). And we call the real-world image domain without data augmentation as the target domain \(\mathcal {Y}\). The semantic projection map of the source domain is denoted as \(x^l = f_{y\rightarrow x}(x) \in \mathcal {X}^l\), and the semantic projection map of the target domain is denoted as \(y^l = f_ {y\rightarrow x}(y) \in \mathcal {Y}^l\).

Unsupervised Image-to-Image Translation Definition. Unsupervised image-to-image translation is ill-posed, which learns \(f_{y\rightarrow x}\) with only the marginals \(p(\mathcal {X})\) and \(p(\mathcal {Y})\) provided, since there are infinitely many conditional probabilities corresponded to the same marginal distributions. We use the same method as CUT [35] to solve the problem, maximizing mutual information of input and output. We only need to learn the translation map in one direction, avoiding the use of reverse generators and discriminators, so that a more compact and more effective architecture is derived. Considering that the source domain image has annotated information, data augmentation can be performed during the supervised semantic segmentation training process to enhance the generalization of the semantic segmentation model. In this paper, we resort to unsupervised image-to-image translation from the target domain \(\mathcal {Y}\) to the source domain \(\mathcal {X}\) and extract features that are invariant with style.

The image-to-image translation module \(f_{y\rightarrow x}\) is composed of an encoder \(E_y\) and a generator \(G_{y\rightarrow x}\). By combining the encoder and generator, we get \(y^l=f_{y\rightarrow x}(y)=G_{y\rightarrow x}(E_y(y))\). The adversarial training [10] is usually used to align the translated output to the source domain. Specifically, the multi-scale discriminator \(D_x\) is used to classify and discriminate between the source domain semantic projection map \(x^l\) (or the source domain input image x) and the target domain semantic projection map \(y^l\) to measure the distance between the generated distribution and the source domain distribution.

Unsupervised Semantic Segmentation Definition. Semantic segmentation aims to predict the unique label [36] pixel by pixel for the input image, and minimize the cross-entropy \(H(\boldsymbol{l}, \boldsymbol{p})= \frac{1}{hw}\sum _{i=1}^{h }\sum _{j=1}^{w}\sum _{k=0}^{K-1}-l_{k}^{(i,j)} \log \left( p_{k}^{( i,j)}\right) \) between the real label \(l_{k}^{(i,j)} \in \mathcal {L}\) and the network output \(p_{ k}^{(i,j)}\), where h and w represent the length and width of the image, i and j represent the pixel position, K represents the total number of classes, and k represents the class.

In the unsupervised semantic segmentation setting, you can sample the label-image pairs \((x,l)\in \mathcal {X} \times \mathcal {L}\) from training set S of the source domain \(\mathcal {X}\), and image data \(y \in \mathcal {Y}\) from training set T of target domain \(\mathcal {Y}\), where source domain \(\mathcal {X}\) and target domain \(\mathcal {Y}\) share K semantic classes \(k\in \{0, \cdots , K-1\}\). For non-shared classes, we ignore the relevant areas and ignore them during training and testing. The training goal is to learn a semantic segmentation model to achieve the lowest prediction risk in the target domain. Generally, the semantic segmentation network can be divided into feature extractor F and classifier C. Traditional cross-domain transfer generally uses the binary domain discriminator \(D_x\) to perform the overall transfer, but this coarse-grained transfer method can easily lead to negative transfer. In this paper, by maximizing the mutual information between input and output, semantic information is indirectly considered for fine-grained transfer.

Other Definition. As mentioned in the introduction, the model reuses the feature extractor F for encoding, reuses the classifier C for decoding, and uses a small two-layers \(1\times 1\) convolution as the projection network H to map the probability space to the RGB space. Formally, the feature extractor F will replace the original encoder \(E_y\) in \(f_{y\rightarrow x}\), and the classifier C and the projection network H will replace \(f_{y\rightarrow x}\) in the original generator \(G_{y\rightarrow x}\), resulting in a new translation \(f_{y\rightarrow x}(y)=(H \circ C)(F(y))\). The cross-domain transfer process is divided into two steps to alternate training to optimize \(f_{y\rightarrow x}\) and multi-scale discriminator \(D_x\).

3.2 Architecture

Feature Extractor \(\boldsymbol{F}\). We use two backbone networks widely used in this field as the feature extractor F to prove the universality of the new method. Two mainstream feature extractors in this field are F backbone networks: ResNet-101 [14] and VGG-16 [39], with feature output dimensions of 2048 and 1024 respectively. And load the pre-trained model on ImageNet [7].

Classifier \({\boldsymbol{C}}\). We adopt Deeplab-V2 [1]’s classifier C, which integrates multi-scale information (Atrous Spatial Pyramid Pooling, ASPP). The feature extractor F and classifier C of the semantic segmentation network used in the self-supervised training stage are the same as the pre-training and cross-domain transfer stage network structure, but the network weights are different.

Table 1. Projection network architecture

Projection Network \(\boldsymbol{H}\). The composition and parameters are shown in the Table 1, where h represents the length, w represents the width, CONV represents the convolutional layer, N represents the number of output channels, K represents the size of the convolution kernel, S is the convolution step size, P is the convolution padding size, and K represents The number of semantic classes, LIN stands for layer-instance normalization.

Multi-scale Discriminator \(\boldsymbol{D_x}\). Our multi-scale discriminator is inspired by the previous work [6, 8, 19, 44]. In these works, discriminators of different scales are applied to images of different sizes (small-size images obtained from the original images by downsampling). In this paper, we consider a more effective method, which treats the feature maps in different layers of a single input as images of different scales, and then input each feature map into a classifier of corresponding size, similar to the feature pyramid in objection detection (e.g., SSD [29] and FPN [26]).

As mentioned above, the discriminator \(D_x\) contains two parts: the encoder \(E_x\) and the classifier \(C_x\). In order to achieve multi-scale processing, the classifier \(C_x\) is further divided into three sub-classifiers: \(C^0_{x}\) is used for local-scale (\(10\times 10\) receptive field), and \(C^1_{x}\) is used for medium Scale (\(70\times 70\) receptive field), and \(C^2_{x}\) for the global scale (\(286\times 286\) receptive field). \(C^0_{x}\) is directly connected to the output of \(E_x\). Then, a down-sampling-convolution layer is conducted on \(E_x\) to provide the feature maps of smaller scale, which are concatenated to two branches: one is linked to \(C^1_{x}\), and the other one is further down-sampled through convolution layers followed by \(C^2_{x}\). For a single input image, \(C^0_{x}\), \(C^1_{x}\), and \(C^2_{x}\) are all trained to predict whether the image is true or false. In addition to the multi-scale design, we also design a residual attention mechanism to promote further the propagation of feature gradients in the discriminator [2].

Table 2. Discriminator network architecture

Leaky-ReLU with a negative slope of 0.2 is used for the discriminator network, and the spectral normalized is used for all convolutional layers. The Table 2 introduces the composition of the discriminator in detail, where: h represents the length, w represents the width, CONV represents the convolutional layer, MLP represents the fully connected layer, and N represents the number of output channels. K represents the size of the convolution kernel, S is the convolution step size, P is the convolution padding size, and SN represents the normalization of the spectrum. In addition, in the residual attention module, the feature maps of global average pooling and maximum pooling are spliced, so the number of input channels of MLP-(N1) is 256.

3.3 Three-Stage Training Paradigm

Fig. 2.
figure 2

Three-stage training paradigm

Figure 2 shows the three-stage training paradigm of pre-training, cross-domain transfer and self-supervised training.

Pre-training stage:

  • Cross-entropy loss: Load the ImageNet [7] pre-trained model, and use the source domain labeled data for model fine-tuning training. Calculate the cross-entropy loss for (xl), minimize the difference between the predicted value and the real label, and pre-train a semantic segmentation network \(C \circ F\):

    $$\begin{aligned} \min _{C \circ F}L_{seg} = \mathbb {E}_{x\sim \mathcal {X}} \frac{1}{hw}\sum _{i=1}^{h}\sum _{j=1}^{w}\sum _{k=0}^{K-1}-l_{k}^{(i,j)} \log _{\text {softmax}} \left( C(F(x))_{k}^{(i,j)}\right) , \end{aligned}$$
    (1)

    where h and w represent the length and width of the image, i and j represent the pixel position, K represents the total number of classes, and k represents the class.

Cross-domain transfer stage:

  • Cross-entropy loss: In the source domain, the cross-entropy loss is calculated by using the image label pair (xl) to further fine-tune the feature extractor F and classifier C:

    $$\begin{aligned} \min _{C \circ F}L_{seg} = \mathbb {E}_{x\sim \mathcal {X}} \frac{1}{hw}\sum _{i=1}^{h}\sum _{j=1}^{w}\sum _{k=0}^{K-1}-l_{k}^{(i,j)} \log _{\text {softmax}} \left( C(F(x))_{k}^{(i,j)}\right) , \end{aligned}$$
    (2)

    where h and w represent the length and width of the image, i and j represent the pixel position, K represents the total number of classes, and k represents the class.

  • Adversarial loss: We use least-square adversarial loss [32], because it has more stable training and better generation quality. The output cross-domain transfer is as follows:

    $$\begin{aligned} \min _{F, H \circ C}\max _{D_x} L^{y\rightarrow x}_{gan} \mathbb {E}_{x^l\sim \mathcal {X^l}}\left[ \left( D_x(x^l)\right) ^2\right] + \mathbb {E}_{y^l \sim \mathcal {Y}^l}\left[ \left( 1-D_x((H \circ C)(F(y^l)))\right) ^2\right] ; \end{aligned}$$
    (3)

    The input-output cross-domain transfer is as follows:

    $$\begin{aligned} \min _{F, H \circ C}\max _{D_x} L^{y\rightarrow x}_{gan} = \mathbb {E}_{x\sim \mathcal {X}}\left[ \left( D_x(x)\right) ^2\right] + \mathbb {E}_{y^l \sim \mathcal {Y}^l}\left[ \left( 1-D_x((H \circ C)(F(y^l)))\right) ^2\right] \end{aligned}$$
    (4)
  • Mutual information maximization loss: We use the prevalent loss function InfoNCE of the current noise contrastive estimation framework to maximize the mutual information between input and output. The contrastive learning based on noise contrastive estimation brings together the anchor sample and the positive samples in the feature space and pushes it further away from the negative samples. We use the normalized version of InfoNCE, and uses the temperature factor \(\tau =0.07\) to adjust the output of softmax. This version is called NT-Xent in SimCLR [3], with adaptive adjustment weight of negative examples. We sample one anchor sample, one positive sample and N negative samples, which are mapped into hidden vectors by the network of different levels of the feature extractor \(\boldsymbol{v}\), \(\boldsymbol{v}^+ \) And \(\boldsymbol{v}^-\), where \(\boldsymbol{v}_n^-\) represents the n-th negative sample. NT-Xent loss is as follows:

    $$\begin{aligned} \min L^{y\rightarrow x}_{\text {NT-Xent}}(\boldsymbol{v}, \boldsymbol{v}^+, \boldsymbol{v}^-) = -\log { \Bigg [\frac{\exp (\boldsymbol{v}\cdot \boldsymbol{v}^+/\tau )}{\exp (\boldsymbol{v} \cdot \boldsymbol{v}^+/\tau ) + \sum _{n=1}^N \exp (\boldsymbol{v} \cdot \boldsymbol{v}^-_n/\tau )} \Bigg ] } \end{aligned}$$
    (5)
  • Multilayer-patchwise contrastive loss: We use a multilayer-patchwise contrastive learning strategy like CUT [35]. Due to the large image size, we take a larger number of negative samples \(N=1024\) and take the features of the \(L=6\) layer to calculate the multilayer-patchwise contrastive loss. The features of each layer and each spatial position in the neural network represent a region of the input image, and the spatial position corresponding to the deeper network layer corresponds to the larger region. We select L interest layers of network features, and perform feature mapping through the small two-layer fully-connected network \(\hat{H}\) used in SimCLR [3] to generate the feature set of target domain image y \(\{z_l\}_L=\{\hat{H}_l(F^l(y))\}_L\), where \(F^l(y)\) represents the \(l\in \{1, 2,...,L\}\) output of selected layers. The sampling spatial position of the lth selected layer is denoted as \(s\in \{1, ..., S_l\}\), where \(S_l\) represents the number of space positions of each layer. Denote the corresponding hidden layer vector as \(z_l^s \in \mathbb {R}^{C_l}\), and other hidden layer vectors as \(z_l^{S \setminus s}\in \mathbb {R} ^{( S_l-1)\times C_l}\), where \(C_l\) represents the corresponding feature dimension. Similarly, the image \(y^l\) after the projection of the target domain image is encoded to obtain the feature set \(\{\hat{z}_l\}_L=\{\hat{H}_l(F^l( (H \circ C)(F(y))))\}_L\). Our purpose is to calculate contrastive loss between the corresponding image patches of the input and output images. The image patch corresponding to the spatial position can be used as the positive sample, and the other image patches in the image can be used as the negative sample. Further, define the PatchNCE loss function:

    $$\begin{aligned} \min _{F, H \circ C, \hat{H}} L^{y\rightarrow x}_{\text {PatchNCE}} = \mathbb {E}_{y\sim \mathcal {Y}} \sum _{l=1}^L \sum _{s=1}^{S_l} L^{y\rightarrow x}_{\text {NT-Xent}}(\hat{z}_l^s, z_l^s, z_l^{S\setminus s}), \end{aligned}$$
    (6)

    where, only the optimized gradient is calculated for \(\hat{z}_l\) to update the model parameters.

  • Full objective: The translated image \(y^l\) should be as close as possible to the input space or output space of the source domain \(\mathcal {X}\). At the same time, the corresponding image patches before and after the translation should retain mutual information as much as possible, as shown in Fig. 1. In addition, we also use the PatchNCE loss function to perform regularization constraints on the source domain \(\mathcal {X}\), guiding the feature extractor F and generator \((H \circ C)\) to not cause unnecessary changes, which is consistent with the idea of reconstruction loss in the field of unsupervised cross-domain image-to-image translation. The whole objective of the discriminator \(D_x\) is as follows:

    $$\begin{aligned} \max _{D_x} \lambda _1 L^{y\rightarrow x}_{gan}; \end{aligned}$$
    (7)

    The full objective of the feature extractor F, classifier C, projection H, and two-layer fully-connected network \(\hat{H}\) is as follows:

    $$\begin{aligned} \min _{F, C, H, \hat{H}} L_{seg}+ \lambda _4 \left( \lambda _1L^{y\rightarrow x}_{gan} + \lambda _2 L^{y\rightarrow x}_{\text {PatchNCE}} + \lambda _3 L^{x\rightarrow x}_{\text {PatchNCE}} \right) , \end{aligned}$$
    (8)

    where, \(\lambda _1\), \(\lambda _2\), \(\lambda _3\) and \(\lambda _4\) are the trade-off weights.

Self-supervised training stage:

  • Cross-entropy loss: Load the ImageNet pre-trained model, and use the image-pseudo-label pair \((y,\hat{l})\) to calculate the cross-entropy loss on the target domain, minimize the difference between the predicted value and the pseudo-label, and help re-distill a new semantics segmentation network \(C \circ F\):

    $$\begin{aligned} \min _{C \circ F}L_{seg} = \mathbb {E}_{y\sim \mathcal {Y}} \frac{1}{hw}\sum _{i=1}^{h}\sum _{j=1}^{w}\sum _{k=0}^{K-1}-\hat{l}_{k}^{(i,j)} \log _{\text {softmax}} \left( C(F(y))_{k}^{(i,j)}\right) , \end{aligned}$$
    (9)

    where h and w represent the length and width of the image, i and j represent the pixel position, K represents the total number of classes, and k represents the class.

4 Experiments

4.1 Dataset

GTA5: contains 24966 simulation images, and the original image size is \(1914 \times 1052\) pixels resolution. During training, the image size is scaled to \(1280 \times 720\) pixels without random cropping, and color jitters are used for data augmentation in the cross-domain transfer stage. Use the 19 semantic classes shared with the real-world city street CityScapes dataset, denote other classes as the “ignore” class and ignore the “ignore” class area during the training process.

Synthia-Rand-CityScapes: contains 9400 simulation images, the original image size is \(1280 \times 760\) pixels resolution. During training, the image size is scaled to \(1024 \times 512\) pixels, without random cropping, and color jitters are used for data augmentation in the cross-domain transfer stage. Similarly, only 16 semantic classes shared with the real-world city street CityScapes dataset are used for training. Other classes are set to “ignore” class, and the “ignore” class area is ignored during training. However, this dataset generally selects the following two evaluation settings: performed on 16 classes or a subset of 13 classes.

CityScapes: contains 2975 real images, and the original image size is \(2048 \times 1024\) pixels resolution. The image size is scaled to \(1024 \times 512\) pixels in the cross-domain transfer stage, without random cropping and color jitters. In the self-supervised training stage, the image size is scaled to \(1024 \times 512{-}3072 \times 1536\) pixels, and \(1024 \times 512\) pixels are randomly cropped, using color jitters and random flipping data augmentation methods. The two cross-domain transfer tasks are GTA5CityScapes and SYNTHIACityScapes. Only use the semantic classes shared with the simulation data set for testing, set other categories as the “ignore” class, and ignore the “ignore” class area during the test. Following the standard protocol [16, 17, 42], we use 2975 images in the training set as the unlabeled target domain training set and evaluate the proposed cross-domain transfer model on 500 images in the validation set.

4.2 Setup

For the feature extractor F and classifier C, use the SGD optimizer (momentum 0.9, weight decay \(10^{-4}\)). The initial learning rate is \(2.5 \times 10^{-4}\) (classifier C learning rate \(\times 10\)), and the learning rate is decayed according to the “poly” learning rate decay strategy with an exponent of 0.9.

The output of the classifier C is softened by the temperature adjustment factor \(\tau =1.8\). \(\lambda _4\) is set to \(10^{-3}\). Use ReLU as the activation function in the projection H and the double-layer fully-connected network \(\hat{H}\), and use Leaky-ReLU with a slope of 0.2 in the discriminator \(D_x\). Use Adam [22] optimizer to train with a learning rate of \(10^{-4}\) and \((\beta 1, \beta 2) = (0.5, 0.999)\). \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are all set to 1.

When using the contrastive learning InfoNCE to calculate the mutual information maximization loss, select the \(L=6\) features layers of the feature extractor, sample \(S_l=1024\) random regions, and use the two-layer fully-connected layer to extract the output features of the corresponding layer 256-dimensional similarity feature, temperature adjustment factor \(\tau =0.07\). Specifically, when the feature extractor F is ResNet-101, the input layer, the first relu layer, layer1, layer2, layer3, and layer4 are respectively selected to calculate the PatchNCE loss; when the feature extractor F is VGG-16, the input layer and the 4th, 9th, 16th, 28th, and 32nd layers are selected respectively to calculate the PatchNCE loss.

Considering that data augmentation methods such as flipping and scale change are used in the self-supervised training stage, a multi-scale testing scheme is used to integrate the prediction results of different augmentation methods in the final model test.

4.3 Evaluation Metrics

The metrics used to evaluate the algorithm are consistent with ordinary semantic segmentation tasks. Specifically, we calculate the PSACAL VOC intersection ratio of the predicted output and the real label (\(\text {IoU}\)) [9]: \(\text {IoU}=\frac{\text {TP}}{\text {TP}+\text {FP}+\text {FN}}\), where \(\text {TP}\), \(\text {FP}\) and \(\text {FN}\) are the true positive examples, false positive examples and false negative examples. In addition to \(\text {IoU}\) for each semantic class, we also report the mean value of all semantic classes \(\text {mIoU}\).

4.4 Comparisons with State of the Arts

Table 3. GTA5 \(\rightarrow \) CityScapes

Table 3 and Table 4 show the semantic segmentation performance of \(\text {GTA5}\rightarrow \text {CityScapes}\) and SYNTHIACityScapes tasks, and our model achieves better results compared with the classic prevalent methods. At the same time, the effects of the two training schemes of output transfer and input-output transfer are compared. Although the input-output transfer performs well on the GTA5CityScapes task, it does not perform well on the SYNTHIACityScapes task. This result may be because the difference between the source domain dataset SYNTHIA and the target domain dataset CityScapes is huge. Not only the texture style is significantly different, but the perspective is also very different, which makes the transfer across the input and output too difficult to learn effective transferred knowledge, and even causes serious negative transfer. These phenomena will damage the capacity of the extraction of input and output common semantic features (for example, using VGG-16 as a feature extractor, the road class’s \(\text {IoU}\) is only 11.2%, using ResNet-101 as a feature extractor, and the road class’s \(\text {IoU}\) is only 29.7%). Since the output transfer scheme only considers the alignment of the semantic output space, it can ignore more information such as texture and perspective, so the semantic segmentation performance is more stable and robust. In summary, it is possible to consider using the input-output transfer scheme for slight domain shift, and use the output transfer scheme for significant domain shift.

Table 4. SYNTHIA \(\rightarrow \) CityScapes

Table 5 shows the experimental results of using the ResNet-101 feature extractor on the GTA5CityScapes task of the three-stage training paradigm. Taking the input-output transfer scheme as an example, the \(\text {mIoU}\) of the proposed model on the semantic segmentation task of the target domain has been increased from 38.4 in the pre-training stage to 45.3 in the cross-domain transfer stage, which verifies the proposed method can realize effective knowledge transfer under the framework of reusing semantic segmentation network for unsupervised image-to-image translation. After self-supervised training of the proposed model, \(\text {mIoU}\) is further increased to 48.3, indicating that the pseudo-labels retain sufficient semantic information, and the semantic segmentation performance can be further improved by distilling pseudo-labels.

Table 5. Training results of each stage

4.5 Ablation Study

Table 6. The ablation study of GTA5 \(\rightarrow \) CityScapes

This section uses the ResNet-101 feature extractor on the GTA5CityScapes task data set to conduct an ablation study in the cross-domain transfer stage, mainly for two cross-domain transfer schemes and color jitters augmentation for ablation analysis. As shown in the Table 6, under the condition of not performing color jitters augmentation on the source domain data, only the output transfer scheme is used, and the \(\text {mIoU}\) is increased from 38.4 to 42.5, but using input-output transfer scheme can only be increased to 40.4. One possible reason is that the transfer across the input and output is affected by the RGB color space of the input. With color jitters, it is more helpful to extract semantic information that is irrelevant to color. While the output transfer scheme is to transfer in the output space, and the reconstruction of the source domain is not constrained by the input color space, so more semantic information invariant with color can be extracted through contrastive learning.

Furthermore, only the color jitters augmentation is performed on the source domain without any explicit transfer scheme. \(\text {mIoU}\) is increased from 38.4 to 44.3, which further proves that the color jitters augmentation is helpful to extract the color-invariant representations and the color-invariant semantic information is essential for unsupervised semantic segmentation tasks. By combining color jitters augmentation and the cross-domain transfer scheme based on contrastive learning, the ability to extract color-invariant semantic information can be better improved, and \(\text {mIoU}\) is further increased to 45.2 and 45.3.

Fig. 3.
figure 3

Visualization of results on GTA5 \(\rightarrow \) CityScapes. From top to bottom, the input images of the target domain CityScapes, the projection maps with the output transfer scheme, the projection maps with the input-output transfer scheme, the semantic segmentation maps after pre-training, the input-output transfer training, and after self-supervised training.

4.6 Visualization of Results

Figure 3 shows the visualization results of the proposed model using the ResNet-101 feature extractor on the GTA5CityScapes task. Observing the semantic segmentation projection maps with the output transfer scheme, we can find that the projection maps are more abstract, so it may be possible to retain more abstract high-level semantic information while ignoring unimportant style details, and further reduce the burden of transfer learning, which may avoid the occurrence of negative transfer when the source domain and target domain are quite different.

Observing the semantic segmentation projection map with the input-output transfer scheme, we can find that the projection maps seem to overlay different color semantic layers on different objects in the foreground and background. Such translated images look like a semantic attention mechanism added to the original input images. This phenomenon indirectly shows that reusing the semantic segmentation network as an unsupervised image-to-image translation network can integrate the semantic segmentation information into the image-to-image translation process. Therefore, the input-output transfer scheme can be used when the difference between the source domain and the target domain is small, so as to effectively use the segmentation label information to realize the positive feedback loop of image-to-image translation and semantic segmentation.

Further observation of the semantic segmentation output results of the three training stages shows that the semantic segmentation images of each stage are more refined and accurate than the previous stage, which qualitatively verifies the steady improvement of the semantic segmentation performance through the three-stage training paradigm.

5 Conclusion

We attempt to combine contrastive learning with generative adversarial networks to achieve the fusion of semantic priors and unknown “dark energy”, and proposes a transfer method for unsupervised semantic segmentation tasks. Specifically, it uses an unsupervised image-to-image translation framework to maximize mutual information by a generative adversarial network and contrastive learning while bringing together the input and output distribution and indirectly extracting domain-invariant features. First, the feature extractor and classifier of semantic segmentation are reused for the encoder and generator in the image-to-image translation framework. Specifically, the features of the exact spatial location at the corresponding level of the images before and after the translation are brought closer to each other. The features of different spatial locations at the same level are pushed away from each other so that the feature extractor can learn features that are domain-invariant and sensitive to spatial locations and semantic classes. We conduct an experimental evaluation on the benchmark datasets, compare other prevalent methods and two transfer schemes, and prove that the proposed method is reasonable and practical. The input-output transfer scheme performs well in tasks with slight domain shift but is not up to tasks with large domain shift, while the output transfer scheme has better robustness. At the same time, the importance of color-invariant semantic information for unsupervised semantic segmentation tasks is explained through an ablation study.