1 Introduction

The domain shift problem is drawing increasing attention in recent years (Hoffman et al. 2018; Zhu et al. 2017; Tsai et al. 2018; Sankaranarayanan et al. 2017; Ghifary et al. 2015; Choi et al. 2018). In particular, there are two tasks that are of interest in computer vision. One is the domain adaptation problem (Ganin and Lempitsky 2015; Tzeng et al. 2017; Saito et al. 2017), where the goal is to learn a model for a given task from a label-rich data domain (i.e., source domain) to perform well in a label-scarce data domain (i.e., target domain). The other one is the image translation problem (Zhu et al. 2017; Huang et al. 2018), where the goal is to transfer images in the source domain to mimic the image style in the target domain. Domain adaptation and image translation share the same goal in reducing domain shift. They also differ from each other. More specifically, domain adaptation is task-oriented, where the domain shift is reduced through the guidance of different tasks such as image classification, semantic segmentation, and object detection. However, image translation is agnostic to high-level tasks, and focuses on adapting image styles on the pixel level. Image translation and domain adaptation are complementary, and can be combined by conducting domain adaptation on the samples translated by image translation (Hoffman et al. 2018; Sakaridis et al. 2020) to further reduce the domain shift.

Fig. 1
figure 1

Illustration of domain flow generation. Traditional image translation methods directly map the image from the source domain to the target domain, while our DLOW model is able to produce a sequence of intermediate domains shifting from the source domain to the target domain

Generally, most existing works focus on the target domain only (Hoffman et al. 2018; Tsai et al. 2018; Zhu et al. 2017; Huang et al. 2018). They aim to learn models that well fit the target data distribution, e.g., achieving good classification accuracy in the target domain, or transferring source images into the target style. In this work, we instead are interested in the intermediate domains between source and target domains, inspired by Gopalan et al. (2011), Gong et al. (2012), Cui et al. (2014). Instead of representing the intermediate domain as the subspace or covariance matrix (Gopalan et al. 2011; Gong et al. 2012; Cui et al. 2014), we aim to represent the intermediate domain with the style of the images in the pixel level, to endow more flexibility and potential to be combined with the existing popular deep based framework by using our translated images. Representing intermediate domains directly with images have demonstrated great potential in addressing domain adaptation tasks such as from clear weather to fog (Dai et al. 2019) and from daytime to nighttime (Sakaridis et al. 2020). While collecting real data for the intermediate domains is feasible for some cases (Dai et al. 2019; Sakaridis et al. 2020), it is challenging and even impossible for many more general situations. To address this, we present a new domain flow generation (DLOW) model, which is able to translate images from the source domain into an arbitrary intermediate domain between source and target domains. As shown in Fig. 1, by translating a source image along the domain flow from the source domain to the target domain, we obtain a sequence of images that naturally characterize the distribution shift from the source domain to the target domain.

The benefits of our DLOW model are twofold. First, those intermediate domains are helpful to bridge the distribution gap between two domains. By translating images into intermediate domains, those translated images can be used to ease the domain adaptation task. We show that the traditional domain adaptation methods can be boosted to achieve better performance in the target domain with intermediate domain images. Moreover, the obtained models also exhibit good generalization ability on new datasets that are not seen in the training phase, benefiting from using the diverse intermediate domain images. Second, our DLOW model can be used for style generalization. Traditional image-to-image translation works (Zhu et al. 2017; Isola et al. 2017; Kim et al. 2017; Liu et al. 2017) mainly focus on learning a deterministic one-to-one mapping that transfers a source image into the target style. In contrast, our DLOW model allows to translate a source image into an intermediate domain that is related to multiple target domains. For example, when performing the photo to painting translation, instead of obtaining a Monet or Van Gogh style, our DLOW model can produce a blended style of Van Gogh, Monet, etc. Such blending can be customized during the inference phase by simply adjusting an input vector that encodes the relatedness to different domains.

We implement our DLOW model based on CycleGAN (Zhu et al. 2017), which is one of the state-of-the-art unpaired image-to-image translation methods. We augment the CycleGAN to include an additional input of domainness variable. On the one hand, the domainness variable is injected into the translation network using the conditional instance normalization layer to affect the style of output images. On the other hand, it is also used as the weight on discriminators to balance the relatedness of the output images to different domains. For multiple target domains, the domainness variable is extended as a vector containing the relatedness to all target domains.

Furthermore, we also demonstrate the advantage of our DLOW model for the foggy scene understanding task. In this task, a major challenge is the large variance of fog density. Such variance mainly comes from two factors, the inter-image fog density variance caused by the weather, and the inner-image fog density variance caused by the depth (the more distant the scene is, the denser the fog effect is). To address this challenge, we firstly use our DLOW model to translate the clear weather images into a sequence of images with gradually changing fog density. Then, we adopt a curriculum learning strategy to train segmentation model, in which the images with light fog are used to train the model first, and then we gradually increase the fog density to enforce the model to perform well for dense fog scenes.

A preliminary version of this work was presented in Gong et al. (2019). Compared to Gong et al. (2019), this paper makes the following additional contributions:

  • In Sect. 3.3, we provide more illustration and discussion about the manifold, intermediate domain and domain distribution distance measurement.

  • In Sect. 3.6, we provide more details to illustrate our DLOW model. In particular, the network architecture with multiple target domains is added, where we expand the domainness variable to the domainness vector and add additional discriminators for different target domains. Besides, the image cycle consistency loss is also weighted by the element in domainness vector.

  • In Sect. 3.7, we propose a pipeline for foggy scene understanding based on our DLOW model. In particular, we propose a “Fog Flow” model, which consists of the simulated foggy image with different fog density. We also present a way to use curriculum learning, based on “Fog Flow”, to improve the segmentation model to generalize well to foggy scene understanding with various fog densities.

  • In Sect. 4.1.2, we provide additional experimental results for domain adaptation, such as the experiment combined with more recent feature-level adaptation framework ADVENT, and the ablation and comparison experiment without using our proposed adaptive weight related to domainness.

  • In Sect. 4.3, we provide the experimental results of our DLOW model for the semantic foggy scene understanding task. Specifically, we provide the quantitative comparison and qualitative comparison between our “Fog Flow” (DLOW based) and the popular simulated foggy images “Foggy Cityscapes”. It is proven that our “Fog Flow” outperforms “Fogg Cityscapes” not only in generated different fog density images but also under combination with curriculum learning.

Extensive results on benchmark datasets demonstrate the effectiveness of our proposed model for domain adaptation, style generalization, and foggy scene understanding.

2 Related Work

2.1 Image to Image Translation

Our work is related to the image-to-image translation works. The image-to-image translation task aims at translating the image from one domain into another domain. Inspired by the success of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), many works have been proposed to address the image-to-image translation based on GANs (Isola et al. 2017; Wang et al. 2018; Zhu et al. 2017; Liu et al. 2017; Lu et al. 2017; He et al. 2017; Zhu et al. 2017; Huang et al. 2018; Almahairi et al. 2018; Choi et al. 2018; Lee et al. 2018; Yi et al. 2017; Lin et al. 2018; Zheng et al. 2020). The early works (Isola et al. 2017; Wang et al. 2018) assume that paired images between two domains are available, while the recent works such as CycleGAN (Zhu et al. 2017), DiscoGAN (Kim et al. 2017) and UNIT (Liu et al. 2017) are able to train networks without using paired images. However, those works focus on learning deterministic image-to-image mappings. Once the model is learnt, a source image can only be transferred to a fixed target style.

A few recent works (Lu et al. 2017; He et al. 2017; Zhu et al. 2017; Huang et al. 2018; Almahairi et al. 2018; Choi et al. 2018; Lee et al. 2018; Yi et al. 2017; Lin et al. 2018; Lample et al. 2017) focus on learning a unified model to translate images into multiple styles. These works can be divided into two categories according to the controllability of the target styles. The first category, such as Huang et al. (2018), Almahairi et al. (2018), realizes the multimodal translation by sampling different style codes which are encoded from the target style images. However, those works focus on modelling intra-domain diversity, while our DLOW model aims at characterizing the inter-domain diversity. Moreover, they cannot explicitly control the translated target style using the input codes.

The second category, such as Choi et al. (2018), Lample et al. (2017), assigns the domain labels to different target domains and the domain labels are proven to be effective in controlling the translation direction. Among those, Lample et al. (2017) shows that they could make interpolation between target domains by continuously shifting the different domain labels to change the extent of the contribution of different target domains. However, these methods only use the discrete binary domain labels in the training. Unlike the above work, the domainness variable proposed in this work is derived from the data distribution distance, and is used explicitly to regularize the style of output images during training.

2.2 Domain Adaptation and Generalization

Our work is also related to the domain adaptation and generalization works. Domain adaptation aims to utilize a labeled source domain to learn a model that performs well on an unlabeled target domain (Ganin and Lempitsky 2015; Gopalan et al. 2011; Fernando et al. 2013; Tzeng et al. 2017; Jhuo et al. 2012; Baktashmotlagh et al. 2013; Kodirov et al. 2015; Gong et al. 2012; Chen et al. 2018; Zhang et al. 2018; Wulfmeier et al. 2018). Domain generalization is a similar problem, which aims to learn a model that could be generalized to an unseen target domain by using multiple labeled source domains (Muandet et al. 2013; Ghifary et al. 2015; Niu et al. 2015b; Motiian et al. 2017; Niu et al. 2015a; Li et al. 2018a, c, b).

Our work is partially inspired by Gopalan et al. (2011), Gong et al. (2012), Cui et al. (2014), which have shown that the intermediate domains between the source and target domains are useful for addressing the domain adaptation problem. They represent each domain as a subspace or covariance matrix, and then connect them on the corresponding manifold to model intermediate domains. Different from those works, we model the intermediate domains by directly translating images on pixel level. This allows us to easily improve the existing deep domain adaptation models by using the translated images as training data. Moreover, our model can also be applied to image-level domain generalization by generating mixed-style images.

Recently, there is an increasing interest in applying domain adaptation techniques for semantic segmentation from synthetic data to the real scenario (Hoffman et al. 2016, 2018; Chen et al. 2018; Zou et al. 2018; Luo et al. 2018; Huang et al. 2018; Dundar et al. 2018; Pan et al. 2018; Saleh et al. 2018; Sankaranarayanan et al. 2018; Hong et al. 2018; Peng et al. 2018; Zhang et al. 2018; Tsai et al. 2018; Murez et al. 2017; Saito et al. 2017; Sankaranarayanan et al. 2017; Zhu et al. 2018; Chen et al. 2018). Most of those works conduct the domain adaptation by adversarial training on the feature level with different priors. The CyCADA (Hoffman et al. 2018) also shows that it is beneficial to perform pixel-level domain adaptation firstly by transferring source images into the target style based on the image-to-image translation methods like CycleGAN (Zhu et al. 2017). However, those methods address domain shift by adapting to the target domain only. In contrast, we aim to perform pixel-level adaptation by transferring source images to a flow of intermediate domains. Moreover, our model can also be used to further improve the existing feature-level adaptation methods.

Besides, our work is also relevant to the curriculum domain adaptation (CDA) works. The CDA works (Zhang et al. 2017; Dai et al. 2019; Sakaridis et al. 2020) apply the curriculum training strategy to the domain adaptation problem. Their main idea is to firstly conduct adaptation on relatively easier tasks/samples and then on harder tasks/samples during the training. Zhang et al. (2017) utilizes the curriculum training strategy to firstly solve the easier task, global class distribution and the landmark superpixels, then adapt to harder task, pixel-wise semantic segmentation. More related to our work are Dai et al. (2019) and Sakaridis et al. (2020). They exploit the curriculum training strategy for semantic foggy scene understanding and semantic nighttime scene understanding, by firstly training the model on the light fog images in Dai et al. (2019), or twilight time images in Sakaridis et al. (2020), and then on the dense fog images in Dai et al. (2019), or nighttime images in Sakaridis et al. (2020). In our semantic foggy scene understanding experiment, we adopt similar light-to-dense fog curriculum training strategy. Benefiting from our DLOW model for continuously modeling the intermediate domain, our domainness in DLOW model is able to represent the fog density effectively.

2.3 Semantic Foggy Scene Understanding

This work is related to the semantic foggy scene understanding, too. The semantic foggy scene understanding aims to improve the semantic segmentation performance under the foggy weather condition (Chen et al. 2018; Hahner et al. 2019; Dai et al. 2019; Sakaridis et al. 2018b, a), which is very important for the autonomous driving system, since it is necessary to guarantee the reliability of the system under all conditions including the adverse weather like fog, snow or rain. We focus on the foggy image semantic segmentation task in this paper. Due to the difficulty for gathering and annotating the foggy images, it is hard to build a sufficiently large dataset to train a robust model under the foggy weather.

To this end, some works (He et al. 2010; Yang and Sun 2018) use the dehazing way to remove the fog in the image and then perform the semantic segmentation on the dehazed images. More recently, Sakaridis et al. (2018b) proposes to generate the simulated foggy image by applying the fog physical model to the available clear weather image. By taking the Cityscapes image (Cordts et al. 2016) as the clear weather image, Sakaridis et al. (2018b) generates the “Foggy Cityscapes” dataset, which saves the cost of both annotation and data acquisition. Moreover, Sakaridis et al. (2018b), Sakaridis et al. (2018a) further propose the “Foggy Zurich” and “Foggy Driving” datasets which cover the real world foggy image, the small subset of which has coarse annotations and serves as the benchmark for testing the segmentation model performance under foggy weather. By applying the curriculum fine-tuning strategy (Sakaridis et al. 2018a; Dai et al. 2019) on the “Foggy Cityscapes” image, it is proven that the semantic segmentation performance under the foggy weather is highly improved.

However, a limitation of the physics-based simulation method is that it highly relies on the estimated depth map which is always sparse and highly affected by the atmospherical light of the original image. This often causes that the simulated foggy images contain obvious artifacts for the part with inaccurate depth estimation, and also cannot well imitate the weak light of the real foggy image when the original clear weather image was captured under a strong light condition. In contrast, our DLOW model does not rely on depth information, and is also able to produce more realistic foggy scene images.

3 Domain Flow Generation

3.1 Problem Statement

In the domain shift problem, we are given a source domain \({\mathcal {S}}\) and a target domain \({\mathcal {T}}\) containing samples from two different distributions \(P_S\) and \(P_T\), respectively. Denoting a source sample as \({\mathbf {x}}^s \in {\mathcal {S}}\) and a target sample as \({\mathbf {x}}^t \in {\mathcal {T}}\), we have \({\mathbf {x}}^s \sim P_S\), \({\mathbf {x}}^t \sim P_T\), and \(P_S \ne P_T\).

Such distribution mismatch usually leads to a significant performance drop when applying the model trained on \({\mathcal {S}}\) to \({\mathcal {T}}\). Many works have been proposed to address the domain shift for different vision applications. A group of recent works aim to reduce the distribution difference on the feature level by learning domain-invariant features (Ganin and Lempitsky 2015; Gopalan et al. 2011; Kodirov et al. 2015; Gong et al. 2012), while others work on the image level to transfer source images to mimic the target domain style (Zhu et al. 2017; Liu et al. 2017; Zhu et al. 2017; Huang et al. 2018; Almahairi et al. 2018; Choi et al. 2018).

In this work, we also propose to address the domain shift problem on image level. However, different from existing works that focus on transferring source images into the target domain only, we instead transfer them into all intermediate domains that connect source and target domains. This is partially motivated by the previous works (Gopalan et al. 2011; Gong et al. 2012; Cui et al. 2014; Dai et al. 2019; Sakaridis et al. 2020), which have shown that the intermediate domains between source and target domains are useful for addressing the domain adaptation problem.

In the following, we first briefly review the conventional image-to-image translation model CycleGAN. We then formulate the intermediate domain adaptation problem based on the data distribution distance. Next, we present our DLOW model based on CycleGAN model. Then we show the benefits of our DLOW model with three applications: (1) improve existing domain adaptation models with the images generated from DLOW model, (2) transfer images into arbitrarily mixed styles when there are multiple target domains, and (3) generate the “Fog Flow” to improve the foggy image semantic segmentation performance.

3.2 The CycleGAN Model

We build our model based on the state-of-the-art CycleGAN model (Zhu et al. 2017) which is proposed for unpaired image-to-image translation. Formally, the CycleGAN model learns two mappings between \({\mathcal {S}}\) and \({\mathcal {T}}\), i.e., \(G_{ST}: {\mathcal {S}}\rightarrow {\mathcal {T}}\) which transfers the images in \({\mathcal {S}}\) into the style of \({\mathcal {T}}\), and \(G_{TS}: {\mathcal {T}}\rightarrow {\mathcal {S}}\) which acts in the inverse direction. We take the \({\mathcal {S}}\rightarrow {\mathcal {T}}\) direction as an example to explain CycleGAN.

To transfer source images into the target style and also preserve the semantics, the CycleGAN employs an adversarial training module and a reconstruction module, respectively. In particular, the adversarial training module is used to align the image distributions of two domains, such that the style of mapped images matches the target domain. Let us denote the discriminator as \(D_T\), which attempts to distinguish the translated images and the target images. Then the objective function of the adversarial training module can be written as,

$$\begin{aligned}&\min _{G_{ST} }\max _{D_T} {\mathbb {E}}_{{\mathbf {x}}^t\sim {P_T}}\left[ \log (D_{T}({\mathbf {x}}^t))\right] \nonumber \\&\quad + {\mathbb {E}}_{{\mathbf {x}}^s\sim {P_S}}\left[ \log (1-D_{T}(G_{ST}({\mathbf {x}}^s)))\right] . \end{aligned}$$
(1)

Moreover, the reconstruction module is to ensure the mapped image \(G_{ST}({\mathbf {x}}^s)\) to preserve the semantic content of the original image \({\mathbf {x}}^s\). This is achieved by enforcing a cycle consistency loss such that \(G_{ST}({\mathbf {x}}^s)\) is able to recover \({\mathbf {x}}^s\) when being mapped back to the source style, i.e.,

$$\begin{aligned} \min _{G_{ST}} \quad {\mathbb {E}}_{{\mathbf {x}}^s\sim {P_S}}\left[ \Vert G_{TS}(G_{ST}({\mathbf {x}}^s))-{\mathbf {x}}^s\Vert _{1}\right] . \end{aligned}$$
(2)

Similar modules are applied to the \({\mathcal {T}}\rightarrow {\mathcal {S}}\) direction. By jointly optimizing all modules, CycleGAN model is able to transfer source images into the target style and vice versa.

Fig. 2
figure 2

Illustration of domain flow. Many possible paths (the green dash lines) connect source and target domain distributions on the statistical manifold, while the domain flow is the shortest geodesic path (the red line). There are multiple domain distributions (the blue dash line) keeping the expected relative distances to source and target domain distributions. An intermediate domain distribution (the blue dot) is the point at the domain flow that keeps the right distances to two domain distributions

3.3 Modeling Intermediate Domains

Intermediate domains have been shown to be helpful for domain adaptation (Gopalan et al. 2011; Gong et al. 2012; Cui et al. 2014), where they model intermediate domains as a geodesic path on Grassmannian or Riemannian manifold. Inspired by those works, we also characterize the domain shift using intermediate domains that connect the source and target domains. Different from those works, we directly operate at the image level, i.e., translating source images into different styles corresponding to intermediate domains. In this way, our method can be easily integrated with deep learning techniques for enhancing the cross-domain generalization ability of models.

Given the source domain \({\mathcal {S}}\) and the target domain \({\mathcal {T}}\), the source samples \({\mathbf {x}}^s\in {\mathcal {S}}\) obey the domain distribution \({\mathbf {x}}^s\sim P_S\), while the target samples \({\mathbf {x}}^t\in {\mathcal {T}}\) obey the domain distribution \({\mathbf {x}}^t\sim P_T\). Then from the information geometry (Baktashmotlagh et al. 2014), the probability distribution \(P_S\) and \(P_T\) lie on the Riemannian manifold known as the statistical manifold. A statistical manifold is a Riemannian manifold, the points on which are probability distributions. As illustrated in Baktashmotlagh et al. (2014), given the non-empty set \({\mathcal {X}}\) and a family of probability distributions \(p(x|\theta )\) parameterized by \(\theta \) on \({\mathcal {X}}\), the space of the probability distributions \(\{p(x|\theta )|\theta \in {\mathbb {R}}^{d}\}\) forms a statistical manifold, where d is the dimension of the parameter \(\theta \). On the statistical manifold, there are many possible routes from the source domain distribution \(P_S\) to the target domain distribution \(P_T\). Among the paths, the shortest geodesic path between the \(P_S\) and \(P_T\) is defined as the Domain Flow. The points on the shortest geodesic path are the intermediate domain distributions \(P_{M^{(z)}}\), and the intermediate domain samples \({\mathbf {x}}_{M^{(z)}}\in {\mathcal {M}}^{(z)}\) follow the intermediate domain distributions \({\mathbf {x}}_{M^{(z)}}\sim P_{M^{(z)}}\). In particular, let us denote an intermediate domain as \({\mathcal {M}}^{(z)}\), where \(z \in [0, 1]\) is a continuous variable which models the relatedness to source and target domains. We refer to z as the domainness of intermediate domain. When \(z = 0\), the intermediate domain \({\mathcal {M}}^{(z)}\) is identical to the source domain \({\mathcal {S}}\); and when \(z = 1\), it is identical to the target domain \({\mathcal {T}}\). By varying z in the range of [0, 1], we thus obtain a sequence of intermediate domains distributions \(P_{M^{(z)}}\) that flow from the source domain distribution \(P_S\) to the target domain distribution \(P_T\).

As shown in Baktashmotlagh et al. (2014), on the statistical manifold, the Fisher-Rao Riemannian metric, \({\textit{dist}}(\cdot , \cdot )\), is provided as the distance measurement between different domain distributions and induces the geodesics. Moreover, as illustrated in Fig. 2, given any z, the distance from \(P_S\) to \(P_{M^{(z)}}\) should also be proportional to the distance between \(P_S\) to \(P_T\) by the value of z. With the Fisher-Rao Riemannian metric, \({\textit{dist}}(\cdot , \cdot )\), we expect that,

$$\begin{aligned} \begin{aligned} \frac{{\textit{dist}}\left( P_S, P_{M^{(z)}}\right) }{{\textit{dist}}\left( P_T, P_{M^{(z)}}\right) }=\frac{z}{1-z}, \end{aligned} \end{aligned}$$
(3)

Thus, the intermediate domain \({\mathcal {M}}^{(z)}\) for a given z can be resolved by solving the shortest geodesic path and finding the point satisfying Eq. (3) at the same time, which leads to minimize the following loss,

$$\begin{aligned} {\mathcal {L}}= (1-z) \cdot dist\left( P_S, P_{M^{(z)}}\right) + z\cdot dist\left( P_T, P_{M^{(z)}}\right) . \end{aligned}$$
(4)

However, due to the unavailability of the parametric form of the domain distributions, the Fisher-Rao Riemannian metric is always ill-suited to be used to measure the domain distribution distance in practice. As proven in Baktashmotlagh et al. (2014), the f-divergence is an important class of approximations of the Fisher-Rao Riemannian metric. Typically, the Jensen-Shannon divergence (Arjovsky et al. 2017; Ganin and Lempitsky 2015) is a special case of the f-divergence, and has proven to be effective for measuring the domain distribution distance in the image translation task. Thus, we adopt the Jensen-Shannon divergence as the approximation of the Fisher-Rao Riemannian metric, \({\textit{dist}}(\cdot ,\cdot )\). The adversarial loss in Eq. (1) can be seen as a lower bound of the Jensen–Shannon divergence.

3.4 The DLOW Model

We now present our DLOW model to generate intermediate domains. Given a source image \({\mathbf {x}}^s \sim P_s\), and a domainness variable \(z \in [0, 1]\), the task is to transfer \({\mathbf {x}}^s\) into the intermediate domain \({\mathcal {M}}^{(z)}\) with the distribution \(P_{M^{(z)}}\) that minimizes the objective in Eq. (4). We take the \({\mathcal {S}}\rightarrow {\mathcal {T}}\) direction as an example, and the other direction can be similarly applied.

In our DLOW model, the generator \(G_{ST}\) no longer aims to directly transfer \({\mathbf {x}}^s\) to the target domain \({\mathcal {T}}\), but to move \({\mathbf {x}}^s\) towards it. The interval of such moving is controlled by the domainness variable z. Let us denote \({\mathcal {Z}}= [0, 1]\) as the domain of z, then the generator in our DLOW model can be represented as \(G_{ST}({\mathbf {x}}^s, z): {\mathcal {S}}\times {\mathcal {Z}}\rightarrow {\mathcal {M}}^{(z)}\) where the input is a joint space of \({\mathcal {S}}\) and \({\mathcal {Z}}\).

3.4.1 Adversarial Loss

As discussed in Sect. 3.3, we deploy the adversarial loss as the distribution distance measurement to control the relatedness of an intermediate domain to the source and target domains. Specifically, we introduce two discriminators, \(D_{S}({\mathbf {x}})\) to distinguish \({\mathcal {M}}^{(z)}\) and \({\mathcal {S}}\), and \(D_{T}({\mathbf {x}})\) to distinguish \({\mathcal {M}}^{(z)}\) and \({\mathcal {T}}\), respectively. Then, the adversarial losses between \({\mathcal {M}}^{(z)}\) and \({\mathcal {S}}\) and \({\mathcal {T}}\) can be written respectively as,

$$\begin{aligned} {\mathcal {L}}_{adv} (G_{ST}, D_S)= & {} {\mathbb {E}}_{{\mathbf {x}}^s\sim {P_S}}\left[ \log (D_{S}({\mathbf {x}}^s))\right] \nonumber \\&+{\mathbb {E}}_{{\mathbf {x}}^s\sim {P_S}}\left[ \log (1-D_{S}(G_{ST}({\mathbf {x}}^s, z)))\right] \end{aligned}$$
(5)
$$\begin{aligned} {\mathcal {L}}_{adv} (G_{ST}, D_T)= & {} {\mathbb {E}}_{{\mathbf {x}}^t\sim {P_T}}\left[ \log (D_{T}({\mathbf {x}}^t))\right] \nonumber \\&+{\mathbb {E}}_{{\mathbf {x}}^s\sim {P_S}}\left[ \log (1-D_{T}(G_{ST}({\mathbf {x}}^s, z)))\right] .\nonumber \\ \end{aligned}$$
(6)

By using the above losses to model \(dist(P_S, P_{M^{(z)}})\) and \(dist(P_T, P_{M^{(z)}})\) in Eq. (4), we arrive at the following loss,

$$\begin{aligned} {\mathcal {L}}_{adv} = (1-z) {\mathcal {L}}_{adv}(G_{ST}, D_S) + z {\mathcal {L}}_{adv}(G_{ST}, D_T). \end{aligned}$$
(7)
Fig. 3
figure 3

The overview of our DLOW model: the generator takes domainness z as additional input to control the image translation and to reconstruct the source image; The domainness z is also used to weight the two discriminators

3.4.2 Image Cycle Consistency Loss

Similarly as in CylceGAN, we also apply a cycle consistency loss to ensure the semantic content is well-preserved in the translated image. Let us denote the generator on the other direction as \(G_{TS}({\mathbf {x}}^t, z): {\mathcal {T}}\times {\mathcal {Z}}\rightarrow {\mathcal {M}}^{(1-z)}\), which transfers a sample \({\mathbf {x}}^t\) from the target domain towards the source domain by a interval of z. Since \(G_{TS}\) acts in an inverse direction to \(G_{ST}\), we can use it to recover \({\mathbf {x}}^s\) from the translated version \(G_{ST}({\mathbf {x}}^s, z)\), which gives the following loss,

$$\begin{aligned} \begin{aligned} L_{cyc} =&{\mathbb {E}}_{{\mathbf {x}}^s\sim {P_s}}\left[ \Vert G_{TS}(G_{ST}({\mathbf {x}}^s,z),z)-{\mathbf {x}}^s\Vert _{1}\right] . \end{aligned} \end{aligned}$$
(8)

3.4.3 Full Objective

We integrate the losses defined above, then the full objective can be defined as,

$$\begin{aligned} {\mathcal {L}}={\mathcal {L}}_{adv} + \lambda _1{\mathcal {L}}_{cyc}, \end{aligned}$$
(9)

where \(\lambda _{1}\) is a hyper-parameter used to balance the two losses in the training process.

Similar loss can be defined for the other direction \({\mathcal {T}}\rightarrow {\mathcal {S}}\). Due to the usage of adversarial loss \({\mathcal {L}}_{adv}\), the training is performed in an alternating manner. We first minimize the full objective with regard to the generators, and then maximize it with regard to the discriminators.

3.4.4 Implementation

We illustrate the network structure of of our DLOW model in Fig. 3. First, the domainness variable z is taken as the input of the generator \(G_{ST}\). This is implemented with the Conditional Instance Normalization (CN) layer (Almahairi et al. 2018; Huang and Belongie 2017). We first use one deconvolution layer to map the domainness variable z to the vector with dimension (1, 16, 1, 1), and then use this vector as the input for the CN layer. Moreover, the domainness variable also plays the role of weighting discriminators to balance the relatedness of the generated images to different domains. It is also used as input in the image cycle consistency module. During the training phase, we randomly generate the domainess parameter z for each input image. As inspired by Zhang et al. (2018), we force the domainness variable z to obey the beta distribution, i.e. \(f(z,\alpha , \beta )=\frac{1}{B(\alpha ,\beta )}z^{\alpha -1}(1-z)^{\beta -1}\), where \(\beta \) is fixed as 1, and \(\alpha \) is a function of the training step \(\alpha =e^{\frac{t-0.5T}{0.25T}}\) with t being the current iteration and T being the total number of iterations. In this way, z tends to be sampled more likely as small values at the beginning, and gradually shifts to larger values at the end, which gives slightly more stable training than uniform sampling.

3.5 Boosting Domain Adaptation Models

With the DLOW model, we are able to translate each source image \({\mathbf {x}}^s\) into an arbitrary intermediate domain \({\mathcal {M}}^{(z)}\). Let us denote the source dataset as \({\mathcal {S}}= \{({\mathbf {x}}^s_i, y_i)|_{i=1}^n\}\) where \(y_i\) is the label of \({\mathbf {x}}^s_i\). By feeding each of the image \({\mathbf {x}}^s_i\) combined with \(z_{i}\) randomly sampled from the uniform distribution \({\mathcal {U}}(0,1)\), we then obtain a translated dataset \({\tilde{{\mathcal {S}}}} = \{({\tilde{\mathbf {x}}}^s_i, y_i)|_{i=1}^n\}\) where \({\tilde{\mathbf {x}}}^s_i = G_{ST}({\mathbf {x}}^s_i, z_i)\) is the translated version of \({\mathbf {x}}^s_i\). The images in \({\tilde{{\mathcal {S}}}}\) spread along the domain flow from source to target domain, and therefore become much more diverse. Using \({\tilde{{\mathcal {S}}}}\) as the training data is helpful to learn domain-invariant models for computer vision tasks. In Sect. 4.1, we demonstrate that model trained on \({\tilde{{\mathcal {S}}}}\) achieves good performance for the cross-domain semantic segmentation problem.

Moreover, the translated dataset \({\tilde{{\mathcal {S}}}}\) can also be used to boost the existing adversarial training based domain adaptation approaches. Images in \({\tilde{{\mathcal {S}}}}\) fill the gap between the source and target domains, and thus ease the domain adaptation task. Taking semantic segmentation as an example, a typical way is to append a discriminator to the segmentation model, which is used to distinguish the source and target samples. Using the adversarial training strategy to optimize the discriminator and the segmentation model, the segmentation model is trained to be more domain-invariant.

As shown in Fig. 4, we replace the source dataset \({{\mathcal {S}}}\) with the translated version \({\tilde{{\mathcal {S}}}}\), and apply a weight \(\sqrt{1-z_i}\) to the adversarial loss. The motivation is as follows, for each sample \({\tilde{\mathbf {x}}}^s_i\), if the domainness \(z_i\) is higher, it is closer to the target domain, then the weight of adversarial loss should be reduced. Otherwise, we should increase the loss weight. Compared with treating all the samples equally with the hyper-parameter, our weight \(\sqrt{1-z_i}\) treats different samples adaptively according to the domainness, and helps further improve the domain adaptation performance, which is tested in Sect. 4.1.2.

Fig. 4
figure 4

Illustration of boosting domain adaptation model for corss-domain semantic segmentation with DLOW model. Intermediate domain images are used as source dataset, and the adversarial loss is weighted by domainness

Fig. 5
figure 5

Network structure of DLOW model for style generalization with four target domains: a direction from \({\mathcal {S}}\rightarrow {\mathcal {T}}\); b direction from \({\mathcal {T}}\rightarrow {\mathcal {S}}\)

3.6 Style Generalization

Most existing image-to-image translation works learn a deterministic mapping between two domains. After the model is learnt, source images can only be translated to a fixed style. In contrast, our DLOW model takes a random z to translate images into various styles. When multiple target domains are provided, it is also able to transfer the source image into a mixture of different target styles. In other words, we are able to generalize to an unseen intermediate domain that is related to existing domains.

In particular, it is supposed that we have K target domains, denoted as \({\mathcal {T}}_1, \ldots , {\mathcal {T}}_K\). Accordingly, the domainness variable z is expanded as a K-dim vector \({\mathbf {z}}= [z_1, \ldots , z_K]'\) with \(\sum _{k=1}^Kz_k = 1\). Each element \(z_k\) represents the relatedness to the k-th target domain. To map an image from the source domain to the intermediate domain defined by \({\mathbf {z}}\), we need to optimize the following objective,

$$\begin{aligned} {\mathcal {L}}= \sum _{k=1}^K z_{k} \cdot dist(P_M, P_{T_k}), \quad \text{ s.t. }\quad \sum _{1}^K z_k = 1 \end{aligned}$$
(10)

where \(P_M\) is the distribution of the intermediate domain, \(P_{T_K}\) is the distribution of \({\mathcal {T}}_k\). The network structure can be easily adjusted from our DLOW model to optimize the above objective. An example is shown in Fig. 5, where we have four target domains, each of which represents an image style. Accordingly, the domainness variable z is expanded as a 4-dim vector \({\mathbf {z}}= [z_1, \ldots , z_4]'\). For the direction of \({\mathcal {S}}\rightarrow {\mathcal {T}}\), shown in Fig. 5a the style generalization model consists of two modules, the adversarial module and the image reconstruction module. Similar to the DLOW model with one target domain described in Sect. 3.4, the adversarial module here is composed of the generators \(G_{ST}\) and discriminator \(D_{T}\). The difference lies in that we have multiple discriminators here since multiple target domains are provided. For each target domain \({\mathcal {T}}_{i}\), there is one corresponding discriminator \(D_{T_{i}}\) measuring the distribution distance between the intermediate domain \({\mathcal {M}}^{(z)}\) and the target domain \({\mathcal {T}}_{i}\), which is then weighted by the corresponding domainness variable \(z_i\) in domainness vector z. Besides, the same image reconstruction module as done in Sect. 3.4 is adopted to guarantee the semantic content preservation. For the other direction \({\mathcal {T}}\rightarrow {\mathcal {S}}\), shown in Fig. 5b, the adversarial module is similar to that of the direction \({\mathcal {S}}\rightarrow {\mathcal {T}}\). However, the image reconstruction module is slightly different. Due to multiple target domains \({\mathcal {T}}_{i}\) are provided, each target domain is corresponding to an intermediate domain \({\mathcal {M}}^{(1-z)}_{i}\), which means \({\mathcal {M}}^{(1-z)} = \{{\mathcal {M}}^{(1-z)}_{i}, i = 1\ldots 4 \}\). There is an image reconstruction module corresponding to each of the intermediate domain \({\mathcal {M}}^{(1-z)}_{i}\), which is weighted by the corresponding domainness variable \(z_{i}\) in domainness vector z respectively.

3.7 Foggy Scene Understanding

The semantic foggy scene understanding task aims to improve the performance of the semantic segmentation model under the foggy weather instead of the clear weather which is widely explored by most of the semantic segmentation work (Long et al. 2015; Yu et al. 2017; Zhao et al. 2017; Lin et al. 2017). However, real fog scene images are difficult to acquire, and it is also more time consuming and costly to perform pixel-level dense annotation than the clear weather images. As a result, recently there is a trend to generate simulated foggy images based on the clear weather images, and then use the generated simulated foggy images as training data to learn semantic segmentation models (Sakaridis et al. 2018b, a; Dai et al. 2019). Due to the simulated foggy images share the same annotation with the original clear weather images, the cost for acquiring and annotating training images is highly saved.

Another good practice for foggy scene understanding is curriculum learning. Previous works (Sakaridis et al. 2018b, a; Dai et al. 2019) show that semantic segmentation performance can be further improved by progressively fine-tuning the segmentation model from the light-fog simulated images to the dense-fog simulated images. Interestingly, our DLOW model naturally fits this learning paradigm. As aforementioned, given a source domain (e.g., clear weather) and a target domain (e.g., foggy weather), our DLOW model is able to translate the source images into a sequence of intermediate domains to approach the target domain. Those intermediate domains are controlled by a domainness variable. So when the domainness variable is low, the model produces light-fog images, and when it is high, the model produces dense-fog images. We also show that the DLOW model can also generate denser-fog images that go beyond the target domain by simply setting a larger domainness variable. Such light-to-dense fog images contained in the domain flow can be readily used in the curriculum learning process, and help to deal with the large fog variance in real world scenarios.

Moreover, there are also several other advantages for using DLOW model to generate simulated foggy images compared with traditional physics-based simulation approaches (Sakaridis et al. 2018b, a; Dai et al. 2019). First, the physics-based simulation method requests the availability of the dense depth map corresponding to the real clear weather image which is always noisy and affects the quality of the simulated foggy image, while our DLOW model does not require any depth information, and takes only the RGB images as input. Secondly, the DLOW model takes advantage of the deep generative models, and demonstrates the ability to generate more natural foggy images with less artifacts, compared with traditional physics-based simulation approaches that use a hand crafted pipeline (see Sect. 4.3 for details). We describe how to use the DLOW model for foggy scene understanding as follows.

Fig. 6
figure 6

Illustration of “Fog Flow” generation and the curriculum training for semantic segmentation with generated “Fog Flow”. The “Fog Flow” \({\hat{{\mathbf {x}}}}^{z_{1\ldots 5}}\) connected by the blue solid and the dash line represents the foggy image generated with different fog density by feeding the clear weather image \({\hat{{\mathbf {x}}}}^{s}\) into our DLOW model. With the generated “Fog Flow”, the curriculum learning strategy is adopted and the semantic segmentation model pretrained on the clear weather image is fine-tuned on the “Fog Flow” progressively from the light fog to the dense fog as shown by the yellow arrows

3.7.1 Fog Flow Generation

Our target is to translate clear weather images into a set of images with gradually changing fog density, which is referred as the “Fog Flow” in this work. More specifically, we use the method described in Sect. 3.4. A source domain (i.e., clear weather) and a target domain (i.e., foggy weather) are needed for training the DLOW model. As the real foggy images are not easy to be acquired, we thus use the synthetic clear and foggy weather images generated with virtual environment (Gaidon et al. 2016) as the training data for training DLOW model. In the inference stage, the real clear weather images are used as input, and we vary the domainness variable to obtain foggy images with different fog density.

Formally, let us denote the source domain as \({\mathcal {S}}\) consisting of the synthetic rendered clear weather image \({\mathbf {x}}^s\), and \( {\mathcal {T}}\) the target domain consisting of the synthetic rendered foggy image \({\mathbf {x}}^t\), our DLOW model aims to learn the mapping \(G_{ST}({\mathbf {x}}^s, z): {\mathcal {S}}\times {\mathcal {Z}}\rightarrow {\mathcal {M}}^{(z)}\), where \({\mathcal {M}}^{(z)}\) is the intermediate domain consisting of images between the source clear weather images and the target foggy weather images. The larger the domainness variable z is, the denser the fog in \({\mathcal {M}}^{(z)}\) is.

During the inference phase, we take a real clear weather image \({\hat{{\mathbf {x}}}}^{s}\) as input, and then generate the simulated foggy image \([{\hat{{\mathbf {x}}}}^{z_{1}}, \ldots , {\hat{{\mathbf {x}}}}^{z_{k}}]\) with different fog density \(z_{1}, \ldots , z_{k}\) by \(G_{ST}({\hat{{\mathbf {x}}}}^{s}, z_{i}), i=1\ldots k\). In order to generate the dense-fog image, we also set the domainness variable \(z>1\) even the model does not take such a domainness variable during training. In other words, our DLOW model can guide the intermediate domain \(0<z<1\) to exceed the target domain \(z>1\). In this way, we translate the original clear weather image into a sequence of simulated foggy images with different fog density from the light fog to the dense fog, which will be used in the curriculum training as described below.

Fig. 7
figure 7

Examples of intermediate domain images from GTA5 to Cityscapes. As the domainness variable z increases from 0 to 1, the styles of the translated images shift from the synthetic GTA5 style to the realistic Cityscapes style gradually. The image should be looked through from left to right and from top to bottom

Fig. 8
figure 8

Examples of comparison between the translated image by DLOW and the adjusted image using brightness. We adjust the brightness of the DLOW translated source image (\(z=0\)) to make its brightness match the corresponding DLOW translated target image (\(z=1\)). The lower one (c) and (d) in each group is the brightness adjusted image while the upper one (a) and (b) is the DLOW translated target image (\(z=1\)). Part of the image is enlarged and shown in the right to show that our DLOW translation not only changes the brightness but also changes the details such as the texture of the road and the style of the curb to mimic the feature of the Cityscapes image

3.7.2 Curriculum Training with Fog Flow

We follow Sakaridis et al. (2018a), Dai et al. (2019) to perform curriculum learning for the semantic foggy scene understanding. We employ a multi-stage fine-tuning strategy that trains the segmentation model firstly on the simulated light-fog image and then fine-tunes on the simulated denser-fog image, and so on. Due to the fog controllability and diversity of our generated “Fog Flow”, there is natural advantage of our “Fog Flow” to be combined with the above curriculum learning strategy. Obeying the curriculum learning strategy and assuming that the “Fog Flow” \([{\hat{{\mathbf {x}}}}^{z_{1}}, \ldots , {\hat{{\mathbf {x}}}}^{z_{k}}]\) generated by our DLOW model are with ascending-order density \(z_{i-1}<z_{i}\), the semantic segmentation model can be trained by progressively fine-tuning the model from \({\hat{{\mathbf {x}}}}^{z_{1}}\) to \({\hat{{\mathbf {x}}}}^{z_{k}}\), which means that the model fine-tuned on \({\hat{{\mathbf {x}}}}^{z_{i-1}}\) is regarded as the pretrained model on \({\hat{{\mathbf {x}}}}^{z_{i}}\). We illustrate the curriculum learning process in Fig. 6.

4 Experiments

In this section, we demonstrate the benefits of our DLOW model with three tasks. In the first task, we address the domain adaptation problem, and train our DLOW model to generate the intermediate domain samples to boost the domain adaptation performance. In the second task, we consider the style generalization problem, and train our DLOW model to transfer images into new styles that are unseen in the training data. In the third task, we explore the semantic foggy scene understanding problem, and train our DLOW model to generate “Fog Flow” to improve the foggy image semantic segmentation performance.

4.1 Domain Adaptation and Generalization

4.1.1 Experiments Setup

For the domain adaptation problem, we follow Hoffman et al. (2016), Hoffman et al. (2018), Chen et al. (2018), Zou et al. (2018) to conduct experiments on the urban scene semantic segmentation by learning from synthetic data to real scenario. The GTA5 dataset (Richter et al. 2016) or the SYNTHIA dataset (Ros et al. 2016) is used as the source domain while the Cityscapes dataset (Cordts et al. 2016) as the target domain. Moreover, we also evaluate the generalization ability of learnt segmentation models to unseen domains, for which we take the KITTI (Geiger et al. 2012), WildDash (Zendel et al. 2018) and BDD100K (Yu et al. 2018) datasets as additional unseen datasets for evaluation.

Cityscapes is a dataset consisting of urban scene images taken from multiple European cities. We use the 2, 975 training images without annotation as unlabeled target samples in training phase, and 500 validation images with annotation for evaluation, which are densely labelled with 19 classes.

GTA5 is a dataset consisting of 24, 966 densely labelled synthetic frames generated from the computer game. The annotations of the images are compatible with the Cityscaps.

SYNTHIA-RAND-CITYSCAPES is a dataset comprising 9400 photo-realistic images rendered from a virtual city and the semantic labels of the images are precise and compatible with Cityscapes test set.

KITTI is a dataset consisting of images taken from mid-size city of Karlsruhe. We use 200 validation images which are densely labeled and compatible with Cityscapes.

WildDash is a dataset covering images of driving scenarios from different sources, different environments (place, weather, time and so on) and different camera characteristics. We use 70 labeled and Cityscapes annotation-compatible validation images.

BDD100K is a driving dataset covering diverse images taken from US, whose label maps are with training indices specified in Cityscapes. We use 1, 000 densely labeled images for validation in our experiment.

We take the GTA5 dataset as an example to explain the experimental setup. We fisrt train our proposed DLOW model using GTA5 dataset as the source domain, and Cityscapes as the target domain. Then, we generate a translated GTA5 dataset with the learnt DLOW model. Each source image is fed into DLOW with a random domainness variable z. The new translated GTA5 dataset contains exactly the same number of images as the original one, but the styles of images randomly vary from the synthetic style to the real style. We then use the translated GTA dataset as the new source domain to train segmentation models.

Table 1 Results of semantic segmentation on the Cityscapes dataset based on DeepLab-v2 model with ResNet-101 backbone using the (a) GTA5 and (b) SYNTHIA images translated with different models
Table 2 Comparison of the performance of AdaptSegNet (Tsai et al. 2018) when using original source images and intermediate domain images translated with our DLOW model for semantic segmention under domain adaptation (1st column) and domain generalization (2nd to 4th columns) scenarios
Table 3 Comparison of the cross-domain semantic segmentation performance between our methods and other methods, under the setting GTA5\(\rightarrow \)Cityscapes

We implement our model based on Augmented CycleGAN (Almahairi et al. 2018) and CyCADA (Hoffman et al. 2018). By following their setup, all images are resized to have width 1024 while keeping the aspect ratio and the crop size is set as \(400\times 400\) pixels. When training the DLOW model, the image cycle consistency loss weight is set as 10. The learning rate is fixed as 0.0002. For the segmentation network, we use the AdaptSegNet (Tsai et al. 2018) model, which is based on DeepLab-v2 (Chen et al. 2018) with ResNet-101 (He et al. 2016) as the backbone network. The training images are resized to \(1280\times 720\) pixels. We follow the exactly same training policy as the AdaptSegNet.

The same training parameters and scheme, as for GTA5, are applied to SYNTHIA dataset, while the only difference lies in that we resize the training images to \(1280\times 760\) pixels for the segmentation network.

4.1.2 Experimental Results

Intermediate Domain Images: To verify the ability of our DLOW model to generate intermediate domain images, in the inference phase, we fix the input source image, and vary the domainness variable from 0 to 1. A few examples are shown in Fig. 7. It can be observed that the styles of translated images gradually shift from the synthetic style of GTA5 to the real style of Cityscapes, which demonstrates the DLOW model is capable of modeling the domain flow to bridge the source and target domains as expected.

The main change in the generated intermediate domain images at the first glance might be the image brightness. Then we provide an enlarged version of intermediate images to show that not only the brightness but also the subtle textures are adjusted to mimic the Cityscapes style. For comparison, we adjust the brightness of the translated image with \(z=0\) to match it with the brightness of the corresponding translated image with \(z=1\). The enlarged translated image with \(z=1\) and the corresponding adjusted image based on brightness (\(z=0\)) are shown in Fig. 8, from which we observe that the adjusted images based on brightness (c.f. Fig. 8c and d) still exhibit obvious features of the game style such as the high contrast textures of the road and the red curb, while our DLOW translated images (c.f. Fig. 8a and b) well mimic the texture of Cityscapes style.

Fig. 9
figure 9

Qualitative results of the cross-domain semantic segmentation task, GTA5?Cityscapes. For each a RGB image from target domain, we show the corresponding b ground truth semantic segmentation map, c the semantic segmentation results of the AdaptSegNet model trained with source domain GTA5 images, and d the semantic segmentation results of the AdaptSegNet model trained with our DLOW translated images

Cross-Domain Semantic Segmentation: We further evaluate the usefulness of intermediate domain images in two settings. In the first setting, we compare with the CycleGAN model (Zhu et al. 2017), which is used in the CycADA approach (Hoffman et al. 2018) for performing pixel-level domain adaptation. The difference between CycleGAN and our DLOW model is that CycleGAN transfers source images to mimic only the target style, while our DLOW model transfers source images into random styles flowing from the source domain to the target domain. We first obtain a translated version of the GTA5 dataset with each model. Then, we respectively use the two transalated GTA5 datasets to train DeepLab-v2 models, which are evaluated on the Cityscapes dataset for semantic segmentation. We also include the “NonAdapt” baseline which uses the original GTA5 images as training data, as well as a special case of our approach, “DLOW (\(z=1\))”, where we set \(z = 1\) for all source images when making image translation using the learnt DLOW model.

The results are shown in Table 1a. We observe that all pixel-level adaptation methods outperform the “NonAdapt” baseline, which verifies that image translation is helpful for training models for cross-domain semantic segmentation. Moreover, “DLOW(\(z=1\))” is a special case of our model that directly translates source images into the target domain, which non-surprisingly gives the comparable result as the CycADA-pixel method (40.7% vs. 41.0%). By further using intermediate domain images, our DLOW model is able to improve the result from 40.7 to 42.3%, which demonstrates that intermediate domain images are helpful for learning a more robust and domain-invariant model.

Similar to GTA5, our DLOW model based on SYNTHIA dataset also exhibits excellent performance for the first setting. By following (Tsai et al. 2018), the segmentation performance based on SYNTHIA dataset is tested on the Cityscapes validation dataset with 13 classes. As shown in Table 1b, all pixel-level adaptation methods outperform the “NonAdapt” baseline, which verifies the effectiveness of the image translation for cross-domain segmentation. In particular, our “DLOW (\(z=1\))” model achieves 41.6%, gaining 3% improevment compared with the “NonAdapt” baseline. After using the intermediate domain images, the adaptation performance can be further improved from 41.6 to 42.8%.

In the second setting, we further use intermediate domain images to improve the feature-level domain adpatation model. We conduct experiments based on the AdaptSegNet method (Tsai et al. 2018), which is proven effective for the cross-domain semantic segmentation task, GTA5 \(\rightarrow \) CityScapes. It consists of multiple levels of adversarial training, and we augment each level with the loss weight discussed in Sect. 3.5. The results are reported in Table 2a. The “Original” method denotes the AdaptSegNet model that is trained using GTA5 as the source domain, for which the results are obtained using their released pretrained model. The “DLOW” method is AdaptSegNet trained using translated dataset with our DLOW model. From the first column, we observe that the intermediate domain images are able to improve the AdaptSegNet model by 2.4% points from 42.4 to 44.8%. Table 2b also reports the result of our DLOW model adaptation performance combining with the AdaptSegNet method, under SYNTHIA \(\rightarrow \) Cityscapes. The \(\hbox {Original}^*\) in Table 2b denotes our retrained multi-level AdaptSegNet model in Tsai et al. (2018). Compared with the retrained AdaptSegNet model, our DLOW model could improve the adaptation performance from 45.7 to 47.1%. The qualitative semantic segmentation results of the AdaptSegNet model trained with our DLOW translated GTA5 images are shown in Fig. 9.

In order to validate the effectiveness of our proposed adaptive weight \(\sqrt{1-z_i}\) for the adversarial loss, we conduct the ablation experiment without using this adaptive weight, under GTA5 \(\rightarrow \) Cityscapes. The adaptation performance without using the adaptive weight is 43.7%, and is lower than the adaptation performance with our proposed weight \(\sqrt{1-z_i}\), 44.8%. Besides, we also conduct the experiment with linear decreasing weight \(1-z_i\), the performance of which is 43.8% and is similar to the performance without adaptive weight, 43.7%. It is due to that the weight \(1-z_i\) decreases too fast as the domainness \(z_i\) improves, and it even reduces too much the effect of the adversarial learning for the translated images with low and middle domainness. Instead, our adopted weight \(\sqrt{1-z_i}\) attenuate the speed of weight decay especially for the low and middle domainness \(z_i\). It helps align the domain distribution more when the translated images are far from the target domain, i.e., with low and middle domainness.

Moreover, in order to further compare our domain adaptation performance with other SOTA methods, we replace the AdaptSegNet with the more recent domain adaptation model ADVENT (Vu et al. 2019). As shown in Table 3, by training the ADVENT on our DLOW translated images, the domain adaptation performance can be further improved from 44.8 to 47.0%, under GTA \(\rightarrow \) Cityscapes.

More interestingly, under GTA5 \(\rightarrow \) Cityscapes, we show that the AdaptSegNet model trained with DLOW translated images also exhibits excellent domain generalization ability when being applied to unseen domains, achieving significantly better results than the original AdaptSegNet model on the KITTI, WildDash and BDD100K datasets as reported in the second to the fourth columns of Table 2a, respectively. Table 2b also reports the domain generalization performance for the unseen domains, under SYNTHIA \(\rightarrow \) Cityscapes. This shows that intermediate domain images are useful to improve the model’s cross-domain generalization ability.

4.2 Style Generalization

We conduct the style generalization experiment on the Photo to Artworks dataset (Zhu et al. 2017), which consists of real photographs (6853 images) and artworks from Monet (1074 images), Cezanne (584 images), Van Gogh (401 images) and Ukiyo-e (1433 images). We use the real photographs as the source domain, and the remaining artworks as four target domains. As discussed in Sect. 3.6, The domainness variable in this experiment is expanded as a 4-dim vector \([z_1, z_2, z_3, z_4]'\) meeting the condition \(\sum _{i=1}^{4}z_i=1\). Also, \(z_{1}, z_{2}, z_{3}\) and \(z_{4}\) corresponds to Monet, Van Gogh, Ukiyo-e and Cezanne, respectively. Each element \(z_i\) can be seen as how much each style contributes to the final mixture style. In every 5 steps of the training, we set the domainness variable z as [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1] and uniformly distributed random variable. The qualitative results of the style generalization are shown in Fig. 10. From the qualitative results, it is shown that our DLOW model can translate the photo image to corresponding artworks with different styles. When varying the values of domainness vector, we can also successfully produce new styles related to different painting styles, demonstrating the good generalization ability of our model to unseen domains. Note, different from Zhang et al. (2018), Huang and Belongie (2017), we do not need any reference image in the test phase, and the domainness vector can be changed instantly to generate different new styles of images.

Fig. 10
figure 10

Examples of style generalization. Results with red rectangles are images translated into the four target domains, and those with green rectangles in between are images translated into intermediate domains. The results show that our DLOW model generalizes well across styles, and produces new images styles smoothly

4.2.1 Quantitative Results

To verify the effectiveness of our model for style generalization, we conduct an user study on Amazon Mechanical Turk (AMT) to compare with the existing methods FadNet (Lample et al. 2017) and MUNIT (Huang et al. 2018). Two cases are considered, style transfer to Van Gogh, and style generalization to mixed Van Gogh and Ukiyo-e. For FadNet, domain labels are treated as attributes. For MUNIT, we mix Van Gogh and Ukiyo-e as the target domain. The data for each trial is gathered from 10 participants and there are 100 trials in total for each case. For the first case, participants are shown the example Van Gogh style painting and are required to choose the image whose style is more similar to the example. For the second case, participants are shown the example Van Gogh and Ukiyo-e style painting and are required to choose the image with a style that is more like the mixed style of the two example paintings. The user preference is summarized in Table 4, which shows that DLOW outperforms FadNet and MUNIT on both tasks.

4.2.2 Qualitative Results

We further provide the qualitative result comparison in Fig. 11. It can be observed that the FadNet fails to translate the photo to painting while the MUNIT and our DLOW model both could get reasonable results. For the Van Gogh style transfer result shown in Fig. 11, our DLOW model could not only learn the typical color of the painting but also the details such as the brushwork and lines while the MUNIT only learns the main colors. For the Van Gogh and Ukiyo-e style generalization results shown in Fig. 11, our DLOW model could combine the color and the stroke of the two styles while the MUNIT just fully changes the main colors from one style to another. The qualitative comparison result also demonstrates that our DLOW model performs better on both of the style transfer and generalization task compared with the FadNet and the MUNIT.

Table 4 User preference for style transfer and generalization
Fig. 11
figure 11

Comparison of our model with existing methods on style transfer and style generalization. The left part (a) shows the given input photo image and the example images of the target style. The translated results with different methods FadNet (Lample et al. 2017), MUNIT (Huang et al. 2018) and our DLOW are shown in right part (b), (c) and (d). The first row of the right part is the Van Gogh style transfer result while the second row is the style generalization result aiming at mixing the Van Gogh and Ukiyo-e style

4.3 Semantic Foggy Scene Understanding

4.3.1 Experiments Setup

Following the setting of Sakaridis et al. (2018a), the Cityscapes (Cordts et al. 2016) dataset is adopted as the real clear weather image to generate the simulated foggy image. The Foggy Zurich (FZ) (Sakaridis et al. 2018a; Dai et al. 2019), Foggy Dense Driving (FDD) (Sakaridis et al. 2018a; Dai et al. 2019) and Foggy Driving (FD) (Sakaridis et al. 2018b) are used as the real foggy images for evaluation. As stated in Sect. 3.7, the Virtual KITTI dataset (Gaidon et al. 2016) is used to train the DLOW model, where the synthetic clear weather images are used as the source domain while the synthetic foggy images are used as the target domain.

Virtual KITTI is a dataset which covers 21,260 synthetic photo-realistic images imitating the scenario in KITTI dataset (Geiger et al. 2012). These images are densely labeled and under different lighting and weather conditions. The part corresponding to our experiment is the paired group of images under clear weather and foggy weather, where there are 2126 images in each group.

Foggy Zurich is a dataset consisting of real foggy images taken in the Zurich City, 40 of which with dense fog have coarse annotations and serve as the benchmark to measure the segmentation model performance under the real foggy weather condition. The size of the images is \(1920\times 1080\) pixels. All images except for the 40 dense-fog images do not have the ground truth annotations.

Fig. 12
figure 12

Examples of comparison between our DLOW-based simulated foggy images and the Foggy Cityscapes images. From the result, it is shown that the Foggy Cityscapes images in c are highly affected by the lighting condition of the original Cityscapes image in b which induces that the generated fog might have very strong light and cannot imitate the weak-light condition of the real foggy image in a, but our DLOW-based image does not rely on the light of the original Cityscapes image and can imitate the whole style of real foggy image

Foggy Driving is a dataset where there are 101 densely labeled real foggy images, and the semantic class annotations of Foggy Driving dataset image are compatible with that of Cityscapes. The Foggy Dense Driving is a 21-image subset of Foggy Driving dataset with dense fog.

This experiment includes two stages. In the first stage, we generate the “Fog Flow” for the Cityscapes dataset. As mentioned in Sect. 3.7, the real foggy images are usually difficult to acquire, so we use the synthetic images generated from virtual environment to train our DLOW model. In particular, the Virtual KITTI dataset is used, where the synthetic clear weather image is used as the source domain while the synthetic foggy image is regarded as the target domain. Since there is only one single domain, the network structure follows the one described in Sect. 3.3. During the inference time, the Cityscapes clear weather images are input to the DLOW model to get the simulated foggy Cityscapes images. Note that, we vary the domainness variable \(z = 1.0, 1.1, 1.2, 1.3 \) and 1.4, and get one copy of images for each z, denoted as “DLOW(\(z=1.0\))”, “DLOW(\(z=1.1\))”, “DLOW(\(z=1.2\))”, “DLOW(\(z=1.3\))”, “DLOW (\(z=1.4\))”, respectively. These images form the “Fog Flow”, and will be used in curriculum learning for training a robust model for foggy seance understanding.

Fig. 13
figure 13

Examples of generated “Fog Flow” which cover simulated foggy images with different fog density by taking different domainness variable \(z=1.0, 1.1, 1.2, 1.3, 1.4\). Even though the model does not take the domainness variable \(z>1.0\) during training, the fog in the generated image can still become denser by improving the domainness variable from 1.0 to 1.4, which proves not only the controlability of the domainness variable but also the generalization ability of the domainness variable

In the second stage, we use the foggy scene images to train the semantic segmentation model, and evaluate the model on real world foggy scene images. In particular, two training paradigms are considered. In the first one, we follow Sakaridis et al. (2018b) to train a semantic segmentation model directly using the generated foggy images (\(z=1\)). And in the second one, we follow Sakaridis et al. (2018a) to trian the model with the curriculum learning strategy, in which the model is progressively fine-tuned using images with gradually changed fog densities.

4.3.2 Qualitative Results

We firstly compare the simulated foggy images generated by DLOW with the physics-based simulated foggy Cityscapes images (Sakaridis et al. 2018b) in Fig. 12. It is shown that the physics-based simulated image “Foggy Cityscapes” (c.f., Fig. 12c) is often affected by the lighting condition of the original Cityscapes image. Specifically, when the images are captured under a strong lighting condition, the physics-based simulated foggy Cityscapes images still keep the region with strong light, which looks quite unrealistic when compared with the real foggy images that usually have a smooth and weak light condition (c.f., Fig. 12a). This is due to that the physical model adopted by the “Foggy Cityscapes” introduces the atmospheric light into the fog simulation and is positively correlated with the generated light condition of the foggy image. Since the atmospheric light is always taken from the “sky” part of the original clear weather image and positively correlated with the generated fog, the simulated foggy image will also have strong light if the original clear weather image has the strong light. In contrast, our DLOW-based simulation method conducts the pixel-level image translation and does not strictly rely on the light condition of the original Cityscapes image and can imitate the style of the real foggy images, as shown in Fig. 12d. Moreover, the change of the fog density in the “Fog Flow” by varying the domainness variable z from 1.0 to 1.4 is shown in Fig. 13. It is shown that the fog density of the simulated foggy image can be controlled flexibly by just varying the domainness variable during the inference phase.

Table 5 Results of semantic segmentation on the Foggy Zurich dataset based on RefineNet model with ResNet-101 backbone using different simulated foggy images

4.3.3 Quantitative Results

We firstly evaluate our DLOW model following the paradigm in Sakaridis et al. (2018b). We use the generated foggy images by our DLOW model as training data to train a semantic segmentation model, and evaluate the model on real world foggy scene images. Following the setting in Sakaridis et al. (2018b), the RefineNet (Lin et al. 2017) with ResNet-101 backbone is adopted as the semantic segmentation model. We also include the physics-based simulated images proposed in Sakaridis et al. (2018b) as a baseline.

The semantic segmentation results of all methods on Foggy Zurich, Foggy Driving, Foggy Dense Driving datasets are reported in Tables 56 and 7, respectively. From the results, we observe that all the simulated foggy image based methods (i.e., CylceGAN, Foggy Cityscapes, and DLOW) have better semantic segmentation performance on the real foggy images than the original pretrained Cityscapes model, which proves the effectiveness of the simulated foggy images for the semantic foggy scene understanding. Moreover, both of the foggy images generated from the transfer-learning-based simulation methods “CycleGAN” and “DLOW” perform better than the foggy images “Foggy Cityscapes” generated from physics-based simulation method, e.g., the comparison between the “CycleGAN” and the “Foggy Cityscapes” shows 40.5% versus 36.1%, 47.7% versus 47.0% and 33.0% versus 32.4% on the Foggy Zurich, Foggy Driving and Foggy Dense Driving dataset, respectively. As mentioned in Sect. 3.7, this might be due to the physics-based simulation method relies on the dense depth map to generate the fog, which might contain errors as it is often estimated from sparse point. Also, the physics-based simulation method is a simple linear transformation, and cannot well transform the strong light in clear weather to realistically weak light in foggy weather.

Regarding our DLOW model, we observe that our DLOW model with domainness variable \(z=1.0\) has similar but slightly better performance than CycleGAN, 41.0% versus 40.5%, 48.7% versus 47.7% and 33.6% versus 33.0% respectively on the Foggy Zurich, Foggy Driving and Foggy Dense Driving dataset. Thanks to the controllability and the flexibility of the domainness variable in our DLOW model, we can further generate the “Fog Flow” by varying the domainness variable as described in Sects. 3.7 and 4.3.1, where the simulated foggy image with different fog density is covered. By choosing the image with appropriate fog density to fine-tune the pretrained Cityscapes model, we can reach better performance than setting the domainness variable as \(z=1.0\), i.e., 42.5 versus 41.0%, 50.7 versus 48.7% and 36.0% versus 33.0% respectively on the Foggy Zurich, Foggy Driving and Foggy Dense Driving dataset.

4.3.4 Curriculum Learning

Secondly, we evaluate our DLOW model for foggy scene understanding under the curriculum learning scenario. We follow Sakaridis et al. (2018a) to trian the model with the curriculum learning strategy, in which the model is fine-tuned using images with gradually increasing fog densities. Similarly to Sakaridis et al. (2018a), the two-step curriculum fine-tuning strategy, where the RefineNet model pretrained on Cityscapes is firstly fine-tuned on “DLOW (\(z=1.1\))” dataset and then further fine-tuned on “DLOW (\(z=1.2\))” dataset, is adopted to further improve the semantic foggy scene segmentation performance, which is named as “Fog FLow”.

Table 6 Results of semantic segmentation on the Foggy driving dataset based on RefineNet model with ResNet-101 backbone using different simulated foggy images
Table 7 Results of semantic segmentation on the Foggy dense driving dataset based on RefineNet model with ResNet-101 backbone using different simulated foggy images

The segmentation results are shown in Table 8 and Fig. 14. By combining the curriculum training strategy, our “Fog Flow” approach can further improve the semantic segmentation performance compared with purely fine-tuning the pretrained model on the simulated foggy image with single fog density. We also compare our “Fog Flow” performance with the state-of-the-art method “CMAda-7” (Sakaridis et al. 2018b) and it is proven that our method “Fog Flow” outperforms the “CMAda-7” on all the three real fog datasets, 43.2% versus 41.4%, 36.1% versus 34.3%, and 51.4% versus 48.6%. Our “Fog Flow” adopts the same curriculum training strategy as the “CMAda-7” and the difference lies in that our “Fog Flow” uses our generated “Fog Flow” while the “CMAda-7” uses the “Foggy Cityscapes” for fine-tuning, which further proves the effectiveness of our generated “Fog Flow” for the semantic foggy scene understanding compared witht the “Foggy Cityscapes”.

Table 8 The comparison between our “Fog Flow” with other state-of-the-art methods for the semantic foggy scene understanding

5 Conclusion

In this paper, we have presented DLOW model, which can generate intermediate domains for bridging different domains. The model takes a domainness variable z (or domainness vector \({\mathbf {z}}\)) as the conditional input, and transfers images into the intermediate domain controlled by z or \({\mathbf {z}}\). We demonstrate the benefits of our DLOW model in three scenarios. Firstly, for the cross-domain semantic segmentation task, our DLOW model can improve the performance of the pixel-level domain adaptation by taking the translated images in intermediate domains as training data. Secondly, our DLOW model also exhibits excellent style generalization ability for image translation and we are able to transfer images into a new style that is unseen in the training data. Extensive experiments on benchmark datasets have verified the effectiveness of our proposed model. Thirdly, our DLOW model is able to generate the “Fog Flow” when applied to the semantic foggy scene understanding task, which covers the simulated foggy images with different fog density. Moreover, our generated “Fog Flow” naturally fits the curriculum learning paradigm and can be combined to further improve the foggy image semantic segmentation performance.

Fig. 14
figure 14

The testing performance of the semantic segmentation model changes when fine-tuned with our “Fog Flow” under different fog density 1.0, 1.1, 1.2, 1.3 to 1.4 and the comparison with the “Fog Flow” and the “Foggy Cityscapes” performance. Taking the Foggy Zurich dataset for example, the blue solid line shows that the segmentation performance gradually improves by varying the fog density from 1.0 to 1.4, named as “Fog Flow w/o CL”. The red dash line represents the performance of the “Fog Flow” which combines the curriculum learning with our generated “Fog Flow”. It is shown that the “Fog Flow” outperforms each one model trained with single fog density image in “Fog Flow” without curriculum learning “Fog Flow w/o CL”. The green dash line represents the performance of the “Foggy Cityscapes” which is much lower than our “Fog Flow” and proves the effectiveness of our “Fog Flow” for semantic foggy scene understanding