1 Introduction

In real world scenarios, generic vision tasks like image recognition, object detection, image translation, etc., always face severe challenges from variations in viewpoint, background, object appearance, illumination, occlusion conditions, scene change, etc. These unavoidable factors make these tasks in domain-shift circumstance a challenging and new rising research topic in the recent years. Also, domain change is a widely-recognized, intractable problem that urgently needs to break through in reality tasks, like video surveillance, autonomous driving, etc. Consequently, a large-scale cross-domain benchmark is urgently-needed for pushing this field forward.

The recent emergence of large-scale image datasets in the cross-domain circumstance like VisDA (Peng et al. 2017), Office-Home (Venkateswara et al. 2017), Syn2real (Peng et al. 2018), DomainNet (Peng et al. 2019) are mainly focusing on the traditional classification or detection tasks, thus they are not flexible to be applied to new raised tasks like image-to-image translation, especially the instance level translation task. The motivation of this work is to build a dataset that has instance-level annotations of images (every instance has a bounding box coordinate and a semantic label) under a large, unrestricted and real world scenarios across different domains, in order to solve the instance-level image-to-image translation and further extend to domain adaptive object detection tasks.

Instance-level image-to-image translationImage-to-Image (I2I) translation has become more and more important in computer vision recently, since many vision and graphics problems can be formulated as an I2I translation problem like super-resolution, neural style transfer, colorization, etc. This technique has also been adapted to the relevant fields such as medical image processing (Zhang et al. 2018) to further improve the medical volumes segmentation performance. In general, Pix2pix (Isola et al. 2017) is regarded as the first unified framework for I2I translation which adopts conditional generative adversarial networks (Mirza and Osindero 2014) for image generation, while it requires the paired examples during training process. A more general and challenging setting is the unpaired I2I translation, where the paired data is unavailable.

Fig. 1
figure 1

Illustration of the motivation of our method. (1) MUNIT (Huang et al. 2018)/DRIT (Lee et al. 2018) methods; (2) their limitation; and (3) our solution for instance-level translation. More details can be referred to the text

Several recent efforts (Zhu et al. 2017; Liu et al. 2017; Huang et al. 2018; Lee et al. 2018; Almahairi et al. 2018) have been made on this direction and achieved very promising results. For instance, CycleGAN (Zhu et al. 2017) proposed the cycle consistency loss to enforce the learning process that if an image is translated to the target domain by learning a mapping and translated back with an inverse mapping, the output should be the original image. Furthermore, CycleGAN assumes the latent spaces are separate from the two mappings. In contrast, UNIT (Liu et al. 2017) assumes two domain images can be mapped onto a shared latent space. MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018) further postulate that the latent spaces can be disentangled to a shared content space and a domain-specific attribute space.

However, all of these methods thus far have focused on migrating styles or attributes onto the entire images. As shown in Fig. 1 (1), they work well on the unified-style scenes or relatively content-simple scenarios due to the consistent pattern across various spatial areas in an image, while this is not true for the complex structure images with multiple objects since the stylistic vision disparity between objects and background in an image is always huge or even totally different, as in Fig. 1 (2).

To address the aforementioned limitation, in this work we present a method that can translate objects and background/global areas separately with different style codes as in Fig. 1 (3), and still training in an end-to-end manner. The motivation of our method is illustrated in Fig. 3. Instead of using the global style, we use instance-level style vectors that can provide more accurate guidance for visually related object generation in target domain. We argue that styles should be diverse for different objects, backgrounds or global images, meaning that the style codes should not be identical for the entire image. More specifically, a car from “sunny” to the “night” domain should have different style codes comparing to the global image translation between these two domains. Our method achieves this goal by involving the instance-level styles. Given a pair of unaligned images and object locations, we first apply our encoders to obtain the intermediate global and instance level content and style vectors separately. Then we utilize the cross-domain mapping to obtain the target domain images by swapping the style/attribute vectors. Our swapping strategy is introduced with more details in Sect. 4. The main advantage of our method is the exploration and usage of object level styles, which affects and guides the generation of target domain objects directly. Certainly, we can also apply the global style for target objects to enforce the model to learn more diverse results.

Fig. 2
figure 2

Illustration of domain-shift object detection in autonomous driving scenario. Images are from our CDTD dataset (Shen et al. 2019)

Domain adaptive object detection As illustrated in Fig. 2, unsupervised domain adaptive object detection aims to learn a robust detector in the domain shift circumstance, where the training (source) domain is label-rich with bounding box annotations, while the testing (target) domain is label-agnostic and the feature distributions between training and testing domains are dissimilar or even totally different. Previous solutions on this problem usually design distribution alignments on global and local level images by using an adversarial loss. The alignments generally require additional components or sub-networks to realize, which are troublesomely complicated and poorly interpretable. In this work, we propose a simple training technique called gradient detach that prevents the flow of gradients from context sub-network through the detection backbone path, so that it can learn more discriminative representations between object and global/context images, and focus more on the target areas. After accompanying with the compatible stacked complementary losses by cutting in several auxiliary objectives in different network stages, our method can automatically align the distributions of source and target domains effectively. We conduct experiments on the proposed dataset with two baseline methods DA (Chen et al. 2018) and strong-weak alignment (Saito et al. 2019), our results are consistently better than the two baseline methods.

In summary, our contributions are four fold:

  • We introduce a large-scale, multimodal, highly varied and high-resolution cross domain dataset, containing \(\sim \)155k streetscape images across four domains. Our dataset not only includes the domain category labels, but also provides the detailed object bounding box annotations, which will benefit the instance-level I2I translation and domain adaptive object detection problems.

  • We propel I2I translation problem step forward to instance-level such that the constraints could be exploited on both instance and global-level attributes by adopting the proposed compound loss.

  • We conduct extensive qualitative and quantitative experiments to demonstrate that our approach can surpass the baseline I2I translation methods. Our synthetic images can be even beneficial to other vision tasks such as generic object detection, and further improve the performance.

  • We propose a novel training strategy, gradient detach, for the domain adaptive object detection task which suppresses gradients flowing back to the detection backbone. To our best knowledge, this may be the first work to show the effectiveness of gradient detach that can help to learn better context representation for domain adaptive object detection. In addition, we proposed to use multiple complementary losses to help gradient detach training for better optimization.

A preliminary version (Shen et al. 2019) of this manuscript has been published in a previous conference CVPR 2019. Compared to the previous conference paper, our major new contributions are that we extend our dataset to domain adaptive object detection task, we propose a gradient detach based stacked complementary losses approach to boost the previous state-of-the-art methods and achieve fairly competitive performance. We also conduct additional experiments and visualizations on the original instance-level image-to-image translation task. Moreover, we include more description of the dataset, the method for domain adaptive object detection and more baseline results.

The rest sections of this work are organized as follows. In Sect. 2, we review the related work of our study. In Sect. 3, we introduce the construction of the CDTD dataset and its statistics. We also provide a feature-by-feature comparison to other related datasets. In Sect. 4, we introduce the proposed INIT method for instance-level image-to-image translation. We propose to use the fine-grained local (instance) and global styles on the target image spatially to translate the source images. In Sect. 5, we introduce a gradient detach method for the domain adaptive object detection task. The proposed method prevents the flow of gradients from context sub-network through the detection backbone path, so that it can learn more discriminative representations between object and global/context images, and focus more on the target areas. In Sect. 6 we provide extensive experiments and ablation studies on our collected dataset of image-to-image translation task, some baselines and our method results on domain adaptive object detection task. Sect. 7 concludes this work.

2 Related Work

2.1 Cross Domain Datasets for Translation and Object Detection

A variety of datasets have been collected for the purpose of cross domain study. In image-to-image translation field, the most commonly used ones are edge \( \leftrightarrow \) shoes (Isola et al. 2017), Yosemite (summer \( \leftrightarrow \) winter) (Zhu et al. 2017), Cityscapes (Cordts et al. 2016), while as shown in Table 1, these datasets either are in low-resolution (e.g., edge \( \leftrightarrow \) shoes), or have limited scale, i.e., number of images is too small (e.g., Cityscapes). In contrast, our dataset has more sufficient images to explore the potential of proposed algorithm. As shown in Table 2, the central weakness of current domain adaptive object detection datasets is the scale in terms of the number of images. In general, our dataset is about 15\(\sim \)20\(\times \) larger than these existing ones with higher quality/resolution of images.

Image-to-Image Translation The goal of I2I translation is to learn the mapping between two different domains. Pix2pix (Isola et al. 2017) first proposes to use conditional generative adversarial networks (Mirza and Osindero 2014) to model the mapping function from input to output images. Inspired by Pix2pix, some works further adapt it to a variety of relevant tasks, such as semantic layouts \(\rightarrow \) scenes (Karacan et al. 2016), sketches \(\rightarrow \) photographs (Sangkloy et al. 2017), etc. Despite popular usage, the major weaknesses of these methods are that they require the paired training examples and the outputs are single-modal. In order to produce multimodal and more diverse images, BicycleGAN (Zhu et al. 2017) encourages the bijective consistency between the latent and target spaces to avoid the mode collapse problem. A generator learns to map the given source image, combined with a low-dimensional latent code, to the output during training. While this method still needs the paired training data.

Recently, CycleGAN (Zhu et al. 2017) is proposed to tackle the unpaired I2I translation problem by using the cycle consistency loss. UNIT (Liu et al. 2017) further makes a share-latent assumption and adopts Coupled GAN in their method. To address multimodal problem, MUNIT (Huang et al. 2018), DRIT (Lee et al. 2018), Augmented CycleGAN (Almahairi et al. 2018), etc. adopt a disentangled representation to further learn diverse I2I translation from unpaired training data.

Fig. 3
figure 3

A natural image example of our I2I translation

Fig. 4
figure 4

Image samples from our benchmark grouped by their domain categories (sunny, night, cloudy and rainy). In each group, left are original images and right are images with corresponding bounding box annotations

Table 1 Feature-by-feature comparison of popular I2I translation datasets
Table 2 Comparison of popular domain adaptive object detection datasets
Table 3 Statistics (# images) of the entire dataset across four domains: sunny, night, rainy and cloudy

Instance-level Image-to-Image Translation To the best of our knowledge, there are so far very few efforts on the instance-level I2I translation problem. Perhaps the most similar to our work is the recently proposed InstaGAN (Mo et al. 2019), which utilizes the object segmentation masks to translate both an image and the corresponding set of instance attributes while maintaining the permutation invariance property of instances. A context preserving loss is designed to encourage model to learn the identity function outside of target instances. The main difference with ours is that instaGAN cannot translate different domains for an entire image sufficiently. They focus on translating instances and maintain the outside areas, in contrast, our method can translate instances and outside areas simultaneously and make global images more realistic. Furthermore, InstaGAN is built on the CycleGAN (Zhu et al. 2017), which is single modal, while we choose to leverage the MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018) to build our INIT, thus our method inherits multimodal and unsupervised properties, meanwhile, produces more diverse and higher quality images.

Some other existing works (Ma et al. 2018; Li et al. 2018) are more or less related to this paper. For instance, DA-GAN (Ma et al. 2018) learns a deep attention encoder to enable the instance-level translation, which is unable to handle the multi-instance and complex circumstance. BeautyGAN (Li et al. 2018) focuses on facial makeup transfer by employing histogram loss with face parsing mask. Mechrez et al. Mechrez et al. (2018) proposed a contextual loss based on the images’ context and semantics, which compared regions with similar semantic information, meanwhile, considering the context of the entire image.

Domain Adaptive Object Detection. Unsupervised domain adaptation for recognition has been widely studied by a large body of previous literature (Ganin et al. 2016; Long et al. 2016; Tzeng et al. 2017; Panareda Busto and Gall 2017; Hoffman et al. 2018; Murez et al. 2018; Zhao et al. 2019; Wu et al. 2019), our method more or less draws merits from them, like aligning source and target distributions with adversarial learning (domain-invariant alignment). However, object detection is a technically different problem from classification, since we would like to focus more on the object of interests (regions).

Common approaches for tackling domain-shift object detection are mainly in two directions: (i) training supervised model and then fine-tuning on the target domain; or (ii) unsupervised cross-domain representation learning. The former requires additional instance-level annotations on target data, which is fairly laborious, expensive and time-consuming. So most approaches focus on the latter one but still have some challenges. The first challenge is that the representations of source and target domain data should be embedded into a common space for matching the object, such as the hidden feature space (Saito et al. 2019; Chen et al. 2018), input space (Tzeng et al. 2018; Cai et al. 2019) or both of them (Kim et al. 2019). The second is that a feature alignment or matching operation or mechanism for source/target domains should be further defined, such as subspace alignment (Raj et al. 2015), \({\mathcal {H}}\)-divergence and adversarial learning (Chen et al. 2018), MRL (Kim et al. 2019), Strong-Weak alignment (Saito et al. 2019), universal alignment (Wang et al. 2019), etc. In general, our proposed method in this work targets at these two challenges, and it is also a learning-based alignment method across domains with an end-to-end framework.

3 CDTD: A Cross-Domain Dataset with Instance Bounding-box Annotations

We introduce a large-scale street scene centric dataset CDTDFootnote 1 that addresses three core research problems in I2I translation: (1) unsupervised learning paradigm, meaning that there is no specific one-to-one mapping in the data; (2) multimodal domains incorporation. Most existing I2I translation datasets provide only two different domains, which limit the potential to explore more challenging tasks like multi-domain incorporation circumstance. Our dataset contains four domains: sunny, night, cloudy and rainyFootnote 2 in a unified street scene; and (3) multi-granularity (global and instance-level) information. Our dataset provides instance-level bounding box annotations, which can utilize more details for learning a translation model. Table 1 shows a feature-by-feature comparison among various I2I translation datasets. We also visualize some examples of the dataset in Fig. 4. For instance category, we annotate three common objects in street scenes including: car, person, traffic sign (speed limited sign). As our dataset covers multiple domains with shared categories, so it is also suitable for the domain adaptive object detection task.

3.1 Dataset summary

CDTD dataset consists of 155,529 images, among it, there are 132,201 images for training and 23,328 images for testing. The dataset contains four relevant but visually-different domains: sunny, night, cloudy, rainy. The detailed statistics (#images) of the entire dataset are shown in Table 3. All the images are collected in Tokyo, Japan with SEKONIX AR0231 camera. The whole collection process lasted about 3 months.

4 Instance-aware Image-to-Image Translation

Unpaired Image-to-image Translation aims to learn a mapping between unaligned image pairs in diverse domains. Recent advances in this field like MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018) mainly focus on disentangling content and style/attribute from a given image first, then directly adopting the global style to guide the model to synthesize new domain images. However, this kind of approaches severely incurs contradiction if the target domain images are content-rich with multiple discrepant objects. In this paper, we present a simple yet effective instance-aware image-to-image translation approach (INIT), which employs the fine-grained local (instance) and global styles to the target image spatially. The proposed INIT exhibits three import advantages: (1) the instance-level objective loss can help learn a more accurate reconstruction and incorporate diverse attributes of objects; (2) the styles used for target domain of local/global areas are from corresponding spatial regions in source domain, which intuitively is a more reasonable mapping; (3) the joint training process can benefit both fine and coarse granularity and incorporates instance information to improve the quality of global translation. We observe that our synthetic images can even benefit real-world vision tasks like generic object detection.

More precisely, our goal is to realize the instance-aware I2I translation between two different domains without paired training examples. We build our framework by leveraging the MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018) methods. To avoid repetition, we omit some innocuous details. Similar to MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018), our method is straight-forward and simple to implement. As illustrated in Fig. 6, our translation model consists of two encoders \(E_g, E_o\) (g and o denote the global and instance image regions respectively), and two decoders \(G_g,G_o\) in each domain \({\mathcal {X}}\) or \({\mathcal {Y}}\). Since we have the object coordinates, we can crop the object areas and feed them into the instance-level encoder to extra the content/style vectors. An alternative method for object content vectors is to adopt RoI pooling (Girshick 2015) from the global image content features. Here we use image crop (object region) and share the parameters for the two encoders, which is easier to implement.

Fig. 5
figure 5

Our content-style pair association strategy. Only coarse styles can be applied to fine contents, the reversal of processing flow is not allowed during training

Fig. 6
figure 6

Overview of our instance-aware cross-domain I2I translation. The whole framework is based on the MUNIT method (Huang et al. 2018), while we further extend it to realize the instance-level translation purpose. Note that after content-style association, the generated images will place in the target domain, so a translation back process will be employed before self-reconstruction, which is not illustrated here

Disentangle content and style on object and entire image. As (Cheung et al. 2015; Mathieu et al. 2016; Huang et al. 2018; Lee et al. 2018), our method also decomposes input images/objects into a shared content space and a domain-specific style space. Take global image as an example, each encode \(E_g\) can decompose the input to a content code \(c_{g}\) and a style code \(s_{g}\), where \(E_{g}=(E_{g}^c,E_{g}^s)\), \(c_{g}=E_{g}^c(I)\), \(s_{g}=E_{g}^s(I)\), I denotes the input image representation. \(c_g\) and \(s_g\) are global-level content/style features.

Generate style code bank. We generate the style codes from objects, background and entire images, which form our style code bank for the following swapping operation and translation. In contrast, MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018) use only the entire image style or attribute, which is struggling to model and cover the rich image spatial representation.

Associate content-style pairs for cyclic reconstruction. Our cross-cycle consistency is performed by swapping encoder-decoder pairs (dashed arc lines in Fig. 7). The cross-cycle includes two modes: cross-domain (\({\mathcal {X}} \leftrightarrow {\mathcal {Y}}\)) and cross-granularity (entire image \(\leftrightarrow \) object). We illustrate cross-granularity (image \(\leftrightarrow \) object) in Fig. 7, the cross-domain consistency (\({\mathcal {X}} \leftrightarrow {\mathcal {Y}}\)) is similar to MUNIT (Huang et al. 2018) and DRIT (Lee et al. 2018). As shown in Fig. 5, the swapping or content-style association strategy is a hierarchical structure across multi-granularity areas. Intuitively, the coarse (global) style can affect fine content and be adopted to local areas, while it’s not true if the process is reversed. Following (Huang et al. 2018), we also use AdaIN (Huang and Belongie 2017) to combine the content and style vectors, which can be formulated as:

$$\begin{aligned} {\text {AdaIN}}(c, s)=\sigma (s)\left( \frac{c-\mu (c)}{\sigma (c)}\right) +\mu (s) \end{aligned}$$
(1)

where c is the input content batch, s is a style input. \(\mu (c)\), \(\sigma (c)\) are the mean and standard deviation and AdaIN aims to scale the normalized content input with \(\sigma (s)\), and shift it with \(\mu (s)\).

Incorporate Multi-Scale. It is technically easy to incorporate multi-scale advantage into the framework. We simply replace the object branch in Fig. 7 with resolution-reduced images. In our experiments, we use a 1/2 scale and original size images as pairs to perform scale-augmented training. Specifically, styles from the small size and original size images can be performed to each other, and the generator needs to learn multi-scale reconstruction for both of them, which leads to more accurate results.

Reconstruction loss. We use self-reconstruction and cross-cycle consistency loss (Lee et al. 2018) for both entire image and object that encourage reconstruction of them. With encoded c and s, the decoders should decode them back to original input,

$$\begin{aligned} {{\hat{I}}} = {G_g}(E_g^c(I),\;E_g^s(I)),\;{{\hat{o}}} = {G_o}(E_o^c(o),\;E_o^s(o)) \end{aligned}$$
(2)

We can also reconstruct the latent distribution (i.e. content and style vectors) as (Huang et al. 2018).

$$\begin{aligned} {{{\hat{c}}}}_o = E_o^c({G_o}({c_o},{s_g})),\;{{{{\hat{s}}}}_o} = E_o^s({G_o}({c_o},{s_g})) \end{aligned}$$
(3)

where \(c_o\) and \(s_g\) are instance-level content and global-level style features. Then, we can use the following formation to learn a reconstruction of them:

$$\begin{aligned} {\mathcal {L}}_{recon}^k = {{\mathbb {E}}_{k \sim p(k)}}\left[ {{{\left\| {{{\hat{k}}} - k} \right\| }_1}} \right] \end{aligned}$$
(4)

where k can be I, o, c or s. p(k) denotes the distribution of data k. The formation of cross-cycle consistency is similar to this process and more details can be referred to (Lee et al. 2018).

Adversarial loss. Generative adversarial learning (Goodfellow et al. 2014) has been adapted to many visual tasks, e.g., detection (Nguyen et al. 2017; Bai et al. 2018), inpainting (Pathak et al. 2016; Iizuka et al. 2017; Yu et al. 2018), ensemble (Shen et al. 2019), etc. We adopt adversarial loss \({\mathcal {L}}_{adv}\) where \(D^g_{\mathcal {X}}\), \(D^o_{\mathcal {X}}\), \(D^g_{\mathcal {Y}}\) and \(D^o_{\mathcal {Y}}\) attempt to discriminate between real and synthetic images/objects in each domain. We explore two designs for the discriminators: weight-sharing or weight-independent for global and instance images in each domain. The ablation experimental results are shown in Tables 4 and  5, we observe that shared discriminator is a better choice in our experiments.

Full objective function. The full objective function of our framework is:

$$\begin{aligned} \begin{gathered} \mathop {\min }\limits _{{E_{\mathcal {X}}},{E_{\mathcal {Y}}},{G_{\mathcal {X}}},{G_{\mathcal {Y}}}} \mathop {\max }\limits _{{D_{\mathcal {X}}},{D_{\mathcal {Y}}}} {\mathcal {L}}({E_{\mathcal {X}}},{E_{\mathcal {Y}}},{G_{\mathcal {X}}},{G_{\mathcal {Y}}},{D_{\mathcal {X}}},{D_{\mathcal {Y}}}) \\ = \underbrace{{\lambda _g}({\mathcal {L}}_{}^{{g_{\mathcal {X}}}} + {\mathcal {L}}_{}^{{g_{\mathcal {Y}}}}) + {\lambda _{{c_g}}}({\mathcal {L}}_g^{{c_{\mathcal {X}}}} + {\mathcal {L}}_g^{{c_{\mathcal {Y}}}}) + {\lambda _{{s_g}}}({\mathcal {L}}_g^{{s_{\mathcal {X}}}} + {\mathcal {L}}_g^{{s_{\mathcal {Y}}}})}_{global - level\;reconstruction\;loss} \\ \quad + \underbrace{{\lambda _o}({\mathcal {L}}_{}^{{o_{\mathcal {X}}}} + {\mathcal {L}}_{}^{{o_{\mathcal {Y}}}}) + {\lambda _{{c_o}}}({\mathcal {L}}_o^{{c_{\mathcal {X}}}} + {\mathcal {L}}_o^{{c_{\mathcal {Y}}}}) + {\lambda _{{s_o}}}({\mathcal {L}}_o^{{s_{\mathcal {X}}}} + {\mathcal {L}}_o^{{s_{\mathcal {Y}}}})}_{instance - level\;reconstruction\;loss} \\ \quad + \underbrace{{\mathcal {L}}_{adv}^{{{\mathcal {X}}_g}} + {\mathcal {L}}_{adv}^{{{\mathcal {Y}}_g}}}_{global - level\;GAN\;loss} + \underbrace{{\mathcal {L}}_{adv}^{{{\mathcal {X}}_o}} + {\mathcal {L}}_{adv}^{{{\mathcal {Y}}_o}}}_{instance - level\;GAN\;loss} \\ \end{gathered}\nonumber \\ \end{aligned}$$
(5)

where \(\lambda _g\), \(\lambda _o\), \(\lambda _{c_g}\), \(\lambda _{c_p}\), \(\lambda _{s_g}\), \(\lambda _{s_o}\) are weights that control the importance of different reconstruction terms.

During inference time, we simply use the global branch to generate the target domain images (See Fig. 6 upper-right part) so that it is not necessary to use bounding box annotations at this stage, and this strategy can also guarantee that the generated images are harmonious.

Fig. 7
figure 7

Illustration of our cross-cycle consistency process. We only show cross-granularity (image \(\leftrightarrow \) object), the cross-domain consistency (\({\mathcal {X}} \leftrightarrow {\mathcal {Y}}\)) is similar to the above paradigm

5 Domain Adaptive Object Detection

Unsupervised domain adaptive object detection aims to learn a robust detector in the domain shift circumstance, where the training (source) domain is label-rich with bounding box annotations, while the testing (target) domain is label-agnostic and the feature distributions between training and testing domains are dissimilar or even totally different.

Following the common formulation of domain adaptive object detection, we define a source domain \({\mathcal {X}}\) where annotated bounding-box is available, and a target domain \({\mathcal {Y}}\) where only the image can be used in training process without any labels (bounding box and category). Our purpose is to train a robust detector that can adapt well to both source and target domain data, i.e., we aim to learn a domain-invariant feature representation that works well for detection across two different domains.

5.1 Gradient Detach Updating

In this section, we first introduce the detach strategy and how it helps to prevent the flow of gradients from context sub-network through the detection backbone path. Then we introduce the whole framework that we incorporate detach-based multi-objective learning on domain adaptive object detection scenario.

Fig. 8
figure 8

Gradient detach helps to amplify contrast between context and object areas in domain adaptation scenario

We define a sub-network to generate the context information from early layers of detection backbone. Intuitively, instance and context will focus on perceptually different parts of an image, so the representations from either of them should also be discrepant. However, if we train with the conventional joint process, the companion sub-network will be updated simultaneously with the detection backbone, which may lead to learning an indistinguishable representation/behavior from these two parts. To this end, in this work we propose to suppress gradients during backpropagation and force the representation of context sub-network to be dissimilar to the detection network, as shown in Algorithm 1. We then apply an instance-context alignment module with detach-generated context and backbone object representations for joint adaptation, as we elaborate in the following section. We find that gradient detach can help to obtain more discriminative context and object representations (see Fig. 8), and we show empirical evidence that this path carries information with diversity and hence gradients from this path getting suppressed is superior for such task.

Fig. 9
figure 9

Overview of our domain adaptive object detection framework. “RPN” is the region proposal network proposed in Faster RCNN (Ren et al. 2015) for generating object proposals. “GRL” is the gradient reverse layer (GRL) (Ganin and Lempitsky 2015) that the sign of the gradient will be reversed by passing through the GRL layer to optimize the base network, and the conventional gradient descent is applied for training the domain classifiers at different layers. More details please refer to Sect. 5

Detach-Based Multi-Objective Learning. As shown in Fig. 9, we focus on the detach based complement objective learning and let \({\mathcal {S}}=\{({\mathbf {x}}_i^{({\mathcal {X}})}, {\mathbf {y}}_i^{({\mathcal {X}})})\}\) where \({\mathbf {x}}_i^{({\mathcal {X}})} \in {\mathcal {R}}^n\) denotes an image, \({\mathbf {y}}^{({\mathcal {X}})}_i\) is the corresponding bounding box and category labels for sample \({\mathbf {x}}^{({\mathcal {X}})}_i\), and i is an index. Each label \({\mathbf {y}}^{({\mathcal {X}})}=(y_{\mathbf {c}}^{({\mathcal {X}})},y_{\mathbf {b}}^{({\mathcal {X}})})\) denotes a class label \(y_{\mathbf {c}}^{({\mathcal {X}})}\) where \({\mathbf {c}}\) is the category, and a 4-dimension bounding-box coordinate \(y_{\mathbf {b}}^{({\mathcal {X}})} \in {\mathcal {R}}^4\). For the target domain we only use image data for training, so \({\mathcal {T}}=\{{\mathbf {x}}_i^{({\mathcal {Y}})}\}\). We define a recursive function for layers \({\mathbf {k}}=1,2,\dots ,{\mathbf {K}}\) where we cut in complementary losses:

$$\begin{aligned} \begin{array}{l}{{\hat{\varTheta }}_{{\mathbf {k}}}={\mathcal {F}}\left( {\mathbf {Z}}_{{\mathbf {k}}}\right) , \text{ and } {\mathbf {Z}}_{0} \equiv {\mathbf {x}}} \end{array} \end{aligned}$$
(6)

where \({\hat{\varTheta }}_{{\mathbf {k}}}\) is the feature map produced at layer \({\mathbf {k}}\), \({\mathcal {F}}\) is the function to generate features at layer \({\mathbf {k}}\) and \({\mathbf {Z}}_{{\mathbf {k}}}\) is input at layer \({\mathbf {k}}\). We formulate the complement loss of domain classifier \({\mathbf {k}}\) as follows:

$$\begin{aligned} \begin{gathered} {\mathcal {L}}_{{\mathbf {k}}}\left( \hat{\varTheta }^{({\mathcal {X}})}_{{\mathbf {k}}}, \hat{\varTheta }^{({\mathcal {Y}})}_{{\mathbf {k}}} ; {\mathbf {D}}_{{\mathbf {k}}}\right) =\mathcal{L}_{\mathbf {k}}^{({\mathcal {X}})}({{{\hat{\varTheta }} }^{({\mathcal {X}})}_{{\mathbf {k}}}};{{\mathbf {D}}_\mathbf{{k}}}) + \mathcal{L}_{\mathbf {k}}^{({\mathcal {Y}})}({{{\hat{\varTheta }} }^{({\mathcal {Y}})}_{{\mathbf {k}}}};{{\mathbf {D}}_\mathbf{{k}}}) \\ = {\mathbb {E}}\left[ \log \left( {\mathbf {D}}_{{\mathbf {k}}}\left( \hat{\varTheta }^{({\mathcal {X}})}_{{\mathbf {k}}}\right) \right) \right] + {\mathbb {E}}\left[ \log \left( 1-{\mathbf {D}}_{{\mathbf {k}}}\left( \hat{\varTheta }^{({\mathcal {Y}})}_{{\mathbf {k}}}\right) \right) \right] \end{gathered} \end{aligned}$$
(7)

where \({\mathbf {D}}_{\mathbf {k}}\) is the \({\mathbf {k}}\)-th domain classifier or discriminator. \(\hat{\varTheta }^{({\mathcal {X}})}_{{\mathbf {k}}}\) and \(\hat{\varTheta }^{({\mathcal {Y}})}_{{\mathbf {k}}}\) denote feature maps from source and target domains respectively. Following (Chen et al. 2018; Saito et al. 2019), we also adopt gradient reverse layer (GRL) (Ganin and Lempitsky 2015) to enable adversarial training where a GRL layer is placed between the domain classifier and the detection backbone network. During backpropagation, GRL will reverse the gradient that passes through from domain classifier to detection network.

For our instance-context alignment loss \({{\mathcal {L}}_{{\mathbf {ILoss}}}}\), we take the instance-level representation and context vector as inputs. The instance-level vectors are from RoI layer that each vector focuses on the representation of local object only. The context vector is from our proposed sub-network that combines hierarchical global features. We concatenate instance features with same context vector. Since context information is fairly different from objects, joint training detection and context networks will mix the critical information from each part, here we proposed a better solution that uses detach strategy to update the gradients. We will introduce it with details in the next section. Aligning instance and context representation simultaneously can help to alleviate the variances of object appearance, part deformation, object size, etc. in instance vector and illumination, scene, etc. in context vector. We define \(d_i\) as the domain label of i-th training image where \(d_i=1\) for the source and \(d_i=0\) for the target, so the instance-context alignment loss can be further formulated as:

$$\begin{aligned} \begin{aligned} {{\mathcal {L}}_{{\mathbf {ILoss}}}} = - \frac{1}{{{N_{{\mathcal {X}}}}}}\sum \limits _{i = 1}^{{N_{{\mathcal {X}}}}} {\sum \limits _{i,j} {(1 - d_i)} \log {{{\mathbf {P}}}_{(i,j)}}} \\ \quad - \frac{1}{{{N_{{\mathcal {Y}}}}}}\sum \limits _{i = 1}^{{N_{{\mathcal {Y}}}}} {\sum \limits _{i,j} {d_i\log \left( {1 - {{{\mathbf {P}}}_{(i,j)}}} \right) } } \end{aligned} \end{aligned}$$
(8)

where \(N_{{\mathcal {X}}}\) and \(N_{{\mathcal {Y}}}\) denote the numbers of source and target examples. \({\mathbf {P}}_{(i,j)}\) is the output probabilities of the instance-context domain classifier for the j-th region proposal in the i-th image. So our total SCL (stacked complementary losses) objective \({\mathcal {L}}_{\mathbf {SCL}}\) can be written as:

$$\begin{aligned} {{\mathcal {L}}_{\mathbf {SCL}}} = \sum \limits _{{\mathbf {k}} = 1}^{\mathbf {K}} {{{\mathcal {L}}_{{\mathbf {k}}}}} + {{\mathcal {L}}_{\mathbf {ILoss}}} \end{aligned}$$
(9)

5.2 Framework Overall

Our detection part is based on the Faster RCNN (Ren et al. 2015), including the Region Proposal Network (RPN) and other modules. This is a conventional practice in many adaptive detection works. The objective of the detection loss is summarized as:

$$\begin{aligned} {{\mathcal {L}}_{det}} = {{\mathcal {L}}_{rpn}} + {{\mathcal {L}}_{cls}} + {{\mathcal {L}}_{reg}} \end{aligned}$$
(10)

where \({{\mathcal {L}}_{cls}}\) is the classification loss and \({{\mathcal {L}}_{reg}}\) is the bounding-box regression loss. To train the whole model using SGD, the overall objective function in the model is:

$$\begin{aligned} \min _{{\mathcal {F}}, {\mathbf {R}}} \max _{{\mathbf {D}}} {\mathcal {L}}_{det}({\mathcal {F}}({\mathbf {Z}}), {\mathbf {R}})-\lambda {\mathcal {L}}_\mathbf {SCL}({\mathcal {F}}({\mathbf {Z}}), {\mathbf {D}}) \end{aligned}$$
(11)

where \(\lambda \) is the trade-off coefficient between detection loss and our complementary loss. \({\mathbf {R}}\) denotes the RPN and other modules in Faster RCNN.

Table 4 Diversity scores on our dataset. We use the average LPIPS distance (Zhang et al. 2018) to measure the diversity of generated images
Table 5 Comparison of Conditional Inception Score (CIS) and Inception Score (IS). To obtain high CIS and IS scores, a model is required to synthesis images that are more realistic, diverse with high-quality
Table 6 Mask-RCNN with ResNet-50-FPN (Lin et al. 2017) detection and segmentation results on MS COCO 2017 val set

6 Experiments and Analysis

6.1 Instance-level Image-to-image Translation

We conduct experiments on our collected dataset (CDTD). We also use COCO dataset (Lin et al. 2014) to verify the effectiveness of data augmentation.

Implementation Details. Our implementation is based on MUNITFootnote 3 with PyTorch (Paszke et al. 2017). For I2I translation, we resize the short side of images to 360 pixels due to the limitation of GPU memory. For COCO image synthesis, since the training images (INIT dataset) and target images (COCO) are in different distributions, we keep the original size of our training image and crop 360\(\times \)360 pixels to train our model, in order to learn more details of images and objects, meanwhile, ignore the global information. In this circumstance, we build our object part as an independent branch and each object is resized to 120\(\times \)120 pixels during training. We set the trade-off hyper-parameters to \(\lambda _g=10\), \(\lambda _o=10\), \(\lambda _{c_g}=1\), \(\lambda _{c_p}=1\), \(\lambda _{s_g}=1\), \(\lambda _{s_o}=1\) following MUNIT (Huang et al. 2018).

6.1.1 Baselines

We perform our evaluation on the following four recent proposed state-of-the-art unpaired I2I translation methods:

  • CycleGAN (Zhu et al. 2017): CycleGAN contains two translation functions (\({\mathcal {X}} \rightarrow {\mathcal {Y}}\) and \({\mathcal {X}} \leftarrow {\mathcal {Y}}\)), and the corresponding adversarial loss. It assumes that the input images can be translated to another domain and then can be mapped back with a cycle consistency loss.

  • UNIT (Liu et al. 2017): The UNIT method is an extension of CycleGAN (Zhu et al. 2017) that is based on the shared latent space assumption. It contains two VAE-GANs and also uses cycle-consistency loss (Zhu et al. 2017) for learning models.

  • MUNIT (Huang et al. 2018): MUNIT consists of an encoder and a decoder for each domain. It assumes that the image representation can be decomposed into a domain-invariant content space and a domain-specific style space. The latent vectors of each encoder are disentangled to a content vector and a style vector. I2I translation is performed by swapping content-style pairs.

  • DRIT (Lee et al. 2018): The motivation of DRIT is similar to MUNIT. It consists of content encoders, attribute encoders, generators and domain discriminators for both domains. The content encoder maps images into a shared content space and the attribute encoder maps images into a domain-specific attribute space. A cross-cycle consistency loss is adopted for performing I2I translation.

6.1.2 Evaluation

We adopt the same evaluation protocol from previous unsupervised I2I translation works and evaluate our method with the LPIPS Metric (Zhang et al. 2018), Inception Score (IS) (Salimans et al. 2016) and Conditional Inception Score (CIS) (Huang et al. 2018).

LPIPS Metric. Zhang et al. proposed LPIPS distance (Zhang et al. 2018) to measure the translation diversity, which has been verified to correlate well with human perceptual psychophysical similarity. Following (Huang et al. 2018), we calculate the average LPIPS distance between 19 pairs of randomly sampled translation outputs from 100 input images of our test set. Following (Huang et al. 2018) and recommended by Zhang et al. (2018), we also use the pre-trained AlexNet (Krizhevsky et al. 2012) to extract deep features.

Results are summarized in Table 4, “INIT w/ D\(_s\)” denotes we train our model with shared discriminator between entire image and object. “INIT w/o D\(_s\)” denotes we build separate discriminators for image and object. Thanks to the coarse and fine styles we used, our average INIT w/ D\(_s\) score outperforms MUNIT with a notable margin. We also observe that our dataset (real image) has a very large diversity score, which indicates that the dataset is diverse and challenging.

Fig. 10
figure 10

Visualization of our synthetic images. The left group images are from COCO and the right are from Cityscapes

Fig. 11
figure 11

Visualization of multimodal results. We use randomly sampled style codes to generate these images and the darkness are slightly different across them

Fig. 12
figure 12

Qualitative comparison on randomly selected instance level results. The first row shows the input objects. The second row shows the self-reconstruction results. The third and fourth rows show outputs from MUNIT and ours, respectively

Inception Score (IS) and Conditional Inception Score (CIS). We use the Inception Score (IS) (Salimans et al. 2016) and Conditional Inception Score (CIS) (Huang et al. 2018) to evaluate our learned models. IS measures the diversity of all output images and CIS measures diversity of output conditioned on a single input image, which is a modified IS that is more suitable for evaluating multimodal I2I translation task. The detailed definition of CIS can be referred to Huang et al. (2018). We also employ with Inception V3 model (Szegedy et al. 2016) to fine-tune our classification model on four domain category labels of our dataset. Other settings are the same as Huang et al. (2018). It can be seen in Table 5 that our results are consistently better than the baselines MUNIT and DRIT.

Table 7 Improvement comparison on COCO detection with different image synthetic methods
Table 8 Performance decline when training and testing on real image, and comparing to results on synthetic image

Image Synthesis on Multiple Datasets The visualization of our synthetic images is shown in Fig. 17. The left group images are on COCO and the right are on Cityscapes. We observe that the most challenging problem for multiple datasets synthesis is the inter-class variance among them.

6.2 Data Augmentation for Detection & Segmentation on COCO

We use Mask RCNN (He et al. 2017) framework for the experiments. A synthetic copy of entire COCO dataset is generated by our sunny\(\rightarrow \)night model. We employ open-source implementation of Mask RCNNFootnote 4 for training the COCO models. For training, we use the same number of training epochs and other default settings including the learning rate schedule, #batchsize, etc.

All results are summarized in Table 6, the first column (group) shows the training data we used, the second group shows the validation data where we tested on. The third and fourth groups are detection and segmentation results, respectively. We can observe that our real-image trained model can obtain 30.4% mAP on synthetic validation images, this indicates that the distribution differences between original COCO and our synthetic images are not very huge. It seems that our generation process is more likely to do photo-metric distortions or brightness adjustment of images, which can be regarded as a data augmentation technique and has been verified the effectiveness for object detection in Liu et al. (2016). From the last two rows we can see that not only the synthetic images can help improve the real image testing performance, but the real image can also boost the results of synthetic images (both train and test on synthetic images). We also compare improvement with different generation methods in Table 7. The results show that our object branch can bring more benefits for detection task than the baseline. We also believe that the proposed data augmentation method can benefit to some limited training data scenarios like learning detectors from scratch (Shen et al. 2017; Law and Deng 2018; He et al. 2019; Duan et al. 2019).

We further conduct scene parsing on Cityscapes (Cordts et al. 2016). However, we didn’t see obvious improvement in this experiment. Using PSPNet (Zhao et al. 2017) with ResNet-50 (He et al. 2016), we obtain mIoU: 76.6%, mAcc: 83.1% when training and testing on real images and 74.6%/81.1% on both synthetic images. We can see that the gaps between real and synthetic image are really small. We conjecture this case (no gain) is because the synthetic Cityscapes is too close to the original one. We compare the performance decline in Table 8. Since the metrics are different in COCO and Cityscapes, we use the relative percentage for comparison. The results indicate that the synthetic images may be more diverse for COCO since the decline is much smaller on Cityscapes.

6.2.1 Analysis

Qualitative Comparison We qualitatively compare our method with baseline MUNIT (Huang et al. 2018). Fig. 13 shows example results on sunny\(\rightarrow \)night.

We randomly select one output for each method. It’s obvious that our results are much more realistic, diverse with higher quality. If the object area is small, MUNIT (Huang et al. 2018) may fall into mode collapse and brings small artifacts around object area, in contrast, our method can overcome this problem through instance-level reconstruction. We also visualize the multimodal results in Fig. 11 with randomly sampled style vectors. It can be observed that the various degrees of darkness are generated across these images.

Fig. 13
figure 13

Case-by-case comparison on sunny\(\rightarrow \)night. The first row shows the input images. The second and third rows show random outputs from MUNIT (Huang et al. 2018) and ours, respectively

Instance GenerationThe results of generated instances are shown in Fig. 12, our method can generate more diverse objects (columns 1, 2, 6), more details (columns 5, 6, 7) with even the reflection (column 7). MUNIT sometimes fails to generate desired results if the global style is not suitable for the target object (column 2).

Comparison of Local (Object) and Global Style Code Distributions. To further verify our assumption that the object and global styles are distinguishable enough to disentangle, we visualize the embedded style vectors from our w/ D\(_s\) model. The visualization is plotted by t-SNE tool (Maaten and Hinton 2008). We randomly sample 100 images and objects in the test set of each domain, results are shown in Fig. 14. The same color groups represent the paired global images and objects in the same domain. We can observe that the style vectors of same domain global and object images are grouped and separate with a remarkable margin, meanwhile, they are neighboring in the embedded space. This is reasonable and demonstrates the effectiveness of our learning process.

Fig. 14
figure 14

Visualization of style distribution by t-SNE (Maaten and Hinton 2008). The groups with the same color are paired object and global styles of same domain

Table 9 Adaptive detection results on our CDTD dataset
Table 10 More adaptive detection results on other translation of the CDTD dataset
Fig. 15
figure 15

Parameter sensitivity for the value of \(\lambda \) (left) and \(\gamma \) (right) in adaptation from Cityscapes to FoggyCityscapes and from Sim10k (Johnson-Roberson et al. 2016) to Cityscapes

Table 11 Analysis of hype-parameter \({\mathbf {K}}\) in stacked complementary losses
Table 12 Ablation study (%) on Cityscapes to FoggyCityscapes (we use 150 m visibility, the densest one) adaptation
Fig. 16
figure 16

Visualizations of our synthetic images on different source-target domain pairs. In each group, the first row is the reconstructed source domain images, and the second row is the synthetic target domain images

Fig. 17
figure 17

More examples of our synthetic images on sunny\(\rightarrow \)cloudy and sunny\(\rightarrow \)rainy. Note that as the rainy images in our dataset look more like overcast weather with wet road, our results capture the attributes of training data very well

6.3 Domain Adaptive Object Object

Implementation Details. In all experiments, we resize the shorter side of the image to 600 following (Ren et al. 2015; Saito et al. 2019) with ROI-align (He et al. 2017). We train the model with SGD optimizer and the initial learning rate is set to \(10^{-3}\), then divided by 10 after every 50,000 iterations. Unless otherwise stated, we set \(\lambda \) as 1.0 and \(\gamma \) as 5.0, and we use \({\mathbf {K}}=3\) in our experiments (the analysis of hyper-parameter \({\mathbf {K}}\) is shown in Table 11). We report mean average precision (mAP) with an IoU threshold of 0.5 for evaluation. Following (Chen et al. 2018; Saito et al. 2019), we feed one labeled source image and one unlabeled target one in each mini-batch during training. Our method is implemented on PyTorch platform.

6.3.1 Baselines and Our Results

The baselines and our results are shown in Tables 9 and  10. Following translation settings, we conduct experiments on three domain pairs: sunny\(\rightarrow \)night (s2n), sunny\(\rightarrow \)rainy (s2r) and sunny\(\rightarrow \)cloudy (s2c). Since the training images in rainy domain are much fewer than sunny, for s2r experiment we randomly sample the training data in sunny set with the same number of rainy set and then train the detector. It can be observed that our method is consistently better than the baseline methods. We did not provide the results of s2c (faster) as we found that cloudy images are too similar to sunny in this dataset (nearly the same), thus the non-adapted result is very close to the adapted methods. Our code for domain adaptive object detection is available at: https://github.com/harsh-99/SCL.

6.3.2 Ablation Results of Gradient Detach

To thoroughly verify the effectiveness of each component and generalization ability to other benchmarks of our proposed gradient detach method, we further investigate each component and design of our framework from Cityscapes (Cordts et al. 2016) to FoggyCityscapes (Sakaridis et al. 2018). Both source and target datasets have 2975 images in the training set and 500 images in the validation set. We design several controlled experiments for this ablation study. A consistent setting is imposed on all the experiments, unless when some components or structures are examined. In this study, we train models with the ImageNet (Deng et al. 2009) pre-trained ResNet-101 as a mainly used backbone, we also provide the results with pre-trained VGG16 model. We use four types of loss functions in SCL: LS: Least-squares Loss; CE: Cross-entropy Loss; FL: Focal Loss; ILoss: Instance-Context Alignment Loss.

Focal Loss (FL). Focal loss \({\mathcal {L}}_\mathbf {FL}\) (Lin et al. 2017) is adopted to ignore easy-to-classify examples and focus on those hard-to-classify ones during training:

$$\begin{aligned} {\mathcal {L}}_\mathbf {FL}\left( p_{\mathrm {t}}\right) =-f\left( p_{\mathrm {t}}\right) \log \left( p_{\mathrm {t}}\right) , f\left( p_{\mathrm {t}}\right) =\left( 1-p_{\mathrm {t}}\right) ^{\gamma } \end{aligned}$$
(12)

where \(p_{\mathrm {t}}=p \text{ if } d_i=1, \text {otherwise}, p_{\mathrm {t}}={1-p}\).

The results are summarized in Table 12. We present several combinations of four complementary objectives with their loss names and performance. We observe that “LS|CE|FL|FL” obtains the best accuracy with Context and Detach. It indicates that LS can only be placed on the low-level features (rich spatial information and poor semantic information) and FL should be in the high-level locations (weak spatial information and strong semantic information). For the middle location, CE will be a good choice. If you use LS for the middle/high-level features or use FL on the low-level features, it will confuse the network to learn hierarchical semantic outputs, so that ILoss+detach will lose effectiveness under that circumstance. This verifies that domain adaptive object detection relies heavily on the deep supervision, however, the diverse supervisions should be adopted in a controlled and correct manner. Furthermore, our proposed method performs much better than baseline Strong-Weak (Saito et al. 2019) (37.9% vs. 34.3%) and other state-of-the-arts.

Parameter Sensitivity on \(\lambda \) and \(\gamma \) . Figure 15 shows the results for parameter sensitivity of \(\lambda \) and \(\gamma \) in Eqs. 11 and 12. \(\lambda \) is the trade-off parameter between SCL and detection objectives and \(\gamma \) controls the strength of hard samples in Focal Loss. We conduct experiments on two adaptations: Cityscapes \(\rightarrow \) FoggyCityscapes (blue) and Sim10K (Johnson-Roberson et al. 2016) \(\rightarrow \) Cityscapes (red). On Cityscapes \(\rightarrow \) FoggyCityscapes, we achieve the best performance when \(\lambda =1.0\) and \(\gamma =5.0\) and the best accuracy is 37.9%. On Sim10K \(\rightarrow \) Cityscapes, the best result is obtained when \(\lambda =0.1\), \(\gamma =2.0\).

Hyper-parameter \({\mathbf {K}}\) Analysis. Table 11 shows the results for sensitivity of hyper-parameter \({\mathbf {K}}\) in Figure 9. This parameter controls the number of SCL losses and context branches. It can be observed that the proposed method performs best when \({\mathbf {K}}=3\) on all three datasets.

7 Conclusion

In this work, we have introduced a large-scale cross-domain dataset for the instance-level image-to-image translation and domain adaptive object detection tasks. We presented INIT method for instance-aware translation with unpaired training data. Extensive qualitative and quantitative results demonstrate that the proposed method can capture the details of objects and produce realistic and diverse images. We also addressed unsupervised domain adaptive object detection through a novel training strategy, gradient detach, for the convolutional neural networks. Our future work will focus on exploring the domain-shift tasks from scratch, i.e., without the pre-trained models (Shen et al. 2017, 2019; He et al. 2019; Zhu et al. 2019) to avoid involving bias from the pre-trained dataset.