1 Introduction

We hypothesize that solving multiple unsupervised tasks together, enables one to improve on the performance of the best methods that solve each individually. The underlying motivation is that in unsupervised learning, the structure of the data is a key source of knowledge and each task exposes a different aspect of it. We advocate for solving the various tasks in phases, where easier tasks are addressed first, and the other tasks are introduced gradually, while constantly updating the solutions of the previous sets of tasks. The method consists of multiple networks that are trained end-to-end and side-by-side to solve multiple tasks. The method starts from learning background image synthesis and image generation of objects from a particular domain. It then advances to more complex tasks, such as clustering, semantic segmentation and object removal. Finally, we show the model’s ability to perform image-to-image translation. The entire learning process is unsupervised, meaning that no annotated information is used. In particular, the method does not employ class labels, segmentation masks, bounding boxes, etc. However, it does require a separate set of clean background images, which are easy to obtain in many cases.

Contributions. Beyond the conceptual novelty of a method that treats single-handedly multiple unsupervised tasks, which were previously solved by individual methods, the method displays a host of technical novelties, including: (i) a novel architecture that supports multiple paths addressing multiple tasks, (ii) employing bypass paths that allow a smooth transition between autoencoding and generation based on a random seed, (iii) backpropagation through three paths in each iteration, (iv) mixup module, which applies interpolation between latent representations of the generation and reconstruction paths, and more. Due to each of these novelties, backed by the ablation studies, we obtain state of the art results compared to the literature methods in each of the individual tasks.

2 Related Work

Since our work touches on many tasks, we focus the literature review on general concepts and on the most relevant work. Generative models are typically based on Generative Adversarial Networks  [11] or Variational Auto-Encoders  [18]. In addition, these two can be combined  [21]. Conditional image generation conditions the output on an initial variable, most commonly, the target class. CGAN  [23] and InfoGAN  [7] proposed different methods to apply the condition on the discriminator. Our work is more similar to InfoGAN, since we do not use labeled data and the label is not linked to any real image and no conditional discriminator can be applied. The condition is maintained by a classifier that tries to predict the conditioned label and, as a result, forces the generator to condition the result on that label. Semantic Segmentation deals with the classification of the image pixels based on their class labels. For the supervised setting Unet  [24], DeepLab  [4], DeepLabV3  [5], HRNet  [30] have shown great performance leaps using a regression loss. For the unsupervised case, more creative solutions are considered. In [3, 6, 8, 14, 16, 27, 31, 32] a variety of methods have been used including inpainting, learning feature representation, clustering or video frames comparison. In Clustering, deep learning methods are the current state of the art. JULE  [33] and DEPICT  [10], cluster based on a learned feature representation. IIC  [14] trains a classifier directly.

The most similar approach to ours is FineGAN  [26], which our generators and discriminators are based upon. However, there are many significant differences and additions: (i) We added a set of encoders, which are trained to support new tasks. (ii) While FineGAN employs one-hot input, our generators use coded input, which is important for our autoencoding path. (iii) We added a skip connection, followed by a mixup module that combines the bypass tensor with the pre-image tensor. The mixup also allows passing only one of the tensors, making either the bypass or the pre-image optional in each flow. (iv) We employ single foreground generator instead of FineGAN’s double hierarchical design, where we have found one generator to be dominant and the second one redundant. (v) Our model uses layer normalization  [1] instead of batch normalization, which better performs for large number of classes, small batch size, and alternating paths. (vi) We define a new normalization method for the generators, where GLU  [9] activation layers were used as non-linear activations. (vii) We add many losses and training techniques. Many of which are completely novel, as far as we can ascertain. As a result, our work outperforms FineGAN in all tasks and is capable of performing new tasks that its predecessor could not handle.

Mixup  [35] is a technique for applying a weighted sum between two latent variables in order to synthesize a new latent variable. We use it to merge different paths in the model by mixing latent variables that are part coded by the encoder and part produced by the lower levels of the generation. As far as we can ascertain, this is the first usage of mixup to merge information from different paths. Our mixup is applied to image reconstruction in four different locations.

3 Method

To solve the tasks of clustering, foreground segmentation, and conditional generation, our method trains multiple neural networks side-by-side. Each task is solved by applying the networks in a specific order. Similarly, the model is trained by applying the networks in two different paths, with a specific set of losses.

Architecture. The compound network consists of two generators, three encoders and two discriminators. Figure 1,2 illustrate the two training paths. In the generation path, the generators produce a synthetic image conditioned on selected code, the encoders then retrieve the latent code from the generated images. In the reconstruction path, the encoders code an input image into latent code which is used by the generators to reconstruct the image. The reconstruction path is applied twice in each iteration, once on real images and once on fake images from the generation path. The reconstruction of fake images adds multiple capabilities of self-supervision such as reconstructing the background and mask, which is not applicable with real images without additional supervision.

Fig. 1.
figure 1

Flow of the generation path. The generators decode the four priors (\(e_{bg},e_p,e_c,z\)) and produce three separate images (foreground, background, mask) that are combined into the final image. The generated image is then coded by three encoders to retrieve the latent variables and priors.

Fig. 2.
figure 2

Flow of the reconstruction path. The same sub-networks are rearranged to perform image reconstruction. The image is coded with the shape and style encoders and then decoded by the foreground generator to produce the foreground image and mask. Then the background encoder and generator code the masked image and produces a background image. The output image combines the foreground and background images. The mixup modules, placed in four different locations, merge the encoders’ predicted codes with intermediate stages of the generation, acting as a robust skip connection.

Generators. The generation is performed by merging the results of two separate generators that run in parallel to produce the output image. One generator is dedicated for generating the background and the other for the foreground. The generators are conditioned on a two-level hierarchical categorization. Each category has a unique child class \(\phi _c\) and a parent class \(\phi _p\) shared by multiple child classes. These classes are represented by the one-hot vectors \((e_c, e_p)\). An additional background one-hot vector \(e_{bg}\) affects the generation of the background images. Since there is a tight coupling between the class of the object (water bird, tropical bird, etc.) and the expected background, the typology of the background follows the coarse hierarchy, i.e. the parent class. The generator architecture is influenced by.

$$\begin{aligned} e_{c}[i] = \delta _{i,\phi _c},\quad e_{p}[i] = \delta _{i,\phi _p},\quad e_{bg} = e_p \end{aligned}$$
(1)

The generation starts by converting the one-hot vectors into code vectors using learned embeddings. Such an embedding is often used when working with categorical values. A fourth vector z is sampled from a multi-variate gaussian distribution to represent non-categorical features.

$$\begin{aligned} v_{bg} = V_{bg}(e_{bg}),\quad v_{p} = V_{p}(e_p),\quad v_{c} = V_{c}(e_c) ,\quad z \sim \mathcal {N}(0,1)^{d_z} \end{aligned}$$
(2)

The background generator \(G_{bg}\) receives the background vector \(v_{bg}\) and noise z and produces a background image \(I_{bg}\). The foreground generator \(G_{fg}\) receives the parent vector \(e_p\), child vector \(e_c\) and the same z used in the background generation and produces a foreground image \(I_{fg}\) and a foreground mask \(I_m\). The generator is optimized such that all foreground images with the same \(e_p\) will have the same object shape and all images with the same \(e_c\) will have a similar object appearance. The latent vector z is implicitly conditioned to represent all non-categorical information, such as pose, orientation, size, etc. It is used in both the background and foreground generation, so that the images produced by both networks will merge into a coherent image. Each generator is a composition of sub-modules applied back to back, with intermediate pre-images (\(A_{bg},A_{fg}\)):

$$\begin{aligned} I_{bg}&= G_{bg_1}(A_{bg}),\quad&A_{bg}&= G_{bg_0}(v_{bg}, z) \end{aligned}$$
(3)
$$\begin{aligned} (I_{fg}, I_m)&= G_{fg_2}(G_{fg_1}(A_{fg}, v_p), v_c),\quad&A_{fg}&= G_{fg_0}(v_p, z) \end{aligned}$$
(4)

The final generated image is: (where \(\circ \) denotes element-wise multiplication)

$$\begin{aligned} I = I_{bg} \circ (1 - I_{m}) + I_{fg} \circ I_{m} \end{aligned}$$
(5)

Encoders. Unlike FineGAN, which performs only the generation task, our method requires the use of encoders. We introduce three encoders: background encoder \(E_{bg}\), shape encoder \(E_{p}\), and style encoder \(E_{c}\). They run in semi-parallel to predict both the latent codes (\(v_{bg},v_p,v_c, z\)) of an input image and the underlying one-hot vectors (\(e_{bg},e_p,e_c\)). All encoders are fed with image I as input. The background encoder is also fed with the mask \(I_m\). During image reconstruction, there is no initial image mask, therefore it first has to be generated by encoding the shape and style features and applying the foreground generator. The lack of ground-truth mask is why the encoders do not run fully in parallel. In addition, the background and shape encoders also produce bypass tensors (\(B_{bg},B_{fg}\)) to be used as skip connections between the encoders and the generators.

$$\begin{aligned} (B_{bg})&= E_{bg}(I, I_m) \end{aligned}$$
(6)
$$\begin{aligned} (\hat{e}_p ,\mu _p, \sigma _p, B_{fg}, \mu _z, \sigma _z)&= E_{p}(I) \end{aligned}$$
(7)
$$\begin{aligned} (\hat{e}_c ,\mu _c, \sigma _c)&= E_{c}(I) \end{aligned}$$
(8)

where (\(\mu , \sigma \)) are three paired vectors of sizes (\(d_{z}, d_{p}, d_{c}\)) defining the mean and variance to sample each element of (\(\hat{z}, \hat{v}_{p}, \hat{v}_{c}\)) from a gaussian distribution.

Mixup. At the intersection between the encoders and generators, we introduce a novel method to merge information coded by the encoders and information produced by the embeddings and lower levels of the generators. The mixup module  [35], mixes two input variables with a weight parameter \(\beta \). The rationale behind this application is that during generation there is no data coming from the encoders, so the mixup is turned off and only information from the embeddings and lower levels of the generators are passed forward. During reconstruction, we want our method to utilize the skip connections to improve performance and also use the predicted embeddings \((v_p,v_c)\) to represent the object’s shape and style. The contrast between the two paths leads to a difficulty in optimizing them simultaneously. The introduction of the mixup simplifies this by having both paths active during forward path and back-propagation. In contrast to regular residual connections, the ever changing \(\beta \) used in the mixup forces both inputs to be independent representations and not complement each other.

The mixup modules at the vector embeddings level (mixup1 and mixup2 in Fig. 2) mix the vectors (\(v_p, v_c\)) given by the embeddings (\(V_p, V_c\)), Eq. 2, with the predicted vectors (\(\hat{v}_p, \hat{v}_c\)) produced by the encoders (\(E_p, E_c\)), Eq. 7,8. The mixture of features leads to both the embeddings and the encoders being optimized for reconstructing the object. This has two benefits. First, it trains the encoders to properly code the images, which improves clustering and learns image-to-image translation implicitly. Second, it trains the embeddings to represent the real object classes, which improves the generation task.

The mixup modules at the skip connections (mixup3 and mixup4 in Fig. 2) mix the pre-image tensors (\(A_{bg}, A_{fg}\)), Eq. 3,4, with the bypass tensors (\(B_{bg},B_{fg}\)), Eq. 6,7. It serves to create the condition where the reconstruction path will be simultaneously dependent on the bypass and on the lower stage of the generators. This way, at any time we can choose any \(\beta \) or even pass only the bypass or only the pre-image and result in an almost identical image.

Given two inputs and a parameter \(\beta \), the mixup is defined as follows:

$$\begin{aligned} \begin{aligned} v_{p_{mix}} = v_p \circ (1 - \beta _1) + \hat{v}_p \circ \beta _1,&\quad v_{c_{mix}} = v_c \circ (1 - \beta _2) + \hat{v}_c \circ \beta _2 \\ A_{fg_{mix}} = A_{fg} \circ (1 - \beta _3) + B_{fg} \circ \beta _3,&\quad A_{bg_{mix}} = A_{bg} \circ (1 - \beta _4) + B_{bg} \circ \beta _4 \end{aligned} \end{aligned}$$
(9)

In our implementation, \(\beta _1,\beta _2 \in [0,1]\) and \(\beta _3,\beta _4 \in [0.5,1]\), are sampled in each iteration for each instance in the batch. At reconstruction, the mixed features (\(v_{p_{mix}}, v_{c_{mix}}, A_{fg_{mix}}, A_{bg_{mix}}\)) replace the features (\(v_{p}, v_{c}, A_{fg}, A_{bg}\)) in Eq. 3,4 as input to the generators. For illustration, please refer to Fig. 2.

GLU Layer Normalization. Following StackGANv2 and FineGAN architecture, we apply GLU  [9] activation in the generators. Due to the multiple paths, the large scale and high complexity of our method, batch normalization was unstable for our low batch size, and, increasing the batch size was not an option. As a solution, we switched to layer normalization, which is not affected by the batch size. We fused the normalization and activation into a single module termed “GLU Layer Normalization”. Given an input x with \(x_L,x_R\) representing an equal split in the channel axis (left/right): \( \text {GLU}(x_L,x_R) = x_L \circ \text {Sigmoid}(x_R)\), \(\text {GLU-LNorm}(x_L,x_R) = \text {GLU}(\text {LNorm}(x_L),x_R)\). In this method, the normalization is only applied on \(x_L\). The input to the sigmoid, \(x_R\), is not normalized. This is favorable, because \(x_R\) serves as a mask on \(x_L\), and normalizing it across the channels contradicts this goal.

Discriminators. Following FineGAN, the two discriminators are adversarial opponents on the outputs \(I_{bg}, I\). The background discriminator \(D_{bg}\) has two tasks, with a separate output for each. The tasks are as follows: (i) patch-wise prediction if the input image is real or fake when presented with either a real or fake background image, annotated as \(D_{bg_A}\). (ii) patch-wise prediction if the input image is a background image or not when presented with either a real background image or a real object image, annotated as \(D_{bg_B}\). The background generator is hereby optimized to generate images that look like real images and do not contain object features. In addition, when performing the reconstruction path on fake images, we also extract a hidden layer output and apply perceptual loss between generated and reconstructed backgrounds, annotated as \(D_{bg_C}\), to reduce the perceptual distance between the original and the reconstructed image.

The image discriminator \(D_c\) receives real images from \(X_c\) or generated fake images, and also has two tasks: (i) predict if the input image is real or fake, annotated as \(D_{c_A}\). (ii) predict the child class \(\phi _c\) of the image, annotated as \(D_{c_B}\), as in all InfoGAN-influenced methods. This trains the foreground generator to generate images that look real and represent the conditioned child class. In addition, we also extract a hidden layer output and apply perceptual loss between generated and reconstructed foreground images, annotated as \(D_{c_C}\).

4 Training

To train to solve various tasks, we perform in each iteration two different paths through the model, by connecting the various sub-networks in a specific order.

4.1 Generation Path

The generation path is described in Fig. 1, Eq. 25. For illustrations, see Fig. 3. The inputs for this path are \(e_{bg},e_{p},e_{c},z\). During generation, the model learns to generate image I in a way that relies on generating a background \(I_{bg}\), foreground \(I_{fg}\), and mask \(I_m\) images. The discriminators are trained along with the generators and produce an adversarial training signal. In addition, the encoders are also trained to retrieve the latent variables from the generated images, as a self-supervised task.

The losses in this path can be put into four groups: adversarial losses, classification losses, distance losses, and regularizations. For brevity, e represents the dependence on all prior codes (\(e_{bg}, e_p, e_c\)). Similarly, G(ez) represents the full generation of the final image, Eq. 35.

Adversarial losses. These involve the two discriminators and are derived from the minimax equation: \(\min _G \max _D \mathbb {E}_x[\log (D(x))] + \mathbb {E}_z[\log (1 - D(G(z)))]\), for a generic generator G and discriminator D. The concrete GAN loss is the sum of the losses for the separation between real/fake background, the separation between background/object and the separation between real/fake object.

For the discriminators, where \(X_{bg}, X_c\) are the sets of real background images and real object images, the losses are:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{D_{bg_A}} =&\mathbb {E}_{x\sim X_{bg}}[\log (D_{bg_A}(x))] + \mathbb {E}_{e_{bg},z}[\log (1 - D_{bg_A}(G_{bg}(e_{bg}, z)))] \\ \mathcal {L}_{D_{bg_B}} =&\mathbb {E}_{x\sim X_{bg}}[\log (D_{bg_B}(x))] + \mathbb {E}_{x\sim X_c}[\log (1 - D_{bg_B}(x))] \\ \mathcal {L}_{D_{c_A}} =&\mathbb {E}_{x\sim X_c}[\log (D_{c_A}(x))] + \mathbb {E}_{e,z}[\log (1 - D_{c_A}(G(e, z)))] \\ \mathcal {L}_D =&10 \cdot \mathcal {L}_{D_{bg_A}} + \mathcal {L}_{D_{bg_B}} + \mathcal {L}_{D_{c_A}} \end{aligned} \end{aligned}$$
(10)

For the generators, the losses are:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{G_{bg_A}} =&\mathbb {E}_{e_{bg},z}[\log (D_{bg_A}(G_{bg}(e_{bg}, z)))], \mathcal {L}_{G_{bg_B}} = \mathbb {E}_{e_{bg},z}[\log (D_{bg_B}(G_{bg}(e_{bg}, z)))] \\ \mathcal {L}_{G_{c_A}} =&\mathbb {E}_{e,z}[\log (D_{c_A}(G(e, z)))], \mathcal {L}_G = 10 \cdot \mathcal {L}_{G_{bg_A}} + \mathcal {L}_{G_{bg_B}} + \mathcal {L}_{G_{c_A}} \end{aligned} \end{aligned}$$
(11)

Classification losses. These losses optimize the generators to generate distinguished images for each style and shape priors and optimize the encoders to retrieve the prior classes. We use the cross entropy loss between the conditioned classes (\(\phi _p, \phi _c\)) and the encoders’ predictions (\(\hat{e}_p, \hat{e}_c\)) form Eq.7,8. In addition, we use the auxiliary task \(D_{C_B}\).

$$\begin{aligned} \mathcal {L}_E = \text {CE}(\hat{e}_p, \phi _p) + \text {CE}(\hat{e}_c, \phi _c) + \text {CE}(D_{c_B}(I), \phi _c) \end{aligned}$$
(12)

Distance Losses. We train the encoders to minimize the mean squared error between the vectors in the latent space produced during generation and their predicted counterparts. These vectors are used in the reconstruction path, thus this self-supervised task assists in this regard. We minimize the distance between the pre-images and bypasses (\(A_{bg}, A_{fg}, B_{bg}, B_{fg}\)), and between the latent vectors (\(v_p, v_c\)) and the mean vectors \(\mu _p,\mu _c\) used to sample the latent code (\(\hat{v}_p, \hat{v}_c\)).

$$\begin{aligned} \mathcal {L}_{\text {MSE}} = \text {MSE}(v_c, \mu _c) + \text {MSE}(v_p, \mu _p) + \text {MSE}(A_{fg}, B_{fg}) + \text {MSE}(A_{bg}, B_{bg}) \end{aligned}$$
(13)

Regularization Losses. For regularization, a loss term is applied on the latent codes (\(v_{bg}, v_p, v_c\)), annotated as \(\mathcal {L}_{R_v}\), and on the foreground mask \(I_m\), annotated as \(\mathcal {L}_{R_M}\). They are detailed in the supplementary.

All the losses are summed together to the total loss:

$$\begin{aligned} \mathcal {L}_{\text {GEN}} = \mathcal {L}_G + \mathcal {L}_E + \mathcal {L}_{\text {MSE}} + 0.1 \cdot \mathcal {L}_{R_v} + 2 \cdot \mathcal {L}_{R_M} \end{aligned}$$
(14)

4.2 Reconstruction Path

The reconstruction path is described in Fig. 2. For illustrations, see Fig. 4. The input is an image I. The precise flow is: (1) encode the foreground through the shape and style encoders (\(E_p, E_c\)), Eq. 7,8, (2) generate a foreground image and mask with the foreground generator (\(G_{fg}\)), Eq. 4, (3) encode the image and mask through the background encoder (\(E_{bg}\)), Eq. 6, (4) generate the background image with the background generator (\(G_{bg}\)), Eq. 3, and (5) compose the final image with \(I_{fg}, I_{bg}, I_m\), Eq. 5. In addition, the mixup is applied as in Eq. 9 between encoding and generation. This path optimizes the clustering and segmentation tasks directly and also implicitly optimizes the generation task by reconstruction real images. We perform the reconstruction path on both real images and generated images from the generation path. This fully utilizes the information available to learn all tasks with minimal supervision.

The losses in this path can be put in three groups: statistical losses, reconstruction losses, and perceptual losses.

Statistical Losses. As in Variational Auto-Encoders  [18], we compare the Kullback-Leiber Divergence between the latent variables encoded by the encoders (\(\hat{v}_p, \hat{v}_c, \hat{z}\)) to a multivariate gaussian distribution. For the pose vector z, we used the standard normal distribution with covariance matrix equal to the identity matrix (\(\varSigma = I_{d_z}\)) and a zero mean vector (\(\mu = \mathbf{0} \)). For the shape and style vectors (\(\hat{v}_p, \hat{v}_c\)) we still use identity \(\varSigma \), but since they should match their latent code (\(v_p, v_c\)), we use these latent codes as the target mean.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {VAE}_p}&= D_{\mathrm {KL}}(\mathcal {N}(\mu _{p}, \text {diag}(\sigma _{p})) \Vert \mathcal {N}(v_p, I_{d_p})) \\ \mathcal {L}_{\text {VAE}_c}&= D_{\mathrm {KL}}(\mathcal {N}(\mu _{c}, \text {diag}(\sigma _{c})) \Vert \mathcal {N}(v_c, I_{d_c})) \\ \mathcal {L}_{\text {VAE}_z}&= D_{\mathrm {KL}}(\mathcal {N}(\mu _{z}, \text {diag}(\sigma _{z})) \Vert \mathcal {N}(\mathbf{0} ,~~~ I_{d_z})) \\ \mathcal {L}_{\text {VAE}}&= \mathcal {L}_{\text {VAE}_p} + \mathcal {L}_{\text {VAE}_c} + \mathcal {L}_{\text {VAE}_z} \end{aligned} \end{aligned}$$
(15)

Reconstruction Losses. The reconstruction losses are a set of L1 losses that compare the difference between the input image to the output. The network trains at reconstructing both real and fake images. For fake images, we have the extra self-supervision to also compare reconstruction of the background image and foreground mask.

$$\begin{aligned} \mathcal {L}_{\text {REC}} = {\left\{ \begin{array}{ll} \text {L1}(I, \hat{I}) &{}, \text {real}\\ \text {L1}(I, \hat{I}) + \text {L1}(I_{bg}, \hat{I}_{bg}) + \text {L1}(I_m, \hat{I}_m) &{}, \text {fake} \end{array}\right. } \end{aligned}$$
(16)

Perceptual Losses. Comparing images to their ground-truth counterpart is known to produce blurred images; Perceptual loss  [15] is known to aid in producing sharper images with more visible context  [36] by comparing the images on the feature level as well. The perceptual loss is often used along with a pre-trained network, but this relies on added supervision. In our case, we use the discriminators as feature extractors. We use the notation \(D_{bg_C}, D_{c_C}\) from Sect. 3 to describe the extraction of the hidden layers used for this comparison.

$$\begin{aligned} \mathcal {L}_{\text {PER}} = {\left\{ \begin{array}{ll} \text {L2}_{D_{c_C}}(I, \hat{I}) &{} , \text {real}\\ \text {L2}_{D_{c_C}}(I, \hat{I}) + \Vert D(D_{bg_C})- D(\hat{I}_{bg})\Vert ^2 &{} , \text {fake} \end{array}\right. } \end{aligned}$$
(17)

All the losses are summed together to the total loss:

$$\begin{aligned} \mathcal {L}_{\text {AE}} = \mathcal {L}_{\text {GEN}} + \mathcal {L}_{\text {VAE}} + \mathcal {L}_{\text {REC}} + \mathcal {L}_{\text {PER}} \end{aligned}$$
(18)

4.3 Multi-phase Training

In order to simplify training, instead of training both paths at once, we schedule the training process by phases. The phases are designed to train the network for a gradually increasing subset of tasks, starting from image-level tasks (generating images) to semantic tasks (semantic segmentation of the foreground, and semantic clustering) that benefit from the capabilities obtained in the generation path. In the first phase we only perform the generation path 4.1 and in the second phase we add the reconstruction path 4.2.

Without multi-phase training, the networks would be trained to generate and reconstruct images simultaneously. While the generation flow encourages a separation between the background and foreground components, the reconstruction flow resists this separation due to the trivial solution of encoding and decoding the image in one of the paths (foreground or background) and applying an all-zero or all-one mask. In the experiments, in Table 1, 2, we show that without multi-phase the model is incapable of learning any task.

In this controlled environment, the generators are much more likely to converge to the required setting. After a decent amount of iterations, determined in advance by a hyper-parameter, the second phase kicks in, where the model is also trained to reconstruct images, which will train the encoders on top of the generator instead of breaking it.

When entering Phase II, the fake images for both discriminators can be a result of either (i) generation path, (ii) fake image reconstruction, or (iii) real image reconstruction. We noticed that images from the reconstruction paths fail to converge to real-looking images when the discriminators were only trained by the generation path outputs. We hypothesized that this is probably due to each path producing images from a different source domain and these paths can generate very different images during training and the discriminators get overwhelmed by the different tasks and are not able to optimize them simultaneously. To solve this, upon entering Phase II, we clone each discriminator (\(D_c, D_{bg}\)) twice and associate one separate clone for each path, resulting in a total of three background discriminators and another three for the foreground. In this setting, each path receives the adversarial signal that is concentrated only at improving that path.

5 Experiments

Fig. 3.
figure 3

Image Generation for each dataset. From top to bottom: (i) final image, (ii) foreground, (iii) foreground mask, (iv) background.

Fig. 4.
figure 4

Image reconstruction for each dataset. From top to bottom: (i) real image, (ii) reconstructed image, (iii) reconstructed foreground, (iv) reconstructed background, (v) ground-truth foreground mask, (vi) predicted foreground mask.

Fig. 5.
figure 5

Conditional Generation. From left to right: (i) real images, (ii-vi) generation of images with the encoded parent and child codes and a different vector z per column, (vii) FineGAN  [26] + our encoders, (viii) StackGANv2  [34] + our encoders.

Fig. 6.
figure 6

Style Transfer. From left to right: (i) real images. (ii-vi) reconstructed images when the child code \(e_c\) is switched with a code from a selected category, (vii) FineGAN  [26] + our encoders, (viii) StackGANv2  [34] + our encoders.

Table 1. Quantitative generation results. FID\(\downarrow \), IS\(\uparrow \), CFID\(\downarrow \), CIS\(\uparrow \)
Table 2. Segmentation and clustering results. \(^\mathsection \)unfair upper bound results, obtained by selecting the best result out of many. \(^\dag \)provided by [26]. \(^*\)model performed task by using our encoders. model cannot perform task. Higher is better in all scores.
Table 3. Ablation studies on CUB: (a) normalization methods, (b) modules’ behaviour, and (c) losses. Measuring FID and C-IS for generation and IOU for segmentation.

We train the network for 600,000 iterations, with batch size 20. All sub-networks are optimized using Adam [19], with lr=2e-4. Phase I duration is 200,000 iterations and Phase II 400,000. Within Phase II, we start with training only on fake images and real image reconstruction starts after another 200,000 iterations.

We evaluate our model on various tasks against the state of the art methods. Since no other model can solve all these tasks, we evaluate against different methods in each task. Depending on availability, some baselines were pre-trained models released by the authors and some were trained from scratch with the authors’ official code and instructions.

Datasets. We evaluate our model with three datasets of fine-grained categorization. Caltech-UCSD Birds-200–2011 (Birds)  [29]: This dataset consists of 11,788 images of 200 classes of birds, annotated with bounding boxes and segmentation masks. Stanford Dogs (Dogs)  [20]: This dataset consists of 20,580 images of 120 classes of dogs, annotated with bounding boxes. For evaluation, target segmentation masks were generated by a pre-trained DeepLabV3  [5] model on the COCO  [22] dataset. The pre-trained model was acquired from the gluoncv toolkit  [12]. Stanford Cars (Cars)  [17]: This dataset consists of 16,185 images of 196 classes of cars, annotated with bounding boxes. Segmentation masks were generated as above with the pre-trained DeepLabV3 model.

Similarly to FineGAN, before training the model, we produced a background subset by cutting background patches with the bounding boxes. In addition to FineGAN, the bounding boxes were not used in any other way to train our method and we made sure that no image was used for both foreground and background examples. This was done by splitting the dataset in a 80/20 ratio, and use the larger subset as foreground \(X_c\) and only the smaller subset for background \(X_{bg}\).

Due to the different size of classes in each dataset, there is also a different size of child and parent classes in the design for each dataset. Birds: \(N_C=200, N_P=20\), Dogs: \(N_C=120, N_P=12\), Cars: \(N_C=196, N_P=14\).

Image Generation. We compare our image generation results to FineGAN  [26] and StackGANv2  [34], by relying on an InceptionV3 fine-tuned on each dataset. We evaluate our method in both IS  [25] and FID  [13]. In addition, we measure the conditional variants of these metrics (CIS, CFID), as presented in [2]. The conditional metrics measure the similarity between real and fake images within each class, which cannot be measured by the unconditional metrics.

Our results, reported in Table 1 show that OneGAN outperforms in both conditional and unconditional image generation. In unconditional generation, our method and the baselines performed roughly the same, since the generators are very similar. In conditional generation, our method improves on the baseline by a large margin. StackGANv2 was the worst performing model. This suggest that the mask-based generation, that FineGAN and our method rely on, generates a stronger conditioning on the object in the image. In addition, our multi-path training method improves conditional generation further, as is shown in the ablation tests. For illustration of conditional generation, see Fig. 5.

Unsupervised Foreground Segmentation. We compare our mask prediction from the reconstruction path to the real foreground mask. We evaluate according to IOU and DICE scores. We compare against three baselines, ReDO  [6], WNet  [32] and UISB  [16] which are trained for each dataset separately, and a third one, IIC-seg  [14], which was trained on coco-stuff and coco-stuff-3 (a subset). While coco-stuff is a different dataset than the ones we used, it contains all the relevant classes. ReDO and WNet produce a foreground mask which we compare to the ground-truth similarly to how we evaluate our model. UISB is an iterative method that produces a final segmentation with a varying number of classes between 2 and 100. We iterated UISB on each image 50 times. The output was usually between 4–20 classes. Since there is no labeling of the foreground or background classes, this method cannot be immediately used for this task. In order to get an evaluation, we look for each image for the class that has the highest IOU with the ground-truth foreground. The rest of the classes are merged to a single background class. We then repeat with a single background class and the rest merged into foreground. Finally, taking the best out of the two options, each obtained by using an oracle to select out of many options, which provides a liberal upper bound on the performance of UISB. IIC also produces a multi-class segmentation map, we use it in the same way we use UISB by taking the best class for either background or foreground in respect to IOU. IIC has 2-headed output, one for the main task and one for over-clustering. For coco-stuff trained IIC, we look for the best mask in one of the 15 classes of the main head. For coco-stuff-3 trained IIC, the main head is trained to cluster sky/ground/plants, so we look for the best mask in one of the 15 classes of the over-clustering head. FineGAN cannot perform segmentation, since it does not have a reconstruction path. But we added an additional baseline by training FineGAN with our encoders to allow such path. The results in Table. 2 show that our method outperforms all the baselines. The ablation show that the biggest contribution comes from the reconstruction path and the multi-phase scheduling.

Unsupervised Clustering. We compare our method against JULE  [33], DEPICT  [10] and IIC-cluster  [14]. In addition, we added the baselines of StackGANv2 and FineGAN trained with our encoders. We evaluate how well the encoders cluster the real images. OneGAN outperforms the other methods for both Birds and Dogs. For Cars, our model was second after DEPICT. By looking at the generated images, this can be explained by the fact our method clusters the cars based more on color and less on car model. This aligns with the lower conditional generation score for Cars than for the other datasets.

Image to Image Translation. To further evaluate our model, we show its capability to transfer an input image to a target category. The results can be seen in Fig. 6. Even though our model was never trained on this task, the disentanglement between the shape and the texture enables this translation simply by passing a different child code during reconstruction. In contrast, FineGAN and StackGANv2 are unable to perform this task correctly as there is no learned disentanglement in StackGANv2 and no bypass connection in FineGAN.

Object Removal and Inpainting. Our model is also capable of performing automatic object removal and background reconstruction, see Fig. 4. Due to the lack of perfect ground-truth mask, our model does not only fill the missing pixels but fully reconstructs the background image. As a result, the background image is not identical to the original background, but it is semantically similar to it. We compare our method with previous work in the supplementary.

Ablation Study. In Table 1,2, we provide multiple versions of our method for ablation. In the version without real reconstruction, we only add fake image reconstruction in Phase II, meaning that real images did not pass through the network during training. Another variant employs only the first phase of training. Finally, a third variant trains without multi-phase scheduling. These tests show the contribution of the multiple paths and the multi-phase scheduling. In Table 3, we provide an extensive ablation study on three aspects. In (a), we compared layer and instance normalization  [28] methods in the generators. Our “GLU layer normalization” outperformed all other options. In (b), we turned of intersection modules between encoders and generators. The experiment shows that these models strongly improve the CIS, which explains why our method outperformed FineGAN and StackGANv2 in conditional generation. In (c), we evaluated the contribution of selected novel losses, which affected all scores. Together, all these experiments show the contribution of the proposed novelties in our method.

6 Conclusions

By building a single model to handle multiple unsupervised tasks at once, we convincingly demonstrate the power of co-training, by surpassing the performance of the best in class methods for each task. This capability is enabled by a complex architecture with many sub-networks. However, supporting this complexity during training is challenging. We introduce a mixup module that integrates multiple pathways in a homogenized manner and a multi-phase training, which helps to avoid some tasks dominating over the others.