1 Introduction

In recent years, conditional generation of medical images has become a popular area for research using conditional Generative Adversarial Networks (cGAN) [20, 36]. One common pitfall of cGAN is that the conditioning codes are extremely high-level and do not cover nuances of the data. This challenge is exacerbated in the medical imaging domain where insufficient label granularity is a common occurrence. We refer to the factors of variation that depend on the conditioning vector as content. Another challenge in conditional image generation is that the image distribution also contains factors of variation that are agnostic to the conditioning code. These types of information are shared among different classes or different conditioning codes. In this work we refer to such information as style, which depending on the task, could correspond to position, orientation, location, background information, etc. Learning disentangled representation of content and style allows us to control the detailed nuances of the generation process.

In this work, we consider two types of information to preside over the image domain: content and style, which by definition, are independent and this independence criteria should be taken into account when training a model. By explicitly constraining the model to disentangle content and style, we ensure their independence and prevent information leakage between them. To achieve this goal, we introduce Dual Regularized Adversarial Inference (DRAI), a conditional generative model that leverages unsupervised learning and novel disentanglement constraints to learn disentangled representations of content and style, which in turn enables more control over the generation process.

We impose two novel disentanglement constraints to facilitate this process: Firstly, we introduce a novel application of the Gradient Reverse Layer (GRL) [16] to minimize the shared information between the two variables. Secondly, we present a new type of self-supervised regularization to further enforce disentanglement; using content-preserving transformations, we attract matching content information, while repelling different style information.

We compare the proposed method with multiple baselines on two datasets. We show the advantage of using two latent variables to represent style and content for conditional image generation. To quantify style-content disentanglement, we introduce a disentanglement measure and show the proposed regularizations can improve the separation of style and content information. The contributions of this work can be summarized as follows:

  • To the best of our knowledge, this is the first time disentanglement of content and style has been explored in the context of medical image generation.

  • We introduce a novel application of GRL that penalizes shared information between content and style in order to achieve better disentanglement.

  • We introduce a self-supervised regularization that encourages the model to learn independent information as content and style.

  • we introduce a quantitative content-style disentanglement measure that does not require any content or style labels. This is especially useful in real world scenarios where attributes contributing to content and style are not available.

2 Method

2.1 Overview

Let \(\boldsymbol{t}\) be the conditioning vector associated with image \(\boldsymbol{x}\). Using the pairs \(\{(\boldsymbol{t}_i,\boldsymbol{x}_i)\}, i={1, \dots , N}\), where N denotes the size of the dataset, we train an inference model \(G_{c,z}\) and a generative model \(G_x\) such that (i) the inference model \(G_{c,z}\) infers content \(\boldsymbol{c}\) and style \(\boldsymbol{z}\) in a way that they are disentangled from each other and (ii) the generator \(G_x\) can generate realistic images that not only visually respect the conditioning vector \(\boldsymbol{t}\) but also the style/content disentanglement. An Illustration of DRAI is made in Fig. 1

It is worth noting that our generative module is not constrained to require a style image. Having a probabilistic generative model allows us to sample the style code from the style prior distribution and generate images with random style attributes. The framework also allows us to generate hybrid images by mixing style and content from various sources (details can be found in Sect. B.2).

Fig. 1.
figure 1

Overview of DRAI. The dashed purple arrows mark the cycle consistency between features implemented via \(\ell _1\) norm, while the solid purple arrows show the imposed disentanglement constrains. On the right hand side of the figure we show all the discriminators used for training. \(\hat{\boldsymbol{c}}\) represents the inferred content, \(\hat{\boldsymbol{z}}\) the inferred style, \(\hat{\boldsymbol{x}}\) the reconstructed input image and \(\bar{\boldsymbol{x}}\) the image with mismatched conditioning.

2.2 Dual Adversarial Inference (DAI)

We follow the formulation of [30] for Dual Adversarial Inference (DAI) which is a conditional generative model that uses bidirectional adversarial inference [14, 15] to learn content and style variables from the image data. To impose alignment between conditioning vector \(\boldsymbol{t}\) and the generated image \(\boldsymbol{\tilde{x}}\), we seek to match \(p(\boldsymbol{\tilde{x}},\boldsymbol{t})\) with \(p(\boldsymbol{x},\boldsymbol{t})\). To do so, we adopt the matching-aware discriminator proposed by [40]. For this discriminator—denoted as \(D_{x,t}\)—the positive sample is the pair of real image and its corresponding conditioning vector \((\boldsymbol{x}, \boldsymbol{t})\), whereas the negative sample pairs consist of two groups; the pair of real image with mismatched conditioning \((\bar{\boldsymbol{x}}, \boldsymbol{t})\), and the pair of synthetic image with corresponding conditioning \((G_x(\boldsymbol{z}, \boldsymbol{c}), \boldsymbol{t})\). In order to retain the fidelity of the generated images, we also train a discriminator \(D_x\) that distinguishes between real and generated images. The loss function imposed by \(D_{x,t}\) and \(D_x\) is as follows:

$$\begin{aligned}&\mathop {\min }\limits _G \mathop {\max }\limits _D V_{\text {t2i}}(D_x, D_{x,t}, G_x) = \mathbb {E}_{p_\text {data}}[\log D_x(\boldsymbol{x})] + \mathbb {E}_{p(\boldsymbol{z}), q(\boldsymbol{c}})[\log (1 - D_x(G_x(\boldsymbol{z}, \boldsymbol{c})))] +\\&\mathbb {E}_{p_\text {data}}[\log D_{x,t}(\boldsymbol{x},t)] + \frac{1}{2} \big \{ \mathbb {E}_{p_\text {data}}[\log (1 - D_{x, t}(\bar{\boldsymbol{x}}, t))] + \mathbb {E}_{p(\boldsymbol{z}), q(\boldsymbol{c}),p_\text {data}}[\log (1 - D_{x, t}(G_x(\boldsymbol{z}, \boldsymbol{c}), t))] \big \}, \\ \end{aligned}$$

where \(\boldsymbol{\tilde{x}} = G_x(\boldsymbol{z}, \boldsymbol{c})\) is the generated image and \((\bar{\boldsymbol{x}}, t)\) designates a mis-matched pair.

We use adversarial inference to infer style and content codes from the image. Using the adversarial inference framework, we are interested in matching the conditional \(q(\boldsymbol{z},\boldsymbol{c}|\boldsymbol{x})\) to the posterior \(p(\boldsymbol{z},\boldsymbol{c}|\boldsymbol{x})\). Given the Independence assumption of \(\boldsymbol{c}\) and \(\boldsymbol{z}\), can use the bidirectional adversarial inference formulation individually for style and content. This dual adversarial inference objective is thus formulated as:

$$\begin{aligned} \min _G \max _D V_{\text {dALI}}(D_{x,z},D_{x,c}, G_x, G_{c,z}) = {\mathbb {E}_{q(\boldsymbol{x}), q(\boldsymbol{z},\boldsymbol{c}|\boldsymbol{x})}[\log D_{x,z}(\boldsymbol{x}, \hat{\boldsymbol{z}}) + \log D_{x,c}(\boldsymbol{x}, \hat{\boldsymbol{c}})] + } \\ {\mathbb {E}_{p(\boldsymbol{x}|\boldsymbol{z},\boldsymbol{c}), p(\boldsymbol{z}), p(\boldsymbol{c})}[\log (1 - D_{x,z}(\tilde{\boldsymbol{x}}, \boldsymbol{z})) + \log (1 - D_{x,c}(\tilde{\boldsymbol{x}}, \boldsymbol{c}))]. } \end{aligned}$$
(1)

To improve the stability of training, we include image-cycle consistency (\(V_{\text {image-cycle}}\)) [51] and latent code cycle consistency (\(V_{\text {code-cycle}}\)) objectives [12].

2.3 Disentanglement Constrains

The dual adversarial inference (DAI) encourages disentanglement through the independence assumption of style and content. However, it does not explicitly penalize entanglement. We introduce two constraints to impose style-content disentanglement. Refer to the Appendix for details.

Content-Style Information Minimization: We propose a novel application of the Gradient Reversal Layer (GRL) strategy [16] to explicitly minimize the shared information between style and content. We train an encoder \(F_c\) to predict the content from style and use GRL to minimize the information between the two. The same process is done for predicting style from content through \(F_z\). This constrains the content feature generation to disregard style features and the style feature generation to disregard content features.

Self-supervised Regularization: We incorporate a self-supervised regularization such that the content is invariant to content-preserving transformations (such as a rotation, horizontal or vertical flip) while the style is sensitive to such transformations. More formally, we maximize the similarity between the inferred contents of \(\boldsymbol{x}\) and the transformed \(\boldsymbol{x}'\) while minimizing the similarity between their inferred styles. This constrains the content feature generation to focus on the content of the image reflected in the conditioning vector and the style feature generation to focus on the transformation attributes.

DRAI is a probabilistic model that requires reparameterization trick to sample from the approximate posteriors \(q(\boldsymbol{z}|\boldsymbol{x})\), \(q(\boldsymbol{c}|\boldsymbol{x})\) and \(q(\boldsymbol{c}|\boldsymbol{t})\). We use KL divergence in order to regularize these posteriors to follow the normal distribution \(\mathcal {N}(\boldsymbol{0}, \boldsymbol{I})\). Taking that into account, the complete objective criterion for DRAI is:

$$\begin{aligned} \begin{aligned} \min _G \max _{D, F} V_{\text {t2i}}+ V_{\text {dALI}}+ V_{\text {image-cycle}} + V_{\text {code-cycle}} + V_{\text {GRL}} + V_{\text {self}} + \\ \lambda D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) \, || \, \mathcal {N}(\boldsymbol{0}, \boldsymbol{I})) + \lambda D_{KL}(q(\boldsymbol{c}|\boldsymbol{x}) \, || \, \mathcal {N}(\boldsymbol{0}, \boldsymbol{I})) + \lambda D_{KL}(q(\boldsymbol{c}|\boldsymbol{t}) \, || \, \mathcal {N}(\boldsymbol{0}, \boldsymbol{I})). \\ \end{aligned} \end{aligned}$$
(2)

3 Experiments

We conduct experiments on two publicly available medical imaging datasets: LIDC [4] and HAM10000 [46] (see Appendix for details on these datasets). To evaluate the quality of generation, inference, and disentanglement, we consider two types of baselines. To show the effectiveness of dual variable inference, we compare our framework with single latent variable models. For this, we introduce a conditional adaptation of InfoGan [12] referred to as cInfoGAN and a conditional adversarial variational Autoencoder (cAVAE). We also compare DRAI to Dual Adversarial Inference (DAI) [30] and show how using our proposed disentanglement constraints together with latent code cycle-consistency can significantly boost performance. See Appendix for more details on various baselines. Finally, we conduct rigorous ablation studies to evaluate the impact of each component in DRAI.

3.1 Generation Evaluation

To evaluate the quality and diversity of the generated images, we measure FID and IS (see Appendix Sect. D.3) for the proposed DRAI model and various double and single latent variable baselines described in Appendix Sect. D. The results are reported in Table 1 for both LIDC and HAM10000 datasets. For the LIDC dataset, we observe all methods have comparable IS score while DRAI and DAI have significantly lower FID compared to other baselines, with DRAI having better performance. For the HAM10000 dataset, DRAI once again achieves the best FID score while D-cInfoGAN achieves the best IS.

Table 1. Comparison of image generation metrics (FID, IS) and disentanglement metric(CIFC) on HAM10000 and LIDC datasets for single and double variable baselines. CIFC is only evaluated for double variable baselines.
Fig. 2.
figure 2

Conditional generations on LIDC and HAM10000. The images are generated by keeping the content code (\(\boldsymbol{c}\)) fixed and only sampling the style codes (\(\boldsymbol{z}\)).

We highlight that while FID and IS are the most common metrics for the evaluation of GAN based models, they do not provide the optimum assessment [5] and thus qualitative assessment is needed. We use the provided conditioning vector for the generation process and only sample the style variable \(\boldsymbol{z}\). The generated samples are visualized in Fig. 2. In every sub-figure, the first column represents the reference image corresponding to the conditioning vector used for the image generation, and the remaining columns represent synthesized images.

By fixing the content and sampling the style variable, we can discover the types of information that are encoded as style and content for each dataset. We observe that the learned content information are color and lesion size for HAM10000, and nodule size for LIDC; while the learned style information are location, orientation and lesion shape for HAM10000 and background for LIDC. We also observe that DRAI is very successful in preserving the content information when there is no stochasticity in the content variable (i.e., \(\boldsymbol{c}\) is fixed). As for other baselines, sampling style results in changing the content information of the generated images, which indicates information leak from the content variable to the style variable. The results show that compared to DAI and other baselines, DRAI achieves better separation of style and content.

3.2 Style-Content Disentanglement

Achieving good style-content disentanglement in both inference and generation phases is the main focus of this work. We conduct multiple quantitative and qualitative experiments to asses the quality of disentanglement in DRAI (our proposed method) as well as the competing baselines.

As a quantitative metric, we introduce the disentanglement error CIFC (refer to Appendix for details). Table 1 shows results on this metric. As seen from this table, in both HAM10000 and LIDC datasets, DRAI improves over DAI by a notable margin, which demonstrates the advantage of the proposed disentanglement regularizations; on one hand, the information regularization objective through GRL minimizes the shared information between style and content variables, and on the other hand, the self-supervised regularization objective not only allows for better control of the learned features but also facilitates disentanglement. In the ablation studies (Sect. 3.3), we investigate the effect of the individual components of DRAI on disentanglement.

Fig. 3.
figure 3

Qualitative evaluation of style-content disentanglement through hybrid image generation on LIDC dataset. In every sub-figure, images in the first row present style image references and those in the first column present content image references. Hybrid images are generated by using the style and content codes inferred from style and content reference images respectively.

Fig. 4.
figure 4

Qualitative evaluation of style-content disentanglement through hybrid image generation on HAM10000 dataset. In every sub-figure, images in the first row represent style image references and those in the first column represent content image references. Hybrid images are generated by using the style and content codes inferred from style and content reference images respectively.

To have a more interpretable evaluation, we qualitatively assess the style-content disentanglement through generating hybrid images by combining style and content information from different sources (See Appendix for details on hybrid images). We can then evaluate the extent to which the style and content of the generated images respect the corresponding style and content of the source images. Figure 3 and Fig. 4 show these results on the two datasets. For the LIDC dataset, DAI and DRAI learn CT image background as style and nodule as content. This is due to the fact that the nodule characteristics such as nodule size is included in the conditioning factor and thus the content tends to focus on those attributes. Thanks to the added disentanglement regularizations, DRAI has the best content-style separation compared to all other baselines and demonstrates clear decoupling of the two variables. Because of the self-supervised regularization objective, DRAI assigns more emphases on capturing nodule characteristics as part of the content and background as part of the style. Overall, it is evident from the qualitative experiments that the proposed disentanglement regularizations help to decouple the style and content variables.

3.3 Ablation Studies

In this section, we perform ablation studies to evaluate the effect of each component on disentanglement using the CIFC metric. Ablated models use the same architecture with the same amount of parameters. The quantitative assessment is presented in Table 2. We observe that on both LIDC and HAM10000, each added component improves over DAI, while the best performance is achieved when these components are combined together to form DRAI.

Table 2. Quantitative ablation study on LIDC and HAM10000 datasets

4 Conclusion

We introduce DRAI, a frame work for generating synthetic medical images which allows control over the style and content of the generated images. DRAI uses adversarial inference together with conditional generation and disentanglement constraints to learn content and style variables from the dataset. We compare DRAI quantitatively and qualitatively with multiple baselines and show its superiority in image generation in terms of quality, diversity and style-content disentanglement. Through ablation studies and comparisons with DAI [30], we show the impact of imposing the proposed disentanglement constraints over the content and style variables.