Ultrasound Variational Style Transfer to Generate Images Beyond the Observed Domain

Hung, Alex Ling Yu; Galeotti, John

doi:10.1007/978-3-030-88210-5_2

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13003))

Included in the following conference series:

1738 Accesses
1 Citations

Abstract

The use of deep learning in medical image analysis is hindered by insufficient annotated data and the inability of models to generalize between different imaging settings. We address these problems using a novel variational style-transfer neural network that can sample various styles from a computed latent space to generate images from a broader domain than what was observed. We show that using our generative approach for ultrasound data augmentation and domain adaptation during training improves the performance of the resulting deep learning models, even when tested within the observed domain.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Medical image processing with contextual style transfer

Article Open access 10 November 2020

A Style Transfer-Based Augmentation Framework for Improving Segmentation and Classification Performance Across Different Sources in Ultrasound Images

Improving Style Transfer in Dynamic Contrast Enhanced MRI Using a Spatio-Temporal Approach

Keywords

1 Introduction

Deep learning in medical imaging has immense potential but the challenge is that images taken by different machines or in different settings follow different distributions [4]. As a result, models are usually not generalized enough to be used across datasets. Especially in ultrasound, images captured by different imaging settings on different subjects can be so different that the model trained on one dataset could completely fail on another dataset. Besides, since annotation of medical images is laborious work which requires experts, the number of annotated images is limited. It is extremely difficult to get a large dataset of annotated medical images that are from different distributions. Therefore, we propose a continuous neural style transfer algorithm, which is capable of generating ultrasound images with known content from unknown style latent space.

In the first attempt at style transfer with convolutional neural networks (CNNs), the content and style features were extracted using pretrained VGGs [21] and used to iteratively optimize the output image on-the-fly [6]. Feed-forward frameworks were then proposed to get rid of the numerous iterations during inference [10, 23]. The gram matrix, which describes the style information in images, were further explained in [14]. To control the style of the output, ways to manipulate the spatial location, color information and spatial scale were introduced in [7]. Adaptive Instance Normalization (AdaIN) was proposed as a way to perform arbitrary style transfer in real-time [8]. Further improvements were made where the image transformation network was directly generated from a style image via meta network [20]. Some works are dedicated to using style transfer as a data augmentation tool [17, 25], which showed that style transfer on natural images can improve the performance of classification and segmentation.

In medical image analysis, [22] directly applied style transfer to fundus images for augmentation, while [3] analyzed the style of ultrasound images encoded by VGG encoders. [16] improved the segmentation results of cardiovascular MR images by style transfer. [5] built their network upon StyleGAN [11] to generate high-resolution medical images. [24] showed that generated medical images based on style transfer can improve the results of semantic segmentation on CT scans. [15] proposed a method to do arbitrary style transfer on ultrasound images.

However, current works can only generate one result given a content and a style image, and few can sample the style from a latent space. Works in medical domain are mostly tested on images with similar styles, limiting the ability of generalization. As the style should follow a certain distribution instead of some specific values, we intent to generate multiple plausible output images given a content and a style image. Furthermore, we also want to sample the style in a continuous latent space so that we would be able to generate images from unseen styles. We propose a variational style transfer approach on medical images, which has the following contribution: (1) To the best of our knowledge, our method is the first variational style transfer approach that to explicitly sample the style from a latent space without giving the network a style reference image. (2) Our approach can be used to augment the data in ultrasound images, which results in better segmentation. (3) The method that we propose is able to transform the ultrasound images taken in one style to an unobserved style.

2 Methods

Our style transfer network consists of three parts: style encoder $E_{s}$, content encoder $E_{c}$, and decoder D. The network structure is shown in Fig. 1, where $I_s$ is the style image, $I_c$ is the content image and $\hat{I}$ is the output image. During training, the decoder D learns to generate $\hat{I}$ with the Gaussian latent variables $\mathbf {z}$ conditioned on the content image $I_c$, while the style encoder $E_s$ learns the distribution of $\mathbf {z}$ given style image $I_s$. Before putting the three parts together, we pre-train the style encoder and the content encoder separately to provide better training stability. When generating images, our method can either use certain given style, or sample the style from the latent space.

2.1 Style Encoder

The style encoder $E_s$ is the encoder part of a U-Net [19] based Variational Autoencoder (VAE) [13]. The difference between our VAE and traditional ones is that our latent variables are at different scales to generate images with better resolutions. The style encoder $E_s$ approximates the distribution of latent variables at different scale. In other words, it learns distributions $q_{\phi }(z_i|I)$ that approximates the intractable true distribution $p_{\theta }(z_i|I)$, where $\phi $ is the variational parameters, while $\theta $ is the generative model parameters. Therefore the variational lower bound could be written as:

$$\begin{aligned} \mathcal {L}(\theta ,\phi ;I_s)=-KL(q_\phi (\mathbf {z}|I_s)||p_\theta (\mathbf {z}))+\mathbb {E}_{q_\phi (\mathbf {z}|I_s)}\log p_\theta (I_s|\mathbf {z}) \end{aligned}$$

(1)

where $KL(\cdot )$ is the Kullback–Leibler (KL) divergence between two distributions, and we further assume $\mathbf {z}|\theta \sim \mathcal {N}(0,1)$.

The encoder is further incorporated into the style transfer network while the decoder here is only used during initial training. The structure of the style encoder is shown in Fig. 1, while the decoder part is a traditional U-Net decoder.

2.2 Content Encoder

The structure of the content encoder $E_c$ is shown in Fig. 1, and like the style encoder $E_s$, it is the encoder of a U-Net autoencoder, where the decoder is only used in the initial training with the exact opposite structure of the encoder.

2.3 Decoder

The decoder takes in the encoded style and content before generating a new image, and is also designed based on a traditional U-Net. The structure is shown in Fig. 1. Denote the content features as $f_c=E_c(I_c)$ and the style feature as $f_s=E_s(I_s)$. Also denote the content and style features at scale i as $f_c^{(i)}$ and $f_s^{(i)}$ respectively. The features of the output images at scale i, $\hat{f}^{(i)}$, can be treated as the input into the decoder of a normal U-Net. Inspired by [8], we utilize AdaIN to perform style transfer in feature space at each scale to calculate $\hat{f}^{(i)}$ based on $f_c^{(i)}$ and $f_s^{(i)}$. The calculation can be expressed as:

$$\begin{aligned} \hat{f}^{(i)}=AdaIN(f_c^{(i)},f_s^{(i)})=\sigma (f_s^{(i)}(\frac{f_c^{(i)}-\mu (f_c^{(i)})}{\sigma (f_c^{(i)})}))+\mu (f_s^{(i)}) \end{aligned}$$

(2)

2.4 Loss Functions

The training objective of the network is to generate an output image $\hat{I}$ containing the contents in the content image $I_c$ and having the style of the style image $I_s$, all while maximizing the variational lower bound. Therefore, the loss function is made up of three parts: perceptual loss, style loss, and KL divergence loss.

Perceptual and style losses, are based on high level features extracted by pre-trained VGGs [21] and were first utilized in [6]. Perceptual loss is the difference between two feature maps encoded by a CNN. Since spatial correlation is considered in perceptual loss $L_p$, it is deemed as an expression of the content similarity between two images, which can be expressed as follows:

$$\begin{aligned} L_p=\sum _{i=1}^{N_{VGG}}w^p_i||\psi _i(I_c)-\psi _i(\hat{I})|| \end{aligned}$$

(3)

where $\psi _i(x)$ extracts the layer i of VGG from x, $w^p_i$ is the weight at layer i for perceptual loss, and $N_{VGG}$ as the total number of layers in VGG.

Denote the number of channels, height, and width of the i th layer of the feature map as $C_i$, $H_i$, and $W_i$ respectively. We also denote the $C_i \times C_i$ gram matrix of i th layer of the feature map of image x as $G_i(x)$. The gram matrix can be described as:

$$\begin{aligned} G_i(x)(u,v)=\frac{\sum _{h=1}^{H_i}\sum _{w=1}^{W_i}\psi _i(x)(h,w,u)\psi _i(x)(h,w,v)}{C_iH_iW_i} \end{aligned}$$

(4)

Since the gram matrix only records the relationship between different channels rather than the spatial correlation, it is considered to be the representation of the general textures and patterns of an image. Let $w^s_i$ be the weight of the loss at layer i for style loss, the style loss $L_s$ is the distance of two gram matrices:

$$\begin{aligned} L_s=\sum _{i=1}^{N_{VGG}}w^s_i||G_i(I_s)-G_i(\hat{I})|| \end{aligned}$$

(5)

To maximize the variational lower bound, we need to minimize the KL divergence between $q_{\phi }(\mathbf {z}|I_s)$ and $p_\theta (\mathbf {z})$. Under the assumption that all the latent variables are i.i.d., we calculate the KL divergence on each scale and sum over all the KL divergence to get the KL loss. Since we already assume that $\mathbf {z}|\theta \sim \mathcal {N}(0,1)$, we can derive the KL divergence at each scale as:

$$\begin{aligned} KL(q_{\phi }(\mathbf {z}|I_s)||p_\theta (\mathbf {z}))=\frac{1}{2}(\frac{\mu ^2}{\sigma ^2}+\sigma ^2-\log \sigma ^2-1) \end{aligned}$$

(6)

where we assume $\mathbf {z}|I_s;\phi \sim N(\mu ,\sigma )$.

2.5 Implementation Details

Encoders and decoders in the network follow the architecture below. There are 2 convolutional layers in each ConvBlock, followed by batch normalization [9] and swish activation [18]. We use 5 ConvBlocks in the encoder side which are followed by a max pooling layer, except for the last one. The number of convolutional filters are 64, 128, 256, 512, 1024 respectively. The decoders follow the inverse structure of the encoder. Note that in the style encoder, shown in Fig. 1, there is an additional ConvBlock before generating the distribution $q_{\phi }(z_i|I_s)$ after the normal U-Net [19] encoder at each scale. There is also another ConvBlock after sampling $z_i$ from $q_{\phi }(z_i|I_s)$ to calculate $f_s^{(i)}$ at each scale. The weights for perceptual and style loss are set to 0, 0, 0, 0, 0.01 and 0.1, 0.002, 0.001, 0.01, 10 for block1_conv1, block2_conv1, block3_conv1, block4_conv1, block5_conv1 of VGG respectively. The network is optimized via Adam optimizer [12] with a learning rate of $5\times 10^{-5}$ and trained for 20 epochs on a single Nvidia Titan RTX GPU.

3 Experiments

The images in the experiments are a combination of numerous datasets: (1) lung images on clinical patients by Sonosite ultrasound machine with HFL38xp linear transducer, (2) chicken breast images by UF-760AG Fukuda Denshi using a linear transducer with 51 mm scanning width (FDL), (3) live-pig images by FDL, (4) blue-gel phantom images by FDL, (5) leg phantom images by FDL, (6) Breast Ultrasound Images Dataset [2], (7) Ultrasound Nerve Segmentation dataset from Kaggle [1], (8) arteries and veins in human subjects by aVisualsonics Vevo 2100 UHFUS machine with ultrahigh frequency scanners. In total, there are 18308, 2285, 2283 images for training, validation, and testing respectively in the combined dataset. During training, we randomly select a pair of content and style images from the combined dataset without any additional restrictions.

3.1 Qualitative Results

In the first experiment, we directly generate the outputs given the content and style images. We transfer the style of images across (1)–(8). Shown in Fig. 2, the visual results are good in each combination of content and style images. Moreover, all the results still have the anatomy in the content images, including but not limited to vessels and ligaments, while looking like the style images. On the contrary, shown in Fig. 3, method proposed by Huang et al. [8], is not able to capture the fine details in the content and generate realistic textures.

Another qualitative experiment is that we directly sample the style from the latent space without giving the model a style image. Figure 4 shows the distribution of our training style images in the latent space. We then randomly sampled two styles (at the end points of the arrow in Fig. 4) from the latent space and interpolate between the two, and the results can be found in Fig. 5. We observe that even without a style image, the model can still generate visually reasonable ultrasound images given the content image without losing significant details while sampling from latent space that is not covered by the training data.

Table 1. Comparison between style-transfer-based augmentation and the traditional augmentation

Full size table

3.2 Quantitative Results

To show that our approach is effective in augmentation, assume that we only have limited live-pig data while having more leg-phantom data for veins and arteries (405 images). Denote vt-ph, dt-ph and ph as variational-style-transferred (ours), deterministic-style-transferred [8] and original phantom images respectively, aug. as traditional augmentation methods including gamma transform, Gaussian blurring, and image flipping. We further evaluate the effects of the number of the real images have on the final results. We transfer the leg-phantom images into the style of pig images. For live-pig data, we set 99 images as training set while having 11 images in each of the validation set and test set. In the experiments, we train U-Nets with the same network implementation and training settings as [19], on 99, 77, 55, 33 live-pig images denoted as pig-99, pig-77 etc. to show that our method works with very limited number of images. Note that, we balance the phantom data and pig data as roughly 1:1 ratio in each epoch. Shown in Table 1, the segmentation performs better by the networks trained on variational-style-transferred phantom images and pig images than the ones trained on other images in general. Additional augmentation on top of our variational-style-transfer augmentation sometimes improve the performance but our variational-style-transfer augmentation is an upgrade over the traditional augmentation and deterministic style transfer by Huang et al. [8]. Besides, when the number of real live-pig data is really limited, only style-transfer augmentation can produce a decent result. In any case, it can be seen that style-transfer-based approach has a significant improvement over traditional methods and our variantional approach is superior than Huang et al.’s method [8].

4 Conclusion

We demonstrated that our method is capable of transferring the style of one ultrasound image to another style, e.g. from the style of phantom data to that of pig data, from the style of normal ultrasound machines to the style of high frequency ultrasound machines, etc.. Besides, it is also able to sample arbitrary and continuous style from a latent space. Our method can generate ultrasound images from both observed and unobserved domains, which helps address the insufficiency of data and labels insufficiency in medical imaging.

References

Ultrasound Nerve Segmentation: Identify nerve structures in ultrasound images of the neck (2016). https://www.kaggle.com/c/ultrasound-nerve-segmentation
Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data Brief 28, 104863 (2020)
Article Google Scholar
Byra, M.: Discriminant analysis of neural style representations for breast lesion classification in ultrasound. Biocybern. Biomed. Eng. 38(3), 684–690 (2018)
Article Google Scholar
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15(141), 20170387 (2018)
Article Google Scholar
Fetty, L., et al.: Latent space manipulation for high-resolution medical image synthesis via the styleGAN. Z. Med. Phys. 30(4), 305–314 (2020)
Article Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3985–3993 (2017)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Li, Y., Wang, N., Liu, J., Hou, X.: Demystifying neural style transfer. arXiv preprint arXiv:1701.01036 (2017)
Liu, Z., et al.: Remove appearance shift for ultrasound image segmentation via fast and universal style transfer. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1824–1828. IEEE (2020)
Google Scholar
Ma, C., Ji, Z., Gao, M.: Neural style transfer improves 3D cardiovascular MR image segmentation on inconsistent data. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 128–136. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_15
Chapter Google Scholar
Mikołajczyk, A., Grochowski, M.: Style transfer-based image synthesis as an efficient regularization technique in deep learning. In: 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR), pp. 42–47. IEEE (2019)
Google Scholar
Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. arXiv preprint arXiv:1710.059417 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shen, F., Yan, S., Zeng, G.: Neural style transfer via meta networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8061–8069 (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Singh, K., Drzewicki, D.: Neural style transfer for medical image augmentation. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Singh%2C+K.%2C+Drzewicki%2C+D.%3A+Neural+style+transfer+for+medical+image+augmentation&btnG=
Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: feed-forward synthesis of textures and stylized images. In: ICML, vol. 1, p. 4 (2016)
Google Scholar
Xu, Y., Li, Y., Shin, B.-S.: Medical image processing with contextual style transfer. Hum.-Cent. Inf. Sci. 10(1), 1–16 (2020). https://doi.org/10.1186/s13673-020-00251-9
Article Google Scholar
Zheng, X., Chalasani, T., Ghosal, K., Lutz, S., Smolic, A.: STaDA: style transfer as data augmentation. arXiv preprint arXiv:1909.01056 (2019)

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Alex Ling Yu Hung & John Galeotti

Authors

Alex Ling Yu Hung
View author publications
You can also search for this author in PubMed Google Scholar
John Galeotti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alex Ling Yu Hung .

Editor information

Editors and Affiliations

Universitätsklinikum Heidelberg, Heidelberg, Germany
Sandy Engelhardt
Istanbul Technical University, Istanbul, Turkey
Ilkay Oksuz
The University of Texas at Arlington, Arlington, TX, USA
Dajiang Zhu
University of Hong Kong, Hong Kong, Hong Kong
Yixuan Yuan
TU Darmstadt, Darmstadt, Germany
Anirban Mukhopadhyay
University of Minnesota, Minneapolis, MN, USA
Nicholas Heller
Pennsylvania State University, University Park, PA, USA
Sharon Xiaolei Huang
University of Houston, Houston, TX, USA
Hien Nguyen
University of Bern, Bern, Switzerland
Raphael Sznitman
Johns Hopkins University, Baltimore, MD, USA
Yuan Xue

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hung, A.L.Y., Galeotti, J. (2021). Ultrasound Variational Style Transfer to Generate Images Beyond the Observed Domain. In: Engelhardt, S., et al. Deep Generative Models, and Data Augmentation, Labelling, and Imperfections. DGM4MICCAI DALI 2021 2021. Lecture Notes in Computer Science(), vol 13003. Springer, Cham. https://doi.org/10.1007/978-3-030-88210-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-88210-5_2
Published: 25 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88209-9
Online ISBN: 978-3-030-88210-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)