Abstract
As many other machine learning driven medical image analysis tasks, skin image analysis suffers from a chronic lack of labeled data and skewed class distributions, which poses problems for the training of robust and well-generalizing models. The ability to synthesize realistic looking images of skin lesions could act as a reliever for the aforementioned problems. Generative Adversarial Networks (GANs) have been successfully used to synthesize realistically looking medical images, however limited to low resolution, whereas machine learning models for challenging tasks such as skin lesion segmentation or classification benefit from much higher resolution data. In this work, we successfully synthesize realistically looking images of skin lesions with GANs at such high resolution. Therefore, we utilize the concept of progressive growing, which we both quantitatively and qualitatively compare to other GAN architectures such as the DCGAN and the LAPGAN. Our results show that with the help of progressive growing, we can synthesize highly realistic dermoscopic images of skin lesions that even expert dermatologists find hard to distinguish from real ones.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Just like for many other medical fields, the problems of data scarcity and class imbalance are also apparent for machine learning driven skin image analysis. In the ISIC2018 challenge, the provided dataset comprises only 10,000 labeled training samples, and the class distribution is heavily skewed among the seven categories of skin lesions, due to the rare nature of some pathologies. In order to tackle the problem of limited training data, state-of-the-art approaches for skin lesion classification and segmentation rely on heavy data augmentation [9, 18] or webly supervised learning [11]. As an alternative, synthetic images could open up new ways to deal with these problems. Generative Adversarial Networks (GANs) [5] have shown outstanding results for this task. In the computer vision community, GANs have been successfully used for the generation of realistically looking images of indoor and outdoor scenery [3, 13], faces [13] or handwritten digits [5]. Some conditional variants [10] have also set the new state-of-the-art in the realms of super-resolution [8] and image-to-image translation [6]. A few of these successes have been translated to the medical domain, with applications for cross-modality image synthesis [16], CT image denoising [17] and for the pure synthesis of biological images [12], PET images [2], and OCT patches [14]. First successful attempts for medical data augmentation using GANs have been made in [1, 4], however at a level of small patches.
In contrast to many other medical classification problems, skin lesion segmentation and classification models often utilize ImageNet-pretrained models, meaning that these rely on input data with resolutions of \(224\times 224\,\)px or higher. For image synthesis, this implies that higher resolution images need to be generated without trading off realism. Thoroughly engineered, unconditional architectures such as DCGAN [13] or LAPGAN [3] have proven to work well for high quality image synthesis from noise, however at fairly low resolution. Conditional approaches [15] have shown that both high quality and high resolution image synthesis up to \(2048\times 1024\,\)px is possible when mapping from semantic labelmaps to synthetic images with a hierarchy of conditional GANs, however this setting requires well structured input into the generator. Recently, progressive growing of GANs (PGAN) [7] has shown outstanding results for realistic image synthesis of faces at resolutions up to \(1024\times 1024\,\)px, without the need for any conditioning.
Contribution. In this work, we synthesize skin lesion images at sufficiently high resolution while ensuring high quality and realism. For our experiments, we utilize dermoscopic images of benign and malignant skin lesions provided by the ISIC2018 challengeFootnote 1. For data synthesis, we employ the PGAN and compare it to the DCGAN and the LAPGAN. As PGANs can natively only synthesize images whose size is a power of 2, we aim for a target resolution of \(256\times 256\,\)px, such that State-of-the-Art classifiers could potentially leverage the samples. A quantitative comparison of the image statistics of the synthetic and real images shows that the PGAN matches the training dataset distribution very well, and visual exploration further corroborates its superiority over the other approaches in terms of sample diversity, sharpness and artifacts. Ultimately, we evaluate the quality of the PGAN samples in a user study involving 3 expert dermatologists as well 5 Deep Learning experts, showing that the experts have a hard time distinguishing between real and fake images.
The remainder of this manuscript is organized as follows: We first briefly recapitulate the GAN framework as well as the different GAN concepts before we describe the experimental setup. Afterwards, we introduce the dataset, evaluation metrics, provide a quantitative comparison of the aforementioned concepts for skin lesion synthesis and the results of our user study. We conclude this paper with a discussion and an outlook on future work.
2 Skin Lesion Synthesis
2.1 Generative Adversarial Networks
The original GAN framework consists of a pair of adversarial networks: A generator network G tries to transform random noise \(z \sim p_z\) from a prior distribution \(p_z\) (usually a standard normal distribution) to realistically looking images \(G(z) \sim p_{fake}\). At the same time, a discriminator network D aims to classify well between samples coming from the real training data distribution \(x \sim p_{real}\) and fake samples G(z) generated by the generator. By utilizing the feedback of the discriminator, the generator G can be adjusted such that its samples are more likely to fool the discriminator in its classification task, ultimately teaching the generator to approximate the training dataset distribution. Mathematically speaking, the networks play a two-player minimax game against each other:
In consequence, as D and G are updated in an alternating fashion, the discriminator D becomes better in distinguishing between real and fake samples while the generator G learns to produce even more realistic samples.
In this work, we employ three different GAN concepts for the task of high resolution skin lesion synthesis, namely the DCGAN, the LAPGAN and finally the very recent PGAN. An overview of the setup is given in Fig. 2.
The DCGAN architecture is a popular and well engineered convolutional GAN that is fairly stable to train and has proven to yield high quality results at a resolution of 64\(\,\times \,\)64 px. The architecture is carefully designed with concepts such as leaky ReLu activations to avoid sparse gradients and a specific weight initialization to allow for a robust training.
The LAPGAN is a generative image synthesis framework inspired by the concept of Laplacian pyramids. In essence, it consists of a hierarchy of GANs, where the first generator \(G_0\) is trained to synthesize low-resolution images from noise. Successive generators \(G_i\) are targeted to map from lower-resolution images of the previous generator \(G_{i-1}\) to residual images, which have to be added to the upsampled, input in order to obtain compelling higher resolution images.
The PGAN utilizes the idea of progressive growing [7] to facilitate high resolution image synthesis from noise at unprecedented levels of quality and realism. Opposed to the LAPGAN, the PGAN consists only of a single generator and a discriminator, which both start as small networks which grow in depth and model complexity during training (see Fig. 2). Gradually, the output-resolution of the generator and the input-resolution to the discriminator are simultaneously ramped up, leading to a very stable training behavior and very realistic, synthetic images at resolutions up to \(1024\times 1024\) px.
3 Experiments and Results
In the first part of our experiments, we train a PGAN, and to prove its superiority over other concepts, also a DCGAN and a LAPGAN for skin lesion synthesis at a resolution of \(256\times 256\) px. In succession, we investigate the properties of the synthetic samples both quantitatively and qualitatively. In the second part of our experiments, we conduct a user study to verify the realism of the generated images.
3.1 Dataset
For our experiments, we utilize the ISIC2018 dataset consisting of 10,000 dermoscopic images of both benign and malignant skin lesions (see Fig. 1a). The megapixel dermoscopic images are center cropped to square size and downsampled to \(256\times 256\) px. No data augmentation or pre-processing was applied.
3.2 Evaluation Metrics
A variety of methods have been proposed for evaluating the performance of GANs in capturing data distributions and for judging the quality of synthesized images. In order to evaluate visual fidelity, numerous works utilized either crowdsourcing or expert user studies. We also conduct such a user study to rate the realism of our synthetic images. In addition, we discuss visual fidelity of the generated images with a focus on diversity, realism, sharpness and artifacts. For quantitatively judging sample realism, the Sliced Wasserstein Distance (SWD) has recently shown to be a reasonably good metric for approximately comparing image distributions [7], thus we also make use of it.
3.3 Image Synthesis
We trained a PGAN as described in [7] from all 10,000 images, as well as a DCGAN and a LAPGAN. The PGAN has been trained for 3M iterations, until the SWD between the synthetic samples and the training dataset did not decrease noticeably any further. For a valid comparison, the LAPGAN and DCGAN were also trained for the same amount of iterations.
Per model, we then generate 10,000 synthetic images and compare their distribution to the real data by means of the SWD (see Table 1). Since the SWD constitutes an approximation, we also compute the SWD between the real data and itself to obtain a lower bound. In comparison, the lowest SWD is clearly obtained with the PGAN samples, whereas the DCGAN and LAPGAN perform considerably, but equally worse. This is also reflected by a visual exploration of the samples (see Fig. 1 for a comparison of samples generated with the different models). The DCGAN samples are prone to checkerboard artifacts (Fig. 3, left) and can thus easily be identified as fake. The LAPGAN samples (Fig. 3, middle) seem more realistic and diverse, but close inspection shows a vast amount of high frequency artifacts, which again, negatively impact realism of these samples. The PGAN samples (Fig. 3, right) seem highly realistic, alone filamentary structures such as hair raise suspicion.
Exploring the Visual Manifold. Since the PGAN samples look so compelling, there might be a chance that the model memorized the training dataset. Therefore, we explore the manifold of synthetic samples. The smooth transitions among samples provide clear evidence that memorization did not occur (see Fig. 4).
3.4 Visual Turing Test
In order to juge realism of the generated images, we conduct a so-called Visual Turing Test (VTT) involving 3 expert dermatologists (ED) and 5 deep-learning experts (DLE). Each participant is asked to classify the same random mix of generated and real images as being either real (class 1) or fake (class 0). The DLEs are familiar with common GAN artifacts and are thus expected to be skilled to identify unplausible generated images, even though they do not have experience in judging actual skin lesion images. On the other hand, the EDs are not aware of these deep-learning induced image artifacts, but instead know about the gamut of possible skin lesion phenotypes.
Using the PGAN, we first generate 30 synthetic images, which are then mixed with 50 randomly chosen images from the real training dataset. In the VTT, we present each participant with these 80 images in random order and let him/her classify. The performances of all the participants in terms of the TPR (how many real images have been identified as real), the FPR (how many fake images have ben classified as real) and the Accuracy are reported in Fig. 5a. Performance statistics among EDs and DLEs are provided in Fig. 5b), and the complete user study details can be found in Table 2. Interestingly, the classification accuracy is slightly lower for the EDs than for the DLEs. Overall, the accuracy is just slightly above 50%, implying that the experts can distinguish between real and fake just slightly better than chance. Thereby, not all fakes have been mistaken as real (on average 56%), but on average 42% of the real images have also mistakingly be identified as fake. All in all, none of the participants is able to reliably distinguish the fake samples from real ones, leading to the conclusion that these synthetic samples are in fact highly realistic.
4 Discussion and Conclusion
We have shown that with the help of PGANs, we are able to generate extremely realistic dermoscopic images, which carves open new opportunities to tackle the problems of data scarcity and class imbalance. Yet, it is unclear to which extent these synthetic data provide additional information to supervised deep learning models. In fact, a variety of questions need to be answered, such as (i) whether there is an information gain in the synthetic samples over the actual training dataset, (ii) if the gain is higher than using standard data augmentation and (iii) how many training images are in fact required to obtain reliable generative models. Noteworthy, we trained the PGAN ignoring the presence of different classes. For generating images along with class information, one would need to leverage labeled data and effectively train a single model per class. Further, the synthetic images are not always perfect. In particular, the methodology has to be enhanced to account for filamentary structures. In future work, we aim to perform large scale experiments and strive to answer these question.
Overall, we have shown that we can synthesize images of skin lesions at yet unprecedented levels of realism. In fact, the level of realism is so high such that experts from both the medical and the deep-learning fields were not able to reliably distinguish real images from generated ones. This leaves us confident that such synthetic data can be leveraged for new data augmentation approaches.
References
Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017)
Bi, L., Kim, J., Kumar, A., Feng, D., Fulham, M.: Synthesis of Positron Emission Tomography (PET) images via multi-channel Generative Adversarial Networks (GANs). In: Cardoso, M.J., et al. (eds.) CMMI/SWITCH/RAMBO -2017. LNCS, vol. 10555, pp. 43–51. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67564-0_5
Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: NIPS, pp. 1486–1494 (2015)
Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using gan for improved liver lesion classification. In: ISBI, pp. 289–293 (2018)
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR, pp. 5967–5976, July 2017
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR, pp. 105–114 (2017)
Matsunaga, K., Hamada, A., Minagawa, A., Koga, H.: Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble. arXiv preprint arXiv:1703.03108 (2017)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Navarro, F., Conjeti, S., Tombari, F., Navab, N.: Webly supervised learning for skin lesion classification. arXiv preprint arXiv:1804.00177 (2018)
Osokin, A., Chessel, A., Carazo-Salas, R.E., Vaggi, F.: GANs for biological image synthesis. In: ICCV, pp. 2252–2261 (2017)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: Niethammer, M., et al. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 146–157. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59050-9_12
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR, June 2018
Wolterink, J.M., Dinkla, A.M., Savenije, M.H.F., Seevinck, P.R., van den Berg, C.A.T., Išgum, I.: Deep MR to CT synthesis using unpaired data. In: Tsaftaris, S.A., Gooya, A., Frangi, A.F., Prince, J.L. (eds.) SASHIMI 2017. LNCS, vol. 10557, pp. 14–23. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68127-6_2
Yang, Q., et al.: Low-dose CT image denoising using a generative adversarial network with wasserstein distance and perceptual loss. IEEE Trans. Med. Imaging 37(6), 1348–1357 (2018)
Yu, L., Chen, H., Dou, Q., Qin, J., Heng, P.A.: Automated melanoma recognition in dermoscopy images via very deep residual networks. IEEE Trans. Med. Imaging 36(4), 994–1004 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Baur, C., Albarqouni, S., Navab, N. (2018). Generating Highly Realistic Images of Skin Lesions with GANs. In: Stoyanov, D., et al. OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis. CARE CLIP OR 2.0 ISIC 2018 2018 2018 2018. Lecture Notes in Computer Science(), vol 11041. Springer, Cham. https://doi.org/10.1007/978-3-030-01201-4_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-01201-4_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01200-7
Online ISBN: 978-3-030-01201-4
eBook Packages: Computer ScienceComputer Science (R0)