Abstract
With the development of generative models, image synthesis conditioned on the specific variable becomes an important research theme gradually. This paper presents a novel spectral normalization based Hybrid Attentional Generative Adversarial Networks (HAGAN) for text to image synthesis. The hybrid attentional mechanism is composed of text-image cross-modal attention and self-attention of image sub regions. Cross-modal attention mechanism contributes to synthesize more fine-grained and text-related image by introducing word-level semantic information in generative model. The self-attention solves the long distance reliance of image local-region features when generate image. With spectral normalization, the training of GANs become more stable than traditional GANs, which conduces to avoid model collapse and gradient vanishing or explosion. We conduct experiments on widely used Oxford-102 flower dataset and CUB bird dataset to validate our proposed method. During quantitative and non-quantitative experimental comparison, the results indicate that the proposed method achieves the best performance on Inception score (IS), Fréchet Inception Distance (FID) and visual effect.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Text to image synthesis
- Spectral normalization
- Self-attention
- Cross-modal attention
- Generative Adversarial Networks
1 Introduction
Recent years have witnessed the great progress of Deep Neural Networks (DNNs), especially various kinds of generative tasks and discriminative tasks. Particularly, Convolutional Neural Networks (CNNs) have shown excellent performance on the challenging multi-category classification [1]. Besides, another branch of research focus is generative task, which is inverse mapping of discriminative task. In particular, generative tasks based on Generative Adversarial Networks (GANs) have achieved promising results [2] in image synthesis. Recently, photo-realistic image synthesis gradually becomes an important research direction with many potential applications, such as, computer graphics and photo retouching. To be specific, methods for text-to-image synthesis need generate image that are highly similar to meanings embedded in texts. However, image synthesis, conditioned on the given text descriptions, is also a knotty problem because of the great gap between text modality and image modality.
Almost all existing text-to-image synthesis methods are based on GANs and some of them achieve remarkable performance. Generative Adversarial Networks (GANs) is proposed by Goodfellow in 2014 [3], which has made impressive performance in generative tasks. It is composed of two sub-networks, generator and discriminator, trained with a competing goal in an adversarial manner. From them on, GANs related work become a focused research direction. Meanwhile, adversarial learning mechanisms have shown great progress in many complex simulating problems [4].
Although excellent performance in many tasks, GANs are well known for difficulty in training and mode collapse. Many research works indicate that the instability in training is due to the disjoint of the generated data distribution and the real data distribution [5]. Besides, the mode collapse in GANs shows that the model will synthesize similar samples with uniform color and single texture. For addressing the knotty problem, many methods were proposed until now, such as WGAN [6], WGAN-GP [7] and SNGAN [8]. Some of those methods achieve excellent performance in stabilizing the training process and avoiding mode collapse.
Text-to-image synthesis is more challengeable than simply generate image from random noise or category condition. Text description contains more abundant and detailed image features, which should be drawn in synthesized image. Aiming at synthesizing photo-realistic image, there are two main branches of methods, VAE-based methods [9,10,11] and GAN-based methods [12,13,14,15,16]. Cai et al. [9] propose an image synthesis framework for fine-grained image in a multi-stage variational auto-encoder manner. Gulrajani et al. [10] present an improved PixelCNN-based model named PixelVAE, which introduces an autoregressive decoder for natural image synthesis. Deep Recurrent Attentive Writer (DRAW) [11] combines spatial attention mechanism with sequential VAE framework for constructing complex images.
Apart from VAE-based methods, GAN-based approaches also show great effectiveness in text to image synthesis. Specifically, Reed et al. [12] firstly introduce the traditional GAN into text to image synthesis in 2016. Following on the previous work, they propose a Generative Adversarial What-Where Network (GAWWN) [17] by using position box as additional supervision, which achieves better performance. However, the images synthesized by the first model are blurry and unclear. Inspired by the drawing step of human beings, multi-stage strategy is introduced into image synthesis in recent years, such as StackGAN [13, 14], AttnGAN [16] and CWPGGAN [15]. To be specific, StackGAN has two versions, StackGAN-v1 [13] and StackGAN-v2 [14]. StackGAN-v1 is based on two-stage GANs, while the StackGAN-v2 is an advanced three-sage model. Therefore, the images synthesized by the second model are more realistic and richly-textured than the first method. Progressive growing mechanism [18] is adopted in CWPGGAN [15], which can gradually improve the resolution and quality.
Attention mechanism shows effectiveness in many applications, especially in natural language process and computer vision. More specifically, self-attention mechanism is introduced in image generation [19]. Besides, attention mechanism is also adopted in text to image generative task, such as alignDRAW [20] and AttnGAN [16]. The alignDRAW [20] based on the mentioned DRAW introduces soft attention mechanism for attending to the relevant words of image feature. Xu et al. [16] propose a multi-stage Attentional Generative Adversarial Network (AttnGAN) for fine-grained image synthesis from text. Their methods not only use generator to generate high-resolution realistic image but also add word-level feature into generator, while others’ methods only adopt sentence feature.
Inspired by previous work, we propose a spectral normalization based Hybrid Attentional Generative Adversarial Networks (HAGAN) that combines the image self-attention and text-image cross-modal attention mechanism for fine-grained image synthesis in this paper. Firstly, the features are extracted by the pretrained model name DAMSM [16], which contains both text and image feature embedding. Then, we feed the encoded text feature into three-stage hybrid attentional generative adversarial networks for image synthesis. The self-attention mechanism is introduced in the first-stage network and cross-modal attention is adopted in second and third stage generators. We mainly use the publicly available Oxford-102 flowers dataset and the Caltech CUB-200 birds dataset to conduct the experimental analysis. During the evaluation metric and side-by-side comparison with the state-of-the-art methods, the results indicate that our proposed method can get better visual effect and competitive evaluation value.
Compared to existing works, the main contributions of our work are as follows.
-
(1)
By developing a hybrid attention mechanism for text to image synthesis, self-attention of image generation can solve long distance reliance between local features and cross-modal attention can add word-level features in generator for fine-grained image details.
-
(2)
Due to spectral normalization, the training of the model becomes more stable than traditional GANs. Therefore, the generator can synthesize more realistic image due to discriminator satisfied with K-Lipschitz constraint can provide useful and effective gradient information for model optimizing.
The rest of this paper is organized as follows. The second section presents our proposed HAGAN approach. The third section shows the experimental results and comparison, and the last section concludes this paper.
2 The Proposed Method
2.1 Background
A. Generative Adversarial Networks
The GANs consists of two sub-networks, a discriminator D and a generator G, that cooperate and compete in a minimax game until the game achieves zero-sum game. Such minimax game can be described as the following object function \( V(G,D) \).
where x is the real image and z is random noise. In the training process, the discriminator tries to maximize V, however, the generator wants to minimize the object function. In the last, the game of the two networks achieves the Nash Equilibrium that both can obtain the best performance.
B. Conditional Generative Adversarial Networks
Conditional GANs add conditional variable y to control the features of output image. The object function of conditional GAN can be described as follows.
where y is a conditional variable. The function of generator \( G(z|y) \) allows the generator G to generate images conditioned on the given conditioning variable. The discriminator \( D(x|y) \) evaluates whether the generated image is matched with conditioning variable y or not.
2.2 The Framework of Hybrid Attentional Generative Adversarial Networks
The HAGAN enables the generator to draw different sub-regions conditioned on related words and other long distance related image sub-regions. Meanwhile, the spectral normalization stabilize the training of the discriminator, which will contribute the optimization of the generator. The framework of the HAGAN, as shown in Fig. 1.
A. Hybrid Attentional Generative Adversarial Networks
Suppose the texts and images are stored in a N-pair document corpus \( (X^{T} ,X^{I} ) \). Here, \( X^{T} \) is text data and \( X^{I} \) is image data. The text feature and image feature are extracted by the well-trained embedding model DAMSM [16], which is based on the bi-directional Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN).
where \( {\varvec{\upvarphi }} \) indicates word feature matrix, \( \bar{\varphi } \) denotes sentence feature, \( \bar{\phi } \) is global image feature and \( {\varvec{\Phi}} \) presents the sub-region feature matrix.
The encoded sentence feature \( \bar{\varphi } \) will be pretreated before input into the multi-stage generative networks. As following,
where z is random noise vector, \( F^{ca} \) denotes the Conditioning Augmentation [14] which converts the sentence feature vector \( \bar{\varphi } \) to the conditioning vector, and \( F_{cat} \) is concatenate function. After several upsample operation, the hidden feature gradually denotes the image features. The self-attention mechanism acts on the hidden feature maps \( (\hat{h}_{0} ,\hat{h}_{1} ) \). As following,
Here, \( F_{i}^{self\_attn} \) is the self-attention mechanism. The first-stage generator synthesizes image conditioned on the output of self-attention block directly. The generative networks consists of three generators \( (G_{0} ,G_{1} ,G_{2} ) \), which use the previous hidden feature \( (h_{0} ,h_{1} ,h_{2} ) \) to generate different-scale images \( (\hat{x}_{0} ,\hat{x}_{1} ,\hat{x}_{2} ) \). To be specific, the process of multi-stage generator is defined as following.
The cross-modal mechanism is introduced in the second and the third networks, which can add more detailed attribute informations in the feature matrix. Specifically, the operation of cross-modal attention is defined as following.
Here, \( {\varvec{\upvarphi}} \) is the feature matrix of word features, and \( F_{i}^{{cro_{ - } attn}} \) is the cross-attention model of the i-th stage generator. All of these functions are modeled as neural networks.
-
(1)
Self-Attention mechanism for the first stage generator
The generator G and discriminator D of GAN models usually consist of convolutional neural networks. However, the convolutional filter only process the information in a local neighborhood, such as window size \( 3 \times 3 \). Hence, long-range dependencies cannot be considered in the convolutional process. By introducing the self-attention mechanism into the GANs model, the generator can use the long-distance relationships between widely separated sub-regions.
In the deep model, the feature map \( \hat{h} \in {\mathbb{R}}^{C \times N} \) of previous layer presents the hidden features of an image. We use two \( 1 \times 1 \) convolutional layer to convert the feature map into two space \( \bar{H},\hat{H} \), and then calculate the attention of the two feature maps.
$$ \beta_{j,i} = \frac{{\exp (s_{ij} )}}{{\sum\nolimits_{i = 1}^{N} {\exp (s_{ij} )} }}, $$(8)where \( s_{ij} = \bar{H}(x_{i} )^{T} \hat{H}(x_{j} ) \), and \( \beta_{j,i} \) indicates how much attention from the i-th location when generating the j-th region. The attention map is obtained by weighted sum of all the output, as following.
$$ \hat{C}_{i} = (\hat{c}_{1} , \cdots ,\hat{c}_{j} , \cdots ,\hat{c}_{N} ) \in {\mathbb{R}}^{C \times N} , $$(9)where,
$$ \hat{c}_{j} = \sum\nolimits_{i = 1}^{N} {\beta_{j,i} h(x_{i} )} ,h(x_{i} ) = {\mathbf{W}}_{h} \hat{h}(x_{i} ). $$(10)Then we apply a weight scale parameter \( \gamma \) on attention map. The final weighted output is given by,
$$ \hat{h}_{i} = \gamma \hat{C}_{i} + \hat{h}_{i - 1} , $$(11)where \( \gamma \) is initialized as 0.
In short, the self-attention mechanism can be denoted as
$$ \hat{h}_{i} = \hat{F}_{i} (h_{i - 1} ,F_{i}^{self\_attn} ({{\varvec{\upvarphi}}},\hat{h}_{i - 1} )). $$(12) -
(2)
Cross-modal Attention mechanism for the second and third stage generators
Cross-modal attention mechanism is adopted to add relevant word-level information to networks for producing fine-grained image. The input of the cross-modal attention mechanism is the previous hidden feature \( h \in {\mathbb{R}}^{{\hat{D} \times N}} \) of image and the word-level features \( {\varvec{\upvarphi}} \in {\mathbb{R}}^{D \times T} \), which is encoded by the optimized model. Then, the word features are converted to a common space by adding a perceptron layer. Specifically, word feature \( \widehat{{\varvec{\upvarphi}}} \in {\mathbb{R}}^{D \times T} \) is converted by \( \widehat{{\varvec{\upvarphi}}} = {\mathbf{U}}{\varvec{\upvarphi}} \), where \( {\mathbf{U}} \in {\mathbb{R}}^{{\hat{D} \times D}} \). Then, we calculate the word-context vector of the j-th sub-region by attention mechanism. Hidden feature \( h \) denotes the query, and the converted word features are the value. In detail, the word-context of the j-th sub-region is calculated as follows.
$$ c_{j} = \sum\limits_{i = 0}^{T - 1} {\beta_{j,i} \hat{\varphi }_{i} } , $$(13)where
$$ \beta_{j,i} = \frac{{\exp (s^{\prime}_{j,i} )}}{{\sum\nolimits_{k = 0}^{T - 1} {\exp (s^{\prime}_{j,k} )} }}. $$(14)Here, the similarity is computed by dot-product similarity
$$ s^{\prime}_{j,i} { = }h_{j}^{T} \hat{\varphi }_{i} . $$(15)In short, the word-context can be denotes as
$$ {\mathbf{C}} = F^{cro\_attn} ({\varvec{\upvarphi}},h) = (c_{0} ,c_{1} , \ldots ,c_{N - 1} ) \in {\mathbb{R}}^{{\hat{D} \times N}} $$(16)Then, the word-context and original image hidden feature is concatenated and feed in next layer.
-
(3)
Objective function of multi-stage GANs
In our work, we adopt three generators and three discriminators in text-image translation. Each stage of generator \( G_{i} \left( {i = 0,1,2} \right) \) has a corresponding discriminator \( D_{i} \). The same with the conditional GANs, the objective function of the i-th generator is defined as follows.
$$ L_{{G_{i} }} = - \frac{1}{2}E_{{\hat{x}_{i} \sim p_{{G_{i} }} }} [\log (D{}_{i}(\hat{x}_{i} ))] - \frac{1}{2}E_{{\hat{x}_{i} \sim p_{{G_{i} }} }} [\log (D{}_{i}(\hat{x}_{i} ,\bar{\varphi }))], $$(17)where the first part is unconditional loss and the second term is conditional loss. Meanwhile, in order to ensure the generated image is match with the text description, we introduce the DAMSM loss [16] into the objective function of the last-stage generator. As following,
$$ L = L_{{G_{2} }} + \lambda_{2} L_{DAMSM} , $$(18)where \( \lambda_{2} \) is a balance factor.
In the adversarial learning, the discriminators evaluate whether the synthesized image is realistic and matched with the text or not. The objective function of each stage discriminator is defined as follows.
$$ \begin{aligned} L_{{D_{i} }} = & - \frac{1}{2}E_{{x_{i} \sim p_{data} }} [\log (D{}_{i}(x_{i} ))] - \frac{1}{2}E_{{\hat{x}_{i} \sim p_{{G_{i} }} }} [\log (1 - D{}_{i}(\hat{x}_{i} ))] + \\ \, & - \frac{1}{2}E_{{x_{i} \sim p_{{data_{i} }} }} [\log (D{}_{i}(x_{i} ,\bar{e}))] - \frac{1}{2}E_{{\hat{x}_{i} \sim p_{{G_{i} }} }} [\log (1 - D{}_{i}(\hat{x}_{i} ,\bar{\varphi }))], \\ \end{aligned} $$(19)where \( x_{i} \) is from the real i-th scale image and \( \hat{x}_{i} \) the generated image from the i-th stage generator. By optimizing the discriminator and generator alternately, the network will achieve zero-sum game that the generators and discriminators obtain the best performance.
B. Spectral Normalization for Stabilizing Training
Model Collapse, gradient vanishing and gradient explosion are very popular phenomena in the training of GANs. Besides, the balance of training between generator and discriminator is hard to control, which leads to converge difficultly. In order to solve the problem, many methods were proposed to improve the stability of model, such as WGAN [6] and WGAN-GP [7]. The original WGAN introduces Wasserstein distance to measure the distance between the real data and the generated data and minimize it. The Wasserstein distance is calculated as follows,
Here, the formula \( \left\| f \right\|_{Lip} \le K \) indicates that the function \( f( \bullet ) \) is satisfied with K-Lipschitz constraint. The original WGAN presents a way of clipping the weights of discriminator in \( [ - c,c] \), which drops the fitting capacity of deep neural network. WGAN-GP adopts Gradient Penalty in discriminator to satisfy K-Lipschitz constraint, which increases computational effort. Therefore, those methods could not solve the problem absolutely. For stabilizing GAN-based model, the discriminator D should follow the Lipschitz continuity hypothesis. In other words, we need constrain the function of discriminator to satisfy the K-Lipschitz constraint.
where the \( ||f||_{Lip} \) is the smallest value of K such that \( |f(x_{1} ) - f(x_{2} )| \le K|x_{1} - x_{2} | \) for any \( x_{1} ,x_{2} \). Miyato et al. [8] propose a novel weight normalization named spectral normalization, which stabilize the training of discriminator by forcing the network to satisfy the Lipschitz constraint. Therefore, normalizing the weight parameters \( W \) of each layer can ensure the \( \left\| f \right\|_{Lip} \) is bounded from above by 1. As following
where \( \sigma (W) \) is spectral normalization and \( D_{N} \) is nonlinear activation function of the N-th layer. With spectral normalization, the discriminator provides useful gradient to generator for optimization so that the network optimize better and generate images that are more realistic.
3 Experimental Results and Evaluation
3.1 Datasets and Evaluation Metric
We conduct experiments for text to image synthesis on the widely used CUB dataset [21] and Oxford-102 dataset [22]. The statistics of each datasets as shown in Table 1. In order to verify the effectiveness fairly, Inception Score (IS) [23] and Fréchet Inception Distance [24] are adopted for quantitative evaluation of generative model.
Inception Score.
The Inception Score (IS) is current well-known metric for evaluating the generative performance of GANs. The motivation of Inception Score is that excellent generative models should generate realistic, various and meaningful images. The calculation of IS score as follows.
where x denotes sample of generated image, and y is image label predicted by the inception model. The Eq. (22) indicates that classes of generated image should be as diverse as possible and the label prediction probability should be as accurate as possible. Therefore, the higher KL divergence shows excellent generative ability of model.
Fréchet Inception Distance.
Assuming that both the real data and the generated data distribution following Gaussian distribution, so they have two major parameters, mean and covariance \( (m,C) \). The distance between the two data distribution is measured by Fréchet distance. The calculation is as following.
where \( (m,C) \) are mean and covariance of generated data, and \( (m_{r} ,C_{r} ) \) are mean and covariance of real data. The lower distance of the mentioned two distributions presents that the synthesized image are more similar to the original data.
3.2 Experimental Results and Comparison
(1) Evaluation Metric Comparison
In experiment, we make quantitative and non-quantitative comparison with many state-of-art methods. Tables 2 and 3 show the quantitative comparison details of IS and FID score on Oxford-102 dataset and CUB dataset. For fair comparison, we choose some IS and FID value from the published paper [14, 15]. On the Oxford-102 dataset, the proposed method achieves 3.95 of inception score and 47.32 of Fréchet Inception Distance, which outperforms the previous methods. Likewise, the proposed method obtains the highest IS value (from 4.36 to 4.43) and competitive FID value (44.64). Comparing to the Oxford dataset, the CUB dataset is more difficult for text to image generation. The bird dataset can better reflect the performance of different methods. Significantly, the results show that the proposed method is able to achieve better performance than other state-of-art text to image synthesis methods.
(2) Visual Effect Comparison
The comparisons of state-of-art text-to-image generative methods by side-by-side comparison are shown in Fig. 2. Life part of Fig. 2 is various images generated by different methods, which are conditioned on the same text description of the Oxford-102 dataset. By scrutinizing the image details and text description roughly, the results show that all images generated by different methods matches with the text, and all those images are realistic and natural. However, the detailed comparison indicates that the image generated by our method are more realistic. On the challengeable CUB dataset, we can find that some previous methods have difficulty in generating highly real and clear image conditioned on the given text, such as GAN-CLS, GAWWN and StackGAN_v1. On the contrary, our proposed method can generate photo-realistic and fine-grained image, especially the bird of the third column. Therefore, in conclusion, our proposed method generates more realistic, more fine-grained and more natural images than other methods in visual evaluation.
(3) Word-Level Attention Visualization
For better evaluating the performance of attention mechanism, we visualize the word-level attention results as shown in Fig. 3. The attention visualization are shown below the red box. The words belong to the paired text description, and the bright region is the corresponding attention area of the words. However, some words do not give attention to right area, such as articles and verbs, which make less contribution to image synthesis. The words describing object attributes, such as colours, shape, and parts of objects, can give attention to correct regions. With adding word-level semantic information in the latter two generators, the generators can redraw the word’s information in the corresponding region, which can saliently enhance the significant details of generated image as well as make it be suitable for the human system (Fig. 4).
4 Conclusion
This paper presents a hybrid attentional model to fulfill text-to-image synthesis. The hybrid attentional mechanism contributes to improve performance of generating fine-grained and realistic image. Meanwhile, the training of network become more stable by introducing spectral normalization in discriminator network. The conducted experiments show that our proposed method synthesizes realistic images in visual comparison, and outperforms the state-of-the-art approaches in FID and IS metric.
References
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Huang, H., Yu, P.S., Wang, C.: An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469 (2018)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: Proceedings of the 2017 on Multimedia Conference, pp. 154–162. ACM Press, California (2017)
Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Cai, L., Gao, H., Ji, S.: Multi-stage variational auto-encoders for coarse-to-fine image generation. arXiv preprint arXiv:1705.07202 (2017)
Gulrajani, I., et al.: PixeLVAE: a latent variable model for natural images. arXiv preprint arXiv:1611.05013 (2016)
Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. In: International Conference on Machine Learning, Lille, pp. 1462–1471 (2015)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396 (2016)
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915. IEEE Press, Venice (2017)
Han, Z., Tao, X., Hongsheng, L., Shaoting, Z., Xiaogang, W., Xiaolei, H.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1947–1962 (2018)
Bodnar, C.: Text to image synthesis using generative adversarial networks. arXiv preprint arXiv:1805.00676 (2018)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint arXiv:1711.10485 (2017)
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: Advances in Neural Information Processing Systems, pp. 217–225 (2016)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018)
Mansimov, E., Parisotto, E., Ba, J.L., Salakhutdinov, R.: Generating images from captions with attention. In: International Conference on Learning Representations, San Juan (2016)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 dataset (2011)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: IEEE Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE Press, Bhubaneswar (2008)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 (2017)
Acknowledgments
This work is supported by National Natural Science Foundation of China under grant 61771145 and grant 61371148.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Cheng, Q., Gu, X. (2019). Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions. ICANN 2019. Lecture Notes in Computer Science(), vol 11731. Springer, Cham. https://doi.org/10.1007/978-3-030-30493-5_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-30493-5_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30492-8
Online ISBN: 978-3-030-30493-5
eBook Packages: Computer ScienceComputer Science (R0)