Keywords

1 Introduction

Nowadays, text-to-image synthesis [1,2,3] is one of the important applications of GANs, which is one of the most active research areas in recent years. Most early proposed methods of text-to-image using one-step to directly generate final results. However, with the development of text-to-image synthesis methods, the more recent approaches explore multi-stages to generate images from text descriptions, such as AttnGAN [4], StackGAN [5] and MirrorGAN [6]. Some researchers [4,5,6,7] take the entire sentence encoding as the basis, and then change the corresponding attribute for each word vector [8]. However, if the initial images are not real (that is, lacks substance, loses form, and is far from the real image), the quality of the image in the next stage will not improve promisingly. Therefore, the text-to-image generation not only needs multi-stage generation, but also needs to achieve different functions in different stages to generate more realistic images.

To tackle it, in this paper a text-to-image model is proposed for synthesizing images from text descriptions by multi-stages, called Text-representation Generative Adversarial Network (TRGAN). And its main contributions are as follows: Firstly, each stage performs different generation tasks for different functions in TRGAN. Secondly, in order to improve the quality of low-quality generated images in the initial stage, a layer of processing is designed in the second stage of generation, in which the generated image is encoded into the image vector as the condition for text vector generation. After that the method utilizes a discriminator to distinguish between the ground truth text vector and generated text vector (see Fig. 1).

Fig. 1.
figure 1

A discriminator to distinguish between the ground truth text vector and generated text vector.

2 Text-Representation Generative Adversarial Network

In order to favorably generate images from the text description, we proposed the Text-representation Generative Adversarial Network (TRGAN) model. TRGAN is a complex structure with three stages. As shown in Fig. 2, the proposed TRGAN also contains two modules: JASGM and TROGM. The first two stages belong to module JASGM and the last stage belongs to module TROGM. In the JASGM module, the detailed feature information is captured from word-level information and the images are generated based on global sentence attention. In the TGOCM module, the text description is generated reversely from generated images to improve the quality of the initial images by matching the word-level feature vector. Details of the model will be introduced in the subsections.

Fig. 2.
figure 2

The architecture of the proposed TRGAN, including JASGM and TROGM modules, which realize different functions, respectively.

2.1 JASGM: Joint Attention Stacked Generation Module

In this section, we mainly focus on the properties of detail and embed the given text description into local word-level features. Specifically, we need to process word by word in the sentence, so we choose RNN, the recurrent neural network (RNN) [9], to extract word embedding \(\left( w_{0}, w_{1}, w_{t-1}\right) \) from the given text description T.

$$\begin{aligned} \begin{array}{l} f_{t}=g_1\left( v_{w_{t}}+w_{s_{t-1}}\right) ;\\ \\ W_{t}=g_2\left( v_{h_{t}}\right) ,\\ \end{array} \end{aligned}$$
(1)

where \(w=\left\{ w^{l} \mid l=0, \ldots , L-1\right\} \), f represents the output of hidden layer.

In our module, an attention word-level feature context matrix \(A t t_{i-1}^{w}\) is generated. After that, the word-level weight matrix \(A t t_{i-1}^{w}\) and visual feature \(f_{i}\) are as inputs to the perceptron, and then the perceptual layer transforms word-level features into the common semantic space of visual features. Finally, the visual feature \(f_{i}\) of the next stage is further generated through the computation of word-level weight matrix \(A t t_{i-1}^{w}\) and visual feature \(f_{i-1}\).

As shown in Fig. 2, the proposed TRGAN has three generators (G0, G1, G2), which take the hidden states (h0, h1, h2) as input, and three discriminators (D1, D2, D3). The images (X1, X2, X3) are generated from low-resolution to high-resolution by generators. First, the feature is extracted from a global sentence vector using a random noise vector, and then the visual feature vector extracted from the perceptron is combined to generate the image of the initial stage.

$$\begin{aligned} \begin{array}{l} f_{0}=F_{0}\left( z, F^{c a}({s})\right) ; \\ f_{i}=F_{i}\left( f_{i-1}, F_{a t t_{i}}\left( f_{i-1}, w\right) \right) , i \in \{1,2\};\\ \hat{I}_{i}=G_{i}\left( f_{i}\right) . \end{array} \end{aligned}$$
(2)

Herein, z is a noise vector usually sampled from a standard normal distribution, \(f_{i} \in \mathbb {R}^{M_{i} \times N_{i}}\), \(I_{i} \in \mathbb {R}^{q_{i} \times q_{i}}\), and \(z \sim N(0,1)\). \(F_{a t t_{i}}\) is the proposed word level attention model. Then, each word vector is computed for each region of the image based on its hidden features h (query). Each part of the initial image is plotted according to the weight of each word for each region.

2.2 TGOCM: Text Generation in the Opposite Direction and Correction Module

The TGOCM is divided into four parts, which are generating text in the opposite direction, matching word-level attention, jointing attention mechanism and correcting image. The following is a detailed description of each part. We employ the widely used encoder-decoder architecture, which needs to be implemented using CNN [10] and RNN [11] models, respectively. The structure of the model mainly includes three parts: a) Feature Extractor, the size of the extracted image features is 2048, with dense layers, and we reduce the size to 256 nodes. b) Sequence Processor, the embedding layer handles the text input, followed by the LSTM layer [12]. c) Decoder, combining the outputs of the above two layers, we process them as dense layers to make the final prediction.

$$\begin{aligned} \begin{array}{l} x_{2}=C N N\left( I_{2}\right) ; \\ x_{t}=W_{e} T_{t}, t \in \{0, \ldots L-1\}; \\ p_{t+1}=L S T M\left( x_{t}\right) , t \in \{0, \ldots L-1\}, \end{array} \end{aligned}$$
(3)

in which \(x_{2} \in \mathbb {R}^{M_{m-1}}\) is the visual feature used as the input to inform the LSTM for the image content. \(W_{e} \in \mathbb {R}^{M_{m-1} \times D}\) represents a word embedding matrix, which maps word features to the visual feature space. \(p_{t+1}\) is a predicted probability distribution over the words.

Here, we can compare the real semantics with the generated semantics. By calculating the similarity [13] between the two semantics, and according to the similarity of the word, it gives the corresponding weight to each word.

$$\begin{aligned} \cos (\theta )=\frac{\sum _{i=1}^{n}\left( x_{i} \times y_{i}\right) }{\sqrt{\sum _{i=1}^{n}\left( x_{i}\right) ^{2}} \times \sqrt{\sum _{i=1}^{D}\left( y_{i}\right) ^{2}}} \end{aligned}$$
(4)

where \(x_{i}\) represents the actual text, \(y_{i}\) represents the generated text, if the cosine is closer to 1, it means that the angle between them is closer to 0\(^\circ \), which means that the two vectors are more similar, and the angle between them is equal to 0, which means that the two vectors are equal.

Meanwhile, each column of h is a feature vector of a sub-region of the image. For the \(j^{t h}\) sub-region, its word-context vector is a dynamic representation of word vectors relevant to \(h_{j}\), which is calculated by

$$\begin{aligned} c_{j}=\sum _{i=0}^{T-1} \beta _{j, i} e_{i}^{\prime }, \text{ where } \beta _{j, i}=\frac{\exp \left( s_{j, i}^{\prime }\right) }{\sum _{k=0}^{T-1} \exp \left( s_{j, k}^{\prime }\right) }, \end{aligned}$$
(5)

where \(s_{j, i}^{\prime }=h_{j}^{T} e_{i}^{\prime }\), and \(\beta _{j, i}\) indicates the weight that the model attends to the \(i^{t h}\) word when generating the \(j^{t h}\) sub-region of the image.

Each word is given the corresponding weight from the matching and word-level attention module. In this way, we can not only locate the specific region, but also focus on the word vector with great loss. Based on the above work, we multiply two matrices. This points the way for the final phase of the generation. The final stage has the function of correcting and optimizing the generated image according to the attention mechanism. Such targeted optimization generation will make the generated image quality promisingly.

2.3 Objective Function

The whole model is divided into three generation stages, so we will describe the objective function in three stages. The generator losses can be defined as:

$$\begin{aligned} L_{G}=\sum _{i=0}^{2} \mathcal {L}_{G_{i}}+\alpha L_{G 1}+\beta L_{cap}+\lambda L_{w s}, \end{aligned}$$
(6)

Herein, \(L_{G 1}\), \(L_{cap}\) and \(L_{w s}\) represent three stages of the loss, respectively. The discriminator works against the generator to determine whether the generated image is true, the calculation method is a conventional algorithm.

The adversarial loss for \(D_{i}\) [4] is defined as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{D_{i}}=&-\frac{1}{2} \mathbb {E}_{x_{i} \sim p_{\text{ data } _{i}}}\left[ \log D_{i}\left( x_{i}\right) \right] \\&-\frac{1}{2} \mathbb {E}_{\hat{x}_{i} \sim p_{G_{i}}}\left[ \log \left( 1-D_{i}\left( \hat{x}_{i}\right) \right) \right] \\&-\frac{1}{2} \mathbb {E}_{x_{i} \sim p_{\text{ data}_{i}}}\left[ \log D_{i}\left( x_{i}, \bar{e}\right) \right] \\&-\frac{1}{2} \mathbb {E}_{\hat{x}_{i} \sim p_{G_{i}}}\left[ \log \left( 1-D_{i}\left( \hat{x}_{i}, \bar{e}\right) \right) \right] . \end{aligned} \end{aligned}$$
(7)

3 Experiments

In this section, we first introduce the datasets, training details, and evaluation metrics used in our experiments. In addition, we carried out extensive experiments that evaluate the proposed model, which is compared with some state-of-the-art models (i.e., StackGAN [14], StackGAN++ [5], AttnGAN [4] and MirrorGAN [6]) by some basic evaluation indicators.

3.1 Datasets

Most of the studies on text-to-image are based on CUB and complex integrated COCO datasets. Each image has 10 text descriptions in the CUB dataset and each image has 5 text descriptions in the COCO dataset.

3.2 Training Details

Firstly, we pre-train the three models of text encoding, image encoding and text reproduction. To simplify the training process, we directly load the pre-trained model and parameters into our overall model. We preprocess the COCO dataset and randomly select a quarter of the original training sets and test sets for training and testing. The training process is performed for 300 epochs on the CUB birds dataset and 300 epochs on the COCO dataset.

Fig. 3.
figure 3

Examples of images generated by AttnGAN, MirrorGAN and TRGAN conditioned on text descriptions from CUB and COCO test datasets and the corresponding ground truth.

3.3 Results

Quantitative Results: The TRGAN we proposed is based on a multi-stage structure generated from low resolution to high resolution. GAN-INT-CLS [15], GAWWN [16], AttnGAN, StackGAN++ and MirrorGAN proposed in previous studies are also based on a multi-stage structure generated from low resolution to high resolution. So we compared the TRGAN with the previous models (AttnGAN, StackGAN++ and MirrorGAN). As shown in Table 1, compared with MirrorGAN which employs Siamese Network to ensure text-image semantic consistency on a simple dataset CUB, our TRGAN improves the IS from 4.56 to 4.66 and the R-Precision from 60.42 to 69.05. This is because our TRGAN can generate a better initial image and optimizing it in the subsequent generation process. It proved that our model has a higher resolution on images of a single entity and multiple entities.

Table 1. IS scores and R-Precision of the six models on the CUB dataset.

Qualitative Results: For qualitative evaluation, Fig. 3 shows text-to-image synthesis examples generated by our TRGAN and the state-of-the-art models. In general, our TRGAN approach generates images with more vivid details as well as more clear backgrounds in most cases, comparing to the AttnGAN, MirrorGAN and ground truth. In conclusion, the reason is that although StackGAN, AttnGAN, MirrorGAN used their stacked architecture or cross-modal spatial attention, it is not completely solved. However, our model aims at improving the quality of the initial image first, and targeted optimization aims at generating regions.

Ablation Study: We next conduct ablation studies on the proposed model and its variants. To validate the effectiveness of generating the text module in reverse, we conduct several comparative experiments by excluding/including these components in TRGAN. We compare the baseline model and reverse generated text module in the second and last stage, respectively. The IS score increases from 4.33 to 4.49 by adding a reverse-generated text module, then the IS score increases from 4.49 to 4.69 by adding the model on different stages (stage 2 and stage 3), as shown in Fig. 4. That’s why we can change the quality of the initial image to ensure the quality of the result.

Fig. 4.
figure 4

The results of baseline model, adding reverse generated text model comparison.

4 Conclusions

In this paper, we have proposed a new framework called Text-representation Generative Adversarial Network (TRGAN). The whole framework consists of two modules, namely JASGM and TROGM. The first modules focus on the generation of fine-grained features. In the second module, the image of the previous stage is repaired and corrected based on the attention mechanism. Extensive experiment results show that our proposed TRGAN significantly outperforms state-of-the-art models on the CUB and COCO datasets.