1 Introduction

People tend to describe rich and detailed pictures of scenes through language, and the ability to generate images from these descriptions can facilitate creative applications in various life contexts, including art design and multimedia content creation [1, 2]. This fact has inspired researchers to design models of text-to-image comparative learning to assist people with making decisions quickly in specific scenarios, such as presentation and advertising design [3, 4]. In recent years, diffusion models have attracted the attention of many scholars due to their promising performance in image generation. Within this framework, DALL-E 2 [5] and Imagen [6] have become successful generative models for image generation.

Imagen is currently one of the greatest image generation models. Its most significant distinguishing feature is its immensity, which is reflected, in particular, by its utilisation of a large text encoder, i.e. T5 [7]. T5 is pre-trained on a sizable plain text corpus. It turns out that T5 is very effective for enhancing image fidelity and image-text alignment [6]. However, using T5 alone to obtain text embeddings cannot guarantee that the model learns important text features, such as semantic layout. Besides visual elements, the semantic layout is recognised as an important factor in guiding text-to-image synthesis [8]. Our experimental results provide evidence for this claim.

Furthermore, very few research works are dedicated to addressing the UNet issue of Imagen. The diffusion model of Imagen relies on the Efficient-UNet, which suffers from the limitations of CNN convolution operations. CNN are good at extracting the low-level features and elements of visual structure, such as colour, contour, texture and shape [9]. However, CNN focuses on the consistency of these low-level features under transformations, such as translation [10] and rotation [11]. This is also the main reason why CNNs are widely used in object detection [12]. In other words, while the convolutional filters are good at detecting key points, object boundaries and other basic units that constitute the visual elements, it fails to extract features efficiently in terms of global and layout. For text-to-image synthesis tasks, it is significant to consider how to accurately extract the complex relationships between objects from the limited text. The Transformer is more natural and efficient than CNN in processing this demand. This is mainly because the attention in the Transformer can effectively mine the relationships between text features, allowing the model not only focuses on local information but also has a diffusion mechanism to find expressions from the local to global layout [13, 14].

To solve the aforementioned drawbacks of Imagen, in this paper, we propose a diffusion text-to-image generation model called Swinv2-Imagen. The proposed model is based on a Hierarchical Visual Transformer and Scene Graph incorporating layout information. Specifically, the semantic layout is generated via semantic scene graphs, enabling Swinv2-Imagen to parse the layout information in the text description effectively. In this paper, we adopt Stanford Scene Graph Parser [15] to obtain the Scene Graph from the text. Subsequently, the entity and relationship embeddings are extracted using a frozen Graph Convolution Network (GCN) [15]. The image generation process appears conditional on text, object and relationship embeddings. The layout representation with global semantic information ensures the realism of the generated images. In addition, the diffusion models are developed based on Swinv2-Unet, a variant of Swin Transformer v2 [16], which allows the model to learn features from local to global. Finally, we evaluate our model on the MSCOCO, CUB and MM-CelebA-HQ datasets. The results show that the proposed model outperforms the current best generative model, Imagen, on MSCOCO. The ablation experiments reveal that the addition of semantic layouts is effective in improving the semantic understanding of the model.

The key contributions of this paper are summarised below.

  1. 1.

    We leverage scene graphs to extract entity and relational embeddings to improve local and layout information representation of text for a more accurate understanding of the text and realistic image generation;

  2. 2.

    We propose Swinv2-UNet as a novel diffusion model architecture. The model leverages attention to explore the relationship between features, allowing the diffusion model to focus on different granularities of features at different moments, from local to global;

  3. 3.

    We fuse the scene graph with the diffusion model, and the experimental results demonstrate that the resulting images not only generate the objects specified in the text, but also additional objects based on specific words (e.g. kitchen);

  4. 4.

    We achieve a new state-of-the-art FID result (FID=7.21) on the MSCOCO dataset compared to the latest generative models. Better results are also obtained on both the CUB (FID=9.78) and MM CelebA-HQ (FID=10.31).

The rest of the paper is organised as follows. In Sect. 2, related works are reviewed. In Sect. 3, we elaborate on the proposed Swinv2-Imagen model. In Sect. 4, we conduct extensive experiments to evaluate the performance of the proposed model and perform an ablation study to evaluate the contributions of each key component of our model. Finally, we conclude this paper in Sect. 5, and discuss future research directions.

2 Related work

2.1 Diffusion models

Text-to-Image synthesis is a typical application of multimodal and cross-modal comparative learning. In the field of image generation, most models mainly fall into two categories, i.e. the GAN-based generation models [17,18,19,20,21,22] and the diffusion-based models [23,24,25,26,27]. The former has been developed over the last few years and widely used in many scenarios, such as medical and image restoration. The latter has demonstrated outstanding performance over the GAN models, acknowledged as state-of-the-art deep generative models [6, 28, 29].

Diffusion models and GAN generative models are essentially comparable, both being a process of gradually removing noise. However, in contrast to GAN, the diffusion models do not suffer from training instability and model collapse. The diffusion model transforms the data distribution into random noise and reconstructs data samples with the same distribution [6, 30]. The diffusion model demonstrates outstanding performance for a number of tasks, such as multimodal modelling. Many contemporary text-to-image synthesis models, e.g. DALL-E 2 [5], Imagen [6] and GLID [25], are constructed based on the diffusion model. They cascade multiple diffusion models to improve the image generation quality step by step. DALL-E 2 uses a priori diffusion model and CLIP Latents to process the text. In contrast, Imagen discards the priori model and replaces it with a large pre-trained text encoder, i.e. T5. Although the T5 model leveraged in the Imagen model improves the understanding of the text, it does not ensure that the model understands the semantic layout of the text, especially in complex sentences containing multiple objects and relationships. As a result, the model will not be able to reproduce some entities or will lose some entity relationships. Therefore, we attempt to model the global semantic layout by adding a scene graph in the text processing. Furthermore, Imagen builds its diffusion model based on Efficient-Unet. Efficient Unet is not the best choice in image generation tasks, because it contains multiple CNN blocks and leads to a limited view within the CNN kernel window.

2.2 Scene graph and graph representation learning

A sentence’s nature is a linear data structure, where one word follows another [15]. Usually, when a sentence is complex with multiple objects, it is time-consuming to analyse the sentence directly, and the accuracy of the text-image alignment is not guaranteed. Complex sentences often incorporate rich scene information. Mapping this information into a scene graph can provide an intuitive understanding of the relationships between objects in a sentence [31]. Previous studies reveal that the performance of multimodal models, such as text-to-image synthesis, is significantly dependent on mining visual relationships [32]. Scene graphs can provide a high level of understanding regarding scene information [15]. Therefore, the scene graph is recognised as a useful representation of images and text. Specifically, each node in a scene graph represents an object, such as a person or an event, and each object has multiple attributes, such as shape. The relationships between objects are denoted by the edges between nodes, which can be an action or a position [33]. Recently, the scene graphs have been used extensively for tasks such as text-based image retrieval [34, 35], semantic segmentation [36, 37], visual question answering [38], image captioning [39,40,41,42] and image generation [15, 31, 43, 44]. The recently proposed dynamic scene graph generation also demonstrates the prospects of scene graphs in video monitoring, autonomous driving, and other fields related to video processing and generation [45].

In addition, there is no way for an image generation model to manipulate graph-like data such as scene graphs directly, so scene graphs are usually used in conjunction with graph representation learning [46]. The main objective of graph representation learning is to extract node and edge contexts from the scene graph and map them to a set of embeddings. Graph representation learning methods can currently be classified into two types, i.e. machine learning based on Random-Walk and deep learning Graph Convolution-based methods [46]. Node2vec [47] is a typical representative model of the former. It is based on SkipGram [48] theory to learn the embedding of nodes on a graph and optimises the sampling method. It is proposed in related studies [49] that two sampling methods, Breadth-First Search (BFS) and Depth First Search (DFS), are mainly included when sampling neighbouring nodes in a graph. BFS requires that each sampled node is a direct neighbour of that node. This sampling method results in a graph representation that is more concerned with local information. In contrast, DFS, where each node is sampled to increase the distance to the initial node as much as possible, produces a graph representation that focuses more on global information. Random-Walk-based representation learning [50] comprises multiple stages, each with different optimisation goals, which is a typical non-end-to-end model. Graph convolution-based methods, e.g. Graph convolution neural networks [15], are able to learn both node feature information and structural information via an end-to-end way. It focuses on both local information and global structural features. Graph Convolutional Neural Networks (GCNs) have gained significant attention in recent studies as powerful tools for analysing graph-structured data, such as social networks and database tables [51]. For the text-to-image synthesis task, generation models leverage GCNs to capture semantic relationships between textual descriptions and visual features, enabling more accurate image generation. Recently, many studies have been conducted to enhance graph-based neural networks. Specifically, a novel augmentation method called GraphENS has been proposed to address the issue of overfitting to neighbour sets of minor class nodes [52, 53]. They proposed a saliency-based node mixing method to leverage the abundant class-generic attributes of other nodes while preventing the injection of class-specific features. To mitigate the negative impact of oversampling on message passing, they restricted the message passing only to the incoming edges of the oversampled nodes.

2.3 UNet

UNet is an encoder–decoder architecture, which is scalable in structure [54]. The encoding stage of the UNet consists of four downsamples. Symmetrically, its decoding stage is also upsampled four times, restoring the result of the encoder to the resolution of the original image. In contrast to Fully Convolutional Networks (FCN) [55], UNet upsamples four times and uses a jump connection in the encoder and decoder of the corresponding convolution blocks. The jump connection ensures that the final recovered feature map incorporates more low-level semantic features and features at different scales are well fused, allowing for multi-scale prediction. In addition, the four times upsampling also allows the segmentation map to recover information such as edges more finely. However, UNet also has some shortcomings. For example, UNet++ [56, 57] argues that it is inappropriate to directly combine the shallow features from the encoder with the deeper features from the decoder in UNet. Direct fusion would potentially lead to semantic gaps. Furthermore, UNet 3+ [58] maximises the scope of model information fusion and circulation. Each decoder layer in the UNet 3+ fuses small-scale and same-scale feature maps from the encoder with larger-scale feature maps from the decoder, which capture both fine-grained and coarse-grained semantics at full scale.

Many researchers develop a set of UNet variants by improving and optimising the original UNet. For example, ResUNet [59] and DenseUNet [60] are inspired by Residual and Dense connections, respectively; each sub-module of the UNet is replaced with a form having a Residual connection and a Dense connection. There are variants, e.g. MultiResU-Net [61] and R2 UNet [62]. All of these models are constructed using multiple convolutional blocks. With the advent of the Transformer, researchers begin to develop the UNet base on the Transformer, such as Swin-UNet [63]. While Swin-UNet mitigates the limitations of CNN convolutional operations, it is likely to suffer from training instability due to the use of the Swin-Transformer block. Swin-Transformer v2 [16] is an improvement on Swin-Transformer, which is effective in avoiding training instability and is easier to scale.

Inspired by these research works, we propose a Swinv2-Imagen model that leverages scene graphs as auxiliary modules to help the model understand the text semantics more comprehensively. In addition, Swinv2-Unet is applied to build the diffusion models so that our model is based on the full Transformer implementation. As a result, it effectively addresses the limitations of CNN convolution operations, theoretically enabling the synthesis of images better than baselines.

3 Swinv2-Imagen

The overall architecture of the proposed Swinv2-Imagen model is shown in Fig. 1. It takes text descriptions as input and uses scene graphs to guide downstream image generation more accurately and efficiently. The upstream comprises two sub-modules: the text encoder, which maps the text input to a text embedding sequence and the scene graph generator sub-module. The scene graph generator includes a Scene Graph parser and a frozen Graph Neural Network, which aims to represent objects and relationships in a text with a graph structure. The downstream consists of a set of conditional diffusion models, integrating the intermediate embeddings in the upstream and generating high-fidelity images step by step.

The input of the model is a text-picture pair. Firstly, the text is encoded by T5 tokenizers and input to the embedding layer to get the initial text embedding. Next, it goes through the T5 encoder (n-layer T5 Block) to obtain Text Embeddings.

Fig. 1
figure 1

Overall architecture of Swinv2-Imagen. The text is passed through both a frozen T5 Encoder and a scene graph. The scene graph mines complex entity relationships explicitly, ensuring that the model understands the text semantics accurately

Meanwhile, the scene graph parser extracts the scene graph from the text, and the frozen GCN (m-layer Graph Triple Convolution) obtains the corresponding Object and Relation embeddings. Finally, the Conditional embeddings are obtained by concatenating the Text embeddings, Object embeddings and Relation embeddings in this order. The Conditional embeddings are used as conditional input for subsequent super-resolution image generation. In the following subsections, we describe the main components of Swinv2-Imagen in detail.

3.1 Pre-trained frozen text encoders

It is widely acknowledged that a robust semantic text encoder is essential for text-to-image synthesis models and plays a crucial role in analysing the complexity and composition of textual input [6]. Previously, language models were mainly built on RNN architectures. However, since the emergence of the Transformer, a number of transformer-based pre-trained language models have been developed, such as GPT [64,65,66], BERT [67] and T5 [7]. The traditional Imagen model is compared against popular text encoders, BERT, CLIP and T5-XXX, by freezing parameters. The existing research results prove the promising performance of T5-XXX in terms of both image-text alignment and image fidelity [6]. Therefore, we adopt the T5 large language model for text encoding in the proposed model.

3.2 Scene graph and frozen graph convolutional neural network

Fig. 2
figure 2

Process of Object and Relation embeddings extraction. The input to the model is a sentence, which is first parsed by the scene graph parser into a graph structure (scene graph). Each node represents an object in the text and each edge represents a relationship between objects. Finally, all the nodes and edges are parsed by the graph neural network into an object embedding and a relation embedding, respectively

This sub-module aims to extract entity and relationship features from the text to enhance the text understanding of the model. We adopt a Scene Graph parser to represent text as a scene graph, followed by a frozen GCN to extract the entity and relational embeddings for the image generation with diffusion models. Scene graphs with graph neural networks [15] have been proven to be highly effective in extracting object relationships from the text. As shown in Fig. 2, Swinv2-Imagen constructs a scene graph for the text and is followed by a graph neural network to extract the entities and relationships from the scene graph. For any given text description, the corresponding scene graph is represented as follows: (OE), where \(O=(o_1,o_2,o_3,\cdots ,o_n )\) denotes each object in the sentence, i.e. subject and object, and E is a collection of edges of the form \((o_i, r, o_j)\), where \(r \in {\mathcal {R}} \), \( {\mathcal {R}} \) refers to a collection of relationships. In the end, object and relation embeddings are constructed, which are used to assist the T5 model in analysing and understanding the text more comprehensively.

The input to the graph convolution is a scene graph, having each node and edge represented as a vector with dimension \(D_{\text {in}}\), i.e. \({{\textbf {v}}}_{i}, {{\textbf {v}}}_{r} \in {\mathbb {R}}^{D_{\text {in}}}\). In the graph convolution sub-module, these vectors are adopted to compute output vectors with dimension \( D_{\text{out}} \) for each node and edge, i.e. \({{\textbf {v}}}_{i}^{\prime }, {{\textbf {v}}}_{r}^{\prime } \in {\mathbb {R}}^{D_{\text{out}}}\). Three functions, \(g_s\), \(g_o\) and \(g_p\) are used to calculate the object features vectors and relation vectors of output. They take a triplet as input, i.e. \( \left( {{\textbf {v}}}_{i}, {{\textbf {v}}}_{r}, {{\textbf {v}}}_{j} \right) \). In the scene graph, given an edge \( {{\textbf {v}}}_{r} \), the two associated objects, \( {{\textbf {v}}}_{i} \) and \({{\textbf {v}}}_{j} \), are determined. Thus, the output relationship vector \( {{\textbf {v}}}_{r}^{\prime }\) can be simply expressed as:

$$\begin{aligned} {{\textbf {v}}}_{r}^{\prime }=g_p\left( {{\textbf {v}}}_{i}, {{\textbf {v}}}_{r}, {{\textbf {v}}}_{j}\right) \end{aligned}$$
(1)

In contrast, the calculation of output object vectors \( {{\textbf {v}}}_{i}^{\prime } \), is more complicated. Generally, an object is associated with two or more relations. Therefore, the output vector of an entity \(o_i\) is calculated by considering all the vectors directly connected to the object, i.e. \( {{\textbf {v}}}_{j} \), and the corresponding relationship vectors, \( {{\textbf {v}}}_{r} \). The function \(g_s\) in Eq. (2) is used to compute all vectors starting at node \(o_i\) and function \(g_o\) in Eq. (3) is used to compute all vectors ending at node \(o_i\). Afterwards, these vectors are collected into lists \( V_i^{s} \) and \( V_i^{o} \).

$$\begin{aligned} V_{i}^{s} & ={} \lbrace g_s \left( {{\textbf {v}}}_{i}, {{\textbf {v}}}_{r}, {{\textbf {v}}}_{j} \right) : \left( o_i, r, o_j \right) \in E \rbrace \end{aligned}$$
(2)
$$\begin{aligned} V_{i}^{o} & ={} \lbrace g_o \left( {{\textbf {v}}}_{j}, {{\textbf {v}}}_{r}, {{\textbf {v}}}_{i} \right) : \left( o_j, r, o_i \right) \in E \rbrace \end{aligned}$$
(3)

Then, the output vector \( \mathbf {v_i^{'}}\) for the entity \(o_i\) is expressed as follows:

$$\begin{aligned} {{\textbf {v}}}_{i}^{\prime } = h \left( V_{i}^{s} \cup V_{i}^{o} \right), \end{aligned}$$
(4)

where h denotes a function that pools all vectors in lists \( V_i^{s} \) and \( V_i^{o} \) to a single output vector [15].

3.3 Image generator

The image generator is composed of three diffusion models located downstream. In the diffusion model, a hidden variable z is obtained by adding noise to the image for T times. After forward and backward diffusion, a basic 64 * 64 image can be learned. The basic image is input to the first Swinv2-Unet to generate a 256 * 256 image. Finally, the image goes to the second Swinv2-Unet super-resolution generation, producing a 1024 * 1024 high-definition image.

The diffusion model can be described as an Encoder–Decoder architecture. It first adds Gaussian noise \((\epsilon )\) to the original image \((x_0 \sim q(x_0))\) in an iterative manner, the number of iterations being T (T is timestep, usually \(T=1000\)). When T tends to infinity, i.e. \((T \rightarrow \infty )\), the image is nearly a random Gaussian noise distribution \(x_T\). This process is called forward diffusion and can be thought of as an encoder. The model then learns how to recover the noise distribution \((x_T)\) to the original image \((x_0 \sim q(x_0))\) by gradually removing the noise from \(x_T\). The process is called reverse diffusion and can be thought of as a decoder [28, 29].

In the forward diffusion, the result at timestep t is mainly related to the outcome at moment \(t-1\) and the added noise \(\epsilon _t\), i.e.

$$\begin{aligned} x_t= & {} \sqrt{1-\beta _t} x_{t-1} + \sqrt{\beta _t} \epsilon _t \quad q(x_t \vert x_{t-1}) \sim {{\mathcal {N}} (\sqrt{1-\beta _t}, \beta _t)}, \end{aligned}$$
(5)
$$\begin{aligned} q(x_{1:T} \vert x_0)= & {} \prod _{i=1}^T q(x_t \vert x_{t-1}) \end{aligned}$$
(6)

where \(\beta _t\) I prefer to understand as a linear weight value. At different timestep, \(x_{t-1}\) and \(\epsilon _t\) have different effects on the result. When T is small, e.g. \(t=1\), \(x_{t-1}\) has a greater impact on the result and adds little noise. Conversely, when t is large, e.g. \(t=900\), more noise is added and the contribution to the result is larger than \(x_{t-1}\).

The distribution of the noise added at each timestep in the forward process is identical, i.e. \(\epsilon _1, \epsilon _2,...... \sim {\mathcal {N}} (0, {{\textbf {I}}})\). Thus, we can compute the result at any timestep \(x_t\) directly from \(x_0\), i.e.

$$\begin{aligned} x_t = \sqrt{{\bar{\alpha }}_t} x_o + \sqrt{1 - {\bar{\alpha }}_t} \epsilon _t \end{aligned}$$
(7)

where \(\alpha = 1 - \beta \), \({\bar{\alpha }}_t = \prod _{i=1}^t \alpha _i\).

Reverse diffusion is an image generation process. The Gaussian noise \(x_T \sim {\mathcal {N}}(0, {{\textbf {I}}})\) will be taken as input to infer and reconstruct the true sample by sampling from distribution \(q(x_{t-1} \vert x_t)\), i.e.

$$\begin{aligned} x_{t-1} = \frac{1}{\sqrt{\alpha _t}}\left( x_t - \frac{1 - \alpha _t}{\sqrt{1 - {\bar{\alpha }}_t}}\right) f_\theta (x_t, t) \end{aligned}$$
(8)

where \(f_\theta (x_t, t)\) is a function used to predict the noise \(\epsilon \) added in the forward diffusion. This is mainly because it is difficult to infer the true distribution of the image directly from the random noise \(x_T\). In other words, the objective of the diffusion generation model is to evaluate the difference between the predicted noise data and the true added noise data, i.e.

$$\begin{aligned} p(x_{t-1} \vert x_t) = \vert \vert \epsilon - f_\theta (x_t, t) \vert \vert . \end{aligned}$$
(9)

In contrast to Imagen, we focus on improving super-resolution diffusion models. We introduce a new UNet variant to our super-resolution diffusion model, called Swinv2-UNet. The Swin Transformer Block is replaced with the Swin Transformer v2 Block based on the original Swin-Unet [63], the complete structure of which is shown in Fig. 3.

A distinctive feature of Swinv2-Unet compared to Swin-Unet is the replacement of the \(dot({{\textbf {K}}}, {{\textbf {Q}}})\) operation with cosine normalisation [68] in the attention part, which makes the attention output more stable. Given two vectors, Q and K, the cosine normalisation could be expressed as follows:

$$\begin{aligned} Cosine({{\textbf {Q}}}, {{\textbf {K}}}) = \frac{\sum _i (q^{i} k^{i})}{\sqrt{\sum _i (q^i)^2} \sqrt{\sum _i (k^i)^2}}. \end{aligned}$$
(10)

The DBlock and UBlock of Swinv2-UNet consist of the Swin Transformer v2 block, which comprises LayerNorm (LN) layers, multi-headed self-attention modules, Residual connections and a 2-layer MLP with GELU nonlinearity. The Swin Transformer v2 block could be represented as follows:

$$\begin{aligned} {\hat{z}} ^ {l+1} & = \rm{LN}(Attn(z^l)) + z^l \end{aligned}$$
(11)
$$\begin{aligned} z^{{l + 1}} = {\text{ MLP}}({\text {LN}}(\hat{z}^{{l + 1}} )) + \hat{z}^{{l + 1}}, \end{aligned}$$
(12)

where \(z^l\) and \(z^{l+1}\) denote the input and output of the Transformer v2 block, respectively. \({\hat{z}} ^ {l+1}\) is an intermediate variable. + denotes the residual connection or skip connection.

The attention of Swinv2 is expressed as follows:

$$\begin{aligned} Attn({{\textbf {Q}}}, {{\textbf {K}}}, {{\textbf {V}}}) = SoftMax \left( \frac{Cosine(Q, K)}{\tau } + B\right) \quad , \end{aligned}$$
(13)

where \({{\textbf {Q}}}\), \({{\textbf {K}}}\), and \({{\textbf {V}}}\) denote the matrix of query, key and value, respectively. Cosine() refers to a function that calculates the scaled cosine similarity of \({{\textbf {Q}}}\) and \({{\textbf {K}}}\). \(\tau \) denotes a learnable scalar, usually greater than 0.01. B is a matrix of relative position bias.

Fig. 3
figure 3

UNet architecture of the super-resolution sub-module. The architecture includes an encoder(downsampling), bottleneck, and decoder(upsampling). Skip connections are used between the encoder and decoder. All components are built based on the Swin Transformer v2 block

Fig. 4
figure 4

Swinv2-Unet DBlock

Figure 4 illustrates the network structure of the Swinv2-Unet DBlock, which is the basic component of the downsampling path under the encoding–decoding structure of UNet. Firstly, the DBlock combines the pooled text embeddings, object embeddings and relation embeddings into a conditional embedding input to the cross-attention layer. Next, it is followed by the Swinv2-Transformer v2 blocks for (num_block-1) times feature extraction.

Fig. 5
figure 5

Swinv2-Unet UBlcok

Figure 5 shows the network structure of the Swinv2-Unet UBlock, which is the basic component of the upsampling path on the UNet encoder–decoder. The inputs to the UBlock include the output of the previous UBlock layer and the corresponding DBlock. The DBlock and UBlock are connected using skip connections [57]. Subsequently, the conditional embedding inputs are also introduced to the cross-attention layer. Similar to the DBlock, this layer is followed by the Swinv2-Transformer v2 blocks for (num_block-1) times feature extraction.

The encoder is presented as a stacking of DBlocks and Patch Merging. In the encoder, images are fed into five consecutive DBlocks for learning, where the feature dimension and resolution are maintained. Meanwhile, Patch Merging performs Token Merging and increases the feature dimension to four times the original dimension. Next, we apply a linear layer to standardise the feature dimension to twice the original dimension. The process is repeated four times in the encoder.

Similar to UNet, skip connections are used to integrate the multi-scale features of the encoder with the upsampled features. We connect shallow and deep features to minimise the loss of spatial information due to downsampling. The next layer is a linear layer where the dimensionality of the connected features is kept the same dimensionality as that of the upsampled features.

The decoder is a symmetric decoder corresponding to the encoder. For this reason, unlike the Patch Merging used in the encoder, we use Patch Expanding in the decoder to upsample the extracted features. The Patch Expanding reshapes the feature maps of adjacent dimensions into a higher resolution feature map (2\(\times \) upsampling) and accordingly reduces the number of feature dimensions to half the original dimensionality.

4 Experiments

In this section, we perform extensive experiments to evaluate the proposed Swinv2-Imagen model by using the MSCOCO, CUB and Multi-modalCelebA-HQ (MM CelebA-HQ) datasets. Firstly, a brief description of the datasets is given. Secondly, we compare the performance of the Swinv2-Imagen model with state-of-the-art generative models. Finally, we conduct ablation experiments to compare the contributions of each module.

4.1 Setup

4.1.1 Datasets

The Microsoft Common Objects in Context 2014 (MS COCO-2014) [69], the Caltech-UCSD Birds-200-2011 (CUB-200-2011) [70] and MM CelebA-HQ [20] datasets are utilised in this research. Three datasets cover both simple (CUB) and complex (MSCOCO) datasets. The use of the MM CelebA-HQ dataset is mainly because most generative models such as CogView and Craiyon, produce distorted and less realistic faces.

  • MSCOCOFootnote 1 was released in 2014. It is a collection of 164K images, which have been partitioned into the training set (82K), validation set (41K) and testing set (41K). The dataset is complex because most of the images possess at least two objects.

  • CUBFootnote 2 contains 12K bird images of 200 subcategories, 6K for training and 6K for testing. It is a simple dataset, having only one object per image.

  • MM CelebA-HQFootnote 3 is a large-scale face image dataset. It is a collection of 30K high-resolution face images. The dataset is used widely to train and evaluate algorithms for text-image generation and text-guided image manipulation.

4.1.2 Evaluation metrics

We adopt Fréchet Inception Distance (FID) [71] and Inception Score (IS) [8] as evaluation metrics. Both are acknowledged as standard metrics for evaluating the image generation model. Specifically, IS examines both the clarity and diversity of the resulting images. The higher the IS, the better the quality of the generated images. FID calculates the difference between the generated image and the original image. The smaller the difference, the better the generated image is.

4.1.3 Baselines

  • PCCM-GAN [72] (Photographic Text-to-Image Generation with Pyramid Contrastive Consistency Model) is a typical multi-stage generative model. Its main innovations include the introduction of stack attention and the lateral connection of the PCCM. The two modules enhance the generative model to simultaneously extract semantic information from both global and local aspects, ensuring that the generated images are semantically consistent.

  • DM-GAN [17] (Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis) is also a multi-stage generative model. It uses a memory module and a gate mechanism in the image refinement process. The aim is to re-extract important information from the image as an aid when the generated image is not as good as expected.

  • SDGAN [73] (Semantics Disentangling for Text-to-Image Generation) consists of two modules, i.e. Siamese and semantic conditioned batch normalisation, to extract high-level and low-level semantic features respectively.

  • CogView [74] is based on the Transformer architecture. Its input is a text-image pair. The text and image features are combined and passed to the GPT language model for autoregressive training.

  • GLIDE [25] is a large-scale image generation model based on diffusion models with 3.5 billion model parameters.

  • DALL-E 2 [5] is also based on diffusion models. One of its highlights is the use of a priori model built on the diffusion models. Its inputs are also text and corresponding images. The text is first passed through the priori model and a corresponding image vector is generated. The image is passed through the CLIP module which also generates an image vector to supervise the result of the priori model.

  • LAFITE [75] is a variant of generative adversarial networks. It leverages the CLIP model to extract features from images and text, ensuring text-image consistency.

  • Imagen [6] is a text-to-image synthesising model based on the diffusion model. It passes text through a large pre-trained T5 language model and generates high-fidelity images through cascading diffusion model blocks.

Fig. 6
figure 6

FID and IS on MSCOCO. Smaller FID is better, larger IS is better. 0 means that the model does not use this evaluation metric

4.1.4 Training parameters

We apply an Imagen-like training strategy, i.e. training the base model and then the super-resolution model twice. The Adam optimiser is adopted, having a learning rate of 1e−4. We give 10,000 linear warm-up steps with a batch size of 8 and training epochs of 1000. The loss function is Mean Squared Error (MSE), formulated as follows.

$$\begin{aligned} MSE(I, K) = \frac{1}{M \times N} \sum _{i=0}^{M-1} \sum _{j=0}^{N-1} [I(i, j) - K(i, j)]^2, \end{aligned}$$
(14)

where M and N denote the total number of pixels in the real image I and the generated image K, respectively. A smaller MSE implies that the generated image is closer to the real image.

4.2 Experimental results

In this subsection, we evaluate the proposed model by comparing it against a few state-of-the-art generative models.

4.2.1 Performance evaluation

Table 1 demonstrates the results of the quantitative comparison. The proposed model is compared against 10 popular generative models, including GAN and diffusion models. It is evident that the proposed Swinv2-Imagen model outperforms the baselines on all three datasets. Particularly, on the MSCOCO dataset, Swinv2-Imagen significantly outperforms the GAN-based generative model and slightly surpasses the Imagen, achieving an FID of 7.21. It can be seen from Fig. 6 that our model has achieved the best result in terms of FID. However, our model, in IS metric, is lower than SDGAN and LAFITE. One possible reason for this result is that IS is not very robust in evaluating classes that differ significantly from the ImageNet [78] and is more sensitive to data perturbations. This is also the main reason why this metric is not widely used in most diffusion-based generation methods, such as DALLE2 and Imagen.

Table 1 Experimental results of varied models for Text-To-Image synthesis

4.2.2 Qualitative analysis

Figure 7 shows examples of images generated by our proposed model on MSCOCO, CUB and MM CelebA-HQ. It can be seen that our model understands the text very well. For example, given the text input, ‘Food cooks in a pot on a stove in a kitchen’, the resulting picture not only contains the food, the stove and the pot, but also places these objects to the exact location. More importantly, based on the word ‘kitchen’, the model also generates other common kitchen objects, such as spoons and storage shelves. This shows that our model understands the text accurately and comprehensively.

Figure 8 illustrates the qualitative comparison of the proposed model and the GAN-based, diffusion-based generative models, i.e. DM-GAN [17], DF-GAN [79], VQ-Diffusion [80]. Compared to diffusion-based models, the GAN-based models lose many detailed features in the generated results. For example, the bird’s eyes are very blurred in the third image in the first row and the second image in the second row. One possible reason for this result is that the diffusion model improves the generalisation ability of the model by iterating over the image several times, with each iteration perturbing the image slightly (by adding randomly noisy data to the image). GANs, on the other hand, usually rely on continuous optimisation over large amounts of data in order to generate high-quality images. Compared to VQ-Diffusion, which is a diffusion-based model, our results are more realistic and contain more fine-grained features. Particularly, the blue birds in the third column generated by our method are better than that generated by VQ-Diffusion. One possible reason for this result is that VQ-diffusion is a typical two-stage generative model [80]. First is the vector quantisation stage, where the original image is represented as a set of high-dimensional vectors and mapped into a discretised space using a vector quantisation model. Second is the diffusion computation stage, where a discretised sequence of vectors is used as the initial state, which is iterated over several times using a diffusion model. Finally, the resulting discretised vector sequence is transformed into an image. The vectorisation results of the first stage will directly affect the quality of the generation results. By comparison, our proposed model is designed as an end-to-end architecture that optimises the entire generation process holistically. Our model eliminates the need for intermediate stages, facilitating better optimisation and faster convergence. In addition, our model also outperforms other generation models in terms of text-image alignment. The text description of the first column requires the bird’s breast to be white, but this feature seems to be grey in the results of other models, especially DM-GAN. In summary, by comparing with other GAN-based and diffusion-based generation models, it can be seen that our model synthesises fine-grained and detailed images on CUB.

Figure 9 presents the qualitative comparison between our model and LAFITE [75] on MSCOCO. Intuitively, our results are more colourful and saturated. For example, in the first and fourth columns, our bus and city street include more colours and the images are brighter. Furthermore, our model is also better for text understanding. In the third column, the room should include two colours, white and beige, however, in the LAFITE result, there are just white walls and a white cupboard. There is not any trace of the beige features. In contrast, our generated room contains the two colours required by the text, and the overall layout is more realistic. Finally, our model is also better regarding image quality. The tops of the bus and room generated by LAFITE are distorted and the results are generally blurred. Our model has a significant advantage over LAFITE in generating objects such as buildings, buses, trees, etc. Although the two models are very close in terms of FID and IS in the quantitative analysis in Table 1, our model is superior in terms of the quality of the generated images.

Fig. 7
figure 7

Generated examples by proposed model on COCO, CUB and MM CelebA-HQ. The resulting images generate not only the objects requested in the sentence but also additional objects based on special words(e.g. kitchen). For example, in the first image, the resulting image includes food, a pot, and a stove (required in the text) and some spoons and rice cookers (common kitchen items)

Fig. 8
figure 8

Comparison with GAN-based and diffusion models on CUB-200 dataset. For each method, we present three captions and the corresponding generated images. Our resulting images are more detailed in colour and higher in quality than the popular GAN models

Fig. 9
figure 9

Comparison with LAFITE on MSCOCO dataset

4.3 Ablation study

In order to improve the performance of the generation models, we introduce two new modules to Imagen, i.e. scene graph and Swinv2-Unet. These are the main innovations of the article. In this subsection, two ablation experiments are conducted on MSCOCO to investigate the contributions of the scene graph module and Swinv2-Unet, respectively. The choice to experiment on MSCOCO is based on two considerations. Firstly, each image in MSCOCO contains multiple objects, which is more complex than CUB dataset. Theoretically, it allows a better evaluation on the effect of each module. Secondly, the main baseline we referenced, Imagen, is only experimented on MSCOCO. The aim of Experiment 1 is to evaluate the contribution of the scene graph module. We add only the scene graph, and the diffusion model is still built using Efficient-Unet, which is called Imagen_sg. Experiment 2 is designed to evaluate the performance of the Swinv2-Unet. We constructed a new diffusion model using our improved Swinv2-Unet and replace Imagen’s super-resolution diffusion models with it, which is called Swinv2-Imagen_su. The result of Experiment 1 supports our conjecture that merely using a T5 encoder does not sufficiently learn the semantic information of the text, as mentioned in the introduction. Experiment 2 shows that the diffusion model constructed with the Transformer outperforms the CNN-constructed diffusion model in the image generation task. It also can be seen from Table 2 that the FIDs of the Imagen_sg and Swinv2-Image_su are very close. This intuitively reveals that the two submodules almost contribute equally to the FID.

Table 2 Ablation study of Swinv2-Imagen model

5 Conclusion and future work

In this paper, we propose a novel text-to-image synthesis model based on Imagen, called the Swinv2-Imagen, which integrates the Transformer and Scene Graph. The improved sliding window-based hierarchical visual Transformer (Swin Transformer v2) avoids the local view of CNN convolution operations. It improves the efficiency and effectiveness of the Transformer applied to image generation. In addition, we introduce a Scene Graph in the text processing stage. Feature vectors of entities and relationships are extracted from the Scene Graph and incorporated into the diffusion model. These additional feature vectors improve the quality of generated images. Swinv2-Imagen produces 1024 \(\times \) 1024 samples with unprecedented fidelity with these novel components.

Furthermore, it has also recently been noted that autoregressive models can produce diverse and high-quality images from text. Thus, we plan to consider combining autoregressive and diffusion models for image generation and determine the best opportunities to combine their strengths.