1 Introduction

In the current landscape of text-to-video generation with deep learning techniques, GAN-based methods have garnered widespread recognition. These algorithms have shown impressive results in various tasks, including stock price prediction [1], intrusion detection [2], spam detection [3], data augmentation [4] and many more. The generator network in GANs imparts it with an ability of automatic visual content generation. This network learns the probabilistic density estimation from the training data distribution, allowing it to generate samples within the boundaries of the data it was trained on. Through adversarial learning between the generator and discriminator networks, GANs can produce photo-realistic samples that are virtually indistinguishable from real samples. In computer vision, GANs find applications in diverse fields, such as image super-resolution [5, 6], image deblurring [7], human face synthesis [8, 9], high-resolution human faces [10, 11], face sketch synthesis [12], and image in-painting [13, 14]. Their versatility and capability to handle complex visual data make GANs a valuable asset in a wide range of tasks. Further advancements in this field has led to the expansion of the unconditioned GAN towards the conditional generative models. In the original GANs [15], there is no control over the type of samples to be generated. It uses the training data distribution to create data samples for any category. Furthermore, throughout the sampling process, the resulting data samples might not fully reflect all potential changes in the training set. In contrast, conditional GANs (CGAN) [16] introduced the concept of using additional conditional information in the model to exert control over the generated sample output. This supplementary information could take the form of class labels, text, sketches, and more. By incorporating this conditioning, the process of image generation becomes intention-based, allowing for the creation of variational data samples. Conditional image generation has become increasingly popular, particularly in scenarios that involve latent embedding. Notable applications include text-to-image generation [17,18,19], image translation [20, 21], and manipulation based on linguistic instructions [22]. These techniques empower the model to produce images that align with specific conditions, adding a new dimension of control and flexibility to the generative process.

GAN-based methods have also demonstrated superior performance in video synthesis. As videos are essentially collections of images, GAN-based video synthesis has become feasible. However, the fact that a video has multiple images as opposed to just one, presents the main challenge in video synthesis. Video synthesis is a multi-modal data generation challenge that includes elements of motion, speed, sound, and picture. The temporal aspect and interdependence between frames further increase the complexity of video generation. The connectedness between frames is crucial for producing high-quality videos. Merely ensuring that each individual frame looks realistic is insufficient; the temporal dependency between consecutive frames must be preserved for effective video synthesis. Despite the increased complexity of video-GANs compared to image-GANs, they have been utilized in various research studies involving video datasets [23]. In recent years, several generative adversarial networks (GANs) have been developed specifically for video generation. The primary objective behind these advancements is to generate high-resolution, indistinguishably natural videos with extended temporal range. However, while there has been significant research on producing conditional output for various inputs in the image domain, conditional video generation, especially from diverse inputs like audio signals, textual data, semantic maps, images, or videos, has received less attention and requires further investigation [24]. One of the most complex tasks in conditional video generation is generating videos based on textual descriptions, where semantic matching between the provided text and the synthesized video, visual frame quality, and frame coherence all need careful consideration. It is an extremely challenging yet crucial area of research with numerous potential applications, including multimedia special effects, synthetic data generation for reinforcement learning systems and domain adaptation, among others. In this study, we focus on the less explored domain of generating videos from text and propose Video-Text Matcher-Generative adversarial GAN (VTM-GAN). Our approach aims to tackle the challenges posed by text-conditioned video synthesis and contributes to advancing this important research area.

2 Related work

GANs have proven to be highly effective in generating high-quality images that closely align with given text descriptions. Unlike traditional image generation methods that rely solely on noise, text-to-image (T2I) generation operates by conditioning on concatenated noise and text conditions as input. This approach allows text descriptions to play a guiding role in the generation process, resulting in visually appealing conditional image outputs. The use of text descriptions, as opposed to mere labels, allows for the incorporation of a wealth of semantic information about the depicted objects, their attributes, and spatial arrangements, enabling the portrayal of diverse and intricate scenarios with fine details.

Text-to-image GANs: Reed et al. [17]. were the first to take the initiative and extended the conditional GANs, employing them to produce convincing visuals that correspond to the input texts. They demonstrated that their model could produce convincing visuals of birds and flowers from textual descriptions. Reed et al. [25] developed the “Generative Adversarial What-Where Network” (GAWWN) that generates images taking into consideration both “what” and “where” aspects. Besides focusing on what is to be drawn, it also determines at which location the object is to be drawn. In order to improve the resolution of the produced samples and to retain the semantic relationship between the textual and visual data, Zhang et al. [18] introduced StackGAN, which consists of two stages, with stage 1 generating low-resolution images and stage 2 generating high resolution images. Xu et al. [19] introduced a layered GAN referred to as Attentional Generative Adversarial Network (AttnGAN) for successfully generating fine-grained images with better quality using both word level information as well as the sentence level information. As the name suggests, it is the attention driven approach which means that more attention is given to the words of importance in a sentence. The demand for generating images at various scales, ensuring semantic consistency, and achieving high resolution has driven the adoption of complex stacked generative adversarial networks with multiple generators and discriminators. However, the increasing complexity of these networks has led to slower and less efficient GAN training processes. To address these challenges, Tao et al. [26] proposed the Deep Fusion Generative Adversarial Network (DF-GAN). Unlike other approaches, DF-GAN utilizes a single generator-discriminator model and introduces a novel regularization method called “Matching-Aware zero-centered Gradient Penalty” to generate images from text without the need for additional networks. This innovative architecture aims to streamline the training process while still achieving high-quality image synthesis from textual descriptions.

Video-GANs: While both conditional and unconditional image generation are well-studied problems, different studies have also leveraged GANs in video generation. Vondrick et al. [27] were the pioneers in utilizing Generative Adversarial Networks (GANs) for video generation. The authors introduced video-GAN (VGAN), which emphasizes the importance of capturing scene dynamics by splitting it into dynamic and static components. Saito et al. [28] introduced Temporal Generative Adversarial Nets (TGAN), an approach capable of generating unlabeled videos by learning their semantic descriptions. Unlike its predecessor, VGAN, TGAN uses the approach of decoupling temporal-level features and frame-level features, rather than assuming a division into background and foreground streams. This model is equipped with a generator for image generation and a generator for temporal coherence. Tulyakov et al. [29] introduced a video generation model called Motion and Content decomposed Generative Adversarial Network (MoCoGAN). This model is designed to generate videos from a noise vector, similar to TGAN, and it also utilizes a generator for image generation and a generator for ensuring temporal coherence. However, what sets MoCoGAN apart is its generator for temporal coherence, which is constructed using a Recurrent Neural Network (RNN). Building on MoCoGAN’s foundation, Saito et al. [30] further expanded it and introduced Temporal GAN-version 2 (TGAN-V2). TGAN-V2 employs various sub-discriminators within the discriminator, as well as multiple sub-generators within the generator. Through extensive training, this model has demonstrated superior video quality.

Dual video discriminator GAN (DVD-GAN), introduced by Clark et al. [31], extends the capabilities of BigGAN for generating videos. It adopts an efficient spatial–temporal division in its discriminator. Unlike previous approaches, the generator of DVD-GAN does not depend on predefined assumptions for static, dynamic and temporal features. Ohnishi et al. [32] presented a hierarchical method for generating appearance-and-motion-realistic videos, consisting of distinct FlowGAN and TextureGAN. The initial GAN is responsible for generating optical flow, capturing edge and motion information whereas TextureGAN adds texture to the generated optical flow. While this model successfully produces motion-realistic videos, the output lacks the visual clarity and finer details seen in higher-resolution videos. To improve the understanding of scene dynamics, Nakahira et al. [33] introduced Depth Conditional Video GAN (DCVGAN) which incorporates three-dimensional-geometrical information, emphasizing the importance of optical information along with 3D geometry and color details. This approach surpasses MoCoGAN in generating realistic and diverse samples. To address instability and non-convergence issues, especially in generating the samples of better resolution, Acharya et al. [34] introduced the idea of generating the videos in a progressive fashion. This technique uses a coarse-to-fine approach, starting with smaller networks and gradually adding new layers in generator network as well as in discriminator network for generating samples of improved-resolution. Munoz et al. [35] conceptualized Temporal Shift GAN which instead of using the three dimensional generator network introduces a novel two dimensional generator for generating videos. This approach maintains temporal coherence between the frames as well as the relationship between the different regions. For generating high-resolution videos with computational efficiency, Tian et al. [36] introduced MoCoGAN- High Definition video synthesis (MoCoGAN-HD). The challenge of generating videos is formulated as searching for the motion trajectory using a motion generator in the latent codes that have been generated by a pre-defined image generator. Thus, motion generator works on the latent codes and obtains representations which are finally used to generate temporally coherent frames. In a recent development, Hong et al. [37] introduced Arrow GAN, a novel approach that utilizes an arrow-of-time discriminator (Arrow-D) to impart a sense of time to the generated content. Arrow-D can autonomously discern the direction of time without the need for explicit supervision and serves as a guiding force for the generators to produce more realistic and temporally consistent results.

Text-to-video GANs: Like generating images based on conditions, conditional GANs have also been used in the video domain to produce videos depending on conditions. Compared to other conditional video generation, the research studies on text conditioning video generation is deficient. It is a new, timely problem and a more challenging task. Pan et al. [38]. were the first to aim at producing videos based on the text description and proposed Temporal GANs conditioned on Captions (TGANs-C). The generator network receives a latent noise vector and embedded textual description concatenation, which aids in creating a frame sequence. The discriminator in TGAN not only performs its primary task of distinguishing between real and generated data samples, but it also serves an additional role. Integrating a GAN and Variational Autoencoder (VAE) network, Li et al. [24] used a hybrid approach for generating videos. By combining the strengths of GANs and VAEs, the authors designed a model that could effectively extract features from videos, taking into account both the static visual characteristics (color and structure) and the dynamic aspects of the content. This approach allowed for more comprehensive and informative representations of the videos, leading to improved video generation performance. Both [24] and [38] generated videos of constant length and were trained on datasets having lower resolution.

For strengthening the relationship between input text and the generated video sample, Balaji et al. [39] developed a Text-Filter conditioning Generative Adversarial Network (TFGAN), which incorporates a multi-scale scheme and text conditioning to generate video frames based on given textual descriptions. One of the primary concerns they tackled was generating videos with fixed lengths, which can be limiting in terms of capturing diverse and dynamic scenes. Deng et al. [40] introduced the “Introspective Recurrent Convolutional GAN (IRC-GAN)” approach, which incorporates a recurrent generator to produce high-quality frames. To ensure temporal coherence, this generator combines LSTM cells with 2D transconvolutional networks, enabling it to generate new frames based on the previous ones. The proposed model also introduces a mutual-information introspection discriminator, which leverages information from generated samples to specifically evaluate the semantic relation between video samples and their corresponding descriptions. Li et al. [41] introduced StoryGAN, a model rooted in the sequential conditional GAN framework. The primary goal of StoryGAN is to visualize stories by generating a series of images. For each sentence in the story, the model produces one corresponding image. This approach allows the progression of the narrative to be visually represented through a sequence of generated images. Yu et al. [42] also introduced a recurrent deconvolutional generative adversarial network (RD-GAN) for the conditional generation of videos. In this model, skip-thoughts are employed to represent text as latent vectors, which serve as input for the generator to produce videos frame by frame. RD-GAN effectively addresses the problem of visual discontinuity, a common challenge faced by many video generation models that results in unrealistic output. However, one limitation of RD-GAN is its stability issue when trained with too many frames. Additionally, the model’s reliance solely on feature extraction restricts its ability to generate diverse videos. Moreover, it faces challenges in generating sharp and longer videos. In contrast, Kim et al. [43] proposed Text to Video Generation TiVGAN, a model that produces full-length videos. Rather than seeking a mapping feature between the text and all video frames as a whole, TiVGAN is trained with respect to a single frame and progressively evolves to generate a video clip of the desired length. Experimental results show that TiVGAN not only accurately generates videos based on the given descriptions but also produces results of higher sharpness and better quality, addressing some of the shortcomings of the RD-GAN model.

Attention-GANs: Nowadays, attention mechanism has become a crucial part of many applications’ effective sequence modeling and transduction models. It is widely used in both the natural language processing and computer vision domains. Alami et al. [44] used attention mechanism in image-to-image translation and enhanced the image quality significantly. Their proposed algorithm takes advantage of the discriminator’s capability to learn accurate attention maps without the need for any additional supervision. Chen et al. [45] suggested attention-GAN for the task of object transfiguration, which is a part of image-to-image translation. The attention network is responsible for predicting the regions of interest in the input image. It identifies specific areas that are relevant and important for the transformation process. On the other hand, the transformation network is responsible for actually transforming the object from one class to another. It takes the input image and focuses on the regions of interest as indicated by the attention network. Zhang et al. [46] introduced self-attention based mechanism into convolutional GAN and proposed Self-Attention Generative Adversarial Network (SAGAN). Attention GANs have been successfully introduced in other tasks including aerial scene classification [47], video game generation [48], Data Augmentation on Medical Images [49], high-quality long-time series samples [50], and text-to-image generation [19, 51]. Recently, Chen et al. [52] proposed Bottom-Up GAN (BoGAN) for generating videos from the text that utilizes an attention mechanism and introduces a region-level loss, which enables it to focus on specific regions within the video and produce fine-grained details as specified in the input text. This approach results in the successful synthesis of videos that closely align with the provided textual descriptions, enhancing the realism and quality of the generated videos. Jiang et al. [53] succeeded in synthesizing images with 256 × 256 resolution, utilizing pure transformer-based architectures and a GAN entirely free of convolutions. Lee et al. [54] presented ViTGAN, which uses Vision Transformers [55] in GANs, and proposed critical strategies for assuring training stability and enhancing convergence. STrans-GAN, which also employs Transformers in GAN, was proposed by Xu et al. [56], providing competitive results in both unconditional and conditional image production. Other transformer-based GANs employed for high-resolution image synthesis include HiT [57] and Swin transformers [58]. The application of transformers, however, remains yet to be investigated in tasks like generation of images from textual description or generation of videos from textual description. Recently, Naveen et al. [59] have used a transformer to enhance AttnGAN for generating images from text data. To our knowledge, the proposed Video-Text Matcher Generative Adversarial Network (VTM-GAN) for text-to-video generation, for the first time, leverages a transformer with the text to video GAN, enabling it to generate fine-grained high-quality videos following the text.

3 Proposed method

The proposed VTM-GAN architecture comprises of two main components, as illustrated in Fig. 1: the Video-Text Matcher (VTM) and the Text-to-Video (T2V)-GAN model.

Fig. 1
figure 1

The Proposed Model

3.1 Video-text matcher (VTM)

The Video-Text Matcher model is composed of two main components: a Transformer [60] and a ResNet-101 [61]. It is trained using the parameters from Contrastive Language-Image Pre- training (CLIP) [62]. CLIP is a training technique that combines the training of an image encoder with a text encoder by contrastive learning to achieve a compact embedding space between image and text pairs. Contrastive learning falls under the category of self-supervised learning, where augmentations are applied to the same input to create comparable representations. The contrastive objective in CLIP is inspired by previous studies on learning contrastive representations from images, which have shown its effectiveness compared to predictive objectives. CLIP predicts which pairings of the \(n\times n\) potential (image, text) pairs actually happened across a batch with \(n\) pairs of (image, text). In order to learn a multi-modal embedding space, the image encoder is trained concurrently with the text encoder. This allows CLIP to maximize the cosine similarity between the text embeddings and the image of the \(n\) actual pairs in the batch while minimizing it for the \({n}^{2}-n\) erroneous pairings. CLIP uses two different image encoding architectures, ResNet-50 and recently released Vision Transformer (ViT). The antialiased rect-2 blur pooling and the ResNet-D enhancements from [63] are used to make a number of adjustments to the original ResNet version. Additionally, it substitutes an attention pooling method for the global average pooling layer which is implemented as a single layer multi-head Query-Key-Value (QKV) attention with global average-pooled representation of the image as the query condition. Moreover, it also alters ViT by incorporating an extra layer normalization to the combined embedding of patch and position prior to the transformer. A Transformer that has been changed in accordance with the [64] specification serves as the text encoder. The actual size of CLIP has 63 M parameters, and is a 12 layered and 512-wide model containing 8 attention heads. The vocabulary size of the lower- case byte pair encoding (BPE) text representation employed by the transformer is 49,152 words. To enhance computing performance, the maximum sequence length is set to 76. The text sequence is delimited by start of sentence (SOS) and end of sentence (EOS) tokens, and the feature representation of the text is obtained from the activation of the transformer’s top layer at the EOS token. This text feature representation undergoes layer normalization and is then linearly projected into the multi-modal embedding space.

The VTM architecture closely resembles that of CLIP in most aspects, but it diverges in a crucial way. While CLIP primarily focuses on capturing image and sentence-level features, VTM goes a step further by capturing region features from videos and word-level vectors. This distinction allows VTM to calculate a fine granularity consistency loss, unlike CLIP, since CLIP does not provide the word features and the image region features. To achieve this capability of calculating the region features and the word vectors, VTM introduces new neural network layers and trains a video-text matching task on our dataset. The video encoder and the text encoder in VTM are depicted in Figs. 2 and 3 respectively. The video encoder obtains output features from layer 3 to serve as initial region features. These are then fine-tuned using a 1 × 1 convolutional layer to facilitate the word-region level loss calculation. In the text encoder, an MLP layer and Layer Normalization is used for calculating the word vector based on the token vector produced by the Transformer. Utilizing the aforementioned techniques, VTM incorporates two distinct losses at different levels: word and region level, as well as sentence and video level. The complete VTM block is presented in Fig. 4.

Fig. 2
figure 2

Video Encoder used in VTM

Fig. 3
figure 3

Text Encoder used in VTM

Fig. 4
figure 4

VTM Block

In the context of the provided description, the word vectors extracted from the text encoder are represented by the matrix \(e\in {R}^{D\times T}\). The feature vector of \({i}^{ th}\) word is located in the \({i}^{ th}\) column denoted by \({e}_{i}\). This word feature matrix \(e\) has a dimension of D and total number of words is T. \(\overline{e}\in {R }^{D}\) stands for the global sentence vector. On the other hand, the local video feature \(\mathcal{F}\in {R}^{ \overline{D}\times N }\) is taken from a ResNet-101 video encoder, where D represents the dimensions of local video feature matrix, and N representing total number of sub-regions comprising that video. The \({i}^{ th}\) column of \(\mathcal{F}\) is a feature vector representing the \({i}^{ th}\) subregion. To bring the video data into a common semantic space with the text features, a perceptron layer \(P\) is included, and it performs the translation as shown in Eq. (1).

$$S= P\mathcal{F} \,\,\,\overline{S } = P \overline{\mathcal{F} }$$
(1)

where \(S\in {R}^{ \overline{D}\times N }\). \({S}_{i}\) is the \({i}^{ th}\) column that represents the feature vector of \({i}^{ th}\) sub-region in the video in common space. \(\overline{S}\in {R }^{ \overline{D} }\) is the global feature vector of the entire video. \(P\in {R}^{ \overline{D}\times D }\) where D represents the dimension of common feature space of text embeddings and video features. To assess the similarity between each pair of words and video sub-regions, a similarity matrix is initially calculated. This matrix represents the similarity score for each potential pair of words in a sentence and video sub-regions. The calculation of this similarity matrix is performed using Eq. (2), which is mentioned as follows:

$$\upnu = {e}^{ T}\mathrm{ S}$$
(2)

where \(\upnu \in {R}^{ T\times N}\) with \({\upnu }_{i}\) , representing degree of dot-product similarity between the \({i}^{ th}\) word of the sentence and the \({j}^{ th}\) sub-region in the video. To make the similarity matrix more effective, it is normalized using Eq. (3):

$$ {\overline{\nu }}_{i,j} = \frac{{{\text{exp}}\left( {{\upnu }_{i,j} } \right)}}{{\sum\limits_{k = 0}^{T - 1} {{\text{exp}}\left( {{\upnu }_{k,j} } \right)} }} $$
(3)

Then, for each word (query), a region-context vector \({\mathrm{r}}_{i}\) is generated using an attention model. This vector \({\mathrm{r}}_{i}\) is a representation of the sub-regions of the video relative to the \({i}^{ th}\) word in sentence. The region-context vector is computed by calculating the weighted sum of all regional visual vectors, as shown in Eq. (4):

$$ {\text{r}}_{i} = \mathop \sum \limits_{j = 0}^{N} {\upmu }_{j} ,{\text{S}}_{j} \quad {\text{where}}\quad {\upmu }_{j} = \frac{{{\text{exp}}\left( {{\upbeta }_{1} {\overline{\nu }}_{i,j} } \right)}}{{\sum\limits_{k = 1}^{N} {{\text{exp}}\left( {{\upbeta }_{1} {\overline{\nu }}_{i,j} } \right)} }} $$
(4)

where factor \({\upbeta }_{1}\) determines how much attention is required for features of its relevant sub-regions when calculating region context vector for a word. It acts as a weighting factor that controls the importance of the visual information from the video’s sub-regions in the region-context vector computation for each word in the sentence. After obtaining the region-context vector \({\mathrm{r}}_{i}\) for the \({i}^{ th}\) word in sentence, the correlation between the word and its corresponding video is established using the cosine similarity between \({\mathrm{r}}_{i}\) and \({\mathrm{e}}_{i}\), as shown in Eq. (5):

$$X\left({{\mathrm{r}}_{i},\mathrm{e}}_{i }\right)=\frac{{{\mathrm{r}}_{i}}^{T}{\mathrm{e}}_{i}}{\Vert {\mathrm{r}}_{i}\Vert \Vert {\mathrm{e}}_{i}\Vert }$$
(5)

Taking inspiration from minimum classification error formulation in speech recognition, the relational score between whole video and the complete text driven by attention mechanism is given by Eq. (6):

$$ X\left( {V,E} \right) = \log \left( {\mathop \sum \limits_{i = 1}^{T - 1} {\text{exp}}\left( {{\upbeta }_{2} X\left( {{\text{r}}_{i} ,{\text{e}}_{i} } \right)} \right)} \right)^{\frac{1}{2}} $$
(6)

where \({\upbeta }_{2}\) regulates the level of emphasis placed on the importance of the word-region pair that is most relevant. As \({\upbeta }_{2}\) approaches infinity (\({\upbeta }_{2}\to \infty )\), the function \(X\left(V,E\right)\) converges towards finding the maximum value i.e. it approaches to \({max}_{i=1}^{T-1} X\left({\mathrm{r}}_{i}{,\mathrm{e}}_{i}\right)\). The posterior probability of sentence \({\mathrm{E}}_{i}\) being a match with video \({\mathrm{V}}_{i}\) for a batch of pairs of video and sentence \({\left\{\left(V,E\right) \right\}}_{i=1}^{M}\) is determined by Eq. (7) as follows:

$$ M\left( {{\text{E}}_{i} {\text{|V}}_{i} } \right) = \frac{{{\text{exp}}\left( {{\upbeta }_{3} X\left( {{\text{V}}_{i} ,{\text{E}}_{i} } \right)} \right)}}{{\sum\limits_{j = 1}^{M} {{\text{exp}}\left( {{\upbeta }_{3} X\left( {{\text{V}}_{i} ,{\text{E}}_{j} } \right)} \right)} }} $$
(7)

where \({\upbeta }_{3}\) represents a smoothing factor that is determined through experimental measurements. In this batch of sentences, only sentence \({\mathrm{E}}_{i}\) corresponds to video \({\mathrm{V}}_{i}\); the other \(\mathrm{M}-1\) sentences are regarded as descriptions that don’t match. To formulate the loss function, the negative logarithm of the posterior probability that the videos and their associated text descriptions are matched, is taken. The loss function is expressed as Eq. (8):

$$ L_{1}^{w} = - \mathop \sum \limits_{i = 1}^{M} \log M\left( {{\text{E}}_{i} {\text{|V}}_{i} } \right) $$
(8)

where ‘w’ denotes “word”. On a symmetrical basis, another loss function is derived that mirrors structure of Eq. (8). This new loss function is formulated to address a complementary aspect of the problem, thereby ensuring a balanced and comprehensive approach to evaluating the match between videos and their corresponding text descriptions. This new loss function is defined in Eq. (9):

$$ L_{2}^{w} = - \mathop \sum \limits_{i = 1}^{M} \log M\left( {{\text{V}}_{i} {\text{|E}}_{i} } \right) $$
(9)

where

$$ M\left( {{\text{V}}_{i} {\text{|E}}_{i} } \right) = \frac{{{\text{exp}}\left( {{\upbeta }_{3} X\left( {{\text{V}}_{i} ,{\text{E}}_{i} } \right)} \right)}}{{\sum\limits_{j = 1}^{M} {{\text{exp}}\left( {{\upbeta }_{3} X\left( {{\text{V}}_{i} ,{\text{E}}_{j} } \right)} \right)} }} $$
(10)

is the posterior probability that sentence \({\mathrm{E}}_{i}\) matches with its corresponding video \({\mathrm{V}}_{i}\). Modifying Eq. (6) by Eq. (11):

$$X\left({V,\mathrm{E}}\right)=\frac{{\overline{S} }^{T}\overline{e} }{\Vert \overline{S }\Vert \Vert \overline{e}\Vert }$$
(11)

and by substituting it in Eqs. (7), (8) and (9), two loss functions \({L}_{1}^{\mathcal{s}}\) and \({L}_{2}^{\mathcal{s}}\) are formulated with global sentence-vector \(\overline{e }\) and global video-vector \(\overline{S }\). Here \(\mathcal{s}\) denotes sentence. The VTM loss is finally described by Eqs. (12) and (13):

$${L}_{VTM}={{\lambda }_{1}L}_{VTM}^{w}+{\lambda }_{2}{L}_{VTM}^{\mathcal{s}}$$
(12)
$${L}_{VTM}={\lambda }_{1}\left({L}_{1}^{w}+{L}_{2}^{w}\right)+{\lambda }_{2}\left({L}_{1}^{\mathcal{s}}+{L}_{2}^{\mathcal{s}}\right)$$
(13)

3.2 Video-generator

Given as the input, a sentence \(\mathcal{s}\) concatenated with a noise vector \(z\) which is sampled from a normal distribution \((z\in {R}^{100})\), a generator network \(G\) is developed for generating sequence of frames:\(\left\{{R}^{{{d}_{\mathcal{s}}{,d}_{z}}}\right\}\to {R}^{{{d}_{c}{\times d}_{l}{\times d}_{h}{\times d}_{w}}}\) where \({d}_{c}\), \({d}_{l}\), \({d}_{h}\) and \({d}_{w}\) represent respectively the number of channels, length of the sequence, height of frames and width of frames. In order to capture both the spatial and the temporal information of the videos, 3D convolution filters are used with deconvolutions for concurrently synthesizing spatial information using 2D convolution filters and providing temporal coherence across adjacent frames. Initially, a fully-connected layer is used for learning a unified embedding \(m\) by concatenating text embedding \(\overline{e }\) and noise variable. This unified embedding undergoes feature transformation, represented by Eq. (14):

$$m={W}_{\overline{e} }\left[z.\overline{e }\right]\in {R}^{{d}_{m}{+d}_{z}}$$
(14)

where \({W}_{\overline{e}}\in {R }^{{d}_{\overline{e} }{+d}_{m}}\) and is the transformation matrix. Then generator \(G\) takes this latent variable as the input and generates the associated video as shown in Eq. (15).

$$Q=G(m)\in {R}^{{{d}_{c}{\times d}_{l}{\times d}_{h}{\times d}_{w}}}$$
(15)

Here, \(Q=\left\{{F}_{1}{,F}_{2}\dots ..{F}_{{d}_{l}}\right\}\) is the generated video and \({F}_{i}\) is the \({i}^{ th}\) frame of the video being generated where \({F}_{i}\in {R}^{{{d}_{c}{\times d}_{h}{\times d}_{w}}}\).

3.3 Discriminator

To ensure the generation of realistic videos, while maintaining temporal coherence across adjacent frames, a two-discriminator setup is employed: a 3D discriminator and a 2D discriminator.

3.3.1 3D discriminator \({{\varvec{D}}}_{1}({\varvec{\nu}}, \overline{{\varvec{e}} } )\)

The discriminator \({D}_{1}\) operates by taking two inputs: the video tensor \(\nu\) and the text embedding\(\overline{e }\). Initially, it processes the video input through 3D convolutional layers, resulting in a video-level tensor\({\mathcal{T}}_{\nu }\in {R}^{{{d}_{c}{\times d}_{l}{\times d}_{h}{\times d}_{w}}}\). This tensor represents high-level features extracted from the video at different spatio-temporal locations. This follows with the augmentation of the video-level tensor with the text embedding \(\overline{e }\), effectively incorporating the semantic information from the given description into the video representation. The augmented tensor is then passed through a dense layer with a softmax activation function for discriminating if the input video is sampled from the training data (real video) or generated data (synthetic video) and also if it is semantically consistent with the given description. This process is illustrated in Eq. (16):

$${D}_{1}(\nu , \overline{e } )\to \left[\mathrm{0,1}\right]$$
(16)

Unlike the conventional discriminator that distinguishes between real and generated videos, here, an additional requirement is to maintain the semantic relationship between the generated video and its corresponding caption. Thus, a conditional discriminator is necessary which not only judges the authenticity of videos but also evaluates whether the video aligns with the given text description. Initially, during training, the discriminator disregards the conditioning information and promptly rejects samples generated by the model \(G\) as they may not appear realistic. However, as the generator \(G\) becomes proficient in generating plausible data, it must also learn to align them with the provided conditioning data. Similarly, the discriminator \(D\) must develop the ability to determine whether the samples from \(G\) satisfy the conditioning constraint. So for training such a discriminator, three types of inputs are provided which include: real video and semantically matched text \(\left(\widetilde{Q}\right)\), synthetic video and semantically matched text \(\left(Q\right)\), and real video and semantically mismatched text \(\left(\overline{Q }\right)\). The discriminator must score \(Q\) and \(\overline{Q }\) as fake and \(\widetilde{Q}\) as real. By introducing the third input that is real video and semantically mismatched text \(\overline{Q }\), the discriminator learns to improve video-text matching in addition to generation of realistic videos, and thereby provides an additional signal to the generator. Consequently, the loss function designed to optimize \({D}_{1}\) is expressed as Eq. (17).

$${L}_{{D}_{1}}=-\frac{1}{3}\left[\mathrm{log}\left({D}_{1}\left(\widetilde{Q},\overline{e }\right)\right)+\mathrm{log}\left(1-{D}_{1}\left(\overline{Q },\overline{e }\right)\right)+\mathrm{log}\left(1-{D}_{1}\left(Q,\overline{e }\right)\right)\right]$$
(17)

3.3.2 2D discriminator \({{\varvec{D}}}_{1}({\varvec{\nu}}, \overline{{\varvec{e}} } )\)

To further improve frame realism and semantic alignment with the given text and to maintain the temporal coherence, a 2D discriminator is used. So, this discriminator is actually a combination of two sub networks: Frame-discriminator (\({D}_{\mathrm{frame}}\)) and motion-discriminator (\({D}_{\mathrm{motion}}\)). Frame discriminator is responsible for determining the authenticity of individual frames in the video and assessing whether they exhibit semantic consistency with the given textual description. On the other hand, the motion-discriminator focuses on examining the temporal coherence within the video. It assesses the smoothness and natural flow of motion between adjacent frames.

Frame-Discriminator: \({D}_{\mathrm{frame}}({F}_{i}\),\(\overline{e })\): Firstly, a common 2D convolutional model is used to extract the frame level tensor \({\mathcal{T}}_{F}\in {R}^{{{d}_{c}{\times d}_{h}{\times d}_{w}}}\) from every frame of the video. Then, augmentation of this frame-level tensor is done with the text embedding \(\overline{e }\) and fed to the \({D}_{\mathrm{frame}}\) which discriminates whether different frames of the video are both real and have semantic consistency with the caption. This evaluation is carried out as under by Eq. (18).

$${D}_{\mathrm{frame}}({F}_{i}, \overline{e } )\to \left[\mathrm{0,1}\right]$$
(18)

The frame-level loss for optimizing \({D}_{\mathrm{frame}}\) is designed as shown in Eq. (19).

$$ L_{{D_{{{\text{frame}}}} }} = - \frac{1}{{3d_{l} }}\left[ {\mathop \sum \limits_{i = 1}^{{d_{l} }} \log \left( {D_{{{\text{frame}}}} \left( {\widetilde{{F_{i} }},\overline{e}} \right)} \right) + \mathop \sum \limits_{i = 1}^{{d_{l} }} \log \left( {1 - D_{{{\text{frame}}}} \left( {\overline{{F_{i} }} ,\overline{e}} \right)} \right) + \mathop \sum \limits_{i = 1}^{{d_{l} }} \log \left( {1 - D_{{{\text{frame}}}} \left( {F_{i} ,\overline{e}} \right)} \right)} \right] $$
(19)

where\(\widetilde{{F}_{i}}\),\(\overline{{F }_{i}}\),and \({F}_{i}\) are the \({i}^{ th}\) frames in \(\widetilde{Q}\), \(\overline{Q }\) and \(Q\) respectively.

Motion-Discriminator: \({D}_{\mathrm{frame}}(\overrightarrow{{\mathcal{T}}_{{F}_{i}}}\),\(\overline{e })\): In order to make the adjacent frames temporally coherent, similarity between two successive frames is determined using the Euclidean distances between their frame-level tensors. In other words, the magnitude of motion tensor is calculated as depicted in Eq. (20):

$$Dist\left({F}_{i},{F}_{i-1}\right)={\Vert {\mathcal{T}}_{{F}_{i}}, {\mathcal{T}}_{{F}_{i-1}}\Vert }_{2}^{2}={\Vert {\overrightarrow{\mathcal{T}}}_{{F}_{i}}\Vert }_{2}^{2}=\Delta {\mathcal{T}}_{{F}_{i}}$$
(20)

where \(\Delta {\mathcal{T}}_{{F}_{i}}\) is the difference of frame-tensors of consecutive frames \({F}_{i}\) and \({F}_{i-1}\) indicating magnitude of motion between them and \(\left\| {} \right\|_{2}\) is the L2-norm.The temporal coherence adversarial loss that optimizes motion discriminator is given by Eq. (21), as under:

$$ L_{{D_{{{\text{motion}}}} }} = - \frac{1}{{3\left( {d_{l} - 1} \right)}}\left[ {\mathop \sum \limits_{i = 2}^{{d_{l} }} \log \left( {D_{{{\text{motion}}}} \left( {\Delta {\mathcal{T}}_{{\widetilde{{F_{i} }}}} ,\overline{e}} \right)} \right) + \mathop \sum \limits_{i = 2}^{{d_{l} }} \log \left( {1 - D_{{{\text{motion}}}} \left( {\Delta {\mathcal{T}}_{{\overline{{F_{i} }} }} ,\overline{e}} \right)} \right) + \mathop \sum \limits_{i = 2}^{{d_{l} }} \log \left( {1 - D_{{{\text{motion}}}} \left( {\Delta {\mathcal{T}}_{{F_{i} }} ,\overline{e}} \right)} \right)} \right]. $$
(21)

Here, \(\Delta {\mathcal{T}}_{\widetilde{{F}_{i}}}\), \(\Delta {\mathcal{T}}_{\overline{{F }_{i}}}\) and \(\Delta {\mathcal{T}}_{{F}_{i}}\) respectively denote the motion features between \({i}^{ th}\) frame and \(({i-1)}^{ th}\) frame in \(\widetilde{Q}\), \(\overline{Q }\) and \(Q\) respectively.

3.4 Optimization

The overall optimization of discriminator can be done by minimizing the integrated losses at both the video-level, and frame-level, while also considering the loss for temporal coherence. This is represented by Eq. (22) below:

$${L}_{Discriminator}={L}_{{D}_{1}}+{L}_{{D}_{2}}={L}_{{D}_{1}}+{L}_{{D}_{\mathrm{frame}}}+{L}_{{D}_{\mathrm{motion}}}$$
(22)

By minimizing Eq. (22), discriminator D learns to categorize videos as well as their frames as true or counterfeit while also aligning them with semantically appropriate descriptions. Additionally, it also trains the discriminator to recognize the temporal changes between frames. For the generator network, the adversarial losses for optimizing G at video level and at frame level are defined by Eqs. (23) and (24) as under:

$${L}_{{G}_{\mathrm{video}}}=-\frac{1}{3}\mathrm{log}\left(1-{D}_{1}\left(Q,\overline{e }\right)\right)$$
(23)
$$ L_{{G_{{{\text{frame}}}} }} = - \frac{1}{{3d_{l} }}\left[ {\frac{1}{{d_{l} }}\mathop \sum \limits_{i = 1}^{{d_{l} }} \log \left( {1 - D_{{{\text{frame}}}} \left( {F_{i} ,\overline{e}} \right)} \right) + \frac{1}{{d_{l} - 1}}\mathop \sum \limits_{i = 2}^{{d_{l} }} \log \left( {1 - D_{{{\text{motion}}}} \left( {\Delta {\mathcal{T}}_{{F_{i} }} ,\overline{e}} \right)} \right)} \right] $$
(24)

The losses \({L}_{{G}_{\mathrm{video}}}\) and \({L}_{{G}_{\mathrm{frame}}}\) train generator to produce realistic and semantically aligned videos at frame-level as well as at video-level. Moreover, \({L}_{{D}_{\mathrm{frame}}}\) loss also enforces the temporal coherence across the frames, thereby enhancing realism of the generated videos. In addition to these two losses, the VTM-loss \({L}_{VTM}\), as elaborated in Sect. 3.1 is introduced to further enforce the restrictions for maintaining fine granularity consistency of a video with its description. This VTM model provides two level losses that is loss at sentence level and loss at word level. In order to produce convincing videos with realism, overall final objective function of generator G is given by Eq. (25) as under:

$${L}_{Generator}={L}_{{G}_{\mathrm{video}}}+{L}_{{G}_{\mathrm{frame}}}+{L}_{VTM}$$
(25)

The entire methodology of training the VTM-GAN is presented in Algorithm 1.

figure a

4 Experiments and results

The proposed VTM-GAN, designed for text-to-video generation, is thoroughly evaluated by comparing it with numerous state-of-the-art techniques. This evaluation involves tackling the challenging task of generating videos using the SBMG dataset [65].

4.1 Dataset used

Single Digit Bouncing MNIST GIFs (SBMG) [65]: This dataset was originally developed for the purpose of generating videos from textual descriptions and introduced in [65]. It is a synthetic dataset containing \(12\) GIFs, each depicting a handwritten digit moving within a \(64\times 64\) frame. These GIFs are \(16\) frames long and feature a single \(28\times 28\) digit that moves in either an upward/downward or leftward/rightward direction. Additionally, each GIF is accompanied by a single sentence that describes the digit’s movement direction, as illustrated in Fig. 5.

Fig. 5
figure 5

Frame sequence along with corresponding captions from SBMG [65] dataset

4.2 Parameter setting

The dataset used for training and testing was resized to \(48\times 48\), and the training dataset comprised 4,78,961 images. For encoding the text, input and hidden layers in VTM encoder all had the dimension fixed to\(256\). The dimension of the sentence embedding in the generator is also \(256\) which is concatenated with the noise vector of dimension 100. The no. of the frames in a video is same as in the original dataset, \({d}_{l}=16\) and number of channels, \({d}_{c}=1\). However, height along with width of frames is set as, \({d}_{h}={d}_{w}=48\). In the discriminator, the size of video tensor \({\mathcal{T}}_{\nu }\) was set to \(512\times 1\times 3\times 3\) for 3D discriminator and the size of frame-level tensor \({\mathcal{T}}_{F}\) was set to \(512\times 3\times 3\) for 2D discriminator. The optimizer used was Adam optimizer having momentum terms set to \(0.9\) and \(0.999\). Other parameters used were: \({\upbeta }_{1}=4.0\), \({\upbeta }_{2}=5.0\), \({\upbeta }_{3}=10.0\), \({\lambda }_{1}=4.0\) and \({\lambda }_{2}=1.0\). The learning rate of the encoder was fixed as \(0.002\) whereas learning rates of both discriminator and generator were set to \(0.0002\). The VTM module was trained for \(599\) epochs, with batch size set to \(96\) whereas the final model was trained for \(299\) epochs with batch size set to \(192\). It took nearly \(20\) days for training VTM module and about \(35\) days for training the final model. The model was trained and tested on a system with the following configuration: RTX 3090 24 GB, Core i9 CPU with 128 GB RAM.

4.3 Evaluation metrics used

To quantitatively evaluate the results and compare them with state-of-the-art methodologies, three metrics were used. These include Frechet Inception Distance at image level (FID-image), Frechet Inception Distance at video level (FID-video) and Inception Score (IS). FID-image is utilized to assess the quality of each individual frame. It quantifies the Frechet Distance between the features extracted from real frames and the synthesized frames. A trained Inception v3 network is used to extracts these features. Realistic frames yield significantly lower FID values, indicating better quality. FID-video, a variant of FID-image, goes beyond single frames to measure both visual quality and temporal consistency at the video level. Lower values of FID-video suggest improved overall results, reflecting higher visual fidelity and better temporal coherence in the generated videos. Inception Score (IS) serves as an automated metric to evaluate the quality of image and video generative models. A higher IS value indicates superior performance.

4.4 Quantitative analysis

The VTM-GAN proposed in this study demonstrates better proficiency while generating moving digits from the textual description. It exhibits more encouraging outcomes than state-of-the-art Text-to-Video GAN methods. The proposed technique has been contrasted with most recently introduced approaches for text-to-video generation which include IRC-GAN [40] and BoGAN [52]. To assess the quality of generated videos from captions, quantitative results of VTM-GAN evaluated on the Single Digit Bouncing MNIST GIFs (SBMG) dataset were compared with various state-of-the-art methods, as shown in Table 1.  Notably, VTM-GAN outperforms both the compared techniques in terms of FID-image and IS metrics. VTM-GAN reduces FID-image from \(45.17\) to \(42.46\) as well as increases IS from \(3.76\) to \(4.52\) signifying the higher quality of generated frames. In terms of FID-video, though better results are shown by BoGAN, VTM-GAN still achieved comparable performance in this metric.

Table 1 Comparative Analysis of performance of proposed method with other text to video generation models

4.5 Qualitative analysis

In Fig. 6, the qualitative result analysis for the SBMG dataset showcases the performance of the proposed VTM-GAN in comparison to various existing GAN models. The results provide clear evidence that VTM-GAN exhibits superior performance over current state-of-the-art techniques. The outcomes demonstrate that VTM-GAN has the capability to generate entire videos that correspond to the given textual descriptions. These generated videos not only possess visual and semantic consistency but also maintain temporal coherence throughout their duration. The VTM-based mechanism employed in the proposed model plays a pivotal role in this success. By considering similarity at both the sentence and video level, as well as at the word and region level, the model incorporates consistency losses. This ensures that the generated videos align well with the provided text, resulting in higher-quality and more coherent outputs.

Fig. 6
figure 6

Samples generated from the textual descriptions by a IRC-GAN, b BoGAN, c Proposed Model

4.6 Ablation analysis

An ablation study was conducted to investigate the impact of different losses applied in the model. The quantitative results obtained from using these different losses are presented in Table 2. From the results in Table 2, it is evident that employing video level loss yields superior outcomes in terms of FID-image and IS when compared to using frame level loss alone. Furthermore, the notable improvements are achieved by combining both losses, as all three metrics used for analysis exhibit better performance. Additionally, the inclusion of clip loss alongside video level loss and frame level loss leads to reductions in FID-image and FID-video from \(56.62\) to \(43.86\) and \(4.18\) to \(4.01\), respectively. Moreover, the Inception Score (IS) improves from \(3.77\) to \(4.43\). However, when VTM-loss is employed in conjunction with frame level and video level losses, there are reductions in FID-image and FID-video from \(56.62\) to \(42.46\) and \(4.18\) to \(3.43\), respectively,whereas IS increases from \(3.77\) to \(4.52\), indicating that superior results are obtained using VTM loss alongside both frame level and video level losses, as utilized in the proposed method.

Table 2 Ablation study on using different losses

5 Conclusion and future work

This paper introduces a novel method called VTM-GAN for generating videos corresponding to textual descriptions. Its key component is the VTM module, which combines the training of video encoder and text encoder using contrastive learning to create a compact embedding space between paired video-text elements. VTM is an improved version of CLIP, addressing the limitation of CLIP in providing word features and video region features, which affects its ability to calculate fine granularity consistency loss. The architecture of VTM is similar to that of CLIP, but VTM stands out as it effectively captures the region features of the video and word-level vectors in combination with features of video and sentence-level vectors. This is achieved through the addition of new neural network layers dedicated to calculating region features and word vectors. On the text encoder side, an MLP layer and Layer Normalization are utilized to compute the word vector based on the token vector generated by the Transformer. The proposed VTM-GAN model shows significant improvement over earlier state-of-the-art GAN models. On the SBMG dataset, it achieves notable reductions in FID-image values from \(45.17\) to \(42.46\), FID-video values from \(4.02\) to \(3.43\), and an increase in IS from \(3.76\) to \(4.52\). Extensive experimental findings provide ample evidence of the efficiency of the suggested strategy for producing semantically aligned videos from textual descriptions, highlighting the superiority of VTM-GAN over previous GAN models. The future efforts will primarily concentrate on scaling up the model to handle higher-resolution videos, enabling the synthesis of videos with greater visual clarity and quality. Another key focus will be to enhance the model’s capabilities in generating videos based on longer texts, enabling the model to understand and interpret more complex and diverse captions. Additionally, extending the existing framework to generate longer-duration videos, enabling the model to generate videos of extended lengths, providing more comprehensive and immersive visual storytelling experiences.