1 Introduction

Generative Adversarial Networks (GANs) has been successful at learning several image processing tasks beginning with, the first GAN [9] to recently [31] to solve the image generator tasks, then expanding to other tasks such as image-to-image translation [13, 20, 38]. In the same way as super resolution, this task can be handled by GAN. Super resolution (SR) uses lower resolution (LR) images to reconstruct a high resolution (HR) image.

Image super-resolution is very challenging, and the smaller the resolution of the image used, the more difficult the task is to complete. Various methods used to address this problem, from naive strategies such as bicubic and bilinear interpolation, which use a predefined mathematical function [24], or simple Nearest Neighbor (NN) [16] interpolation, or complex deep learning methods [25, 26]. Recently, deep learning based approaches showed outstanding outcome for single image super resolution (SISR). The deep learning based approaches have two main streams of evaluation, first is the Peak Signal to Noise Ratio (PSNR) based approach [6], most of which provides excessively subtle and less detailed results. The second is the perceptual image quality [14] to enhance the visual appeal of the SR results.

SISR is a task that reconstructs a high-resolution (HR) image from a low-resolution (LR) image. Video super-resolution (VSR) is different from SISR. It is a task of generating high-resolution videos, which is more complex than SISR. The SISR problem can be handled very well with the various super-resolution GAN methods [17, 33], while the VSR has a sequence problem that cannot be handled by the same strategy. Using sequences directly without paying more attention to the continuous constraints will lead to inconsistent result and failure from one frame to another. The generators should not just learn a spatial from the single image, but should also learn the temporal features as a motion continuity that are correlated over time. This continuity looks clear, especially on videos that highlight motion, such as fireworks videos. The fireworks have a clear appearance of motion. It shows a fire movement that produces a trajectory in the form of a line of rays.

Many VSR methods use motion estimation to maintain the motion continuity of the video, while another method uses multiple frames as the sequential input. The first method [27] relies more on motion estimation accuracy, resulting in sub-optimal output quality. The second method [32] focuses more on the quality of an image using multiple input features consequences of inconsistency motion from frame to frame. Therefore, it is necessary to balance between the two in order to obtain better results.

Some of the recent methods have used different upsampling approaches including the front upsampling method [15] and the back upsampling method [29]. Most of these methods use a direct scaling method from the LR input to the HR output. The bigger the scale, the bigger the difference in image quality, and that makes it more difficult for the super resolution task to achieve good results.

In this work we propose a Multi-Hop Video Super Resolution GAN, where our novel long-term loss function maintains both the detail of images and the smoothness of the motion. Our multi-hop works to gradually improve the resolution, which keeps the gap between input and output small, making it easier to perform super resolutions and has a bigger possibility of producing better output.

Following is a summary of the contributions of the proposed method: 1) We propose a multiple times of scaling method called MVSRGAN, 2) the proposed method brings gradual improvement of the image resolution for each hop, and 3) we propose a novel long-term loss for video super-resolution.

This paper is divided into six sections. Section 1 provides a brief description of the background and content of the research. In Section 2, several research studies that relate to the proposed method are discussed. The proposed method is introduced and described in detail. In Section 3, we introduce and explain the proposed method in details. Section 4 shows the various data sets we used and how we process it to fit with our model. Section 5 discusses the experiments that have been carried out and presents the results. Finally, Section 6 gives the essential conclusions of our works.

2 Related works

Several SR methods have been proposed over the last few decades. In this section we will focus on the state-of-the-art deep learning based methods.

2.1 SISR

SISR is a technique that is usually used for increasing the resolution of images. It takes a lower resolution image to create an image with higher resolution. There have been various methods developed to enhance the resolution of images. The first deep learning-based method using the deep convolutional neural network (SRCNN) for image super-resolution proposed by [6] has proven to get results that go beyond traditional methods. They also proposed a faster method called fast super-resolution convolutional neural network (FSRCNN)[7].

Recently many SR methods [10, 30] have used PSNR and structural similarity (SSIM) for evaluation, but they still lack perceptual quality. The perceptual quality proposed by [14] and later used by some researchers uses a perceptual loss that can help to improve the image quality. The GAN-based method for SISR [17] also applies this perceptual loss to generate the photo-realistic texture of the LR image to fulfill the perceptual quality.

2.2 VSR

The video super-resolution task involves upscaling process from a LR video to a HR video. It is similar to the SISR but it is used for multiple images. However, the alignment of the frames in multiple image super-resolution (MISR) ignores the temporal features between frames. The early VSR methods [22, 32] employed this approach to obtain high resolution video from low resolution input.

Most of the VSR methods obtain the temporal feature by concatenating multiple frames [28]. This approach has been proven to enhance the VSR quality, but it lacks motion consistency and results in uncertain artifacts. To overcome this problem, flow estimation is used as a motion compensation to obtain motion consistency [3]. Flow estimation is the task of estimating flow vectors for consecutive frames. It attempts to find the motion by calculating the pixel-wise motion between two images, taken from current and previous frames. This feature is essential for VSR to capture the temporal feature between the frames, especially for the motion attribute. Caballero et al. [2] used an optical flow of the LR input. Instead of using LR, [35] exploited the HR for the optical flow to improve the accuracy of the motion compensation.

3 Proposed method

Generative Adversarial Networks are widely used and common in image translation or image generation tasks [14, 15]. GAN was expanded to do another task; one of the purposes is to overcome the super resolution problem [5, 27]. SR aims to generate a HR image from a LR input image. The problem becomes more challenging when dealing with sequential input as found in VSR. We need to consider a temporal feature between the frames in addition to the spatial feature of each frame for the VSR. We introduce our proposed model in detail in this section, which explains how sequential LR inputs can be transformed into an HR output. We first describe the input that we use, then show how to build SR models and present how to handle the temporal feature for VSR.

3.1 Multi-hop

Perform an SR is like completing a missing feature of the HR image, which is generated from the LR image. The bigger the size gap between the LR and HR the more missing feature it has. Directly apply the LR image to produce the final HR that has a big different size is difficult. As a solution to solve this problem, we propose an approach that perform the SR gradually, so this make the work easier. With the multi-hop scheme, each generator produces HR image as an output with twice the resolution of the input, and the last HR image size is the target size. Using this scheme makes it possible to improve the quality of the SR output at every hop.

Our method is constructed with a multiple generator and discriminator scheme shown in Fig. 1. This multi-hop architecture is trained for all frames except the first and last frame of each video, because we use three frames as input. The first generator uses LR images as input and the next generator uses the intermediate output of the previous generator as input. The last generator produces a high resolution image as a final output. All the output is evaluated to the multi-size ground truth that is obtained by resizing the original resolution of ground truth. The dataset that we used have 4 times scaling resolution and each generator (each hop) produce 2 times scaling. Therefore, we need to use two generators and discriminators (two hops).

Fig. 1
figure 1

MVSRGAN architecture using multiple hop GAN for video super resolution. The number of hop x depends on the number of scales between the input and target image, with each hop producing a 2× larger image

3.2 Input

Maintaining the temporal feature from the video is very important for VSR, but not all methods consider this problem. There is a method that uses the recurrent neural network [12] and another method concatenates some frames for the input [28], but they need more time to train. Both of these methods learn the motion implicitly. The latter method uses the explicit motion feature between the frames [22, 32]. We use multiple frames as input, as shown in Fig. 2. The current frame t is combined with the previous frame t-1 and the next frame t+ 1. To minimize the training time, we explicitly use the optical flow as a motion feature for the previous and the next frame, so the dimension of input is smaller than using the original RGB feature.

Fig. 2
figure 2

Multiple feature input using the previous frame (t - 1), current frame (t) and next frame (t + 1)

3.3 Generator

In the generator, at the first layer we add an upsample layer then followed by U-Net architecture as the backbone shown in Fig. 3. It has an equal number of convolution layer as a downsampling step and deconvolution layer as an upsampling step. In this architecture the generator has a connection between the downsampling and upsampling layers which is mirrored as in [1], which looks like a U-shaped network. This design is useful because it allows low-level interaction and implicitly assumes alignment between inputs and output. If there are no skip layers, the knowledge of each level must pass through the congestion, which lead to losing a substantial feature [33].

Fig. 3
figure 3

Generator architecture using U-Net with additional up sample layer. The generator produces a 2× larger image with a better image quality

The input size has a big impact for the network architecture. The depth of U-Net depends on the input size, where the bigger the input size, the deeper the network. The last donwsampling layer is the smallest odd value size of the width or height. For the input with similar size of width and height, the last downsampling size is 1, and if the width and height have a different size we try to make it as small as we can. Our method has multiple generators, and each generator will produce the output with the size two times larger, because we add a single upsampling layer before the U-Net.

3.4 Discriminator

We used a similar discriminator as a traditional GAN. It employs the Markovian PatchGAN architecture as explored in [13]. The discriminator of PatchGAN gives the probability of the real or fake image, but not in scalar output. Instead of using the whole input, it uses a patch of the image with an output dimension of NxN; this size differs depending on the input size. Our method requires multiple discriminators according to the number of generators. The size of input for each discriminator adjusts to the size of the output from the generator.

3.5 Loss function

The success of a model can be achieved by applying the appropriate objective function. The VSR need to consider the spatial and temporal feature, so that it is require a different objective functions for each target. Here we have two types of loss functions. First is short-term loss, which maintains the spatial quality, and second is long-term loss for the temporal quality.

3.5.1 Short-term loss

The spatial quality of VSR is determined by the quality of the super resolution results of each frame. We propose a triple loss function for the short-term loss. It consists of adversarial loss, perceptual loss and MSE loss. GAN generally uses log-based adversarial loss where D saturates very quickly. Inspired by [23], we will use L2 based adversarial loss to ensure that discriminator D always generates a useful gradient for G. The corresponding loss functions are shown in (1) as follows:

$$ \begin{array}{@{}rcl@{}} l_{adv} = \frac{1}{2}\left( D\left( G\left( z\right)\right)-1\right) \end{array} $$
(1)

where z is the LR images.

Nevertheless, here we use additional loss that works in the feature map level called perceptual loss [14] which is also similar with content loss in [18]. This loss promotes a natural and perceptually pleasing outcome. The corresponding loss functions lpercep are shown in (2), as follows:

$$ \begin{array}{@{}rcl@{}} l_{percep} = VGG\left( G\left( z\right)\right)-VGG\left( b\right) \end{array} $$
(2)

where z is the LR images, b is a ground truth image, and VGG is a feature that is provided by the VGG19 model using pertained weight from Image-Net.

We use MSE loss to control the image quality from the output of each frame in the way described below:

$$ \begin{array}{@{}rcl@{}} l_{MSE} = \sum\limits_{x=0}^{W}\sum\limits_{y=0}^{H}\left( \left( b_{t}-G\left( z\right)_{t}\right)_{x,y}\right)^{2} \end{array} $$
(3)

where b is the high resolution of the target images at frame t, G(z) is the generated images of low resolution input at frames t and x, y is the value of each pixel to be compared.

The short-term loss lshort is the total of summation from the lpercep, ladv, and lMSE, as follows:

$$ \begin{array}{@{}rcl@{}} l_{short} = \sum\limits_{t=0}^{s}\left( l_{percep}+l_{adv}+l_{MSE}\right) \end{array} $$
(4)

3.5.2 Long-term loss

Maintaining the temporal quality is important for VSR. We propose long-term loss to handle this temporal problem. Our long-term loss consists of long-term MSE and long-term optical flow. We add up the loss function from the first frame to the latest frame.

Similar to the short-term MSE, the long-term MSE (lLTMSE) loss also controls the image quality, but it maintains consistency of image quality over time. As shown by (5), lMSE the MSE loss in the current frame and we accumulate the value from the t = 0. The lLTMSE formulated as follows:

$$ \begin{array}{@{}rcl@{}} l_{LTMSE} = \sum\limits_{t=0}^{s}\left( l_{MSE}\right) \end{array} $$
(5)

Maintaining the image quality is important, but for VSR we also have to consider smooth movement. Optical flow is the pattern of movement obtained by observing the difference from one frame to another in a sequential image [8]. Tao et al. [32] proposed a dense optical flow that observes the movement of the object between two frames. It compares the movement intensity i of the object after time dt, and produces a new intensity after the movement. Using these functions we compute the OF for every two consecutive frames. We calculate the distance between two OF from the target images and the generated images, as is shown in (6), where b is the ground truth and g is the generated images, then we compare frame t and the previous frame t-1. The long-term loss lshort is the total of summation from the lLTMSE and lLTOF and it is formulated as follows:

$$ \begin{array}{@{}rcl@{}} l_{LTOF} = \sum\limits_{t=0}^{s}\left( OF\left( b_{t-1},b_{t}\right)-OF\left( g_{t-1},g_{t}\right)\right) \end{array} $$
(6)

The long-term loss llong is the total of summation of the lLTMSE, and lLTOF from the beginning to the current frame, as follows:

$$ \begin{array}{@{}rcl@{}} l_{long} = \sum\limits_{t=0}^{s}\left( l_{LTMSE}+l_{LTOF}\right) \end{array} $$
(7)

4 Data

Datasets are an important component for training deep learning models. Therefore, it is necessary to ensure that the quality and quantity are adequate. To achieve this balance condition, the data must have a variety of variations and have sufficient amounts for each variation. To achieve good VSR results, we must have training data containing videos with various real-world motions. Here we use several data sets to train and test our model. We used the Vimeo90k dataset [37] containing short video footage and we also used the Vid4 dataset [21] which is commonly used to test VSR [4, 11, 28, 36]. We also created our own Fireworks dataset which shows the motion more clearly.

4.1 Data collection

The Fireworks dataset was created by using high resolution firework videos from YouTube. These videos were trimmed into several video clips of 10 to 20 frames. The number of clips obtained from all videos is 70 clips. The resolution of each frame was resized according to our model.

4.2 Data preprocessing

The proposed model requires a certain input size. The first thing we needed to do is preprocess our dataset to fit the proposed model. We performed resizing and cropping as illustrated by Fig. 4, in the left side we can see that the original high resolution data was sized at 768 × 576 resolution. We needed to resize it to 768 × 512 resolution same as the target image of the proposed model. The input image is obtained by 4× downsampling of the target image and the size of input image is 196 × 128.

Fig. 4
figure 4

Resize the input data, the left image is the original image, and the right image is the resized image to adjust with the proposed model

4.3 Data augmentation

Having appropriate data is important for training because it can prevent overfitting and help a model to get a good result. We augmented the dataset to achieve that desired result. Therefore, we perform random cropping with 10% padding of the original image, then resized it back to its original size as augmentation data as shown in Fig. 5. Each data is augmented 5 times so that it will produce a larger and more diverse dataset.

Fig. 5
figure 5

Random cropping for augment the dataset with 10% padding. The left image is the original image with a red box indicating the cropping area, the middle image is cropping result, and the right image is the resized image of cropping result

5 Experiment and result

5.1 Implementation detail

The VSR model trained with the loss functions described in the section 3. We use the Adam optimizer with the initial learning rate is different for each dataset. In the training the learning rate is set to 5 × 10− 4 and 1 × 10− 3 for Vimeo90K and Firework dataset respectively. In the Vimeo90K dataset, after 70K for each 20K iterations we decrease the learning rate based on the experiments. In the Fireworks dataset, we halved the learning rate every 50K iterations. Training dataset using Vimeo90K and Firework dataset. Testing dataset using Vimeo90-T, Vid4 and Fireworks dataset. All models were trained on NVidia RTX 3090 GPUs and 64G RAM.

5.2 Evaluation metrics

We evaluate model performance using two common metrics to measure the quality of the image. The first, is the PSNR to measure image pixel-wise accuracy, and the second, is the SSIM as a perceptual metric. However, using only PSNR causes a mismatch of the image to the perceptual conditions, due to the loss of some important information in the image. A perceptual measurements such as SSIM is used to evaluate the deterioration of image quality and the results are more representative of human visuals. SSIM is based on YCbCr color channel and uses only the the Y channel which is luminance part, and does not use the Cb and Cr channels, which refer to the blue and red chrominance parts respectively. The time performance of the model calculate using Floating point Operations Per Second (FLOPs). The motion can be evaluated using an Optical Flow(OF) calculation between two frames, which is the previous and current frame and the total of the motion is the summation of the OF score, similar like the evaluation that perform by [5] that called temporal optical flow (tOF).

5.3 Experiments on generator configuration

We perform an experiment to find the best U-Net configuration for the generator. We compare two generator architectures, the first is the model with a deeper architecture that uses an input with an image ratio of 3 for width and 2 for the height, and the second is the model with shallower architecture that uses an input with an image ratio 12 for width and 9 for the height. The result is shown in Fig. 6, and it shows that the deeper generator has a better result in terms of PSNR.

Fig. 6
figure 6

Comparison of two generator architectures using a different input size. The left image is the result using an optimal input image, the middle image is the ground truth image and the right image is the result using a larger ratio of an input image

5.4 Experiment on multi-hop model

In order to test whether the proposed method appropriately utilizes the temporal information of a video, we compared our proposed method for VSR with its use for SISR, which has no motion or temporal correction. Both of the model using the same multi-hop schema. The difference is, the SISR does not apply the temporal loss like VSR. Therefore, to test whether the proposed temporal loss provide a significant contribution we perform an ablation experiment for the different configuration of temporal loss. The comparison of the result is shown in Table 1.

Table 1 PSNR and SSIM evaluation of different model configuration methods using the Fireworks dataset. Bold font indicates best result

Moreover, to highlight the advantages of our VSR scheme using temporal loss, we provide the Fig. 7 that shows the MVSR generates better SR images than the SISR. The figure contains three columns and two rows. The left column is the MVSR result, the middle column is the ground truth, and the right column is the SISR result. Due to the fireworks movement, the lights should have a line path, like shown the first row of Fig. 7 in the MVSR result. The MVSR can create this line while the SISR misses it. Also, in the second row, we can see that the MVSR successfully reconstructs the small details of the fireworks better than single image architecture.

Fig. 7
figure 7

Comparison of our multi-hop for VSR and multi-hop for SISR in the Firework dataset. The first column is the result using the proposed MVSR, the second column is the ground truth images, and the third column is the result of our proposed multi-hop using single frame input and without the temporal loss for SISR

5.5 Experiment on learning rate

The PSNR result for each iteration shows that the increase of PSNR is also affected by the learning rate. The Fig. 8 shows that decreasing the learning rate in some iterations get a better result. Otherwise, the result will be stuck or get lower. The initial learning is set to 5 × 10− 4 and use the same value until iteration 70K. Using the same value to continue the training leads to not optimal performance, shows by red line in the Fig. 8. The next decay is every additional 20K iterations and the experimental results that are not optimal for each learning rate are shown in Fig. 8. The best result is shown by the orange line after perform several decay of the learning rate with the last learning value is 5 × 10− 6.

Fig. 8
figure 8

The PSNR score in different iterations and learning rate decay. The initial learning rate is 0.0005, the decay is started at iteration 70K and after that each 20K iteration

5.6 Comparisons with state-of-the-art methods

We perform a quantitative comparison for the Vimeo90k, Vid4 and Fireworks dataset with other state-of-the-art VSR methods. The results of the quantitative comparison for the Vimeo90K dataset with state-of-the-art VSR methods are shown in Table 2. For this dataset, our proposed method performed better than the other methods.

Table 2 PSNR, SSIM and FLOPs evaluation of state-of-the-art VSR methods using the Vimeo90K dataset. Bold font indicates best result

The results of the quantitative comparison for the Vid4 dataset with state-of-the-art VSR methods are shown in Table 3. For this dataset, our proposed method using multi hop schema with long-term loss proven performed better than the other methods. In the Vid4 dataset, we have a significant improvement compared to the Vimeo90K dataset, because the Vid4 dataset has longer frames than the Vimeo90K dataset.

Table 3 PSNR, SSIM and FLOPs evaluation of state-of-the-art VSR methods using the Vid4 dataset. Bold font indicates best result

The motion continuity is evaluated using OF as motion estimation technique. The Table 4 show the comparison of the motion quality between the proposed method and state-of-the-art method on Vid4 dataset, the lower score shows the better result. The “Walk” and “City” video has the higher total score from all the methods due to the large of motion. The “Calendar” and “Foliage” video has a lower total score from all of the methods due to some part of video is a static object or a white color. In the “Foliage” video, the motion score of proposed method is slightly higher than other methods because it has an occlusion between the motion. However, the result show that our method is superior by the average score.

Table 4 Motion evaluation (tOF) of state-of-the-art VSR methods using the Vid4 dataset. Bold font indicates best result

Further, we also performed a qualitative comparison using the Vid4 dataset, demonstrated in Fig. 9. It illustrates how our method compares with the state-of-the-art method. The results that used our method have better reconstructed image details and textures. We found that maintaining both motion and image detail can produce better output than focusing more on motion. This is in line with the other two works that also paid attention to image details through the addition of spatial image quality enhancement schemes and maintained motion features either explicitly [36] or implicitly [28]. Our result is still better because we preserve both spatial and temporal image quality, this result also proves the advantages of using a multi-hop scheme.

Fig. 9
figure 9

Qualitative comparison of the Vid4 dataset. The left image is the original ground truth image with the red box for showing magnification area. The right images are the comparison of the result from magnification area of different methods. DUF [28] implementation https://github.com/yhjo09/VSR-DUF/ and TecoGAN [5] implementation https://github.com/thunil/TecoGAN

The results of the quantitative comparison for the Fireworks dataset with state-of-the-art VSR methods are shown in Table 5. For this dataset, our proposed model performed better than the previous methods.

Table 5 PSNR and SSIM evaluation of state-of-the-art VSR methods using the Fireworks dataset. Bold font indicates best result

Bicubic is a simple SISR method whereby the image is sampled fully using the values of the pixels around it or by calculating the average of squared errors. The result is smooth edges and blur HR images. TecoGAN and Frame Recurrent Video Super Resolution (FRVSR) are VSR methods that have a similar approach, and they use frame-recurrent input. TecoGAN has a temporal objective function that focuses more on motion quality, and FRVSR only uses the recurrent input for short temporal features. Meanwhile, our method uses the Long-Term Optical Flow loss (lLTOF) to deal with long-term motion and the Long-Term MSE loss (lLTMSE) to deal with long-term image quality.

The result shows that the model which concerns motion had a better result for the Fireworks dataset. TecoGAN is better than FRVSR because it has a PP loss that maintains the long-term temporal consistency. We also tested our method only using a single hop and lLTOF, it has a similar result with TecoGAN because it is only focused on the motion quality. The SOF-VSR also have the similar result with our multi-hop and using only lLTOF. Our proposed method with the multi hop and llong has a better result because it maintained both the motion and the image quality in long-term consistency.

Meanwhile, we performed a qualitative evaluation using our Fireworks dataset shown in In Fig. 10. We compared our result with the other state-of-the-art methods. Using our dataset, we can see an explicit motion of the fireworks, and some of the methods has advantages in motion compensation. Some small details can be reconstructed better using our method, while the result using the other methods has more missing pixels. Sometimes other methods can create sharper lines, but the results are different compared to the ground truth.

Fig. 10
figure 10

Qualitative comparison using the Fireworks dataset. The first row is the ground truth image, the second row is the result of Bicubic, the third row is the result of FRVSR [13], and the fourth row is the result of TecoGAN [5], the fifth row is the result of SOF-VSR [37], and the sixth row is the result of our proposed method

6 Conclusion

In this paper, we have introduced a new deep learning based framework for VSR that includes a multiple scaling process. This multiple scaling helps the model to learn gradually to perform super resolution from LR input to generate HR output through multiple intermediate results. Our proposed long-term losses are effective in terms of reconstructing the detail while maintaining the motion. We can outperform the state-of-the-art performance with our proposed method and recover high quality HR frames with long-term consistency for both datasets (Fireworks and Vid4 datasets). However, the proposed model is only suitable for a particular input size ratio. In the future, we will consider using a different network instead of U-Net to allow various input sizes without the need to resize the input to a specific image resolution ratio.