Multi-hop Video Super Resolution with Long-Term Consistency (MVSRGAN)

Aditya, Wisnu; Shih, Timothy K; Thaipisutikul, Tipajin; Lin, Chih-Yang

doi:10.1007/s11042-023-15351-8

Multi-hop Video Super Resolution with Long-Term Consistency (MVSRGAN)

Published: 22 May 2023

Volume 83, pages 4115–4132, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Multi-hop Video Super Resolution with Long-Term Consistency (MVSRGAN)

Download PDF

Wisnu Aditya ORCID: orcid.org/0000-0002-0406-8083¹,
Timothy K Shih¹,
Tipajin Thaipisutikul² &
…
Chih-Yang Lin³

180 Accesses
Explore all metrics

Abstract

Utilizing deep learning, and especially Generative Adversarial Networks (GANs), for super-resolution images has yielded auspicious results. However, performing super resolutions with a big difference in scaling between input and output will add a certain degree of difficulty. In this paper we propose a super resolution with multiple steps, which means scaling the image gradually to stimulate maximum results. Video super resolution (VSR) needs different treatment from single image super resolution (SISR). It requires a temporal connection in between the frames, but this has not been fully explored by most of the existing studies. This temporal feature is significant to maintain the video consistency, in term of video quality and motion continuity. Using this loss functions, we can avoid the inconsistent failure in the image which accumulate continuously over time. Finally, our method has been shown to generate a super-resolution video that maintains both the video quality and its motion continuity. The quantitative result has higher Peak Signal to Noise Ratio (PSNR) scores for the Vimeo90K, Vid4, and Fireworks datasets with 37.70, 29.91, and 31.28 respectively compared to the state-of-the-art methods. The result shows that our models is better than other state-of-the-art methods using a different dataset.

Deep Plug-and-Play Video Super-Resolution

Learning for Video Super-Resolution Through HR Optical Flow Estimation

Video super-resolution based on deep learning: a comprehensive survey

Article 01 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Generative Adversarial Networks (GANs) has been successful at learning several image processing tasks beginning with, the first GAN [9] to recently [31] to solve the image generator tasks, then expanding to other tasks such as image-to-image translation [13, 20, 38]. In the same way as super resolution, this task can be handled by GAN. Super resolution (SR) uses lower resolution (LR) images to reconstruct a high resolution (HR) image.

Image super-resolution is very challenging, and the smaller the resolution of the image used, the more difficult the task is to complete. Various methods used to address this problem, from naive strategies such as bicubic and bilinear interpolation, which use a predefined mathematical function [24], or simple Nearest Neighbor (NN) [16] interpolation, or complex deep learning methods [25, 26]. Recently, deep learning based approaches showed outstanding outcome for single image super resolution (SISR). The deep learning based approaches have two main streams of evaluation, first is the Peak Signal to Noise Ratio (PSNR) based approach [6], most of which provides excessively subtle and less detailed results. The second is the perceptual image quality [14] to enhance the visual appeal of the SR results.

SISR is a task that reconstructs a high-resolution (HR) image from a low-resolution (LR) image. Video super-resolution (VSR) is different from SISR. It is a task of generating high-resolution videos, which is more complex than SISR. The SISR problem can be handled very well with the various super-resolution GAN methods [17, 33], while the VSR has a sequence problem that cannot be handled by the same strategy. Using sequences directly without paying more attention to the continuous constraints will lead to inconsistent result and failure from one frame to another. The generators should not just learn a spatial from the single image, but should also learn the temporal features as a motion continuity that are correlated over time. This continuity looks clear, especially on videos that highlight motion, such as fireworks videos. The fireworks have a clear appearance of motion. It shows a fire movement that produces a trajectory in the form of a line of rays.

Many VSR methods use motion estimation to maintain the motion continuity of the video, while another method uses multiple frames as the sequential input. The first method [27] relies more on motion estimation accuracy, resulting in sub-optimal output quality. The second method [32] focuses more on the quality of an image using multiple input features consequences of inconsistency motion from frame to frame. Therefore, it is necessary to balance between the two in order to obtain better results.

Some of the recent methods have used different upsampling approaches including the front upsampling method [15] and the back upsampling method [29]. Most of these methods use a direct scaling method from the LR input to the HR output. The bigger the scale, the bigger the difference in image quality, and that makes it more difficult for the super resolution task to achieve good results.

In this work we propose a Multi-Hop Video Super Resolution GAN, where our novel long-term loss function maintains both the detail of images and the smoothness of the motion. Our multi-hop works to gradually improve the resolution, which keeps the gap between input and output small, making it easier to perform super resolutions and has a bigger possibility of producing better output.

Following is a summary of the contributions of the proposed method: 1) We propose a multiple times of scaling method called MVSRGAN, 2) the proposed method brings gradual improvement of the image resolution for each hop, and 3) we propose a novel long-term loss for video super-resolution.

This paper is divided into six sections. Section 1 provides a brief description of the background and content of the research. In Section 2, several research studies that relate to the proposed method are discussed. The proposed method is introduced and described in detail. In Section 3, we introduce and explain the proposed method in details. Section 4 shows the various data sets we used and how we process it to fit with our model. Section 5 discusses the experiments that have been carried out and presents the results. Finally, Section 6 gives the essential conclusions of our works.

2 Related works

Several SR methods have been proposed over the last few decades. In this section we will focus on the state-of-the-art deep learning based methods.

2.1 SISR

SISR is a technique that is usually used for increasing the resolution of images. It takes a lower resolution image to create an image with higher resolution. There have been various methods developed to enhance the resolution of images. The first deep learning-based method using the deep convolutional neural network (SRCNN) for image super-resolution proposed by [6] has proven to get results that go beyond traditional methods. They also proposed a faster method called fast super-resolution convolutional neural network (FSRCNN)[7].

Recently many SR methods [10, 30] have used PSNR and structural similarity (SSIM) for evaluation, but they still lack perceptual quality. The perceptual quality proposed by [14] and later used by some researchers uses a perceptual loss that can help to improve the image quality. The GAN-based method for SISR [17] also applies this perceptual loss to generate the photo-realistic texture of the LR image to fulfill the perceptual quality.

2.2 VSR

The video super-resolution task involves upscaling process from a LR video to a HR video. It is similar to the SISR but it is used for multiple images. However, the alignment of the frames in multiple image super-resolution (MISR) ignores the temporal features between frames. The early VSR methods [22, 32] employed this approach to obtain high resolution video from low resolution input.

Most of the VSR methods obtain the temporal feature by concatenating multiple frames [28]. This approach has been proven to enhance the VSR quality, but it lacks motion consistency and results in uncertain artifacts. To overcome this problem, flow estimation is used as a motion compensation to obtain motion consistency [3]. Flow estimation is the task of estimating flow vectors for consecutive frames. It attempts to find the motion by calculating the pixel-wise motion between two images, taken from current and previous frames. This feature is essential for VSR to capture the temporal feature between the frames, especially for the motion attribute. Caballero et al. [2] used an optical flow of the LR input. Instead of using LR, [35] exploited the HR for the optical flow to improve the accuracy of the motion compensation.

3 Proposed method

Generative Adversarial Networks are widely used and common in image translation or image generation tasks [14, 15]. GAN was expanded to do another task; one of the purposes is to overcome the super resolution problem [5, 27]. SR aims to generate a HR image from a LR input image. The problem becomes more challenging when dealing with sequential input as found in VSR. We need to consider a temporal feature between the frames in addition to the spatial feature of each frame for the VSR. We introduce our proposed model in detail in this section, which explains how sequential LR inputs can be transformed into an HR output. We first describe the input that we use, then show how to build SR models and present how to handle the temporal feature for VSR.

3.1 Multi-hop

Perform an SR is like completing a missing feature of the HR image, which is generated from the LR image. The bigger the size gap between the LR and HR the more missing feature it has. Directly apply the LR image to produce the final HR that has a big different size is difficult. As a solution to solve this problem, we propose an approach that perform the SR gradually, so this make the work easier. With the multi-hop scheme, each generator produces HR image as an output with twice the resolution of the input, and the last HR image size is the target size. Using this scheme makes it possible to improve the quality of the SR output at every hop.

Our method is constructed with a multiple generator and discriminator scheme shown in Fig. 1. This multi-hop architecture is trained for all frames except the first and last frame of each video, because we use three frames as input. The first generator uses LR images as input and the next generator uses the intermediate output of the previous generator as input. The last generator produces a high resolution image as a final output. All the output is evaluated to the multi-size ground truth that is obtained by resizing the original resolution of ground truth. The dataset that we used have 4 times scaling resolution and each generator (each hop) produce 2 times scaling. Therefore, we need to use two generators and discriminators (two hops).

3.2 Input

Maintaining the temporal feature from the video is very important for VSR, but not all methods consider this problem. There is a method that uses the recurrent neural network [12] and another method concatenates some frames for the input [28], but they need more time to train. Both of these methods learn the motion implicitly. The latter method uses the explicit motion feature between the frames [22, 32]. We use multiple frames as input, as shown in Fig. 2. The current frame t is combined with the previous frame t-1 and the next frame t+ 1. To minimize the training time, we explicitly use the optical flow as a motion feature for the previous and the next frame, so the dimension of input is smaller than using the original RGB feature.

3.3 Generator

In the generator, at the first layer we add an upsample layer then followed by U-Net architecture as the backbone shown in Fig. 3. It has an equal number of convolution layer as a downsampling step and deconvolution layer as an upsampling step. In this architecture the generator has a connection between the downsampling and upsampling layers which is mirrored as in [1], which looks like a U-shaped network. This design is useful because it allows low-level interaction and implicitly assumes alignment between inputs and output. If there are no skip layers, the knowledge of each level must pass through the congestion, which lead to losing a substantial feature [33].

The input size has a big impact for the network architecture. The depth of U-Net depends on the input size, where the bigger the input size, the deeper the network. The last donwsampling layer is the smallest odd value size of the width or height. For the input with similar size of width and height, the last downsampling size is 1, and if the width and height have a different size we try to make it as small as we can. Our method has multiple generators, and each generator will produce the output with the size two times larger, because we add a single upsampling layer before the U-Net.

3.4 Discriminator

We used a similar discriminator as a traditional GAN. It employs the Markovian PatchGAN architecture as explored in [13]. The discriminator of PatchGAN gives the probability of the real or fake image, but not in scalar output. Instead of using the whole input, it uses a patch of the image with an output dimension of NxN; this size differs depending on the input size. Our method requires multiple discriminators according to the number of generators. The size of input for each discriminator adjusts to the size of the output from the generator.

3.5 Loss function

The success of a model can be achieved by applying the appropriate objective function. The VSR need to consider the spatial and temporal feature, so that it is require a different objective functions for each target. Here we have two types of loss functions. First is short-term loss, which maintains the spatial quality, and second is long-term loss for the temporal quality.

3.5.1 Short-term loss

The spatial quality of VSR is determined by the quality of the super resolution results of each frame. We propose a triple loss function for the short-term loss. It consists of adversarial loss, perceptual loss and MSE loss. GAN generally uses log-based adversarial loss where D saturates very quickly. Inspired by [23], we will use L2 based adversarial loss to ensure that discriminator D always generates a useful gradient for G. The corresponding loss functions are shown in (1) as follows:

$$ \begin{array}{@{}rcl@{}} l_{adv} = \frac{1}{2}\left( D\left( G\left( z\right)\right)-1\right) \end{array} $$

(1)

where z is the LR images.

Nevertheless, here we use additional loss that works in the feature map level called perceptual loss [14] which is also similar with content loss in [18]. This loss promotes a natural and perceptually pleasing outcome. The corresponding loss functions l_percep are shown in (2), as follows:

$$ \begin{array}{@{}rcl@{}} l_{percep} = VGG\left( G\left( z\right)\right)-VGG\left( b\right) \end{array} $$

(2)

where z is the LR images, b is a ground truth image, and VGG is a feature that is provided by the VGG19 model using pertained weight from Image-Net.

We use MSE loss to control the image quality from the output of each frame in the way described below:

$$ \begin{array}{@{}rcl@{}} l_{MSE} = \sum\limits_{x=0}^{W}\sum\limits_{y=0}^{H}\left( \left( b_{t}-G\left( z\right)_{t}\right)_{x,y}\right)^{2} \end{array} $$

(3)

where b is the high resolution of the target images at frame t, G(z) is the generated images of low resolution input at frames t and x, y is the value of each pixel to be compared.

The short-term loss l_short is the total of summation from the l_percep, l_adv, and l_MSE, as follows:

$$ \begin{array}{@{}rcl@{}} l_{short} = \sum\limits_{t=0}^{s}\left( l_{percep}+l_{adv}+l_{MSE}\right) \end{array} $$

(4)

3.5.2 Long-term loss

Maintaining the temporal quality is important for VSR. We propose long-term loss to handle this temporal problem. Our long-term loss consists of long-term MSE and long-term optical flow. We add up the loss function from the first frame to the latest frame.

Similar to the short-term MSE, the long-term MSE (l_LTMSE) loss also controls the image quality, but it maintains consistency of image quality over time. As shown by (5), l_MSE the MSE loss in the current frame and we accumulate the value from the t = 0. The l_LTMSE formulated as follows:

$$ \begin{array}{@{}rcl@{}} l_{LTMSE} = \sum\limits_{t=0}^{s}\left( l_{MSE}\right) \end{array} $$

(5)

Maintaining the image quality is important, but for VSR we also have to consider smooth movement. Optical flow is the pattern of movement obtained by observing the difference from one frame to another in a sequential image [8]. Tao et al. [32] proposed a dense optical flow that observes the movement of the object between two frames. It compares the movement intensity i of the object after time dt, and produces a new intensity after the movement. Using these functions we compute the OF for every two consecutive frames. We calculate the distance between two OF from the target images and the generated images, as is shown in (6), where b is the ground truth and g is the generated images, then we compare frame t and the previous frame t-1. The long-term loss l_short is the total of summation from the l_LTMSE and l_LTOF and it is formulated as follows:

$$ \begin{array}{@{}rcl@{}} l_{LTOF} = \sum\limits_{t=0}^{s}\left( OF\left( b_{t-1},b_{t}\right)-OF\left( g_{t-1},g_{t}\right)\right) \end{array} $$

(6)

The long-term loss l_long is the total of summation of the l_LTMSE, and l_LTOF from the beginning to the current frame, as follows:

$$ \begin{array}{@{}rcl@{}} l_{long} = \sum\limits_{t=0}^{s}\left( l_{LTMSE}+l_{LTOF}\right) \end{array} $$

(7)

4 Data

Datasets are an important component for training deep learning models. Therefore, it is necessary to ensure that the quality and quantity are adequate. To achieve this balance condition, the data must have a variety of variations and have sufficient amounts for each variation. To achieve good VSR results, we must have training data containing videos with various real-world motions. Here we use several data sets to train and test our model. We used the Vimeo90k dataset [37] containing short video footage and we also used the Vid4 dataset [21] which is commonly used to test VSR [4, 11, 28, 36]. We also created our own Fireworks dataset which shows the motion more clearly.

4.1 Data collection

The Fireworks dataset was created by using high resolution firework videos from YouTube. These videos were trimmed into several video clips of 10 to 20 frames. The number of clips obtained from all videos is 70 clips. The resolution of each frame was resized according to our model.

4.2 Data preprocessing

The proposed model requires a certain input size. The first thing we needed to do is preprocess our dataset to fit the proposed model. We performed resizing and cropping as illustrated by Fig. 4, in the left side we can see that the original high resolution data was sized at 768 × 576 resolution. We needed to resize it to 768 × 512 resolution same as the target image of the proposed model. The input image is obtained by 4× downsampling of the target image and the size of input image is 196 × 128.

4.3 Data augmentation

Having appropriate data is important for training because it can prevent overfitting and help a model to get a good result. We augmented the dataset to achieve that desired result. Therefore, we perform random cropping with 10% padding of the original image, then resized it back to its original size as augmentation data as shown in Fig. 5. Each data is augmented 5 times so that it will produce a larger and more diverse dataset.

5 Experiment and result

5.1 Implementation detail

The VSR model trained with the loss functions described in the section 3. We use the Adam optimizer with the initial learning rate is different for each dataset. In the training the learning rate is set to 5 × 10^− 4 and 1 × 10^− 3 for Vimeo90K and Firework dataset respectively. In the Vimeo90K dataset, after 70K for each 20K iterations we decrease the learning rate based on the experiments. In the Fireworks dataset, we halved the learning rate every 50K iterations. Training dataset using Vimeo90K and Firework dataset. Testing dataset using Vimeo90-T, Vid4 and Fireworks dataset. All models were trained on NVidia RTX 3090 GPUs and 64G RAM.

5.2 Evaluation metrics

We evaluate model performance using two common metrics to measure the quality of the image. The first, is the PSNR to measure image pixel-wise accuracy, and the second, is the SSIM as a perceptual metric. However, using only PSNR causes a mismatch of the image to the perceptual conditions, due to the loss of some important information in the image. A perceptual measurements such as SSIM is used to evaluate the deterioration of image quality and the results are more representative of human visuals. SSIM is based on YCbCr color channel and uses only the the Y channel which is luminance part, and does not use the Cb and Cr channels, which refer to the blue and red chrominance parts respectively. The time performance of the model calculate using Floating point Operations Per Second (FLOPs). The motion can be evaluated using an Optical Flow(OF) calculation between two frames, which is the previous and current frame and the total of the motion is the summation of the OF score, similar like the evaluation that perform by [5] that called temporal optical flow (tOF).

5.3 Experiments on generator configuration

We perform an experiment to find the best U-Net configuration for the generator. We compare two generator architectures, the first is the model with a deeper architecture that uses an input with an image ratio of 3 for width and 2 for the height, and the second is the model with shallower architecture that uses an input with an image ratio 12 for width and 9 for the height. The result is shown in Fig. 6, and it shows that the deeper generator has a better result in terms of PSNR.

5.4 Experiment on multi-hop model

In order to test whether the proposed method appropriately utilizes the temporal information of a video, we compared our proposed method for VSR with its use for SISR, which has no motion or temporal correction. Both of the model using the same multi-hop schema. The difference is, the SISR does not apply the temporal loss like VSR. Therefore, to test whether the proposed temporal loss provide a significant contribution we perform an ablation experiment for the different configuration of temporal loss. The comparison of the result is shown in Table 1.

Table 1 PSNR and SSIM evaluation of different model configuration methods using the Fireworks dataset. Bold font indicates best result

Full size table

Moreover, to highlight the advantages of our VSR scheme using temporal loss, we provide the Fig. 7 that shows the MVSR generates better SR images than the SISR. The figure contains three columns and two rows. The left column is the MVSR result, the middle column is the ground truth, and the right column is the SISR result. Due to the fireworks movement, the lights should have a line path, like shown the first row of Fig. 7 in the MVSR result. The MVSR can create this line while the SISR misses it. Also, in the second row, we can see that the MVSR successfully reconstructs the small details of the fireworks better than single image architecture.

5.5 Experiment on learning rate

The PSNR result for each iteration shows that the increase of PSNR is also affected by the learning rate. The Fig. 8 shows that decreasing the learning rate in some iterations get a better result. Otherwise, the result will be stuck or get lower. The initial learning is set to 5 × 10^− 4 and use the same value until iteration 70K. Using the same value to continue the training leads to not optimal performance, shows by red line in the Fig. 8. The next decay is every additional 20K iterations and the experimental results that are not optimal for each learning rate are shown in Fig. 8. The best result is shown by the orange line after perform several decay of the learning rate with the last learning value is 5 × 10^− 6.

5.6 Comparisons with state-of-the-art methods

We perform a quantitative comparison for the Vimeo90k, Vid4 and Fireworks dataset with other state-of-the-art VSR methods. The results of the quantitative comparison for the Vimeo90K dataset with state-of-the-art VSR methods are shown in Table 2. For this dataset, our proposed method performed better than the other methods.

Table 2 PSNR, SSIM and FLOPs evaluation of state-of-the-art VSR methods using the Vimeo90K dataset. Bold font indicates best result

Full size table

The results of the quantitative comparison for the Vid4 dataset with state-of-the-art VSR methods are shown in Table 3. For this dataset, our proposed method using multi hop schema with long-term loss proven performed better than the other methods. In the Vid4 dataset, we have a significant improvement compared to the Vimeo90K dataset, because the Vid4 dataset has longer frames than the Vimeo90K dataset.

Table 3 PSNR, SSIM and FLOPs evaluation of state-of-the-art VSR methods using the Vid4 dataset. Bold font indicates best result

Full size table

The motion continuity is evaluated using OF as motion estimation technique. The Table 4 show the comparison of the motion quality between the proposed method and state-of-the-art method on Vid4 dataset, the lower score shows the better result. The “Walk” and “City” video has the higher total score from all the methods due to the large of motion. The “Calendar” and “Foliage” video has a lower total score from all of the methods due to some part of video is a static object or a white color. In the “Foliage” video, the motion score of proposed method is slightly higher than other methods because it has an occlusion between the motion. However, the result show that our method is superior by the average score.

Table 4 Motion evaluation (tOF) of state-of-the-art VSR methods using the Vid4 dataset. Bold font indicates best result

Full size table

Further, we also performed a qualitative comparison using the Vid4 dataset, demonstrated in Fig. 9. It illustrates how our method compares with the state-of-the-art method. The results that used our method have better reconstructed image details and textures. We found that maintaining both motion and image detail can produce better output than focusing more on motion. This is in line with the other two works that also paid attention to image details through the addition of spatial image quality enhancement schemes and maintained motion features either explicitly [36] or implicitly [28]. Our result is still better because we preserve both spatial and temporal image quality, this result also proves the advantages of using a multi-hop scheme.

The results of the quantitative comparison for the Fireworks dataset with state-of-the-art VSR methods are shown in Table 5. For this dataset, our proposed model performed better than the previous methods.

Table 5 PSNR and SSIM evaluation of state-of-the-art VSR methods using the Fireworks dataset. Bold font indicates best result

Full size table

Bicubic is a simple SISR method whereby the image is sampled fully using the values of the pixels around it or by calculating the average of squared errors. The result is smooth edges and blur HR images. TecoGAN and Frame Recurrent Video Super Resolution (FRVSR) are VSR methods that have a similar approach, and they use frame-recurrent input. TecoGAN has a temporal objective function that focuses more on motion quality, and FRVSR only uses the recurrent input for short temporal features. Meanwhile, our method uses the Long-Term Optical Flow loss (l_LTOF) to deal with long-term motion and the Long-Term MSE loss (l_LTMSE) to deal with long-term image quality.

The result shows that the model which concerns motion had a better result for the Fireworks dataset. TecoGAN is better than FRVSR because it has a PP loss that maintains the long-term temporal consistency. We also tested our method only using a single hop and l_LTOF, it has a similar result with TecoGAN because it is only focused on the motion quality. The SOF-VSR also have the similar result with our multi-hop and using only l_LTOF. Our proposed method with the multi hop and l_long has a better result because it maintained both the motion and the image quality in long-term consistency.

Meanwhile, we performed a qualitative evaluation using our Fireworks dataset shown in In Fig. 10. We compared our result with the other state-of-the-art methods. Using our dataset, we can see an explicit motion of the fireworks, and some of the methods has advantages in motion compensation. Some small details can be reconstructed better using our method, while the result using the other methods has more missing pixels. Sometimes other methods can create sharper lines, but the results are different compared to the ground truth.

6 Conclusion

In this paper, we have introduced a new deep learning based framework for VSR that includes a multiple scaling process. This multiple scaling helps the model to learn gradually to perform super resolution from LR input to generate HR output through multiple intermediate results. Our proposed long-term losses are effective in terms of reconstructing the detail while maintaining the motion. We can outperform the state-of-the-art performance with our proposed method and recover high quality HR frames with long-term consistency for both datasets (Fireworks and Vid4 datasets). However, the proposed model is only suitable for a particular input size ratio. In the future, we will consider using a different network instead of U-Net to allow various input sizes without the need to resize the input to a specific image resolution ratio.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Akçay S, Atapour-Abarghouei A, Breckon TP (2019) Skip-GANomaly: skip connected and adversarially trained encoder-decoder anomaly detection. In: International joint conference on neural networks (IJCNN). IEEE
Caballero J, Ledig C, Aitken A, Acosta A, Totz J, Wang Z, Shi W (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017. IEEE, pp 2848–2857
Cao Y, Wang C, Song C, Tang Y, Li H (2021) Real-time super-resolution system of 4K-video based on deep learning. In: International conference on application-specific systems, architectures and processors (ASAP). IEEE, pp 69–76
Chadha A, Britto J, Roja MM (2020) iSeeBetter: spatio-temporal video super-resolution using recurrent generative back-projection networks. Comput Vis Media 6:307–317
Article Google Scholar
Chu M, Xie Y, Mayer J, Leal-Taixé L, Thuerey N (2020) Learning temporal coherence via self-supervision for GAN-based video generation. ACM Trans Graph. 39. https://doi.org/10.1145/3386569.3392457
Dong C, Loy C C, He K, Tang X (2014) Image super-resolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell 38:295–307. https://doi.org/10.1109/TPAMI.2015.2439281
Article Google Scholar
Dong C, Loy CC, Tang X (2016) Accelerating the super-resolution convolutional neural network. In: Leibe B, Matas J, Sebe N, Welling M (eds) European conference on computer vision. Springer International Publishing, Cham, pp 391–407
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis. Springer International Publishing, pp 363–370
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv neural Inf Process Syst
Gu J, Lu H, Zuo W, Dong C (2019) Blind super-resolution with iterative kernel correction. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 11604–1613
Haris M, Shakhnarovich G, Ukita N (2019) Recurrent back-projection network for video super-resolution. In: Conference on computer vision and pattern recognition (CVPR), pp 13897–3906
Huang Y, Wang W, Wang L (2015) Bidirectional recurrent convolutional networks for multi-frame super-resolution. In: Advances in neural information processing systems, pp 235–243
Isola P, Zhu J -Y, Zhou T, Efros A A, Research B A (2017) Image-to-image translation with conditional adversarial networks. In: Conference on computer vision and pattern recognition (CVPR). IEEE, pp 5967–5976
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision (ECCV). Springer International Publishing
Kim J, Lee J K, Lee K M (2016) Accurate image super-resolution using very deep convolutional networks. In: Conference on computer vision and pattern recognition. IEEE, pp 1646–1654
Koester E, Sahin C S (2019) A comparison of super-resolution and nearest neighbors interpolation applied to object detection on satellite data. arXiv:1907.05283
Ledig C, Theis L, Huszar F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, Shi W (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: Conference on computer vision and pattern recognition. IEEE, pp 4681–4690
Li C, Wand M (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In: European conference on computer vision (ECCV). Springer International Publishing, pp 702–716
Li W, Tao X, Li Y, Guo T, Qi L, Lu J, Jia J (2021) MuCAN: multi-correspondence aggregation network for video super-resolution. In: European conference on computer vision (ECCV). Springer International Publishing, pp 335–351
Lin Y, Wang Y, Li Y, Gao Y, Wang Z, Khan L (2021) Attention-based spatial guidance for image-to-image translation. In: Workshop on applications of computer vision. IEEE, pp 816–825
Liu C, Sun D (2013) On bayesian adaptive video super resolution. IEEE Trans Pattern Anal Mach Intell 36(2):346–360. IEEE
Article Google Scholar
Liu D, Wang Z, Fan Y, Liu X, Wang Z, Chang S, Huang T (2017) Robust video super-resolution with learned temporal dynamics. In: Proceedings of the IEEE international conference on computer vision. IEEE, pp 2526–2534
Mao X, Li Q, Xie H, Lau R Y K, Wang Z, Smolley S P (2017) Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision. IEEE, pp 2794–2802
Nazeri K, Thasarathan H, Ebrahimi M (2019) Edge-informed single image super-resolution. In: International conference on computer vision workshops (ICCV workshops). IEEE, pp 3275–3284
Niu B, Wen W, Ren W, Zhang X, Yang L, Wang S, Zhang K, Cao X, Shen H (2020) Single image super-resolution via a holistic attention network. In: European conference on computer vision (ECCV). Springer International Publishing, pp 191–207
Park S -J, Son H, Cho S, Hong K -S, Lee S (2018) SRFEat: single image super-resolution with feature discrimination. In: European conference on computer vision. Springer International Publishing, pp 455–471
Sajjadi M S M, Vemulapalli R, Brown M (2018) Frame-recurrent video super-resolution. In: Conference on computer vision and pattern recognition. IEEE, pp 6626–6634
Seoung Y J, Oh W, Kang J, Kim SJ (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In: Conference on computer vision and pattern recognition (CVPR), pp 3224–3232
Shi W, Caballero J, Huszar F, Totz J, Aitken A P, Bishop R, Rueckert D, Wang Z (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 1874–1883
Shocher A, Cohen N, Irani M (2018) “Zero-shot“ super-resolution using deep internal learning. In: Conference on computer vision and pattern recognition (CVPR), pp 1043–1052
Sushko V, Gall J, Khoreva A, One-Shot G A N (2021) Learning to generate samples from single images and videos. In: Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 2596–2600
Tao X, Gao H, Liao R, Wang J, Jia J (2017) Detail-revealing deep video super-resolution. In: International conference on computer vision (ICCV), pp 4482–4490
Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Loy C C (2018) ESRGAN enhanced super-resolution generative adversarial networks. In: European conference on computer vision. Springer International Publishing, pp 63–79
Wang X, Chan KCK, Yu K, Dong C, Loy CC (2019) EDVR: video restoration with enhanced deformable convolutional networks. In: Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1954–1963. https://doi.org/10.1109/CVPRW.2019.00247
Wang L, Guo Y, Liu L, Lin Z, Deng X, An W (2020) Deep video super-resolution using HR optical flow estimation. IEEE Trans Image Process 29:4323–4336. https://doi.org/10.1109/TIP.2020.2967596
Article Google Scholar
Wang J, Teng G, An P (2021) Video super-resolution based on generative adversarial network and edge enhancement. Electron 10:1–19
Google Scholar
Xue T, Chen B, Wu J, Wei D, Freeman W T (2019) Video enhancement with task-oriented flow. Int J Comput Vis (IJCV) 127:1106–1125. Springer
Article Google Scholar
Yi Z, Zhang H, Tan P, Gong M (2017) DualGAN: unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE international conference on computer vision, pp 2868–2876

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Central University, Zhongli, Taoyuan, 32001, Taiwan
Wisnu Aditya & Timothy K Shih
Faculty of Information and Communication Technology (ICT), Mahidol University, Salaya, Nakhon Pathom, 73170, Thailand
Tipajin Thaipisutikul
Department of Mechanical Engineering, National Central University, Zhongli, Taoyuan, 32001, Taiwan
Chih-Yang Lin

Authors

Wisnu Aditya
View author publications
You can also search for this author in PubMed Google Scholar
Timothy K Shih
View author publications
You can also search for this author in PubMed Google Scholar
Tipajin Thaipisutikul
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Yang Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wisnu Aditya.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Timothy K. Shih, Tipajin Thaipisutikul and Chih-Yang Lin contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aditya, W., Shih, T.K., Thaipisutikul, T. et al. Multi-hop Video Super Resolution with Long-Term Consistency (MVSRGAN). Multimed Tools Appl 83, 4115–4132 (2024). https://doi.org/10.1007/s11042-023-15351-8

Download citation

Received: 08 November 2021
Revised: 10 October 2022
Accepted: 15 April 2023
Published: 22 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15351-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-hop Video Super Resolution with Long-Term Consistency (MVSRGAN)

Abstract

Similar content being viewed by others

Deep Plug-and-Play Video Super-Resolution

Learning for Video Super-Resolution Through HR Optical Flow Estimation

Video super-resolution based on deep learning: a comprehensive survey

Explore related subjects

1 Introduction

2 Related works

2.1 SISR

2.2 VSR

3 Proposed method

3.1 Multi-hop

3.2 Input

3.3 Generator

3.4 Discriminator

3.5 Loss function

3.5.1 Short-term loss

3.5.2 Long-term loss

4 Data

4.1 Data collection

4.2 Data preprocessing

4.3 Data augmentation

5 Experiment and result

5.1 Implementation detail

5.2 Evaluation metrics

5.3 Experiments on generator configuration

5.4 Experiment on multi-hop model

5.5 Experiment on learning rate

5.6 Comparisons with state-of-the-art methods

6 Conclusion

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation