1 Introduction

Digital super-resolution (DSR or briefly SR) is a set of image restoration techniques aiming to increase spatial and/or temporal resolutions in images and videos, mainly using a controlled optimization procedure. For an imaging system, spatial resolution refers to the finest detail visually distinguishable in captured images. In contrast, temporal resolution defines the highest frequency of dynamic events perceivable in a video sequence. SR has been a very active research area in both academia and industry over the past three decades. It has found practical applications in many real-life problems, including the display industry, medical imaging, satellite and aerial photography, astronomy, surveillance, and remote sensing.

There are two categories of SR techniques in the literature: multi-frame SR (MFSR) and single-frame SR (SFSR), where frame means a still image or a video frame. MFSR mainly refers to the traditional way of doing SR, which reconstructs a high-resolution (HR) frame by fusing multiple low-resolution (LR) frames [1,2,3]. Each LR frame must have information not present in other LR frames. This condition is fulfilled by the existence of subpixel motion (globally or locally) between the LR frames, which commonly happens in most captured frame sequences due to the movement of objects in the scene or the camera. Most MFSR techniques in the literature assume that the motion is global, and the blur function is known a priori. However, a few ones allow for local motion in their model and estimate both motion and blur along with the HR frames [4, 5].

Learning-based SR (LBSR) techniques reconstruct an HR frame from a single LR frame. This SR category of techniques assumes that the relationship between the LR and HR frames can be learned from a training set that contains several LR frames and their corresponding HR frames. In a traditional SFSR approach called patch-based or dictionary-based SR [6,7,8], an LR input frame is segmented into small patches. Then each patch is compared against the LR patches in the training set to find its best match. Finally, an input LR patch is replaced with the corresponding HR patch of its best match. SFSR is also developed for videos where an LR video is segmented into spatio-temporal patches [9]. Machine-learning (ML) and Deep Learning (DL)-based SFSR methods [10,11,12,13,14,15,16,17,18,19,20], including Convolutional Neural Networks (CNN) and Generative Adversarial Nets (GAN) based techniques, have been of much interest in recent years. Learning-based MFSR methods are also introduced for videos by leveraging the temporal correlation between video frames for more accurate reconstruction [9, 10, 21,22,23,24]. However, in this work, we refer to the traditional (none-ML) motion-based MFSR techniques simply as MFSR to differentiate them from the LBSR techniques.

The majority of SR publications aim to improve only the spatial resolution in images/videos. Nevertheless, space–time (or spatio-temporal) super-resolution (STSR) for improving both spatial and temporal resolutions is considered in very few publications. One approach for STSR is to use multiple LR video sequences having spatial (sub-pixel) as well as temporal (sub-frame) misalignments [25,26,27]. It means that the corresponding pixels in the input videos’ frames are in different spatial locations due to the scene's movements, and the videos are captured in slightly different timestamps. Here, for simplicity in modeling the motion, the capturing cameras are kept close to each other compared to their distances from the scene. This constraint enables the motion between the videos to be globally modeled as a 2D homography transformation in space and a 1D affine transformation in time [28].

STSR from a single video is also proposed in several works [9]. proposes a patch-based approach assuming that in a natural video, space–time patches recur many times inside the same video at different spatio-temporal scales. This method is effective on videos having a repeated act like a rotating turbine [29]. proposes a 3D steering kernel regression method to fuse the frames without an explicit motion estimation. However, they employ a suboptimal imaging model to first estimate the upsampled output frames without deburring and then apply deblurring to each output frame individually. A few DL-based STSR methods are also proposed [10,11,12], but they mainly target frame interpolation to increase the video frame rate. A one-stage space–time video SR method is introduced in [11] to increase the spatial resolution and temporal frame rate using a frame feature temporal interpolation and a deformable ConvLSTM recurrent model.

We propose in this paper an STSR method from a single video using an MFSR approach. It takes an LR video as input and reconstructs an HR video with a larger frame size and/or a higher number of frames. The proposed technique improves an input video's spatial and temporal resolutions by combining each video frame with its adjacent frames. For this purpose, we extend the sequential motion estimation approach introduced in [5] to support temporal upsampling. The optimization framework is based on a maximum a posteriori (MAP) statistical framework that applies the desired level of smoothness while restoring sharp edges in the estimated HR frames. We introduce a sharpening process embedded in the optimization framework to intensify the recovered edges. Furthermore, we improve the temporal consistency by adding a temporal constraint between the current and previous reconstructed frames. We compare our proposed STSR method's performance with a few SR methods, including deep learning-based ones.

It should be noted that the proposed method can remove/reduce spatial blur, temporal blur, spatial aliasing, and noise in video sequences. It can also increase the video frame rate and perform view interpolation. However, it does not address the removal of temporal aliasing resulting from very fast dynamic events. Removing the temporal aliasing is mainly done by capturing multiple videos with temporal misalignments [25, 30].

The rest of this paper is organized as follows: Sect. 2 demonstrates our problem formulation, with subsections that discuss our assumed STSR imaging model, the proposed optimization framework, the extended motion estimation method, the initial estimate of the HR frame, and our strategy of color processing. The experimental results are presented in Sect. 3, and finally, Sect. 4 concludes the work of this paper.

2 Problem formulation

2.1 Imaging model

Although the forward imaging model represented in this section is similar to those used in other MFSR methods, it is extended to include both spatial and temporal resolution improvements. Here, the input in the image domain is a four-dimensional (4D) LR video \(g({x}_{l},{y}_{l},c,{t}_{l})\) of size \(W\times H\times C\times T\), where \({x}_{l}\in \left[0, W-1\right]\) and \({y}_{l}\in \left[0,H-1\right]\) are spatial pixel coordinates, \(c\in \left[0, C-1\right]\) is the color channel, and \({t}_{l}\in \left[0, T-1\right]\) is the frame number. Here, \(W\) is the frame width, \(H\) is the frame height, \(C\) is the number of color channels (1 for gray-scale and 3 for color videos), and \(T\) is the number of frames. The output would be a 4D HR video \(f\left({x}_{h},{y}_{h},c,{t}_{h}\right)\) of size \(rW\times rH\times C\times sT\), where \(r\) and \(s\) are the scaling (upsampling) factors for space and time domains. So the frame dimensions and the frame rate of the input video would increase by factors of \(r\) and \(s\), respectively.

For simplicity, we represent our formulation in the vector–matrix notation where the input and output videos are vectors in lexicographical order, of sizes \(WHCT\times 1\) and\({r}^{2}sWHCT\times 1\), and shown in bold lower-case letters as \({\varvec{g}}\) and\({\varvec{f}}\), respectively. The \(i\) th frame of the output video \({{\varvec{f}}}_{i}\) (of size\({r}^{2}WHC\times 1\)) is estimated from the input frames\(\left\{{{\varvec{g}}}_{j-a}, \dots ,{{\varvec{g}}}_{j},\dots , {{\varvec{g}}}_{j+b}\right\}\), where \(j = \left\lfloor {i/s} \right\rfloor\).Footnote 1 In other words, our proposed framework combines \(a+b+1\) adjacent frames of the input LR video around the center frame \({{\varvec{g}}}_{j}\) (of size\(WHC\times 1\)) to reconstruct the output frame\({{\varvec{f}}}_{i}\), where \(a\) and \(b\) are the number of adjacent LR frames in the backward and forward directions, respectively.Footnote 2 We use the following linear imaging model to relate the \(i\)th HR frame to the \(k\) th LR frame:

$$ {\varvec{g}}_{k} = {\varvec{DB}}_{k} {\varvec{M}}_{j,k} {\varvec{f}}_{i} + {\varvec{n}}_{k} ,\;j = i/s,\;k \in \left[ {j - a,\;j + b} \right] $$
(1)

where \({{\varvec{M}}}_{j,k}\) is the motion matrix that models warping (registration) from \({{\varvec{f}}}_{i}\) to \({{\varvec{g}}}_{k}\), \({{\varvec{B}}}_{k}\) is the spatio-temporal blur matrix, \({\varvec{D}}\) is the spatio-temporal downsampling matrix, and \({{\varvec{n}}}_{k}\) is the noise vector. According to this model, an HR frame \({{\varvec{f}}}_{i}\) is warped, blurred, and downsampled in both space and time, then added up with noise to form an LR frame \({{\varvec{G}}}_{k}\). The motion matrix represents the movement of the scene's objects and the camera between two frames. The 3D blur kernel is the overall effect of the camera's and objects' movements, defocus, depth of field, optical and sensor blurs, and exposure time. Downsampling is the outcome of capturing the scene in discrete spatial positions (pixels) and temporal timestamps, dictating the camera's frame resolution and framerate.

2.2 Proposed STSR framework

We use a maximum a posteriori (MAP) framework to estimate an HR frame given a few neighboring LR frames:

$$ {\varvec{f}}_{i} = \arg \mathop {\max }\limits_{{X_{i} }} \mathop \prod \limits_{k} \Pr \left( {{\varvec{f}}_{i} {|}{\varvec{g}}_{k} } \right),\;j = i/s,\;k \in \left[ {j - a,\;j + b} \right] $$
(2)

Using the Bayes rule, this can be alternatively written as:

$${{\varvec{f}}}_{i}=\mathrm{arg}\underset{{{\varvec{X}}}_{i}}{\mathrm{max}}\prod_{k}\frac{\mathrm{Pr}\left({{\varvec{g}}}_{k}|{{\varvec{f}}}_{i}\right)\mathrm{Pr}\left({{\varvec{f}}}_{i}\right)}{\mathrm{Pr}\left({{\varvec{g}}}_{k}\right)}$$
(3)

where \(\mathrm{Pr}\left({{\varvec{g}}}_{k}|{{\varvec{f}}}_{i}\right)\) is the likelihood (a.k.a. data fidelity or data fusion term), \(\mathrm{Pr}\left({{\varvec{f}}}_{i}\right)\) is the prior on the HR frame (a.k.a. regularization term), and \(\mathrm{Pr}\left({{\varvec{g}}}_{k}\right)\) is the evidence of the LR frame. The denominator in (3) can be ignored because it is not a function of \({{\varvec{f}}}_{i}\). Moreover, since the nominator's densities have exponential forms, it would be simpler to minimize the minus log of the functional in (3) equivalently. This yields:

$${{\varvec{f}}}_{i}=\mathrm{arg}\underset{{{\varvec{X}}}_{i}}{\mathrm{mi}}n\left\{\sum_{k}-\mathrm{log}\left[\mathrm{Pr}\left({{\varvec{g}}}_{k}|{{\varvec{f}}}_{i}\right)\right]-\mathrm{log}\left[\mathrm{Pr}\left({{\varvec{f}}}_{i}\right)\right]\right\}$$
(4)

Assuming the noise to be white Gaussian, \(-\mathrm{log}\left[\mathrm{Pr}\left({{\varvec{g}}}_{k}|{{\varvec{f}}}_{i}\right)\right]\) in (4) would be proportional to the energy of noise, i.e. \({\Vert {\varvec{D}}{{\varvec{B}}}_{k}{{\varvec{M}}}_{jk}{{\varvec{f}}}_{i}-{{\varvec{g}}}_{k}\Vert }_{2}^{2}\) which is the sum of squared differences (SSD) between the simulated and observed LR frames. The operator \({\Vert \cdot \Vert }_{2}^{2}\) denotes the square of norm-2, which is defined for a vector \({\varvec{A}}\) with elements \({a}_{i}\) as \({\Vert {\varvec{A}}\Vert }_{2}^{2}={{\varvec{A}}}^{T}{\varvec{A}}=\sum {a}_{i}^{2}\) where \({{\varvec{A}}}^{T}\) is the transpose of \({\varvec{A}}\).

Natural HR images are not globally smooth but mostly piecewise-smooth, as they consist of smooth regions surrounded by sharp edges. An appropriate form for the regularization term based on such observation would penalize high-energy variations much less than norm-2 in the reconstructed frame while still suppresses smaller variations (noise) effectively, so it allows for sharp edges to appear in the estimated frame. One example is \({\Vert {\varvec{H}}{{\varvec{f}}}_{{\varvec{i}}}\Vert }_{1}\) where \({\Vert \cdot \Vert }_{1}\) denotes norm-1 (defined as \({\Vert {\varvec{A}}\Vert }_{1}=\sum \left|{a}_{i}\right|\)). In our framework, we chose the following form for the regularization term: \({\Vert \nabla {{\varvec{f}}}_{i}\Vert }_{1}={\Vert {{\varvec{H}}}_{h}{{\varvec{f}}}_{i}\Vert }_{1}+{\Vert {{\varvec{H}}}_{v}{{\varvec{f}}}_{i}\Vert }_{1}\) where \(\nabla \) is the gradient operator, and \({{\varvec{H}}}_{h}\) and \({{\varvec{H}}}_{v}\) are first-order derivatives (FODs) in the horizontal and vertical directions, respectively. Therefore, the following optimization framework is stemmed from (4):

$$ \varvec{f}_{i} = \arg \mathop {\min }\limits_{{\varvec{f}_{i} }} \left\{ {\mathop \sum \limits_{{\begin{array}{*{20}c} {k = j - a} \\ {j = i/s} \\ \end{array} }}^{{j + b}} \left\| {\varvec{DB}_{k} \varvec{M}_{{j,k}} \varvec{f}_{i} - \varvec{g}_{k} } \right\|_{2}^{2} + \lambda \left\| {~\varvec{H}_{h} \varvec{f}_{i} } \right\|_{1} + \lambda \left\| {\varvec{H}_{v} \varvec{f}_{i} } \right\|_{1} } \right\} $$
(5)

To improve the optimization framework, we apply a few modifications to the functional in (5).

Remark 1

The vulnerability of the functional in (5) to the motion estimation error can be reduced by applying the adaptive weighting operator \({{\varvec{O}}}_{k}\) defined in (6) to the norm-2 function in the fidelity term, i.e. \({\Vert {{\varvec{O}}}_{k}\left({{\varvec{D}}{\varvec{B}}}_{k}{{\varvec{M}}}_{j,k}{{\varvec{f}}}_{i}-{{\varvec{g}}}_{k}\right)\Vert }_{2}^{2}\). The operator \({{\varvec{O}}}_{k}\) is a diagonal matrix that assigns smaller weights to the outlier pixels.

$${{\varvec{O}}}_{k}=\mathrm{diag}\left(\mathrm{exp}\left\{-\frac{{\Vert {{\varvec{D}}{\varvec{B}}}_{k}{{\varvec{M}}}_{j,k}{{\varvec{f}}}_{i}-{{\varvec{Y}}}_{k}\Vert }_{1}}{\sigma }\right\}\right)$$
(6)

What matrix \({{\varvec{O}}}_{k}\) does is assigning a lower weight to the pixels in the \(k\)th LR frame that have a higher deviation from the central LR frame to lessen the contribution of those pixels in estimating \({{\varvec{f}}}_{i}\). The scalar parameter \(\sigma \) in (6) determines the decay speed of the exponential function.

Remark 2

Modifying the norm-2 function of the fidelity term in (5) with \({\Vert {{\varvec{D}}{\varvec{B}}}_{k}{{\varvec{M}}}_{j,k}\left({\varvec{I}}-\beta {\varvec{S}}\right){{\varvec{f}}}_{i}-{{\varvec{g}}}_{k}\Vert }_{2}^{2}\) will boost the sharpness of the estimated frame \({{\varvec{f}}}_{i}\), where \({\varvec{I}}\) is the identity matrix, \({\varvec{S}}\) is a high-pass filter operator, and \(\beta \) is a scalar that controls the sharpness amount.

According to the unsharp masking technique [31], an edge-sharpened frame \({\widehat{{\varvec{f}}}}_{i}\) can be obtained from a frame \({{\varvec{f}}}_{i}\) by summing up \({{\varvec{f}}}_{i}\) with its high-passed filtered form, i.e. \({\widehat{{\varvec{f}}}}_{i}={{\varvec{f}}}_{i}+\beta {\varvec{S}}{{\varvec{f}}}_{i}\). Consequently, by replacing \({{\varvec{f}}}_{i}\) in the likelihood term with \({{\varvec{f}}}_{i}-\beta {\varvec{S}}{{\varvec{f}}}_{i}=\left({\varvec{I}}-\beta {\varvec{S}}\right){{\varvec{f}}}_{i}\) a sharper image is obtained. We do not need to apply the above modification to \({{\varvec{f}}}_{i}\) in the regularization term since a high-pass filtering operation already exists in this term. Our experiments show that the optimization problem modified using this technique converges as fast as the original one.

Remark 3

The temporal consistency of the estimated video is improved by adding the term \({\Vert {{\varvec{f}}}_{i}- {{\varvec{M}}}_{i-1,i}{{\varvec{f}}}_{i-1}\Vert }_{2}^{2}\) to the functional in (5), which minimizes the error between each estimated frame \({{\varvec{f}}}_{i}\) and its motion-compensated previous estimated frame \({{\varvec{f}}}_{i-1}\).

The modified optimization framework using the above propositions is obtained as:

$$ \begin{gathered} \varvec{f}_{i} = \arg \mathop {\min }\limits_{{X_{i} }} \left\{ {\mathop \sum \limits_{{\begin{array}{*{20}c} {k = j - a} \\ {j = i/s} \\ \end{array} }}^{{j + b}} \left\| {\varvec{O}_{k} \left( {\varvec{DB}_{k} \varvec{M}_{{j,k}} \left( {\varvec{I} - \beta \varvec{S}} \right)\varvec{f}_{i} - \varvec{g}_{k} } \right)} \right\|_{2}^{2} } \right. \hfill \\ \quad \quad \quad \quad \quad \quad \left. { + \lambda \left\| {\varvec{H}_{h} \varvec{f}_{i} } \right\|_{1} \lambda \left\| {\varvec{H}_{v} \varvec{f}_{i} } \right\|_{1} + \gamma \left\| {\varvec{f}_{i} - ~\varvec{M}_{{i - 1,i}} \varvec{f}_{{i - 1}} } \right\|_{2}^{2} } \right\} \hfill \\ \end{gathered} $$
(7)

The optimization problem in (11) is convex but non-quadratic and can be solved using the iteratively reweighted least-squares (IRLS) method [32]. IRLS can solve (11) in an iterative manner in which each step comprises solving a weighted least-square problem. If \({{\varvec{f}}}_{i}^{\left(n\right)}\) is the \(k\) th HR frame to be estimated at the \(n\) th iteration of IRLS, then \({\Vert {\varvec{H}}{{\varvec{f}}}_{i}^{\left(n\right)}\Vert }_{1}\)[\({\varvec{H}}\) stands for \({{\varvec{H}}}_{h}\) or \({{\varvec{H}}}_{v}\) in (7)] can be replaced by \({\left({\varvec{H}}{{\varvec{f}}}_{i}^{\left(n\right)}\right)}^{T}{{\varvec{W}}}_{i}^{\left(n-1\right)}\left({\varvec{H}}{{\varvec{f}}}_{i}^{\left(n\right)}\right)\) where \({{\varvec{W}}}_{i}^{\left(n-1\right)}=\mathrm{diag}{\left(\left|{\varvec{H}}{{\varvec{f}}}_{i}^{\left(n-1\right)}\right|\right)}^{-1}\). To prevent division by zero, zero elements of \({\varvec{H}}{{\varvec{f}}}_{i}^{\left(n-1\right)}\) are replaced with a small number \(\epsilon \) (e.g. \(0.01\)).

Remark 4

Using IRLS, the functional in (7) results in the following linear equation where \({{\varvec{A}}}_{i,k}^{\left(n-1\right)}={{\varvec{O}}}_{k}^{\left(n-1\right)}{{\varvec{D}}{\varvec{B}}}_{k}{{\varvec{M}}}_{jk}\left({\varvec{I}}-\beta {\varvec{S}}\right)\).

$$ \begin{aligned}&\left( {\mathop \sum \limits_{k} {\varvec{A}}_{i,k}^{{\left( {n - 1} \right)T}} {\varvec{A}}_{i,k}^{{\left( {n - 1} \right)}} + \lambda {\varvec{H}}_{h}^{ T} {\varvec{W}}_{h}^{{\left( {n - 1} \right)}} \;_{i} {\varvec{H}}_{h} + \lambda {\varvec{H}}_{v}^{ T} {\varvec{W}}_{v}^{{\left( {n - 1} \right)}} \;_{i} {\varvec{H}}_{v} + \gamma {\varvec{I}}} \right) \\ &\quad{\varvec{f}}_{i}^{\left( n \right)} = \mathop \sum \limits_{k} {\varvec{A}}_{i,k}^{{\left( {n - 1} \right)T}} {\varvec{g}}_{k} + \gamma {\varvec{M}}_{i - 1,i} {\varvec{f}}_{i - 1}\end{aligned} $$
(8)

The equation in (8) can be easily proved by replacing the norm-1 terms in (7) with their equivalent IRLS forms for the \(n\)th iteration, taking the derivative of (7) with respect to \({{\varvec{f}}}_{i}^{\left(n\right)}\), and setting the derivative to zero. IRLS iterates between solving the least square problem in (8) using an iterative method such as Conjugate Gradient [33] and estimating \({{\varvec{A}}}_{i,k}\) and \({{\varvec{W}}}_{i}\) matrices based on the value of \({{\varvec{f}}}_{i}^{\left(n-1\right)}\). The advantage of using Conjugate Gradient to solve (8) at the \(n\)th iteration of IRLS is that the matrix in the left-hand side of (8) does not require explicit calculation since it can be decomposed into a set of filtering and weighting operations.

2.3 Motion estimation

For spatial-only SR (no temporal upsampling), motion estimation can be performed using either a central or a sequential scheme [5]. In the central scheme, motion is directly estimated between the current frame and its adjacent frames (Fig. 1a). However, in the sequential scheme, the motion is first estimated between each adjacent frame and its previous frame (Fig. 1b); then, the central motion is obtained from the sequentially estimated motion. While the former approach provides better accuracy, the latter has significantly less computational complexity since only one motion is estimated from/to each frame when estimating multiple HR frames using SR. However, for spatio-temporal SR (STSR), the central scheme cannot be used since we do not have proper initial estimates for the frames missing in the input LR video (due to different temporal resolutions) before motion is estimated.

Fig. 1
figure 1

Two schemes for estimating motion between the frames in SR. a Central scheme. b Sequential scheme

We expand the model employed in [5] to estimate motion for our proposed STSR method using a sequential scheme. In Fig. 2a, the solid circles are related to the frame positions available in the input LR video, and the empty circles correspond to the frame positions missing in the input LR video due to a lower temporal resolution. For every frame position (solid or empty), we aim to estimate motion from that position to all its neighboring positions using the following procedure:

  1. 1.

    Upsample each LR frame \({{\varvec{g}}}_{j}\) individually via interpolation (e.g. using Bilinear or Bicubic methods) to obtain upsampled frame \({{\varvec{g}}}_{j}\).

  2. 2.

    Estimate motion sequentially from each upsampled LR frame to its previous upsampled LR frame, i.e. \({{\varvec{M}}}_{j,j-1}\) (Fig. 2b). Hence \({{\varvec{z}}}_{j-1}={{\varvec{M}}}_{j,j-1}{{\varvec{z}}}_{j}\).

  3. 3.

    Obtain motion between adjacent HR frame positions (Fig. 2c) using motion between adjacent LR frame positions as \({{\varvec{M}}}_{i,i-1}={{\varvec{M}}}_{i,i-s} /s\). This conversion is obtained by assuming that the motion between two consecutive solid circles is distributed linearly at the intermediate empty circles.

  4. 4.

    Obtain motion from the current frame position to all its adjacent frame positions using (9) (sequential to central motion conversion). For instance, in Fig. 2d, the current HR frame position is \(i=7\), and the number of forward and backward adjacent frames are chosen such that we get two neighboring LR frame positions (black dots) in each direction, which yields \(a=4\) and \(b=5\).

    $$ \varvec{M}_{{i,k}} = \left\{ {\begin{array}{*{20}l} {\mathop \sum \limits_{{j = k + 1}}^{i} \varvec{M}_{{j,j - 1}} } \hfill & {a \le k < i} \hfill \\ 0 \hfill & {k = i} \hfill \\ { - \mathop \sum \limits_{{j = i + 1}}^{k} \varvec{M}_{{j,j - 1}} ~} \hfill & {i < k \le b} \hfill \\ \end{array} } \right. $$
    (9)
Fig. 2
figure 2

Steps to estimate motion for the current frame (\(i=7\))

2.4 Initial estimate and color handling

A reasonable initial estimate for the HR frames, i.e. \({{\varvec{f}}}_{i}^{\left(0\right)}\), is essential because it helps the SR algorithm reach the final solution in fewer iterations. Also, due to the SR problem's ill-posedness, multiple solutions may exist that minimize the optimization functional, so different initial estimates may result in different solutions due to the framework's local minima. We use a multi-frame non-uniform interpolation method followed by a single-frame deburring step to obtain an initial estimate for our MFSR problem. Rather than using the imaging model in (1), we use a suboptimal model by swapping the motion and blur operators and assuming similar blur kernel and noise characteristics for all frames, which yields:

$$ \varvec{g}_{k} = \varvec{DM}_{{j,k}} \varvec{Bf}_{i}^{{\left( 0 \right)}} + \varvec{n}_{\varvec{k}} = \varvec{DM}_{{j,k}} \varvec{z}_{i} + \varvec{n}_{\varvec{k}} ,\;j = \left\lfloor {i/s} \right\rfloor ,\;k \in \left[ {j - a,\;j + b} \right] $$
(10)

where \({{\varvec{z}}}_{i}={\varvec{B}}{{\varvec{f}}}_{i}^{\left(0\right)}\). An intuitive way to estimate \({{\varvec{z}}}_{i}\) would be:

$${{\varvec{z}}}_{i}=\sum_{k}{{{\varvec{M}}}_{i,k}}^{-1}{{\varvec{D}}}^{-1}{{\varvec{g}}}_{k}=\sum_{k}{{\varvec{M}}}_{k,i}{{\varvec{D}}}^{T}{{\varvec{g}}}_{k}$$
(11)

According to (11) the LR frames are projected onto the HR grid through upsampling and warping. Since motion vectors have arbitrary values, the projected points may not be uniformly distributed over the HR grid. Therefore, a non-uniform interpolation process is required to estimate the HR grid's pixel values from the projected points. Once \({{\varvec{z}}}_{i}\) is obtained, \({{\varvec{f}}}_{i}^{\left(0\right)}\) can be estimated by removing blurring and noise. We use a few iterations of the proposed STSR framework in Sect. (2.2) with \({\varvec{D}}\) set to the identity matrix \({\varvec{I}}\) (i.e. no upsampling).

If the input video is in the RGB color space, SR must be applied to all three red, green, and blue color channels since they need to have the same resolution. However, since the human visual system (HVS) is less sensitive to color than luminance (gray level), a more efficient way would be to decorrelate luminance from color and apply SR only to the luminance channel. This process is done in video coding using the YCbCr color space where Y expresses luminance and Cb and Cr convey color information [5]. Using this approach, we apply SR only to the Y channel and upsample Cb and Cr channels via Bicubic interpolation.

3 Experimental results

Unlike most state-of-the-art (SOTA) SR methods, our proposed method does not include a training step. As the first set of experiments to test our method, we use the Vid4 [4] and SPMCs-30 [34] benchmark datasets. Vid4 which is used by most publications contains four video sequences (City, Calendar, Foliage, and Walk) of slightly different sizes close to \(720\times 576\), each of which has at least 34 frames. SPMCs-30 contains 30 video sequences of dynamic scenes, each has 31 frames of size \(960\times 540\). Our proposed method is compared with SOTA SFSR and MFSR methods, including VSRnet [13], VESCPN [14], DBPN [15], RDN [16], RCAN [17], TOFlow [12], and TDAN [18] for 4X spatial upsampling, similar to [18]. Table 1 shows the quantitative comparison using PSNR (in dB) and SSIM [35] quality metrics. The results on SPMCs-30 are not reported for VSRnet [13] and VESCPN [14] since their source codes or reconstructed frames are not publicly available. The visual comparisons of different methods on Vid4 and SPMCs-30 datasets are shown in Figs. 3 and 4, respectively. Our proposed method demonstrates similar or better results than those SOTA methods.

Table 1 Quantitative comparison of our proposed method with SOTA methods in terms of PSNR (dB) and SSIM quality metrics
Fig. 3
figure 3

Comparison results of our proposed method with SOTA methods on the Vid4 dataset

Fig. 4
figure 4

Comparison results of our proposed method with SOTA methods on the SPMCs-30 dataset

Figure 5 provides quantitative comparisons between our proposed method and Bicubic. Figure 5a shows the PSNR variations of our proposed SR method versus Bicubic concerning variation in the standard deviation (\(\sigma \)) of Gaussian blur. The maximum PSNR values for both SR and Bicubic are obtained for \(\sigma \) in the range of \(\left[0.6, 1\right]\). In this range, SR shows an average PSNR difference of \(7.4 \mathrm{dB}\) compared to the Bicubic interpolation, which is an impressive improvement. A blur function with small support (\(\sigma \in \left[0.6, 1\right]\)) is effective in suppressing noise. However, as \(\sigma \) increases, the LR images become very blurry and SR becomes less effective as reflected in its PSNR values getting closer to Bicubic.

Fig. 5
figure 5

Variation of peak signal to noise ratio (PSNR) with respect to the variation of the following SR parameters: a Standard deviation (\(\sigma \)) of Gaussian blur; b Downsampling/upsampling ratio; c signal to noise ratio (SNR)

Figure 5b represents the variations in PSNR values of our SR method and Bicubic for different downsampling ratios. The increase in the downsampling ratio results in lower-resolution LR images, making it harder for SR to recover the missing details. Despite a considerable drop in PSNR for the downsampling of 4, our SR method has still provided \(4.4 \mathrm{dB}\) more improvement than Bicubic. Figure 5c also demonstrates the PSNR variation for different noise power or SNR values. For higher values of \(\sigma \) of noise, we increase the regularization parameter \(\lambda \) in (8) to increase the smoothness of the reconstructed frame. This figure shows that SR has a higher PSNR difference with Bicubic for higher SNR values.

Figure 6 shows another example of improving the spatial resolution of a traffic light footage using the proposed method compared to Bicubic. Due to the high distance of the scene from the camera, the target object is noisy and has a low resolution. Therefore, it is hard to read the plate number using Bicubic upsampling. However, our proposed method has significantly improved the image quality.

Fig. 6
figure 6

Improving the quality of a traffic camera video. a One video frame; b closeup view of the frame upsampled by Bicubic; c improved closeup using our proposed method

The next experiment shown in Fig. 7 demonstrates the performance of our proposed SR method in removing the temporal blur. Motion blur is a temporal artifact in nature, as it appears due to the fast movements of objects in the scene or the camera itself during the exposure time [25, 36]. When the scene is roughly static and planar, the perceived spatial motion blur would be space-invariant (similar for all regions). However, when the scene is highly dynamic during the exposure time or the camera is filming a scene with far and near-field objects while moving fast, the perceived motion blur would be space-variant. Removing a space-variant motion blur from a single image is highly challenging. It requires segmenting the scene into objects and background, applying different deblurring to different parts of the scene, and finally putting the deblurred objects back together in a coherent way. On the other hand, our multi-frame SR method is inherently capable of removing motion blur through applying temporal deblurring, as described below.

Fig. 7
figure 7

Removing space-variant motion blur for the Old Town Cross video. a Ground-truth frame; b LR frame; c result by [37]; d result by [38]; e result by our method through temporal deburring; f close-up from images (b) and (e)

Figure 7a shows one frame of the Old Town Cross video, and Fig. 7b demonstrates the LR frame generated by applying a temporal rectangular blur of length 5 (so the exposure time is expanded over five frames). Since the scene is not planar and the camera is not moving parallel to the scene, a more severe motion blur happens on the right side than on the left side of the frame. Figure 7c, d result from the motion deblurring method proposed in [37] and the online GANFootnote 3-based deblurring and upscaling tool [38], respectively. These methods have failed to perform any noticeable improvement due to the space-variant nature of the perceived motion blur. The result of our proposed method is presented in Fig. 7e, and closeups from the LR and restored frames are shown in Fig. 7f. Our method has successfully removed the space-variant blur by improving the video’s temporal resolution.

Our proposed method can also be used to create high-quality slow-motion videos or interpolated views. Tested on several videos, including building structures, fast-moving clouds, and candle fumes, the frame rate is set to increase by a factor of 15. These results are not shown here due to the difficulty of perceiving the temporal effect on a still manuscript. Please refer to the package accompanying this article for a few examples.

4 Conclusion

Improving the resolution of natural videos using the super-resolution (SR) technique is a highly ill-posed problem. This paper presents a space–time super-resolution method to increase a single video's spatial and temporal resolutions to alleviate aliasing, blurring, and noise artifacts. A new motion estimation framework is proposed to first estimate motion sequentially and then derive motion between the central and neighboring frames. An initial estimate of the output frame is obtained using a non-uniform interpolation technique to derive the upsampled frame, followed by a deblurring step. An optimization formulation is derived using the maximum a posteriori probability (MAP) estimator, which estimates a high-resolution (HR) video frame from a few neighboring low-resolution (LR) frames of the input video. We incorporate an edge-sharpening operation into the optimization problem to further enhance the edges. We also improve temporal consistency in the reconstructed video by minimizing the error between successive estimated frames. The results of the proposed method are compared with a few SOTA methods, including two deep learning-based ones. The results confirm the effectiveness of the proposed SR method in improving the quality of natural videos.