1 Introduction

Image super-resolution (SR) is by far a flourishing branch of image processing, concerning the particular issue of image resolution enhancement. When given one or more low-resolution (LR) observation(s) of the same scene, SR is capable of reconstructing a visually pleasing high-resolution (HR) output (i.e., containing more image details) [30], which might be a crucial preprocessing procedure in a wide range of digital imaging applications, such as medical diagnosis, remote sensing, and vehicle license plate recognition [53], to name but a few. Thus this technique provides us with a much economical and promising way to transcend the inherent limitations of LR optical imaging systems in place of utilizing sensor manufacturing technology, making it one of the most appealing research areas for image processing experts.

Numerous SR algorithms have been proposed over the last three decades or so. Judging from the domain employed, SR can be directly divided into two families: frequency domain approaches and spatial domain approaches. The initial research on SR [5, 24, 25, 42] belongs to the former. Although frequency domain approaches are theoretically and computationally simple, they can only be applied to pure translational model and are extremely susceptible to model errors according to the definition, which severely limits their prevalence [33]. For this reason, more and more researchers started to reconsider SR issue in spatial domain. Similarly, the proposed diverse spatial domain approaches can be roughly classified into two categories again in terms of the number of LR inputs required, i.e., multiple images SR and single image SR. For the multiple inputs case, a variety of approaches have been developed from different points of view, such as iteration back projection (IBP) based methods [18, 36], projection onto convex sets (POCS) based methods [1, 37], maximum a posteriori (MAP) estimation based methods [23, 43], regularization based methods [20, 26, 31, 32, 52, 54, 58], etc. However, it has been pointed out that the performance of this kind of methods all degrades dramatically under three circumstances where (a) the amount of LR inputs is inadequate; (b) the estimate of motion is imprecise; or (c) the scale factor increases [2, 21, 27, 45, 46].

The aforementioned limitations can be broken through by the way of exploring the other type of SR issue, i.e., single image SR (SISR). Clearly, SISR is the extreme case of SR when there is only one LR observation. Due to the ultra-insufficiency of input information, most of the proposed methods are built on the basic concept of Freeman et al. [22] assuming that the high frequency details lost in a LR image can be predicted or hallucinated by learning the co-occurrence relationship between LR patches and their corresponding HR patches extracted from a training image set, so they can be called the example or learning based methods. One group of them is neighbor embedding (NE) based methods [10, 29] which were first explored by Chang et al. These methods are based on the assumption in machine learning that small patches in LR and HR images form two manifolds lying in two distinct feature spaces but with similar local geometric structures. Thus, the SR output can be estimated by weighted summation of the K nearest HR neighbors found in the corresponding HR training database. However, the heavy running time of this method was ignored despite the fact that it is very important to real applications. Timofte et al. [3941] incorporated the concept of learning and calculating a set of sparse dictionaries [55] (projection matrices) beforehand into the basic framework of NE to accelerate its computing process.

Another group of learning based methods, which is most relevant to this paper, is derived from the theory of sparse representation that most natural images are sparse or compressible actually when represented in the proper basis [7]. The representative work of sparse representation based SISR was proposed by Yang et al. [49, 50]. In their papers, a universal pair of HR dictionary and its corresponding LR dictionary is learned first by extracting raw patches randomly from some training images, and then sparse coding process is applied to the overlapping patches sampled in the input LR image with a raster-scan order to get the sparse coefficients. Finally, the SR output is recovered by averaging all the overlapping HR patches produced by the product of HR dictionary and coefficients. This scheme is proved to lead to a state-of-the-art result at that time, whereas it is very time-consuming to obtain two large dictionaries by randomly sampling. Hence, Zeyde et al. [55] put forward an improved method in which dimensionality reduction method is applied to the raw patches first to accelerate the subsequent dictionary learning process. To employ the priors of the training patches, clustering techniques were also introduced into the framework of sparse representation based SISR. For instance, Yang et al. [51] and Dong et al. [12] both utilize K-means algorithm to cluster the training raw patches into dozens of groups from which the multiple dictionaries can be learned. The superiority of their multiple dictionaries to the universal dictionary is experimentally validated.

Even though dictionary learning is an important procedure of sparse representation based SISR and the aforementioned progress actually has been made by studying on it, some more recent work [13, 14, 28, 34, 35, 47, 48] indicates that the accuracy of the sparse coefficients produced in the sparse coding process are more helpful to the performance of SISR. However, due to the complexity of the model of image degeneration, it still remains a challenging work to recover the ideal coefficients as precise as possible. Several pioneering studies on this aspect have already been made, for example, Peleg et al. [34] suggested utilizing a statistical prediction model in which a more accurate set of HR coefficients is predicted from their corresponding LR ones via the minimum mean-square error estimator. Moreover, Dong et al. [14] proposed a nonlocally centralized sparse representation (NCSR) model where the calculated coefficients are additionally centralized to a set of good estimates obtained by exploiting the nonlocal similarity within the observed image. By doing so, the two models both get improved greatly and the latter has even provided the leading SISR performance to date.

However, in light of the discovery [57] that similarities exist not only among columns but also among rows if a cluster of similar image patches is arranged in matrix form, we believe that the capacity of NCSR model is limited as it considers the column similarity only while ignores that among rows. Thus, in this paper we propose an enhanced SISR model based on bidirectionally aligned sparse representation (BASR). In our model, a pair of regularization terms is created first by exploiting both the column and row similarities (i.e., bidirectional similarities). Then, after sparse coding process, the roughly calculated sparse coefficients are simultaneously aligned to the pair of terms in order to compensate the errors caused by image noise and degradation, and consequently increase the accuracy of the sparse coefficients and SISR performance. Furthermore, for a more rapid convergence, the fast iterative shrinkage-thresholding algorithm (FISTA) [3] is adopted in this paper instead of employing the conventional iterative thresholding algorithm (ITA) [11]. Extensive experiments have demonstrated that our proposed BASR model outperforms its recent counterparts in terms of both visual quality and numerical evaluations.

The rest of this paper is organized as follows: In section 2, we formalize the sparse representation based SISR problem. Section 3 presents the proposed BASR model for SISR issue and its implementation in detail. Experimental results and analysis are given in Section 4, while Section 5 concludes the whole paper.

2 Problem formulation

The goal of SISR can be regarded as recovering the potential HR image as precise as possible from just only one LR input. For a comprehensive analysis, the first step is to set up a suitable single image degradation model which relates the original HR image to the observed and degraded LR image.

Assume that X is an ideal HR image, while Y is the corresponding LR image of X in the same scenario. Then both of them are lexicographically rearranged into vector form, i.e., X ∈ R N, Y ∈ R M, where N > M, r 2 =N/M, and r is the scale factor. The degradation can be typically described as [38]

$$ \boldsymbol{Y}=\boldsymbol{SBX}+\boldsymbol{V}=\boldsymbol{HX}+\boldsymbol{V} $$
(1)

where S : R N → R M is the down-sampling operator, B : R N → R N is the blurring operator, V ∈ R M is the additive noise, and H : R N → R M is the degradation operator which can be viewed as a composite operator of both S and B.

Clearly, the fundamental constraint of SISR is that the recovered image should approximately reproduce the LR observation after imposing the same degradation on it. Nevertheless, since too much information is discarded during the high-to-low acquisition process, the linear eq. (1) is seriously underdetermined, i.e., infinitely many solutions may be suitable for (1). In order to obtain an appropriate solution, the researchers in [17, 49, 50] set up the initial framework of sparse representation based SISR, which incorporates the sparsity prior and the local-to-global reconstruction concept to the basic constraint. To be specific, suppose that the operator R i  : R N → R n is used to extract the i-th patch of size\( \sqrt{n}\times \sqrt{n} \) from an N length image and vectorize it, thus the i-th patch of X can be readily expressed as x i  = R i X. With the corresponding observed LR patch y i , each patch x i can be sparsely represented by the formula \( {\widehat{\boldsymbol{x}}}_i={\boldsymbol{D}\boldsymbol{\alpha}}_{\boldsymbol{y}, i} \), where α y , i is its sparse coefficient (representation) that can be calculated by a sparse coding operation with respect to a known and proper dictionary D

$$ {\widehat{\boldsymbol{\alpha}}}_{\boldsymbol{y}, i}= \arg \underset{{\boldsymbol{\alpha}}_i}{ \min}\left\{\ {\left\Vert {\boldsymbol{y}}_i-{\boldsymbol{HD}\boldsymbol{\alpha}}_i\right\Vert}_2^2\right.+\lambda \left.{\left\Vert {\boldsymbol{\alpha}}_i\right\Vert}_1\right\} $$
(2)

where λ is a trade-off to make a balance between the two terms. Note that the l 1-norm term is a regularization term representing the sparsity prior and it has already been changed from l 0-norm to l 1-norm as long as the coefficients are sufficiently sparse due to non-convex character of l 0-norm [8, 9, 16].

Usually, we work with overlapped patches to suppress the boundary artifacts. Under the circumstances of maximal overlaps, a total number of \( Q={\left(\sqrt{N}-\sqrt{n}+1\right)}^2 \) patches can be represented. After imposing a global constraint on these patches, the optimal reconstruction of the whole HR image X can be straightforwardly computed by averaging all the obtained local patches according to [17]

$$ \widehat{\boldsymbol{X}}=\boldsymbol{D}\circ {\widehat{\boldsymbol{\alpha}}}_{\boldsymbol{Y}}\overset{\Delta}{=}{\left(\sum_{i=1}^Q{{\boldsymbol{R}}_i}^T{\boldsymbol{R}}_i\right)}^{-1}\left(\sum_{i=1}^Q{{\boldsymbol{R}}_i}^T\boldsymbol{D}{\widehat{\boldsymbol{\alpha}}}_{\boldsymbol{y}, i}\right) $$
(3)

where α Y represents the concatenation of all sparse coefficients, and a shorthand notation “∘” is defined here for a briefer expression in the following parts.

Eqs. (2) and (3) can be reformulated together into a more unified formation to stand for the sparse coding process and local-to-global reconstruction simultaneously

$$ {\widehat{\boldsymbol{\alpha}}}_{\boldsymbol{Y}}= \arg \underset{\boldsymbol{\alpha}}{ \min}\left\{{\left\Vert \boldsymbol{Y}-\boldsymbol{HD}\circ \boldsymbol{\alpha} \right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{\alpha} \right\Vert}_1\right\} $$
(4)

where the first term corresponds to the data fidelity constraint, and the second one corresponds to the sparsity prior constraint.

In summary, under the basic framework of sparse representation, the ill-posed SISR problem is further regularized by the sparse prior of patches in addition to the local and global data fidelity constraint, resulting in a proper and stable solution.

3 Proposed BASR model for SISR

In this section, we are going to present an enhanced model, namely BASR, which is designed for handling the SISR issue under the particular circumstances where no external databases is allowed for prior or dictionary learning. The presentation of BASR model begins with the modeling of bidirectional similarities that is the theoretical support of subsequent content. Once it is finished, the key process, sparse coefficient alignment, is able to be established in both directions. Afterward, the way of dictionary learning and detailed implementation are specified sequentially.

3.1 Modeling of bidirectional similarities

In this paper, the proposed bidirectional similarities consist of the row similarity and column similarity. To construct the bidirectional similarities, the first step is to establish the similarity data matrix [57]. For each path x i , we search for its P closest counterparts (include itself) from the whole HR image X, in the sense of Euclidean distance metric. By concatenating the patch and its counterparts together, we can obtain a matrix S i  ∈ R n × P, namely the similarity data matrix of the i-th patch. As we mentioned before, patch similarity can be found not only among the columns of similarity data matrix but also among the rows of it. Therefore, the next step is to exploit the column similarity and row similarity, respectively, by virtue of the similarity data matrix.

As for column similarity, it was first put forward in the non-local means (NLM) algorithm [6] and applied to the application of image denoising. But, different from the original NLM algorithm that actually means a weighted average of all similar patches; we determine to use every column of S i as dictionary atoms to approximately represent the corresponding patch. This process can be formulated as

$$ {\boldsymbol{\beta}}_i= \arg \underset{{\boldsymbol{\beta}}_i}{ \min}\left\{{\left\Vert {\boldsymbol{x}}_i-{\boldsymbol{S}}_i{\boldsymbol{\beta}}_i\right\Vert}_2^2+{\eta}_1{\left\Vert {\boldsymbol{\beta}}_i\right\Vert}_2^2\right\} $$
(5)

where β i is called the column similarity coefficient of the i-th patch, and η 1 is its regularization parameter.

In (5), l 2-norm of the coefficient is designed as the regularization term, which is involved for the purpose of alleviating the singularity in calculation and avoiding the trivial solution. As you may see, the formula is actually in the same form of classic Tikhonov regularization, which is also known as ridge regression in statistics. Its explicit solution can be easily given by

$$ {\boldsymbol{\beta}}_i={\left({\boldsymbol{S}}_i^T{\boldsymbol{S}}_i+{\eta}_1\boldsymbol{I}\right)}^{-1}{\boldsymbol{S}}_i^T{\boldsymbol{x}}_i $$
(6)

where I represents the identical matrix.

When turning to exploit row similarity, we are motivated by the concept of piecewise autoregressive (AR) model. As a classic but powerful method in statistics, AR model has been successfully employed in some other image processing applications, such as image interpolation [56] and image denoising [57]. The key point in it is that if a natural image is tailored into small local parts, each part can be viewed as one stationary process. In other words, it suggests that natural images are piecewise stationary and able to be modeled by a set of AR models.

Therefore, in this work we assume that the central pixel of each patch can be linearly represented by its neighboring pixels with the coefficient calculated by an AR model. Moreover, the patches that belong to the same similarity data matrix should have identical AR coefficient as they share the same similarity. Let C be the operator to extract the central pixels from the similarity data matrix (i.e., the central row of it). Similar to column similarity, this progress can also be modeled as

$$ {\boldsymbol{\gamma}}_i= \arg \underset{{\boldsymbol{\gamma}}_i}{ \min}\left\{{\left\Vert {\left({\boldsymbol{CS}}_i\right)}^T-{\boldsymbol{S}}_i^T{\boldsymbol{\gamma}}_i\right\Vert}_2^2+{\eta}_2{\left\Vert {\boldsymbol{\gamma}}_i\right\Vert}_2^2\right\} $$
(7)

where α i is called the i-th row similarity coefficient, and the closed-form solution is given by

$$ {\boldsymbol{\gamma}}_i={\left({\boldsymbol{S}}_i{\boldsymbol{S}}_i^T+{\eta}_2\boldsymbol{I}\right)}^{-1}{\boldsymbol{S}}_i{\left({\boldsymbol{CS}}_i\right)}^T $$
(8)

With column and row coefficients, the similarities in both directions can be calculated as \( {\left\{{\boldsymbol{S}}_i{\boldsymbol{\beta}}_i\right\}}_{i=1}^Q \) and \( {\left\{{\boldsymbol{S}}_i^T{\boldsymbol{\gamma}}_i\right\}}_{i=1}^Q \), respectively. Be aware that each calculated row similarity is composed of the central pixels of the corresponding similar patches, so it is incorrectly ordered with what we have requested. Therefore, we need to spread it out in the whole HR image first, and then rearrange it in the same order as α. After doing so, the rearranged column and row similarities are denoted as φ i and ψ i , respectively, which are the proposed bidirectional similarities in this paper. By taking advantage of the bidirectional similarities, it allows us to structure a much more accurate sparse representation model, which will be introduced specifically in the following subsection.

3.2 Bidirectional sparse coefficient alignment

As mentioned before, it has been found that the accuracy of sparse coefficients is of great significance to sparse representation based SISR. Nevertheless, model (4) which uses the sparse prior only may not lead to a precise enough output due to the complexity of image degradation. And it can be expected that a performance enhancement would be acquired by suppressing the sparse coding error caused by degradation and noise. Thus, in this subsection we propose an enhanced model, in which the roughly computed sparse coefficients are simultaneously aligned to the bidirectional similarities proposed previously.

Before the construction of our model, an important procedure we need to do is to change the values of bidirectional similarities from pixel domain to sparse coefficient domain through a sparse coding process. As will be introduced in the next subsection, all the sub-dictionaries adopted in this paper are orthogonal, so the coding process is simplified to only multiplying the pixel values by the transpose of the corresponding sub-dictionary. Given the sub-dictionary \( {\boldsymbol{D}}_{t_i} \)of the i-th patch, this process is formulated as \( {\boldsymbol{\varphi}}_i={\boldsymbol{D}}_{t_i}^T{\boldsymbol{\varphi}}_i^{\boldsymbol{\prime}} \) and \( {\boldsymbol{\psi}}_i={\boldsymbol{D}}_{t_i}^T{\boldsymbol{\psi}}_i^{\boldsymbol{\prime}} \). By incorporating this pair of similarities into the basic model (4) as additional regularization terms, we get the objective function of our proposed BASR model

$$ \widehat{\boldsymbol{\alpha}}= \arg \underset{\boldsymbol{\alpha}}{ \min}\left\{{\left\Vert \boldsymbol{Y}-\boldsymbol{HD}\circ \boldsymbol{\alpha} \right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{\alpha} \right\Vert}_1\right.\left.+{\mu}_1{\left\Vert \boldsymbol{\alpha} -\boldsymbol{\varphi} \right\Vert}_p^p+{\mu}_2{\left\Vert \boldsymbol{\alpha} -\boldsymbol{\psi} \right\Vert}_p^p\right\} $$
(9)

where φ and ψ are the concatenation of all φ i and ψ i respectively, representing the bidirectional similarities of α, respectively.

As you can see, in our BASR model, the output sparse coefficients are not only of the characteristic of sparsity, but also bidirectionally aligned so that the errors caused by degradation and noise can be efficiently suppressed. Furthermore, similar to [13], a more comprehensive analysis of sparse coding error is conducted and provided here in order to illustrate its statistical property, and consequently determine the type of norm to be used in (9). Specially, the test images Lena is chosen as an HR sample from which its four degraded versions are able to generate by applying the degradations specified in subsection 4.1. That is to say, here, we take into consideration all the four SISR scenarios. Using the given sub-dictionaries, it is straightforward to calculate the difference between the ideal and estimated sparse coefficients, namely the sparse coding error, by solving (4). Be aware that, to be fully convincing, the parameters used here are set to be identical with what will be adopted in the experimental section. Eventually, the probability density functions (PDFs) of sparse coding error under the four scenarios are plotted in Fig. 1(a)-(d) with respect to the 5th, 10th, 15th, 20th sub-dictionary, respectively. As shown in Fig. 1, the estimated PDFs are unable to conform to the Gaussian distributions, but they all can fit in well with the Laplace distributions. Therefore, l 1-norm should be picked to model the sparse coding error (i.e., p = 1), motivated by the analysis conducted.

Fig. 1
figure 1

PDFs of sparse coding errors of image Lena in (a) scenario 1, (b) scenario 2, (c) scenario 3 and (d) scenario 4

3.3 Dictionary learning and adaptive selection

Clearly, two key procedures left undone are dictionary learning and adaptive selection of one dictionary for each local patch. Conventional way of dictionary learning aims at learning a universal and over-complete dictionary to code different varieties of local structures [49, 50, 55]. However, recently it has been proved that sparse coding process under this kind of dictionaries is inherently time-consuming and potentially unstable [19]. Thus we turn to the help of another promising strategy, namely adaptive sparse domain selection (ASDS) [12].

Originally, ASDS needs an extra database of raw image patches to train on, whereas in this paper we are considering a more practical situation where no external information is available. To overcome this, an alternative training database is constructed for ASDS by sampling patches from the currently estimated HR image and its down-scaled versions instead. With training database, the specific produces goes as follows: firstly, the training patches are gathered and partitioned into K clusters via K-means clustering. After applying PCA to each cluster, we can totally obtain K orthogonal and compact sub-dictionaries which compose the final dictionary of this paper, denoted by \( {\left\{{\boldsymbol{D}}_i\right\}}_{i=1}^K \). Then, for an input patch x i to be coded, the sub-dictionary \( {\boldsymbol{D}}_{t_i} \)belonging to the nearest cluster is selected from the overall dictionary, and the sparse coding process is greatly simplified to matrix multiplication of the form: \( {\boldsymbol{\alpha}}_i={\boldsymbol{D}}_{t_i}^T{\boldsymbol{x}}_i \). Since each given patch can be better represented by the adaptively selected sub-dictionary, the whole recovered image is more accurate than just using a universal dictionary. Moreover, this learning and coding strategy implicitly enforces the coefficient of the given patch with respect to the other sub-dictionaries equal to zero. That is to say, our model guarantees the local sparsity of coefficients spontaneously, thus the regularization term in (9) enforcing local sparsity can be omitted. The objective function finally becomes

$$ \widehat{\boldsymbol{\alpha}}= \arg \underset{\boldsymbol{\alpha}}{ \min}\left\{{\left\Vert \boldsymbol{Y}-\boldsymbol{HD}\circ \boldsymbol{\alpha} \right\Vert}_2^2\right.\left.+{\mu}_1{\left\Vert \boldsymbol{\alpha} -\boldsymbol{\varphi} \right\Vert}_1+{\mu}_2{\left\Vert \boldsymbol{\alpha} -\boldsymbol{\psi} \right\Vert}_1\right\} $$
(10)

3.4 Summary and mathematic implementation

It can be seen that the proposed BASR model (10) is a hybrid optimization problem with the co-occurrence of l 2-norm and l 1-norm, which makes it become non-convex and have no closed-form solution. Therefore, in our implementation, the proposed model is designed to be iteratively solved in a patchwise manner. Without loss of generality, this model can be rewritten into a patchwise form

$$ {\left\{{\boldsymbol{\alpha}}_i\right\}}_{i=1}^Q= \arg \underset{{\boldsymbol{\alpha}}_i}{ \min}\left\{{\left\Vert {\boldsymbol{y}}_i-{\boldsymbol{HD}}_{t_i}{\boldsymbol{\alpha}}_i\right\Vert}_2^2\right.+{\mu}_1{\left\Vert {\boldsymbol{\alpha}}_i-{\boldsymbol{\varphi}}_i\right\Vert}_1\left.+{\mu}_2{\left\Vert {\boldsymbol{\alpha}}_i-{\boldsymbol{\psi}}_i\right\Vert}_1\right\}, for\ i=1,2,\dots, Q $$
(11)

On the basis of fast iterative shrinkage-thresholding algorithm (FISTA) [3], a local-to-global and coarse-to-fine solving course is able to be concisely expressed as follows: (for more details about the whole process, please refer to Appendix A).

$$ {\left\{{\boldsymbol{\alpha}}_i^{\left( l+1\right)}\right\}}_{i=1}^Q=\rho \left({\boldsymbol{\alpha}}_i^{t mp}, L\right)=\rho \left({\boldsymbol{D}}_{t_i}^T{\boldsymbol{R}}_i{\boldsymbol{X}}^{t mp}, L\right)=\rho \left({\boldsymbol{D}}_{t_i}^T{\boldsymbol{R}}_i\cdot h\left({\boldsymbol{X}}^{(l)},{\boldsymbol{X}}^{\left( l-1\right)}\right), L\right), for\ i=1,2,\dots, Q $$
(12)

Notation L is a scalar involved here to control the magnitude of step-size and its value can be determined by employing a backtracking step-size rule, function ρ is cast as the shrinkage operator which is defined in Appendix A. Moreover, to accelerate the iteration, the temporary variable before shrinkage operation in (12) ought not to be computed by considering the result obtained in the previous iteration only (i.e., X (l)), but rather calculated by utilizing a very special linear combination of the previous two results (i.e., X (l) and X (l −1)), which is conveyed by function h. In summary, the detailed description of the implementation in this paper is outlined in Algorithm 1, while the block diagram of proposed BASR method is illustrated in Fig. 2.

figure e
Fig. 2
figure 2

Block diagram of proposed BASR method. Note that part B, A, and C in the above diagram actually corrspond to the specific contents of subsetion 3.1, 3.3, and 3.2 + 3.4, respectively

As we can see in Algorithm 1, a nested iteration consisting of inner and outer loops is employed. The reason is that the main computational burden exits in the procedures of dictionary learning and similarity modeling, and the variables involved in these parts do not change drastically while iteration proceeds. Therefore, these procedures can be put in the outer loop and only need to be executed every I 2 iterations in order to save computation cost. In addition, the step (i) and (iv) in the inner loop is additionally requested computation cost of FISTA. When compared to other steps, the two are both computationally negligible, thus the algorithm above remains almost the same computational burden but with a faster convergence rate. The theoretical convergence of FISTA has been well proved, please refer to [3, 11] for details.

4 Experimental results and analysis

In this section, a series of experiments on natural images have been designed and conducted to verify the effectiveness and robustness of our proposed BASR method in comparison with eight recent state-of-the-art counterparts, including SCSR [50], SLSR [55], SPSR [34], ASDS [12], NCSR [14], ANR [39], A+ [40] and SRCNN [15].

4.1 Experimental settings

First of all, thirteen different types of genetic images (refer to Table 1) are accepted as test benchmarks, among which the first ten were presented in [12] while the rest three appeared in [34]. Then, the degradation model (1) needs to be applied to these test samples for generating the LR images that will be used as inputs in the following experiments. But be aware that two prevalent but different types of degradation configurations were used among the aforementioned eight methods: the first type tends to use the bicubic filter to blur the original image first and then down-sample it with a prespecified decimation factor in both horizontal and vertical directions; by contrast the second chose to employ a Gaussian filter instead, followed by the same down-sampling process. It can be seen that the essential difference between the two configurations is actually the choice of blur kernel. In this paper, both types of the filters should be taken into account in order to be fully convincing. But to emphasize the performance influence caused by different blur kernels, we decide to fix the decimation factor to be 3 (the benchmark value) throughout our experiments so that that part of impact introduced by down-sampling could always be the same. In addition, the additive zero-mean Gaussian noise of standard deviation 5 could be further added to the degraded inputs for testing the robustness of all methods to noise. Therefore, in total, four different SISR scenarios are taken into consideration in this paper, which are sequentially specified as follows:

  • Scenario 1: HR images are blurred by the Bicubic filter, followed by down-sampling.Footnote 1

  • Scenario 2: HR images are blurred by the Gaussian filter of size 7 × 7 and standard deviation 1.6, followed by down-sampling.

  • Scenario 3: Use the LR outputs of scenario 1 but contaminated by Gaussian noise additionally.

  • Scenario 4: Use the LR outputs of scenario 2 but contaminated by Gaussian noise additionally.

Table 1 Numerical results (PSNR and SSIM) of the reconstructed HR images in scenario 1

The other main setting of parameters involved in this paper empirically goes as follows: P = 12, K = 50, η 1 = 0.1, η 2 = 0.3, I 1  = 9, I 2  = 40 and the local patches are extracted with maximal overlaps, whose size is set to be 5 × 5 (i.e., n = 25). Note that, according to some preliminary tests, we have found that the performance of our BASR method is almost insusceptible to the aforementioned parameters within a reasonable range, whereas the weights of regularization terms (i.e., μ 1 and μ 2) play a much important role. Therefore, the determination of the regularization weights deserves a deep consideration. In order to simplify the determination, we decide to keep the weight of column similarity (i.e., μ 1) equal to the value employed in [14] unvaryingly, meanwhile search for the optimal value of the other weight (i.e., μ 2). In doing so, we can pay more attention on the complementary effect produced by the regularization terms of BASR model rather than get stuck in doing parameter optimization. As a result, the weights μ 1 and μ 2 are set to be 0.7 and 0.1 respectively for noiseless experiments, while under noisy condition they are equal to 1 and 0.2, respectively.

Moreover, considering that human visual system is more sensitive to the changes in luminance than to those in chromaticity, YCbCr color space is adopted for representing color images in place of RGB space. All of the competitive methods are applied to the luminance component only, while the chromaticity components are simply interpolated from the input LR image to the target HR image by bicubic interpolation. To qualitatively and quantitatively evaluate the performance, the reconstructed images produced by various methods are contrasted in terms of both visual quality and two wildly used numerical indicators, i.e., PSNR and SSIM index [44]. When computing both of the numerical indicators, image borders which were neglected in [15, 38, 55] are uniformly taken into consideration in this paper for a fair comparison. All experiments were performed in the MATLAB R2013b environment on a PC with Core i5 3.2GHz CPU and 4GB RAM.

4.2 Experimental results

In this subsection, all competitive methods are implemented and evaluated under the four different SISR scenarios specified above. First, let us concentrate on the first two scenarios, which actually correspond to noiseless cases but with two different blur kernels. For scenario 1, the numerical results are reported in Table 1. Note that for each image its upper row in the Table shows the PSNR values (unit is dB) while the lower row provides the SSIM indexes (dimensionless quantity), and the layout is accepted in every Table of this paper. From Table 1 we can see that, in terms of the quantitative assessment, the conventional sparse representation based methods, SCSR and SLSR, give the lowest results all the time, while the other six methods are all better than them more or less. This is mainly due to that some particular progress has been made in either dictionary learning or sparse coding process. In the meantime, it is doubtless that our proposed BASR method always provides the highest numerical values among all competitive methods, since it not only inherits the good properties of sparse representation based SISR framework but also benefits quite a lot from the proposed bidirectional sparse coefficient alignment procedure. When testing environment turns to the second scenario, the whole numerical indicators are listed in Table 2, from which a consistent conclusion can be readily drawn that our BASR method is still superior to all the other counterparts, including the best competitor NCSR. To be precise, it outperforms the second best method, namely NCSR, by an average gain of 0.63 dB and 0.0086 in PSNR and SSIM index, respectively. Moreover, by comparing the numerical results of scenario 1 and 2, we noticed that the numerical differences of our method between the two scenarios (−0.01 dB for PSNR and 0.0012 for SSIM) are negligible. That is to say, our BASR model is able to provide the equal performance when the blur operator is changed from bicubic filter to Gaussian, showing a strong robustness to the variation on blur type. Next, to further evaluate the visual quality obtained by all methods, two sets of visual comparisons on image Leaves and Parrot under the first two scenarios are illustrated in Figs. 3 and 4, respectively. As shown in the figures, the proposed BASR method can produce the most visually pleasing outputs in terms of preserving more fine details and sharp edges. For instance, the stems and leaves in Fig. 3(j) are most consistent with those belonging to the original image Leaves showed in Fig. 3(a), while others are blurred, twisted or distorted to some degree. Meanwhile, the stripes around the eye of image Parrot in Fig. 4(j) are the clearest and sharpest in comparison with the other methods.

Table 2 Numerical results (PSNR and SSIM) of the reconstructed HR images in scenario 2
Fig. 3
figure 3

Visual comparison on Leaves in scenario 1 with scale factor 3. a Original image, HR images reconstructed by b SCSR [50], c SLSR [55], d SPSR [34], e ASDS [12], f NCSR [14], g ANR [39], h A+ [40], i SRCNN [15] and j proposed BASR

Fig. 4
figure 4

Visual comparison on Parrot in scenario 2 with scale factor 3. a Original image, HR images reconstructed by b SCSR [50], c SLSR [55], d SPSR [34], e ASDS [12], f NCSR [14], g ANR [39], h A+ [40], i SRCNN [15] and j proposed BASR

Then, in consideration of the fact that in practice the LR inputs of SISR are often contaminated by noise, thus we move on to the next two SISR scenarios for the purpose of testing the robustness of these methods to noise. Gaussian white noise is further added to the same LR images previously used in scenarios 1 and 2 to generate the new inputs for scenarios 3 and 4 respectively, making the task more challenging. Tables 3 and 4 and Figs 5 and 6 give the whole numerical results and another two sets of visual displays of HR images reconstructed by various methods. In contrast to their previous performance, it can be seen that the SPSR method is the most sensitive to the perturbation caused by noise (especially in scenario 4), which results in not only serious noise-caused artifacts in recovered images but also very severe declines in numerical indicators. The reason for this is that a particular set of parameters in the SPSR model is trained to work efficiently under one certain condition. If the circumstances are slightly changed, all parameters need to be retrained or reassigned. In other words, the SPSR method lacks the characteristic of robustness to noise, which is the fatal drawback of it. Similarly, although both SCSR and SLSR methods can indisputably enhance the resolution of LR images, they are still liable to generate noticeable jaggy artifacts along image edges, indicating that using sparse prior only is not enough for reliable reconstructions when facing noise. As for ANR, A+ and SRCNN, they do much better in reducing jaggy artifacts, whereas they still magnify, rather than eliminate, the unpleasant influence of the miscellaneous outliers to some degree. The ASDS and NCSR methods both show an outstanding capability of recovering high frequency components and suppressing noise; nevertheless, they tend to produce some unclear and blurred local details or even incorrect parts sometimes. For instance, in Fig. 6(e) the shape of the left eye of Lena is kind of deformed, while a wrong white spot can be observed in Fig. 6(f), which all make the whole HR images reconstructed by ASDS and NCSR look unnatural and uncomfortable. On the contrary, by taking advantage of bidirectional similarities, our BASR method can produce more accurate sparse coding coefficients so that it not only suppresses the noise efficiently but also preserves more delicate structures as compared to any other counterparts. Thus, its corresponding HR outputs are still of the best visual quality as we can see, showing a strong robustness against the noise.

Table 3 Numerical results (PSNR and SSIM) of the reconstructed HR images in scenario 3
Table 4 Numerical results (PSNR and SSIM) of the reconstructed HR images in scenario 4
Fig. 5
figure 5

Visual comparison on Butterfly in scenario 3 with scale factor 3. a Original image, HR images reconstructed by b SCSR [50], c SLSR [55], d SPSR [34], e ASDS [12], f NCSR [14], g ANR [39], h A+ [40], i SRCNN [15] and j proposed BASR

Fig. 6
figure 6

Visual comparison on Lena in scenario 4 with scale factor 3. a Original image, HR images reconstructed by b SCSR [50], c SLSR [55], d SPSR [34], e ASDS [12], f NCSR [14], g ANR [39], h A+ [40], i SRCNN [15] and j proposed BASR

4.3 Comparison and analysis on the rate of convergence

In this subsection, we decide to make an investigation on the convergence rate of our method. The second scenario is selected as the typical testing environment, and be aware that not all the competitive methods are added to the comparison this time, since some of them are not iteratively solved. Consequently, the ASDS and NCSR methods are picked as the testing baselines. In addition to the two methods, a variant of NCSR model, in which the FISTA algorithm is employed to solve it in place of using its original solution, is also implemented for the purpose of testing the effectiveness of FISTA and giving a more comprehensive illustration. The variant will be referred to as NCSR + FISTA hereinafter.

Then, the RMSE values of the first five images (bike, butterfly, hat, leaves and parrot) varying with the increase of iterations are plotted in Fig. 7(a)-(e), respectively. According to the graphics, the discoveries here are twofold: First, it can be observed that our method consistently converges fastest among these methods, which demonstrates both the efficiency and effectiveness of the proposed BASR model. Second, we still found that the original NCSR method is inferior to its variant, NCSR + FISTA, in terms of convergence rate. The improvement on the variant can only be attributed to the virtue of using FISTA algorithm, since we exactly preserved the whole framework of NCSR (even the original parameter settings were maintained) except for the change on solution. And because we had not fine-tuned the parameters of NCSR + FISTA, it becomes reasonable that its curves are ladder-like and not smooth enough.

Fig. 7
figure 7

Convergence rates of ASDS [12], NCSR [14], NCSR + FISTA and BASR methods. RMSE values on (a) Bike, (b) Butterfly, (c) Hat, (d) Leaves, (e) Parrot and (f) Average of all test samples

Furthermore, to provide more precise information, we finished the experiments on the remaining samples. The average RMSE curve of all images is plot in Fig. 7(f). It can be seen that on average ASDS and NCSR do not converge after 500 iterations (actually after 900, as not shown in the figure), whereas our method only needs 340 ~ 380 iterations to yield a convergence. This is also the reason that we set the number of iterations equal to 360 times. As for running time, the ASDS, NCSR and BASR methods cost about 298, 532, 319 s respectively, to reconstructed a super-resolved image from 85 × 85 to 255 × 255 pixels. Even though ASDS requires the least amount of time, this is acquired by virtue of offline learning and never updating dictionary. If its learning process runs once, it will cost another 10 min. By contrast, our method achieves equivalent performance on running time as compared to ASDS, while costs much less when compared with NCSR, showing the efficiency once again.

In conclusion, according to the conducted experiments, the FISTA algorithm can be acknowledged as a more promising solution to BASR in the sense that it has a higher rate of convergence, and the superiority of our method to similar ones is fully verified.

5 Conclusion

In this paper, a bidirectionally aligned sparse representation model was proposed for the application of single image super-resolution. Motivated by the recent discovery [57] that image patch similarities exist not only among columns but also among rows of similarity data matrix, the modeling of bidirectional similarities is first presented in our model. By virtue of it, we constructed a pair of regularization terms in the form of l 1 -norm, to which the raw sparse coefficients are simultaneously aligned after sparse coding process in order to make up for the errors caused by image noise and degradation. On the basis of FISTA algorithm, a local-to-global and coarse-to-fine solving course was developed to efficiently solve the proposed BASR model. Extensive experiments were performed to make a complete comparison between the BASR method and other leading methods, and the results indicate that our method is constantly superior to its counterparts in terms of both numerical assessment and visual perception. In our future work, we will concentrate on the research on the computational efficiency and adaptive allocation of weights of similarities and regularization in order to achieve a further improvement.