1 Introduction

Image super-resolution (SR) reconstruction technology refers to the process of reconstructing a high-resolution image of a corresponding scene from a low-resolution image of a given scene. The single-image super-resolution problem focuses on how to generate believable, visually pleasing high-resolution (HR) output images from low-resolution (LR) input images. Unlike similar image enhancement algorithms, in SR, it is usually assumed that the input image, although of lower resolution in the number of pixels, is still of high resolution at the original scale. SR method mainly focuses on increasing the scale of the image while ensuring that the image is as clear as possible. The most direct way to improve image resolution is to improve the optical hardware in the acquisition system, however, since the manufacturing process is difficult to greatly improve, and the manufacturing cost is very high, it is often too expensive to increase the image resolution physically. Therefore, the technology of image super-resolution reconstruction from the perspective of software and algorithms has become a research hotspot in many fields such as image processing and computer vision. Traditional super-resolution reconstruction algorithms mainly rely on basic digital image processing techniques for reconstruction, such as interpolation-based super-resolution reconstruction, degradation model-based super-resolution reconstruction, and learning-based super-resolution reconstruction.

Interpolation-based methods [1, 2] treat each pixel on the image as a point on the image plane, so the estimation of super-resolution images can be seen as fitting unknown pixel information to the plane using known pixel information on process. This type of algorithm is usually completed by a predefined transformation function or interpolation kernel. The method based on interpolation is simple to calculate and easy to understand, but there are some obvious defects. First, it assumes that the change of pixel gray value is a continuous and smooth process, but in fact this assumption is not fully established; second, in the reconstruction process, super-resolution is calculated only according to a pre-defined transfer function The image, regardless of the degradation model of the image, often leads to blurring, jaggedness and other phenomena in the restored image, so the effect is not good. Nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation are common interpolation-based methods.

Degradation model-based super-resolution reconstruction [3] considers that HR images are properly transformed by motion, blur and noise, resulting in reduced image resolution. This method constrains super-resolution image reconstruction by extracting key information in LR images and incorporating prior knowledge of images that require super-resolution reconstruction. Common methods include iterative back projection method [4], projection to convex set [5] and maximum a posteriori method [6], etc.

The learning-based method believes that there are certain correspondences between the LR images and the HR images, mainly to learn these correspondences, so a large amount of training data is needed, and these data are used to learn mapping relationships, which can be used to predict the HR images corresponding to the LR images to complete the reconstruction. Common learning-based methods include manifold learning [7], sparse coding methods [8], etc. Gradient priors [3, 9, 10], non-local self-similarity priors [11,12,13] and sparsity priors [14,15,16] are some typical image priors. These methods are able to recover sharp edges and suppress aliasing artifacts. However, the prior imposed on the HR image will affect the performance of these methods to a relatively large extent, and the reconstruction effect is poor when the zoom scale of the image is large.

Yang et al. [17] used the sparseness of natural images, introduce the theory of compressed sensing, build a dictionary that can sparsely represent image blocks through dictionary learning, and then, the reconstructed image can be generated through the linear combination of the dictionary and the sparse representation coefficients obtained by linear programming, so as to restore the high-frequency details of the image and achieve better results. On the basis of Yang, Zeyde et al. [18] used K-SVD to train the dictionary, which not only greatly accelerated the dictionary training time but also improved the efficiency of learning. Compared with the method of Yang et al. [17], the reconstruction effect Significant improvement. After that, Timofte et al. [19] performed manifold learning on dictionaries obtained by dictionary sparse training, which accelerated the image reconstruction time while ensuring the image reconstruction effect. Subsequently, he proposed an improved A + algorithm (Timofte et al. [20]) which resulted in faster training times and improved reconstruction quality. Perez-Pellitero [21] proposed an improved SR linear regressor training strategy and a reverse search method for accelerating regression-based SR methods. Zhao et al. [22] proposed a multi-resolution dictionary learning (MRDL) model that not only restores details well, reduces aliasing and noise, but also have a large improvement in computational efficiency. Zhang et al. [23] proposed a joint super-resolution framework with structurally modulated sparse representations, which also improves the performance of image super-resolution. But all the work are based on a single-layer training dictionary for reconstruction, which leads to the fact that if the reconstruction effect is expected to be good, the size of the trained dictionary is relatively large, and the deep features of the image in these methods do not have much mention and attention.

Tariyal et al. [24] proposed the method of deep dictionary learning (DDL). In single-layer dictionary learning, one utilizes the learned dictionary to synthesize data from the coefficients. DDL extends this concept from a single layer to multiple layers, learning a multi-layer dictionary so that the deepest coefficients can be used to synthesize data. Mahdizadehaghdam et al. [25] proposed a new model that attempts to learn a deep dictionary for classification tasks. Compared with traditional deep neural networks, deep dictionary learning methods usually first divide images into patches and extract features at the patch level, which is the most obvious difference between it and the deep neural network. The learned patch-based feature dictionary is then used to transform the input data into a global sparse feature representation. Song et al. [26] proposed multi-layer discriminative dictionary learning (MDDL) with local constraints for image classification. Through multi-layer dictionary learning, a robust dictionary is learned in the last layer, and the separability of encoded vectors belonging to different categories is improved compared to other methods. Classification accuracy is also guaranteed. The classification accuracy is also guaranteed. Tang et al. [27] proposed a new deep-to-point encoding network for image multi-classification problems. Thereafter, Montazeri et al. [28] proposed a multi-layer K-SVD method, which was also used to solve classification problems. However, image super-resolution reconstruction using deep dictionary learning is relatively rare. Recently, Huang et al. [29] proposed an image super-resolution model for deep dictionary learning. Different from other multi-layer dictionaries, their architecture contains L-1 analysis dictionary and synthesis dictionary for extracting high-level features. However, the results are slightly deficient with the method proposed in this paper. Singhal et al. [30] proposed deep coupled dictionary learning for solving the image inverse problem. Inspired by such research results, this paper will propose an improved image super-resolution reconstruction method for the limitations of image super-resolution reconstruction methods through dictionary learning. In the training part, this paper uses the deep dictionary learning method to learn the deep dictionary, and then uses the A + method to complete the super-resolution reconstruction of the image. Experimental results show that in the case of the same dictionary size, the proposed algorithm in this paper has good improvements in PSNR and SSIM, and effectively reduces the dictionary size.

The main contributions to our work are summarized as follows:

  1. (1).

    A new method of super-resolution based on deep dictionary learning is proposed.

  2. (2).

    While ensuring the reconstruction effect, the size of the dictionary is reduced and the reconstruction time is shortened.

  3. (3).

    Compared with some classical methods, our method performs better with the same dictionary size. Compared with some methods in recent years, our method is also quite competitive.

The rest of the paper are organized as follows. The second part outlines the relevant technologies and concepts used in our approach. The third part introduces the detailed process of the method. In Sect. 4, we conducted an experimental evaluation of the proposed scheme, and in Sect. 5, we summarized our work.

2 Related work

This section details the main methods involved in the proposed scheme.

2.1 Deep dictionary learning

The shallow dictionary learning model is:

$$ {\varvec{X}} = {\varvec{DZ}} $$
(1)

where X is the training data, D represents the dictionary, and Z is the sparse coefficient.

DDL draws on the idea of deep learning, and further extracts more abstract deep features by extending shallow dictionary learning to multiple layers. Figure 1 shows a schematic diagram of two-layer dictionary learning. Mathematically, it can be modeled as:

$$\begin{array}{c}X={{\varvec{D}}}_{1}\varphi \left({{\varvec{D}}}_{2}{{\varvec{Z}}}_{2}\right)\end{array}$$
(2)

here φ Is a nonlinear activation function. Extending this idea, the problem of multi-level dictionary learning can be expressed as:

Fig. 1
figure 1

Schematic diagram of two-layer deep dictionary learning

$$\begin{array}{c}X={{\varvec{D}}}_{1}\varphi \left({{\varvec{D}}}_{2}\varphi \left(\dots \varphi \left({{\varvec{D}}}_{{\varvec{N}}}{{\varvec{Z}}}_{{\varvec{N}}}\right)\right)\right)\end{array}$$
(3)

2.2 Anchored neighborhood regression

In the image super-resolution problem, most of the neighborhood embedding (NE) and sparse coding (SC) methods use the \({l}_{1}\) norm of coefficients to constrain or regularize the least squares (LS) problem, resulting in a high computational capacity requirement for the algorithm. In order to solve this problem, we can reformulate the problem as the least square regression of \({l}_{2}\) norm regularization of coefficients. Therefore, this paper uses Anchored Neighborhood Regression (ANR) to transform the problem into solving the following optimization problems:

$$\begin{array}{c}\underset{{\varvec{\beta}}}{{\text{min}}}{\Vert {\varvec{y}}-{{\varvec{N}}}_{{\varvec{l}}}{\varvec{\beta}}\Vert }_{2}^{2}+\lambda {\Vert {\varvec{\beta}}\Vert }_{2}\end{array}$$
(4)

where, y represents the input LR patches feature, and \({{\varvec{N}}}_{{\varvec{l}}}\) represents the neighborhood of LR dictionary space, β Represents the sparse coefficient, λ Is the regularization coefficient. The algebraic solution of the optimization problem can be expressed as:

$$\begin{array}{c}\beta ={\left({{\varvec{N}}}_{{\varvec{l}}}^{{\varvec{T}}}{{\varvec{N}}}_{{\varvec{l}}}+\lambda {\varvec{I}}\right)}^{-1}{{\varvec{N}}}_{{\varvec{l}}}^{{\varvec{T}}}y\end{array}$$
(5)

Since the sparse coefficients of the LR dictionary and the HR dictionary are shared, that is, their sparse coefficients are the same, then the reconstructed HR patches can be written as:

$$\begin{array}{c}x={{\varvec{N}}}_{{\varvec{h}}}\beta \end{array}$$
(6)

where x is the reconstructed HR patches, and \({{\varvec{N}}}_{{\varvec{h}}}\) is the neighborhood of the HR dictionary space corresponding to \({{\varvec{N}}}_{{\varvec{l}}}\).

For the atoms in the learned dictionary, each atom selects its K-nearest neighbor atoms, and define these neighbor atoms as the neighborhood of the atom. Once the neighborhood is defined, a separate projection matrix \({{\varvec{P}}}_{{\varvec{j}}}\) can be calculated for each dictionary atom \({{\varvec{d}}}_{{\varvec{j}}}\) according to its neighborhood. Then, the super-resolution problem can be solved by calculating the adjacent atom \({{\varvec{d}}}_{{\varvec{j}}}\) of each input patches feature \({\varvec{y}}\) in the dictionary, and then using the stored projection matrix \({{\varvec{P}}}_{{\varvec{j}}}\) to map it to the HR space:

$$\begin{array}{c}x={{\varvec{P}}}_{{\varvec{j}}}y\end{array}$$
(7)

This method is an approximation of NE method with low complexity, so it can greatly reduce the execution time and time complexity of the algorithm.

2.3 Adjust anchored neighborhood regression(A +)

In ANR and other sparse coding methods, training samples are only used in the dictionary construction (learning) stage, while the adjusted anchored neighborhood regression (A +) algorithm is different. When looking for the neighborhood of each atom (calculated in the training phase and used in the reconstruction phase), the neighborhood used for regression is obtained from the training sample set. Therefore, training samples in the reconstruction phase is also very important. The ANR method is used again to conduct regression during training. The difference is that the neighborhood is defined based on training samples instead of sparse dictionary atoms. Therefore, the optimization problem is transformed into:

$$\begin{array}{c}\underset{{\varvec{\delta}}}{{\text{min}}}{\Vert {\varvec{y}}-{{\varvec{S}}}_{{\varvec{l}}}{\varvec{\delta}}\Vert }_{2}^{2}+\lambda {\Vert {\varvec{\delta}}\Vert }_{2}\end{array}$$
(8)

Here, the matrix \({{\varvec{S}}}_{{\varvec{l}}}\) is used instead of \({{\varvec{N}}}_{{\varvec{l}}}\) as the neighborhood of the atom, which represents the K closest training samples to the dictionary atoms that matching the input patches y.

3 Proposed scheme

In this section, we introduce the details of the proposed scheme. The method learns the required high- and low-resolution dictionary from the training set through deep dictionary learning, and selects dictionary atomic pairs to calculate the projection matrix. Finally, the test image is read and reconstructed through the dictionary and projection matrix. The symbols involved are shown in Table 1.

Table 1 The notations used in this paper

3.1 Pretreatment

The image super-resolution algorithm usually divides the image into blocks in the spatial domain, carries out super-resolution reconstruction on the image patches, fuses the reconstructed images of every patch, and then, the super-resolution reconstruction is completed. The spatial characteristics of the original image patches cannot represent the image well. The common practice is to subtract the average value and divide it by the standard deviation to normalize the image contrast. The commonly used image features are the first and second derivatives of image patches. The performance of these two feature types seems similar, while Bevilacqua et al. [31] showed that the performance of using only the average subtraction is slightly better than that of using only the first derivative and second order derivative. In this paper, the same features as Zeyde et al. [18] are used to extract the first and second order features of an image patch, and PCA is applied to reduce the dimensions.

We subtract the LR image bicubically interpolated image from the HR image to create a normalized HR block. The patches generated from the SR process are added to the bicubic interpolated image of the LR input image (overlapping pixel values are averaged) to reconstruct the output.

3.2 Dictionary construction

Dictionary construction is critical to the performance of any SR reconstruction method with sparse representations. Generally speaking, the performance of the algorithm is positively correlated with the size of the dictionary, that is, the larger the dictionary, the better the effect, but this also brings higher computational costs. The LR input image itself can be used to build a dictionary, in which case there will be an "inner" dictionary [32]. However, many methods prefer to use different images to build an "external" dictionary beyond the input query.

In the dictionary construction process of this paper, the same set of training samples are used as those used by Zeyde et al. [23] and Yang et al. [17]. For the dictionary learning method, this paper draws on the deep dictionary learning method proposed by TARIYAL et al. [29].

For single-layer dictionaries, D and Z can be obtained by solving the following optimization problems:

$$\begin{array}{c}\underset{{\varvec{D}},{\varvec{Z}}}{{\text{min}}}{\Vert {\varvec{X}}-{\varvec{D}}{\varvec{Z}}\Vert }_{F}^{2}\end{array}$$
(9)

For the problem of sparse representation, the purpose is to learn the basis that can represent samples in a sparse way, that is, Z is required to be sparse. The most commonly used algorithm to solve this problem is K-SVD algorithm. Fundamentally, it solves the following optimization problems:

$$\begin{array}{c}\underset{{\varvec{D}},{\varvec{Z}}}{{\text{min}}}{\Vert {\varvec{X}}-{\varvec{D}}{\varvec{Z}}\Vert }_{F}^{2} s.t.{\Vert {\varvec{Z}}\Vert }_{0}\le \varepsilon \end{array}$$
(10)

For deep dictionary learning, as shown in Eq. (3), the proposed method uses the greedy learning method to learn one layer at a time, that is, first learn the dictionaries and coefficients of the first layer:

$$\begin{array}{c}\underset{{{\varvec{D}}}_{1},{{\varvec{Z}}}_{1}}{{\text{min}}}{\Vert {\varvec{X}}-{{\varvec{D}}}_{1}{{\varvec{Z}}}_{1}\Vert }_{F}^{2}\end{array}$$
(11)

\({{\varvec{Z}}}_{1}\) is not required to be sparse, except for the coefficient of the last layer.

The above optimization problems can be solved by alternately solving \({{\varvec{D}}}_{1}\) and \({{\varvec{Z}}}_{1}\), namely:

$$\begin{array}{c}{{\varvec{Z}}}_{1}={\left({{\varvec{D}}}_{1}^{{\varvec{T}}}{{\varvec{D}}}_{1}+\lambda {\varvec{I}}\right)}^{-1}{{\varvec{D}}}_{1}^{{\varvec{T}}}X\end{array}$$
(12)
$$\begin{array}{c}{{\varvec{D}}}_{1}={{{\varvec{Z}}}_{1}{{\varvec{X}}}^{{\varvec{T}}}\left({{\varvec{Z}}}_{1}{{\varvec{Z}}}_{1}^{{\varvec{T}}}+\lambda {\varvec{I}}\right)}^{-1}\end{array}$$
(13)

alternate the above two processes to solve.

For the second layer, we need to solve the following problems:

$$\begin{array}{c}\underset{{{\varvec{D}}}_{2},{{\varvec{Z}}}_{2}}{{\text{min}}}{\Vert {\varphi }^{-1}\left({{\varvec{Z}}}_{1}\right)-{{\varvec{D}}}_{2}{{\varvec{Z}}}_{2}\Vert }_{F}^{2}\end{array}$$
(14)

it can also be solved by alternately solution, that is:

$$\begin{array}{c}{{\varvec{Z}}}_{2}={\left({{\varvec{D}}}_{2}^{{\varvec{T}}}{{\varvec{D}}}_{2}+\lambda {\varvec{I}}\right)}^{-1}{{\varvec{D}}}_{2}^{{\varvec{T}}}{\varphi }^{-1}\left({{\varvec{Z}}}_{1}\right)\end{array}$$
(15)
$$\begin{array}{c}{{\varvec{D}}}_{2}={{\varvec{Z}}}_{2}{{\varphi }^{-1}\left({{\varvec{Z}}}_{1}\right)}^{T}{\left({{\varvec{Z}}}_{2}{{\varvec{Z}}}_{2}^{{\varvec{T}}}+\lambda {\varvec{I}}\right)}^{-1}\end{array}$$
(16)

solve alternately until the last layer:

$$\begin{array}{c}\underset{{{\varvec{D}}}_{{\varvec{N}}},{\varvec{Z}}}{{\text{min}}}{\Vert {\varphi }^{-1}\left({{\varvec{Z}}}_{{\varvec{N}}-1}\right)-{{\varvec{D}}}_{{\varvec{N}}}{{\varvec{Z}}}_{{\varvec{N}}}\Vert }_{F}^{2}+\lambda {\Vert {{\varvec{Z}}}_{{\varvec{N}}}\Vert }_{0}\end{array}$$
(17)

Among them, adding a regularization term requires the last layer coefficient \({{\varvec{Z}}}_{{\varvec{N}}}\) to be sparse. The last layer can be solved using K-SVD method.

First, use deep dictionary learning to learn the final LR dictionary through LR image patches, and the resulting final LR dictionary can be expressed as:

$$\begin{array}{c}{{\varvec{D}}}_{{\varvec{l}}}={{\varvec{D}}}_{1}\varphi \left({{\varvec{D}}}_{2}\varphi \left(\dots \varphi \left({{\varvec{D}}}_{{\varvec{N}}}\right)\right)\right)\end{array}$$
(18)

where φ is the nonlinear activation function. Then, by forcing \({{\varvec{D}}}_{{\varvec{h}}}\) to share the same sparse coefficients as the last layer in \({{\varvec{D}}}_{{\varvec{l}}}\), \({{\varvec{D}}}_{{\varvec{h}}}\) can be built from the sparse coefficients and HR patches, which can be calculated by the following expression:

$$\begin{array}{c}{{\varvec{D}}}_{{\varvec{h}}}=Y{\varphi \left({{\varvec{Z}}}_{{\varvec{N}}}\right)}^{T}{\left(\varphi \left({{\varvec{Z}}}_{{\varvec{N}}}\right){\varphi \left({{\varvec{Z}}}_{{\varvec{N}}}\right)}^{T}\right)}^{-1}\end{array}$$
(19)

where y represents the features of the HR patches, and \({{\varvec{Z}}}_{{\varvec{N}}}\) is the sparse coefficient learned by the last layer of dictionary learning. The process of dictionary construction is shown in Fig. 2.

Fig. 2
figure 2

dictionary building

3.3 Sampling anchor neighborhoods

In order for robust regressors to be anchored to dictionary atoms, we need to have a sample neighborhood centered on the atom, in the sense of Euclidean distance, or up to the unit l2 norm, on the surface of the unit hypersphere. The local manifold on the unit hypersphere can be approximated by a set of adjacent samples. At the same time, since we have a consistent scale factor between the LR feature and its HR patch at the time of feature extraction, when we bring the LR patch feature into the hypersphere by l2 normalization, we must also pass the same factor by the corresponding HR patches are scaled to linearly transform which aims to preserve the relationship between LR and HR spaces.

In the LR dictionary, each atom finds several atoms closest to it to form the neighborhood \({{\varvec{S}}}_{{\varvec{l}}}\), these atoms are extracted from the LR patches in the training set, and the distance metric used for the nearest neighbor atom search is the anchor Euclidean distance between atoms and training samples; in the HR dictionary, the corresponding atoms are also extracted to form the neighborhood \({{\varvec{S}}}_{{\varvec{h}}}\), which is also extracted from the HR patches in the training set. Compute the projection matrix between these LR, HR atoms. After traversing all dictionary atoms, in the end, each atom has a corresponding group of neighbor atoms and a projection matrix. The calculation of the projection matrix will be described in detail in the next section.

3.4 Super-resolution reconstruction

Input the LR image to be processed and segment it with the same block size as the dictionary training stage, and then perform feature extraction on all the LR image segmentation blocks. For each LR feature block X, finding which class it is closest to in the neighborhood of the LR dictionary atoms can be expressed as the following optimization problem:

$$\begin{array}{c}\underset{{\varvec{\beta}}}{{\text{min}}}{\Vert {\varvec{x}}-{{\varvec{S}}}_{{\varvec{l}}}{\varvec{\beta}}\Vert }_{2}^{2}+\lambda {\Vert {\varvec{\beta}}\Vert }_{2}\end{array}$$
(20)

where \({\varvec{x}}\) represents the input LR image patches, and \({{\varvec{S}}}_{{\varvec{l}}}\) represents the neighborhood of the LR dictionary atoms. Solve the above optimization problem to get the coefficient \({\varvec{\beta}}\). The algebraic form solution can be expressed as:

$$\begin{array}{c}\beta ={\left({{\varvec{S}}}_{{\varvec{l}}}^{{\varvec{T}}}{{\varvec{S}}}_{{\varvec{l}}}+\lambda {\varvec{I}}\right)}^{-1}{{\varvec{S}}}_{{\varvec{l}}}^{{\varvec{T}}}x\end{array}$$
(21)

Directly use the projection matrix to which this class belongs to obtain the corresponding HR block Y. which is:

$$\begin{array}{c}y={{\varvec{S}}}_{{\varvec{h}}}\beta \end{array}$$
(22)

where, \({\varvec{y}}\) represents the output HR image patch, and \({{\varvec{S}}}_{{\varvec{h}}}\) represents the neighborhood of the HR dictionary atom corresponding to \({{\varvec{S}}}_{{\varvec{l}}}\). which is:

$$\begin{array}{c}y={{{\varvec{S}}}_{{\varvec{h}}}\left({{\varvec{S}}}_{{\varvec{l}}}^{{\varvec{T}}}{{\varvec{S}}}_{{\varvec{l}}}+\lambda {\varvec{I}}\right)}^{-1}{{\varvec{S}}}_{{\varvec{l}}}^{{\varvec{T}}}x\end{array}$$
(23)

According to Eq. (7), the projection matrix can be expressed as:

$$\begin{array}{c}{{\varvec{P}}}_{{\varvec{j}}}={{{\varvec{S}}}_{{\varvec{h}}}\left({{\varvec{S}}}_{{\varvec{l}}}^{{\varvec{T}}}{{\varvec{S}}}_{{\varvec{l}}}+\lambda {\varvec{I}}\right)}^{-1}{{\varvec{S}}}_{{\varvec{l}}}^{{\varvec{T}}}\end{array}$$
(24)

after the sampling of the LR neighborhood and the HR neighborhood is completed, the projection matrix can be calculated.

The calculation is performed on all input image patches until all input low-resolution image patches have corresponding reconstructed high-resolution image patches. All the reconstructed high-resolution image blocks are pasted in reverse according to the previous segmentation coordinates (the average value of the overlapping area between blocks is calculated) to obtain the final high-resolution image. The super-resolution reconstruction process is shown in Fig. 3.

Fig. 3
figure 3

super-resolution reconstruction

4 Experimental results and analysis

This section compares the proposed method with several typical image super-resolution reconstruction methods NE + LLE, Yang [17], Zeyde [18], ANR [19], GR [19], MCSR [33], A + (same dictionary size) [20] and CDDL[30]. At the same time, our proposed method is compared with some methods in recent years, these methods are Huang’s method [29], recurrent residual regressor (RRR) [34], single image super-resolution (SISR) [35], based on deep learning-dilated convolution (DC) [36] and multi-scale encoder decoder (MSED) [37]. By using the most commonly used PSNR and SSIM in image evaluation indicators for objective evaluation. Where PSNR stands for peak signal-to-noise ratio and SSIM for structural similarity. Its expressions are:

$$\begin{array}{c}PSNR=10{{\text{log}}}_{10}\left(\frac{{255}^{2}\cdot {\text{MN}}}{{\Vert \widehat{{\varvec{x}}}-{\varvec{x}}\Vert }^{2}}\right)\end{array}$$
(25)
$$\begin{array}{c}SSIM=\frac{\left({2\mu }_{\widehat{x}}{\mu }_{x}+{C}_{1}\right)\left({2\sigma }_{\widehat{x}x}+{C}_{2}\right)}{\left({\mu }_{\widehat{x}}^{2}+{\mu }_{\widehat{x}}^{2}+{C}_{1}\right)\left({\sigma }_{\widehat{x}}^{2}+{\sigma }_{\widehat{x}}^{2}+{C}_{2}\right)}\end{array}$$
(26)

where \(\widehat{{\varvec{x}}}\) represents the reconstructed image, \({\varvec{x}}\) represents the original image, M and N represent the row and number of the image, \(\mu \) represents the mean, \({\sigma }^{2}\) represents the variance, \(\sigma \) represents the covariance, and \({C}_{1}\) and \({C}_{2}\) are constants. The larger the PSNR value, the better the reconstructed image quality. The closer the SSIM value is to 1, the more similar the reconstructed image is to the original image.

The datasets used in the experiment set5, set14, BSDS100, Manga109 and Urban100. Set5 and Set14 datasets are low-complexity single image super-resolution data sets based on non-negative neighborhood embedding. These two training sets are used for single image super-resolution reconstruction, that is, to reconstruct HR images based on LR images to obtain more details. BSDS100 is a data set containing 100 natural images, Urban100 is a data set containing 100 Urban landscape images, and Manga109 is a data set containing 109 comic images. These three data sets are very challenging data. sets because they contain different scenes and genres.

4.1 Experimental parameters

The experimental platform is Intel Core i7-11800H@2.30 GHz, the operating system is 64bit Windows10 Professional Edition, Matlab R2019a. The dictionaries used in this experiment are two-layer dictionaries. When the magnification is 2, 3 and 4, the size of the deep dictionaries are 2048 × 1024 × 512, 4096 × 2048 × 1024 × 512, and 8192 × 4096 × 2048 × 1024, respectively. The neighborhood size K = 2048.

4.2 Experimental results and analysis

Tables 2 and 3 give the quantitative experimental results on the set5 dataset. Figure 4 shows the experimental results of the method proposed in this paper when the scale is 3 on the set14 dataset.

Table 2 PSNR/dB of each algorithm with different zoom scales on the Set5 dataset
Table 3 SSIM of each algorithm at different zoom scales on the Set5 dataset
Fig. 4
figure 4

PSNR at scale 3 on set14 dataset

Through the experimental results, it is not difficult to find that compared with some classic methods, the proposed method in this paper has obvious improvements in PSNR and SSIM. Table 4 shows the comparison results of the proposed method with recent methods when the zoom scale is 2. It can be seen that compared with some methods in recent years, our method is also very competitive in PSNR, and in terms of SSIM, our method has achieved good results on most of the test pictures. Table 5 gives the quantitative experimental results on the BSDS100, Manga109 and Urban100 datasets.

Table 4 Comparison with recent methods(zoom scale 2)
Table 5 PSNR and SSIM on the BSDS100 dataset, Manga109 dataset, and Urban100 dataset

Figures 5 and 6 show the visualization experimental results of the super-resolution reconstructed part of the image. It can be seen from the figure that the reconstruction result of bicubic interpolation is the worst, not only quite blurry, but also missing a lot of high-frequency detail information, and the visual effect is not good compared with the original image; the algorithm of Yang et al. has improved performance and reconstructed the original image, and the image retains most of the information of the original image, but the edge of the reconstructed image is not clear enough, and the overall effect is not particularly good; the edge information of the reconstructed image of Zeyde’s algorithm is better than that of Yang’s, but the visual effect of the reconstructed image is slightly poor; the reconstructed image of the NE + LLE algorithm has good detailed information, but the graininess is strong, which affects the vision; the A + method has the same dictionary size as the method in this paper, the reconstruction effect is not particularly ideal, and some places are more Blurred; the proposed method in this paper combines A + and deep dictionary learning algorithm to reconstruct the image with super-resolution, so that the edge of the reconstructed image is clear, and the visual effect is better than the previous methods. We have enlarged an area in the results, and the enlarged area is shown as a box in the figure. The enlarged image of Fig. 5 shows that the proposed method not only has fewer artifacts but also has a clearer reconstruction effect than other methods. The enlarged image of Fig. 6 shows that the proposed method better removes jagged edges that appear in other methods, and the visual effect is also better.

Fig. 5
figure 5

The reconstruction effect of each algorithm of butterfly when the zoom scale is 3

Fig. 6
figure 6

The reconstruction effect of each algorithm of foreman when the zoom scale is 3

4.3 Ablation experiment

We conducted some parameter-related experiments on the proposed method. First, we conducted experiments with different number of layers and different dictionary sizes to observe the effect. Figure 7a shows the PSNR of different λ on the set5 data set at a magnification factor of 2 and a dictionary size of 1024 × 512 × 256. The spacing of λ is set to 0.05. It can be seen that as λ increases, the PSNR start to increases and then gradually decreases. It can be inferred that the PSNR reaches the highest peak when λ is around 0.15. Figure 7b shows the PSNR on the set5 data set when the amplification factor is 2, the first and last dictionary sizes are 1024 and 256 respectively, and λ is 0.1, and the number of dictionary layers is different. As the number of intermediate layers increases, PSNR shows a downward trend after being flat at the beginning. It is speculated that too many layers will cause over-fitting. Figure 7c shows the PSNR on the set5 dataset when the magnification is 2, \(\lambda \) is 0.1, the first layer dictionary size is 2048, the number of dictionary layers is different, and the last layer dictionary size is different. It can be seen that the PSNR gradually decreases because the overall dictionary size gradually decreases, which reduces the reconstruction effect.

Fig. 7
figure 7

a PSNR with different λ (top left), b PSNR with different number of layers (top right), c PSNR with different number of layers and different tail layer dictionary sizes (bottom)

5 Conclusion

In view of the reason that the image super-resolution reconstruction algorithm ignores the image deep features, this paper proposed a deep dictionary model for image super-resolution, which can extract the image features to train the deep dictionary, and then carries out super-resolution reconstruction by adjust anchored neighborhood regression. The algorithm in this paper effectively reduces the size of the dictionary. The experimental results show that the reconstruction results of the algorithm in this paper are more ideal than those of multiple image super-resolution methods in the case of the same dictionary size, and have achieved good improvement in the objective evaluation criteria PSNR and SSIM.