Keywords

1 Introduction

One single image SR problem has been a concerned issue in image processing for a long time. The goal is to recover the high resolution (HR) image from its low resolution (LR) form. However, it is an ill-posed inverse problem that some prior knowledge is in need to make the solution unique and stable. Lots of articles provide various methods to address this problem, which can be roughly divided into three categories, interpolating based, reconstructing based and learning based. Among them, the third one is more worth being researched than others in trend. For example, example learning based methods [1,2,3,4] employ a database consisting of co-occurrence examples from a training set of HR and LR image patches. Since they rely much more on the similarity between the training set and the test set, they are not very practical in some situations. Another kind of efficient learning-based method [5,6,7,8] use the sparse-representation modeling to deal with this problem. Sparse-representation theory assumes that there is a linear relationship between high and low dimension, so that high dimension signal can be restored from their low dimension projection accurately [6]. Besides, [7] found that image patches can be well-represented as a sparse linear combination of elements from an appropriately chosen over-complete dictionary, so they made a compact representation for these patch pairs to capture the co-occurrence prior to improve the speed and the robustness significantly, achieving much better performance. Lately, [5] modified the approach above in various respects including computational complexity and algorithm architecture, which shows to be more efficient and much faster than [7]. Because of the limitation in recovering high-frequency details and the wide gap between the frequency spectrum of the corresponding HR image and that of the initial interpolation, [8] put forwards a dual-dictionary learning method via parse representation for image super-resolution , which consist of two steps to make up the wide gap. First, a main learned high-frequency dictionary was used to reduce the most gap of the frequency spectrum primarily. Then, a residual high-frequency dictionary was trained to recover the lack of residual high-frequency signal. According to [8], it obtained better results than [5] in PSNR.

However, the gap between the frequency spectrum of the corresponding HR image and that of the initial interpolation is so wide that two-layer progressive estimation of high frequency is not enough to recover the whole image high frequency details. In order to alleviate this problem, the multi-stage dictionary learning method is proposed. First, multiple stages of dictionary are trained offline, and each one also contains both high and low resolution parts. After that, high frequency details will be compensated by using these dictionaries via sparse representation stage by stage until the gap is smaller enough. This scheme can be treated as a cascade coarse-to-fine recovering progress, and the final results in the experimental section show that our method is better than expected.

This overall framework is as follows: some methods and research were introduced before in this section. In Sect. 2, the proposed SR scheme are described in detail including dictionary learning in Sect. 2.1 and image restoration in Sect. 2.2. Section 3 shows some experimental results in different views, and Sect. 4 makes some conclusions.

2 Method

When capturing image, it is easy to be affected by some factors such as deformation, blur, noise and down-scaling etc. Assuming that the original capturing image is an HR image, the actual obtained result is a LR image. This process can be described by formulation (1):

$${\text{y}} = {\text{GBDx}} + {\text{n}}$$
(1)

where x is the original HR image, y is the observed LR image. G denotes the geometric deformation operator, B denotes a blurring operator, D denotes a down-scaling operator and n is the additive Gaussian noise.

It can be seen that solving x is an ill-posed inverse problem. As a learning-based method, sparse representation method can get the coefficient between LR and HR image via a trained over-complete dictionary, which avoid to solving the equation directly. Both dictionary training and image generation are needed inescapability. We describe the training process as Fig. 1 and image generation progress in Fig. 2.

Fig. 1
figure 1

The frame of dictionary learning stage

Fig. 2
figure 2

Frame of image synthesis stage

2.1 Offline Dictionary Learning

In this stage, multi-stage dictionaries are trained using sparse representation , i.e. \(D_{1} ,\, D_{2} ,\, D_{3} , \ldots\). Each dictionary like \(D_{1}\) has two parts: low-frequency dictionary \(\left( {LD_{1} } \right)\) and high-frequency dictionary \(\left( {HD_{1} } \right)\), respectively. Our training scheme is similar in spirit to that of [7].

As shown in Fig. 1, \({\text{H}}_{\text{l}}\) and \({\text{H}}_{\text{h}}\) which represent HR low-frequency image and HR high-frequency image is the first pair input to train the first stage dictionary \(D_{1}\), and some pre-progress to the defined original HR image \({\text{H}}_{\text{o}}\) should have been done to get them before the true training stage. First, we down sampling \({\text{H}}_{\text{o}}\) and get its blur image \({\text{L}}_{l}\). Then, applying bi-cubic interpolation method on \({\text{L}}_{l}\) to construct the image \({\text{H}}_{l}\), which is of the same size as \({\text{H}}_{\text{o}}\). The final image \({\text{H}}_{\text{h}}\) is generated by subtracting \({\text{H}}_{\text{l}}\) from \({\text{H}}_{\text{o}}\).

Since we have said that each stage dictionary has two coupled sub-dictionaries \(\left( {LD_{1} ,HD_{1} } \right)\), we need to extract the local patches from \({\text{H}}_{l}\) and \({\text{H}}_{\text{h}}\) to forming the training data \(\left\{ {{\text{pa}}_{\text{l}}^{\text{n}} ,\left. {{\text{pa}}_{\text{h}}^{n} } \right\}} \right.\), where \({\text{pa}}_{\text{h}}^{\text{n}}\) is the set of patches extracted from image \({\text{H}}_{\text{h}}\) directly while \({\text{pa}}_{\text{l}}^{\text{n}}\) is built in another way which has been explained in detail in [8].

In order to generate the dictionary \(LD_{1}\) and \(HD_{1}\), the following two Eqs. (2), (3) can be used to generate them. Formulation (2) is K-SVD dictionary learning [9] procedure and Formulation (3) is based on the theory of high-dimension image patches can be accurately recovered from their low-dimension projections.

$${\text{LD}},\left\{ {q^{n} } \right\} = {\text{argmin}}\sum\nolimits_{n} {\left\| {pa_{l}^{n} - LD \cdot q^{n} } \right\|_{2}^{2} } ,\,{\text{s}}.{\text{t}}. \left\| {q^{n} } \right\|_{0} \le L, \forall n$$
(2)

where \(\left\{ {{\text{q}}^{\text{n}} } \right\}_{n}\) are sparse representation vectors, and \(|| \cdot ||_{0}\) is the \({\text{l}}_{0}\) norm counting the nonzero entries of a vector.

$${\text{HD }} = {\text{argmin}}\sum\nolimits_{n} {\left\| {pa_{h}^{n} - HD \cdot q^{n} } \right\|_{2}^{2} } = {\text{argmin}}\sum\nolimits_{n} {\left\| {P_{h} - HD \cdot {\text{Q}}} \right\|}_{2}^{2}$$
(3)

where the matrices \({\text{P}}_{\text{h}} = \left\{ {{\text{pa}}_{\text{h}}^{n} } \right\}_{\text{n}}\) and \({\text{Q}} = \left\{ {{\text{q}}^{n} } \right\}_{\text{n}}\), respectively.

So far, the first stage dictionary \({\text{D}}_{1}\) has been trained, and we need set a stage number n to train more stage dictionaries. The next stages of dictionary can be built by using the same method of dictionary learning as \({\text{D}}_{1}\). As the input training image of the next stage, \({\text{H}}_{\text{l}}^{1}\) is generated by adding \({\text{H}}_{\text{l}}\) and \(\widehat{\text{H}}_{h}^{1}\), which contains more details \(\left( {\widehat{\text{H}}_{h}^{1} } \right)\) than \({\text{H}}_{l}\). It is important to note that other stage of dictionary \({\text{D}}_{\text{i}}\) is also consist of two coupled sub-dictionaries: low-frequency residual dictionary \(\left( {{\text{LD}}_{\text{i}} } \right)\) and high-frequency residual dictionary \(\left( {{\text{HD}}_{\text{i}} } \right)\).

Finally, all the rest stages of dictionary are trained as the same way described above. Theoretically, the back stage of dictionary contains less high-frequency signal than the previous stage and the dimension of the dictionary is higher, and at some point, the dictionary may has little use to compensate the high frequency signal.

2.2 Online Image Generation

After the offline training stage, there are multiple stages of dictionaries were generated. Each stage of dictionary can be used to compensate some high frequency component for the low resolution image. More high frequency details can be got via a cascade compensating strategy in theory. However, too much compensation is not necessary, and even cause a distortion. Generally speaking, the problem of how many stages of dictionary should we use for generating the final HR image is hard to be determined, because we have not the strict evaluation standard to estimate the result. In this paper, we select the PSNR value as an indicator. When the PSNR value decline or stay the same, we stop the next image synthesis stage.

As shown in Fig. 2, \({\text{H}}_{h}^{\text{n}}\) is the final synthetic image. \({\text{H}}_{h}^{1} ,{\text{H}}_{h}^{2} , \ldots ,{\text{H}}_{h}^{\text{i}}\) are the intermediate synthetic image after each stage of dictionary representation, which is also used to the next stage input. \(\widehat{\text{H}}_{h}^{1} ,\widehat{\text{H}}_{h}^{2} , \ldots ,\widehat{\text{H}}_{h}^{\text{i}}\) are the lost high frequency of each input LR image.

Each stage of image synthesis has the same procedure to restore the loss. For the first stage example, suppose that an input LR image denoted by \({\text{L}}_{l}\) has been done the same pre-progress in Sect. 2.1 to an HR image. Then, \({\text{H}}_{l}\) is the first input target of the super resolution. With the use of dictionary \({\text{D}}_{1}\) and the method in [5], the first stage high-frequency image is generated \(\widehat{\text{H}}_{h}^{1}\), which is just contain the lost high frequency signal, and add the input LR image \({\text{H}}_{l}\).

First, make sure that \({\text{H}}_{l}\) is filtered with the same high-pass filters and PCA projection as the training stage, and then is decomposed into overlapped patches \(\{ {\text{pa}}_{\text{l}}^{n} \}_{n}\). After all, employ the traditional OMP method [8] to generate \(\{ {\text{pa}}_{\text{l}}^{n} \}_{n}\), and calculate the sparse representation vectors \(\left\{ {{\text{q}}^{n} } \right\}_{n}\) by allocating L atoms to their representation under \({\text{LD}}_{1}\). Next, the HR image patches can be reconstructed by the formulation: \(\left\{ {\widehat{\text{pa}}_{\text{h}}^{n} } \right\}_{n} = \left\{ {{\text{HD}}_{1} \cdot {\text{q}}^{n} } \right\}_{n}\). Finally, generate the first high frequency loss \(\widehat{\text{H}}_{h}^{1}\) by solving the following minimization problem (4):

$$\widehat{\text{H}}_{h}^{1} = {\text{argmin}}\sum\nolimits_{n} {\left\| {R_{n} \widehat{\text{H}}_{h}^{1} - \widehat{\text{pa}}_{\text{h}}^{n} } \right\|_{ 2}^{2} }$$
(4)

More details of the solution can be referred to [8]. Then, the first HR temporary image \({\text{H}}_{\text{LF}}^{1}\) containing more details than \({\text{H}}_{\text{LF}}\) is built by adding \({\text{H}}_{l}\) to \(\widehat{\text{H}}_{h}^{1}\).

In the same way, \({\text{H}}_{h}^{2}\) can be generated by using of \({\text{H}}_{l}^{1}\) and \({\text{D}}_{2}\), then, \({\text{H}}_{l}^{3} , \ldots ,{\text{H}}_{l}^{\text{i}}\) and so on until reach the certain stopped condition. The last synthesized HR high-frequency image \({\text{H}}_{l}^{\text{n}}\) contains much more details than the original HR high-frequency \({\text{H}}_{l}\). The stopped condition has been explained before in this section. Some synthesized image result is shown in Fig. 3.

Fig. 3
figure 3

Some vision comparison by different methods: a Bicubic interpolation; b J Zhang et al. [8]; c our method; d original images

3 Experiments

Extensive experiments on image super-resolution by using our method are demonstrated in this section. Bi-cubic interpolation method is a kind of complex interpolation method which is the best method of super resolution based on the interpolation method. It is comparable with sparse representation on the comprehensive performance, as a result, we employ it as a basic correlation method used in this paper. Besides, we take the comparison with the similar method in [8] to illustrate the advantages of our method.

First, we trained 9 stages dictionary as an offline library for the image synthesis step in Sect. 2.2. In order to test our performance with the methods bi-cubic interpolation method and dual-dictionary learning method [8], we take the same parameters as the method [8] including the Gaussian filter size and standard deviation of blurring operator which are set to 5 × 5 and 1 respectively, down sampling scale factor of decimation operator which is set to 2, and also the size of each level dictionary which is set to 500. Besides, the number of atoms for representing each image patch is fixed to 3, and the size of image patch is 9 × 9 with overlap of 1 pixel.

Some experimental results are shown in Fig. 4, which separately show the result of PSNR and FSIM with different stages of dictionary to be used in the image synthesis step. Each curve represents a test image, and each point in curve is an evaluation result corresponding to the stage in axis X. From the figures, we can see that in the front several stages, PSNR and FSIM increased significantly, and then remain the same or stay a little shock. In which, PSNR is the most widely used evaluation quality objective measurement and FSIM indicates the similarity of the original image and the interpolated high frequency image which is ranged from 0 to 1. Both of them are the bigger the better in their ranges.

Fig. 4
figure 4

PSNR and FSIM results on different test images

To show the performance of the proposed method intuitively, we draw Table 1 as the compare result between different methods with the evaluation index PSNR. It can be seen that the proposed method can gain much better results of PSNR than the methods mentioned above, which increased 3.45 dB and 0.48 dB, respectively. The last column means how much the proposed method gain over Zhang’s method [8], in which it claimed that his approach is better than the state-of-art method [5]. In conclusion, our method is effective in any way.

Table 1 PSNR comparisons with different algorithms (dB)

4 Conclusions

This paper presents a novel image super-resolution approach via multi-stages dictionaries learning based on sparse representation , which can restore a high-resolution image from a low-resolution one by a series of progressive high-frequency compensation utilizing multi-stages dictionaries. Experimental results show that the proposed method is able to narrow the gap between the frequency spectrum of the corresponding HR image and that of the initial interpolation, hence achieving better results in terms of both PSNR and FSIM. However, our method may spend some time off because of too much compensation in high frequency. Next, we will do some work to improve it.