1 Introduction

Image inpainting [5, 6, 10, 24, 25, 29], deblurring [3, 8, 9, 13, 21, 31] and denoising [3, 4, 7, 12, 14, 38, 43] are the widely-concerned ill-posed problems in machine vision and image processing. These problems have not been remarkably dealt with because the missing part of the image is indeed difficult to estimate. Inpainting is the process of reconstructing lost or deteriorated parts of images, in order to make the images look more natural and visually plausible. The inpainting technology is widely used to rebuild damaged photographs, remove unwanted objects or texts and replace objects. Motion blur is the result of the relative motion between the camera and the scene during image exposure time. Blur may come from the shaking of the camera at the time of imaging and also may come from the noise generated when saving images. Deblurring attempts to recover the origin sharp content, remove the noise and enhance the quality of images. Deblurring methods are trending topic due to its involvement of many challenges in regularization and optimization. Denoising algorithms seek to remove noise, errors, or perturbations from an image, while preserving as many image details as possible. Previous researches commonly assume that image noise is additive white Gaussian noise [38]. Yet in many cases, the noise is not stationary, and the variance of the noise is difficult to estimate. Figure 1 Shows the typical examples of these problems.

Fig. 1
figure 1

Example of the mentioned three problems above. The upper-left is a picture losing the central part. The upper-middle is a blurred image, its visual effect is unclear. And the upper-right is a noising image. The lower images are the desired corresponded image after inpainting, deblurring and denoising

In practice, those three problems often arise concurrently, rather than exist solely. Such as, if the noise on the image is serious, it will cause the image to be blurred. If the blurred area on the image is concentrated somewhere, it becomes a inpainting problem. So it is necessary to consider these problems as a whole, and deal with them jointly. But most current research methods treat these problems separately, and mostly focused on solving one of the problems. In general, part of the content in the image after inpainting is certainly not clear, such as the lost part and the twisty position where the obstructions removed. In order to get a visually plausible image, deblurring and denoising processes are necessary.

In this paper, we propose a deep cascade of neural networks to handle these multiple ill-posed image problems through an unique step. The model will learn how to fill the holes, deblurring and denoising the image at the same time. The obvious technical challenge is how to infer the details of an image that actually does not exist in the input data. Our approach is inspired by generative adversarial networks (GANs) [17], which is a powerful generative approach for probabilistic modeling. Although natural images are diversiform, in most cases, they are extremely structured and coherent. These properties make it possible that the GANs can capture the structure and pattern of the image through a well-trained model with bad image as the input. The method proposed in this paper establishes a specially designed cascade networks structure and set up a progressive training strategy, we can achieve the purpose that integrating the origin three processes into a holistic procedure.

Our contributions can be summarized as follows:

  1. 1.

    We propose a cascade of deep neural networks that deal with the inpainting, deblurring, and denoising through a unified process. In the training step, we can train the whole networks by pipelining the procedures, instead of training two separate models to accomplish these tasks. The model learns multiple tasks after training. Besides, in the inferring step, we can get not only the ultimate output image after inpainting, deblurring and denoising, but also the intermediate results after inpainting step.

  2. 2.

    We propose a gradual training strategy. At the first step, we only train the inpainting part networks, which is a GANs-like architecture. After the inpainting part being well trained, we commence to train the deblurring and denoising part of the networks. What has to be aware of is that at this step the parameters in the inpainting part are not frozen and jointly optimized with the latter part.

  3. 3.

    We evaluate the proposed model on several datasets and demonstrate that its performance is advanced. The ultimate output looks more natural, and the inpainting part looks smoother with its surrounding. This shows that the cascade of deep neural networks can learn the ability to handle these reverse vision tasks.

The rest of this paper is organized as following. Section 2 briefly reviews related work of the current methods that handle these problems. Section 3 presents the proposed method in detail, and Section 4 gives the experimental results to verify the effectiveness of the proposed method. Finally, we conclude the paper in Section 5.

2 Related work

A variety of techniques have been proposed to handle those tasks mentioned above. Such as in the inpainting and denoising fields, the traditional structural based and textural based methods have been studied for a long time. For the deblurring problem, the main methods are based on kernel estimation. In the recent years, Convolutional Neural Networks (CNNs) has shown outstanding performance in many tasks, including classification [23, 26, 32], object detection [16], segmentation [27], NLP [33], behavior analysis [34, 44] and so on. The deep learning basic approaches are also introduced to process the generative vision problems [17], such as image generation [30], image to image translation [20], and video prediction [39].

Existing methods address the inpainting problem can be divided into several categories such as structural inpainting [6, 25], textures synthesis [5, 6], and example-based methods [10, 11]. Structural inpainting uses geometric approaches to fill in the missing information in the region. Liu et al. [25] proposes a compression-oriented edge-based algorithm for inpainting, which focus on visual quality rather than pixel-wise fidelity. These algorithms focus on the consistency of the geometric structure. Textures synthesis inpainting algorithms uses similar textures approaches, under the constraint that image texture should be consistent. Bertalmio et al. [6] simultaneously utilizes structure and texture to fill-in the regions of missing image information. These classes of techniques are less effective in the case of large lost region due to missing global information. Example-based image inpainting attempts to infer the missing region through retrieving similar patches or learning-based model. Hays and Efros [18] retrieve semantically similar patches from a large photographs dataset and then use these patches to fill in the missing pixels. Pathak et al. [29] proposes an unsupervised learning algorithm named Context Encoders. That is a convolutional neural network trained to generate the contents of an arbitrary image region which is conditioned on its surroundings. Yang et al. [42] propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints.

The task of image deblurring is to recover a clean image given only the blurry image. In order to generate clear image via image processing, a number of approaches have been proposed. Shan et al. [31] uses a unified probabilistic model of both blur kernel estimation and unblurred image restoration to deblur image. Cho and Lee [9] introduce a novel prediction step to accelerate both latent image estimation and kernel estimation in an iterative deblurring process. Cai et al. [8] removes motion blurring from a single image by formulating the blind blurring as a new joint optimization problem, which simultaneously maximizes the sparsity of the blur kernel and the sparsity of the clear image under certain suitable redundant tight frame systems, Sun et al. [37] utilizes deep learning approach to predict the probabilistic distribution of motion blur. Xu et al. [41] establishes a framework for robust deconvolution against artifacts through combining traditional optimization-based schemes and neural network.

Denoising is the process of reconstructing the original image by removing unwanted noise from a corrupted image. Image denoising approaches can be categorized as spatial domain, transform domain, and learning based methods [35]. Elad and Aharon [14] uses K-SVD to obtain an over-complete dictionary and describe the image content effectively. Most existing state-of-the-art image denoising algorithms are based on retrieving the similarities between a number of patches. The eminence method is block-matches with 3D filtering(BM3D) [12]. BM3D is based on effective filtering in 3D transform domain by combining sliding window transform processing with block-matching. Burger et al. [7] apply multi-layer perceptron (MLP) to image patches, directly learning the mapping from noising image to noise-free image. Stacked denoising auto-encoder [38], which is trained locally to denoise corrupted versions of their inputs, is one of the well-known deep neural network model used for denoising. The auto-encoder tries to restore the raw input without noise. Zhang et al. [43] utilizes the residual learning to train a denoising convolutional neural networks to handle Gaussian denoising with unknown noise level.

There are some researches which attempt to resolve more than one aspect of the reverse vision problems. Dong et al. [13] adds autoregressive models and nonlocal self-similarity regularization term to sparse-coding algorithm, achieving excellent results on both image deblurring and super-resolution. Meur et al. [24] introduces a framework involving a combination of multiple inpainting versions of the input picture followed by a single-image super-resolution method. Gharbi et al. [15] trains a deep neural network on a large corpus of image to jointly solve denoising and demosaicking. Unlike these methods treating each task separately, our approach is much general, learning multi-task through a whole neural network.

3 Method

In this section, we first introduce the overall architecture of the deep cascade neural networks. Then we present the details of inpainting GAN and deblurring-denoising network. Finally, the gradually training strategy is introduced.

3.1 Framework overview

To deal with the above mentioned multiple ill-posed vision tasks, our deep cascade of neural networks is illustrated in Fig. 2. This framework mainly contains two parts: inpainting GAN and deblurring-denoising network. The corrupted image serves as the input to our method and the output of deblurring-denoising network is the final resulting image.

Fig. 2
figure 2

Architecture of the cascade neural networks. It consists of two parts, the Inpainting GAN and deblurring-denoising network. The Inpainting GAN simultaneously train two networks: a generator and a discriminator. The resulting image of the generator is further processed by deblurring-denoising network in order to remove the blur and noise

The first part named inpainting network is based on generative adversarial networks. The GAN generates meaningful visual blocks to fill in the vacancy or replace deteriorated parts through the competition between the generator and discriminator. The output of inpainting GAN is an image with complete content, but the filled area is blurred. The reason is that although the generator can produce images that look natural, the noise is inevitably mixed. The generated image is directly entered into the deblurring-denoising network. The intention of the deblurring-denoising network is to make the filled area clear. Inspired by deep residual networks [19] and Stacked denoising auto-encoders [38], the structure of the deblurring-denoising network is deep convolutional Auto-Encoder with skip connections.

We call our model deep cascade of neural networks because the two parts in the model are directly connected, and errors can be back propagated from deblurring-denoising network to the generator of inpainting GAN. The two sub-networks are joint as an integration. We firstly train the inpainting GAN network, and make the generator obtain the ability to generate coarse blocks to fill in the missing parts. The inpainting GAN enforces the generated image to be coherent and to look natural. Then we pre-train the deblurring-denoising network. After this step, we jointly optimize the deblurring-denoising network and inpainting GAN. The loss of the deblurring-denoising network will affect the parameters of the inpainting GAN. The stepwise training process is specially designed in order to optimize the complex model. A detailed description of training steps will be introduced in 3.4 section.

3.2 Inpainting GAN

Deep generative models attempt to capture the probability distributions of the given data. Generative adversarial networks, which have been proposed by Goodfellow et al. [17], aim to estimate the generative models via an adversarial process. The GANs simultaneously train two networks: a generative network G which wants to captures the input data distribution, and a discriminative network D which wants to correctly distinguish the sample came from the training data or model G. Unlike [30] generating image from noise prior, in our model the generative network generates the image G(x) given the input image x. The x is the corrupted input image, and its corresponding ground-truth image is denoted by y. In the discriminator, G(x) and y are presented as inputs. With the adversarial process, the generator can learn to create similar patches to fill in the missing parts, meanwhile it’s hard for the discriminator to distinguish. In order to improve the stability of learning and get rid of mode collapse, we adopt the Wasserstein GAN [1] instead of traditional GAN. The structure of Inpainting GAN is shown in Fig. 3.

Fig. 3
figure 3

Detail of the inpainting GAN. The generator contains seven residual blocks. The residual network building block is stacked by two convolutional layers followed by batch normalization layer. ReLU activation layer following the first batch normalization layer. An identity mapping shortcut connects the input of the residual block to the output of last layer in the residual block. Mirrored skip connections are adopted between corresponding convolution layers. The structure of discriminator is similar to VGG networks. We replace the max-pooling layers by adjust the stride of convolution layer to 2

Following the network architectures in [29], the generator is a simple encoder-decoder pipeline, which consists of convolution layers and deconvolution layers. The generator extracts feature through first five convolution layers, and recovers the details of image contents through five deconvolution layers. Batch normalization layer is used after every convolution layer and adopt leaky ReLU as the activation function. The decoder uses ReLU as activation function which is different from encoder. In order to make the training more effective, we adopt the mirrored skip connections between the first convolution layers and after their corresponding deconvolution layers. The skip connection simply element-wise add the input image to the generator’s output. The size of both ends of skip connections should keep the same.

The discriminator is similar to VGG-16 network, which is proposed by K. Simonyan and A. Zisserman [36]. We use five groups of convolution blocks, and remove max-pooling layers in VGG-16 network. In order to reduce the size of the feature maps, the stride of last convolutional layer in each block is 2, and others are 1. And the number of 3*3 filter kernels increase by a factor of 2 from 64 to 512 as in the VGG-16 network. The last convolution layer is followed by two full connection layers. Since the original GAN training process is unstable at the risk of model collapsing, we use the Wasserstein GAN instead of original GAN. It is important to note that we followed the advice of [1], by removing the sigmoid layer in the output layer of the discriminative network, and using the RMSProp as the optimizer.

The loss of the network consists of three parts: MSE loss, perceptual loss and generative adversarial loss. The MSE loss is calculated through pixel wise mean squared error. In many cases, peak signal-to-noise ratio (PSNR) is an approximation to the human perception of reconstruction quality, and the lower the MSE will result in the higher the PSNR. Therefore, the MSE loss is the most widely used optimization target for image inpainting task. Johnson et al. [22] proposed the perceptual loss functions based on high-level features extracted from pertained networks. And their experiments also demonstrate that the perceptual loss produces more realistic results in the style transfer and super-resolution tasks. We adopt the Wasserstein GAN loss as the generative adversarial loss. Unlike traditional GAN, Wasserstein loss is differentiable almost everywhere. This nature results in a better discriminator. On the other hand, Wasserstein distance provides a metric that correlates well with training progress.

Given a paired image (x, y) ∈ (I input , I groundtruth ) The MSE loss is defined as:

$$ {L}_{MSE}=\frac{1}{WH}\sum \limits_{i=1}^W\sum \limits_{j=1}^H{\left[{y}_{\left\{i,j\right\}}-G{(x)}_{\left\{i,j\right\}}\right]}^2 $$
(1)

The W is the width of image, H is the height of image, G(x) is the image generated by the generator.

We define the perceptual loss on the activation layers of VGG-16 [36]. Denoting ϕ i, j (x) as the feature map obtained after ReLU activation of the j-th convolutional layer and before the i-th polling layer in VGG-16, if the shape of feature map is (Hi, j × Wi, j × Ci, j), the mean Euclidean distance between feature representations is denoted as L i, j , and the perceptual loss is the mean value of specified L i, j , in this case we use the feature maps: ϕ 1, 2, ϕ 2, 2, ϕ 3, 3, ϕ 4, 3(all are the feature maps before polling layer). The perceptual loss is finally given by:

$$ {L}_{i,j}=\frac{1}{H_{i,j}{W}_{i,j}{C}_{i,j}}{\left\Vert {\phi}_{i,j}(y)-{\phi}_{i,j}\left(G(x)\right)\right\Vert}_2^2 $$
(2)
$$ {L}_{per}=\frac{1}{4}\left({L}_{1,2}+{L}_{2,2}+{L}_{3,3}+{L}_{4,3}\right) $$
(3)

Arjovsky and collaborators [1] theoretically analyzed the drawback of original GAN, and advice on using Wasserstein distance W (f, g) to measure the difference between input data distribution and generator’s distribution. The Wasserstein GAN is to solve the adversarial min-max problem:

$$ \underset{G}{\mathit{\min}}\underset{D}{\mathit{\max}}\underset{x\sim {\mathrm{\mathbb{P}}}_r}{\mathbb{E}}\left[D(x)\right]-\underset{\overset{\sim }{x}\sim {\mathrm{\mathbb{P}}}_g}{\mathbb{E}}\left[D\left(\overset{\sim }{x}\right)\right] $$
(4)

The discriminator loss is:

$$ {L}_D={\mathbb{E}}_{\overset{\sim }{x}\sim {\mathrm{\mathbb{P}}}_g}\left[D\left(\overset{\sim }{x}\right)\right]-{\mathbb{E}}_{\overset{\sim }{x}\sim {\mathrm{\mathbb{P}}}_d}\left[D(x)\right] $$
(5)

The generator loss is:

$$ {L}_G=-{\mathbb{E}}_{\overset{\sim }{x}\sim {\mathrm{\mathbb{P}}}_g}\left[D\left(\overset{\sim }{x}\right)\right] $$
(6)

we define the overall loss function as:

$$ L={\lambda}_{MSE}{L}_{MSE}+{\lambda}_{per}{L}_{per}+{\lambda}_D{L}_D $$
(7)

3.3 Deblurring-denoising network

The deblurring-denoising network is connected to the generator of inpainting GAN. It takes the generator’s output image as input and estimates corresponding clean image. The structure of the deblurring-denoising network is a deep convolutional Auto-Encoder with skip connections. The optimize objective is minimizing the mean squared error of estimating image and the ground-truth image. Instead of learning a mapping y = (x) to directly get clean image from noisy input image, we learn the residual between clean image and noisy observation through the skip connection between input and output. The residual learning method solved the vanishing gradients problem through learning a mapping between noisy input and noise or blur. We get the clean image y = x – v, where v = (x).

The network structure is shown in Fig. 4. We use 4 convolutional layers encoding the input image to features, and 4 convolutional layers decoding these features to restore a full-detail image. Every convolutional layer is followed by a batch normalization layer except the first and the last. LeakyReLU activation with negative slope parameter set to 0.001 is applied after batch normalization. We set the size of all convolutional filter to be 3*3, and the number of channels is 96. To preserve the dimension of feature map, every convolutional layer is given zero-padding.

Fig. 4
figure 4

Detail of the deblurrring-denoising network. Four convolution layers are used to get the encoded presentation, and 4 convolution layers are used to decode these features to reconstruct clean image. The size of all convolutional filters is set to 3*3 and stride to 1. All feature maps keep the same size with input image

3.4 Gradual training strategy

The model is composed by two parts: inpainting GAN and deblurring-denoising network. The deblurring-denoising networks are directly connected to the generator of inpainting GAN. Because the network structure is complex and handles multiple tasks at the same time, so end-to-end training is difficult to make the network converge. Therefore, we designed a gradual training strategy in order to acquire better results.

Firstly, we pre-train the generator of inpainting GAN by only using MSE-loss. The process is similar to training an auto-encoder, enabling the generator has the basic ability to extract useful features and reconstruction. Secondly, we add the discriminator and VGG-loss into the training process. Next, pre-train the deblurring-denoising network independently. The image pairs are used to pre-train the deblurring-denoising network which is generated by adding noise and blur to a clean image. This step also can be parallelized with training inpainting GAN, because we can regard them as separate networks at this point. After the two networks being well trained, we combine the generator of inpainting GAN with deblurring-denoising network, and jointly training them. The finally training step is conforms to an end-to-end mode. The image firstly enters into the generator and the output comes from deblurring-denoising network; and the gradient is calculated and update weight propagated backwards, starting from the output until the generator’s input.

4 Experiments and results

In this section, for evaluation purpose, extensive experimental results are introduced to evaluate the performance of the deep cascade of neural networks. We firstly present the details of datasets and experiments settings. Next, we conduct a series of experiments to evaluate the comprehensive effectiveness of the learned model. We also compare our model with the recent state-of-art approaches systematically to show the difference and advantage of our model. Finally, we analyze the architecture of our model.

4.1 Data sets and evaluation metrics

We evaluate the proposed approach on ImageNet [23], BSD300 [28]. ImageNet, which contains 1000 categories and 1.2 million images, is the authoritative dataset to evaluate the classification task. The BSD300 is widely used for segmentation task. The BSD300 dataset only contains 300 images, so BSD300 is only used for testing. Testing set is randomly picked from validation set. Dataset for deblurring-denoising network pre-training is randomly selected from training sets of ImageNet. The input images are blurred by the random blur kernels followed by adding Gaussian white noise. Generating datasets for inpainting GAN pre-training is simple which is just removing the center part of images.

Following previous works, we adopt the Peak Signal to Noise Ratio (PSNR), and the Structural Similarity Index(SSIM) [40] as the evaluation metrics. The PSNR estimates the absolute errors in pixel values between two images, while SSIM is a perception-based model that estimates the structural similarity of two images.

4.2 Comparisons with state-of-the-art methods

From various classic and recent state-of-the art image inpainting approaches, five representative methods are selected as the comparison baselines, including k-Nearest Neighbor, PatchMatch [2], Context Encoders [29], Neural Patch Synthesis(NPS) [42]. The resolution of test images for comparison is 128 * 128. NN is implemented by ourselves. The algorithms of PatchMatch have been provided by the author. The results of Context Encoders are provided by the author. The results of Neural Patch Synthesis are generated through running the author’s model. We use the same test dataset provided by the author of Context Encoder to compare the effect of these methods.

By the image consequence generated by k-NN methods, there is no gainsaying that k-NN methods have the inferior performance. As the input images contain high-frequency scene, the output is entirely unpredictable even emerge radicalized filling part centered by loss region. In low-frequency region, the PatchMatch has exceptional performance. But in the images that contain relatively high-frequency scenes, the PatchMatch failed to fill loss region. We can easily conduct that the generative model based approach has better performance to fill damaged images and generate legible output images. Compared with CE and NPS, the PSNR and SSIM scores of our model is litter higher. We think this improvement benefits from our cascade structure and gradual training strategy. The detail quantitative comparison on ImageNet is listed in Table 1, and examples of visual comparison is presented in Fig. 5.

Table 1 Quantitative comparison on ImageNet and BSD300 between different methods
Fig. 5
figure 5

Visual comparisons of different methods on ImageNet. From left to right: ground truth, input image, K-NN, PatchMatch, Context Encoder, Neural Patch Synthesis, and Ours

4.3 Architecture analysis

In order to evaluate the effectiveness of our method, we have done an additional set of experiments. The first one doesn’t use the pre-training strategy, in order to verify the effect of the gradual training strategy. In the second experiment the model with only the inpainting GAN but no deblurring-denoising network to verify the ability of the deblurring-denoising network.

We present some of the results in Fig. 6, and detail quantitative comparison of these experiments in Table 2. From the result, we found that the PSNR score will drop about 1.2 dB without the pre-training procedure on ImageNet and 1.4 dB on BSD300. Without the deblurring-denoising network, the PSNR score will drop about 0.8 dB on ImageNet and 1.1 dB on BSD300. It is obviously that the performance of our intact model surpasses all the others. We demonstrate the role of the pre-training strategy and deblurring-denoising network to enhance the image quality is very conspicuous.

Fig. 6
figure 6

Visual comparisons of different setting result on ImageNet and BSD300.The image three lines above come from ImageNet. The image three lines below come from BSD300. From left to right: ground truth, input image, model without pre-training, model without deblurring-denoising network and intact model

Table 2 Comparison between different architectures and training strategies

5 Conclusion

This paper presents a novel cascade of neural networks for multiple low-level vision problems. The model contains two parts: inpainting GAN and deblurring-denoising network. The inpainting GAN adopt three weighted loss functions as training loss. Using an effective joint optimization, the two parts are well trained to generate clean version of the image. Future work will focus on optimizing the structure of generative network, so as to improve the generator’s ability to learn and represent.