1 Introduction

Mobile phones are arguably the most ubiquitous imaging system in use nowadays, and hence have taken on a fundamental role in capturing our favorite memories. Unfortunately, mobile phone cameras come equipped with a small aperture that suffers from high noise levels when imaging low-light scenes. To compensate for low-light, exposure time can be increased. However, longer exposures make the mobile phone more sensitive to motion (of the camera or of the scene) and may result in visibly blurry pictures. Hence, in the past decade, the task of blur removal has become more and more pressing.

In this work we are interested in removing motion blur with the help of computational methods and we consider only a single blurry photograph as input. For this purpose, we use the following blur model

$$\begin{aligned} f = k *u + n \end{aligned}$$
(1)

where k depends on the motion of the camera and is called the blur kernel (or point spread function), u is the sharp (or uncorrupted) image and n is the sensor noise. In this model the blur does not change across the image. This assumption does not hold in real scenes with depth variation and/or with general camera motions. Given both the blurry image f and the blur k, the estimation of the sharp image is a (non blind) deconvolution problem. When instead only the blurry image f is given, the problem of estimating both the sharp image and the blur is called blind deconvolution.

Solving blind deconvolution is a challenging task because it is a non-convex and ill-posed problem that requires the estimation of both u and k. For this reason it is often tackled with the use of regularization by means of prior knowledge on the distribution of sharp images (Babacan et al. 2012; Cho and Lee 2009; Fergus et al. 2006; Krishnan et al. 2011, 2013; Levin et al. 2011a; Shan et al. 2008; Wipf and Zhang 2014; Xu and Jia 2010; Xu et al. 2013). Several methods use also a prior on the blur kernel [(see for instance the recent works (Keuper et al. 2013; Kenig et al. 2010)]. A thorough analysis of convergence and further references can be found in Chaudhuri et al. (2014). In this paper, we consider only priors on the sharp image and use only the positivity and normalization constraints on the blur.

A common approach to choosing the image prior is to look for the one that best matches the statistics of natural images. This led to the use of Total Variation and its modifications, which brought some impressive results (Chan and Wong 1998; Shan et al. 2008). However, recent theoretical results have also shown that such priors are not suitable for the blind deconvolution problem. These priors prefer a blurry image rather than a sharp one (Levin et al. 2011b; Perrone and Favaro 2014).

Recently, Wipf and Zhang (2014) argued that a good image prior may not need to model realistic natural images. Indeed, they claim that an image obtained by removing gradients from a sharp image may be sufficient to estimate the correct blur kernel. Moreover, they show that their gradient sparsity principle is better than most natural image statistics in distinguishing a blurry image from a sharp one. Since one can then retrieve the sharp image given the estimated blur kernel, their conclusion is that an image prior that encourages strong sparsity in the gradients leads to a better performance in blind deconvolution.

Based on this principle, a natural choice is the \(L_0\) “norm”. This is the ideal sparsity-promoting prior, but it also leads to an intractable combinatorial problem. It is therefore common to use some approximation. The logarithm of the gradient norm of an image has been the most successful approximation and has achieved state-of-the-art performance (Babacan et al. 2012; Wipf and Zhang 2014). Unluckily, these implementations employ elaborate formulations to deal with the non-convexity of the logarithmic prior.

In this work we introduce a parametric family of logarithmic priors and show that, despite being far from modeling natural image statistics, it particularly suits blind deconvolution. We present two simple algorithms that effectively minimize the logarithmic prior and achieve state of the art results. Finally, we study the prior behavior and show what makes it successful for blind deconvolution. The main difference with the work of Wipf and Zhang (2014) is that in this work we propose a more thorough empirical analysis of the logarithmic prior and the algorithms that we propose have a simpler form.

2 Prior Work

In the past decade, several high-performing blind deconvolution schemes using Bayesian principles have been proposed (Babacan et al. 2012; Cho and Lee 2009; Fergus et al. 2006; Krishnan et al. 2011, 2013; Levin et al. 2011a; Shan et al. 2008; Wipf and Zhang 2014; Xu and Jia 2010; Xu et al. 2013). The first step in the Bayesian framework is to devise a statistical distribution for both the gradients of the sharp image and the measurement noise or the model error. This joint distribution is used to pose a maximum a posteriori (MAP) problem

$$\begin{aligned} p(u,k|f) \propto p(f|u,k)p(u)p(k), \end{aligned}$$
(2)

where p(f|uk) is a generative model of the noise, p(u) is a prior of the sharp image and p(k) is a prior of the blur. Commonly used sharp image priors approximate the heavy-tail distribution of natural image gradients (Srivastava et al. 2003) via a sparsity-inducing norm of the gradients of u. The \(L_2\) norm of the gradients (isotropic total variation), or the \(L_1\) norm of the derivatives (anisotropic total variation) are classical choices (Chan and Wong 1998). In contrast to other sparsity-inducing norms, total variation (TV) (Rudin et al. 1992) has the desirable property of being convex. However, it also introduces a loss of contrast in the recovered sharp image (Perrone and Favaro 2014; Strong and Chan 2003). Other methods use heuristics to encourage sharp gradients (Cho and Lee 2009; Xu and Jia 2010; Shan et al. 2008), or some reweighing strategy of the norm of the gradients (Krishnan et al. 2011, 2013). The latter methods aim at approximating the \(L_0\) “pseudo-norm” of the gradients, as proposed also in Xu et al. (2013). In this paper we also encourage sparsity in the gradients. However, we use the logarithm of TV at each pixel, which yields a simple energy term while providing a good approximation to the number of nonzero gradients. Indeed, this prior has already demonstrated promising results in blind deconvolution (Babacan et al. 2012; Wipf and Zhang 2013) and denoising (Ochs et al. 2014).

The MAP estimators are usually discredited to be theoretically less convenient than the conditional mean (CM) estimator (Burger and Lucka 2014). In fact, the CM estimator is the Bayes estimator for the mean square error, while the MAP estimator is only asymptotically the Bayes estimator for the uniform cost function. Nonetheless, in Burger and Lucka (2014) they show theoretical and experimental results that rehabilitate the MAP estimator and justify its successful use in different restoration problems.

For the blind deconvolution problem the MAP formulation has received further criticisms. In fact, in Levin et al. (2011b) and Perrone and Favaro (2014) it is shown that a large class of commonly used image priors favors the blurry image instead of the sharp one.

Because of such limitation, in Levin et al. (2011a) it is suggested to marginalize over all possible sharp images u and thus to solve the reduced problem

$$\begin{aligned} \max _k p(k|f)&= \min _k - \log p(k|f) \nonumber \\&= \min _k - \log \int _u p(u,f|k)p(k) du. \end{aligned}$$
(3)

In general, the integral in problem (3) is intractable. Therefore, typically one looks for an approximate solution. A common approach is to minimize an upper bound of \(- \log p(k|f)\) using a variational Bayes strategy (Babacan et al. 2012; Fergus et al. 2006; Levin et al. 2011a; Wipf and Zhang 2013) . This class of methods has achieved performances comparable to the best methods that directly solve the MAP problem (2).

Despite the apparent performance of the variational Bayes strategy, Wipf and Zhang (2013) show that methods that solve problem (3) are equivalent to a MAP strategy as in problem (2). They experimentally show that with an \(L_p\) norm with \(p\ll 1\), MAP approaches are able to favor the right sharp solution. They also argue that a variational Bayes approach should be preferred because it is more robust when minimizing a highly non-convex function. Their conclusions are however in contrast with several MAP approaches that have demonstrated effective results in various non-convex problems (Strekalovskiy and Cremers 2014; Möllenhoff et al. 2014a, b; Ochs et al. 2014). The conclusions given in Wipf and Zhang (2013) suggest that minimizing a cost functional as in (2) is not limited per se, as long as one finds a minimization strategy that carefully avoids its local minima.

In this paper, we extend the initial work in Perrone et al. (2014) where we proposed two MAP strategies to minimize a functional based on a logarithmic non-convex prior by proposing an analysis of the logarithmic prior, by studying the distribution of the parameters and by adding additional experiments that show the properties the logarithmic prior and of the two algorithms.

3 A Logarithmic Image Prior

In this section we introduce our image prior. From a Bayesian perspective, natural images can be described as having a sparse collection of gradients (Srivastava et al. 2003). Hence, one could employ sparsity-inducing priors of the image gradients. However, another point of view is that blurring is the average of shifted and scaled replicas of the same image gradients. The likelihood that such replicas combine to cancel each other is statistically insignificant. Therefore, this averaging is more likely to multiply the number of gradients by the number of nonzero elements in the blur. Thus, a different perspective is that, in the context of deblurring, the role of an image prior is to favor solutions that have as few gradients as possible regardless of their magnitude. Both points of view lead to the same principle, i.e., one should choose as prior

$$\begin{aligned} \mathtt{Number}\, \mathtt{of}\, \mathtt{non}\, \mathtt{zero}\, \mathtt{elements}\, \mathtt{of }\,(|\nabla u|)\doteq \Vert \nabla u\Vert _0 \end{aligned}$$
(4)

where \(\Vert \cdot \Vert _0\) denotes the \(L_0\) “norm” (the Hamming distance to zero) and \(\nabla u\) is the 2-D gradient of u. Unfortunately, optimization with this prior is very challenging and, typically, smoother alternatives such as \(L_p\) norms \(\Vert \nabla u\Vert _p^p\), with \(0<p<1\), are used. In this work we also consider a prior with a similar behavior and simple form.

Let us consider the discrete setting. In the 2D discrete case, we have images with \(N\times M\) pixels. The (ij)-th entry of the blurry image u will be denoted by \(u_{i,j}\). We consider four possible first order (discrete) derivatives of u according to whether forward or backward differences are used:

$$\begin{aligned} \nabla _{FF} u \doteq&[u_{i+1,j}-u_{i,j}~~u_{i,j+1}-u_{i,j}]^T\end{aligned}$$
(5)
$$\begin{aligned} \nabla _{FB} u \doteq&[u_{i+1,j}-u_{i,j}~~u_{i,j}-u_{i,j-1}]^T\end{aligned}$$
(6)
$$\begin{aligned} \nabla _{BF} u \doteq&[u_{i,j}-u_{i-1,j}~~u_{i,j+1}-u_{i,j}]^T\end{aligned}$$
(7)
$$\begin{aligned} \nabla _{BB} u \doteq&[u_{i,j}-u_{i-1,j}~~u_{i,j}-u_{i,j-1}]^T. \end{aligned}$$
(8)

As image prior we propose using the following logarithmic priorFootnote 1

$$\begin{aligned} \log \Vert \nabla u\Vert _{2,\epsilon }^p&\doteq \sum _{i=1}^N\sum _{j=1}^M \frac{1}{4}\sum _{D \in \mathcal D}\log \Vert \nabla _D u_{i,j}\Vert _{2,\epsilon }^p \nonumber \\&= \frac{p}{2}\sum _{i=1,j=1}^{N,M} \frac{1}{4} \sum _{D \in \mathcal D} \log \Vert \nabla _D u_{i,j}\Vert _{2,\epsilon }^2, \end{aligned}$$
(9)

with \(p>0\), \(\mathcal D = \{FF,FB,BF,BB\}\), and where

$$\begin{aligned} \Vert \nabla u_{i,j}\Vert _{2,\epsilon }^2\doteq (u_{i+1,j}-u_{i,j})^2+(u_{i,j+1}-u_{i,j})^2+\epsilon ^2 \end{aligned}$$
(10)

for \(\epsilon >0\) so that the argument of the logarithm is never 0. Since the following analysis and discussion can be applied to each gradient discretization independently, in the remainder of this work we will use the following simplified notation

$$\begin{aligned} \log \Vert \nabla u\Vert _{2,\epsilon }^p&\doteq \sum _{i=1}^N\sum _{j=1}^M \log \Vert \nabla u_{i,j}\Vert _{2,\epsilon }^p \nonumber \\&= \frac{p}{2}\sum _{i=1,j=1}^{N,M} \log \Vert \nabla u_{i,j}\Vert _{2,\epsilon }^2. \end{aligned}$$
(11)

In Eq. (11) the parameter \(\epsilon \) leads to a lower bound for this prior equal to \(MNp\log \epsilon \). We can formulate our blind deconvolution problem using the logarithmic prior in Eq. (11) as

$$\begin{aligned} u,k&= \arg \min _{u,k} \lambda \Vert k*u - f\Vert _2^2 + \log \Vert \nabla u\Vert _{2,\epsilon }^p\nonumber \\&\hbox {s.t. }\quad k\succcurlyeq 0,\quad \Vert k\Vert _1 = 1. \end{aligned}$$
(12)

Notice how the role of \(\epsilon \) is fundamental. If \(\epsilon =0\) then the optimal solution will always be \(u=0\) for any \(\lambda \).

Remark

The following limit shows how the log prior approximates the desired \(L_0\) “norm”

$$\begin{aligned} \lim _{\epsilon \rightarrow 0} \frac{p}{2} + \frac{1}{\log (1/\epsilon ^2) }\log \Vert \nabla u\Vert _{2,\epsilon ^2}^p = \frac{p}{2}\Vert \nabla u\Vert _0. \end{aligned}$$
(13)

Now, assume that \(0< \epsilon \le 1\) and we substitute \(\lambda \) in problem (12) with \(-\lambda p\log \epsilon ^2\). Then, in the limit for \(\epsilon \rightarrow 0\) we are solving

$$\begin{aligned} u,k&= \arg \min _{u,k} \lambda \Vert k*u - f\Vert _2^2 + \Vert \nabla u\Vert _{0}\nonumber \\&\hbox {s.t. }\quad k\succcurlyeq 0,\quad \Vert k\Vert _1 = 1. \end{aligned}$$
(14)

In the following sections we present two different strategies to minimize problem (12), and we study the prior in Eq. (11) to understand how it relates to other commonly used priors and why it is a good choice for blind deconvolution.

4 Algorithm

To solve problem (12) we use the alternating minimization scheme

figure a

While the iteration in the blur k entails solving a convex problem, and we solve it as in Chan and Wong (1998), the minimization in the update of the sharp image u is non convex and requires more attention. To this purpose we introduce two solvers: one based on a primal-dual approach and another on majorization-minimization.

4.1 A Primal-Dual Solver

Recall the deblurring problem (given the blur \(k^t\)) in Algorithm (15); here we rewrite it as

$$\begin{aligned} u&= \arg \min _u \sum _{i=1,j=1}^{N,M} \left( (k^t*u)_{i,j} - f_{i,j}\right) ^2 + \frac{1}{\mu }\log \Vert \nabla u_{i,j}\Vert _{2,\epsilon }^2 \end{aligned}$$
(16)

where \(\mu = 2\lambda /p\). By using the primal-dual approach of Chambolle and Pock (2011) we obtain the following minimax problem

$$\begin{aligned} u= & {} \arg \min _u\max _{z_1,z_2} ~ \langle k^t*u, z_1 \rangle - F_1^*(z_1) \nonumber \\&+\, \langle \nabla u, z_2 \rangle - F_2^*(z_2) \end{aligned}$$
(17)

where \(F_{1}^*\) and \(F_{2}^*\) are conjugate functions of \(F_1\) and \(F_2\) respectively, and we have defined

$$\begin{aligned} F_1(x)\doteq & {} \sum _{i=1,j=1}^{N,M}\left( x_{i,j} - f_{i,j}\right) ^2,\nonumber \\ F_2(\xi )\doteq & {} \frac{1}{\mu } \sum _{i=1,j=1}^{N,M}\log \Vert \xi _{i,j}\Vert _{2,\epsilon }^2. \end{aligned}$$
(18)

The conjugate functions can be computed via the Legendre-Fenchel (LF) transform (Rockafellar 1970) and are convex by construction. Thus problem (17) is an approximation in all variables \(z_1\), \(z_2\) and u of the original problem (16). This formulation has been also proposed for solving the Mumford-Shah problem (Strekalovskiy and Cremers 2014) and for solving problems with \(L_p\) norms (with \(0 < p < 1\)) (Möllenhoff et al. 2014a). Notice that the convex approximation provided by the primal-dual formulation may not lead to one of the minima of the original non convex cost.

Our general primal-dual algorithm to solve problem (17) is

figure b

where \(k_{-}^t\) denotes the mirrored blur kernel \(k^t\) (along both axes), n is the iteration index, \(\theta \in (0,1]\) and \(\tau \sigma \Vert K\Vert ^2<1\), with \(\tau ,\sigma >0\), where K is the matrix operator implementing both the blur k and the finite difference operator \(\nabla \). Two of the 4 iterations in the above algorithm are defined based on the proximity operator. The proximity operator \(\hbox {prox}_{\sigma F_1^*}\) is computed via

$$\begin{aligned} \hbox {prox}_{\sigma F_1^*}(z)&= z-\sigma \hbox {prox}_{F_1/\sigma }(z/\sigma )\nonumber \\&=z-\sigma \arg \min _x \frac{1}{2}\left\| \frac{z}{\sigma }-x\right\| _2^2+\sigma F_1(x)\\&= \frac{1}{\sigma +1} \left( z-\sigma f\right) .\nonumber \end{aligned}$$
(20)

The proximity operator \(\hbox {prox}_{\sigma F_2^*}\) is instead computed via

$$\begin{aligned} \hbox {prox}_{\sigma F_2^*}(z)&= z-\sigma \arg \min _x \frac{1}{2}\left\| \frac{z}{\sigma }-x\right\| _2^2+\sigma F_2(x). \end{aligned}$$
(21)

In Eq. (21) we use Moreau’s Identity (Rockafellar 1970) to express the proximity operator of \(F_2^*\) in terms of the proximity operator of \(F_2\). This strategy has been used successfully also for other priors and problems (Möllenhoff et al. 2014a; Strekalovskiy and Cremers 2014). While this algorithm was originally studied for convex problems (Chambolle and Pock 2011), recently convergence has been studied also for non-convex problems (Möllenhoff et al. 2014b).

Since the minimization problem is separable, let us consider the solution obtained for only one element \(x_{i,j}\) and \(z_{i,j}\) of the variables x and z respectively. With an abuse of notation, instead of the element-wise cumbersome notation \(x_{i,j}\) and \(z_{i,j}\) we simply refer to x and z in the next equations. We use the representation \(x\doteq \rho w\), where \(\rho \ge 0\) and \(\Vert w\Vert _2 = \Vert z\Vert _2/\sigma \). Then, let \(\xi = z/\sigma \) and we have

$$\begin{aligned}&\arg \min _x \frac{1}{2}\left\| \xi -x\right\| _2^2+\sigma F_2(x) \nonumber \\&\quad =\arg \min _{\rho ,w} \frac{\rho ^2}{2}\left\| \frac{\xi }{\rho }- w\right\| _2^2 + \frac{\sigma }{\mu } \log \left( \rho ^2\frac{\Vert z\Vert _{2}^2}{\sigma ^2}+\epsilon ^2\right) . \end{aligned}$$
(22)

Notice that the logarithmic term now depends only on \(\rho \). Hence, we can first solve the minimization problem with respect to w. By simplifying the least squares term we obtain

$$\begin{aligned}&\arg \min _{w, \Vert w\Vert =\frac{\Vert z\Vert }{\sigma }} \frac{\rho ^2}{2}\left\| \frac{\xi }{\rho }- w\right\| ^2 \nonumber \\&\quad = \arg \min _{w, \Vert w\Vert =\frac{\Vert z\Vert }{\sigma }} \left\| \frac{\xi }{\rho }\right\| ^2 + \left\| w\right\| ^2 - 2\langle \xi /\rho , w\rangle \nonumber \\&\quad = \arg \min _{w, \Vert w\Vert =\frac{\Vert z\Vert }{\sigma }} \frac{\Vert z\Vert ^2}{\sigma ^2} - 2\langle \xi /\rho , w\rangle \\&\quad = \arg \max _{w, \Vert w\Vert =\frac{\Vert z\Vert }{\sigma }} \langle \xi , w\rangle \nonumber \end{aligned}$$
(23)

which immediately yields \(w = \frac{\Vert z\Vert _2}{\sigma \Vert \xi \Vert _2}\xi = z/\sigma \). By substituting the expression of w back into Eq. (22) and by using \(\xi = z/\sigma \) we finally have

$$\begin{aligned}&\arg \min _x \frac{1}{2}\left\| \xi -x\right\| ^2+\sigma F_2(x) \nonumber \\&\quad = \xi \cdot \arg \min _{\rho } \frac{1}{2}\left\| \xi - \rho \frac{z}{\sigma }\right\| ^2+ \frac{\sigma }{\mu }\log \left( \rho ^2\frac{\Vert z\Vert ^2}{\sigma ^2}+\epsilon ^2\right) \nonumber \\&\quad =\xi \cdot \arg \min _{\rho } \frac{\mu }{2\sigma }\left( 1- \rho \right) ^2\Vert \xi \Vert ^2+ \log \left( \rho ^2\Vert \xi \Vert ^2+\epsilon ^2\right) . \end{aligned}$$
(24)

We can define \(\mathcal{H}\) as the solution of the 1D problem

$$\begin{aligned} \mathcal{H}(\xi ,\mu ,\epsilon ,\sigma )= & {} \arg \min _\rho \frac{\mu }{2\sigma }(\rho -1)^2\Vert \xi \Vert _2^2 \nonumber \\&+\, \log (\rho ^2\Vert \xi \Vert _2^2+\epsilon ^2) \end{aligned}$$
(25)

and build it into a lookup table.Footnote 2 The proximity operator \(\hbox {prox}_{\sigma F_2^*}\) can then be written as

$$\begin{aligned} \hbox {prox}_{\sigma F_2^*}(z) = \left( 1-\mathcal{H}\left( \frac{z}{\sigma },\mu ,\epsilon ,\sigma \right) \right) z. \end{aligned}$$
(26)

The final algorithm is summarized in Table 1. A similar approach was proposed for the minimization of \(L_p\) norms with \(0 < p < 1\) by Möllenhoff et al. (2014a). Notice how several operations are parallelizable, thus leading to a very efficient implementation.

Table 1 The proposed primal-dual algorithm

4.2 A Majorization–Minimization Approach

As a more accurate alternative to the primal-dual algorithm, one could use a majorization–minimization (MM) approach (Hunter and Lange 2004), in a similar manner as proposed by Candes et al. (2008). In the MM approach one defines an upper bound functional \(\psi (u|u^t)\) given the current estimate \(u^t\) at time t. This upper bound must satisfy the following properties

$$\begin{aligned} \begin{array}{rl} \psi (u|u^t) \ge &{}\displaystyle \sum _{i=1}^N\sum _{j=1}^M \log \Vert \nabla u_{i,j}\Vert _{2,\epsilon }^p\\ \psi (u^t|u^t) =&{}\displaystyle \sum _{i=1}^N\sum _{j=1}^M \log \Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p. \end{array} \end{aligned}$$
(27)

Then, one can apply the following iterative scheme

$$\begin{aligned} u^{t+1}&= \arg \min _u \sum _{i=1,j=1}^{N,M} \lambda \left( (k*u)_{i,j} - f_{i,j}\right) ^2 + \psi (u|u^t) \end{aligned}$$
(28)

and provably reach a local minimum of the original function. To define the upper bound, we consider using the Taylor expansion of the logarithm around the t-th estimate of \(\Vert \nabla u\Vert _{2,\epsilon }^p\) up to the first term

$$\begin{aligned} \psi (u|u^t)&= \sum _{i=1,j=1}^{N,M} \log \Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p\nonumber \\&\quad + \frac{\Vert \nabla u_{i,j}\Vert _{2,\epsilon }^p-\Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p}{\Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p}. \end{aligned}$$
(29)

The properties (27) hold because of the concavity of the logarithm function. Finally, by plugging \(\psi \) in Eq. (28) we obtain the following update

$$\begin{aligned} u^{t+1}= & {} \arg \min _u \sum _{i=1,j=1}^{N,M} \lambda \left( (k*u)_{i,j} - f_{i,j}\right) ^2\\&+\, \log \Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p + \frac{\Vert \nabla u_{i,j}\Vert _{2,\epsilon }^p-\Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p}{\Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p} \nonumber \\= & {} \arg \min _u\sum _{i=1,j=1}^{N,M} \lambda \left( (k*u)_{i,j} - f_{i,j}\right) ^2 + \frac{\Vert \nabla u_{i,j}\Vert _{2,\epsilon }^p}{\Vert \nabla u_{i,j}^t\Vert _{2,\epsilon }^p}.\nonumber \end{aligned}$$
(30)

so that the majorization–minimization algorithm can be summarized in Table 2. Notice the similarity with reweighed least squares algorithms when \(p=2\).

Table 2 The proposed majorization-minimization algorithm
Fig. 1
figure 1

Log probability of natural and blurry images compared to Cauchy and Hyper-Laplacian distributions (Color figure online)

5 Analysis of the Logarithmic Prior

From a Bayesian perspective the logarithm prior in Eq. (11) with \(p = \frac{2}{\pi \epsilon }\) is equivalent to the assumption that the magnitude of the gradient of u is independent and identically distributed according to a Cauchy distribution. In fact, if we define the prior as

$$\begin{aligned} \frac{1}{\pi \epsilon } \prod _{i=1, j=1}^{N, M}\left( \frac{\epsilon ^2}{\Vert \nabla u_{i,j}\Vert _2^2 + \epsilon ^2} \right) , \end{aligned}$$
(31)

the minimization in Eq. (12) is equivalent to a maximum a posteriori (MAP) estimation in which Eq. (31) is used as image prior. The Cauchy distribution does not fit particularly well natural image statistics, as we show in Fig. 1. Why should it be a good choice for blind deconvolution?

5.1 Image Statistics or Blur “Reconstructability”?

Following the arguments made in Wipf and Zhang (2014) we argue that in blind deconvolution an image prior does not have to necessarily model natural image statistics. Rather, it could model another family of images as long as it allows the estimation of the correct blur kernel k. In support to this thesis we show that blur estimation is sufficiently accurate for a wide range of “unnatural” images, and hence we can use them in practice. Our analysis focuses on the image family modeled by the logarithmic prior, but similar results are obtained by using Total Variation, which is another typical cartooning prior successfully used in blind/nonblind deconvolution.

To substantiate this claim, we perform an empirical evaluation of the logarithmic prior. The family of “unnatural” images is generated by solving the denoising problem

$$\begin{aligned} u_\lambda = \arg \min _x \frac{\lambda }{2}\Vert x - u\Vert _2^2 + \log \Vert \nabla x\Vert _{2,\epsilon }^p, \end{aligned}$$
(32)

where u is a ground truth sharp image and the image \(u_\lambda \) is its approximation via the logarithmic prior (for a given \(\lambda \)). We consider several degrees of regularization by varying \(\lambda \). As sharp images we use the dataset introduced in Sun et al. (2013) composed by 80 images.

After computing the image \(u_\lambda \), which is essentially a “cartooned” version of the original image u, we estimate the blur \(k_\lambda \) by solving the problem

$$\begin{aligned} \min _{k}&\quad \Vert k*u_\lambda - f\Vert _2^2\nonumber \\&\hbox {s.t. }\quad k\succcurlyeq 0,\quad \Vert k\Vert _1 = 1 \end{aligned}$$
(33)

where the blurry image f is generated by convolving one of the 8 different blurs in Sun et al. (2013) with the ground truth data u. Thus, the model \(k*u_\lambda \), \(\lambda <\infty \), can never match f exactly. Because of the dependency on \(\lambda \) we denote the optimal blur \(k_\lambda \). Finally, to evaluate the accuracy of the estimated blur, we compute the error between \(k_\lambda \) and the ground truth blur k by using the \(L_2\) norm of their difference. When \(\lambda \) is very high, the cartooning disappears and \(u_\lambda \) approaches u. In this case the recovered blur \(k_\lambda \) also approaches the true blur k. It is more interesting to observe \(k_\lambda \) when \(\lambda \) becomes small, as this corresponds to working with a family of less “natural” images \(u_\lambda \).

Fig. 2
figure 2

Plot of the average error \(\Vert k_\lambda - k\Vert _2^2\) obtained over the dataset from Sun et al. (2013) for different values of \(\lambda \). The blue region denotes the area around the average bounded by one standard deviation (Color figure online)

In Fig. 2 we show the average blur error for different values of \(\lambda \). When \(\lambda \) is decreased \(u_\lambda \) becomes more “cartooned”, until, at \(\lambda = 0.1\), it becomes almost constant (see Fig. 5). This translates in an increasing error for the blur \(k_\lambda \) as \(\lambda \) becomes smaller. In the second column of Fig. 4 one can see how even a cartooned version of the sharp image can yield a blur with a relatively small error compared to the ground truth blur. The other columns in Figs. 4 and 5 show how images that look different still yield similar blur estimations.

As typically done in the literature we also perform a final non-blind deconvolution step using the estimated blurs. This experiment allows establishing a connection between the error in the blur reconstruction and the error in the final sharp image. For this evaluation we use the SSD ratio proposed in Levin et al. (2011b). The ratio can be computed by

$$\begin{aligned} r = \frac{ \sum _{i=1,j=1}^{N,M} (u_{i,j}^{k_\lambda } - u_{i,j}^g )^2 }{\sum _{i=1,j=1}^{N,M} (u_{i,j}^{k_g} - u_{i,j}^g )^2} \end{aligned}$$
(34)

where \(u^g\) is the ground truth sharp image, \(u^{k_g}\) is the image obtained by solving a non-blind deconvolution problem with the ground truth blur, and \(u^{k_\lambda }\) is the image obtained by solving a non-blind deconvolution problem with the blur \(k_\lambda \). This metric is commonly used to evaluate blind deconvolution algorithms, because it takes into account the intrinsic difficulty of each blur. In our experiment we evaluate the blurs estimated with images obtained with different values of \(\lambda \) using the non-blind deconvolution proposed in Zoran and Weiss (2011).

In Fig. 3 we show the percentage of images with \(r \le 5\) for each value of \(\lambda \). We choose to consider error ratios smaller than 5 because it has been shown that reconstructed images with such error are still visually pleasant (Michaeli and Irani 2014). Notice that for values of \(\lambda >60\), \(100\, \%\) of the images have error ratio smaller than 5 (Fig. 4). This range includes also values for which a cartooned version of the sharp image is generated. Indeed, from Fig. 5 one can see that \(u_\lambda \) is evidently cartooned for \(\Vert k_\lambda - k\Vert \le 0.01\). In Fig. 2 such blur error corresponds to \(\lambda <200\), which largely overlaps with the range \(\lambda >60\) obtained from Fig. 3.

The previous set of experiments shows that a large set of reconstructed images allows the correct estimation of the PSF. It also shows how this set includes the ground truth image and images where details (which do not affect the blur reconstruction) are removed. In conclusion, our experiments demonstrate that it is limiting to look only for images that resemble as much as possible the sharp ground truth. Finally, we show how the logarithmic prior is able to generate different images that allow a correct reconstruction of the blur.

Fig. 3
figure 3

Percentage of images with error ratio smaller or equal than 5 for different values of \(\lambda \)

Fig. 4
figure 4

Example of kernels obtained by solving Eq. (33) using different \(u_\lambda \)

Fig. 5
figure 5

Enlarged regions of images obtained by solving Eq. (32) and corresponding to the kernels in Fig. 4

5.2 Favoring Cartooned Images

Another fundamental aspect of the image prior is whether it favors the blurry input or some “cartooned” version of the true sharp image. To understand how the logarithmic prior behaves, we compare it with the total variation (TV) prior and the Hyper-Laplacian prior. We consider the difference \(d(\lambda ) = \psi (f) - \psi (\tilde{u}_\lambda )\). If \(d(\lambda )\) is positive the prior favors \(u_\lambda \) over f. If it is negative, the prior favors f instead of \(u_\lambda \). In Fig. 6 we show the average value of \(d(\lambda )\) computed on the dataset in Sun et al. (2013) for the three priors and different values of \(\lambda \).

Fig. 6
figure 6

Difference between the prior computed on the image \(u_\lambda \) and the blurry image f (Color figure online)

Since each prior has a different sensitivity to the parameter \(\lambda \) we normalize the plot such that each curve intersects the zero (in d) at the same point.

A first interesting result is that all three priors have a range for which the image \(\tilde{u}\) is favored. This aspect was highlighted also in Wipf and Zhang (2014). However, the magnitude of \(d(\lambda )\) also matters. The more negative \(d(\lambda )\) is, the more the prior discriminates between the sharp explanation and the blurry one. From this perspective, the logarithmic prior seems to have a stronger preference towards the sharp image compared to the other priors.

6 The Role of \(\epsilon \)

In this section we study the behavior of the logarithmic prior introduced in Eq. (11) according to the choice of the parameter \(\epsilon \).

One relevant aspect to consider is how to avoid the degenerate constant solution. In this case we can compare two cases: one when \(u=\hbox {constant}\) and one when \(u=f\) and \(k=\delta \). The idea is to make sure that the cost function favors the no-blur solution over the constant one. We can therefore plug in the two cases in the cost of problem (12) and obtain the following inequality

$$\begin{aligned} \log \Vert \nabla f\Vert _{2,\epsilon ^2}^p < \lambda \Vert \bar{f} - f\Vert _2^2 + \frac{p}{2} MN\log \epsilon ^2 \end{aligned}$$
(35)

or, alternatively,

$$\begin{aligned} \log \left\| \frac{1}{\epsilon ^2}\nabla f\right\| _{2,1}^p < \lambda \Vert \bar{f} - f\Vert _2^2, \end{aligned}$$
(36)

where \(\bar{f}\) is the average value of f. Then, we use Jensen’s inequality and the fact that the logarithm is a concave function to obtain an upper bound of the left hand side of Eq. (36)

$$\begin{aligned}&p/2 \sum _{i=1,j=1}^{N,M} \log \left[ \left\| \frac{1}{\epsilon }\nabla f_{i,j}\right\| _{2,1}^2\right] \nonumber \\&\quad \le \frac{pMN}{2}\log \left[ \frac{1}{MN} \sum _{i=1,j=1}^{N,M} \left\| \frac{1}{\epsilon }\nabla f_{i,j}\right\| _{2,1}^2\right] . \end{aligned}$$
(37)

Then, if we choose \(\epsilon \) such that

$$\begin{aligned} \epsilon > \sqrt{\frac{\frac{1}{MN}\sum _{i=1,j=1}^{N,M} \left\| \nabla f_{i,j}\right\| _{2}^2}{e^{\frac{2\lambda }{pMN}\Vert f-\bar{f}\Vert _2^2}-1}} \end{aligned}$$
(38)

the degenerate constant solution will be avoided. Also, notice that \(\frac{2\lambda }{pMN}\Vert f-\bar{f}\Vert _2^2>0\) and \(\frac{1}{MN}\sum _{i=1,j=1}^{N,M} \left\| \nabla f_{i,j}\right\| _{2}^2>0\) unless f is constant (in this case u constant is a plausible solution and it should not be avoided). This means that an \(\epsilon \) that satisfies Eq. (38) always exists.

Fig. 7
figure 7

SSD reconstruction error for the non-blind deconvolution problem with different values of \(\lambda \) and \(\epsilon \) (Color figure online)

Fig. 8
figure 8

SSD reconstruction error for the blind deconvolution problem (12) with different values of \(\lambda \) and \(\epsilon \) (Color figure online)

Fig. 9
figure 9

Cumulative histogram of SSD ratio results on the dataset (Levin et al. 2011b) (Color figure online)

Fig. 10
figure 10

Cumulative histogram of SSD results per image of the dataset (Levin et al. 2011b) (Color figure online)

Fig. 11
figure 11

Cumulative histogram of SSD ratio results on the dataset (Sun et al. 2013) (Color figure online)

Fig. 12
figure 12

Cumulative histogram of SSD kernel error on the dataset (Levin et al. 2011b) (Color figure online)

Fig. 13
figure 13

Cumulative histogram of SSD kernel error on the dataset (Sun et al. 2013) (Color figure online)

Fig. 14
figure 14

Average cost per iteration for the Log-TV MM and Log-TV PD algorithms (Color figure online)

To understand more in detail how the choice of \(\epsilon \) changes the behavior of the logarithmic prior we performed some empirical evaluation on the blind and non-blind problems. In this case we extracted patches of size \(201 \times 201\) from the dataset proposed in Sun et al. (2013) and we synthetically blurred them with blurs of size \(11 \times 11\). We then solved the non-blind deconvolution problem (using the ground truth blur) and the blind deconvolution problem with different values of \(\lambda \) and \(\epsilon \). We finally computed the Sum of Squared Differences (SSD) error between the reconstructed images and the ground truth sharp image. In Fig. 7 we show a visualization of the SSD error for the non-blind deconvolution problem. The blue color denotes smaller errors while the yellow color denotes larger errors. A small value of \(\lambda \) gives large errors because it removes details of the image, while a large value slightly increases the error because it is not able to remove the noise in the image. In the region surrounding the best value of \(\lambda \), the value of \(\epsilon \) appears to be less relevant. Nonetheless, large values of \(\epsilon \) increase the error.

Fig. 15
figure 15

Examples of deblurred images from dataset in Levin et al. (2011b)

Fig. 16
figure 16

Deblurred Images with worst SSD ratio from dataset in Sun et al. (2013)

In Fig. 8 we show a visualization of the SSD error for the blind deconvolution problem. In this case the estimation seems more sensitive to the values of \(\lambda \) and \(\epsilon \) and the set of values that gives the best errors is narrower.

7 Experiments

We evaluate the proposed algorithms on the datasets proposed in Levin et al. (2011b) and Sun et al. (2013). The first dataset is made of 4 images of size \(255 \times 255\) pixels blurred with 8 different blurs, and it is provided with ground truth sharp images and blurs. We use the metric defined in Eq. (34) as a performance measure as done also in Levin et al. (2011b).

For each method the same parameters are used for all the 32 blurry images of the dataset. For all the tests we used the non-blind deconvolution algorithm from Levin et al. (2007), where for each method we carefully selected the regularization parameter in order to have the best SSD ratio.

In Fig. 9 we show the cumulative histogram of the SSD ratios for several methods in the literatures and for our proposed algorithms (Log-TV MM and Log-TV PD). The MM algorithm achieves an error ratio equal to 1 for more than 50 % of the images, clearly outperforming the methods from Wipf and Zhang (2013) and Babacan et al. (2012), and, for most error ratios, the method of Sun et al. (2013). Our primal-dual method is on par with high performing variational Bayesian algorithms such as the one from Levin et al. (2011a). In Fig. 10 we also show the cumulative histogram of the SSD errors, while in Fig. 15 we show some of the sharp images obtained on this dataset.Footnote 3 In Fig. 12 we show the SSD errors of the blur kernels compared with the ground truth kernels. The Log-TV MM algorithm achieves a very good blur reconstruction.

For our methods we used the same regularization parameter \(\lambda = 30{,}000\), \(\epsilon = 0.001\), \(p = 1\) and 3500 iterations for each pyramid level. For the primal-dual algorithm we set \(N_0 = 1\), \(\tau = 0.005\) and \(\sigma = \frac{1}{32\tau }\). The parameter values have been found experimentally. We used a pyramid scheme where the input image and the blur are down sampled at each level by \(\sqrt{2}\), and the parameter \(\lambda \) is divided by the number 2.1. The number of levels of the pyramid is computed such that at the top level the blur kernel has a support of 3 pixels. For the other methods we used the estimates provided by the authors, or we ran their algorithm using the tuning that gives the best results.

In Fig. 11 we show an evaluation on the dataset from Sun et al. (2013). In this case we have 80 images of average size \(1024 \times 768\) synthetically blurred with the blurs from Levin et al. (2009). The evaluation is made in a similar manner as in Fig. 9, but using the non-blind deconvolution from Zoran and Weiss (2011) as proposed in Sun et al. (2013). Michaeli and Irani (2014) have highlighted that images in this dataset with an error ratio smaller than 5 are visually pleasant (Fig. 12). Also in this case the MM algorithm outperforms the other methods except for the error ratio smaller than 2, where the algorithm in Sun et al. (2013) performs slightly better. The primal-dual method performs on par with the algorithm proposed in Michaeli and Irani (2014) until error ratio smaller than 3.5, but then has a drop in performance. This suggests a larger instability of the algorithm for more difficult blurs. For this dataset we set the parameters to \(\lambda = 12{,}500\) and \(\epsilon = 0.0005\) for both algorithm. In Fig. 16 we show the images that have the worst error ratio in the dataset from Sun et al. (2013). In Fig. 13 we show the SSD errors of the blur kernels compared with the ground truth kernels. In this case the Log-TV MM algorithm is still a top performer, but not a clear winner like for the other experiments. Nonetheless, the reconstructed blur leads to a better reconstruction than with the other methods. This means that the metric used to assess the blur reconstruction (the Euclidean norm) can only roughly be used to predict the image reconstruction accuracy.

We also measured the performance of the two algorithms on a MacBook Pro with a 2.6 GHz Intel Core i7 processor and 16 GB of RAM with bandwith of 1600 MHz. We consider the average time required to process an image of size \(1024 \times 800\) for four different cases: a MATLAB unoptimized implementation of the MM algorithm (MatMM); a MATLAB unoptimized implementation of the PD algorithm (MatPD); a MATLAB implementation of the MM algorithm where the prior gradient is computed by a C routine (MatCMM); a MATLAB implementation of the PD algorithm where the time of the proximity operators is computed by considering an ideal parallelization where the computation time of the parallelizable components is divided by the number of pixels (MatPPD). In our experiments the MatMM implementation took on average 194.56 min, the MatPD implementation 46.90 min, the MatCMM 46.13 min and the MatPPD would take, in the ideal case, 37.41 min.

To understand the difference in convergence of the MatMM and MatPD algorithms, we show the cost functional (12) at each iteration in Fig. 14. As the cost evolution changes depending on the image and blur, we compute and average of all costs on the dataset of Levin et al. (2011b). Also, because both algorithms use a pyramid scheme, we show the cost evolution of only the last level and we use the same initialization. We use a logarithmic scale for both axes so that two facts are emphasized: (1) The MatPD lowers the cost more quickly than the MatMM algorithm; (2) The MatMM algorithm achieves a smaller cost. Notice that the logarithmic prior in the cost functional can make the cost negative; thus, we add a constant before converting the cost to the logarithmic scale (Figs. 15, 16).

From our experiments it can been concluded that the primal-dual method can result in a faster implementation, but at the cost of being too coarse (due to the convex approximation of the logarithmic prior) to achieve the same accuracy of the MM algorithm.

8 Conclusions

In this paper we presented solutions to blind deconvolution based on a logarithmic image prior. The chosen prior is as effective as \(L_p\) norms with \(p<1\) on the image gradients, while at the same time leading to simple optimization schemes despite its non convexity. We show empirical experiments that support the choice of a logarithmic prior for blind deconvolution. To solve blind deconvolution with this image prior we propose a computationally efficient scheme via a primal-dual approach and a high-accuracy scheme via the majorization-minimization approach. Both approaches perform well and converge very robustly.