1 Introduction

The following novel duality-based bilevel optimization framework is proposed in [31] for the development of a monolithic variational, i.e., optimization approach to simultaneously recovering an image \(u:\Omega \rightarrow \mathbb {R}\) and a spatially varying regularization weight \(\alpha :\Omega \rightarrow \mathbb {R}_+\) from measurement data \(f\in L^2(\Omega )\):

figure e

where \(J (\cdot , \cdot )\) is defined in (\(\tilde{\mathbb {P}}\)) below along with the motivation for its choice. Specifically, it contains a term involving a localized variance estimator and a \(H^1\)-regularization term. We define \(\mathbf {K}(\alpha )\) as

$$\begin{aligned} \mathbf {K}(\alpha ):=\{\mathbf {q}\in H_0({\text {div}})~:~|\mathbf {q}(x)|_\infty \le \alpha (x)\text { f.a.a. }x\in \Omega \}, \end{aligned}$$

and the lower level problem \(\text {D}(\alpha )\) is given by

figure f

with \({\text {div}}(\cdot )=\sum _i\frac{\partial (\cdot )_i}{\partial x_i}\) the divergence operator, and K a linear and continuous transfer operator from \(L^2(\Omega )\) to \(L^2(\Omega )\), i.e., \(K\in \mathcal {L}(L^2(\Omega ))\), and \(K^*\) standing for its adjoint. Specific examples for K are the identity (denoising), convolution (deblurring), and Fourier or wavelet transforms. The image domain \(\Omega \subset \mathbb {R}^\ell \), where \(\ell =1\) or 2 (unless stated differently), is a bounded connected open set with Lipschitz boundary \(\partial \Omega \). The given datum satisfies \(f=Ku_{\text {true}}+\eta \in L^2(\Omega )\), where \(u_{\text {true}}\) denotes the original image and \(\eta \) additive “noise,” which has zero mean on \(\Omega \) and satisfies \(|\eta |_{L^2(\Omega )}^2\le \sigma ^2|\Omega |\) with \(\sigma ^2>0\) and \(|\cdot |\) the (Lebesgue) measure of \(\Omega \). Further, \(|w|_B^2:=(w,B^{-1}w)_{L^2(\Omega )}\) with \(B=K^*K\), which—for simplicity—is assumed invertible, and \(|\cdot |_\infty \) denotes the maximum norm on \(\mathbb {R}^\ell \). We use \((\cdot ,\cdot )_{L^2(\Omega )}\) to denote the \(L^2(\Omega )\)-inner product, for which we sometimes also write \((\cdot ,\cdot )_{L^2}\) or just \((\cdot ,\cdot )\). Note also that with inner products and pairings we do not distinguish notationwise between scalar functions and vector fields. The underlying function space is

$$\begin{aligned} H_0({{\mathrm{\mathrm {div}}}}){:=}\,\{\mathbf {v}\in L^2(\Omega )^\ell \!:\! {{\mathrm{\mathrm {div}}}}\mathbf {v}\in L^2(\Omega ) \, \text{ and } \, \mathbf {v}\cdot \varvec{n}|_{\partial \Omega }{=}0\},\nonumber \\ \end{aligned}$$
(1.1)

where \(\varvec{n}\) denotes the outer unit normal vector and the boundary condition is taken in the \(H^{-1/2}(\partial \Omega )\)-sense. Endowed with the inner product

$$\begin{aligned} (\mathbf {v},\mathbf {w})_{H_0({{\mathrm{\mathrm {div}}}})}:=(\mathbf {v},\mathbf {w})+({{\mathrm{\mathrm {div}}}}\mathbf {v},{{\mathrm{\mathrm {div}}}}\mathbf {w}), \end{aligned}$$

\(H_0({{\mathrm{\mathrm {div}}}})\) is a Hilbert space. Moreover,

$$\begin{aligned} {\mathcal {A}}_\text {ad}:=\{\alpha \in H^1(\Omega ): \underline{\alpha }\le \alpha \le \overline{\alpha }, \text { a.e. on } \Omega \}, \end{aligned}$$
(1.2)

with scalars \(0<\underline{\alpha }<\overline{\alpha }<+\infty \), denotes the set of admissible filtering weights. Further, we note already here that throughout this work vector-valued quantities are written in bold font, “s.t.” and “f.a.a.” stand for “subject to” and “for almost all,” respectively. Moreover, we use standard notation for Lebesgue spaces (\(L^p(\Omega )\), \(p\in [1,+\infty ]\)) and Sobolev spaces (\(W^{s,p}(\Omega )\), \(s\in [1,+\infty )\), and \(H^s(\Omega )=W^{s,2}(\Omega )\)); see, e.g., [1] for more on this. We denote by \(\langle \cdot ,\cdot \rangle \) the duality pairings \(\langle \cdot ,\cdot \rangle _{H^{-1},H_0^1}\) and \(\langle \cdot ,\cdot \rangle _{H^1(\Omega )^*,H^1(\Omega )}\). For the sake of completeness, we also mention that \(H^{-1/2}(\partial \Omega )\) denotes the dual space of \(H^{1/2}(\partial \Omega )\).

Provided that \(\alpha \) is regular enough, in [31] (see also [32]) it is argued that (\(\mathrm {D}(\alpha )\)) is the Fenchel pre-dual problem of the following weighted total variation problem:

figure g

with

$$\begin{aligned} J_P(u,\alpha ):=\frac{1}{2}\int _{\Omega } |Ku-f|^2\mathrm {d}x+\int _\Omega \alpha (x) |\mathcal {D}u|, \end{aligned}$$

where \(BV(\Omega ):=\{u\in L^1(\Omega ): \mathcal {D}u\in \mathbf {M}(\Omega , \mathbb {R}^\ell )\}\), and with \(\mathcal {D}u\) representing the distributional gradient of u. Further, by \(\mathbf {M}(\Omega , \mathbb {R}^\ell )\) we denote the space of \(\ell \)-valued Borel measures, which is the dual of \(C_c(\Omega ; \mathbb {R}^\ell )\), the space of continuous \(\mathbb {R}^\ell \)-valued functions with compact support in \(\Omega \). The quantity \(|\mathcal {D}u|\) stands for the smallest nonnegative scalar Borel measure associated with the sum of the total variation norms of the component measures of \(\mathcal {D}u\).

The bilevel optimization problem (\(\mathbb {P}\)) falls into the realm of mathematical programs with equilibrium constraints (MPECs) (in function space); see, e.g., [41, 44] for an account of MPECs in \(\mathbb {R}^n\), [5, 29, 34] for infinite dimensional settings, and [35, 40, 47] for recent applications in mathematical image processing. This problem class suffers from notoriously degenerate constraints ruling out the applications of the celebrated Karush–Kuhn–Tucker theory (compare, e.g., [52]) for deriving first-order optimality or stationarity conditions.

As a remedy, for scalar parameters \(\beta , \delta , \epsilon , \gamma ,\lambda >0\) the following regularized version of (\(\mathbb {P}\)) is studied in [31]:

figure h

where \(F:L^2(\Omega )\rightarrow \mathbb {R}_0^+\) with

$$\begin{aligned} F(v):=\frac{1}{2}\int _{\Omega }\max (v-\overline{\sigma }^2,0)^2\mathrm {d}x+\frac{1}{2}\int _{\Omega }\min (v-\underline{\sigma }^2,0)^2\mathrm {d}x, \end{aligned}$$

and the \(\max \)- and \(\min \)-operations are understood in the pointwise sense. The choice of the bounds \(0<\underline{\sigma }\le \overline{\sigma }<\infty \) is based on statistical properties related to the noise contained in the measurement f; see Sect. 4.2.1 below for details. Moreover, R

$$\begin{aligned} R(v)(x):=\int _\Omega w(x,y)\, (KB^{-1}v+(K B^{-1}K^*- I)f)^2(y)\mathrm {d}y \end{aligned}$$
(1.3)

with a normalized weight \(w\in L^\infty (\Omega \times \Omega )\) with \(\int _\Omega \int _\Omega w(x,y)\, dxdy=1\). Note that if \(\mathbf {p}\) solves (\(\mathrm {D}(\alpha )\)), then we have \({\text {div}} \mathbf {p}=Bu-K^*f\), where u is the solution to (\(\mathrm {P}\)) (see [31, Theorem 3.4]). This implies that

$$\begin{aligned} R({\text {div}} \mathbf {p})(x)=\int _\Omega w(x,y)\, (K u-f)^2(y)\mathrm {d}y, \end{aligned}$$

where the right-hand side represents a convolved version of the image residual \(Ku-f\).

We now provide the motivation and reasoning behind the definition of \((\mathbf {p},\alpha )\mapsto J(\mathbf {p},\alpha )\). We start with the functional \(\mathbf {p}\mapsto F\circ R({{\mathrm{\mathrm {div}}}}\mathbf {p})\): Since F penalizes violations above \(\overline{\sigma }^2\) and below \(\underline{\sigma }^2\), we induce residuals \(R({{\mathrm{\mathrm {div}}}}\mathbf {p})\) to satisfy \(\underline{\sigma }^2\le R({{\mathrm{\mathrm {div}}}}\mathbf {p})\le \overline{\sigma }^2\). The map \(R({{\mathrm{\mathrm {div}}}}\mathbf {p})(x)=\int _\Omega w(x,y) (Ku-f)^2(y)\mathrm {d}y\) for \(x\in \Omega \) may be considered a local variance (see [23]) and note that \(f=Ku_{\text {true}}+\eta \) where \(\int _{\Omega }|\eta |^2\mathrm {d}x=\sigma ^2|\Omega |\). Consequently, if for some \(\alpha ^*\) we would have \(u(\alpha ^*)=u_{\text {true}}\), then we expect \(R({{\mathrm{\mathrm {div}}}}\mathbf {p})\simeq \sigma ^2\). Thus, by choosing \(\underline{\sigma }<\sigma <\overline{\sigma }\) one would get \(F\circ R({{\mathrm{\mathrm {div}}}}\mathbf {p}^*)\simeq 0\), where \({\text {div}} \mathbf {p}^*=Bu(\alpha ^*)-K^*f\). Secondly, the \(H^1\)-regularity of \(\alpha \) induced by the term \(\frac{\lambda }{2}|\alpha |^2_{H^{1}(\Omega )}\) in the objective yields that (\(\mathrm {P}\)) and (\(\mathrm {D}(\alpha )\)) are dual to each other. This comes as a consequence of Theorem 3.1 below.

The map \(P_\delta :H_0^1(\Omega )^\ell \times L^2(\Omega )\rightarrow L^2(\Omega )^\ell \) is defined as

$$\begin{aligned} P_\delta (\mathbf {p},\alpha ):= (\mathbf {p}-\alpha \mathbf {1})_\delta ^+- (\mathbf {p}+\alpha \mathbf {1})_\delta ^-, \end{aligned}$$
(1.4)

where, for \(\delta >0\), \(\mathbb {R}\ni r\mapsto (r)_\delta ^+\in \mathbb {R}\) is given by

$$\begin{aligned} (r)_\delta ^+=\left\{ \begin{array}{ll} r-\delta /2 , &{} r\ge \delta ; \\ r^2/2\delta , &{} r\in (0,\delta ) ; \\ 0, &{} r\le 0 . \end{array} \right. \end{aligned}$$
(1.5)

The function \(r\mapsto (r)_\delta ^+\) is a differentiable approximation of the positive part \(r\mapsto (r)^+:=\max (r,0)\) and analogously, for \((r)_\delta ^-:=(-r)_\delta ^+\) and the negative part \((r)^-:=(-r)^+\). Additionally, for \(\delta =0\), \((r)_\delta ^+:=(r)^+\) and \((r)_\delta ^-:=(r)^-\). For \(\mathbf {r}\in \mathbb {R}^\ell \), \((\mathbf {r})_\delta ^+\) is defined componentwise, i.e., \((\mathbf {r})_\delta ^+= ( (r_1)_\delta ^+, (r_1)_\delta ^+,,\ldots , (r_l)_\delta ^+)\) and \((\mathbf {r})_\delta ^-\) analogously. The functional \({\mathcal {P}}_\delta (\cdot ,\alpha ):H_0^1(\Omega )^\ell \rightarrow \mathbb {R}^+_0\) in (\(\tilde{\mathbb {P}}\)) penalizes violations of \(\mathbf {p}\in \mathbf {K}(\alpha )\) and is defined as

$$\begin{aligned} {\mathcal {P}}_\delta (\mathbf {p},\alpha ):=\int _\Omega \sum _{i=1}^\ell \big (G_\delta (-(p_i+\alpha ))+G_\delta (p_i-\alpha ) \big )\mathrm {d}x,\nonumber \\ \end{aligned}$$
(1.6)

with \(\mathbf {p}=(p_1, p_2,\ldots ,p_l)\) and \(G_\delta :\mathbb {R}\rightarrow \mathbb {R}\),

$$\begin{aligned} G_\delta (r)=\left\{ \begin{array}{ll} \frac{1}{2}r^2-\frac{\delta }{2}r+\frac{\delta ^2}{6} , &{} r\ge \delta ; \\ r^3/6\delta , &{} r\in (0,\delta ) ; \\ 0, &{} r\le 0 , \\ \end{array} \right. \end{aligned}$$
(1.7)

for \(\delta >0\). The function \(G_\delta \) is a primitive of \((\cdot )^+_{\delta }\) defined in (1.5), specifically \(G_{\delta }(r):=\int _{-\infty }^r (s)^+_{\delta } \mathrm {d}s\) and hence \(G_\delta \) is twice continuously differentiable. For \(\delta =0\), we use \(r\mapsto G_0(r):=r^2/2\) for \(r\ge 0\) and \(G_0(r):=0\) otherwise. Note that the derivative of the map \(\mathbf {p}\mapsto {\mathcal {P}}_\delta (\mathbf {p},\alpha )\) is given by \(P_\delta (\mathbf {p},\alpha )\) (see [31] for details).

Utilizing [52], an optimal solution \((\mathbf {p}^*,\alpha ^*)\in H_0^1(\Omega )^\ell \times {\mathcal {A}}_{\text {ad}}\) of (\(\tilde{\mathbb {P}}\)) can be characterized by an adjoint state (a Lagrange multiplier) \(\mathbf {q}^*\in H_0^1(\Omega )^\ell \) such that

$$\begin{aligned}&(J'_0({{\mathrm{\mathrm {div}}}}\mathbf {p}^*), {{\mathrm{\mathrm {div}}}}\mathbf {p}){+}\left\langle -\beta \varvec{\Delta } \mathbf {q}^*{+}\gamma \mathbf {q}^*{+}A\mathbf {q}^*\right. \nonumber \\&\left. \quad +\frac{1}{\epsilon }D_1 P_\delta (\mathbf {p}^*,\alpha ^*)\mathbf {q}^*,\mathbf {p}\right\rangle =0, \end{aligned}$$
(1.8a)
$$\begin{aligned}&\left\langle \lambda (-\Delta {+}I)\alpha ^*{+}\frac{1}{\epsilon }\left( D_2 P_\delta (\mathbf {p}^*,\alpha ^*)\right) ^\top \mathbf {q}^*,\alpha -\alpha ^*\right\rangle \ge 0, \end{aligned}$$
(1.8b)

for all \(\mathbf {p}\in H_0^1(\Omega )^\ell \) and all \(\alpha \in {\mathcal {A}}_{\mathrm {ad}}\), where \(J_0:=F\circ R\) and further

$$\begin{aligned} -\beta \varvec{\Delta }\mathbf {p}^*{+}\gamma \mathbf {p}^*{+} A\mathbf {p}^*{+}\mathbf {f}{+} \frac{1}{\epsilon }P_\delta (\mathbf {p}^*,\alpha ^*)&=0, \quad \text { in } H^{-1}(\Omega )^\ell , \end{aligned}$$
(1.8c)

where \(A:H_0({{\mathrm{\mathrm {div}}}})\rightarrow H_0({{\mathrm{\mathrm {div}}}})^*\) is defined as \(A \mathbf {p}:=-\nabla B^{-1}{{\mathrm{\mathrm {div}}}}\mathbf {p},\) with \(\mathbf {p}\in H_0({{\mathrm{\mathrm {div}}}})\) and \( \mathbf {f}=-\nabla B^{-1}K^*f\in H_0({{\mathrm{\mathrm {div}}}})^*\); see [31, Thm. 6.3]. Further, \(D_1 P_\delta (\mathbf {p},\alpha )\) and \(D_2 P_\delta (\mathbf {p},\alpha )\) denote the Fréchet derivatives of \(\mathbf {p}\mapsto P_\delta (\mathbf {p},\alpha )\) and \(\alpha \mapsto P_\delta (\mathbf {p},\alpha )\), respectively. The latter are given by

$$\begin{aligned} D_1 P_\delta (\mathbf {p},\alpha )\mathbf {r}_1&:=\big ( \varvec{G}''_\delta (\mathbf {p}-\alpha \mathbf {1})+\varvec{G}''_\delta (-\mathbf {p}-\alpha \mathbf {1})\big )\mathbf {r}_1, \end{aligned}$$
(1.9a)
$$\begin{aligned} D_2 P_\delta (\mathbf {p},\alpha ) r_2&:= \big ( \varvec{G}''_\delta (-\mathbf {p}-\alpha \mathbf {1}) -\varvec{G}''_\delta (\mathbf {p}-\alpha \mathbf {1})\big ) \mathbf {1} r_2, \end{aligned}$$
(1.9b)

with \(\varvec{G}'_\delta :L^{2+\xi }(\Omega )^\ell \rightarrow L^{2}(\Omega )^\ell \) (with \(\xi >0\)) given by \(\varvec{G}'_\delta (\mathbf {p})=(G'_\delta (p_1),\ldots ,G'_\delta (p_l))\) and where \(G'_\delta :L^{2+\xi }(\Omega )\rightarrow L^{2}(\Omega )\) is the Nemytskii (superposition) operator induced by the real-valued function \(r\mapsto (r)_\delta ^+\). In order to facilitate navigation through the text, we provide a glossary in Table 1.

Table 1 Glossary of functions and variables

Besides characterizing stationarity, another benefit of (1.8) is related to the reduced bilevel problem. In fact, the solution map \(\alpha \mapsto \mathbf {p}(\alpha )\) for the regularized lower-level problem allows to reduce (\(\tilde{\mathbb {P}}\)) to

figure i

Then, the adjoint state \(\mathbf {q}\) allows to compute the derivative of the reduced objective \(\hat{J}'\) at some \(\alpha \) in an amenable way. In fact, one has

$$\begin{aligned} \hat{J}'(\alpha )=\lambda (-\Delta +I)\alpha +\frac{1}{\epsilon }\left( D_2 P_\delta (\mathbf {p}(\alpha ),\alpha )\right) ^\top \mathbf {q}(\alpha ), \end{aligned}$$
(1.10)

where \(\alpha \mapsto \mathbf {q}(\alpha )\) solves (1.8a) for \(\mathbf {p}^*=\mathbf {p}(\alpha )\) and \(\alpha ^*=\alpha \). The expression for the derivative \(\hat{J}'(\alpha )\) follows from the optimality system (1.8), and the adjoint state formalism (see, for example, [36]).

The starting point for the development in this paper is the reduced problem (\(\tilde{\mathbb {P}}_{\text {red}}\)). It is the basis for developing a projected gradient method for solving the problem algorithmically.

In order to study regularity properties of the solutions of \(H^1\)-projections onto \({\mathcal {A}}_\text {ad}\), in the following Sect. 2 we provide higher-order regularity results for solutions of elliptic variational inequality problems. The projected gradient method is defined in Sect. 3, and global convergence results are established. Section 4 is devoted to the discrete version of our algorithm and the proper choice of the variance bounds \(\underline{\sigma }\) and \(\overline{\sigma }\). Moreover, it contains a report on numerical tests for image denoising, deblurring as well as Fourier and wavelet inpainting.

Before we commence with our analysis, we close this section by mentioning that total variation models of a generalized type can be found in [38] and [3]. Moreover, spatially adapted regularization or data weighting has been studied in [2, 6, 21, 22, 24, 33, 37]. For a brief discussion of these references, we refer to part I of this work; see [31]. The bilevel formulation approach for inverse problems seems to have been pioneered by Haber, Tenorio, and Ghattas (see [11, 28]). In the context of image reconstruction, the bilevel approach has also been studied by De Los Reyes, Schönlieb, Valkonen and collaborators (see [12, 18, 19] and references therein). In addition, splitting methods in image/signal processing involving statistical estimators for parameter selection and deconvolution have been successfully treated in [17, 20, 45] and the references therein, for instance. It should be noted that the present work does not deal with bilevel “learning” (as in many of the aforementioned references), but tackles image reconstruction via a bilevel optimization approach where the upper-level problem enforces local variances within a certain range and the reconstruction itself is obtained in the lower-level one.

2 An Obstacle Problem and Projection Results

Returning to (\(\tilde{\mathbb {P}}_{\text {red}}\)) we note that its associated first-order necessary conditions are given by the variational inequality

$$\begin{aligned}&\text { Find } \alpha ^*\in {\mathcal {A}}_{\text {ad}}:\langle \hat{J}'(\alpha ^*),\alpha -\alpha ^*\rangle \ge 0,\quad \forall \alpha \in {\mathcal {A}}_{\text {ad}}.\nonumber \\ \end{aligned}$$
(2.1)

Given the structure of the derivative \(\hat{J}'(\alpha ^*)\) in (1.10), (2.1) becomes a so-called double obstacle problem. Hence, the characterization of solutions to (\(\tilde{\mathbb {P}}_{\text {red}}\)) hinges on the study of (2.1). In addition, using a gradient descent method for solving problem (\(\tilde{\mathbb {P}}_{\text {red}}\)) yields the sequence \(\{\alpha _n\}\) of iterates defined as

$$\begin{aligned} \alpha _{n+1}=P_{{\mathcal {A}}_{\text {ad}}}(\alpha _{n}-\tau _n \nabla \hat{J}(\alpha _n)),\text { for }n=0,\ldots , \end{aligned}$$
(2.2)

with \(\alpha _0\) given . Here, \(P_{{\mathcal {A}}_{\text {ad}}}:H^1(\Omega )\rightarrow {\mathcal {A}}_{\text {ad}}\subset H^1(\Omega )\) is the minimum distance projector onto \({\mathcal {A}}_{\text {ad}}\) and \(\nabla \hat{J}(\alpha _n)\in H^1(\Omega )\) denotes the gradient of \(\hat{J}\) at \(\alpha _n\). From (2.2), it follows that \(\alpha _{n+1}\) solves: \(\alpha _{n+1}\) solves: Find \(\alpha ^*\in \mathcal {A}_{\text {ad}}\) such that

$$\begin{aligned} \langle (-\Delta +I)\alpha ^*+M(\alpha _n,\tau _n),\alpha -\alpha ^*\rangle \ge 0, \quad \forall \alpha \in \mathcal {A}_{\text {ad}}; \end{aligned}$$

for some \(M(\alpha _n,\tau _n)\), yet another double obstacle problem. This motivates the following study of this type of problems.

The subsequent result establishes the \(H^2(\Omega )\cap C^{0, r}(\overline{\Omega })\) regularity of the solution to the bilateral obstacle problem with Neumann boundary conditions. The \(H^2(\Omega )\)-regularity for a single obstacle and with a \(C^{\infty }\)-boundary was established by Brézis in [10]. Similar and related partial results can also be found in the classical texts by Rodrigues [46] and Kinderlehrer and Stampacchia [39]. For dimensions \(\ell =1, 2, 3\) (note \(\Omega \subset \mathbb {R}^\ell \)), the \(C^{0, r}(\overline{\Omega })\)-regularity is implied by Sobolev embedding results for \(H^2(\Omega )\) (see, for example, [1]). For dimensions \(\ell \ge 2\), we show that the \(C^{0, r}(\overline{\Omega })\)-regularity can also be obtained from estimates due to Serrin; see [48].

While this result may be considered of stand-alone importance in the regularity theory for solutions of elliptic variational inequalities, in our generalized total variation context it is of particular relevance to guarantee continuity of iterates \(\alpha _n\) of the regularization weight generated by some projection-based descent method.

Theorem 2.1

Let \(\Omega \subset \mathbb {R}^\ell \), with \(\ell =1,2,3\), be a bounded convex subset, and let \(\mathcal {A}=\{\alpha \in H^1(\Omega ): \underline{\alpha }\le \alpha \le \overline{\alpha } \,\,\text { a.e.~on } \Omega \}\) where \(\underline{\alpha }, \overline{\alpha }\in H^2(\Omega )\), such that

$$\begin{aligned} \underline{\alpha }\le \overline{\alpha }, \text { a.e. on } \Omega \quad \text { and }\quad \frac{\partial \underline{\alpha }}{ \partial \varvec{\nu }} =\frac{\partial \overline{\alpha }}{\partial \varvec{\nu }} =0 \text { in } H^{1/2}(\partial \Omega ). \end{aligned}$$

Then, for \(f\in L^2(\Omega )\), there exists a unique \(u^*\in H^2(\Omega ) \cap C^{0, r}(\overline{\Omega }) \cap \mathcal {A}\) for some \(r\in (0,1)\) that solves: Find \(u\in \mathcal {A}\) such that

$$\begin{aligned}&\int _{\Omega } \nabla u \cdot \nabla (v-u) + (u-f)(v-u) \mathrm {d}x \ge 0, \quad \forall v\in \mathcal {A}. \end{aligned}$$
(2.3)

In addition \(u^*\) solves uniquely: Find \(u\in \mathcal {A}\) and \(\frac{\partial u}{\partial \varvec{\nu }} =0\) on \(\partial \Omega \) such that

$$\begin{aligned} \langle Lu-f,v-u\rangle \ge 0, \quad \forall v\in \mathcal {A}, \end{aligned}$$
(2.4)

where \(L=-\Delta +I\). Furthermore, for some constant \(C>0\) the following estimates hold:

$$\begin{aligned}&\max (|u^*|_{C^{0,r}(\overline{\Omega })}, |u^*|_{H^2(\Omega )} )\nonumber \\&\quad \le C(|f|_{L^2(\Omega )}+|L\underline{\alpha }|_{L^2(\Omega )}+|L\overline{\alpha }|_{L^2(\Omega )}). \end{aligned}$$
(2.5)

Proof

For \(\rho >0\) consider the approximating problem: Find \(u\in H^1(\Omega )\) such that

$$\begin{aligned} a(u,w)+(F_\rho (u)-f,w)=0, \quad \forall w\in H^1(\Omega ), \end{aligned}$$
(2.6)

where, for any \(v,w \in H^1(\Omega )\), a and \(F_\rho \) are defined as

$$\begin{aligned} a(v,w):= & {} \int _{\Omega } \nabla u \cdot \nabla w+uw \mathrm {d}x \\ (F_\rho (v),w):= & {} \int _{\Omega }\frac{1}{\rho }(v-\overline{\alpha })^+ w-\frac{1}{\rho }(v-\underline{\alpha })^-w \mathrm {d}x. \end{aligned}$$

Note that (2.6) is the first-order optimality condition for the problem:

$$\begin{aligned}&\text {minimize}\quad J(u):=\frac{1}{2} |u|^2_{H^1(\Omega )}+\frac{1}{2\rho } G(u)-(f,u)\\&\quad \quad \text {over }{u\in H^1(\Omega )}, \end{aligned}$$

with \(G(u):=|(u-\overline{\alpha })^+|_{L^2(\Omega )}^2+|(\underline{\alpha }-u)^+|_{L^2(\Omega )}^2\). The existence and uniqueness of a solution are guaranteed since \(J:H^1(\Omega )\rightarrow \mathbb {R}\) is bounded below, coercive, strictly convex and weakly lower semicontinuous (for being convex and continuous).

Note that (2.6) is the variational form of a semilinear Neumann problem, i.e., the solution \(u^*_\rho \) to (2.6) satisfies

$$\begin{aligned} Lu^*_\rho +F_\rho (u^*_\rho )-f&=0 \text { in } \Omega ,&\text { and }&\frac{\partial u^*_\rho }{\partial \varvec{\nu }}&=0 \text { on } \partial \Omega ;&\end{aligned}$$

see [49, 50] or [4]. Let \(f_\rho :=f-F_\rho (u^*_\rho )\). Then \(f_\rho \in L^2(\Omega )\) and \(L u_\rho ^*=f_\rho \) in \(\Omega \) with \(\partial u^*_\rho /\partial \varvec{\nu } =0\) on \(\partial \Omega \). From Theorem 3.2.1.3 and its proof in [25], it follows that \(u_\rho ^*\in H^2(\Omega )\) and \(|u_\rho ^*|_{H^2(\Omega )}\le \tilde{C}_1 |f_\rho |_{L^2(\Omega )}\) for some \(\tilde{C}_1>0\) depending only on \(\ell \). Also, for \(\ell \ge 2\) we have \(u_\rho ^*\in C^{0,r}(\overline{\Omega })\) (see [43, 48] or Theorem 3.1.5 in [42]) for some \(r\in (0,1)\) depending only on \(\ell \) such that \(|u_\rho ^*|_{C^{0,r}(\overline{\Omega })}\le \tilde{C}_2(|u_\rho ^*|_{L^2(\Omega )} +|f_\rho |_{L^2(\Omega )})\) with \(\tilde{C}_2\) independent on \(f_\rho \). Therefore, we have

$$\begin{aligned} |u_\rho ^*|_{H^2(\Omega )}\le & {} \tilde{C}_1 \left( |f|_{L^2(\Omega )}+\left| \frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+ \right| _{L^2(\Omega )}\right. \nonumber \\&\left. +\left| \frac{1}{\rho }(u^*_\rho -\underline{\alpha })^-\right| _{L^2(\Omega )}\right) , \end{aligned}$$
(2.7)

and

$$\begin{aligned} |u_\rho ^*|_{C^{0,r}(\overline{\Omega })}&\le \tilde{C}_2 \left( |u_\rho ^*|_{L^2(\Omega )}+|f|_{L^2(\Omega )}+\left| \frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+ \right| _{L^2(\Omega )}\right. \nonumber \\&\quad \left. +\left| \frac{1}{\rho }(u^*_\rho -\underline{\alpha })^-\right| _{L^2(\Omega )}\right) \nonumber \\&\le 2\max (\tilde{C}_2,\tilde{C}_1) \left( |f|_{L^2(\Omega )}+\left| \frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+ \right| _{L^2(\Omega )}\right. \nonumber \\&\quad \left. +\left| \frac{1}{\rho }(u^*_\rho -\underline{\alpha })^-\right| _{L^2(\Omega )}\right) . \end{aligned}$$
(2.8)

Note that by Green’s theorem, \(a(v,w)=(Lv,w)_{H^1(\Omega )^*,H^1(\Omega )}+\int _{\partial \Omega } (\frac{\partial v}{\partial \varvec{\nu }}) ( w) \mathrm {d}S\), and also \(L\overline{\alpha }\in L^2(\Omega )\), \(\partial \overline{\alpha }/\partial \varvec{\nu } =0\) and \(\underline{\alpha }\le \overline{\alpha }\). Then, by taking \(w=\frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+ \in H^1(\Omega )\) in (2.6) together with adding and subtracting \((L\overline{\alpha },w)\) we observe that

$$\begin{aligned}&\frac{1}{\rho }a(u^*_\rho -\overline{\alpha },(u^*_\rho -\overline{\alpha })^+)+\left| \frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+ \right| ^2_{L^2(\Omega )}\nonumber \\&\quad =(f-L\overline{\alpha },\frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+), \end{aligned}$$
(2.9)

where we have used that \((F_\rho (u^*_\rho ),w)=|w|^2_{L^2(\Omega )}\). Furthermore,

$$\begin{aligned}&a(u^*_\rho -\overline{\alpha },(u^*_\rho -\overline{\alpha })^+)\\&\quad =\left| (u^*_\rho -\overline{\alpha })^+ \right| ^2_{L^2(\Omega )}+\left| \nabla (u^*_\rho -\overline{\alpha })^+ \right| ^2_{L^2(\Omega )^\ell }. \end{aligned}$$

Here we exploit that if \(v\in H^1(\Omega )\), then \(v^+\in H^1(\Omega )\), and \(\nabla v^+=\nabla v\) if \(v>0\) and \(\nabla v^+=0\), otherwise. From this, we infer

$$\begin{aligned} \left| \frac{1}{\rho }(u^*_\rho -\overline{\alpha })^+ \right| _{L^2(\Omega )}\le |f-L\overline{\alpha }|_{L^2(\Omega )}. \end{aligned}$$

Analogously, for \(w=-\frac{1}{\rho }(u^*_\rho -\underline{\alpha })^- \) in (2.6), we obtain

$$\begin{aligned} \left| \frac{1}{\rho }(u^*_\rho -\underline{\alpha })^- \right| _{L^2(\Omega )}\le |f-L\underline{\alpha }|_{L^2(\Omega )}. \end{aligned}$$

Hence, it follows that (2.5) holds for \(u^*_\rho \) and \(C{=}6\max (\tilde{C}_1,\tilde{C}_2)\).

The boundedness of \(\{u^*_\rho \}_{\rho >0}\) in \(H^2(\Omega )\) implies that \(Lu_\rho ^*\rightharpoonup L\tilde{u}\), \(u_\rho ^*\rightarrow \tilde{u}\) in \(L^2(\Omega )\) and \(u_\rho ^*\rightharpoonup \tilde{u}\) in \(H^2(\Omega )\), along a subsequence that we also denote by \(\{u_\rho ^*\}\). The above two inequalities imply that \(\tilde{u}\in {\mathcal {A}}\). Furthermore, since \(u\mapsto \frac{1}{\rho }(u-\overline{\alpha })^+ -\frac{1}{\rho }(u-\underline{\alpha })^-\) is a monotone mapping, using \(w=v-u_\rho ^*\) with an arbitrary \(v\in {\mathcal {A}}\) in (2.6) (note that \((v-\overline{\alpha })^+ +(v-\underline{\alpha })^-=0\)) we observe

$$\begin{aligned} a(u_\rho ^*,v-u_\rho ^*)\ge (f,v-u_\rho ^*). \end{aligned}$$

Since \(a(v-u_\rho ^*,v-u_\rho ^*)\ge 0\), it follows from the above inequality that \(a(v,v-u_\rho ^*)\ge (f,v-u_\rho ^*).\) Taking the limit as \(\rho \downarrow 0\), we get

$$\begin{aligned} a(v,v-\tilde{u})\ge (f,v-\tilde{u}), \qquad \forall v\in {\mathcal {A}}. \end{aligned}$$

Finally, since \(\tilde{u}\in {\mathcal {A}}\), Minty’s lemma [16, 46] implies that \(\tilde{u}\) solves (2.3) and uniqueness follows from standard results.

Additionally, the trace map \(H^2(\Omega )\ni u\mapsto \partial u/\partial \varvec{\nu } \in H^{1/2}(\partial \Omega )\) is a continuous linear map, and hence, it is weakly continuous. Moreover, since the norm is weakly lower semicontinuous, \( \left| \partial \tilde{u}/\partial \varvec{\nu }\right| _{ H^{1/2}(\partial \Omega )} \le \liminf _{\rho \rightarrow 0}\left| \partial u^*_\rho /\partial \varvec{\nu }\right| _{ H^{1/2}(\partial \Omega )} =0. \) From \(a(v,w)=(Lv,w)_{H^1(\Omega )^*,H^1(\Omega )}+\int _{\partial \Omega } (\frac{\partial v}{\partial \varvec{\nu }}) ( w) \mathrm {d}S\) for all \(v,w\in H^1(\Omega )\), it follows that \(\tilde{u}\) solves (2.4), as well. \(\square \)

Remark 2.2

The boundary conditions \( \partial \underline{\alpha }/ \partial \varvec{\nu }=0\) and \(\partial \overline{\alpha }/\partial \varvec{\nu } =0\) may be relaxed to \(\partial \overline{\alpha }/\partial \varvec{\nu } \ge 0\) and \(\partial \underline{\alpha }/ \partial \varvec{\nu }\le 0\), respectively.

An important application of the previous result is related to the preservation of regularity of the minimal distance projection operator in \(H^1(\Omega )\) onto \({\mathcal {A}}=\{\alpha \in H^1(\Omega ): \underline{\alpha }\le \alpha \le \overline{\alpha } \,\,\text { a.e. on } \Omega \}\).

Corollary 2.3

Let \(\Omega \) and \({\mathcal {A}}\) be as in Theorem 2.1. Let \(P_{{\mathcal {A}}}:H^1(\Omega )\rightarrow {\mathcal {A}}\subset H^1(\Omega )\) denote the minimal distance projection operator, i.e., for \(\omega \in H^1(\Omega )\),

$$\begin{aligned} P_{{\mathcal {A}}}(\omega ):={{\mathrm{\arg \min }}}_{\alpha \in {\mathcal {A}}}\frac{1}{2}|\alpha -\omega |^2_{H^1(\Omega )}. \end{aligned}$$
(2.10)

Let \(\omega ^*=P_{{\mathcal {A}}}(\omega )\). Then it holds that

$$\begin{aligned} \omega \in H^2(\Omega ) \text { and } \frac{\partial \omega }{\partial \varvec{\nu }}=0 \Longrightarrow \omega ^*\in H^2(\Omega ) \text { and } \frac{\partial \omega ^*}{\partial \varvec{\nu }}=0, \end{aligned}$$

and furthermore,

$$\begin{aligned}&\max (|\omega ^*|_{H^2(\Omega )},|\omega ^*|_{C^{0,r}(\overline{\Omega })})\\&\quad \le C(|L\omega |_{L^2(\Omega )}+|L\underline{\alpha }|_{L^2(\Omega )}+|L\overline{\alpha }|_{L^2(\Omega )}), \end{aligned}$$

for some \(r\in (0,1)\) and with \(L=-\Delta +I\).

Proof

The first-order optimality condition for (2.10) is equivalent to

$$\begin{aligned} \int _{\Omega } \nabla ( \omega ^*- \omega ) \cdot \nabla (v- \omega ^*) + ( \omega ^*- \omega )(v- \omega ^*) \mathrm {d}x \ge 0, \quad \forall v{\in } {\mathcal {A}}. \end{aligned}$$

Since \(\omega \in H^2(\Omega )\) and \(\partial \omega /\partial \varvec{\nu }=0\), by Green’s Theorem, the previous variational inequality is equivalent to

$$\begin{aligned} \int _{\Omega } \nabla \omega ^*\cdot \nabla (v- \omega ^*) + ( \omega ^*- f_\omega )(v- \omega ^*) \mathrm {d}x \ge 0, \quad \forall v\in {\mathcal {A}}, \end{aligned}$$

with \(f_\omega :=(-\Delta +I)\omega \in L^2(\Omega )\). The proof then follows from a direct application of Theorem 2.1. \(\square \)

3 Descent Algorithm and Its Convergence

In this section, we study a basic projected gradient method for solving the regularized bilevel optimization problem (\(\tilde{\mathbb {P}}\)). We are in particular interested in its global convergence properties in the underlying function space setting as this suggests an image resolution (or, from a discretization point of view, mesh) independent convergence when solving discrete, finite dimensional instances of the problem. As a consequence of such a property, the number of iterations of the solver for computing an \(\epsilon \)-approximation of a solution (or stationary point) should be expected to behave stably on all sufficiently fine meshes resp. image resolutions.

One of the main focus points of our analysis is to provide guarantee that the iterates \(\alpha _n\) remain in \(C(\bar{\Omega })\) for all \(n\in \mathbb {N}\). This property keeps the primal/dual relation between (\(\mathrm {P}\)) and (\(\mathrm {D}(\alpha )\)) vital. We recall here also that for the study of (\(\mathrm {D}(\alpha )\)) alone, \(\alpha _n\in L^2(\Omega )\) suffices, but does no longer allow to link (\(\mathrm {D}(\alpha )\)) to (\(\mathrm {P}\)) through dualization. This refers to the fact that given a dual solution \(\mathbf {p}\) one no longer can infer a primal solution (recovered image) u from primal-dual first-order optimality conditions. We also note here that, of course, more elaborate techniques may be employed as long as the aforementioned primal/dual relation remains intact.

We employ the following projected gradient method given in Algorithm 1 where the steps \(\{\tau _n\}\), \(\tau _n\ge 0\) for all \(n\in \mathbb {N}\), are chosen according to the Armijo rule with backtracking; compare step 1 of Algorithm 1 and see, e.g., [7, 9] for further details.

figure j

Recall that our duality result in [31, Thm. 3.4] requires \(C(\overline{\Omega })\)-regularity of the regularization weight. Below, \(\alpha _{n+1}\) represents a suitable approximation. Since it results from an \(H^1(\Omega )\)-projection, and , unless \(\ell =1\), the required regularity for dualization seems in jeopardy. Under mild assumptions and in view of Theorem 2.1, our next result guarantees \(\alpha _{n+1}\in C^{0, r}(\overline{\Omega })\) for some \(r\in (0,1)\), and thus the required regularity property.

Theorem 3.1

Let \(\{\alpha _n\}\) be generated by Algorithm 1. Then, \(\alpha _n\in H^2(\Omega ) \cap C^{0, r}(\overline{\Omega })\) for all \(n\in \mathbb {N}\), every limit point \(\alpha ^*\) of \(\{\alpha _n\}\) is stationary for (\(\tilde{\mathbb {P}}_{\text {red}}\)), i.e., \(\alpha ^*=P_{{\mathcal {A}}_{\text {ad}}}(\alpha ^*- \nabla \hat{J}(\alpha ^*))\), and belongs to \(H^2(\Omega ) \cap C^{0, r}(\overline{\Omega })\). Furthermore, we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\alpha _{n}-P_{{\mathcal {A}}_{\text {ad}}}(\alpha _{n}-\nabla \hat{J}(\alpha _n)) =0, \quad \text {in } H^1(\Omega ). \end{aligned}$$
(3.2)

Proof

We split the proof into several steps. Step 1: Regularity of \(\alpha ^*\) and \(\alpha _n\). Let \((\mathbf {p}^*,\alpha ^*)\in H_0^1(\Omega )^\ell \times {\mathcal {A}}_{\text {ad}}\) be a solution to problem (\(\tilde{\mathbb {P}}\)). Setting \(K(\mathbf {p}^*,\alpha ^*):= \frac{1}{\epsilon }D_2P_\delta (\mathbf {p}^*,\alpha ^*)\), by [31, Prop. 6.3] (compare (1.8)) there exists an adjoint state \(\mathbf {q}^*\in H^1_0(\Omega )^\ell \) satisfying

$$\begin{aligned}&\int _{\Omega } \nabla \alpha ^* \cdot \nabla (\alpha -\alpha ^*) + \left( \alpha ^*-\frac{1}{\lambda }K(\mathbf {p}^*,\alpha ^*)^\top \mathbf {q}^*\right) \times \\&\quad (\alpha -\alpha ^*) \mathrm {d}x \ge 0, \qquad \forall \alpha \in {\mathcal {A}}_{\text {ad}}. \end{aligned}$$

Let \(\varvec{G}'_\delta \) be the Nemytskii operator induced (componentwise) by \(r\mapsto G'_\delta (r)=(r)_\delta ^+\) where \(G_\delta \) is defined in (1.7). Since \(G'_\delta (r)\in C^1(\mathbb {R})\), \(G''_\delta \) is Lipschitz with \(|G_\delta '' |_{L^\infty (\mathbb {R})},|G_\delta ''' |_{L^\infty (\mathbb {R})}\le \max (1,\delta )\), it follows that \(K(\mathbf {p}^*,\alpha ^*)^{\mathrm {T}}\mathbf {q}^*\in W^{1,1}(\Omega )\cap L^2(\Omega )\) as \((\mathbf {p}^*,\alpha ^*)\in H_0^1(\Omega )^\ell \times H^1(\Omega )\). The application of Theorem 2.1 yields \(\alpha ^*\in H^2(\Omega ) \cap C^{0, r}(\overline{\Omega })\). Given that \(L^2(\Omega )\ni \alpha \mapsto \mathbf {p}(\alpha )\in H_0^1(\Omega )^\ell \) is Lipschitz continuous, note also that, by composition with Lipschitz functions, the map \(H^1(\Omega )\ni \alpha \mapsto K(\mathbf {p}(\alpha ), \alpha )\in L^4(\Omega )^\ell \) for \(\ell \le 4\) is Lipschitz continuous too, and \(G_\delta '':\mathbb {R}\rightarrow \mathbb {R}\) is uniformly bounded and Lipschitz continuous so that \(\varvec{G}_\delta '':L^4(\Omega )^\ell \rightarrow L^4(\Omega )^\ell \) is Lipschitz continuous (see Lemma 4.1 in [51] and the remark at the end of its proof).

Suppose that \(\alpha \in H^2(\Omega )\) and \(\frac{\partial \alpha }{\partial \varvec{\nu }} =0\) in \(\partial \Omega \). Then we have

$$\begin{aligned}&\langle \hat{J}'(\alpha ), \omega \rangle \nonumber \\&\quad = \int _{\Omega } (\lambda (-\Delta \alpha + \alpha )-K(\mathbf {p}(\alpha ),\alpha )^\top \mathbf {q}(\alpha ))\omega \mathrm {d}x, \end{aligned}$$
(3.3)

for \(\omega \in H^1(\Omega )\). Hence, \(\hat{J}'(\alpha )\in L^2(\Omega )\) and \(\nabla \hat{J}(\alpha )\in H^2(\Omega )\) with \(\frac{\partial \nabla \hat{J}(\alpha )}{\partial n}=0\) on \(\partial \Omega \). The application of Corollary 2.3 yields \(P_{{\mathcal {A}}_{\text {ad}}}(\alpha -\tau \nabla \hat{J}(\alpha ))\in H^2(\Omega )\cap C^{0, r}(\overline{\Omega })\) and that it satisfies homogeneous Neumann boundary conditions. By induction one shows \(\alpha _n\in H^2(\Omega )\cap C^{0, r}(\overline{\Omega })\) and \(\partial \alpha _n/\partial \varvec{\nu }=0\) on \(\partial \Omega \) for all \(n\in \mathbb {N}\).

Step 2: The limit in (3.2) holds. It is known that every cluster point of \(\{\alpha _n\}\) is stationary (see [9]) and that \(\alpha _{n}-P_{{\mathcal {A}}_{\text {ad}}}(\alpha _{n}-\tau _n \nabla \hat{J}(\alpha _n))\rightarrow 0\) as \(n\rightarrow \infty \) provided that \(H^1(\Omega )\ni \alpha \mapsto \nabla \hat{J}(\alpha )\in H^1(\Omega )\) is Lipschitz continuous (see Theorem 2.4 in [36]). We first prove the Lipschitz continuity of the map \(\alpha \mapsto \mathbf {q}(\alpha )\). Let \(\mathbf {p}_1, \mathbf {q}_1\) and \(\mathbf {p}_2,\mathbf {q}_2\) (satisfying the system in (1.8)) denote the states and adjoint states associated with \(\alpha _1\) and \(\alpha _2\) in \({\mathcal {A}}_{\text {ad}}\), respectively. Given the structure of \(J_0=F\circ R\), we observe that

$$\begin{aligned}&|(J'_0({{\mathrm{\mathrm {div}}}}\mathbf {p}_2)-J'_0({{\mathrm{\mathrm {div}}}}\mathbf {p}_1),{{\mathrm{\mathrm {div}}}}(\mathbf {q}_2-\mathbf {q}_1))|\\&\quad \le C_1|{{\mathrm{\mathrm {div}}}}(\mathbf {p}_2-\mathbf {p}_1)|_{L^2(\Omega )}|{{\mathrm{\mathrm {div}}}}(\mathbf {q}_2-\mathbf {q}_1)|_{L^2(\Omega )}, \end{aligned}$$

where \(C_1=C_1(\alpha _1,\alpha _2)\) is bounded by

$$\begin{aligned}&C_1\le M_1 \left( |{{\mathrm{\mathrm {div}}}}\mathbf {p}_2|_{L^2(\Omega )}+ \int _{\Omega }|\max (R({{\mathrm{\mathrm {div}}}}\mathbf {p}_1)-\sigma _1^2,0)|\right. \\&\quad \qquad \left. +|\min (R({{\mathrm{\mathrm {div}}}}\mathbf {p}_1)-\sigma _2^2,0)|\mathrm {d}x\right) , \end{aligned}$$

with \(M_1\ge 0\) depending on the filter kernel w and f, so that \(C_1(\alpha _1,\alpha _2)\le M_2<\infty \) uniformly in \(\alpha _1, \alpha _2\). Additionally, as stated before, the map \(H^1(\Omega )\ni \alpha \mapsto \frac{1}{\epsilon }D_2 P(\mathbf {p}(\alpha ),\alpha )=K(\mathbf {p}(\alpha ),\alpha )\in L^4(\Omega )^\ell \) is Lipschitz continuous, \(D_1 P(\mathbf {p}(\alpha ),\alpha )\) is a monotone operator (this follows since \(H_0^1(\Omega )^\ell \ni \mathbf {p}\mapsto P_\delta (\mathbf {p},\alpha )\in H^{-1}(\Omega )^\ell \) is monotone and differentiable), and by composition of maps one shows that \(H^1(\Omega )\ni \alpha \mapsto \mathbf {q}(\alpha )\in H_0^1(\Omega )^\ell \) is Lipschitz continuous. This implies in turn that the map \(H^1(\Omega )\ni \alpha \mapsto K(\mathbf {p}(\alpha ),\alpha )^{\mathrm {T}}\mathbf {q}(\alpha )\in L^2(\Omega )\) is Lipschitz, as well. Since \(\nabla \hat{J}(\alpha )=(-\Delta +I)^{-1}\hat{J}'(\alpha )\), we have that \(H^1(\Omega )\ni \alpha \mapsto \nabla \hat{J}(\alpha )\in H^1(\Omega )\) is Lipschitz continuous. This ends the proof. \(\square \)

The above convergence result can be strengthened. In fact, the following theorem shows that under suitable assumptions one has \(\alpha _n\rightarrow \alpha ^*\) in \(H^1(\Omega )\) at a q-linear rate. In particular, this requires that the sequence of step lengths \(\{\tau _n\}\) is non-increasing and bounded from below. We note that the sequence \(\{\tau _n\}\) can be made non-increasing by setting \(\mu _n:=\tau _{n-1}\) for all \(n\in \mathbb {N}\) in step 3 of Algorithm 1. Concerning proving the existence of a uniform lower bound on the step lengths the Lipschitz continuity of the map \(H^1(\Omega )\ni \alpha \mapsto \nabla \hat{J}(\alpha )\in H^1(\Omega )\), as shown in the proof of Theorem 3.1, suffices. In fact, in finite dimensions and under simple constraints, the result can be found in [8] and the proof there can easily be adapted to a Hilbert space setting with arbitrary nonempty closed convex set. Further, we make use of the following result which can be found in Theorem 5.1 and Remark 5.1 in [31] that we state here as a lemma.

Lemma 3.2

Given \(f\in L^2(\Omega )\) and let \(\mathbf {p}(\alpha , f)\) be the solution to the lower-level problem in (\(\tilde{\mathbb {P}}\)). Then, \(\mathbf {p}(\alpha , f)\rightarrow 0\) in \(H_0^1(\Omega )^\ell \) as \(f\downarrow 0\) in \(L^2(\Omega )\).

Theorem 3.3

Let \(\{\alpha _n\}\) be generated by Algorithm 1. If the sequence of step lengths \(\{\tau _n\}=\{\theta _-^{m_n}\mu _n\}\) is non-increasing in the sense that \(\mu _n=\tau _{n-1}\), then \(\alpha _n\rightarrow \alpha ^*\) q-linearly in \(H^1(\Omega )\) provided that \(\lambda >0\) and the data \(f\in L^2(\Omega )\) are sufficiently small, respectively.

Proof

We first prove that the Lipschitz constant of the map \(H^1(\Omega )\ni \alpha \mapsto K(\mathbf {p}(\alpha ),\alpha )^{\mathrm {T}}\mathbf {q}(\alpha )\in L^2(\Omega )\) denoted as L(f) satisfies \(L(f)\rightarrow 0\) as \(f\rightarrow 0\) in \(L^2(\Omega )\). Let \(\mathbf {p}_i:=\mathbf {p}(\alpha _i)\) and \(\mathbf {q}_i:=\mathbf {q}(\alpha _i)\). Then, by the triangle inequality

$$\begin{aligned}&|K(\mathbf {p}_2,\alpha _2)^{\mathrm {T}}\mathbf {q}_2-K(\mathbf {p}_1,\alpha _1)^{\mathrm {T}}\mathbf {q}_1|_{L^2(\Omega )}\\&\quad \le |\mathbf {q}_1|_{L^4(\Omega )^\ell }C(|\mathbf {p}_2-\mathbf {p}_1|_{L^4(\Omega )^\ell }+|\alpha _2-\alpha _1|_{L^4(\Omega )})\\&\qquad + |K(\mathbf {p}_2,\alpha _2)|_{L^4(\Omega )^\ell }|\mathbf {q}_2-\mathbf {q}_1|_{L^4(\Omega )^\ell }, \end{aligned}$$

for some \(C>0\). We know that \(H^1(\Omega )\ni \alpha \mapsto \mathbf {q}(\alpha )\in H_0^1(\Omega )^\ell \) and \(L^2(\Omega )\ni \alpha \mapsto \mathbf {p}(\alpha )\in H_0^1(\Omega )^\ell \) are Lipschitz continuous. Furthermore, Lemma 3.2 implies \(\mathbf {p}(\alpha , f)\rightarrow 0\) in \(H_0^1(\Omega )^\ell \) as \(f\downarrow 0\) in \(L^2(\Omega )\) and analogously, one shows that \(\mathbf {q}(\alpha , f)\rightarrow 0\) in \(H_0^1(\Omega )^\ell \) as \(f\downarrow 0\) in \(L^2(\Omega )\) since \(K(\mathbf {p}(\alpha , f),\alpha )\rightarrow 0\) in \(L^4(\Omega )^\ell \) and \(-\nabla J_0'({{\mathrm{\mathrm {div}}}}\mathbf {p}(\alpha , f))\rightarrow 0\) in \(H^{-1}(\Omega )^\ell \) as \(f\downarrow 0\) in \(L^2(\Omega )\). Hence, since \(H^1(\Omega )\hookrightarrow L^4(\Omega )\) for \(\ell \le 4\), the map under investigation is Lipschitz continuous with constant L(f), and \(L(f)\rightarrow 0\) as \(f\rightarrow 0\) in \(L^2(\Omega )\).

Since \(H^1(\Omega )\ni \alpha \mapsto \nabla \hat{J}(\alpha )\in H^1(\Omega )\) is Lipschitz continuous (see the proof of Theorem 3.1), it follows that step sizes \(\tau _n\) are bounded from below (see [8]). The sequence \(\{\tau _n\}\) is non-increasing by hypothesis and then, since \(\tau _n=\theta _-^{m_n}\tau _{n-1}\), and \(m_n\in \mathbb {N}_0\), we have \(m_n=0\) for \(n\ge \tilde{N}\) for some \(\tilde{N}\in \mathbb {N}\) sufficiently large: Suppose there is no such an \(\tilde{N}\). Then, there is a subsequence \(\{m_{n_j}\}\) such that \(m_{n_j}\ge 1\) for \(j\in \mathbb {N}\), which implies that \(\tau _{n_j}\le \theta _-^j \tau _0\). Hence, \(\tau _{n_j}\rightarrow 0\) as \(j\rightarrow \infty \) and then \(\{\tau _n\}\) is not bounded below.

Then, it is enough to consider \(\{\alpha _n\}_{n>\tilde{N}}\) and such that \(\tau _n=\tilde{\tau }\) for some fixed \(\tilde{\tau }>0\). Define \(Q(\alpha ):=K(\mathbf {p}(\alpha ),\alpha )^{\mathrm {T}}\mathbf {q}(\alpha )\), let \(\Psi =P_{{\mathcal {A}}_{\text {ad}}}(\psi -\tilde{\tau } \nabla \hat{J}(\psi ))\) and \(\Theta =P_{{\mathcal {A}}_{\text {ad}}}(\theta -\tilde{\tau } \nabla \hat{J}(\theta ))\) for some \(\psi ,\theta \in {\mathcal {A}}_{\text {ad}}\). Then, using that the projection map \(P_{{\mathcal {A}}_{\text {ad}}}\) is non-expansive, \(\nabla \hat{J}(\alpha )=(-\Delta +I)^{-1}\hat{J}'(\alpha )=\mathcal {R}^{-1}\hat{J}'(\alpha )\) (where \(\mathcal {R}\) is the Riesz map for \(H^1(\Omega )\)) and (3.3), we have

$$\begin{aligned}&|\Psi -\Theta |_{H^1(\Omega )}^2 \\&\quad \le |(1-\tilde{\tau }\lambda )(\psi -\theta )+ \tilde{\tau }\mathcal {R}^{-1}(Q(\psi )-Q(\theta ))|_{H^1(\Omega )}^2 \end{aligned}$$

The structure of the norm in \(H^1(\Omega )\) implies

$$\begin{aligned}&|\Psi -\Theta |_{H^1(\Omega )}^2\\&\quad \le (1-\tilde{\tau }\lambda )^2|\psi -\theta |_{H^1(\Omega )}^2+ \tilde{\tau }^2|\mathcal {R}^{-1}(Q(\theta )-Q(\psi ))|_{H^1(\Omega )}^2\\&\qquad +2(1-\tilde{\tau }\lambda )\tilde{\tau }(\psi -\theta , \mathcal {R}^{-1}(Q(\theta )-Q(\psi )))_{H^1(\Omega )}\\&\quad \le (1-\tilde{\tau }\lambda )^2|\psi -\theta |_{H^1(\Omega )}^2+\tilde{\tau }^2L(f)^2|\psi -\theta |_{L^2(\Omega )}^2\\&\qquad +2|1-\tilde{\tau }\lambda |\tilde{\tau }L(f)|\psi -\theta |_{H^1(\Omega )} |\psi -\theta |_{L^2(\Omega )}\\&\quad \le \left( (1{-}\tilde{\tau }\lambda )^2+\tilde{\tau }^2L(f)^2{+}2(1-\tilde{\tau }\lambda )\tilde{\tau }L(f)\right) |(\psi -\theta )|_{H^1(\Omega )}^2. \end{aligned}$$

Here, we have used the Lipschitz properties of the map \(\alpha \mapsto Q(\alpha )\) described before. Finally, for \(\lambda >0\) and \(f\in L^2(\Omega )\) sufficiently small, the map \(H^1(\Omega )\ni \varphi \mapsto P_{{\mathcal {A}}_{\text {ad}}}(\varphi -\tilde{\tau } \nabla \hat{J}(\varphi ))\in H^1(\Omega )\) is contractive and the iteration (3.1) converges linearly by Banach Fixed Point Theorem.

4 Numerical Experiments

In this section, we provide numerical results for image denoising, deblurring, and Fourier as well as wavelet inpainting.

4.1 Implementation

Utilizing a finite difference discretization of the regularized and penalized lower-level problem in (\(\tilde{\mathbb {P}}\)), we arrive at the discretized bilevel problem

$$\begin{aligned} \left\{ \begin{aligned}&\text {minimize}\quad J(\mathbf {p},\alpha ) \quad \text {over } \mathbf {p}\in (\mathbb {R}^{|\Omega _h|})^2,~\alpha \in {\mathcal {A}}_\text {ad},\\&\text {s.t. } g(\mathbf {p},\alpha )\!:=\!-\beta \varvec{\Delta }\mathbf {p}\!+\! \gamma \mathbf {p}\!+\!A\mathbf {p}\!+\!\mathbf {f}\!+\! \frac{1}{\epsilon }P_\delta (\mathbf {p},\alpha )\!=\!0, \end{aligned}\right. \nonumber \\ \end{aligned}$$
(4.1)

with \(A \mathbf {p}:=-\nabla B^{-1}{{\mathrm{\mathrm {div}}}}\mathbf {p},\) and \( \mathbf {f}=-\nabla B^{-1}K^*f\), and where we set \(\Omega _h:=\{1,2,...,n_1\}\times \{1,2,...,n_2\}\) and define the mesh size \(h:=\sqrt{1/(n_1n_2)}\). Assuming constant bounds in \({\mathcal {A}}_\text {ad}\), the discrete admissible set, again denoted by \({\mathcal {A}}_\text {ad}\), is given by

$$\begin{aligned} {\mathcal {A}}_\text {ad}:=\{\alpha \in \mathbb {R}^{|\Omega _h|}:\underline{\alpha }\le \alpha _j\le \overline{\alpha },~\forall j=(j_1,j_2)\in \Omega _h\}. \end{aligned}$$

The discrete objective reads

$$\begin{aligned} J(\mathbf {p},\alpha ):=&\,\frac{1}{2}\left| ({R}({{\mathrm{\mathrm {div}}}}\mathbf {p})-\overline{\sigma }^2)^+\right| ^2_{\ell ^2(\Omega _\omega )}\\&+\frac{1}{2}\left| (\underline{\sigma }^2-{R}({{\mathrm{\mathrm {div}}}}\mathbf {p}))^+\right| ^2_{\ell ^2(\Omega _\omega )}+\frac{\lambda }{2}\left| \alpha \right| ^2_{H^1(\Omega _h)},\\ {R}({{\mathrm{\mathrm {div}}}}\mathbf {p}) :=&\,w*|K(\mu I+K^*K)^{-1}({{\mathrm{\mathrm {div}}}}\mathbf {p}+K^*f)-f|^2, \end{aligned}$$

where \(\Omega _\omega \) is the (index) domain for the acquired data f (we use \(\Omega _\omega =\Omega _h\) in denoising and deblurring), and define \(\left| f\right| ^2_{\ell ^2(\Omega _\omega )}:=(\sum _{j\in \Omega _\omega }|f_j|^2)/|\Omega _\omega |\). In our experiments, w is a (spatially invariant) averaging filter of size \(n_\text {(w)}\)-by-\(n_\text {(w)}\) (i.e., the local window size is \(n_\text {(w)}^2\) many pixels), and thus the computation of the local variance estimator \(R({{\mathrm{\mathrm {div}}}}\mathbf {p})\) becomes a discrete convolution denoted by “\(*\)”. The term “\(\mu I\)” in the definition of \({R}({{\mathrm{\mathrm {div}}}}\mathbf {p})\), with \(0<\mu \ll 1\), serves as a regularization of \(K^*K\).

We discretize the divergence operator as

$$\begin{aligned} ({{\mathrm{\mathrm {div}}}}\mathbf {p})_{(j_1,j_2)}&=\frac{1}{h}\left( \mathbf {p}^1_{(j_1,j_2)}-\mathbf {p}^1_{(j_1-1,j_2)}\right. \\&\quad \left. +\mathbf {p}^2_{(j_1,j_2)}-\mathbf {p}^2_{(j_1,j_2-1)}\right) , \quad \forall (j_1,j_2)\in \Omega _h, \end{aligned}$$

with \(\mathbf {p}^1_{(\tilde{j}_1,\tilde{j}_2)}=\mathbf {p}^2_{(\tilde{j}_1,\tilde{j}_2)}=0\) whenever \((\tilde{j}_1,\tilde{j}_2)\notin \Omega _h\) in the above formula. Accordingly, the discrete gradient operator \(\nabla \) is defined by the adjoint relation, i.e., \(\nabla :=-{{\mathrm{\mathrm {div}}}}^\top \). The discrete vectorial Laplacian \(\mathbf {\Delta }\) is defined by \(\mathbf {\Delta }\mathbf {p}=(\Delta _\text {(D)}\mathbf {p}^1,\Delta _\text {(D)}\mathbf {p}^2)\) for each \(\mathbf {p}\in (\mathbb {R}^{|\Omega _h|})^2\), and \(\Delta _\text {(D)},\Delta _\text {(N)}\in \mathbb {R}^{|\Omega _h|\times |\Omega _h|}\) denote the discrete five-point-stencil Laplacians with homogenous Dirichlet and Neumann boundary conditions, respectively. For generating \(\Delta _\text {(N)}\), the function value on a ghost grid point (outside the domain) is always set to the function value at the nearest grid point within the domain. For the discrete \(H^1\)-norm of \(\alpha \in \mathbb {R}^{|\Omega _h|}\) (satisfying homogeneous Neumann conditions) we use

$$\begin{aligned} \left| \alpha \right| _{H^1(\Omega _h)}:=h\sqrt{\alpha ^\top (I-\Delta _\text {(N)})\alpha }. \end{aligned}$$

By considering the discrete \(H^1(\Omega )\)-to-\(H^1(\Omega )^*\) Riesz map as \(\alpha \mapsto r=(I-\Delta _\text {(N)})\alpha \), we define the discrete dual \(H^1\)-norm as

$$\begin{aligned} \left| r\right| _{H^1(\Omega _h)^*}{:=}\left| (I{-}\Delta _\text {(N)})^{-1}r\right| _{H^1(\Omega _h)}=h\sqrt{r^\top (I{-}\Delta _\text {(N)})^{-1}r}. \end{aligned}$$

The denoising problem is treated specially. In fact, we set \(\mu =0\) and discretize the operator \(\nabla \circ {{\mathrm{\mathrm {div}}}}\) jointly by

$$\begin{aligned}&(\nabla {{\mathrm{\mathrm {div}}}}\mathbf {p})_{(j_1,j_2)} \\&\quad =\frac{1}{h^2}\Big (\mathbf {p}^1_{(j_1+1,j_2)}-2\mathbf {p}^1_{(j_1,j_2)}+\mathbf {p}^1_{(j_1-1,j_2)}+\mathbf {p}^2_{(j_1+1,j_2)}\\&\qquad -\mathbf {p}^2_{(j_1+1,j_2-1)}-\mathbf {p}^2_{(j_1,j_2)}+\mathbf {p}^2_{(j_1,j_2-1)}, \mathbf {p}^2_{(j_1,j_2+1)}\\&\qquad -2\mathbf {p}^2_{(j_1,j_2)}+\mathbf {p}^2_{(j_1,j_2-1)}+\mathbf {p}^1_{(j_1,j_2+1)}-\mathbf {p}^1_{(j_1-1,j_2+1)}\\&\qquad -\mathbf {p}^1_{(j_1,j_2)}+\mathbf {p}^1_{(j_1-1,j_2)} \Big ) \end{aligned}$$

for all \((j_1,j_2)\in \Omega _h\), and \(\mathbf {p}^1_{(\tilde{j}_1,\tilde{j}_2)}=\mathbf {p}^2_{(\tilde{j}_1,\tilde{j}_2)}=0\) whenever \((\tilde{j}_1,\tilde{j}_2)\notin \Omega _h\) in the above formula. Further, this is used to compute the discrete dual \(H_0({{\mathrm{\mathrm {div}}}})\)-norm as

$$\begin{aligned} \left| \mathbf {v}\right| _{H_0({{\mathrm{\mathrm {div}}}})^*}&:=h\sqrt{\mathbf {v}^\top (I-\nabla \circ {{\mathrm{\mathrm {div}}}})^{-1}\mathbf {v}}, \qquad \text {for~}\mathbf {v}\in (\mathbb {R}^{|\Omega _h|})^2. \end{aligned}$$

In our numerical tests, we use the discrete version of Algorithm 1 as shown in Algorithm 2 below. For a given \(\alpha \), the solution of the lower-level problem \(g(\mathbf {p},\alpha )=0\) (compare step 4 of Algorithm 2) is computed by a path-following Newton technique. Its numerical realization can be found in Algorithm 3. Besides, each projection onto \({\mathcal {A}}_\text {ad}\) requires solving an obstacle problem in \(H^1(\Omega )\), which is carried out by the semismooth Newton method [30]. For convenience of the reader, in Algorithm 4 we tailor this semismooth Newton method to the requirements in this paper. The overall algorithm is terminated once \(\kappa ^n/\kappa ^0<\text {tol}_\text {(b)}\), where

$$\begin{aligned} \kappa ^n:=\left| P_{{\mathcal {A}}_\text {ad}}(\alpha ^n-\nabla \hat{J}(\alpha ^n))-\alpha ^n\right| _{H^1(\Omega _h)} \end{aligned}$$

is our proximity measure and \(\text {tol}_\text {(b)}>0\) is the user-set tolerance parameter.

figure k
figure l
figure m

4.2 Parameter Settings

Unless otherwise specified, the following parameters are used throughout our numerical experiments: \(\lambda =10^{-6}\), \(\beta =\gamma =10^{-4}\), \(\epsilon =c=10^{-8}\), \(\delta =\tau ^0=10^{-3}\), \(\theta _-=0.25\), \(\theta _+=2\), \(n_\text {(w)}=7\), \(\text {tol}_\text {(b)}=0.005\). The choice of \(\lambda \) is taken from the range \([10^{-7},10^{-5}]\) in which final results seem invariant. The parameters \(\epsilon \) and \(\delta \) are chosen to be sufficiently small so that improvements are not significant upon further reduction. Note that the sensitivity of parameters is studied in Sect. 4.5 and results shown in Fig. 8 . The bounds \(\underline{\alpha }=10^{-8}\) and \(\overline{\alpha }=10^{-2}\) are chosen so that the interval \([\underline{\alpha },\overline{\alpha }]\) is sufficiently large for proper selection of the spatially variant \(\alpha \). The parameter \(\mu \) is set to be zero for denoising and deblurring, while \(\mu =10^{-4}\) for Fourier- and wavelet inpainting.

Table 2 Comparison with respect to PSNR and SSIM

Finally, concerning the initialization of \(\alpha \), the general guideline is to choose \(\alpha ^0\) sufficiently large, depending on the underlying problem, so that it yields a cartoon-like restoration \(u^0\). This is analogous to the spatially adaptive total variation method in [23]. The rationale behind this guideline lies in that a cartoon-like restoration typically injects meaningful information into the local variance estimator, which finally transfers into the spatial adaption of the regularization parameter. In our experiments, \(\alpha ^0=2.5~\times ~10^{-3}\) seems universally good for all examples. In particular, our choice of \(\alpha ^0\) will be illustrated for the denoising example in Fig. 3.

All experiments reported in this section were performed under MATLAB R2013b. The image intensity is scaled to the interval [0, 1] in our computation. The displayed images will be quantitatively compared with respect to their peak signal-to-noise ratios (PSNR) and the structural similarity measures (SSIM); see Table 2. In all examples, the “best” scalar regularization parameter \(\hat{\alpha }\) is selected via a bisection procedure, up to a relative error of 0.02, to maximize the following weighted sum of the PSNR- and SSIM values of the resulting scalar-\(\alpha \) restoration

$$\begin{aligned} \frac{\text {PSNR}(\alpha )}{\max \{\text {PSNR}(\tilde{\alpha }):\tilde{\alpha }\in I\}} +\frac{\text {SSIM}(\alpha )}{\max \{\text {SSIM}(\tilde{\alpha }):\tilde{\alpha }\in I\}} \end{aligned}$$

over the interval \(I=[10^{-5},10^{-3}]\). The maximal PSNR and SSIM in the above formula are pre-computed up to a relative error of 0.001.

4.2.1 Choices of \(\overline{\sigma }\) and \(\underline{\sigma }\)

Assuming that the noise level \(\sigma \) is known or estimated beforehand, the local variance bounds \(\overline{\sigma }\) and \(\underline{\sigma }\) can be chosen as follows. Let \(\chi ^2({n_\text {(w)}^2})\) denote the Chi-squared distribution with \(n_\text {(w)}^2\) degrees of freedom. Ideally, if \(u=(\mu I+K^*K)^{-1}({{\mathrm{\mathrm {div}}}}\mathbf {p}+K^*f)\) is equal to the true image, then the local variance estimator \({R}({{\mathrm{\mathrm {div}}}}\mathbf {p})=w*|Ku-f|^2\) follows the (scaled) Chi-squared distribution componentwise (see [23]), i.e., for each \((i,j)\in \Omega _h\) we have

$$\begin{aligned} {R}({{\mathrm{\mathrm {div}}}}\mathbf {p})_{(i,j)}\sim \frac{\sigma ^2}{n_\text {(w)}^2}\chi ^2(n_\text {(w)}^2). \end{aligned}$$
(4.2)

This motivates our selection of the local variance bounds. In the following, we describe two variants of the local variance bounds based on Chi-squared statistics. Both of them will be tested through our numerical experiments.

First choice of \(\overline{\sigma }\) and \(\underline{\sigma }\) Ignoring certain dependencies of the random variables, our first local variance bounds are based on extreme value estimation (in the sense of Gumbel, see[26]). In fact, Gumbel’s theory allows to describe the statistical distribution of the maximum and minimum values of a finite number of random variables. Within the theory, the asymptotic distribution of the maximal and minimal values is determined, and hence, the larger the sample, the more accurate the description becomes. In light of this approach, the upper bound \(\overline{\sigma }\) was previously established in [23]. Under conditions analogous to the ones in [23], here we derive the value of the lower bound \(\underline{\sigma }\) and argue that the choice of \(\overline{\sigma }\) is also proper in the setting where the localized residual is enforced to the interval \([\underline{\sigma },\overline{\sigma }]\).

Let \(\mathfrak {f}\) be the probability density function of \(\chi ^2(n_\text {(w)}^2)\) and \(\mathfrak {F}\) denote its cumulative distribution function, i.e., \(\mathfrak {F}(T):=\int _{-\infty }^T \mathfrak {f}(z)\mathrm {d}z\). The maximum and minimum values of \(N:=n_1 n_2\) observations of independent and identically distributed \(\chi ^2(n_\text {(w)}^2)\)-random variables are, respectively, denoted by \(T_{\text {max}}\) and \(T_{\text {min}}\). Following Gumbel (see [26], eq. 31’ on p. 133 and eq. ’31 on p. 135 or [27]), the limiting distributions of the maximum and minimum value \(\mathfrak {f}_{\text {max}}\) and \(\mathfrak {f}_{\text {min}}\) are given by

$$\begin{aligned} \mathfrak {f}_{\text {max}}(y_\text {max}(T_\text {max}))&=N\mathfrak {f}(\tilde{T}_\text {max})e^{-y_\text {max}(T_\text {max})-e^{-y_\text {max}(T_\text {max})}}, \\ \mathfrak {f}_{\text {min}}(y_\text {min}(T_\text {min}))&=N\mathfrak {f}(\tilde{T}_\text {min})e^{y_\text {min}(T_\text {min})-e^{y_\text {min}(T_\text {min})}}, \end{aligned}$$

where \(\tilde{T}_\text {min}\) and \(\tilde{T}_\text {max}\) are the “dominant values” defined as \(\mathfrak {F}(\tilde{T}_\text {min}):=1/N\) and \(\mathfrak {F}(\tilde{T}_\text {max}) :=1-1/N\). Further, \(y_\text {max}(\cdot )\) and \(y_\text {min}(\cdot )\) represent the standardizations (of \(T_{\text {max}}\) and \(T_{\text {min}}\)) defined by

$$\begin{aligned}&y_\text {max}(T) :=N\mathfrak {f}(\tilde{T}_\text {max})(T-\tilde{T}_\text {max}),\\&y_\text {min}(T) :=N\mathfrak {f}(\tilde{T}_\text {min})(T-\tilde{T}_\text {min}). \end{aligned}$$

The cumulative distributions \(\mathfrak {F}_\text {max}(T):=\mathrm {P}\left( T_{\text {max}}\le T\right) \) and \(\mathfrak {F}_\text {min}(T):=\mathrm {P}\left( T_{\text {min}}\le T\right) )\) satisfy

$$\begin{aligned} \mathrm {P}\left( T_{\text {max}}\le T\right) =e^{-e^{-y_\text {max}(T)}}, \mathrm {P}\left( T_{\text {min}}\le T\right) =1-e^{-e^{y_\text {min}(T)}}, \end{aligned}$$

see eq. 32’ on p. 133 and eq. ’32 on p. 135 in [26] or [27]. The corresponding expectations (\(\mathfrak {E}\)) and standard deviations (\(\mathfrak {d}\)) for \(y_\text {max}(T_\text {max})\) and \(y_\text {min}(T_\text {min})\) are given by

$$\begin{aligned} \mathfrak {E}(y_\text {max}(T_\text {max}))&=\kappa ,&\mathfrak {d}(y_\text {max}(T_\text {max}))&=\frac{\pi }{\sqrt{6}}, \\ \mathfrak {E}(y_\text {min}(T_\text {min}))&=-\kappa ,&\mathfrak {d}(y_\text {min}(T_\text {min}))&=\frac{\pi }{\sqrt{6}}, \end{aligned}$$

where \(\kappa \simeq 0.577215\) is the Euler–Mascheroni constant (see [26], p. 141). It follows from the standardizations of \(T_\text {max}\) and \(T_\text {min}\) that

$$\begin{aligned} \mathfrak {E}(T_\text {max})&= \tilde{T}_\text {max}+\frac{\kappa }{N\mathfrak {f}_\text {max}(\tilde{T}_\text {max})},\quad \mathfrak {d}(T_\text {max}) = \frac{\pi }{\sqrt{6}N\mathfrak {f}_\text {max}(\tilde{T}_\text {max})}, \nonumber \\ \mathfrak {E}(T_\text {min})&= \tilde{T}_\text {min}+\frac{\kappa }{N\mathfrak {f}_\text {min}(\tilde{T}_\text {min})},\quad \mathfrak {d}(T_\text {min}) = \frac{\pi }{\sqrt{6}N\mathfrak {f}_\text {min}(\tilde{T}_\text {min})}. \nonumber \end{aligned}$$

It can be straightforwardly proven (see [23]) that

$$\begin{aligned} \mathrm {P}\left( T_\text {max}\le \mathfrak {E}(T_\text {max})+\mathfrak {d}(T_\text {max})\right) =e^{-e^{-k-\frac{\pi }{\sqrt{6}}}}\simeq 0.86, \end{aligned}$$

and analogously, since \(y_\text {min}(\mathfrak {E}(T_{\text {min}})-\mathfrak {d}(T_{\text {min}}) )=-\kappa -\pi /\sqrt{6}\), we have that

$$\begin{aligned}&\mathrm {P}\left( T_{\text {min}}\ge \mathfrak {E}(T_{\text {min}})-\mathfrak {d}(T_{\text {min}})\right) \\&=1-\mathrm {P}\left( T_{\text {min}}\le \mathfrak {E}(T_{\text {min}})-\mathfrak {d}(T_{\text {min}})\right) \\&=1-(1-e^{-e^{-k-\frac{\pi }{\sqrt{6}}}})\simeq 0.86. \end{aligned}$$

Furthermore, although it is not possible to obtain closed-form expressions for \(\mathrm {P}\left( T_\text {max}\le \mathfrak {E}(T_{\text {min}})-\mathfrak {d}(T_{\text {min}})\right) \) and \(P (T_\text {min}\ge \mathfrak {E}(T_\text {max})+\mathfrak {d}(T_\text {max}))\), it is obtained computationally that these two quantities are almost zero in the range given by \(N=16^2, 32^2,\ldots ,1024^2\) and \(n_\text {(w)}=3,4,\ldots , 11\). This implies that

$$\begin{aligned} \mathrm {P}\left( \mathfrak {E}(T_{\text {min}})-\mathfrak {d}(T_{\text {min}})\le T \le \mathfrak {E}(T_\text {max})+\mathfrak {d}(T_\text {max})\right) \simeq 0.86, \end{aligned}$$

for \(T=T_{\text {min}}\) or \(T=T_\text {max}\).

Fig. 1
figure 1

“Cameraman” image. a True image. b Noisy blurry image. c Noisy image (\(\sigma =0.1\)). d Noisy image (\(\sigma =0.2\))

Fig. 2
figure 2

Denoising: \(\sigma =0.1\). a Restor. via \(\hat{\alpha }=\,\)2.641e−4. b Restor. via bilevel-(#1). c \(\alpha \) via bilevel-(#1). d Restor. via SATV. e Restor. via bilevel-(#2). f \(\alpha \) via bilevel-(#2)

Based on the above derivation and (4.2), our first selection of the local variance bounds is given as follows

figure n

Second choice of \(\overline{\sigma }\) and \(\underline{\sigma }\) Our second choice of the local variance bounds is based on mean and variance estimation. It is known that the mean and the standard deviation of \(\chi ^2(n_\text {(w)}^2)\) can be, respectively, calculated as

$$\begin{aligned} \mathfrak {E}(\chi ^2(n_\text {(w)}^2)) =n_\text {(w)}^2,\qquad \mathfrak {d}(\chi ^2(n_\text {(w)}^2)) =\sqrt{2}n_\text {(w)}. \end{aligned}$$

Based on this information, one can choose the local variance bounds as

figure o

4.3 Experiments on Denoising

We first test our method on a denoising problem. The observed image is generated by adding Gaussian white noise of standard deviation 0.1 to the test image “Cameraman”; see subplots a and c in Fig. 1. We test our bilevel method with two different local variance bounds in (#1), i.e., \(\underline{\sigma }^2_\text {(l)}=0.00325\) and \(\overline{\sigma }^2_\text {(l)}=0.02211\), and in (#2), i.e., \(\underline{\sigma }^2_\text {(t)}=0.00798\) and \(\overline{\sigma }^2_\text {(t)}=0.01202\), which are, respectively, referred to as “bilevel-(#1)” and “bilevel-(#2)” in what follows. In Fig. 2, the corresponding restored images and the spatially variant regularization parameters are displayed. These results are compared with the restoration via the best scalar \(\hat{\alpha }= 2.641~\times ~10^{-4}\), as well as the restoration via the spatially adaptive total variation approach (SATV) [23].

Subplot a in Fig. 2 indicates that the scalar \(\hat{\alpha }\) can not simultaneously recover, to visual satisfaction, the detail regions (e.g., where the camera and the tripod are placed) and the homogenous regions (e.g., the background sky). The SATV restoration yields significant improvement in this respect. Our bilevel restorations in subplots b and e are visually even better, especially in the homogenous regions. Comparing b and e, we observe that the tighter bounds given by (#2) tend to capture more information from the image and yield a slightly better restored image. According to a quantitative comparison in Table 2, the bilevel approaches are always superior to the best scalar \(\hat{\alpha }\) with respect to PSNR and SSIM. Compared with SATV, the bilevel approaches lose in PSNR but are better in SSIM.

We note that the \(\alpha \)-plots in c and f are reversely scaled for visualization purposes (i.e., a peak in the \(\alpha \)-plot indicates small value of \(\alpha \) at the point), and similarly for all forthcoming \(\alpha \)-plots in Sect. 4. Notably, one can observe patterns in the spatial distribution of \(\alpha \) from our bilevel approach. In both subplots c and f, \(\alpha \) tends to be small in the detailed regions while being large in the homogenous regions. This explains why the restorations in b and e are superior to the one via the best scalar-valued \(\hat{\alpha }\).

Fig. 3
figure 3

Evolution of \(\alpha ^k\) and \(u^k\) in bilevel-(#2)

We also illustrate the evolution of \(\alpha ^k\) and \(u^k\) along the iterations of the projected gradient algorithm in Fig. 3. As instructed by the guideline at the end of Sect. 4.1, the initial guess \(\alpha ^0\) produces a cartoon-like image \(u^0\). As the iterations proceed, it is observed that \(\alpha ^k\) reveals more and more apparent spatial pattern, and correspondingly the restoration becomes sharper and sharper. The final \(\alpha ^k\) and \(u^k\) after 21 iterations are, respectively, given by subplots f and e in Fig. 2.

Fig. 4
figure 4

Denoising: \(\sigma =0.2\). a Restor. via \(\hat{\alpha }=\,\)6.493e−4. b Restor. via bilevel-(#1). c \(\alpha \) via bilevel-(#1) d Restor. via SATV. e Restor. via bilevel-(#2). f \(\alpha \) via bilevel-(#2)

To conclude the denoising example, we increase the noise level, i.e., \(\sigma =0.2\), and repeat the above experiment. In this case, the local variance bounds from (#1) and (#2) are given by \(\underline{\sigma }^2_\text {(l)}=0.01302\), \(\overline{\sigma }^2_\text {(l)}=0.08843\), \(\underline{\sigma }^2_\text {(t)}=0.03192\), \(\overline{\sigma }^2_\text {(t)}=0.04808\). The corresponding results are shown in Fig. 4. From these results, a general observation is that detection of spatial patterns in \(\alpha \) becomes more challenging as the noise level increases. For relatively loose bounds such as \(\underline{\sigma }^2_\text {(l)}\) and \(\overline{\sigma }^2_\text {(l)}\), the pattern in the spatially variant \(\alpha \) becomes less significant. On the other hand, artifacts due to strong noise tend to appear in \(\alpha \) via relatively tight bounds such as \(\underline{\sigma }^2_\text {(t)}\) and \(\overline{\sigma }^2_\text {(t)}\). Nevertheless, the restorations via the bilevel approaches seem never worse off than the restorations via scalar \(\hat{\alpha }\) or SATV, both visually and quantitatively.

Fig. 5
figure 5

Deblurring. a Restor. via \(\hat{\alpha }=\,\)4.698e−5. b Restor. via bilevel-(#1). c \(\alpha \) via bilevel-(#1). d Restor. via SATV. e Restor. via bilevel-(#2). f \(\alpha \) via bilevel-(#2)

Fig. 6
figure 6

Fourier inpainting: “Chest.” a “Chest” image. b Restor. via bilevel-(#1). c \(\alpha \) via bilevel-(#1). d Restor. via \(\hat{\alpha }=\,\)6.978e−5. e Restor. via bilevel-(#2). f \(\alpha \) via bilevel-(#2)

Fig. 7
figure 7

“Chest”: zoomed views

4.4 Experiments on Deblurring

We continue our experiments by deblurring the “Cameraman” image. Here the image is blurred by Gaussian blur of standard deviation 1 and then degraded by Gaussian white noise of standard deviation 0.05; see Fig. 1b. Again, we have implemented both bilevel-(#1) and bilevel-(#2), where the local variance bounds are given by \(\underline{\sigma }^2_\text {(l)}=0.000814\), \(\overline{\sigma }^2_\text {(l)}=0.005527\), \(\underline{\sigma }^2_\text {(t)}=0.001995\), and \(\overline{\sigma }^2_\text {(t)}=0.003005\). In Fig. 5, the resulting images and \(\alpha \)’s are displayed. These results are compared with the restorations via the best scalar \(\hat{\alpha }=4.698~\times ~10^{-5}\) and via SATV. In view of subplots c and f, the spatially variant regularization parameters obtained in deblurring share similar patterns to the ones in denoising, particularly in the regions of the camera and the tripod. Both bilevel-(#1) and bilevel-(#2) seem to outperform the best scalar \(\hat{\alpha }\) in PSNR and SSIM; see Table 2. Note that the blurring operator has a dampening effect on the artifacts contained in the image. In this circumstance, bilevel-(#2) with tighter local variance bounds is typically more favorable than bilevel-(#1).

4.5 Experiments on Fourier Inpainting

Now we consider Fourier inpainting (restoration with missing samples in the Fourier domain), which is typically encountered in parallel magnetic resonance imaging. For the test image “Chest” in Fig. 6(a), the corresponding data f are generated as \(f=K(u+\eta )\). Here K is defined by \(K=S\circ F\), where F is the 2D discrete Fourier transform and S is a subsampling operator which collects Fourier coefficients along 120 radial lines centered at zero frequency. Since the subsampled Fourier data are typically non-uniformly distributed, the local variance estimator \({R}({{\mathrm{\mathrm {div}}}}\mathbf {p})\) is computed as a 1D convolution, i.e., w is a 1D averaging filter of size \(n_\text {(w)}^2\), and \(|Ku-f|^2\in \mathbb {R}^{|\Omega _\omega |}\) is aligned lexicographically as a 1D vector and then convolved with w. Besides, \(\eta \in \mathbb {R}^{|\Omega _h|}\) is Gaussian white noise of standard deviation 0.05. In contrast to denoising and deblurring, here the acquired data f are coded in the frequency domain rather than the image domain. This renders the SATV method [23] inapplicable to Fourier inpainting.

Fig. 8
figure 8

Sensitivity tests on “Teeth.” a “Teeth” image. b Sensitivity on \(n_\text {(w)}\). c Sensitivity on \(\alpha ^0\). d Sensitivity on \(\epsilon \). e Sensitivity on \(\delta \)

The results via bilevel-(#1) and bilevel-(#2) are displayed in Fig. 6, where the corresponding local variance bounds are given by \(\underline{\sigma }^2_\text {(l)}=0.00077\), \(\overline{\sigma }^2_\text {(l)}=0.00570\), \(\underline{\sigma }^2_\text {(t)}=0.00199\), \(\overline{\sigma }^2_\text {(t)}=0.00301\). It is observed that the spatially distributed \(\alpha \)’s tend to be small in the regions of interest and large in the backgrounds. For comparison, we also display the restorations via scalar \(\hat{\alpha }\); see subplot (c). To highlight the differences among various restorations, we take zoomed views on two framed regions in the “Chest” image; see Fig. 7 for visual comparison. Favorably, the spatial distribution of \(\alpha \) allows to handle both local features properly, i.e., homogenize the flat region while preserving the detailed region, which is not attainable by either backprojection or scalar-valued \(\hat{\alpha }\).

Fig. 9
figure 9

Wavelet inpainting: “Pepper.” a “Pepper” image. b Restor. via bilevel-(#1). c \(\alpha \) via bilevel-(#1). d Restor. via \(\hat{\alpha }=\,\)1.334e−4. e Restor. via bilevel-(#2). f \(\alpha \) via bilevel-(#2)

We also test on another medical image “Teeth”; see Fig. 8a, under the same settings as in “Chest.” Similar conclusions can be drawn as before. In addition, we perform sensitivity tests on various parameters in bilevel-(#2) for the “Teeth” example, namely \(n_\text {(w)}\), \(\alpha ^0\), \(\epsilon \), \(\delta \), and \(\lambda \). Here the parameter \(n_\text {(w)}\) determines the window size in the local variance estimator, \(\alpha ^0\) is a scalar which initializes the search for a spatially distributed \(\alpha \), \(\epsilon \) controls the penalty term in the lower-level problem, \(\delta \) contributes to the smoothing of the max-function, and \(\lambda \) weights the \(H^1\)-regularization on \(\alpha \).

Figure 8 reports the sensitivity measured by PSNR and SSIM. We remark that in general the choice of the window size represents a tradeoff: Small windows typically reduce the reliability of the local variance statistics, while large windows render the local variance less “localized.” Observed from subplot b, however, our bilevel approach appears quite stable with respect to the window size in view of PSNR and SSIM. Concerning the initialization of \(\alpha \), as remarked at the end of Sect. 4.2, the bilevel approach benefits from relatively large initial \(\alpha \) which yields a blocky initial restoration. This identifies with the test results reported in subplot c. Besides, we observe from the numerical tests that the bilevel approach is almost invariant, in terms of PSNR and SSIM, to \(\lambda \) in the range \([10^{-7},10^{-5}]\). In contrast, the parameters \(\epsilon \) and \(\delta \) may significantly affect the restoration in case they are too large; see subplots d and e. The present parameters \(\epsilon =10^{-8}\) and \(\delta =10^{-3}\) are chosen to be sufficiently small so that there would be little marginal gain from any further reduction of \(\epsilon \) or \(\delta \).

4.6 Experiments on Wavelet Inpainting

We conclude this section by a wavelet inpainting (restoration with missing samples) problem on the “Pepper” image; see Fig. 9. Our task is to “inpaint” the missing Haar wavelet coefficients due to lossy image transmission or communication; see [13, 14] for more background information. The given data are generated by \(f=K(u+\eta )\). Here \(\eta \) is Gaussian white noise of standard deviation 0.05, and K is defined by \(K=S\circ W\) with the Haar wavelet transform W and the operator S which randomly collects 80% of the wavelet coefficients. Note that the data f are coded in the (wavelet) transform domain rather than the original image domain. Thus, analogous to Fourier inpainting, the local variance estimator \(R({{\mathrm{\mathrm {div}}}}\mathbf {p})\) is computed as a 1D convolution.

In this example, we set \(\tau ^0=10^{-5}\) for bilevel-(#1) and bilevel-(#2). The local variance bounds in (#1) and (#1) are given by \(\underline{\sigma }^2_\text {(l)}=0.00081\), \(\overline{\sigma }^2_\text {(l)}=0.00553\), \(\underline{\sigma }^2_\text {(t)}=0.00199\), \(\overline{\sigma }^2_\text {(t)}=0.00301\). Their restorations, together with the restoration from scalar \(\hat{\alpha }\), are reported in Fig. 9. The spatially adapted \(\alpha \)’s via bilevel-(#1) and bilevel-(#2) are also shown in subplots c and f, respectively. Although the three restorations in b, d, e are visually close to each other, the bilevel restorations are superior in PSNR but less good in SSIM according to Table 2.

5 Conclusion

The choice of the regularization parameter for total variation-based image restoration remains a challenging task. At the expense of solving a bilevel optimization problem, this paper generalizes and “robustifies” the classical TV-model by considering a spatially variant regularization parameter \(\alpha \). In particular, an upper-level objective based on local variance estimators is proposed. The overall bilevel model is solved by a projected gradient-type algorithm and yields competitive numerical results in comparison with existing methods. In fact, the reconstructions are almost always better in PSNR or SSIM than those obtained from scalar regularization. Moreover, visually, image details get better preserved and homogeneous regions better denoised for distributed regularization than for scalar one.

Potential future research may include alternative choices for the upper-level objectives, although the statistics-based variance corridors proposed in this work operate satisfactorily. From an analytical point of view, either passage to the limit with the lower-level regularization parameter or employing set-valued analysis tools would be of interest in order to obtain sharp stationarity conditions for the original bilevel formulation. Moreover, the framework may be generalized to other types of priors (such as Total Generalized Variation (TGV)) or alternative noise types (such as random-valued impulse noise). Also, the local adaptation of the filter (e.g., by adjusting the window size according to some confidence criterion) is of interest.