Techniques for Gradient-Based Bilevel Optimization with Non-smooth Lower Level Problems

Ochs, Peter; Ranftl, René; Brox, Thomas; Pock, Thomas

doi:10.1007/s10851-016-0663-7

Techniques for Gradient-Based Bilevel Optimization with Non-smooth Lower Level Problems

Published: 17 May 2016

Volume 56, pages 175–194, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

Techniques for Gradient-Based Bilevel Optimization with Non-smooth Lower Level Problems

Download PDF

Peter Ochs^1,3,
René Ranftl²,
Thomas Brox³ &
…
Thomas Pock^4,5

1363 Accesses
28 Citations
Explore all metrics

Abstract

We propose techniques for approximating bilevel optimization problems with non-smooth lower level problems that can have a non-unique solution. To this end, we substitute the expression of a minimizer of the lower level minimization problem with an iterative algorithm that is guaranteed to converge to a minimizer of the problem. Using suitable non-linear proximal distance functions, the update mappings of such an iterative algorithm can be differentiable, notwithstanding the fact that the minimization problem is non-smooth.

A primal nonsmooth reformulation for bilevel optimization problems

Article 21 January 2022

Inexact direct-search methods for bilevel optimization problems

Article Open access 21 March 2024

Methodology and first-order algorithms for solving nonsmooth and non-strongly convex bilevel optimization problems

Article 27 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider numerical methods for solving bilevel optimization problems of the form:

$$\begin{aligned} \begin{aligned}&\min _\vartheta \ \mathcal {L}(x^*(\vartheta ), \vartheta ) \\&s.t.\ x^*(\vartheta ) \in \arg \min _{x\in \mathbb {R}^N} E(x,\vartheta ) , \end{aligned} \end{aligned}$$

(1)

where $\mathcal {L}$ (denoted loss function) is a function penalizing the differences between the output of the lower level problem $x^*(\vartheta )$ and some given ground truth data. In addition, $\mathcal {L}$ can also contain a regularizer on the parameter vector $\vartheta $, e.g., a sparsity prior. The mapping $x^*(\vartheta )$ is the solution of an optimization problem (parametrized by $\vartheta $) that solves a specific task, e.g., multi-label segmentation.

In the general bilevel literature, (1) is often presented as a leader–follower problem. The leader (upper level problem) tries to optimize the next move (minimization of the upper level problem) under consideration of the move of an opponent, the follower. Given some information $\vartheta $ to the follower, the leader tries to anticipate the follower’s next move (minimization of the lower level problem).

In this paper, we focus on a class of problems that allows for non-smooth convex functions $x\mapsto E(x,\vartheta )$ in the lower level problem, e.g., sparse models based on the $\ell _1$-norm. Such models have become very popular in the computer vision, image processing, and machine learning communities since they are robust with respect to noise and outliers in the input data.

Due to the possibly high dimensionality of the parameter vector, we pursue the minimization of the bilevel problem (1) using gradient-based methods. Hence, a descent direction of $\mathcal {L}$ with respect to $\vartheta $ must be determined. Its estimation involves the Jacobian of the solution map $x^*(\vartheta )$ with respect to the parameter vector $\vartheta $, which causes three kinds of problems:

(i)
The solution mapping $x^*(\vartheta )$ is only defined implicitly (as a minimizer of the lower level problem).
(ii)
The lower level’s solution is not unique.^{Footnote 1}
(iii)
The lower level problem is non-smooth.

(i) A reduction to a single-level problem by explicitly solving the lower level problem is not always possible. Nevertheless, if the lower level problem is sufficiently smooth, sometimes, it can be replaced by its optimality condition, and the implicit function theorem (cf. Sect. 4.1) provides an explicit formula for the derivative of the solution map. This approach does not work for non-smooth lower level problems.

(ii) Consider the example

$$\begin{aligned} \begin{aligned}&\min _{\vartheta \in \mathbb {R}} (x^*(\vartheta )-1)^2 \\&s.t. x^*(\vartheta ) \in \arg \min _{x\in [0,1]} \vartheta x , \end{aligned} \end{aligned}$$

(2)

which reduces to minimization of a step function

$$\begin{aligned} \min _{\vartheta \in \mathbb {R}}\, {\mathcal {L}}(x^{*}(\vartheta )), \quad \mathcal {L}(x^{*}(\vartheta )) = {\left\{ \begin{array}{ll} 1, &{} \text {if } \vartheta {>} 0; \\ 0,&{} \text { if } \vartheta {<} 0; \\ {[}0,1],&{} \text {if } \vartheta = 0. \end{array}\right. } \end{aligned}$$

A gradient-based method will get stuck almost everywhere, as the derivative vanishes for all $\vartheta \ne 0$. Similar situations arise for robust models in the lower level problem. By definition, the solution is not affected by small perturbations of the input data (or the parameter $\vartheta $). For instance in the multi-label segmentation problem, small changes in the pixel likelihoods do not change the segmentation result; the energy landscape of the loss function will have the form of a high-dimensional step function.

(iii) Due to the non-smoothness of the lower level problem, standard calculus cannot be applied. In variational (non-smooth) analysis, there are many generalizations of derivatives, such as the convex subdifferential, the Fréchet subdifferential, or the limiting subdifferential. However, they are often set-valued and generalizations of the chain rule and rely on constraint qualifications that are sometimes quite restrictive and often hard to verify.

In the conference version of this paper [30], we introduced an approach to overcome the smoothness restriction in some cases of practical interest. The idea is to replace the lower level problem by an iterative algorithm that is guaranteed to converge to a solution of the problem. If the update mapping of the algorithm is a smooth function, the chain rule can be applied to the composition of these update mappings recursively and the exact derivatives with respect to the parameter vector $\vartheta $ can be computed. Algorithms based on Bregman distances are key for this development. The number of iterations of the iterative algorithm steers the approximation quality of the lower level problem.

The iterative algorithm that replaces the lower level is stopped after a small number of iterations. However, once the algorithm and the number of iterations are fixed, the resulting bilevel optimization problem seeks for optimal $\vartheta $ for exactly this algorithm and this (fixed) number of iterations. Numerically, the derivative that is involved in gradient-based minimization is exact: the number of chain rule recursions is finite. This is in contrast to an approach based on the optimality condition of a smooth approximation of the lower level problem. In this case, the descent direction is based on the derivative of the optimality condition evaluated at the minimizer of the lower level problem, which is only determined approximately.

Beyond the analysis of the conference paper, we discuss approximations to the derivative evaluation that reduce the memory requirements and the computational cost significantly. We extend the class of problems that can be used in our framework and give some more details about the general implementation of our approach. Moreover, we consider the limiting case, i.e., the fixed point equation of an iterative algorithm in the lower level problem.

We point out several applications of our approach and evaluate it for a multi-label segmentation problem coupled with a convolutional neural network.

2 Related Work

We propose a simple approximation of the lower level problem that naturally addresses non-smoothness and non-uniqueness.

For a non-unique solution map (a set-valued mapping) of the lower level problem (1) is not even well defined (cf. Remark 1). Dempe et al. [14] describe three possible options to cope with this problem. The optimistic bilevel optimization problem assumes a cooperative strategy of leader and follower, i.e., in case of multiple solutions the follower tries to minimize the upper level objective. The pessimistic bilevel problem is the other extreme, where the leader tries to bound the damage that the follower could cause by its move. The selection function approach assumes that the leader can always predict the follower’s choice. Of course, these three approaches are the same for lower level problems with a unique output.

Our approach does not fall into any of the three cases; however, the selection function approach is the closest. The difference is that our approximation changes the output also at (originally) unique points. Our solution strategy reduces the solution map to be single-valued, similar to the approaches mentioned above.

Dempe et al. [14] classify the following optimality conditions.^{Footnote 2} The primal Karush–Kuhn–Tucker (KKT) transformation replaces the lower level problem by the necessary and sufficient optimality condition for a convex function. The equivalence to the original problem is shown in [15]. The classical KKT transformation substitutes the lower level problem with the classical KKT conditions. Due to the extra variable, the problems are not fully equivalent anymore (see [14]). This approach, which leads to a non-smooth mathematical problem with complementary constraints (MPEC), is the most frequently used one. The third approach is the optimal value transform, which introduces a constraint that bounds the lower level objective by the optimal value function.

Our approach is—in the limit—motivated by the first class of the primal KKT transformation. We consider the fixed point equation of an algorithm, which represents the optimality condition without introducing additional variables, and approximate this situation with finitely many iterations of the algorithm.

We focus on gradient-based methods, such as gradient descent, L-BFGS [25], non-linear conjugate gradient [1, 19], Heavy-ball method [40], iPiano [29], and others, for solving the bilevel optimization problem. In particular, this paper focuses on the estimation of descent directions. As one option, the gradient can be approximated numerically with finite differences such as in [16]. We rather pursue what is known as algorithmic/automatic differentiation. It is based on the idea to decompose the derivative evaluation into small parts by means of a chain rule, where the analytic derivative of each part is known. A whole branch of research deals with this technique [21]. Obviously, the idea to differentiate an algorithm in the lower level problem is not new [17, 37]. The difference is that our algorithm has a smooth update mapping while actually minimizing a non-smooth objective function. Another idea to approach a non-smooth problem with an iterative algorithm is presented in [12], where a chain rule for weak derivatives is used (cf. Sect. 4.4).

The special case of a lower level problem that depends linearly on the parameters is treated by structured output support vector machines [38]. The linear structure of the lower level problem allows the construction of an upper bound of the upper level objective function, which needs to be minimized. In general, this approach is only an approximation to the bilevel problem in (1), which can be solved using subgradient descent.

There are several practical examples of bilevel optimization in the computer vision and machine learning. Bilevel optimization was considered for task-specific sparse analysis prior learning [32] and applied to signal restoration. In [10, 11, 23], a bilevel approach was used to learn a model of natural image statistics, which was then applied to various image restoration tasks. A variational formulation for learning a good noise model was addressed in [35] in a PDE-constrained optimization framework, with some follow-up works [6, 7, 34]. In machine learning, bilevel optimization was used to train an SVM [4] and other techniques [27]. Recently, it was used for the end-to-end training of a Convolutional Neural Network (CNN) and a graphical model for binary image segmentation [33] (cf. Sect. 8).

Finally, we refer to [13] for an annotated bibliography with many references regarding the theoretical and practical development in bilevel optimization.

3 Preliminaries

We work in a Euclidean vector space $\mathbb {R}^N$ of dimension $N$ equipped with the standard Euclidean norm $\Vert \cdot \Vert _{}:= \sqrt{\left\langle \cdot ,\cdot \right\rangle }$ that is induced by the standard inner product. We use the notation $\overline{\mathbb {R}}:=\mathbb {R}\cup \lbrace \infty \rbrace $ to denote the extended real numbers.

We use the notation $[x *a]$ for $x,a\in \mathbb {R}^N$ to denote the set $\lbrace x\in \mathbb {R}^N\vert \, \forall i:\, x_i *a_i \rbrace $, where $*\in \lbrace <,\le , =, \ge , > \rbrace $ is a binary relation on $\mathbb {R}\times \mathbb {R}$. For example, $[x\ge 0]$ denotes the non-negative orthant in $\mathbb {R}^N$.

4 The Bilevel Problem

We consider bilevel optimization problems of the form:

$$\begin{aligned} \begin{aligned}&\min _{\vartheta \in \mathbb {R}^P}\ \mathcal {L}(x^*(\vartheta ), \vartheta ) + \ell (\vartheta )\\&s.t.\ x^*(\vartheta ) \in \arg \min _{x\in \mathbb {R}^N} E(x,\vartheta ). \end{aligned} \end{aligned}$$

(3)

The function $\ell :\mathbb {R}^P \rightarrow \overline{\mathbb {R}}$ is assumed to be proper, lower semi-continuous, convex, and “prox-friendly”^{Footnote 3} and the function $\mathcal {L}:\mathbb {R}^N\times \mathbb {R}^P \rightarrow \mathbb {R}$ to be continuously differentiable on ${\text {dom}}\ell $. The optimization variable is the (parameter) vector $\vartheta \in \mathbb {R}^P$. It appears implicitly and explicit in the upper level problem. It is implicit via the solution mapping $x^*(\vartheta )\in \mathbb {R}^N$ of the lower level problem and explicit in $\ell $ and in the second argument of $\mathcal {L}$. The lower level is a minimization problem in the first variable of a proper, lower semi-continuous function $E:\mathbb {R}^N\times \mathbb {R}^P \rightarrow \overline{\mathbb {R}}$. For each $\vartheta \in \mathbb {R}^P$, the objective function (energy) $x \mapsto E(x,\vartheta )$ is assumed to be convex.

Note that our formulation includes constrained optimization problems in the upper and lower level problems. The functions $\ell $ and E are defined as extended-valued (real) functions. Of course, in order to handle the constraints efficiently in the algorithm, the constraint sets should not be too complicated.

In order to solve the optimization problem in (3), we can apply iPiano [29], a gradient-based algorithm that can handle the non-smooth part $\ell (\vartheta )$. The extension of iPiano in [28, Chapter6] allows for a prox-bounded (non-convex, non-smooth) function $\ell (\vartheta )$. Informally, the update step of this algorithm (for the parameter vector $\vartheta $) reads

$$\begin{aligned}&\vartheta ^{k+1} \in {\text {prox}}^{}_{\alpha _k \ell }\left( \vartheta ^{k} - \alpha _k \nabla _\vartheta \mathcal {L}(x^*(\vartheta ^k), \vartheta ^k) \right. \nonumber \\&\qquad \qquad \quad \left. +\, \beta _k(\vartheta ^{k} - \vartheta ^{k-1})\right) , \end{aligned}$$

(4)

where ${\text {prox}}^{}_{\alpha _k \ell }$ denotes the proximity operator of the function $\ell $, $\alpha _k$ is a step size parameter, and $\beta _k$ steers the so-called inertial effect of the algorithm (usually $\beta _k\in [0,1]$). For details about $\alpha _k$ and $\beta _k$, we refer to [28, 29], where convergence to a stationary point (a zero in the limiting subdifferential) is proved under mild assumptions. We could also apply proximal gradient descent (forward–backward splitting) [2] ($\beta _k=0$). In our experiments, iPiano was usually faster and less sensitive to local optima, however. If the non-smooth term is not present, several gradient-based solvers can be used [1, 19, 25, 40].

The structure of the update step in (4) points out that the main aspect in applying such gradient-based algorithm is the evaluation of the gradient $\nabla _\vartheta \mathcal {L}(x^*(\vartheta ), \vartheta )$. The remainder of this paper deals with exactly this problem: compute $\nabla _\vartheta \mathcal {L}(x^*(\vartheta ), \vartheta )$ with a solution mapping $\vartheta \mapsto x^*(\vartheta )$ of a possibly non-smooth objective function in the lower level. Note that the following approximations naturally yield or require a unique solution of the lower level problem.

Remark 1

The formulation (3) of a bilevel optimization problem only makes sense when $\arg \min _{x\in \mathbb {R}^N} E(x,\vartheta )$ yields a unique minimizer. In that case, optimality of the bilevel problem can be derived from standard optimality conditions in non-linear programming. If the lower level problem does not provide a unique solution, the loss function $\mathcal {L}$ must be defined on the power set of $\mathbb {R}^N$ and a different notion of optimality must be introduced. Since this results in problems beyond the scope of this paper, we refer to [14]. A common circumvention is to consider the corresponding optimistic bilevel problem.

5 Computing Descent Directions

For a given parameter value $\vartheta \in \mathbb {R}^P$, we would like to compute a descent direction of $\mathcal {L}$ in (3) with respect to $\vartheta $ to find a numerical solution using some gradient-based method. Obviously, we need the derivative of the solution map $x^*(\vartheta )$ with respect to $\vartheta $. In the following, we present strategies to approximate the (possibly non-smooth) lower level problem and to compute a descent direction.

5.1 Derivative of a Smoothed Lower Level Problem

If the objective function of the lower level problem of (3) can be approximated well with a twice continuously differentiable function (again denoted E), we can make use of the implicit function theorem to find the derivative of the solution map with respect to $\vartheta $. The optimality condition of the lower level problem is $\nabla _x E(x,\vartheta ) = 0$, which under some conditions implicitly defines a function $x^*(\vartheta )$. As we assume that the problem $\min _x E(x,\vartheta )$ has a solution, there is $(x^*,\vartheta )$ such that $\nabla _x E(x^*,\vartheta ) = 0$. Then, under the conditions that $\nabla _x E(x^*,\vartheta )$ is continuously differentiable and $({\partial (\nabla _x E)}/{\partial x})(x^*,\vartheta )$ is invertible, there exists an explicit function $x^*(\vartheta )$ defined on a (open) neighborhood of $x^*$. Moreover, the function $x^*(\vartheta )$ is continuously differentiable at $\vartheta $ and it holds that

$$\begin{aligned} \frac{\partial x^{*}}{\partial \vartheta } (\vartheta ) = \left( - \frac{\partial (\nabla _x E)}{\partial x}(x^*(\vartheta ),\vartheta ) \right) ^{-1} \frac{\partial (\nabla _x E)}{\partial \vartheta } (x^*(\vartheta ),\vartheta ). \end{aligned}$$

Using the Hessian $H_E(x^*( \vartheta ), \vartheta ) := \frac{\partial ^2 E}{\partial x^2}(x^*( \vartheta ), \vartheta )$ yields

$$\begin{aligned} \frac{\partial x^*}{\partial \vartheta } ( \vartheta ) = - (H_E(x^*( \vartheta ), \vartheta ))^{-1} \frac{\partial ^2 E }{\partial \vartheta \partial x} (x^*( \vartheta ), \vartheta ). \end{aligned}$$

(5)

The requirement for using (5) from the implicit function theorem is the continuous differentiability of ${\partial E}/{\partial x}$ and the invertibility of $H_E$. Application of the chain rule yields the total derivative of the loss function $\mathcal {L}$ of (3) w.r.t. $\vartheta $

$$\begin{aligned} \frac{d\mathcal {L}}{d\vartheta } = - \Bigg [ \frac{\partial \mathcal {L}}{\partial x} H_E^{-1}\Bigg ] \frac{\partial ^2 E }{\partial \vartheta \partial x} + \frac{\partial \mathcal {L}}{\partial \vartheta }, \end{aligned}$$

(6)

where the function evaluation at $(x^*(\vartheta ),\vartheta )$ is dropped for brevity. A clever way of setting parentheses, as it is indicated by the squared brackets, avoids explicit inversion of the Hessian matrix. However, for large problems iterative solvers are required.

5.2 Derivative of Iterative Algorithms

We can replace the minimization problem in the lower level of (3) by an algorithm that solves this problem, i.e., the lower level problem is replaced by an equality constraint. This approach shows three advantages: (i) after approximating the lower level of (3) by an algorithm, the approach is exact; and (ii) the update step of the algorithm can be smooth without the lower level problem to be smooth; (iii) the output is always unique (for a fixed initialization), which circumvents the critical issue of a non-unique lower level solution.

Let $\mathcal {A}\text { and }\mathcal {A}^{(n)}:X\times \mathbb {R}^P \rightarrow X$ describe one or $n$ iterations, respectively, of algorithm $\mathcal {A}$ for minimizing E in (3). For simplicity, we assume that the feasible set mapping $\vartheta \mapsto \lbrace x\in \mathbb {R}^N\vert \, (x,\vartheta )\in {\text {dom}}\mathcal {L} \rbrace $ is constant,^{Footnote 4} i.e., the same X is assigned to all $\vartheta \in \mathbb {R}^P$. Note that $X=\mathbb {R}^N$ is permitted.

For fixed $n\in \mathbb {N}$, we replace (3) by

$$\begin{aligned} \begin{aligned}&\min _\vartheta \ \mathcal {L}(x^*(\vartheta ), \vartheta ) + \ell (\vartheta ) \\&s.t.\ x^*(\vartheta ) = \mathcal {A}^{(n+1)} (x^{(0)}, \vartheta ), \end{aligned} \end{aligned}$$

(7)

where $x^{(0)}$ is some initialization of the algorithm. The solution map of the lower level problem $x^*(\vartheta )$ is the output of the algorithm $\mathcal {A}$ after $n+1$ iterations. If we write down one iteration of the algorithm, i.e., $x^{(n+1)}(\vartheta ) = \mathcal {A}(x^{(n)}(\vartheta ),\vartheta )$, we have to assume that $x^{(n)}$ depends on the choice of $\vartheta $. However, this dependency can be dropped for the first iterate, which emerges from the initialization.

A suitable algorithm has the properties that $x^{(n)}(\vartheta )$ converges pointwise (i.e., for each $\vartheta $) to a solution of the lower level problem as $n$ goes to infinity and $E(x^{(n)},\vartheta ) = E(\mathcal {A}^{(n)}(x^{(0)}, \vartheta ),\vartheta ) \rightarrow \min _x E(x,\vartheta )$ for $n\rightarrow \infty $. Note that for Bregman proximity functions in algorithm $\mathcal {A}$, the solution for $n\rightarrow \infty $ could lie on ${\text {bdry}}(X)$, despite $x^{(n)}\in {\text {int}}(X)$ for all $n$. However, this matters only for an asymptotic analysis.

If $\mathcal {A}$ is (totally) differentiable with respect to $\vartheta $, then, by the standard chain rule, $\mathcal {A}^{(n)}$ is differentiable with respect to $\vartheta $ as well. This way, we obtain a totally differentiable approximation to the lower level problem of (3), where the approximation quality can simply be controlled by the number of iterations. For the so-called descent algorithms, it holds that

$$\begin{aligned} E(x^{(n+1)},\vartheta ) - \min _x E(x,\vartheta ) \le E(x^{(n)},\vartheta ) - \min _x E(x,\vartheta ). \end{aligned}$$

A large number of iterations usually approximate the minimum of E better than a small number of iterations.

Nevertheless, also the small number of iterations is interesting for our approach. Once the certain number of iterations is fixed, the bilevel optimization problem seeks for an optimal performance with exactly this chosen number of iterations. Solving the bilevel optimization problem accurately with a small number of iterations $n$ of the lower level algorithm can result in a better performance than a poorly solved bilevel problem with a large number of iterations in the lower level.

Our approach is well suited for minimizing the bilevel problem using gradient-based methods. The differentiation of $\mathcal {L}$ with respect to $\vartheta $ in (7) is exact; once an algorithm is selected, no additional approximation is required for computing the derivatives. In contrast, the smoothing approach from Sect. 4.1 requires the minimization of a smooth objective function, the solution of which can be found only approximatively. Therefore, the descent direction, which is based on the optimality condition, is always erroneous.

The “smoothing parameter” in our approach is the number of iterations of the algorithm that replaces the lower level problem. Since the algorithm’s update mapping is assumed to be smooth, in particular, locally Lipschitz continuous, which formally means that

$$\begin{aligned} \Vert \mathcal {A}(x, \vartheta ) - \mathcal {A}(y, \vartheta ) \Vert _{} \le {\text {const.}} \Vert x-y \Vert _{} \end{aligned}$$

holds in a neighborhood of the initial point, the variation of the output after one iteration is limited. Therefore, intuitively, for a large number of iterations $n$, less smoothness of $\mathcal {A}^{(n)}$ can be expected.

In order to obtain the derivative of the lower level problem of (7), there are two prominent concepts: forward mode and backward mode. For any vector $\xi \in \mathbb {R}^N$, the forward mode corresponds to evaluating the derivative as

$$\begin{aligned}&\xi ^\top \frac{dx^{(n+1)}}{d\vartheta }(\vartheta )\nonumber \\&\,\quad = \xi ^\top \left[ \frac{\partial \mathcal {A}}{\partial x}(x^{(n)},\vartheta ) \frac{dx^{(n)}}{d\vartheta }(\vartheta ) \right] + \xi ^\top \frac{\partial \mathcal {A}}{\partial \vartheta } (x^{(n)},\vartheta ), \end{aligned}$$

(8)

whereas the backward mode/reverse mode evaluates the derivative as

$$\begin{aligned}&\Bigg (\frac{dx^{(n+1)}}{d\vartheta }(\vartheta )\Bigg )^\top \xi \nonumber \\&\quad = \Bigg ( \frac{dx^{(n)}}{d\vartheta }(\vartheta )\Bigg )^\top \left[ \Bigg (\frac{\partial \mathcal {A}}{\partial x}(x^{(n)},\vartheta )\Bigg )^\top \xi \right] \nonumber \\&\qquad +\, \left( \Bigg (\frac{\partial \mathcal {A}}{\partial \vartheta } (x^{(n)},\vartheta )\Bigg )^\top \xi \right) , \end{aligned}$$

(9)

where the squared brackets symbolize the different orders of evaluating the terms. In both approaches, replacing and evaluating the term ${dx^{(n)}}/{d\vartheta }$ using the preceding iterate ${(n-1)}$ is done in the respective order.

Mathematically both concepts result in the same solution. However, numerically the approaches are very different. The reverse mode is usually more efficient when the optimization variable $\vartheta $ is high dimensional (i.e., $P$ is large) and the range of the objective function $\mathcal {L}$ is low dimensional—it is always 1 in our setting. This corresponds to $\xi $ being a column vector instead of a derivative matrix. The forward mode is often easier to implement, since it is executed in the same order as the optimization algorithm itself and can be computed online, i.e., during the iteration of the algorithm. As a downside, each partial derivative must be initialized and propagated through the iterations. Therefore, the memory requirement is vastly increasing with the dimension $P$. We focus on the reverse mode for evaluating the derivatives, due to its computationally more appealing nature.

The backward mode is executed in the reverse order of the iterations of the algorithm and needs the optimum $x^*$, which is $x^{(n+1)}$ in our case, for executing the first matrix vector multiplication. All intermediate results toward the optimum must be available. The implementation of the backward mode (9) is shown in Algorithm 1.

This approach is quite expensive. But, for a reasonable number of iterations, it is still practical. It is still faster than the inversion of the Hessian matrix in Sect. 4.1; see (6). In the following section, we present approximations that reduce the cost significantly.

5.3 Derivative of Fixed Point Equations

We generalize the result from Sect. 4.1, where the lower level problem of (3) is replaced by the first-order optimality condition of a smooth approximation. The idea is to consider a different optimality condition. A point is optimal, if it satisfies the fixed point equation of an algorithm $\mathcal {A}:X\times \mathbb {R}^P \rightarrow X$ solving the original lower level problem, i.e., we address the bilevel problem:

$$\begin{aligned} \begin{aligned}&\min _\vartheta \ \mathcal {L}(x^*(\vartheta ),\vartheta ) \\&s.t.\ x^*(\vartheta ) = \mathcal {A}(x^*(\vartheta ), \vartheta ) , \end{aligned} \end{aligned}$$

(10)

where $X\subset \mathbb {R}^N$ is as in Sect. 4.2 and we have a fixed point $x^*$. This approach is more general than the one in Sect. 4.1, since we could actually first smoothly approximate the lower level problem and then consider the fixed point equation. For many algorithms, both approaches are equivalent, because optimization algorithms are often derived from the first-order optimality condition.

Following the idea of Sect. 4.2, we can consider a differentiable fixed point equation without the lower level problem to be differentiable. An algorithm that has a differentiable update rule yields a differentiable fixed point equation.

Assume that $(x^*,\vartheta )$ solves the fixed point equation. By differentiating the fixed point equation, we obtain

$$\begin{aligned} \frac{\mathrm{d} x}{\mathrm{d} \vartheta }(\vartheta ) = \frac{\partial \mathcal {A}}{\partial x}(x^*(\vartheta ),\vartheta ) \frac{d x}{d \vartheta }(\vartheta )+ \frac{\partial \mathcal {A}}{\partial \vartheta }(x^*(\vartheta ),\vartheta ), \end{aligned}$$

which can be rearranged to yield

$$\begin{aligned} \frac{\mathrm{d} x}{\mathrm{d} \vartheta }(\vartheta ) = \Bigg (I - \frac{\partial \mathcal {A}}{\partial x}(x^*(\vartheta ),\vartheta )\Bigg )^{-1} \frac{\partial \mathcal {A}}{\partial \vartheta }(x^*(\vartheta ),\vartheta ). \end{aligned}$$

(11)

Assuming that the spectral radius of $({\partial \mathcal {A}}/{\partial x})(x^*(\vartheta ),\vartheta )$ is smaller than 1, we can approximate the inversion using the geometric series:

$$\begin{aligned} \frac{\mathrm{d} x}{\mathrm{d} \vartheta }(\vartheta ) = \sum _{n=0}^{\infty } \Bigg (\frac{\partial \mathcal {A}}{\partial x}(x^*(\vartheta ),\vartheta )\Bigg )^{n} \frac{\partial \mathcal {A}}{\partial \vartheta }(x^*(\vartheta ),\vartheta ), \end{aligned}$$

where $(({\partial \mathcal {A}}/{\partial x})(x^*(\vartheta ),\vartheta ))^n$ means the $n$-fold matrix product with itself. Let us approximate this term with a finite summation of $0,\ldots , n_0$. Then by a simple rearrangement, for $\xi \in \mathbb {R}^N$, we have (by abbreviating $({\partial \mathcal {A}}/{\partial x})(x^*(\vartheta ),\vartheta )$ by ${\partial \mathcal {A}}/{\partial x}$; the same for ${\partial \mathcal {A}}/{\partial \vartheta }$)

$$\begin{aligned} \begin{aligned} \xi ^\top \frac{\mathrm{d} x}{\mathrm{d} \vartheta }(\vartheta )&\approx \xi ^\top \sum _{n=0}^{n_0} \Bigg (\frac{\partial \mathcal {A}}{\partial x}\Bigg )^{n} \frac{\partial \mathcal {A}}{\partial \vartheta } \\&=\ \xi ^\top \frac{\partial \mathcal {A}}{\partial x}\left( \frac{\partial \mathcal {A}}{\partial \vartheta } + \frac{\partial \mathcal {A}}{\partial x} \left( \frac{\partial \mathcal {A}}{\partial \vartheta } + \cdots \right) \right) \\&=\ \xi ^\top \left[ \frac{\partial \mathcal {A}}{\partial x^{(n_0)}} \frac{dx^{(n_0)}}{d\vartheta } \right] + \xi ^\top \frac{\partial \mathcal {A}}{\partial \vartheta }. \end{aligned} \end{aligned}$$

The difference between the last line in this equation and (8) and (9) is the evaluation point of the terms. While in (8) and (9) the terms for ${\mathrm{d}x^{(n+1)}}/{\mathrm{d}\vartheta }$ are evaluated at $(x^{(n)}(\vartheta ),\vartheta )$, here, all terms are evaluated at $(x^*(\vartheta ),\vartheta )$. Although the condition on the spectral radius is rarely met in practice, this approximation works well empirically and needs to store only the optimum of the algorithm. This leads to an immense reduction of the memory requirements.

5.4 Weak Differentiation of Iterative Algorithms

The approach in [12] also considers an algorithm replacing the non-smooth lower level problem. Their underlying methodology, however, is based on weak differentiability, which can be guaranteed for Lipschitz continuous mappings thanks to Rademacher’s theorem. If all iteration mappings are Lipschitz continuous with respect to the iteration variable and the parameter $\vartheta $, weak differentiability follows from the chain rule for Lipschitz mappings [18, Theorem 4]. For details, we refer to [12], in particular Sect. 4.

6 Explicit Derivatives for Exemplary Algorithms

The framework of Bregman proximity functions is key for the idea to approximate a non-smooth optimization problem by an algorithm with smooth update mappings. In this section, we instantiate two such algorithms. Details and examples of Bregman proximity functions are postponed to Sect. 6.1. For understanding this section, it suffices to know that $D_\psi (x,\bar{x})$ provides a distance measure between two points x and $\bar{x}$, and it can be used to define a Bregman proximity operator ${\text {prox}}^{\psi }$ which generalizes the common proximity operator that is based on the Euclidean distance.

6.1 Derivative of Forward–Backward Splitting

Let us consider forward–backward splitting [24, 31] with Bregman proximity function $D_\psi $ (e.g., [3]). It applies to minimization problems of the form:

$$\begin{aligned} \min _{x\in \mathbb {R}^N}\, f(x) + g(x), \end{aligned}$$

where $f:\mathbb {R}^N \rightarrow \mathbb {R}$ is a continuously differentiable, convex function with Lipschitz continuous gradient and $g:\mathbb {R}^N \rightarrow \overline{\mathbb {R}}$ is a proper, lower semi-continuous, convex function with a (Bregman) proximity operator that is easy to evaluate. The update rule of the forward–backward splitting we consider is

$$\begin{aligned} \begin{aligned} x^{(n+1)} =&\arg \min _{x\in \mathbb {R}^N}\, g(x;\vartheta ) + f(x^{(n)};\vartheta ) \\&+ \left\langle \nabla f(x^{(n)};\vartheta ),x-x^{(n)} \right\rangle + \frac{1}{\alpha }D_\psi (x, x^{(n)}) \\ =:&{\text {prox}}^{\psi }_{\alpha g}\Big (\nabla \psi (x^{(n)}) - \alpha \nabla f(x^{(n)};\vartheta ) ;\vartheta \Big ) \\ =:&{\text {prox}}^{\psi }_{\alpha g}\Big (y^{(n)}(x^{(n)};\vartheta ) ;\vartheta \Big ), \end{aligned} \end{aligned}$$

(12)

where we denote $y^{(n)}(x^{(n)};\vartheta ) := \nabla \psi (x^{(n)}) - \alpha \nabla f(x^{(n)};\vartheta )$, the intermediate result after the forward step. The implementation of the reverse mode for determining the derivative of the solution map of the lower level problem with respect to $\vartheta $ is given in Algorithm 2.

6.2 Derivative of Primal–Dual Splitting

Since the primal–dual algorithm with Bregman proximity functions from [9] provides us with a flexible tool, we specify the implementation of the reverse mode for this algorithm. It applies to the convex–concave saddle-point problem

$$\begin{aligned} \min _x\max _y \left\langle Kx,y \right\rangle + f(x) + g(x) - h^*(y), \end{aligned}$$

which is derived from $\min _x f(x) + g(x) + h(Kx)$, where f is convex and has a Lipschitz continuous gradient and g, h are proper, lower semi-continuous convex functions with simple proximity operator for g and for the convex conjugate $h^*$.

Let the forward iteration of the primal–dual algorithm with variables $x^{(n)}=(u^{(n)},p^{(n)})\in \mathbb {R}^{N_u+N_p}$ be given as

$$\begin{aligned} \begin{aligned} u^{(n+1)} =&\ {\mathcal {PD}}_u(u^{(n)},p^{(n)},\vartheta ) \\ :=&\arg \min _u\, \left\langle \nabla f(u^{(n)}),u-u^{(n)} \right\rangle + g(u) \\&+ \left\langle Ku,p^{(n)} \right\rangle + \tfrac{1}{\tau }D_u(u,u^{(n)}) \\ p^{(n+1)} =&\ {\mathcal {PD}}_p(2u^{(n+1)}-u^{(n)}, p^{(n)}, \vartheta ) \\ :=&\ \arg \min _p\, h^*(p) - \left\langle K(2u^{(n+1)}-u^{(n)}),p \right\rangle \\&+ \tfrac{1}{\sigma }D_p(p,p^{(n)}), \end{aligned} \end{aligned}$$

(13)

where f, g, h, K can depend on $\vartheta $. The step size parameter $\tau $ and $\sigma $ must be chosen according to $(\tau ^{-1} - L_f) \sigma ^{-1} \ge L^2$, where $L=\Vert K \Vert _{}$ is the operator norm of K and $L_f$ is the Lipschitz constant of $\nabla f$.

To illustrate the application of the chain rule throughout the primal–dual algorithm, we show a graphical representation of the information flow in Fig. 1, where we use the following abbreviations (analogously for ${\mathcal {PD}}_p$):

$$\begin{aligned} {\mathcal {PD}}_u^{(n)}:= & {} {\mathcal {PD}}_u(u^{(n)},p^{(n)},\vartheta ); \\ {\mathcal {PD}}_p^{(n)}:= & {} {\mathcal {PD}}_p(2u^{(n+1)}-u^{(n)},p^{(n)},\vartheta ); \\ \partial _u {\mathcal {PD}}_u:= & {} \frac{\partial {\mathcal {PD}}_u}{\partial u}; \partial _p {\mathcal {PD}}_u:= \frac{\partial {\mathcal {PD}}_u}{\partial p}; \partial _\vartheta {\mathcal {PD}}_u:= \frac{\partial {\mathcal {PD}}_u}{\partial \vartheta }. \end{aligned}$$

Remark 2

In Sect. 5.1, we evaluated the forward and the backward steps separately using the chain rule. Of course, this could be done here as well.

Based on this graphical representation, it is easy to derive Algorithm 3.

A running average is used to implement the ergodic primal–dual algorithm whose output is the average of all iterates, i.e., $u^* = \frac{1}{n+1}\sum _{i=0}^{n} u^{(i)}$: denote $s_u^{(n)}:= \frac{1}{n+1}\sum _{i=0}^{n} u^{(i)}$, then $s_u^{{(n+1)}} = \frac{1}{n+2} u^{(n+1)} + \frac{n+1}{n+2}s_u^{(n)}$. Since the derivative is a linear operator, we can estimate the derivative for the ergodic primal–dual sequence by averaging all $w^{(n)}$. These can be computed as a running average in the loop of Algorithm 3.

7 “Smoothing” Using Bregman Proximity

Splitting-based techniques like those in Sect. 5 usually handle non-smooth terms in the objective function via a (non-linear/Bregman) proximal step. Convex conjugation makes terms in the objective amenable for simple and differentiable proximal mappings. Adding the possibility of considering a primal, primal–dual, or dual formulation yields many examples of practical interest.

In the following, we introduce the class of Bregman functions that can be used in combination with the algorithms in Sect. 5. Then, we discuss a few examples that allow the reformulation of several non-smooth terms arising in applications.

7.1 Bregman Proximity Functions

We consider Bregman proximity functions [5] with the following properties: Let $\psi :\mathbb {R}^N \rightarrow \overline{\mathbb {R}}$ be a 1-convex function with respect to the Euclidean norm, i.e., it is strongly convex with modulus 1, and denote its domain by $X:={\text {dom}}\psi $. We assume that $\psi $ is continuously differentiable on the interior of its domain ${\text {int}}(X)$ and continuous on its closure ${\text {cl}}(X)$.

Then, $\psi $ generates a Bregman proximity function $D_\psi :X \times {\text {int}}(X) \rightarrow \mathbb {R}$ by

$$\begin{aligned} D_\psi (x,\bar{x}) := \psi (x) - \psi (\bar{x}) - \left\langle \nabla \psi (\bar{x}),x-\bar{x} \right\rangle . \end{aligned}$$

(14)

For a sequence $(x^n)_{n\in \mathbb {N}}$ converging to $x\in X$, we require that $\lim _{n\rightarrow \infty } D_\psi (x,x^n) = 0$. The 1-convexity of $\psi $ implies that the Bregman function satisfies the inequality

$$\begin{aligned} D_\psi (x,\bar{x}) \ge \frac{1}{2} \Vert x-\bar{x} \Vert _{}^2, \quad \forall x\in X,\; \bar{x} \in {\text {int}}(X). \end{aligned}$$

These are the kind of Bregman proximity functions considered in [9]. Obviously, $\psi (x)=\frac{1}{2} \Vert x \Vert _{}^2$ corresponds to $D_\psi (x,\bar{x}) = \frac{1}{2} \Vert x-\bar{x} \Vert _{}^2$.

In iterative algorithms, the Bregman proximity function is used via the proximity operator for a proper, lower semi-continuous, convex function $g:X \rightarrow \overline{\mathbb {R}}$

$$\begin{aligned} {\text {prox}}^{\psi }_{\alpha g} (\bar{x}) := \arg \min _{x\in X}\, \alpha g(x) + D_\psi (x,\bar{x}), \end{aligned}$$

(15)

where we define ${\text {prox}}^{}_{\alpha g}:= {\text {prox}}^{\frac{1}{2}\Vert \cdot \Vert _{}^2}_{\alpha g}$.

There are two kinds of Bregman proximity functions: (i) the function $\nabla \psi $ can be continuously extended to X, i.e., $D_\psi $ can be defined on $X\times X$, and (ii) $\psi $ is differentiable on ${\text {int}}(X)$ (i.e., $\nabla \psi $ cannot necessarily be extended to ${\text {cl}}(X)$). In this case, $D_\psi (x,\bar{x})$ makes sense only on $X\times {\text {int}}(X)$ and we must assure that ${\text {prox}}^{\psi }_{\alpha g}(\bar{x})\in {\text {int}}(X)$ for any $\bar{x}\in {\text {int}}(X)$. For this, we need to assume that $\Vert \nabla \psi (x) \Vert _{}\rightarrow \infty $ whenever x approaches a boundary point ${\text {bdry}}(X) := {\text {cl}}(X) \backslash {\text {int}}(X)$ (which is sometimes referred to as $\psi $ being essentially smooth [36]).

While solutions of the proximity operator for the first class can lie on the boundary ${\text {bdry}}(X)$, this is not possible for the second class; boundary points can be reached only in the limit when the proximity operator is applied sequentially. Moreover, for $\bar{x}\in {\text {bdry}}(X)$, (14) would imply that, unless $x=\bar{x}$, the Bregman distance is $+\infty $ for any x, which can be represented by $\delta _{[x=\bar{x}]}(x)$. This means that $\bar{x}\in {\text {bdry}}(X)$ is always a fixed point of this Bregman proximity operator. This precludes application of the fixed point approach from Sect. 4.3.

7.2 Examples of Bregman Functions

Since Bregman proximity functions play a key role in this paper, we consider a few examples.

Example 1

The Euclidean length $\psi (x) = \frac{1}{2}\Vert x \Vert _{2}^2$ is continuously differentiable on the whole space $\mathbb {R}^N$ and, therefore, belongs to class (i) of Bregman proximity functions.

Example 2

The Bregman proximity function generated by $\psi (x) = \frac{1}{2}((x+1)\log (x+1) + (1-x)\log (1-x))$ is defined on the interval $(-1,1)$ and can be continuously extended to $[-1,1]$, and is continuously differentiable on $(-1,1)$ with $\vert \psi ^\prime (x) \vert \rightarrow \infty $ when $x\rightarrow \pm 1$. It is 1-strongly convex.

Example 3

The entropy function $\psi (x) = x\log (x)$, which can be continuously extended to $[x\ge 0]$, is continuously differentiable on $[x>0]$ with derivative $\psi ^\prime (x) = \log (x)+1$. The derivative cannot be continuously extended to $x=0$. For $x\rightarrow 0$ we have $\vert \psi ^\prime (x) \vert \rightarrow +\infty $. Unfortunately, this function is not even 1-strongly convex on $[x\ge 0]$. However, the function $a x\log (x)$ is 1-strongly convex when restricted to a bounded subset [0, 1 / a], $a>0$. For $a=1$, the Bregman function $D_\psi (x,\bar{x})=x(\log (x)-\log (\bar{x}))-(x-\bar{x})$ is generated.

Example 4

The entropy function can also be used in higher dimensions. Unfortunately, it is hard to assert a simple evaluation of an associated proximity mapping in this case. Consider a polyhedral set $\emptyset \ne X\in \mathbb {R}^N$ given by

$$\begin{aligned} \begin{aligned} X&= \lbrace x\in \mathbb {R}^N\vert \, \forall i=1,\ldots ,M:\left\langle a_i,x \right\rangle \le b_i \rbrace \\&= \bigcap _{i=1}^M \lbrace x \in \mathbb {R}^N\vert \, \left\langle a_i,x \right\rangle \le b_i \rbrace \end{aligned} \end{aligned}$$

for vectors $0\ne a_i\in \mathbb {R}^{N}$, and $b_i\in \mathbb {R}^M$, $i=1,\ldots ,M$. Then, the generating function

$$\begin{aligned} \psi (x) = \sum _{i=1}^M (b_i-\left\langle a_i,x \right\rangle )\log (b_i-\left\langle a_i,x \right\rangle ) \end{aligned}$$

is designed such that for any point $\bar{x} \in {\text {int}}(X)$ any other point $x\not \in X$ is “moved infinitely far away” with respect to the Bregman distance $D_\psi (x,\bar{x})$. Therefore, $\Vert \nabla \psi (x) \Vert _{} \rightarrow \infty $ for x tends towards a point on the boundary ${\text {bdry}}(X)$. Nevertheless, $\psi $ is continuous on X and strongly convex, if X is bounded.

7.3 Examples of Smooth Bregman Proximity Operators

The Bregman proximity functions that we presented are particularly interesting if the evaluation of the proximal mapping (15) is a constrained minimization problem, i.e., the involved function g in ${\text {prox}}^{\psi }_{g}$ is extended-valued and $+\infty $ outside the constraint (closed) convex set $X\subset \mathbb {R}^N$. The Bregman function can replace or simplify the constraint set. In the following, we consider a few examples of practical interest. The class of functions that are amenable to our approach can be broadened significantly thanks to the concept of (convex) conjugation.

We consider a basic class of functions $g(x) = \left\langle x,c \right\rangle + \delta _{X}(x)$ for some $c\in \mathbb {R}^N$. The associated (non-linear) proximity operator from (15) is given by

$$\begin{aligned} {\text {prox}}^{\psi }_{\alpha g} (\bar{x}) = \arg \min _{x\in X}\, \alpha \left\langle x,c \right\rangle + D_\psi (x,\bar{x}) . \end{aligned}$$

The corresponding (necessary and sufficient) optimality condition, which has a unique solution, is

$$\begin{aligned} \begin{aligned}&\ 0 \in c + \nabla \psi (x) - \nabla \psi (\bar{x}) + \partial \delta _{X}(x) \\ \Leftrightarrow&\ \nabla \psi (\bar{x}) - c \in \nabla \psi (x) + \mathrm {N}_{X}(x), \end{aligned} \end{aligned}$$

where $\mathrm {N}_{X}(x)$ denotes the normal cone at x of the set X. Suppose $\bar{x} \in {\text {int}}(X)$. If $\psi $ is chosen such that $\Vert \nabla \psi (x) \Vert _{} \rightarrow +\infty $ for $x \rightarrow \tilde{x}\in {\text {bdry}}(X)$, then the solution of the proximal mapping is in ${\text {int}}(X)$. Since $\mathrm {N}_{X}(x)=0$ for $x\in {\text {int}}(X)$, the optimality condition simplifies to

$$\begin{aligned} \nabla \psi (\bar{x}) - c = \nabla \psi (x), \end{aligned}$$

(16)

i.e., the constraint is implicitly taken care of by the Bregman proximity function. Summarizing, the goal of our approach (for this basic function g) consists of determining $\psi $, respectively $D_\psi $, such that

the constraint set can be handled implicitly,
(16) can be solved efficiently (possibly in closed form),
and the solution function of (16), which yields the solution of (16) for a given $\bar{x}$, is required to be differentiable w.r.t. x and $\vartheta $, where possibly $c = c(\vartheta )$.

Example 5

For a linear function $g(x) = \left\langle c,x \right\rangle + \delta _{[x\ge 0]}(x)$, the entropy function from Example 3 can be summed up for each coordinate to remove the non-negativity constraint. The proximity operator reads

$$\begin{aligned} \Bigg ({\text {prox}}^{\sum _j x_j\log x_j}_{\alpha g}(\bar{x})\Bigg )_i = \bar{x}_i \exp (-\alpha c_i). \end{aligned}$$

A closer look at the iterations of the forward–backward splitting (FBS) algorithm (12) reveals that such a function g arises with $c=\nabla f(\bar{x})$, i.e., in the iterations of FBS for the minimization of

$$\begin{aligned} \min _{x\in \mathbb {R}^N}\, f(x) + \delta _{[x\ge 0]}(x). \end{aligned}$$

A particular instance of this problem is the non-negative least squares problem, i.e., $f(x) = \frac{1}{2} \Vert Ax - b \Vert _{2}^2$ with a matrix A and a vector b.

Example 6

The most frequent application of the entropy-prox is for the minimization of a linear function $g(x) = \left\langle c,x \right\rangle $ over the unit simplex in $\mathbb {R}^N$. Since the entropy function restricts the solution of the proximity operator to the positive orthant, projecting a point $\bar{x}\in \mathbb {R}_+^N$ onto the unit simplex $\lbrace x\in \mathbb {R}^N\vert \, \sum _{i=1}^Nx_i = 1\text { and } x_i \ge 0 \rbrace $ reduces to the projection onto the affine subspace $\lbrace x\in \mathbb {R}^N\vert \, \sum _{i=1}^Nx_i = 1 \rbrace $, which can be given in closed form, i.e.,

$$\begin{aligned} \Bigg ( {\text {prox}}^{\sum _j x_j\log x_j}_{\alpha g} (\bar{x})\Bigg )_i = \frac{\bar{x}_i\exp (-\alpha c_i)}{\sum _{j=1}^N\bar{x}_j\exp (-\alpha c_j)} . \end{aligned}$$

This proximal problem arises for example in the multi-label segmentation problem in Sect. 8.1 or in Matrix games (see [9, Sect. 7.1]).

Example 7

For the function $g(x) = \left\langle c,x \right\rangle + \delta _{[-1\le x\le 1]}(x)$ the Bregman function from Example 2 reduces the minimization problem in the proximal mapping to an unconstrained problem. The proximal mapping with $\psi (x) = \sum _i \frac{1}{2}((x_i+1)\log (x_i+1) + (1-x_i)\log (1-x_i))$ reads:

$$\begin{aligned} \Bigg ({\text {prox}}^{\psi }_{\alpha g}(\bar{x})\Bigg )_i = \frac{\exp (-2\alpha c_i) - \frac{1-\bar{x}_i}{1+\bar{x}_i}}{\exp (-2\alpha c_i) + \frac{1-\bar{x}_i}{1+\bar{x}_i}}. \end{aligned}$$

Obviously, this example can be adjusted to any Cartesian product of interval constraints. The importance of this exemplary function g becomes clear in the following.

Functions that are linear on a constraint set also arise when conjugate functions are considered. For instance, the $\ell _1$-norm can be represented as

$$\begin{aligned} \Vert x \Vert _{1} = \max _{y}\, \left\langle x,y \right\rangle + \delta _{[-1\le y \le 1]}(y). \end{aligned}$$

In combination with the primal–dual (PD) algorithm (13), this representation results in subproblems of the type discussed in the preceding examples. From this perspective, optimization problems involving a linear operator ${\mathcal {D}}$ and the $\ell _1$-norm $\Vert {\mathcal {D}}x \Vert _{1}$ are also easy to address.

This idea of conjugation can be put into a slightly larger framework, as the convex conjugate of any positively one-homogeneous proper, lsc, convex function is an indicator function of a closed convex set. Unfortunately, it is required that projecting onto such a set is easy (“prox-friendliness”). Therefore, the following example is restricted to the (additively) separable case.

Example 8

Let g be an (additively) separable, positively one-homogeneous, proper, lsc, convex function $g(x) = \sum _{i=1}^Ng_i(x_i)$. Thanks to its properties g coincides with its bi-conjugate function $g^{**}$ and we can consider

$$\begin{aligned} g(x) = g^{**}(x) = \sum _{i=1}^N\max _{y_i}\, x_i y_i - \delta _{Y_i}(y_i) , \end{aligned}$$

where $Y_i = [a_i,b_i]$ is a closed interval in $\mathbb {R}$. Again, the dual update step of (13) involves problems such as in Example 7 with $h^*(y) = \sum _i \delta _{Y_i}(y_i)$.

8 Toy Example

The bilevel problem that we consider here is a parameter learning problem of a one-dimensional non-negative least squares problem:

$$\begin{aligned} \begin{aligned} \min _{\vartheta \in \mathbb {R}}&\; \frac{1}{2} (x^*(\vartheta ) - \mathfrak g)^2 \\&\; s.t.\ x^*(\vartheta ) = \arg \min _{x\in \mathbb {R}}\; \frac{\lambda }{2} (\vartheta x - b)^2 + \frac{1}{2} x^2 + \delta _{[x\ge 0]}(x), \end{aligned} \end{aligned}$$

(17)

where $\vartheta $ is the optimization variable of the bilevel problem, $b\in \mathbb {R}$ is the input of the least squares problem, and $\lambda $ is a positive weighting parameter. Given $\vartheta $ and b, the lower level problem solves the non-negative least squares problem. The squared Euclidean loss function in the upper level problem compares the output of the lower level problem for some $\vartheta $ and b to the ground truth $\mathfrak g:=x^*(\vartheta ^*)$, which is generated by solving the lower level problem with some predefined value $\vartheta ^*$. The goal of the bilevel optimization problem is to find $\vartheta ^*$ given b and $\mathfrak g$.

The analytic solution of the lower level problem (the solution map) is

$$\begin{aligned} x^*(\vartheta ) = \max \Big (0, \frac{\lambda \vartheta b}{1+\lambda \vartheta ^2}\Big ) \end{aligned}$$

and is shown on the right-hand side of Fig. 2. It is obviously a non-smooth function with a non-differentiable point at $\vartheta =0$. Plugging the solution map into the upper level problem shows the actual objective to be minimized; see the left-hand side of Fig. 2.

8.1 Experimental Setup

In the following experiments, we numerically explore the gradients computed with the proposed techniques. We do not consider the actual minimization of the bilevel problem. The computed gradients could be used by any first-order gradient-based method.

8.1.1 Analytic Subdifferential

For $\vartheta \ne 0$, the standard chain rule from calculus can be applied and we can directly write down the derivative of the whole problem, namely

$$\begin{aligned} \frac{\mathrm{d}\mathcal {L}}{\mathrm{d} \vartheta } (x(\vartheta )) = \frac{\lambda b(1-\lambda \vartheta ^2)}{(1+\lambda \vartheta ^2)^2} (x(\vartheta )-\mathfrak {g}). \end{aligned}$$

For $\vartheta = 0$, we consider the derivative

$$\begin{aligned} \frac{\mathrm{d}\mathcal {L}}{\mathrm{d} \vartheta } (x(\vartheta )) = [0,\lambda b (x(0)-\mathfrak {g})] , \end{aligned}$$

where $[0,\lambda b]$ is replaced by $[\lambda b, 0]$ if $\lambda b<0$.

8.1.2 Implicit Differentiation Approach Sect. 4.1

In order to apply this technique, we must smooth the lower level problem. Since we want to avoid solutions $x^*(\vartheta )=0$, we introduce a log-barrier and replace the lower level problem by

$$\begin{aligned} f_\mu (x,\vartheta ) := \frac{\lambda }{2} (\vartheta x - b)^2 + \frac{1}{2} x^2 - \mu \log (x) \end{aligned}$$

for some small $\mu >0$. Thus, we can drop the non-negativity constraint. To compute the gradient via the implicit differentiation formula (6), we minimize $f_\mu $ with respect to x and compute the second derivatives (we abbreviate the x-derivative with $f_\mu ^\prime $ and $\vartheta $-derivative with $\partial _\vartheta f_\mu $)

$$\begin{aligned} \begin{aligned}&f_\mu ^\prime (x,\vartheta ) = \lambda \vartheta (\vartheta x- b) + x - \frac{\mu }{x}; \\&f_\mu ^{\prime \prime }(x,\vartheta ) = \lambda \vartheta ^2 + 1 + \frac{\mu }{x^2}; \\&\partial _\vartheta f^\prime _\mu (x,\vartheta ) = 2\lambda \vartheta x -\lambda b. \end{aligned} \end{aligned}$$

(18)

Then, (6) yields

$$\begin{aligned} \frac{\mathrm{d}\mathcal {L}}{\mathrm{d} \vartheta } (x^*(\vartheta ))\!= \!-\! (x(\vartheta )\!-\!\mathfrak {g}) (f_\mu ^{\prime \prime }(x^*(\vartheta ),\vartheta ))^{-1} \partial _\vartheta f^\prime _\mu (x^*(\vartheta ),\vartheta )\!. \end{aligned}$$

This approach is denoted Smoothed-impl.

8.1.3 Algorithmic Differentiation Approach Sect. 4.2

We consider two algorithms: projected gradient descent and forward–backward splitting with Bregman proximity functions. Both algorithms are splitting methods that distribute the objective into a smooth function f and a non-smooth function g, for our example it reads

$$\begin{aligned} f(x,\vartheta ) = \frac{\lambda }{2} (\vartheta x - b)^2 + \frac{1}{2} x^2\quad \text {and}\quad g(x) = \delta _{[x\ge 0]}(x). \end{aligned}$$

Projected gradient descent operates by a gradient descent step with respect to the smooth function f followed by a projection onto the (convex) set $[x\ge 0]$:

$$\begin{aligned} \begin{aligned} x^{(n+1)} =&{\text {proj}}_{[x\ge 0]} (x^{{(n)}} - \alpha f^\prime (x^{(n)},\vartheta ) ) \\ =&\max (0, x^{{(n)}} - \alpha f^\prime (x^{(n)}) ). \end{aligned} \end{aligned}$$

(19)

Note that the projection onto the convex set can also be interpreted as solving the proximity operator associated with the function g.

The second algorithm is obtained by changing the distance function for evaluating the proximity operator to the Bregman distance from Example 3. It results in

$$\begin{aligned} x^{(n+1)} = x^n \exp (-\alpha f^\prime (x^{(n)},\vartheta )). \end{aligned}$$

(20)

As we assume that $x^0 \in [x> 0]$ the Bregman proximity function ensures that the solution stays in the feasible set. Thus, the back-projection can be dropped.

To apply Algorithm 1 or 2, we need the second derivatives of the update steps (19) and (20). The second derivatives of $f=f_\mu $ with $\mu =0$ are given in (18). Although (19) is not differentiable, it is differentiable almost everywhere, and in the experiment, we formally applied the chain rule and assigned an arbitrary subgradient wherever it is not unique, i.e.,

$$\begin{aligned} \frac{\partial {\text {proj}}_{[x\ge 0]}}{\partial x} (x,\vartheta ) = {\left\{ \begin{array}{ll} 0 ,&{}\quad \text {if } x{<} 0; \\ 1 ,&{}\quad \text {if } x{>}0; \\ {[}0,1] ,&{}\quad \text {if } x=0; \end{array}\right. } \end{aligned}$$

and $\frac{\partial {\text {proj}}_{[x\ge 0]}}{\partial \vartheta } = 0$. This approach is denoted Proj.GD.

For (20), we use Algorithm 1 and obtain^{Footnote 5}

$$\begin{aligned} \begin{aligned} \frac{\partial \mathcal {A}}{\partial \vartheta }(x^{(n)}, \vartheta ) =&-\alpha x \exp (-\alpha f^\prime (x^{(n)},\vartheta )) \frac{\partial f^{\prime }}{\partial \vartheta } (x^{(n)}, \vartheta )\\ \frac{\partial \mathcal {A}}{\partial x}(x^{(n)}.\vartheta ) =&\exp (-\alpha f^\prime (x^{(n)},\vartheta )) \\&-\alpha x^{(n)}\exp (-\alpha f^\prime (x^{(n)},\vartheta ))f^{\prime \prime } (x^{(n)}, \vartheta ). \end{aligned} \end{aligned}$$

This approach is denoted Bregman-FB.

8.1.4 Implicit Differentiation of the Fixed Point Equation Approach from Sect. 4.3

As explained above, direct differentiation of the fixed point equation of an algorithm implies two techniques. One is by applying Algorithm 1 to (20) but evaluating all derivatives at the optimum (denoted Bregman-FB2). The other is to do the numerical inversion as in (11) (denoted Bregman-FB-impl).

8.2 Analysis of the 1D Example

In the experiments, we focus on the estimation of the gradient (in Fig. 3). Therefore, the step size parameters of the individual algorithms are chosen such that a comparable convergence of the lower level energy is achieved, if possible.

For Proj.GD, Bregman-FB, and Bregman-FB2, the chain rule must be applied recursively. We plot the change of the gradient accumulation along these back-iterations (of 200 forward iterations) in bottom of Figure 4 and the energy evolution in the upper part of this figure.

In this example, we can observe a linear decrease in the contribution to the respective final gradient value, which shows that back-iterations can be stopped after a few iterations without making large errors.

Interestingly, the approximations Bregman-FB2 and Proj.GD2 work well, as they show the same gradient accumulation as Bregman-FB and Proj.GD, respectively. This situation changes when the number of forward iterations is reduced. For about 15 forward iterations, a difference of order $10^{-4}$ becomes visible (case $\vartheta =0.3$).

Figures 5 and 6 address the convergence of the gradient towards the analytic gradient, with respect to different approximation accuracies (varied by the number of back-iterations). Figure 5 shows the convergence for $\vartheta =0.3$ and Fig. 6 for $\vartheta =0$. Numerically, we observe convergence to the analytic gradients.

Surprisingly, all methods perform equally well in the case $\vartheta =0$. The estimated gradient lies always in the subdifferential at this point. The range of the subdifferential is indicated with bright green color in Fig. 6. While Proj.GD and Proj.GD2 estimate a gradient from the boundary of the subdifferential, the other methods estimate a subgradient from the interior. However, all of these values are feasible and belong to the analytic subdifferential.

9 Application to Multi-Label Segmentation

In this section, we show how the idea can be applied in practice. To this end, we introduce a multi-label segmentation model. We use a convolutional neural network (CNN) to parametrize the segmentation model. Alternatively, this construction can be thought of as having a segmentation model as the final stage of a deep neural network. In this setting, the bilevel problem amounts to finding the parameters of the CNN such that the loss on training data is minimized. The presented approach provides a generic way to train such systems in an end-to-end fashion.

9.1 Model

Given a cost tensor ${\mathfrak {c}}\in {X}^{N_l}$, where ${X}=\mathbb {R}^{{N_x}{N_y}}$, that assigns to each pixel $(i,j)$ and each label $k$, $i=1,\ldots ,{N_x}$, $j=1,\ldots ,{N_y}$, $k=1,\ldots ,{N_l}$, a cost ${\mathfrak {c}}^k_{i,j}$ for the pixel taking label $k$. We often identify $\mathbb {R}^{{N_x}\times {N_y}}$ with $\mathbb {R}^{{N_x}{N_y}}$ by $(i,j)\mapsto i+ (j-1){N_x}$ to simplify the notation. The sought segmentation $u\in {X}_{[0,1]}^{N_l}$, where ${X}_{[0,1]}=[0,1]^{{N_x}{N_y}}\subset {X}$, is represented by a binary vector for each label. As a regularizer for a segment’s plausibility, we measure the boundary length using the total variation (TV). The discrete derivative operator $\nabla :{X} \rightarrow {Y}$, where we use the shorthand ${Y}:={X}\times {X}$ (elements from ${Y}$ are considered as column vectors), is defined as

$$\begin{aligned} \begin{aligned} (\nabla u^k)_{i,j} :=&\begin{pmatrix} (\nabla u^k)_{i,j}^x \\ (\nabla u^k)_{i,j}^y \end{pmatrix} \in {Y}(= \mathbb {R}^{2{N_x}{N_y}}),\\ {\mathcal {D}}u:=&(\nabla u^1,\ldots , \nabla u^{N_l}), \\ (\nabla u^k)_{i,j}^x :=&{\left\{ \begin{array}{ll} u^k_{i+1,j} - u^k_{i,j}, &{} \text {if } 1\le i< {N_x}, 1\le j\le {N_y}\\ 0, &{} \text {if } i={N_x}, 1\le j\le {N_y}. \end{array}\right. } \end{aligned} \end{aligned}$$

$(\nabla u^k)_{i,j}^y$ is defined analogously. From now on, we work with the image as a vector indexed by $\mathbf {i}=1,\ldots ,{N_x}{N_y}$. Let elements in ${Y}$ be indexed with $\mathbf {j}=1,\ldots ,2{N_x}{N_y}$. Let the inner product in ${X}$ and ${Y}$ be given, for $u^k,v^k\in {X}$ and $p^k,q^k\in {Y}$, as

$$\begin{aligned} \begin{aligned} \left\langle u^k,v^k \right\rangle _{X}&:= \sum _{\mathbf {i}=1}^{{N_x}{N_y}} u^k_{\mathbf {i}} v^k_{\mathbf {i}}, \; \left\langle p^k,q^k \right\rangle _{Y}:= \sum _{\mathbf {j}=1}^{2{N_x}{N_y}} p^k_{\mathbf {j}} q^k_{\mathbf {j}},\\ \left\langle u,v \right\rangle _{{X}^{N_l}}&:= \sum _{k=1}^{N_l}\left\langle u^k,v^k \right\rangle _{X}, \; \left\langle p,q \right\rangle _{{Y}^{N_l}} := \sum _{k=1}^{N_l}\left\langle p^k,q^k \right\rangle _{Y}. \end{aligned} \end{aligned}$$

The (discrete, anisotropic) TV norm is given by

$$\begin{aligned} \Vert {\mathcal {D}}u \Vert _{1} := \sum _{k=1}^{N_l}\sum _{\mathbf {j}=1}^{2{N_x}{N_y}} \vert (\nabla u^k)_{\mathbf {j}} \vert , \end{aligned}$$

where $\vert \cdot \vert $ is the absolute value. In the following, the variables $\mathbf {i}=1,\ldots ,{N_x}{N_y}$ and $\mathbf {j}=1,\ldots ,2{N_x}{N_y}$ always run over these index sets, thus we drop the specification; we adopt the same convention for $k=1,\ldots ,{N_l}$. We define the pixel-wise non-negative unit simplex

$$\begin{aligned} \Delta ^{N_l}&:= \Big \{\forall (\mathbf {i},k):0\le u^k_{\mathbf {i}}\le 1 \nonumber \\&\text { and } \forall \mathbf {i}:\textstyle \sum _ku^k_\mathbf {i}= 1 \;{u\in {X}^{N_l}}\Big \}, \end{aligned}$$

(21)

and the pixel-wise (closed) $\ell _\infty $-unit ball around the origin

$$\begin{aligned} B^{\ell _\infty }_{1}(0) := \left\{ [{\forall (\mathbf {j},k):\vert p^k_{\mathbf {j}} \vert \le 1}]{p\in {Y}^{N_l}}\right\} . \end{aligned}$$

Finally, the segmentation model reads

$$\begin{aligned} \min _{u\in {X}^{N_l}}\ \left\langle {\mathfrak {c}},u \right\rangle _{{X}^{N_l}} + \Vert {\mathcal {W}}{\mathcal {D}}u \Vert _{1},\quad s.t.\ u\in \Delta ^{N_l}, \end{aligned}$$

(22)

where we use a diagonal matrix ${\mathcal {W}}$ to support contrast-sensitive penalization of the boundary length.

This model and the following reformulation as a saddle-point problem are well known (see, e.g., [8])

$$\begin{aligned} \min _{u\in {X}^{N_l}}&\max _{p\in {Y}^{N_l}}\ \left\langle {\mathcal {W}}{\mathcal {D}}u,p \right\rangle _{{Y}^{N_l}} + \left\langle u,{\mathfrak {c}} \right\rangle _{{X}^{N_l}}, \nonumber \\&s.t.\ u\in \Delta ^{N_l},\ p\in B^{\ell _\infty }_1(0). \end{aligned}$$

(23)

The saddle-point problem (23) can be solved using the ergodic primal–dual algorithm [9], which leads to an iterative algorithm with totally differentiable iterations. The primal update in (13) is discussed in Example 6 and the dual update of (13) is essentially Example 7. As a consequence, Algorithm 3 can be applied to estimate the derivatives. A detailed derivation of the individual steps of the algorithm can be found in [30].

9.2 Parameter Learning

We consider (22) where the cost ${\mathfrak {c}}$ is given by the output of a CNN which takes as input an image ${\mathfrak {I}}\in X^{N_c}$ to be segmented and is defined via a set of weights $\vartheta $. Formally, we have ${\mathfrak {c}}^k_{\mathbf {i}} = {\mathfrak {f}}^k_{\mathbf {i}}(\vartheta ,{\mathfrak {I}})$ with ${\mathfrak {f}}: \mathbb {R}^{N_\vartheta } \times X^{N_c}\rightarrow X^{N_l}$, where ${N_c}$ denotes the number of channels of the input image and $N_\vartheta $ is the number of weights parametrizing the CNN.

The training set consists of ${N_T}$ images ${\mathfrak {I}}^{1},\ldots ,{\mathfrak {I}}^{{N_T}}\in {X}^{N_c}$ and their corresponding ground truth segmentations ${\mathfrak {g}}^1,\ldots ,{\mathfrak {g}}^{{N_T}} \in \{1,\ldots ,{N_l}\}^{{N_x}{N_y}}$.

The parameters $\vartheta $ of the CNN are cast as an instance of the general bilevel optimization problem (3):

$$\begin{aligned} \min _{\vartheta \in \mathbb {R}^{N_\vartheta }}&\sum _{t=1}^{N_T}\sum _{\mathbf {i}=1}^{{N_x}{N_y}}\log \Big (\sum _{k=1}^{N_l}\exp (u^k_\mathbf {i}(\vartheta ,{\mathfrak {I}}^t))\Big ) - {\mathfrak {g}}^t_\mathbf {i}(\vartheta ,{\mathfrak {I}}^t)\nonumber \\&s.t.\ u(\vartheta ,{\mathfrak {I}}^t) = \arg \min _{u\in {X}^{N_l}} E(u, {\mathfrak {f}}(\vartheta ,{\mathfrak {I}}^t)), \end{aligned}$$

(24)

where energy E in the lower level problem is (22) and the higher level problem is defined as the softmax loss.

Remark 3

We could equivalently use a multinomial logistic loss, since $u_\mathbf {i}(\vartheta , {\mathfrak {I}}^t)$ lies in the unit simplex by construction. We use this definition to allow for a simplified treatment of the case of training a CNN without the global segmentation model.

9.3 Experiments

We implemented our approach as a custom layer in the MatConvNet framework [39]. We used the Stanford Background dataset [20], which consists of 715 input images and pixel-accurate ground truth consisting of the geometric classes sky, vertical, and horizontal. We used ADAM [22] for the minimization of the higher level problem. We found that general plain stochastic gradient descent performs poorly in our setting, since the global segmentation model can lead to vanishing gradients.

In a first experiment, we used a small subset of 9 images from the dataset to show the influence of the number of iterations used to solve the lower level problem (22) on the training objective. We learned a small network consisting of four layers of alternating convolutions with a kernel width of 3 pixels and ReLU units followed by a fully connected layer. We added $3\times 3$ max-pooling layers with a stride of two after the first and the second convolutional layers, which effectively downsamples the responses by a factor of 4. We added an up-convolutional layer to upsample the responses to the original image size. The penultimate layer of the CNN consists of a multiplicative scaling (analogous to a scalar smoothness parameter) of the CNN output followed by the global segmentation model (22). We ran ADAM with a learning rate of $10^{-3}$ for a total of 1000 iterations with a mini-batch size of one image to learn the parameters of this network.

Figure 7 shows the average per-pixel loss, the average pixel accuracy as well as the time per ADAM iteration vs. number of iterations used to solve the lower level problem (inner iterations). This experiment shows that by solving the lower level problem to higher accuracy, the overall capacity and thus the accuracy of the system can be enhanced. This comes at the price of a higher computational complexity, which increases linearly with the number of iterations.

Table 1 Accuracy on the Stanford Background dataset [20]

Full size table

Finally, we performed a large-scale experiment on this dataset. We partitioned the images into a training set of 572 images and used the remaining 143 images for testing. We used the pre-trained Fully Convolutional Network FCN-32s [26] as a basis for this experiment. We adapted the geometry of the last two layers to this dataset and retrained the network. We then added a multiplicative scaling layer followed by the global segmentation model and refined the parameters. The number of inner iterations was set to 100, which provides a good trade-off between accuracy and computational complexity. We use a mini-batch size of 5 images and a learning rate of $10^{-3}$.

The average accuracy in terms of the average pixel accuracy (Acc) in percent and Intersection over Union (IoU) on both the test and the training sets is shown in Table 1. We compare the plain Fully Convolutional Network FCN to the network with the additional global segmentation model FCN+Global. We observed an increase of 1.4 % in terms of IoU on the test set when using the global model. This can be attributed to the fact that the CNN alone already provides good but coarse segmentations and the segmentation model uses only simple pairwise interactions. As such, it is unable to correct gross errors of the CNN.

Since the presented approach is applicable to a broad range of energies, training of more expressive energies which include more complex interactions (cf. [41]) is a promising direction for future research. Example segmentations from the test set are shown in Fig. 8.

Remark 4

For a comparison to the smoothing approach from Sect. 4.1, we refer to the conference version [30].

10 Conclusion

We considered a class of bilevel optimization problems with non-smooth lower level problem. By an appropriate approximation, we can formulate an algorithm with a smooth update mapping that solves a non-smooth optimization problem in the lower level. This allows us to apply gradient-based methods for solving the bilevel optimization problem. A second approach directly considers the fixed point equation of the algorithm as an optimality condition for the lower level problem. Key for both ideas are Bregman proximity functions.

The idea of estimating gradients for an abstract algorithm was exemplified for a forward–backward splitting method and a primal–dual algorithm with Bregman proximity functions. Several potential application examples were shown. A toy example confirmed our results and provided some more intuition. The contribution of our idea to practical applications was demonstrated by a multi-label segmentation model that was coupled with a convolutional neural network.

There are several open questions, for example convergence of the sequence of gradients or a full classification of optimization problems that allow for algorithms with smooth update mapping.

Notes

Note that the bilevel problem as in (1) is not well-defined in this case. We discuss some details in Sect. 3.
The classification in [14] applies to the optimistic bilevel problem.
The associated proximity operator has a closed-form solution or the solution may be determined efficiently numerically.
More generally, the concept of outer semi-continuity of the feasible set mapping is needed, otherwise a gradient-based method could converge to a non-feasible point.
Note that we kept the order of the terms given by the chain rule, since for multi-dimensional problems the products are matrix products and are, in general, not commutative.

References

Al-Baali, M.: Descent property and global convergence of the Fletcher–Reeves method with inexact line search. IMA J. Numer. Anal. 5(1), 121–124 (1985)
Article MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Svaiter, B.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Article MathSciNet MATH Google Scholar
Bennett, K., Kunapuli, G., Hu, J., Pang, J.S.: Bilevel optimization and machine learning. In: Computational Intelligence: Research Frontiers. Lecture Notes in Computer Science, vol. 5050, pp. 25–47. Springer, Berlin (2008)
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Article MathSciNet MATH Google Scholar
Calatroni, L., Reyes, J., Schönlieb, C.B.: Dynamic sampling schemes for optimal noise learning under multiple nonsmooth constraints. ArXiv e-prints (2014). arXiv:1403.1278
Calatroni, L., Reyes, J., Schönlieb, C.B., Valkonen, T.: Bilevel approaches for learning of variational imaging models. ArXiv e-prints (2015). arXiv:1505.02120
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet MATH Google Scholar
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Program. (2015). doi:10.1007/s10107-015-0957-3
Chen, Y., Pock, T., Ranftl, R., Bischof, H.: Revisiting loss-specific training of filter-based MRFs for image restoration. In: German Conference on Pattern Recognition (GCPR). in Lecture Notes in Computer Science, vol. 8142, pp. 271–281. Springer, Berlin (2013)
Chen, Y., Ranftl, R., Pock, T.: Insights into analysis operator learning: from patch-based sparse models to higher order MRFs. IEEE Trans. Image Process. 23(3), 1060–1072 (2014)
Article MathSciNet Google Scholar
Deledalle, C.A., Vaiter, S., Fadili, J., Peyré, G.: Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM J. Imaging Sci. 7(4), 2448–2487 (2014)
Article MathSciNet MATH Google Scholar
Dempe, S.: Annotated Bibliography on bilevel programming and mathematical programs with equilibrium constraints. Optimization 52(3), 333–359 (2003)
Article MathSciNet MATH Google Scholar
Dempe, S., Kalashnikov, V., Pérez-Valdés, G., Kalashnykova, N.: Bilevel Programming Problems. Energy Systems. Springer, Berlin (2015)
MATH Google Scholar
Dempe, S., Zemkoho, A.: The generalized Mangasarian–Fromowitz constraint qualification and optimality conditions for bilevel programs. J. Optim. Theory Appl. 148(1), 46–68 (2010)
Article MathSciNet MATH Google Scholar
Domke, J.: Implicit differentiation by perturbation. In: Advances in Neural Information Processing Systems (NIPS), pp. 523–531 (2010)
Domke, J.: Generic methods for optimization-based modeling. In: International Workshop on Artificial Intelligence and Statistics, pp. 318–326 (2012)
Evans, L.C., Gariepy, R.F.: Measure Theory and Fine Properties of Functions. CRC Press, Boca Raton (1992)
MATH Google Scholar
Fletcher, R., Reeves, C.: Function minimization by conjugate gradients. Comput. J. 7(2), 149–154 (1964)
Article MathSciNet MATH Google Scholar
Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: International Conference on Computer Vision (ICCV) (2009)
Griewank, A., Walther, A.: Evaluating Derivatives, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2008)
Book MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)
Kunisch, K., Pock, T.: A bilevel optimization approach for parameter learning in variational models. SIAM J. Imaging Sci. 6(2), 938–983 (2013)
Article MathSciNet MATH Google Scholar
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Appl. Math. 16(6), 964–979 (1979)
MathSciNet MATH Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989)
Article MathSciNet MATH Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Moore, G.: Bilevel programming algorithms for machine learning model selection. Ph.D. thesis, Rensselaer Polytechnic Institute, Troy (2010)
Ochs, P.: Long term motion analysis for object level grouping and nonsmooth optimization methods. Ph.D. thesis, Albert–Ludwigs–Universität Freiburg, Freiburg im Breisgau (2015)
Ochs, P., Chen, Y., Brox, T., Pock, T.: ipiano: Inertial proximal algorithm for non-convex optimization. SIAM J. Imaging Sci. 7(2), 1388–1419 (2014)
Article MathSciNet MATH Google Scholar
Ochs, P., Ranftl, R., Brox, T., Pock, T.: Bilevel optimization with nonsmooth lower level problems. In: International Conference on Scale Space and Variational Methods in Computer Vision (SSVM) (2015)
Passty, G.B.: Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J. Math. Anal. Appl. 72(2), 383–390 (1979)
Article MathSciNet MATH Google Scholar
Peyré, G., Fadili, J.: Learning analysis sparsity priors. In: Proceedings of Sampta (2011)
Ranftl, R., Pock, T.: A deep variational model for image segmentation. In: German Conference on Pattern Recognition (GCPR), pp. 107–118 (2014)
Reyes, J., Schönlieb, C.B., Valkonen, T.: The structure of optimal parameters for image restoration problems. ArXiv e-prints (2015). arXiv:1505.01953
Reyes, J.C.D.L., Schönlieb, C.B.: Image denoising: learning noise distribution via pde-constrained optimisation. Inverse Probl. Imaging 7, 1183–1214 (2013)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Book MATH Google Scholar
Tappen, M.: Utilizing variational optimization to learn MRFs. In: International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2007)
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005)
MathSciNet MATH Google Scholar
Vedaldi, A., Lenc, K.: MatConvNet—convolutional neural networks for MATLAB. In: Proceedings of the ACM International Conference on Multimedia (2015)
Zavriev, S., Kostyuk, F.: Heavy-ball method in nonconvex optimization problems. Comput. Math. Model. 4(4), 336–341 (1993)
Article MathSciNet MATH Google Scholar
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: International Conference on Computer Vision (ICCV) (2015)

Download references

Acknowledgments

Peter Ochs and Thomas Brox acknowledge the support from the German Research Foundation (DFG Grant BR 3815/8-1). René Ranftl acknowledges the support from Intel Labs. Thomas Pock acknowledges the support from the Austrian science fund under the ANR-FWF project “Efficient algorithms for nonsmooth optimization in imaging”, No. I1148, and the FWF-START Project “Bilevel optimization for Computer Vision”, No. Y729.

Author information

Authors and Affiliations

Mathematical Image Analysis Group, University of Saarland, Saarbrücken, Germany
Peter Ochs
Visual Computing Lab, Intel Labs, Santa Clara, CA, USA
René Ranftl
Computer Vision Group, University of Freiburg, Freiburg im Breisgau, Germany
Peter Ochs & Thomas Brox
Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria
Thomas Pock
Digital Safety & Security Department, AIT Austrian Institute of Technology GmbH, 1220, Vienna, Austria
Thomas Pock

Authors

Peter Ochs
View author publications
You can also search for this author in PubMed Google Scholar
René Ranftl
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Brox
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Pock
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Ochs.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ochs, P., Ranftl, R., Brox, T. et al. Techniques for Gradient-Based Bilevel Optimization with Non-smooth Lower Level Problems. J Math Imaging Vis 56, 175–194 (2016). https://doi.org/10.1007/s10851-016-0663-7

Download citation

Received: 06 October 2015
Accepted: 04 May 2016
Published: 17 May 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10851-016-0663-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Techniques for Gradient-Based Bilevel Optimization with Non-smooth Lower Level Problems

Abstract

Similar content being viewed by others

A primal nonsmooth reformulation for bilevel optimization problems

Inexact direct-search methods for bilevel optimization problems

Methodology and first-order algorithms for solving nonsmooth and non-strongly convex bilevel optimization problems

Explore related subjects

1 Introduction

2 Related Work

3 Preliminaries

4 The Bilevel Problem

Remark 1

5 Computing Descent Directions

5.1 Derivative of a Smoothed Lower Level Problem

5.2 Derivative of Iterative Algorithms

5.3 Derivative of Fixed Point Equations

5.4 Weak Differentiation of Iterative Algorithms

6 Explicit Derivatives for Exemplary Algorithms

6.1 Derivative of Forward–Backward Splitting

6.2 Derivative of Primal–Dual Splitting

Remark 2

7 “Smoothing” Using Bregman Proximity

7.1 Bregman Proximity Functions

7.2 Examples of Bregman Functions

Example 1

Example 2

Example 3

Example 4

7.3 Examples of Smooth Bregman Proximity Operators

Example 5

Example 6

Example 7

Example 8

8 Toy Example

8.1 Experimental Setup

8.1.1 Analytic Subdifferential

8.1.2 Implicit Differentiation Approach Sect. 4.1

8.1.3 Algorithmic Differentiation Approach Sect. 4.2

8.1.4 Implicit Differentiation of the Fixed Point Equation Approach from Sect. 4.3

8.2 Analysis of the 1D Example

9 Application to Multi-Label Segmentation

9.1 Model

9.2 Parameter Learning

Remark 3

9.3 Experiments

Remark 4

10 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation