A Bregman–Kaczmarz method for nonlinear systems of equations

Gower, Robert; Lorenz, Dirk A.; Winkler, Maximilian

doi:10.1007/s10589-023-00541-9

A Bregman–Kaczmarz method for nonlinear systems of equations

Published: 07 December 2023

Volume 87, pages 1059–1098, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

A Bregman–Kaczmarz method for nonlinear systems of equations

Download PDF

553 Accesses
2 Citations
4 Altmetric
Explore all metrics

A Correction to this article was published on 05 April 2024

This article has been updated

Abstract

We propose a new randomized method for solving systems of nonlinear equations, which can find sparse solutions or solutions under certain simple constraints. The scheme only takes gradients of component functions and uses Bregman projections onto the solution space of a Newton equation. In the special case of euclidean projections, the method is known as nonlinear Kaczmarz method. Furthermore if the component functions are nonnegative, we are in the setting of optimization under the interpolation assumption and the method reduces to SGD with the recently proposed stochastic Polyak step size. For general Bregman projections, our method is a stochastic mirror descent with a novel adaptive step size. We prove that in the convex setting each iteration of our method results in a smaller Bregman distance to exact solutions as compared to the standard Polyak step. Our generalization to Bregman projections comes with the price that a convex one-dimensional optimization problem needs to be solved in each iteration. This can typically be done with globalized Newton iterations. Convergence is proved in two classical settings of nonlinearity: for convex nonnegative functions and locally for functions which fulfill the tangential cone condition. Finally, we show examples in which the proposed method outperforms similar methods with the same memory requirements.

A New Double-Projection Method for Solving Variational Inequalities in Banach Spaces

Article 01 March 2018

A convergent relaxation of the Douglas–Rachford algorithm

Article Open access 06 March 2018

Newton’s method with feasible inexact projections for solving constrained generalized equations

Article 26 October 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider a constrained nonlinear system of equations

$$\begin{aligned} f(x) = 0 \qquad \text {s.t. } x\in C, \end{aligned}$$

(1)

where $f:D \subset \mathbb {R}^d \rightarrow \mathbb {R}^n $ is a nonlinear differentiable function and $C\subset D \subset \mathbb {R}^d$ is a nonempty closed convex set. Let $S \subset C$ be the set of solutions of (1). Our aim is to design an iterative method which approximates a solution of (1) and in each step uses first-order information of just a single component function $f_i$.

The idea of our method is as follows. Given an appropriate convex function $\varphi :\mathbb {R}^d\rightarrow \mathbb {R}\cup \{+\infty \}$ with

$$\begin{aligned} \overline{\textrm{dom}\ \partial \varphi } = C, \end{aligned}$$

(2)

our method computes the Bregman projection w.r.t. $\varphi $ onto the solution set of the local linearization of a component function $f_i$ around the current iterate $x_k$. Here, the underlying distance is the Bregman distance defined by

$$\begin{aligned} D_\varphi ^{x^*}(x,y) = \varphi (y) - \varphi (x) - \langle x^*,y-x\rangle , \end{aligned}$$

where $x^*$ is a subgradient of $\varphi $ at x. That is, the method we study is given by

$$\begin{aligned} x_{k+1} =&\arg \min _{x\in \mathbb {R}^d} D_\varphi ^{x_k^*}(x_k,x) \qquad \text {s.t. } x\in H_k, \end{aligned}$$

(3)

with

$$\begin{aligned} H_k:= \{x\in \mathbb {R}^d: f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k), x-x_k\rangle = 0\}, \end{aligned}$$

(4)

where $i_k \in \{1,\ldots , n\}$ and $x_k^*$ is in the subgradient $\partial \varphi (x_k).$ Since one can show that Bregman projections are always contained in $\textrm{dom}\ \partial \varphi $, the condition (2) guarantees that $x_k\in C$ holds for all k and hence, if the $x_k$ converge, they converge to a point in C.

In order for the Bregman projection $x_{k+1}$ to exist, we need that the hyperplanes $H_{k}$ have nonempty intersection with $\textrm{dom}\ \varphi $. Proposition 2.3 below will show that the slightly stronger condition

$$\begin{aligned} H_{k} \cap \textrm{ri}\ \textrm{dom}\ \varphi \ne \emptyset \end{aligned}$$

(5)

is sufficient for existence and uniqueness of the Bregman projection under regularity assumptions on $\varphi $.

If (5) is violated, we propose to compute a relaxed projection, which is always defined and inspired by the recently proposed mSPS method [23].

1.1 Related work and our contributions

1.1.1 Nonlinear Kaczmarz method and Sparse Kaczmarz

In the pioneering work by Stefan Kaczmarz [35], the idea of solving systems of equations by cycling through the separate equations and solving them incrementally was first executed on linear systems in finite dimensional spaces, an approach which is known henceforth as Kaczmarz method. In this conceptually simple method, an update is computed by selecting one equation of the system according to a rule that may be random, cyclic or adaptive, and computing an orthogonal projection onto its solution space, which is given by a hyperplane.

Recently, two completely different extensions of the Kaczmarz method have been developed. One idea was to transfer the method to systems with nonlinear differentiable functions by considering its local linearizations: In each step k, an equation $i_k$ is chosen and the update $x_{k+1}$ is defined as the orthogonal projection

$$\begin{aligned} x_{k+1} = \mathop {\mathrm {\arg \!\min }}\limits \Vert x-x_k\Vert _2^2 \quad \text {s.t.} \quad f_{i_k}(x_k)+\langle \nabla f_{i_k}(x_k),x-x_k\rangle = 0. \end{aligned}$$

It is easy to check that this update can be computed by

$$\begin{aligned} x_{k+1} = x_k - \frac{f_{i_k}(x_k)}{\Vert \nabla f_{i_k}(x_k)\Vert _2^2}\nabla f_{i_k}(x_k). \end{aligned}$$

This method was studied under the names Sketched Newton-Raphson [62] or Nonlinear Kaczmarz method [58]. Convergence was shown for two kinds of mild nonlinearities, namely star convex functions [62] and functions which obey a local tangential cone condition [58].

A different kind of extension of the Kaczmarz method has been proposed by [39]. Here, the notion of projection was replaced by the (more general) Bregman projection, giving rise to the ‘sparse’ Kaczmarz method, which can find sparse solutions of the system. The method has been further extended to inconsistent systems [54], accelerated by block averaging [57] and investigated as a regularization method in Banach spaces [33]. But so far only linear systems have been addressed.

The present article unifies these two generalizations, that is, we study the case of nonlinearity and general Bregman projections onto linearizations and derive convergence rates in the two aforementioned nonlinear settings. We also demonstrate that instead of sparsity, the proposed method is able to handle simple constraints such as simplex constraints as well.

1.1.2 Stochastic Polyak step size (SPS)

One popular method for solving the finite-sum problem $\min \frac{1}{n}\sum _{i=1}^n \ell _i(x)$ is stochastic gradient descent (SGD), which is defined by the update $x_{k+1} = x_k - \gamma _k\nabla \ell _{i_k}(x_k)$. It is still a challenging question if there exist good choices of step sizes which are adaptive in the sense that no hyperparameter tuning is necessary. In this context, the stochastic Polyak step size (SPS)

$$\begin{aligned} \gamma _k = \frac{\ell _{i_k}(x_k)-\hat{\ell }_{i_k}}{c\cdot \Vert \nabla \ell _{i_k}(x_k)\Vert _2^2} \end{aligned}$$

(6)

was proposed in [38], where $c>0$ is a fixed constant and $\hat{\ell }_i = \inf \ell _i$. It was shown that the iterates of this method converge for convex lower bounded functions $f_i$ for which the interpolation condition holds, meaning that there exists ${\hat{x}}\in \mathbb {R}^d$ with $ \ell _i({\hat{x}}) = \hat{\ell }_i $ for all $i=1,\ldots ,n$. This assumption is strong, but can be fulfilled e.g. by modern machine learning applications such as non-parametric regression or over-parametrized deep neural networks [40, 64]. We cover these assumptions with our framework as a special case by requiring that the functions $f_i$ in (1) are nonnegative, which is clear by setting $f_i=\ell _i-\hat{\ell }_i$. The SPS method applied to $\ell _1,...,\ell _n$ then coincides with the Nonlinear Kaczmarz method applied to $f_1,...,f_n$.

1.1.3 Mirror descent and SPS

For incorporating additional constraints or attraction to sparse solutions into SGD, a well-known alternative to projected SGD is the stochastic mirror descent method (SMD) [3, 44, 65], which is defined by the update

$$\begin{aligned} x_{k+1} \in \mathop {\mathrm {\arg \!\min }}\limits _{x\in \mathbb {R}^d} \ \gamma _k \langle \nabla f_{i_k}(x_k),x-x_k\rangle + D_\varphi ^{x_k^*}(x_k,x). \end{aligned}$$

Here, $\varphi $ is a convex function with additional properties which will be refined later on, which is then called the distance generating function (DGF), $x_k^*$ is a subgradient of $\varphi $ at $x_k$ and $D_\varphi $ is the Bregman distance associated to $\varphi $. We demonstrate that our proposed method can be reinterpreted as mirror descent with a novel adaptive step size in case that the $f_i$ are nonnegative. Moreover, for $\varphi (x)=\frac{1}{2}\Vert x\Vert _2^2$, we obtain back the SGD method with the stochastic Polyak step size. For general $\varphi $, computing the step size requires the solution of a convex one-dimensional minimization problem. This is a similar situation as in the update of the stochastic dual coordinate ascent method [56], a popular stochastic variance reduced method for minimizing regularized general linear models.

The two recent independent works [23] and [60] propose to use the stochastic Polyak step size from SGD in mirror descent. This update has the advantage that it is relatively cheap to compute. However, we prove that for convex functions, our proposed method takes bigger steps in terms of Bregman distance towards the solution of (1).

We generalize the step size from [23] to the case in which the functions $f_i$ are not necessarily nonnegative and employ this update as a relaxed projection whenever our iteration is not defined. We compare our proposed method with the method which always performs relaxed projections in our convergence analysis and experiments. As an additional contribution, we improve the analysis for the method in [23] for the case of smooth strongly convex functions $f_i$ (Theorem 4.16).

Finally, our method is by definition scaling-invariant in the sense that a multiplicative change $\varphi \mapsto \alpha \varphi $ of the DGF $\varphi $ with a constant $\alpha >0$ does not affect the method. To the best of our knowledge, this is the first mirror descent method which has this property.

1.1.4 Bregman projection methods

The idea of using Bregman projections algorithmically dates back to the seminal paper [11], which proposed to solve the feasibility problem

$$\begin{aligned} \text {find}\quad {\hat{x}} \in \bigcap _{i=1}^n C_i \end{aligned}$$

for convex sets $C_i$ by iterated Bregman projections onto the sets $C_i$. This idea initiated an active line of research [1, 4,5,6, 15,16,17,18, 36, 50] with applications in fields such as matrix theory [21], image processing [19, 20, 46] and optimal transport [9, 37]. We can view problem (1) as a feasibility problem by setting $C_i=\{x\mid f_i(x)=0\}$. Our approach to compute Bregman projections onto linearizations has already been proposed for the case of convex inequalities $C_i=\{x\mid f_i(x)\le 0\}$ under the name outer Bregman projections [13, 14]. Convergence of this method was studied in general Banach spaces. Obviously, the two problems coincide for convex nonnegative functions $f_i$. However, to the best of our knowledge, convergence rates have been given recently only in the case that the $C_i$ are hyperplanes [36]. In this paper, we derive rates in the space $\mathbb {R}^d$. Also, we extend our analysis to the nonconvex setting for equality constraints.

1.1.5 Bregman–Landweber methods

There are a couple of works in inverse problems, typically studied in Banach spaces, which already incorporate Bregman projections into first-order methods with the aim of finding sparse solutions. Bregman projections were combined with the nonlinear Landweber iteration the first time in [55]. Later, [10] employed Bregman projections for $L_1$- and TV-regularization. A different nonlinear Landweber iteration with Bregman projections for sparse solutions of inverse problems was investigated in [41]. All of these methods use the full Jacobian Df(x) in each iteration. In [32, 34], a deterministic Kaczmarz method incorporating convex penalties was proposed which performs a similar mirror update as our method, but with a different step size which does not originate from a Bregman projection. The apparently closest related method to our proposed one was recently suggested in [28], where the step size was calculated as the solution of a quite similar optimization problem, which is still different and also does not come with the motivation of a Bregman projection.

1.1.6 Sparse and Bregman–Newton methods

Finally, since our proposed method can be seen as a stochastic first-order Newton iteration, we briefly point out that a link of Newton’s method to topics like Bregman distances and sparsity has already been established in the literature. Iusem and Solodov [30] introduced a regularization of Newton’s method by a Bregman distance. Nesterov and Doikov [22] continued this work by introducing an additional nonsmooth convex regularizer. Polyak and Tremba [48] proposed a sparse Newton method which solves a minimum norm problem subject to the full Newton equation in each iteration, and presented an application in control theory [49].

1.2 Notation

For a set $S\subset \mathbb {R}^d$, we write its interior as $S^\circ $, its closure as $\overline{S}$ and its relative interior as $\textrm{ri}(S)$. The Cartesian product of sets $S_i\subset \mathbb {R}^d$, $i=1,...,m$, is written as . The set $\textrm{span}(S)$ is the linear space generated by all elements of S. We denote by $\mathbb {1}_d$ the vector in $\mathbb {R}^d$ with constant entries 1. For two vectors $x,y\in \mathbb {R}^d$, we express the componentwise (Hadamard) product as $x\cdot y$ and the componentwise logarithm and exponential as $\log (x)$ and $\exp (x)$. For a given norm $\Vert \cdot \Vert $ on $\mathbb {R}^d$, by $\Vert \cdot \Vert _*$ we denote the corresponding dual norm, which is given as

$$\begin{aligned} \Vert x\Vert _* = \sup _{\Vert y\Vert =1} \langle x,y\rangle , \qquad x\in \mathbb {R}^d. \end{aligned}$$

2 Basic notions and assumptions

We collect some basic notions and results as well as our standing assumptions for problem (1).

2.1 Convex analysis and standing assumptions

Let $\varphi :\mathbb {R}^d\rightarrow \overline{\mathbb {R}}:=\mathbb {R}\cup \{+\infty \}$ be convex with

$$\begin{aligned} \textrm{dom}\ \varphi = \{x\in \mathbb {R}^d: \varphi (x)<\infty \} \ne \emptyset . \end{aligned}$$

We also assume that $\varphi $ is lower semicontinuous, i.e. $\varphi (x) \le \liminf _{y\rightarrow x} \varphi (y)$ holds for all $x\in \mathbb {R}^d$, and supercoercive, meaning that

$$\begin{aligned} \lim _{\Vert x\Vert \rightarrow \infty } \frac{\varphi (x)}{\Vert x\Vert } = +\infty . \end{aligned}$$

The subdifferential at a point $x\in \textrm{dom}\ \varphi $ is defined as

$$\begin{aligned} \partial \varphi (x) = \big \{ x^*\in \mathbb {R}^d: \varphi (x) + \langle x^*,y-x\rangle \le \varphi (y) \quad \text {for all } y\in \textrm{dom}\ \varphi \big \}. \end{aligned}$$

An element $x^*\in \partial \varphi (x)$ is called a subgradient of $\varphi $ at x. The set of all points x with $\partial \varphi (x)\ne \emptyset $ is denoted by $\textrm{dom}\ \partial \varphi $. Note that the relative interior of $\textrm{dom}\ \varphi $ is a convex set, while $\textrm{dom}\ \partial \varphi $ may not be convex, for a counterexample see [52, p.218]. In general, convexity of $\varphi $ guarantees the inclusions $\textrm{ri}\ \textrm{dom}\ \varphi \subset \textrm{dom}\ \partial \varphi \subset \textrm{dom}\ \varphi $. For later purposes, we require that $ \textrm{dom}\ \partial \varphi = \textrm{ri}\ \textrm{dom}\ \varphi , $ which will be fulfilled in all our examples. We further assume that $\varphi $ is essentially strictly convex, i.e. strictly convex on $\textrm{ri}\ \textrm{dom}\ \varphi $. (In general, this property only means strict convexity on every convex subset of $\textrm{dom}\ \partial \varphi $.). The convex conjugate (or Fenchel-Moreau-conjugate) of $\varphi $ is defined by

$$\begin{aligned} \varphi ^*(x^*) = \sup _{x\in \mathbb {R}^d} \langle x^*,x\rangle - \varphi (x), \qquad x^*\in \mathbb {R}^d. \end{aligned}$$

The function $\varphi ^*$ is convex and lower semicontinuous. Moreover, the essential strict convexity and supercoercivity imply that $\textrm{dom}\ \varphi ^*=\mathbb {R}^d$ and $\varphi ^*$ is differentiable, since $\varphi $ is essentially strictly convex and supercoercive, see [7, Proposition 14.15] and [52, Theorem 26.3].

The Bregman distance $D_\varphi ^{x^*}(x,y)$ between $x,y\in \textrm{dom}\ \varphi $ with respect to $\varphi $ and a subgradient $x^*\in \partial \varphi (x)$ is defined as

$$\begin{aligned} D_\varphi ^{x^*}(x,y) = \varphi (y) - \varphi (x) - \langle x^*,y-x\rangle . \end{aligned}$$

Using Fenchel’s equality $\varphi ^*(x^*)=\langle x^*,x\rangle - \varphi (x)$ for $x^*\in \partial \varphi (x)$, one can rewrite the Bregman distance with the conjugate function as

$$\begin{aligned} D_\varphi ^{x^*}(x,y) = \varphi ^*(x^*) - \langle x^*,y\rangle + \varphi (y). \end{aligned}$$

(7)

If $\varphi $ is differentiable at x, then the subdifferential $\partial \varphi (x)$ contains the single element $\nabla \varphi (x)$ and we can write

$$\begin{aligned} D_\varphi (x,y):= D^{\nabla \varphi (x)}_\varphi (x,y) = \varphi (y) - \varphi (x) - \langle \nabla \varphi (x),y-x\rangle . \end{aligned}$$

The function $\varphi $ is called $\sigma $-strongly convex w.r.t. a norm $\Vert \cdot \Vert $ for some $\sigma >0$, if for all $x,y\in \textrm{dom}\ \partial \varphi $ it holds that $\frac{\sigma }{2}\Vert x-y\Vert ^2\le D_\varphi ^{x^*}(x,y)$.

In conclusion, we require the following standing assumptions for problem (1):

Assumption 1

(i)
The set C is nonempty, convex and closed.
(ii)
It holds that $\varphi :\mathbb {R}^d\rightarrow \overline{\mathbb {R}}$ is essentially strictly convex, lower semicontinuous and supercoercive.
(iii)
The function $\varphi $ fulfills that $\overline{\textrm{dom}\ \partial \varphi } = C$ and $\textrm{dom}\ \partial \varphi = \textrm{ri}\ \textrm{dom}\ \varphi $.
(iv)
For each $x\in \textrm{dom}\ \varphi $ and each sequence $x_k\in \textrm{dom}\ \partial \varphi $ with $x_k^*\in \partial \varphi (x_k)$ and $x_k\rightarrow x$ it holds that $D_\varphi ^{x_k^*}(x_k,x)\rightarrow 0$.
(v)
The function $f:D\rightarrow \mathbb {R}^n$ is continuously differentiable with $D\supset C$.
(vi)
The set of solutions S of (1) is non-empty, that is $S:= C\cap f^{-1}(0)\ne \emptyset $.

2.2 Bregman projections

Definition 2.1

Let $E\subset \mathbb {R}^d$ be a nonempty convex set, $x\in \textrm{dom}\ \partial \varphi $ and $x^*\in \partial \varphi (x)$. Assume that $E\cap \textrm{dom}\ \varphi \ne \emptyset $. The Bregman projection of x onto E with respect to $\varphi $ and $x^*$ is the point $\Pi _{\varphi , E}^{x^*}(x)\in E\cap \textrm{dom}\ \varphi $ such that

$$\begin{aligned} D_\varphi ^{x^*}\big (x, \Pi _{\varphi , E}^{x^*}(x)\big ) = \min _{y\in E} D_\varphi ^{x^*}(x,y). \end{aligned}$$

Existence and uniqueness of the Bregman projection is guaranteed if $E\cap \textrm{dom}\ \varphi \ne \emptyset $ by our standing assumptions due to the fact that the function $y\mapsto D_\varphi ^{x^*}\big (x, y\big )$ is lower bounded by zero, coercive, lower semicontinuous and strictly convex. For the standard quadratic $\varphi =\frac{1}{2}\Vert \cdot \Vert _2^2$, the Bregman projection is just the orthogonal projection. Note that if $E\cap \textrm{dom}\ \varphi =\emptyset $, then for all $y\in E$ it holds that $D_\varphi ^{x^*}(x,y)=+\infty $.

The Bregman projection can be characterized by variational inequalities, as the following lemma shows.

Lemma 2.2

( [39]) A point $z\in E$ is the Bregman projection of x onto E with respect to $\varphi $ and $x^*\in \partial \varphi (x)$ if and only if there exists $z^*\in \partial \varphi (z)$ such that one of the following conditions is fulfilled:

(i)
$\langle z^*-x^*,z-y\rangle \le 0 \qquad \text {for all } y\in E,$
(ii)
$D_\varphi ^{z^*}(z,y) \le D_\varphi ^{x^*}(x,y) - D_\varphi ^{x^*}(x,z) \qquad \text {for all } y\in E.$

We consider Bregman projections onto hyperplanes

$$\begin{aligned} H(\alpha ,\beta ):= \{x\in \mathbb {R}^d: \langle \alpha ,x\rangle = \beta \}, \qquad \alpha \in \mathbb {R}^d, \ \beta \in \mathbb {R}, \end{aligned}$$

and halfspaces

$$\begin{aligned} H^\le (\alpha ,\beta ):= \{x\in \mathbb {R}^d: \langle \alpha ,x\rangle \le \beta \}, \qquad \alpha \in \mathbb {R}^d, \ \beta \in \mathbb {R}, \end{aligned}$$

and analoguously we define $H^\ge (\alpha ,\beta )$.

The following proposition shows that the Bregman projection onto a hyperplane can be computed by solving a one-dimensional dual problem under a qualification constraint. We formulate this one-dimensional dual problem under slightly more general assumptions than previous versions, e.g. we neither assume smoothness of $\varphi $ (as e.g. [5, 6, 11, 17, 21]) nor strong convexity of $\varphi $ (as in [39]).

Proposition 2.3

Let $\varphi $ fulfill Assumption 1(ii). Let $\alpha \in \mathbb {R}^d\setminus \{0\}$ and $\beta \in \mathbb {R}$ such that

$$\begin{aligned} H(\alpha ,\beta )\cap \textrm{ri}\ \textrm{dom}\ \varphi \ne \emptyset . \end{aligned}$$

Then, for all $x\in \textrm{dom}\ \partial \varphi $ and $x^*\in \partial \varphi (x)$, the Bregman projection $\Pi _{\varphi , H(\alpha ,\beta )}^{x^*}(x)$ exists and is unique. Moreover, the Bregman projection is given by

$$\begin{aligned} x_+:=\Pi _{\varphi , H(\alpha ,\beta )}^{x^*}(x) = \nabla \varphi ^*(x_+^*), \end{aligned}$$

where $x_+^* = x^* - {\hat{t}}\alpha \in \partial \varphi (x_+)$ and ${\hat{t}}$ is a solution to

$$\begin{aligned} \min _{t\in \mathbb {R}} \varphi ^*(x^*-t\alpha ) + \beta t. \end{aligned}$$

(8)

Proof

The assumptions guarantee that $\varphi ^*$ is finite and differentiable on the full space $\mathbb {R}^d$. We already know that the Bregman projection $x_+$ exists and is unique. Fermat’s condition applied to the projection problem $\min _{y\in H(\alpha ,\beta )} D_\varphi ^{x^*}(x,y)$ states that

$$\begin{aligned} 0 \in \partial \big (D_\varphi ^{x^*}(x,\cdot ) + \iota _{H(\alpha ,\beta )} \big )(x_+), \end{aligned}$$

where the indicator function $\iota _M:\mathbb {R}^d\rightarrow \overline{\mathbb {R}}$ is defined by

$$\begin{aligned} \iota _M(x) = {\left\{ \begin{array}{ll} 0, &{} x\in M, \\ +\infty , &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Applying subdifferential calculus [52, Theorem 23.8], where we make use of the fact that $H(\alpha ,\beta )$ is a polyhedral set, we conclude that $x_+\in \textrm{dom}\ \partial \varphi $ and

$$\begin{aligned} 0\in \partial \varphi (x_+) - x^* + \textrm{span}(\{\alpha \}), \end{aligned}$$

where we used the fact that it holds $\partial \iota _{H(\alpha ,\beta )} = \textrm{span}(\{\alpha \})$ on $H(\alpha ,\beta )$. Using subgradient inversion $(\partial \varphi )^{-1} = \nabla \varphi ^*$, we arrive at the identity

$$\begin{aligned} x_+ = \nabla \varphi ^*(x^*-{\hat{t}}\alpha ) \end{aligned}$$

with some ${\hat{t}}\in \mathbb {R}$. Inserting this equation into the constraint $\langle x_+,\alpha \rangle = \beta $, we conclude that ${\hat{t}}$ minimizes (8). $\square $

3 Realizations of the method

To solve problem (1), we propose the following method. In each step, we randomly pick a component equation $f_{i_k}(x)=0$ and consider the set of zeros of its linearization around the current iterate $x_k$. This set is just the hyperplane

$$\begin{aligned} H_k := \big \{x\in \mathbb {R}^d: f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k), x-x_k\rangle = 0\big \} = H(\nabla f_{i_k}(x_k), \beta _k), \end{aligned}$$

where

$$\begin{aligned} \beta _k = \langle \nabla f_{i_k}(x_k),x_k\rangle - f_{i_k}(x_k). \end{aligned}$$

(9)

For later purposes, we also consider the halfspace

$$\begin{aligned} H_k^\le := \{ x\in \mathbb {R}^d: f_{i_k}(x_k) + \langle f_{i_k}(x_k),x-x_k\rangle \le 0 \}. \end{aligned}$$

As the update $x_{k+1}$, we now propose to take the Bregman projection of $x_k$ onto the set $H_k$ using Proposition 2.3, which is possible if

$$\begin{aligned} H_k \cap \textrm{dom}\ \partial \varphi \ne \emptyset . \end{aligned}$$

(10)

The update is then given by $x_{k+1}^* = x_k^*-t_{k,\varphi }\nabla f_{i_k}(x_k)$ and $x_{k+1}=\nabla \varphi ^*(x_{k+1}^*)$ with

$$\begin{aligned} t_{k,\varphi } \in \mathop {\mathrm {\arg \!\min }}\limits _{t\in \mathbb {R}} \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + \beta _k t. \end{aligned}$$

(11)

Note that, although the Bregman projection $x_{k+1}$ is unique, $t_{k,\varphi }$ might not be unique. If (10) is not fulfilled, we define an update inspired from [23] by setting $x_{k+1} = \nabla \varphi ^*(x_k^*-t_{k,\sigma }\nabla f_{i_k}(x_k))$ with the Polyak-like step size^{Footnote 1}

$$\begin{aligned} t_{k,\sigma } = \sigma \frac{f_{i_k}(x_k)}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2} \end{aligned}$$

(12)

with some norm $\Vert \cdot \Vert _*$ and some constant $\sigma >0$. We will refer to the resulting update as the relaxed projection. We note that it is always defined and gives a new point $x_{k+1}\in \textrm{dom}\ \partial \varphi $. However, $x_{k+1}$ does not lie in $H_k$: Indeed, if it would lie in $H_k\cap \textrm{dom}\ \partial \varphi $, this would contradict the assumption that (10) is not fulfilled. In [23], the similar step size $t = \sigma \frac{f_{i_k}(x_k) - \inf _{i_k} f_{i_k}}{c\Vert \nabla f_{i_k}(x_k)\Vert _*^2}$ with some constant $c>0$ was proposed for minimization with mirror descent under the name mirror-stochastic Polyak step size (mSPS). Both the projection and relaxed projection guarantee that $x_{k+1}\in \textrm{dom}\ \varphi $ and deliver a new subgradient $x_{k+1}^*$ for the next update. The steps are summarized in Algorithm 1 below.

As an alternative method, we also consider the method which always chooses the step size $t_{k,\sigma }$ from (12).

Note that the problem (11) is convex and one-dimensional and can be solved with the bisection method, if $\varphi ^*$ is a $C^1$-function, or (globalized) Newton methods, if $\varphi ^*$ is a $C^2$-function, see Appendix A. In Examples 3.2, 3.3, 3.4 and 3.5, we show how to implement the steps of Algorithm 1 for typical constraints.

Remark 3.1

(Choice of $\sigma $ in Algorithm 1 and Algorithm 2) In this paper, we focus on the case that $\varphi $ is a strongly convex function. In this setting, we propose to choose the parameter $\sigma $ in Algorithm 1 and Algorithm 2 as the modulus of strong convexity, since in this case our theorems in Sect. 3 guarantee convergence.

Example 3.2

(Unconstrained case and sparse Kaczmarz) In the unconstrained case $\textrm{dom}\ \varphi = \mathbb {R}^d$, condition (10) is always fulfilled whenever $\nabla f_{i_k}(x_k)\ne 0$.

For $\varphi (x) = \frac{1}{2}\Vert x\Vert _2^2$, we obtain back the nonlinear Kaczmarz method [43, 58, 62]

$$\begin{aligned} x_{k+1} = x_k - \frac{f_{i_k}(x_k)}{\Vert \nabla f_{i_k}(x_k)\Vert _2^2}\nabla f_{i_k}(x_k). \end{aligned}$$

For the function

$$\begin{aligned} \varphi (x)=\lambda \Vert x\Vert _1 + \frac{1}{2} \Vert x\Vert _2^2, \end{aligned}$$

(13)

Assumptions 1(i-iv) are fulfilled and it holds that $\varphi ^*(x) = \tfrac{1}{2} \Vert S_\lambda (x)\Vert _2^2$ with the soft-shrinkage function

$$\begin{aligned} S_\lambda (x) = {\left\{ \begin{array}{ll} x+\lambda , &{} x<-\lambda , \\ 0, &{} |x|\le \lambda , \\ x-\lambda , &{} x>\lambda \end{array}\right. } \end{aligned}$$

Hence, in this case lines 9, 11 and 12 of Algorithm 1 read

$$\begin{aligned}&\text {find } t_k\in \mathop {\mathrm {\arg \!\min }}\limits _{t\in \mathbb {R}} \beta _k t + \frac{1}{2} \Vert S_\lambda (x_k^*-t\nabla f_{i_k}(x_k))\Vert _2^2, \\&\text {update } x_{k+1}^* = x_k^* - t_k\nabla f_{i_k}(x_k), \\&\text {update } x_{k+1} = S_\lambda (x_{k+1}^*). \end{aligned}$$

For affine functions $f_{i}(x) = \langle a^{(i)},x\rangle - b_{i}$ with $a^{(i)}\in \mathbb {R}^d$, $b_{i}\in \mathbb {R}$, Algorithm 1 has been studied under the name Sparse Kaczmarz method and converges to a sparse solution of the linear system $f(x)=0$, see [39, 53]. The update with $t_k$ from (11) is also called the Exact step Sparse Kaczmarz method. The linesearch problem can be solved exactly with reasonable effort, as $\varphi ^*$ is a continuous piecewise quadratic function with at most 2d discontinuities. The corresponding solver is based on a sorting procedure and has complexity ${\mathcal {O}}(d\log (d))$, see [39] for details.

Example 3.3

(Simplex constraints) We consider the probability simplex

$$\begin{aligned} C = \Delta ^{d-1}:= \big \{x\in \mathbb {R}_{\ge 0}^d:\ \sum _{i=1}^d x_i=1\big \}. \end{aligned}$$

The restriction of the negative entropy function

$$\begin{aligned} {\varphi }(x) = {\left\{ \begin{array}{ll} \sum _{i=1}^d x_i \log (x_i), &{} x\in \Delta ^{d-1} , \\ +\infty , &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(14)

fulfills Assumption 1(i-iv) and is 1-strongly convex with respect to $\Vert \cdot \Vert _1$ due to Pinsker’s inequality, see [25, 47] and [8, Example 5.27]. We have that

$$\begin{aligned} \textrm{dom}\ \partial \varphi = \textrm{ri}\ \Delta ^{d-1} = \big \{x\in \mathbb {R}_{> 0}^d:\ \sum _{i=1}^d x_i=1\big \} =: \Delta ^{d-1}_+. \end{aligned}$$

We can characterize condition (10) easily as follows: The hyperplane $H(\alpha ,\beta )$ intersects $\textrm{dom}\ \partial \varphi =\Delta ^{d-1}_+$ if and only if

$\alpha = \beta \mathbb {1}_d$ or
there exist $r,s\in \{1,..., d\}$ with $\alpha _r<\beta <\alpha _s$.

This condition is quickly established by the intermediate value theorem and can be easily checked during the method. When verifying the condition in practice, in case of instabilities one may consider the restricted index set

$$\begin{aligned} \{i=1,...,n\mid |x_i|>\delta \} \end{aligned}$$

for some positive $\delta $.

The Bregman distance induced by $\varphi $ is the Kullback–Leibler divergence for probability vectors

$$\begin{aligned} D_{\varphi }(x,y) = \sum _{i=1}^d y_i\log \big (\frac{y_i}{x_i}\big ), \qquad x\in \Delta ^{d-1}_+, y\in \Delta ^{d-1}. \end{aligned}$$

The convex conjugate of $\varphi $ is the log-sum-exp-function $ {\varphi }^*(p) = \log \big (\sum \limits _{i=1}^d e^{p_i}\big ). $ Since $\varphi $ is differentiable, the steps of Algorithm 1 can be rewritten by substituting $x_k^*$ by $\nabla \varphi (x_k)=1+\log (x_k)$. Denoting the ith component of an iterate $x_{l}$ by $x_{l,i}$, lines 9, 11 and 12 of Algorithm 1 read

$$\begin{aligned} \text {find } t_k&\in \mathop {\mathrm {\arg \!\min }}\limits _{t\in \mathbb {R}} \beta _k t + \log \big (\sum _{i=1}^d x_{k,i} e^{-t\partial _i f_{i_k}(x_k)}\big ), \end{aligned}$$

(15)

$$\begin{aligned} x_{k+1}&= \frac{x_k \cdot e^{-t_k\nabla f_{i_k}(x_k)}}{\Vert x_k \cdot e^{-t_k\nabla f_{i_k}(x_k)}\Vert _1}, \end{aligned}$$

(16)

where multiplication and exponentiation of vectors are understood componentwise. The method (16) is known with the name exponentiated gradient descent or entropic mirror descent, provided that $t_k$ is nonnegative. We claim that our proposed step size $t_{k,\varphi }$ is new. Note that some convex polyhedra, such as $\ell ^1$-balls, can be transformed to $\Delta ^{d'-1}$ for some $d'\in \mathbb {N}$ by writing a point as a convex combination of certain extreme points [23].

Example 3.4

(Cartesian products of constraints) For $i\in \{1,...,m\}$, let $\varphi _i$ be a DGF for $C_i\subset D_i\subset \mathbb {R}^{d_i}$ fulfilling Assumption 1(i-iv) and let fulfill Assumption 1(v-vi). Then

$$\begin{aligned} \varphi (x) = \sum _{i=1}^m \varphi _i(x_i), \quad x = (x_1, ..., x_m) \text { with } x_i\in \mathbb {R}^d \end{aligned}$$

is a DGF for fulfilling Assumption 1(i-iv) with

Denoting the ith component of an iterate $x^{(*)}_l$ by $x^{(*)}_{l,i}$, the lines 9, 11 and 12 of Algorithm 1 for $i\in \{1,...,m\}$ read as follows:

$$\begin{aligned} \text {find } t_k&\in \mathop {\mathrm {\arg \!\min }}\limits _{t\in \mathbb {R}} \beta _k t + \sum _{i=1}^m \varphi _i^*\big (x_{k,i}^*-t\nabla _{i}f_{i_k}(x_k)\big ), \\ x_{k+1,i}^*&= x_{k,i}^* - t_k\nabla _i f_{i_k}(x_k) \qquad \text {for } i = 1,...,m, \\ x_{k+1,i}&= \nabla \varphi _i^*(x_{k+1,i}^*) \qquad \text {for } i = 1,...,m, \end{aligned}$$

where $\nabla _i$ stands for the gradient w.r.t. the ith block of variables.

Finally, we give a suggestion which constant $\sigma $ and norm $\Vert \cdot \Vert _\infty $ should be used in Algorithm 2/ line 10 in Algorithm 1. Let us assume that $\varphi $ is $\sigma _i$-strongly convex w.r.t. a norm $\Vert \cdot \Vert _{(i)}$ on $\mathbb {R}^{d_i}$. Then, the function $\varphi $ is $\sigma $-strongly convex with $\sigma =\min _{i=1,...,m}\sigma _i$ w.r.t. the mixed norm

$$\begin{aligned} \Vert u\Vert := \sqrt{\sum _{i=1}^m \Vert u_i\Vert _{(i)}^2}. \end{aligned}$$

Indeed, for all x, y with $x_i,y_i\in \mathbb {R}^{d_i}$ it holds that

$$\begin{aligned} \frac{\sigma }{2}\Vert x-y\Vert ^2 = \frac{\sigma }{2}\sum _{i=1}^m \Vert x_i-y_i\Vert _{(i)}^2 \le \sum _{i=1}^m D_{\varphi _i}^{x_i^*}(x_i,y_i) = D_\varphi ^{x^*}(x,y). \end{aligned}$$

A quick calculation using Cauchy-Schwarz’ inequality shows that the dual norm of $\Vert \cdot \Vert $ is given by

$$\begin{aligned} \Vert u^*\Vert _* := \sqrt{\sum _{i=1}^m \Vert u_i^*\Vert _{(i,*)}^2}, \end{aligned}$$

(17)

where $\Vert \cdot \Vert _{(i,*)}$ is the dual norm of $\Vert \cdot \Vert _i$ on $\mathbb {R}^{d_i}$. Hence, we recommend to use Algorithm 2 with (17) and $\sigma =\min \limits _{i=1,...,m}\sigma _i$, if $\varphi _i$ is $\sigma _i$-strongly convex w.r.t. $\Vert \cdot \Vert _{(i)}$.

Example 3.5

(Two-fold Cartesian product of simplex constraints) As a particular instance of Example 3.4 we consider the 2-fold product of the probability simplex $C_i=\Delta ^{d-1}$, $i\in \{1,2\}$ with $\varphi _i=\varphi $ from Example 3.3. The properties from Assumption 1 are inherited from the $\varphi _i$. We denote the iterates of Algorithm 1 by $x_k,y_k\in \Delta ^{d-1}$ and address its components by $x_{k,i}, y_{k,i}$ for $i=1,...,d$. Similar to Example 3.3, the steps of the method can be rewritten as

$$\begin{aligned} \text {find } t_k&\in \mathop {\mathrm {\arg \!\min }}\limits _{t\in \mathbb {R}} \beta _k t + \log \Big (\sum _{l=1}^d x_{k,l} e^{-t(\nabla _x f_{i_k}(x_k))_l}\Big ) + \log \Big (\sum _{l=1}^d y_{k,l} e^{-t(\nabla _y f_{i_k}(x_k))_l}\Big ), \nonumber \\ x_{k+1}&= \frac{x_k \cdot e^{-t_k\nabla _x f_{i_k}(x_k)}}{\Vert x_k \cdot e^{-t_k\nabla _x f_{i_k}(x_k)}\Vert _1}, \qquad y_{k+1} = \frac{y_k \cdot e^{-t_k\nabla _y f_{i_k}(y_k)}}{\Vert y_k \cdot e^{-t_k\nabla _y f_{i_k}(y_k)}\Vert _1}, \end{aligned}$$

(18)

where $\nabla _x$ stands for the gradient w.r.t. x and $\nabla _y$ for the gradient w.r.t. y. Also here, we can give a characterization of condition (10): For $\alpha =(\alpha _1, \alpha _2)$ with $\alpha _1, \alpha _2\in \mathbb {R}^d$ and $\beta \in \mathbb {R}$, the hyperplane $H(\alpha ,\beta )$ intersects $\textrm{dom}\ \partial \varphi = \Delta _+^{d+1}\times \Delta _+^{d+1}$ if and only if for $(i,j)=(1,2)$ or $(i,j)=(2,1)$ one of the following conditions is fulfilled:

$\alpha _i = c\mathbb {1}_d$ with some $c\in \mathbb {R}$ and $\alpha _j= (\beta -c)\mathbb {1}_d$,
$\alpha _i = c\mathbb {1}_d$ with some $c\in \mathbb {R}$ and there exist $r, s\in \{1,...,d\}$ with $\alpha _{j,r}< \beta - c < \alpha _{j,s}$ or
$\big ] \min \alpha _i, \max \alpha _i \big [ \ \cap \ \big ] \beta - \max \alpha _j, \beta - \min \alpha _j \big [ \ \ne \ \emptyset $.

To prove this condition, we can invoke Proposition 2.3 which states that (10) is fulfilled if and only if the objective function g from (8) has a minimizer. Next, we note that, for each $c\in \mathbb {R}$, the objective in (18) can be rewritten as

$$\begin{aligned} g(t)&= \log \Big (\sum _{l=1}^d x_{k,l} e^{-(\alpha _{1,l}-c)t}\Big ) + \log \Big (\sum _{l=1}^d y_{k,l} e^{(\beta -c-\alpha _{2,l})t}\Big ) \\&= \log \Big (\sum _{l=1}^d y_{k,l} e^{-(\alpha _{2,l}-c)t}\Big ) + \log \Big (\sum _{l=1}^d x_{k,l} e^{(\beta -c-\alpha _{1,l})t}\Big ) \end{aligned}$$

and a case-by-case analysis shows that g has a minimizer if and only if one of the above assertions is fulfilled. Note that also the here discussed condition can be easily checked during the method. We remind that, in case of instabilities one may consider the restricted index set

$$\begin{aligned} \{i=1,...,n\mid |x_i|>\delta \ \text {and} \ |y_i|>\delta \} \end{aligned}$$

for some positive $\delta $. As derived in Example 3.4, in Algorithm 2/ line 10 of Algorithm 1 we use $\sigma =1$ and $\Vert u^*\Vert _* = \sqrt{\Vert u_1^*\Vert _\infty ^2 + \Vert u_2^*\Vert _\infty ^2}$.

4 Convergence

In this section we do the convergence analysis of Algorithm 1.

At first, we characterize fixed points of Algorithm 1 and Algorithm 2 and provide necessary lemmas for the subsequent analysis. In Sect. 4.1, we prove that for nonnegative star-convex functions $f_i$, condition (10) is always fulfilled and the step size (11) is better than the relaxed step size (12) in the sense that it results in an iterate with a smaller Bregman distance to all solutions of (1). Finally, we present convergence results for Algorithm 1 for this setting. In Sect. 4.2, we prove convergence in a second setting, namely in the case that the functions $f_i$ fulfill a local tangential cone condition as in [34, 41, 58].

As a first result, we determine the fixed points of Algorithm 1. The proposition states in particular that, in the unconstrained case $\textrm{dom}\ \varphi =\mathbb {R}^d$, fixed points are exactly the stationary points of the least-squares function $\Vert f(x)\Vert _2^2$.

Proposition 4.1

Let Assumption 1 hold and let $x\in \textrm{dom}\ \partial \varphi $ and $x^*\in \partial \varphi (x)$. The pair $\big (x,x^*\big )$ is a fixed point of Algorithm 1 if and only if for all $i\in \{1,...,n\}$ it holds that $f_i(x) = 0$ or $\nabla f_i(x)=0$.

Proof

If $f_i(x)=0$ or $\nabla f_i(x)=0$ holds for all $i\in \{1,...,n\}$, then $(x,x^*)$ is a fixed point by definition of the steps.

Next, we assume that $x\in \textrm{dom}\ \partial \varphi $ is a fixed point of Algorithm 1 and $\nabla f_i(x)\ne 0$. First, assume that condition (10) is not fulfilled, then the update for $x^*$ shows that $t_{k,\sigma }=0$, since $\nabla f_i(x)\ne 0$, and hence, $f_i(x)=0$. Finally, we assume that condition (10) holds. Then, from Proposition 2.3 we know that Algorithm 1 computes the Bregman projection $x = \Pi ^{x^*}_{\varphi ,H}(x)$ with

$$\begin{aligned} H = \big \{ y\in \mathbb {R}^d: f_i(x) + \langle \nabla f_i(x),y-x\rangle = 0\big \}. \end{aligned}$$

But this means that $x\in H$ and hence, $f_i(x) = 0$ holds also in this case. $\square $

The following fact will be useful in the convergence analysis. It shows that Algorithm 1 performs a mirror descent step whenever $f_i(x)>0$, and a mirror ascent step whenever $f_i(x)<0$.

Lemma 4.2

Consider the kth iterate $x_k$ of Algorithm 1 and consider the case that $f_{i_k}(x_k)\ne 0$ and $\nabla f_{i_k}(x_k)\ne 0$. Let Assumption 1 hold. Then, the step size $t_k$ in Algorithm 1 fulfills

$$\begin{aligned} \textrm{sign}(t_k) = \textrm{sign}(f_{i_k}(x_k)). \end{aligned}$$

Proof

If condition (10) is not fulfilled, the assertion is clear by definition of the step size. Next, we assume that (10) holds. Then, the function

$$\begin{aligned} g_{i_k,x_k^*}(t) = \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\big (\langle \nabla f_{i_k}(x_k),x_k\rangle - f_{i_k}(x_k)\big ) \end{aligned}$$

(19)

is minimized by $t_{k,\varphi }$. (Note that the expression is indeed fully determined by $i_k$ and $x_k^*$ by the fact that $x_k=\nabla \varphi ^*(x_k^*$)). We compute

$$\begin{aligned} g_{i_k,x_k^*}'(0) = -\langle \nabla f_{i_k}(x_k), \ \nabla \varphi ^*(x_k^*)\rangle + \langle \nabla f_{i_k}(x_k),x_k\rangle - f_{i_k}(x_k) = -f_{i_k}(x_k). \end{aligned}$$

(20)

Since $g_{i_k,x_k^*}$ is convex, its derivative is monotonically increasing and it vanishes at $t_{k,\varphi }$. Since it holds $f_{i_k}(x_k)\ne 0$ by assumption, we conclude that $t_{k,\varphi }$ and $f_{i_k}(x_k)$ have the same sign. $\square $

To exploit strong convexity and smoothness, we will use the following.

Lemma 4.3

If $\varphi :\mathbb {R}^d\rightarrow \mathbb {R}$ is proper, convex and lower semicontinuous, then the following statements are equivalent:

(i)
$\varphi $ is $\sigma $-strongly convex w.r.t. $\Vert \cdot \Vert $.
(ii)
For all $x,y\in \mathbb {R}^d$ and $x^*\in \partial \varphi (x)$, $y^*\in \partial \varphi (y)$,
$$\begin{aligned} \langle x^*-y^*, x-y\rangle \ge \sigma \Vert x-y\Vert ^2. \end{aligned}$$
(iii)
The function $\varphi ^*$ is $\tfrac{1}{\sigma }$-smooth w.r.t. $\Vert \cdot \Vert _*$.

Proof

See [63, Corollary 3.5.11 and Remark 3.5.3]. $\square $

Lemma 4.4

If $\varphi :\mathbb {R}^d\rightarrow \mathbb {R}$ is convex and lower semicontinuous, then the following statements are equivalent:

(i)
$\varphi $ is L-smooth w.r.t. a norm $\Vert \cdot \Vert $,
(ii)
$\varphi (y) \le \varphi (x) + \langle \nabla \varphi (x),y-x\rangle + \frac{L}{2} \Vert x-y\Vert ^2$ for all $x,y\in \mathbb {R}^d$,
(iii)
$\langle \nabla \varphi (y) - \nabla \varphi (x),y-x\rangle \le L\Vert x-y\Vert ^2$ for all $x,y\in \mathbb {R}^d$.

Proof

See [63, Corollary 3.5.11 and Remark 3.5.3]. $\square $

4.1 Convergence for nonnegative star-convex functions

In this subsection we assume in addition that the functions $f_i$ are either nonnegative and star-convex or affine.

Definition 4.5

([45]) Let $f:D\rightarrow \mathbb {R}$ be differentiable. We say that f is called star-convex, if the set $\mathop {\mathrm {\arg \!\min }}\limits f$ is nonempty and for all $x\in D$ and ${\hat{x}}\in \mathop {\mathrm {\arg \!\min }}\limits f$ it holds that

$$\begin{aligned} f(x) + \langle \nabla f(x), {\hat{x}} - x \rangle \le f({\hat{x}}). \end{aligned}$$

Moreover, we call f strictly star-convex, if the above inequality is strict, and $\mu $- strongly star-convex relative to $\varphi $, if for all $x\in D$, $x^*\in \partial \varphi (x)$ and ${\hat{x}}\in \mathop {\mathrm {\arg \!\min }}\limits f$ it holds that

$$\begin{aligned} f(x) + \langle \nabla f(x), {\hat{x}} - x \rangle + \mu D_\varphi ^{x^*}(x,{\hat{x}}) \le f({\hat{x}}). \end{aligned}$$

We recall that the first assumption of nonnegativity and star-convexity covers two settings:

Minimizing a sum-of-terms
$$\begin{aligned} \min \frac{1}{n} \sum _{i=1}^n f_i(x)\qquad \text {s.t. } x\in C, \end{aligned}$$
(21)
under the interpolation assumption
$$\begin{aligned} \text { }\exists x: \quad x\in \bigcap _{i=1}^n \mathop {\mathrm {\arg \!\min }}\limits f_i|_C, \end{aligned}$$
(22)
where $f_i$ is a star-convex function with known optimal value ${\hat{f}}_i$ on C. Under the interpolation assumption every point in the intersection on the right hand side of (22) is a solution to (21).

Furthermore, we will construct a sequence which converges to this intersection point by applying Algorithm 1 to the nonnegative function ${\tilde{f}}$ where ${\tilde{f}}_i = f_i|_C - {\hat{f}}_i$. When $n=1$, we cover the setting of mirror descent for the problem
$$\begin{aligned} \min f(x) \quad \text {s.t. } x\in C \end{aligned}$$
(23)
with known optimal value ${\hat{f}}$.
Systems of nonlinear equations
$$\begin{aligned} f(x) = 0 \qquad \text {s.t. } x\in C \end{aligned}$$
with star-convex component functions $f_i$, where we apply Algorithm 1 to $f_i^+ = \max (f_i,0)$. Note that $f_i^+$ is not differentiable only at points x with $f_i(x)=0$, which is anyway checked during the method.

Precisely, we will use the following assumption.

Assumption 2

For each $f_i$ one of the following conditions is fulfilled:

(i)
$f_i$ is nonnegative and star-convex and it holds that $f_i^{-1}(0) \cap \textrm{dom}\ \partial \varphi \ne \emptyset $,
(ii)
$f_i$ is nonnegative and strictly star-convex or
(iii)
$f_i$ is affine.

The first theorem states that Algorithm 1 always computes nonrelaxed Bregman projections under Assumption 2 outside of the fixed points.

Theorem 4.6

Let $\big (x_k,x_k^*\big )$ be given by Algorithm 1 and consider the case that $f_{i_k}(x)\ne 0$ and $\nabla f_{i_k}(x)\ne 0$. Let Assumptions 1 and 2 hold true. Then, the hyperplane $H_k$ separates $x_k$ and $f_{i_k}^{-1}(0)$, the condition

$$\begin{aligned} H_k \cap \textrm{dom}\ \partial \varphi \ne \emptyset \end{aligned}$$

holds and the Bregman projection of $x_k$ onto $H_k$ is defined, namely it holds that

$$\begin{aligned} x_{k+1}=\Pi _{\varphi ,H_k}^{x_k^*}(x_k). \end{aligned}$$

In particular, Algorithm 1 always chooses the step size $t_k=t_{k,\varphi }$ from (11).

Proof

For $x\in D$ we define the affine function

$$\begin{aligned} \ell _x(y) := f_{i_k}(x) + \langle \nabla f_{i_k}(x), y-x\rangle . \end{aligned}$$

We consider the cases (i) and (ii) from Assumption 2 first. Here we have that $\ell _{x_k}(x_k)=f_{i_k}(x_k)> 0$ and for all ${\hat{x}}\in f_{i_k}^{-1}(0)$, star-convexity of $f_{i_k}$ shows that $\ell _{x_k}({\hat{x}}) \le 0$. This means that the hyperplane $H_k$ separates $x_k$ and $f_{i_k}^{-1}(0)$. By the intermediate value theorem there exists $x^{\lambda } = \lambda x_k + (1-\lambda ){\hat{x}}$ for some $\lambda \in [0,1[$ such that $\ell _{x_k}(x^{\lambda })=0$. Now let us assume that Assumption 2(i) holds, so we can choose ${\hat{x}}\in \textrm{dom}\ \partial \varphi $ with $f_{i_k}({\hat{x}})=0$. By Assumption 1(iii) it holds that $\textrm{dom}\ \partial \varphi =\textrm{ri}\ \textrm{dom}\ \varphi $, which is a convex set and hence, $x^{\lambda }\in \textrm{dom}\ \partial \varphi $. This proves that the claimed condition (10) is fulfilled and the update in Algorithm 1 computes a Bregman projection onto $H_k$ by Proposition 2.3. In case of Assumption 2(ii) we have that $\ell _{x_k}({\hat{x}})<0$ and therefore it even holds that $\lambda \in ]0,1[$. Assumption 1(iii) guarantees that ${\hat{x}} \in \overline{\textrm{dom}\ \varphi }$ and $x_k\in \textrm{dom}\ \partial \varphi = \textrm{ri}\ \textrm{dom}\ \varphi $. Hence, we have $x^{\lambda }\in \textrm{ri}\ \textrm{dom}\ \varphi $ by [52, Theorem 6.1], which again implies that $x^{\lambda }\in \textrm{dom}\ \partial \varphi $ by Assumption 1(iii). This proves that condition (10) is fulfilled in this case, too.

Under Assumption 2(iii) it holds that $\ell _x = f_{i_k}$ for all $x\in D$. This already implies that $H_k$ separates $x_k$ and $f_{i_k}^{-1}(0)$. Condition (10) is fulfilled by the assumption that $f_{i_k}^{-1}(0)\ne \emptyset $ and so, the update computes the claimed Bregman projection by Proposition 2.3 also in this case. $\square $

As an immediate consequence, we see that Algorithm 1 is stable in terms of Bregman distance.

Corollary 4.7

If ${\hat{x}}\in S$ is a solution to (1) and the assumptions from Theorem 4.6 hold true, then it holds that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}}) \le D_\varphi ^{x_k^*}(x_k,{\hat{x}}) - D_\varphi ^{x_k^*}(x_k,x_{k+1}). \end{aligned}$$

Proof

By Theorem 4.6, $x_{k+1}$ is the Bregman projection of $x_k$ onto $H_{k}$ with respect to $x_k^*$. If $f_{i_k}(x_k)=0$, then $x_k$ is a fixed point by Proposition 4.1and the statement holds trivially. Next, assume that $f_{i_k}(x_k)>0$. Then by Lemma 4.2, we have that $t_k > 0$. As $x_{k+1}\in H$, we have that $\langle \nabla f_{i_k}(x_k), x_{k+1}-x_k\rangle =0$. We conclude for all $y \in H_k^\le $ that

$$\begin{aligned} \langle x_{k+1}^*-x_k^*, x_{k+1}-y\rangle&= -t_k\langle \nabla f_{i_k}(x_k), x_{k+1}-y\rangle \\&= t_k\langle \nabla f_{i_k}(x_k), y-x_k\rangle -t_k\langle \nabla f_{i_k}(x_k), x_{k+1}-x_k\rangle \\&\le -t_k f_{i_k}(x_k) \\&\le 0, \end{aligned}$$

which by Lemma 2.2 shows that $x_{k+1} = \Pi _{\varphi ,H^\le _k}^{x_k^*}(x_k)$. As ${\hat{x}}\in H^\le _k$, the claim follows from Lemma 2.2(ii). An analoguous argument shows the claim in the case $f_{i_k}(x_k)<0$. $\square $

Next, we prove that the exact Bregman projection moves the iterates closer to solutions of (1) than the relaxed projections, where the distance is in the sense of the used Bregman distance (see Theorem 4.11). To that end, for $(x_k,x_k^*)$ from Algorithm 1 we define an update with variable step size

$$\begin{aligned} x_{k+1}^*(t) = x_k^* - t\nabla f_{i_k}(x_k), \qquad x_{k+1}(t) = \nabla \varphi ^*(x_{k+1}^*(t)), \quad t\in \mathbb {R}. \end{aligned}$$

(24)

Lemma 4.8

Let $\big (x_k,x_k^*\big )$ be given by Algorithm 1 and consider $(x_{k+1}(t),x_{k+1}^*(t))$ from (24) for some $t\in \mathbb {R}$. Let Assumption 1 hold true. Then, for all $x\in \mathbb {R}^d$ it holds that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*(t)}(x_{k+1}(t),x)&= \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k + t\langle \nabla f_{i_k}(x_k), x-x_k\rangle \\&\hspace{1cm} + t f_{i_k}(x_k) - \langle x_k^*,x\rangle + \varphi (x). \end{aligned}$$

Proof

Rewriting the Bregman distance as in (7) shows that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*(t)}(x_{k+1}(t),x)&= \varphi ^*(x_{k+1}^*(t)) - \langle x_{k+1}^*(t),x\rangle + \varphi (x) \\&= \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\langle \nabla f_{i_k}(x_k),x\rangle - \langle x_k^*,x\rangle + \varphi (x) \\&= \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k + t\langle \nabla f_{i_k}(x_k), x-x_k\rangle \\&\hspace{1cm} + t f_{i_k}(x_k) - \langle x_k^*,x\rangle + \varphi (x). \end{aligned}$$

$\square $

Proposition 4.9

Let $\big (x_k,x_k^*\big )$ and $t_k$ be given by Algorithm 1 and consider $\big (x_{k+1}(t),x_{k+1}^*(t)\big )$ from (24) for some $t\in \mathbb {R}$. Let Assumption 1 hold true. Then, for all $x\in \mathbb {R}^d$ it holds that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},x) \le D_\varphi ^{x_{k+1}^*(t)}(x_{k+1}(t),x) + (t_k-t) \cdot \big ( f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k), x - x_k\rangle \big ). \end{aligned}$$

Proof

Note that $t_k$ equals $t_{k,\varphi }$ from (11), since condition (10) is fulfilled by Theorem 4.6. Hence, the optimality property (11) shows that for any t we have that

$$\begin{aligned} \varphi ^*(x_k^*-t_k\nabla f_{i_k}(x_k)) + t_k\beta _k \le \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k. \end{aligned}$$

Lemma 4.8 then shows that for any x it holds that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},x)&\le \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k + t_k\langle \nabla f_{i_k}(x_k), x-x_k\rangle + t_k f_{i_k}(x_k) \\&\hspace{1cm} - \langle x_k^*,x\rangle + \varphi (x). \end{aligned}$$

We use the definitions of $\beta _k$, $x_{k+1}(t)$ and $x_{k+1}^*(t)$ and get

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},x)&\le \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k + t_k\langle \nabla f_{i_k}(x_k), x-x_k\rangle + t_k f_{i_k}(x_k) \\&\hspace{1cm} - \langle x_k^*,x\rangle + \varphi (x) \\&= \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\langle \nabla f_{i_k}(x_k), x\rangle - \langle x_k^*,x\rangle + \varphi (x) \\&\hspace{1cm} + (t_k-t) \cdot \big ( f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k), x - x_k\rangle \big ) \\&= D_\varphi ^{x_{k+1}^*(t)}(x_{k+1}(t),x) + (t_k-t) \cdot \big ( f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k), x - x_k\rangle \big ). \end{aligned}$$

$\square $

In order to draw a conclusion from Proposition 4.9, we relate the step sizes $t_{k,\varphi }$ from (11) and $t_{k,\sigma }$ from (12). We already know by Lemma 4.2 that both step sizes have the same sign. The next lemma gives upper and lower bounds for $t_{k,\varphi }$ with respect to $t_{k,\sigma }$ under additional assumptions on $\varphi $.

Lemma 4.10

Let $\big (x_k,x_k^*\big )$ be the iterates from Algorithm 1 and let Assumption 1 hold true. We consider $t_{k,\varphi }$ and $t_{k,\sigma }$ from (11) and (12) and the function $g_{i_k,x_k^*}$ from (19).

(i)
If $\varphi $ is $\sigma $-strongly convex w.r.t. $\Vert \cdot \Vert $, then $g_{i_k,x_k^*}$ is $\frac{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}{\sigma }$-smooth and
$$\begin{aligned} |t_{k,\varphi }| \ge \sigma \frac{ |f_{i_k}(x_k)| }{\Vert \nabla f_{i_k}(x_k)\Vert _*^2} = |t_{k,\sigma }|. \end{aligned}$$
(25)
(ii)
If $\varphi $ is M-smooth w.r.t. $\Vert \cdot \Vert $, then $g_{i_k,x_k^*}$ is $\frac{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}{M}$-strongly convex and
$$\begin{aligned} |t_{k,\varphi }| \le M\frac{|f_{i_k}(x_k)|}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2} = \frac{M}{\sigma } \cdot |t_{k,\sigma }|. \end{aligned}$$
(26)

Proof

For $s,t\in \mathbb {R}$ with $s<t$ it holds that

$$\begin{aligned} g_{i_k,x_k^*}'(t)-g_{i_k,x_k^*}'(s)&= \langle \nabla \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) -\nabla \varphi ^*(x_k^*-s\nabla f_{i_k}(x_k)), -\nabla f_{i_k}(x_k) \rangle \nonumber \\&= \frac{1}{(t-s)} \big \langle \nabla \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) -\nabla \varphi ^*(x_k^*-s\nabla f_{i_k}(x_k)), \nonumber \\&\hspace{2.65cm} x_k^*-t\nabla f_{i_k}(x_k) - (x_k^*-s\nabla f_{i_k}(x_k)) \big \rangle . \end{aligned}$$

(27)

(i)
If $\varphi $ is $\sigma $-strongly convex w.r.t. $\Vert \cdot \Vert $, then $\varphi ^*$ is $\frac{1}{\sigma }$-smooth w.r.t. $\Vert \cdot \Vert _*$ by Lemma 4.3(iv). Hence, by Lemma 4.4(iii) we can estimate
$$\begin{aligned} 0 \le g_{i_k,x_k^*}'(t)-g_{i_k,x_k^*}'(s) \le \frac{\Vert (t-s)\nabla f_{i_k}(x_k)\Vert _*^2}{\sigma \cdot (t-s)} = \frac{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}{\sigma } \cdot (t-s), \end{aligned}$$
which proves by the same lemma that $g_{i_k,x_k^*}$ is $\frac{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}{\sigma }$-smooth. Hence, (25) follows by choosing $t=\max (t_{k,\varphi },0)$, $s=\min (t_{k,\varphi },0)$ and inserting (20).
(ii)
Here, by Lemma 4.3 and the Fenchel-Moreau-identity $\varphi =\varphi ^{**}$, the function $\varphi ^*$ is $\frac{1}{M}$-strongly convex w.r.t. $\Vert \cdot \Vert _*$. Using Lemma 4.3(ii) we can estimate
$$\begin{aligned} g_{i_k,x_k^*}'(t)-g_{i_k,x_k^*}'(s) \ge \frac{\Vert (t-s)\nabla f_{i_k}(x_k)\Vert _*^2}{M\cdot (t-s)} = \frac{\Vert \nabla f_{i_k}(x)\Vert _*^2}{M} \cdot (t-s) \end{aligned}$$
which shows that $g_{i_k,x_k^*}$ is $\frac{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}{M}$-strongly convex. Inequality (26) then follows as in (i).

$\square $

Theorem 4.11

Let $\big (x_k,x_k^*\big )$ and $t_k$ be given by Algorithm 1. Let $t\in [0,t_k]$ and let $\big (x_{k+1}(t),x_{k+1}^*(t)\big )$ be as in (24). Let Assumptions 1-2 hold true and assume that $f_{i_k}(x_k)>0$. Then for every solution ${\hat{x}}\in S$ it holds that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1}, {\hat{x}}) \le D_\varphi ^{x_{k+1}^*(t)}(x_{k+1}(t), {\hat{x}}). \end{aligned}$$

If $\varphi $ is $\sigma $-strongly convex, the inequality holds as well for $t=t_{k,\sigma }$.

Proof

We recall that $t_k$ equals $t_{k,\varphi }$ from (11), since condition (10) is fulfilled by Theorem 4.6. By Lemma 4.2 we have that $t_k>0$. Since $f_{i_k}$ is star-convex and ${\hat{x}}\in S$, it holds that

$$\begin{aligned} f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k),{\hat{x}}-x_k\rangle \le 0. \end{aligned}$$

The statement now follows from Proposition 4.9. The theorem applies in particular to $t=t_{k,\sigma }$ if $\varphi $ is $\sigma $-strongly convex, as Lemma 4.10(i) ensures that $0<t_{k,\sigma }\le t_{k,\varphi }$. $\square $

For mirror descent or stochastic mirror descent under interpolation, Theorem 4.11 tells that a choice of a smaller step size than $t_{k,\varphi }$ results in a larger distance to solutions ${\hat{x}}$ of problem (1) in Bregman distance.

The following lemma is the key element of our convergence analysis.

Lemma 4.12

Let $\big (x_k,x_k^*\big )$ be the iterates of either Algorithm 1 or Algorithm 2. Let Assumptions 1 and 2 hold true and assume that $\varphi $ is $\sigma $-strongly convex w.r.t. a norm $\Vert \cdot \Vert $. Then for every solution ${\hat{x}}\in S$ it holds that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}}) \le D_\varphi ^{x_k^*}(x_k,{\hat{x}}) - \frac{\sigma }{2} \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}. \end{aligned}$$

(28)

Proof

We bound the right-hand side in Lemma 4.8 from above for $t\in \{t_{k,\varphi },t_{k,\sigma }\}$. As $\varphi $ is $\sigma $-strongly convex, $\varphi ^*$ is $\frac{1}{\sigma }$-smooth by Lemma 4.3(iii). Hence, by Lemma 4.4(ii), for all $t\in \mathbb {R}$ we can estimate that

$$\begin{aligned} \varphi ^*(&x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k \\&=\varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\big (\langle \nabla f_{i_k}(x_k), x_k\rangle - f_{i_k}(x_k)\big ) \\&\le \varphi ^*(x_k^*) - t\langle \nabla \varphi ^*(x_k^*),\nabla f_{i_k}(x_k)\rangle + \frac{1}{2\sigma } t^2\Vert \nabla f_{i_k}(x_k)\Vert _*^2 \\&\hspace{0.5cm} + t\big (\langle \nabla f_{i_k}(x_k), x_k\rangle - f_{i_k}(x_k)\big ) \\&= \varphi ^*(x_k^*) - t f_{i_k}(x_k) + \frac{1}{2\sigma } t^2\Vert \nabla f_{i_k}(x_k)\Vert _*^2. \end{aligned}$$

Minimizing the right hand side over $t\in \mathbb {R}$ gives ${\hat{t}} = \sigma \frac{ f_{i_k}(x_k)}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2} = t_{k,\sigma }$ and

$$\begin{aligned} \varphi ^*(x_k^*) - {\hat{t}} f_{i_k}(x_k) + \frac{1}{2\sigma } {\hat{t}}^2\Vert \nabla f_{i_k}(x_k)\Vert _*^2 = \varphi ^*(x_k^*) - \frac{\sigma }{2} \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}. \end{aligned}$$

By optimality of $t_{k,\varphi }$ we have that

$$\begin{aligned} \varphi ^*(&x_k^*-t_{k,\varphi }\nabla f_{i_k}(x_k)) + t_{k,\varphi }\beta _k \le \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k \end{aligned}$$

for all $t\in \mathbb {R}$. Hence, we have shown that

$$\begin{aligned} \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)) + t\beta _k \le \varphi ^*(x_k^*) - \frac{\sigma }{2} \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2} \end{aligned}$$

(29)

holds for $t\in \{t_{k,\sigma },t_{k,\varphi }\}$. If Assumption 2(i) or 2(ii) are fulfilled, Lemma 4.2 and star-convexity of $f_{i_k}$ show that

$$\begin{aligned} t_k\big (f_{i_k}(x_k)+\langle \nabla f_{i_k}(x_k), {\hat{x}}-x_k\rangle \big ) \le 0. \end{aligned}$$

(30)

Under Assumption 2(iii), we have equality in (30). Inserting this inequality into Lemma 4.8, we obtain the claimed bound

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}})&\le \varphi ^*(x_k^*) - \frac{\sigma }{2} \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2} - \langle x_k^*,{\hat{x}}\rangle + \varphi ({\hat{x}}) \\&= D_\varphi ^{x_k^*}(x_k,{\hat{x}}) - \frac{\sigma }{2} \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}, \end{aligned}$$

where we used (7) in the last step.

$\square $

We can now establish almost sure (a.s.) convergence of Algorithm 1. The expectations are always taken with respect to the random choice of the indices. Sometimes we also take conditional expectations conditioned on choices of indices in previous iterations, which we will indicate explicitly.

Theorem 4.13

Let Assumptions 1-2 hold true and assume that $\varphi $ is $\sigma $-strongly convex w.r.t. a norm $\Vert \cdot \Vert $. Then it holds that

$$\begin{aligned} \mathbb {E}\big [\sum _{i=1}^n p_i \big (f_i(x_k)\big )^2 \big ]\rightarrow 0 \quad \text {as } k\rightarrow \infty \end{aligned}$$

and we have the rate

$$\begin{aligned} \mathbb {E}\big [\min _{l=1,...,k} \sum _{i=1}^n p_i \big (f_i(x_l)\big )^2\big ] \le \frac{c}{\sigma k} \end{aligned}$$

with some constant $c>0$. Moreover, the iterates $x_k$ of Algorithm 1 converge a.s. to a random variable whose image is contained in the solution set S.

Proof

By $\sigma $-strong convexity of $\varphi $ and Lemma 4.12, we have that

$$\begin{aligned} \frac{\sigma }{2}\Vert x_k-{\hat{x}}\Vert ^2 \le D_\varphi ^{x_k^*}(x_k,{\hat{x}}) \le D_\varphi ^{x_0^*}(x_0,{\hat{x}}) \end{aligned}$$

holds for all $k\in \mathbb {N}$ and ${\hat{x}}\in S$. Hence, the sequence $x_k$ is bounded and we have

$$\begin{aligned} \Vert \nabla f_{i_k}(x_k)\Vert _*^2 \le M \end{aligned}$$

with some constant $M>0$. Inserting this into (28) gives that for all $l\in \mathbb {N}$ it holds

$$\begin{aligned} D_\varphi ^{x_{l+1}^*}(x_{l+1},{\hat{x}}) \big )&\le D_\varphi ^{x_l^*}(x_l,{\hat{x}}) - \frac{\sigma }{2M} \big (f_{i_l}(x_l)\big )^2. \end{aligned}$$

Taking conditional expectation w.r.t. $i_0,...,i_{l-1}$, we obtain

$$\begin{aligned} \mathbb {E}\big [D_\varphi ^{x_{l+1}^*}(x_{l+1},{\hat{x}}) \mid i_0,...,i_{l-1}\big ]&\le D_\varphi ^{x_l^*}(x_l,{\hat{x}})- \frac{\sigma }{2M} \sum _{i=1}^n p_i\big (f_{i_l}(x_l)\big )^2. \end{aligned}$$

(31)

By rearranging and using the tower property of conditional expectation, we conclude that

$$\begin{aligned} \mathbb {E}\big [ \sum _{i=1}^n p_i\big (f_{i_l}(x_l)\big )^2 \big ] \le \frac{2M}{\sigma } \big ( \mathbb {E}\big [ D_\varphi ^{x_l^*}(x_l,{\hat{x}}) \big ] - \mathbb {E}\big [ D_\varphi ^{x_{l+1}^*}(x_{l+1},{\hat{x}}) \big ] \big ) \end{aligned}$$

The convergence rate now follows with $c=2\cdot M\cdot D_{\varphi }^{x_0^*}(x_0,{\hat{x}})$ for any ${\hat{x}}\in S$ by averaging over $l=0,...,k$ and telescoping.

Next, we prove the a.s. iterate convergence. Using (31) gives that

$$\begin{aligned} \mathbb {E}\big [D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}}) \mid i_0,...,i_{k-1}\big ] \le D_\varphi ^{x_k^*}(x_{k},{\hat{x}}) - \frac{\sigma \cdot \min _i p_i}{2M} \cdot \Vert f(x_k)\Vert _2^2. \end{aligned}$$

The Robbins-Siegmund Lemma [51] proves that $f(x_k)\rightarrow 0$ holds with probability 1. Along any sample path in $\{f(x_k)\rightarrow 0\}$, due to boundedness of the sequence $x_k$, there exists a subsequence $x_{k_l}$ converging to some point x. By continuity, we have that $f(x)=0$ and hence, $x\in S$. Due to Assumption 1(iv), it holds that $D_\varphi ^{x_{k_l}^*}(x_{k_l}, x)\rightarrow 0$ and since $D_\varphi ^{x_k^*}(x_k,{\hat{x}})$ is a decreasing sequence in k for ${\hat{x}}\in S$ by Lemma 4.12, we conclude that $D_\varphi ^{x_{k}^*}(x_{k}, x)\rightarrow 0.$ Finally, strong convexity of $\varphi $ implies that $x_k\rightarrow x$. $\square $

If the functions $f_i$ have Lipschitz continuous gradient, we can derive a sublinear rate for the $\ell _1$-kind loss $\mathbb {E}\big [\min _{l=1,...,k} \sum _{i=1}^n p_i f_i(x_l)\big ]$, which coincides with the rate in [23, Theorem 4]. Note that without this assumption, Jensen’s inequality gives the asymptotically slower rate

$$\begin{aligned} \mathbb {E}\big [\min _{l=1,...,k} \sum _{i=1}^n p_i f_i(x_l)\big ] \le \mathbb {E}\big [\frac{1}{k} \sum _{l=1}^k \sum _{i=1}^n p_i f_i(x_l)\big ] \le \frac{c}{\sqrt{k}} \end{aligned}$$

for some constant $c>0$. We will need the following lemma.

Lemma 4.14

[38, Lemma 3] Let $\varphi $ be $\sigma $-strongly convex w.r.t. $\Vert \cdot \Vert $. Moreover, let the functions $f_i$ be L-smooth w.r.t. $\Vert \cdot \Vert $. Then it holds that

$$\begin{aligned} t_{k,\sigma } \ge \frac{\sigma }{2L}.\end{aligned}$$

Theorem 4.15

Let Assumptions 1-2 hold true and assume that $\varphi $ is $\sigma $-strongly convex w.r.t. a norm $\Vert \cdot \Vert $ and all functions $f_i$ are L-smooth w.r.t. $\Vert \cdot \Vert $. Then the iterates $x_k$ of Algorithm 1 fulfill that

$$\begin{aligned} \mathbb {E}\big [\min _{l=1,...,k} \sum _{i=1}^n p_i f_i(x_l)\big ] \le \frac{4L}{\sigma k} \cdot \inf _{{\hat{x}}\in S} D^{x_0^*}_\varphi (x_0,{\hat{x}}). \end{aligned}$$

Proof

Combining Lemma 4.12 and Lemma 4.14 yields that

$$\begin{aligned} D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}}) \le D_\varphi ^{x_k^*}(x_k,{\hat{x}}) - \frac{1}{2} f_{i_k}(x_k) \cdot t_{k,\sigma } \le D_\varphi ^{x_k^*}(x_k,{\hat{x}}) - \frac{\sigma }{4L} f_{i_k}(x_k). \end{aligned}$$

The assertion now follows as in the proof of Theorem 4.13 by taking expectation and telescoping. $\square $

For strongly star-convex functions $f_i$, we can prove a linear convergence rate, where we recover the contraction factor from [23, Theorem 3]. Moreover, we can even improve this factor for smooth $\varphi $.

Theorem 4.16

Let Assumptions 1-2 hold true and assume that $\varphi $ is $\sigma $-strongly convex w.r.t. a norm $\Vert \cdot \Vert $ and all functions $f_i$ are L-smooth w.r.t. $\Vert \cdot \Vert $. Moreover, assume that $\overline{f}:=\sum _{i=1}^d p_if_i$ is $\mu $-strongly star-convex w.r.t. S relative to $\varphi $. Then there exists an element ${\hat{x}} \in S$ such that the iterates $x_k$ of Algorithm 1 converge to ${\hat{x}}$ at the rate

$$\begin{aligned} \mathbb {E}\big [D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}})\big ] \le \big (1-\frac{\mu \sigma }{2L}\big ) \mathbb {E}\big [ D_\varphi ^{x_k^*}(x_k,{\hat{x}}) \big ] - \frac{\sigma }{4L}\overline{f}(x_k). \end{aligned}$$

Moreover, if $\varphi $ is M-smooth, it holds that

$$\begin{aligned} \mathbb {E}\big [D_\varphi ^{x_{k+1}^*}(x_{k+1},{\hat{x}})\big ] \le \big (1-\frac{\mu \sigma }{2L}-\frac{\mu \sigma ^2}{4LM}\big ) \mathbb {E}\big [D_\varphi ^{x_k^*}(x_k,{\hat{x}})\big ]. \end{aligned}$$

Proof

By Lemma 4.14 we have that

$$\begin{aligned} t_k\big (\langle \nabla f_{i_k}(x_k),{\hat{x}}-x_k\rangle + f_{i_k}(x_k)\big ) \le \frac{\sigma }{2L}\big (\langle \nabla f_{i_k}(x_k),{\hat{x}}-x_k\rangle + f_{i_k}(x_k)\big ). \end{aligned}$$

Taking expectation and using the assumption of relative strong convexity, we obtain that

$$\begin{aligned} \mathbb {E}\big [ t_k\big (\langle \nabla f_{i_k}(x_k),{\hat{x}}-x_k\rangle + f_{i_k}(x_k)\big ) \big ]&\le - \frac{\sigma }{2L} \mathbb {E}\big [{\overline{f}}({\hat{x}})-{\overline{f}}(x_k)-\langle \nabla {\overline{f}}(x_k),{\hat{x}}-x_k\rangle \big ] \\&\le -\frac{\mu \sigma }{2L}\mathbb {E}\big [D_\varphi ^{x_k^*}(x_k,{\hat{x}})\big ]. \end{aligned}$$

The first convergence rate then follows by the steps in Lemma 4.12 and Theorem 4.15, replacing (30) by the above inequality. Finally, let $\varphi $ be additionally M-smooth. Using that $\nabla \overline{f}({\hat{x}})=0$, we can further bound

$$\begin{aligned} \overline{f}(x_k)&= \overline{f}(x_k) - \overline{f}({\hat{x}}) - \langle \nabla \overline{f}({\hat{x}}), x_k-{\hat{x}}\rangle \\&\ge \mu D_\varphi ({\hat{x}},x_k) \ge \frac{\mu \sigma }{2}\Vert x_k-{\hat{x}}\Vert ^2\ge \frac{\mu \sigma }{M}D_\varphi ^{x_k^*}(x_k,{\hat{x}}). \end{aligned}$$

$\square $

Since the proofs of Theorems 4.13, 4.15 and 4.16 rely on Lemma 4.12, they also hold for Algorithm 2.

4.2 Convergence under the local tangential cone condition

Inspired by [58], we consider functions fulfilling the so-called tangential cone condition, which was introduced in [29] as a sufficient condition for convergence of the Landweber iteration for solving ill-posed nonlinear problems.

Definition 4.17

A differentiable function $f:D\rightarrow \mathbb {R}$ fulfills the local tangential cone condition ($\eta $-TCC) on $U\subset D$ with constant $0<\eta <1$, if for all $x,y\in U$ it holds that

$$\begin{aligned} |f(x)+\langle \nabla f(x),y-x\rangle -f(y)| \le \eta |f(x)-f(y)|. \end{aligned}$$

(32)

Under this condition, we are able to formulate a variant of Lemma 4.12 and derive corresponding convergence rates. Precisely, we will assume the following.

Assumption 3

There exist a point ${\hat{x}}\in S$ and constants $\eta \in ]0,1[$ and $r>0$ such that each function $f_i$ fulfills $\eta $-TCC w.r.t. $\eta $ on

$$\begin{aligned} B_{r,\varphi }({\hat{x}}) := \big \{x\in C: D_\varphi ^{x^*}(x,{\hat{x}}) \le r \quad \text {for all } x^*\in \partial \varphi (x) \big \}. \end{aligned}$$

Lemma 4.18

Let Assumptions 1 and 3 hold true and assume that $\varphi $ is $\sigma $-strongly convex w.r.t. a norm $\Vert \cdot \Vert $. Let ${\hat{x}}\in S$. Then, the iterates of Algorithm 1 fulfill

$$\begin{aligned} D^{x_{k+1}^*}_\varphi (x_{k+1},{\hat{x}}) \le D^{x_k^*}_\varphi (x_k,{\hat{x}}) - \tau \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}, \end{aligned}$$

if one of the following conditions is fulfilled:

(i)
$t_k=t_{k,\sigma }$, $\eta <\frac{1}{2}$ and $\tau = \sigma \big (\frac{1}{2}-\eta \big )$,
(ii)
$t_k=t_{k,\varphi }$, $\varphi $ is additionally M-smooth w.r.t. $\Vert \cdot \Vert $, $\eta < \frac{\sigma }{2\,M}$ and $\tau = \sigma \big (\frac{1}{2}-\eta \frac{M}{\sigma }\big ).$

In particular, if $x_0\in B_{r,\varphi }({\hat{x}})$, then in both cases we have that $x_k\in B_{r,\varphi }({\hat{x}})$ for all $k\in \mathbb {N}$.

Proof

For (i), by definition of $t_{k,\sigma }$ and $\eta $-TCC we have that

$$\begin{aligned} t_{k,\sigma } \big ( f_{i_k}(x_k) + \langle f_{i_k}(x_k),{\hat{x}}-x_k\rangle \big ) \le \eta \sigma \frac{\big (f_{i_k}(x_k)\big )^2}{\Vert \nabla f_{i_k}(x_k)\Vert _*^2}. \end{aligned}$$

The first convergence rate then follows by the steps in Lemma 4.12, replacing (30) by the above inequality. For (ii), using Lemma 4.10(ii) and the fact that $f_{i_k}$ fulfills $\eta $-TCC, we estimate

$$\begin{aligned} t_{k,\varphi }\big (f_{i_k}(x_k)+\langle \nabla f_{i_k}(x_k), {\hat{x}}-x_k\rangle \big ) \le \eta M\frac{f_{i_k}(x)^2}{\Vert \nabla f_{i_k}(x)\Vert _*^2}, \end{aligned}$$

so that the assertion follows as in (i). $\square $

The condition on $\eta $ in (i) is the classical condition for Landweber methods (see e.g. [29, Theorem 3.8]) and is also required in the work [58], which studies Algorithm 1 for $\varphi (x)=\frac{1}{2}\Vert x\Vert _2^2$ under $\eta $-TCC. The constant $\tau $ in (ii) can not be greater than $\tau $ in (i), as it holds that $\sigma \le M$.

Theorem 4.19

Let Assumption 1 hold and let $\varphi $ be $\sigma $-strongly convex. Moreover, let Assumption 3 hold with some $\eta >0$ and ${\hat{x}}\in S$ and let $x_0\in B_{r,\varphi }({\hat{x}})$.

(i)
If $\eta <\frac{1}{2}$, then the iterates $x_k$ of Algorithm 2 converge a.s. to a random variable whose image is contained in the solution set $S\cap B_{r,\varphi }({\hat{x}})$ and it holds that
$$\begin{aligned} \mathbb {E}\big [\min _{l=1,...,k} \sum _{i=1}^n p_i \big (f_i(x_l)\big )^2\big ] \le \frac{C \cdot D_\varphi ^{x_0^*}(x_0,{\hat{x}})}{\sigma \big (\frac{1}{2}-\eta \big ) k}. \end{aligned}$$
(ii)
Let $\varphi $ be additionally M-smooth and $\eta < \frac{\sigma }{2\,M}$. Assume that $x_k$ are the iterates of Algorithm 1 and the condition $H_k\cap \textrm{dom}\ \varphi \ne \emptyset $ is fulfilled for all k. Then the $x_k$ converge a.s. to a random variable whose image is contained in the solution set $S\cap B_{r,\varphi }({\hat{x}})$ and it holds that
$$\begin{aligned} \mathbb {E}\big [\min _{l=1,...,k} \sum _{i=1}^n p_i \big (f_i(x_l)\big )^2\big ] \le \frac{C \cdot D_\varphi ^{x_0^*}(x_0,{\hat{x}})}{\sigma \big (\frac{1}{2}-\eta \frac{M}{\sigma }\big ) k}. \end{aligned}$$

Proof

By Lemma 4.18, the $x_k$ stay in $B_{r,\varphi }({\hat{x}})$. The statements now follow as in Theorem 4.13 by invoking Lemma 4.18 instead of Lemma 4.12. $\square $

Finally, we can give a local linear convergence rate under the additional assumption that the Jacobian has full column rank. For $\varphi (x)=\frac{1}{2}\Vert x\Vert _2^2$, in part (i) of the theorem we recover the result from [58, Theorem 3.1] as a special case. In both Theorem 4.19 and Theorem 4.20, unfortunately we obtain a more pessimistic rate for Algorithm 1 compared to Algorithm 2, as the $\tau $ in (ii) is upper bounded by the $\tau $ in (i).

Theorem 4.20

Let Assumption 1 hold true and let $\varphi $ be $\sigma $-strongly convex and M-smooth. Let Assumption 3 hold with some $\eta >0$ and ${\hat{x}}\in S$ and let $x_0\in B_{r,\varphi }({\hat{x}})$. Moreover, assume that the Jacobian Df(x) has full column rank for all $x\in B_{r,\varphi }({\hat{x}})$ and $p_{\min }=\min _{i=1,...,n} p_i>0$. We set

$$\begin{aligned} \kappa _{\min } := \min _{x\in B_{r,\varphi }({\hat{x}})} \min _{\Vert y\Vert _2=1} \frac{\Vert Df(x)\Vert _F}{ \Vert Df(x)y\Vert _2} . \end{aligned}$$

(i)
If $\eta <\frac{1}{2}$, then the iterates $x_k$ of Algorithm 2 fulfill that
$$\begin{aligned} \frac{\sigma }{2}\mathbb {E}\big [\Vert x_k-{\hat{x}}\Vert _2^2] \le \mathbb {E}\big [D_\varphi (x_{k},{\hat{x}})\big ] \le \Big ( 1- \frac{ \sigma \big (\frac{1}{2}-\eta \big )p_{\min }}{M(1+\eta )^2\kappa _{\min }^2} \Big )^k \mathbb {E}\big [D_\varphi (x_0,{\hat{x}})\big ]. \end{aligned}$$
(33)
(ii)
Let $\varphi $ be additionally M-smooth and let $\eta < \frac{\sigma }{2\,M}$. Assume that $x_k$ are the iterates of Algorithm 1 and the condition $H_k\cap \textrm{dom}\ \varphi \ne \emptyset $ is fulfilled for all k. Then it holds that
$$\begin{aligned} \frac{\sigma }{2}\mathbb {E}\big [\Vert x_k-{\hat{x}}\Vert _2^2] \le \mathbb {E}\big [D_\varphi (x_{k},{\hat{x}})\big ] \le \Big ( 1- \frac{ \sigma \big (\frac{1}{2}-\eta \frac{M}{\sigma }\big )p_{\min }}{(1+\eta )^2\kappa _{\min }^2} \Big )^k \mathbb {E}\big [D_\varphi (x_0,{\hat{x}})\big ]. \end{aligned}$$
(34)

For the proof, as in [58] we use the following auxiliary lemma.

Lemma 4.21

Let $a_1,...,a_n\ge 0$ and $b_1,...,b_n>0$. Then it holds that

$$\begin{aligned} \sum _{i=1}^n \frac{a_i}{b_i} \ge \frac{\sum _{i=1}^n a_i}{\sum _{i=1}^n b_i}. \end{aligned}$$

Proof

Since $a_{i},b_{i}\ge 0$ it holds that

$$\begin{aligned} \Big ( \sum _{i=1}^n b_i \Big ) \Big ( \sum _{j=1}^n \frac{a_j}{b_j} \Big ) = \sum _{i,j=1}^n b_i \frac{a_j}{b_j} \ge \sum _{i=1}^n b_i \frac{a_i}{b_i} = \sum _{i=1}^n a_i. \end{aligned}$$

$\square $

Proof of Theorem 4.20

By Assumption 3 and the fact that $f({\hat{x}})=0$, we can estimate

$$\begin{aligned} |\langle \nabla f_{i_k}(x_k), {\hat{x}} - x_k\rangle |&\le |f_{i_k}(x_k) + \langle \nabla f_{i_k}(x_k), {\hat{x}} - x_k\rangle - f_{i_k}({\hat{x}})| + |f_{i_k}(x_k)-f_{i_k}({\hat{x}})| \\&\le (1+\eta ) |f_{i_k}(x_k)-f_{i_k}({\hat{x}})| = (1+\eta ) |f_{i_k}(x_k)|. \end{aligned}$$

In all cases of the assumption, inserting the above estimate we respectively conclude that

$$\begin{aligned} D_\varphi (x_{k+1},{\hat{x}}) \le D_\varphi (x_k,{\hat{x}}) - \frac{\tau }{(1+\eta )^2} \cdot \frac{|\langle \nabla f_{i_k}(x_k), {\hat{x}}-x_k\rangle |^2}{\Vert \nabla f_i(x_k)\Vert _2^2}. \end{aligned}$$

Taking expectation and using Lemma 4.21 as well as the definition of $\kappa _{\min }$, we conclude that

$$\begin{aligned} \mathbb {E}\big [D_\varphi (x_{k+1},{\hat{x}})\big ]&\le \mathbb {E}\big [D_\varphi (x_{k},{\hat{x}})\big ] - \frac{\tau }{(1+\eta )^2} \cdot \mathbb {E}\Big [ \sum _{i=1}^n p_i \frac{|\langle \nabla f_i(x_k), {\hat{x}}-x_k\rangle |^2}{\Vert \nabla f_i(x_k)\Vert _2^2} \Big ] \\&\le \mathbb {E}\big [D_\varphi (x_{k},{\hat{x}})\big ] - \frac{\tau p_{\min }}{(1+\eta )^2} \cdot \mathbb {E}\Big [\frac{\Vert Df(x_k)({\hat{x}}-x_k)\Vert _2^2}{\Vert Df(x_k)\Vert _F^2} \Big ]\\&\le \mathbb {E}\big [D_\varphi (x_{k},{\hat{x}})\big ] - \mathbb {E}\Big [ \frac{\tau p_{\min }}{(1+\eta )^2 \kappa _{\min }^2} \cdot \Vert x_k-{\hat{x}}\Vert _2^2 \Big ] \\&\le \Big ( 1- \frac{ 2\tau p_{\min }}{(1+\eta )^2M \kappa _{\min }^2} \Big ) \mathbb {E}\big [D_\varphi (x_{k},{\hat{x}})\big ]. \end{aligned}$$

$\square $

5 Numerical experiments

In this section, we evaluate the performance of the NBK method. In the first experiment we used NBK to find sparse solutions with the nonsmooth DGF $\varphi (x) = \frac{1}{2}\Vert x\Vert _2^2+\lambda \Vert x\Vert _1$ for unconstrained quadratic equations, that is, with $C=\mathbb {R}^d$. Next, we employed the negative entropy DGF over the probability simplex $C=\Delta ^{d-1}$ to solve simplex-constrained linear equations as well as the left-stochastic decomposition problem, a quadratic problem over a product of probability simplices with applications in clustering [2]. All the methods were implemented in MATLAB on a macbook with 1,2 GHz Quad-Core Intel Core i7 processor and 16 GB memory. Code is available at https://github.com/MaxiWk/Bregman-Kaczmarz.

5.1 Sparse solutions of quadratic equations

As the first example, we considered multinomial quadratic equations

$$\begin{aligned} f_i(x) = \frac{1}{2}\langle x,A^{(i)}x\rangle + \langle b^{(i)},x\rangle + c^{(i)}=0 \end{aligned}$$

with $A^{(i)}\in \mathbb {R}^{d\times d}$, $b^{(i)}\in \mathbb {R}^d$, $c^{(i)}\in \mathbb {R}$ and $i=1,...,n$. We investigated if Algorithm 1 (NBK method) and Algorithm 2 (rNBK method) are capable of finding a sparse solution ${\hat{x}}\in \mathbb {R}^d$ by using the DGF $\varphi (x)=\lambda \Vert x\Vert _1+\frac{1}{2}\Vert x\Vert _2^2$ and tested both methods against the euclidean nonlinear Kaczmarz method (NK). As it holds $\textrm{dom}\ \varphi = \mathbb {R}^d$, it is always possible to choose the step size $t_{k,\varphi }$ from (11) in the NBK method. Moreover, the step size can be computed exactly by a sorting procedure, as $\varphi ^*$ is a continuous piecewise quadratic function, see Example 3.2. In order to guarantee existence of a sparse solution, we chose a sparse vector ${\hat{x}}\in \mathbb {R}^d$, sampled the data $A^{(i)}, b^{(i)}$ randomly with entries from the standard normal distribution and set

$$\begin{aligned} c^{(i)} = - \Big ( \frac{1}{2}\langle x,A^{(i)}x\rangle + \langle b^{(i)},x\rangle \Big ). \end{aligned}$$

In all examples, the nonzero part of ${\hat{x}}$ and the initial subgradient $x_0^*$ were sampled from the standard normal distribution. The initial vector $x_0$ was computed by $x_0=\nabla \varphi ^*(x_0^*) = S_\lambda (x_0^*)$.

From the updates it is evident that computational cost per iteration is cheapest for the NK method, slightly more expensive for the rNBK method and most expensive for the NBK method. To examine the case $d<n$, we chose $A^{(i)}\sim {\mathcal {N}}(0,1)^{500\times 500}$ for $i=1,...,1000$, ${\hat{x}}$ with 25 nonzero entries and $\lambda =10$ and performed 20 random repeats. Figure 1 shows that the NBK method clearly outperforms the other two methods in this situation, even despite the higher cost per iteration.

Figure 2 illustrates that in the case $d>n$, both the NBK method and the rNBK method can fail to converge or converge very slowly.

5.2 Linear systems on the probability simplex

We tested our method on linear systems constrained to the probability simplex

$$\begin{aligned} \text {find } x\in \Delta ^{d-1}: \qquad Ax=b. \end{aligned}$$

(35)

That is, in problem (1) we chose $f_i=\langle a_i,x\rangle - b_i$ with $D=\mathbb {R}^d$ and viewed $C=\Delta ^{d-1}$ as the additional constraint. For Algorithm 1, we used the simplex-restricted negative entropy function from Example 3.3, i.e. we set

$$\begin{aligned} \varphi (x) = {\left\{ \begin{array}{ll} \sum _{i=1}^d x_i \log (x_i), &{} x\in \Delta ^{d-1}, \\ +\infty , &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

We know from Example 3.3 that $\varphi $ is 1-strongly convex w.r.t. the 1-norm $\Vert \cdot \Vert _1$. Therefore, as the second method we considered the rNBK iteration given by Algorithm 2 with $\sigma =1$ and $\Vert \cdot \Vert _*=\Vert \cdot \Vert _\infty $. As a benchmark, we considered a POCS (orthogonal projection) method which computes an orthogonal projection onto a row equation, followed by an orthogonal projection onto the probability simplex, see Algorithm 3 listed below. We note that in [36, Theorem 3.3] it has been proved that the distance of the iterates of the POCS method and the NBK method to the set of solutions on $\Delta ^{d-1}_+$ decays with an expected linear rate, if there exists a solution in $\Delta ^{d-1}_+$. Theorem 4.13 shows at least a.s. convergence of the iterates towards a solution for all three methods.

We note that it holds $\nabla f_{i_k}(x)=a_{i_k}$ for all x and $\beta _k = b_{i_k}$ in the NBK method. If problem (35) has a solution, then condition (10) is fulfilled in each step of the NBK method, so the method takes always the step size $t_k=t_{k,\varphi }$ from the exact Bregman projection. For the projection onto the simplex in Algorithm 3, we used the pseudocode from [59], see also [12, 24, 26].

In our experiments we noticed that in large dimensions, such as $d\ge 100$, solving the Bregman projection (11) up to a tolerance of $\epsilon =10^{-9}$ takes less than half as much computation time as the simplex projection. As these two are the dominant operations in these methods, the NBK updates are computationally cheaper than the NK updates in the high dimensional setting. However, the examples will show that convergence quality of the methods depends on the distribution of the entries of A. All methods were observed to converge linearly.

In the following experiments, we took different choices of A and set the right-hand side to $b=A{\hat{x}}$ with a point ${\hat{x}}$ drawn from the uniform distribution on the probability simplex $\Delta ^{d-1}$. All methods were initialized with the center point $x_0=(\frac{1}{d},..., \frac{1}{d}).$

For our first experiment, we chose standard normal entries $A\sim {\mathcal {N}}(0,1)^{n\times d}$ with $(n,d) = (500,200)$ and $(n,d)=(200,500)$. Figure 3 shows that in this setting, the POCS method achieves much faster convergence in the overdetermined case $(n,d) = (500,200)$ than the NBK method, whereas both methods perform roughly the same in the underdetermined case $(d,n) = (200,500)$. The rNBK method is considerably slower than the other two methods, which shows that the computation of the $t_{k,\varphi }$ step size for NBK pays off.

In our second experiment, we built up the matrix from uniformly distributed entries $A\sim {\mathcal {U}}([0,1])^{n\times d}$ and $A\sim {\mathcal {U}}([0.9,1])^{n\times d}$ with $(n,d)=(200,500)$. The results are summarized in Fig. 4. For the Kaczmarz method it has been observed in practice that so called ’redundant’ rows of the matrix A deteriorate the convergence of the method [31]. This effect can also occur with the POCS method, as it also relies on euclidean projections. Remarkably, we can see that this is not the case for the NBK method and it clearly outperforms the POCS method and the rNBK method. This in particular shows that the multiplicative update used in both the rNBK method and the NBK method is not enough to overcome the difficulty of redundancy- to achieve fast convergence, it must be combined with the appropriate step size which is used by the proposed NBK method.

Finally, we illustrate the effect of the accuracy $\epsilon $ in step size computation for the NBK method. We chose $A\sim U([0,1])^{n\times d}$ and $\epsilon =10^{-9}$ and compared with the larger tolerance $\epsilon =10^{-5}$. Figure 5 shows that, with $\epsilon =10^{-5}$, the residual plateaus at a certain threshold. In contrast with $\epsilon =10^{-9}$, the residual does not plateau, and despite the more costly computation of the step size, the NBK method is still competitive with respect to time. Hence, for the problem of linear equations over the probability simplex we recommend to solve the step size problem up to high precision.

5.3 Left stochastic decomposition

The left stochastic decomposition (LSD) problem can be formulated as follows:

$$\begin{aligned} \text {find } X\in L^{r\times m}: \qquad X^TX = A, \end{aligned}$$

(36)

where

$$\begin{aligned}L^{r,m}:=\big \{P\in \mathbb {R}^{r\times m}_{\ge 0}: P^T\mathbb {1}_r = \mathbb {1}_m\}\end{aligned}$$

is the set of left stochastic matrices and $A\in \mathbb {R}^{r\times m}$ is a given nonnegative matrix. The problem is equivalent to the so-called soft-K-means problem and hence has applications in clustering [2]. We can view (36) as an instance of problem (1) with component equations

$$\begin{aligned} f_{i,j}(X) = \langle X_{:,i}, X_{:,j} \rangle - A_{i,j} = 0 \qquad \text {for } i=1,...,r, \ j=1,...,m \end{aligned}$$

and $C=L^{r\times m}\cong \left( \Delta ^{r-1}\right) ^m$, where $X_{:,i}$ denotes the ith column of X. For Algorithm 1 we chose the DGF from Example 3.4 with the simplex-restricted negative entropy $\varphi _i=\varphi $ from Example 3.3. Since $f_{i,j}$ depends on at most two columns of X, Algorithm 1 acts on $\Delta ^{r-1}$ or $\Delta ^{r-1}\times \Delta ^{r-1}$ in each step. Therefore, we applied the steps from Example 3.3 in the first case, and from Example 3.5 in the second case.

We compared the performance of Algorithms 1 and 2 to a projected nonlinear Kaczmarz method given by Algorithm 4. Here, by $X_{k,:,i}$ we refer to the ith column of the kth iterate matrix. In all examples, we set $A={\hat{X}}^T {\hat{X}}$, where the columns of ${\hat{X}}$ were sampled according to the uniform distribution on $\Delta ^{r-1}$.

We observed that Algorithm 1 (NBK method) gives the fastest convergence, if r is not much smaller than m, see Fig. 6 and Fig. 7. In both experiments, we noticed that condition (10) was actually fulfilled in each step, but checking did not show a notable difference in performance. The most interesting setting for clustering is that r is very small and m is large, as r is the number of clusters [2]. However, it appears unclear if the NBK or the PNK method is a better choice for this problem size, as Fig. 8 shows. In this experiment, condition (10) was not always fulfilled in the NBK method and we needed to employ the globalized Newton method together with an additional condition to the step size approximation, see Appendix for details. Finally, we can again see that Algorithm 2 is clearly outperformed by the other two methods in all experiments.

6 Conclusions and further research

We provided a general Bregman projection method for solving nonlinear equations, where each iteration needs only to sample one equation to make progress towards the solution. As such, the cost of one iteration scales independently of the number of equations. Our method is also a generalization of the nonlinear Kaczmarz method which allows for additional simple constraints or sparsity inducing regularizers. We provide two global convergence theorems under different settings and find a number of relevant experimental settings where instantiations of our method are efficient.

Convergence for non-strongly convex distance generating functions $\varphi $, as well as a suitable scope of $\sigma $ in this setting, has so far not been explored.

Our work also opens up the possibility of incorporating more structure into SGD type methods in the interpolation setting as has been done in [33] for the linear case. In this setting each $f_i(x)$ is a positive loss function over the ith data point. If we knew in addition that some of the coordinates of x are meant to be positive, or that x is a discrete probability measure, then our nonlinear Bregman projection methods applied to the interpolation equations would provide new adaptive step sizes for stochastic mirror descent. Further venues for exploring would be to relax the interpolation equations, say into inequalities [27], and applying an analogous Bregman projections to incorporate more structure. We will leave this to future work.

Data availability

We do not analyze or generate any datasets, because our work proceeds within a theoretical and mathematical approach. However, the code that generates the figures in this article can be found at https://github.com/MaxiWk/Bregman-Kaczmarz.

Change history

05 April 2024
A Correction to this paper has been published: https://doi.org/10.1007/s10589-024-00570-y

Notes

The typical setting in convergence analysis will be that $\varphi $ is $\sigma $-strongly convex with respect to a norm $\Vert \cdot \Vert $, and $\Vert \cdot \Vert _*$ will be its dual norm.

References

Alber, Y., Butnariu, D.: Convergence of Bregman projection methods for solving consistent convex feasibility problems in reflexive Banach spaces. J. Optim. Theory Appl. 92(1), 33–61 (1997)
Article MathSciNet Google Scholar
Arora, R., Gupta, M.R., Kapila, A., Fazel, M.: Similarity-based clustering by left-stochastic matrix factorization. J. Mach. Learn. Res. 14(1), 1715–1746 (2013)
MathSciNet Google Scholar
Azizan, N., Lale, S., Hassibi, B.: Stochastic mirror descent on overparameterized nonlinear models. IEEE Trans. Neural Netw. Learn. Systems 33(12), 7717–7727 (2021)
Article MathSciNet Google Scholar
Bauschke, H.H., Borwein, J.M., Combettes, P.L.: Bregman monotone optimization algorithms. SIAM J. Control. Optim. 42(2), 596–636 (2003)
Article MathSciNet Google Scholar
Bauschke, H.H., Borwein, J.M., et al.: Legendre functions and the method of random Bregman projections. J. Convex Anal. 4(1), 27–67 (1997)
MathSciNet Google Scholar
Bauschke, H.H., Combettes, P.L.: Iterating Bregman retractions. SIAM J. Optim. 13(4), 1159–1173 (2003)
Article MathSciNet Google Scholar
Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces, vol. 408. Springer, Berlin (2011)
Book Google Scholar
Beck, A.: First-order methods in optimization. SIAM (2017)
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37(2), A1111–A1138 (2015)
Article MathSciNet Google Scholar
Boţ, R.I., Hein, T.: Iterative regularization with a general penalty term-theory and application to L1 and TV regularization. Inverse Prob. 28(10), 104010 (2012)
Article Google Scholar
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Article MathSciNet Google Scholar
Brucker, P.: An O(n) algorithm for quadratic knapsack problems. Oper. Res. Lett. 3(3), 163–166 (1984)
Article MathSciNet Google Scholar
Butnariu, D., Iusem, A.N., Zalinescu, C.: On uniform convexity, total convexity and convergence of the proximal point and outer Bregman projection algorithm in Banach spaces. J. Convex Anal. 10(1), 35–62 (2003)
MathSciNet Google Scholar
Butnariu, D., Resmerita, E.: The outer Bregman projection method for stochastic feasibility problems in Banach spaces. In: Studies in computational mathematics, vol. 8, pp. 69–86. Elsevier, Amsterdam (2001)
Google Scholar
Butnariu, D., Resmerita, E.: Bregman distances, totally convex functions, and a method for solving operator equations in Banach spaces. In Abstract and Applied Analysis, volume 2006. Hindawi, (2006)
Censor, Y., Elfving, T., Herman, G.: Averaging strings of sequential iterations for convex feasibility problems. In: Studies in computational mathematics, vol. 8, pp. 101–113. Elsevier, Amsterdam (2001)
Google Scholar
Censor, Y., Lent, A.: An iterative row-action method for interval convex programming. J. Optim. Theory Appl. 34(3), 321–353 (1981)
Article MathSciNet Google Scholar
Censor, Y., Reich, S.: Iterations of paracontractions and firmaly nonexpansive operators with applications to feasibility and optimization. Optimization 37(4), 323–339 (1996)
Article MathSciNet Google Scholar
Cetin, A.E.: Reconstruction of signals from Fourier transform samples. Signal Process. 16(2), 129–148 (1989)
Article MathSciNet Google Scholar
Cetin, A.E.: An iterative algorithm for signal reconstruction from bispectrum. IEEE Trans. Signal Process. 39(12), 2621–2628 (1991)
Article Google Scholar
Dhillon, I.S., Tropp, J.A.: Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl. 29(4), 1120–1146 (2008)
Article MathSciNet Google Scholar
Doikov, N., Nesterov, Y.: Gradient regularization of Newton method with Bregman distances. arXiv preprint arXiv:2112.02952, (2021)
D’Orazio, R., Loizou, N., Laradji, I., Mitliagkas, I.: Stochastic mirror descent: convergence analysis and adaptive variants via the mirror stochastic Polyak stepsize. arXiv preprint arXiv:2110.15412, (2021)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the $\ell _1$-ball for learning in high dimensions. In: Proceedings of the 25th international conference on Machine learning, pp. 272–279, (2008)
Fedotov, A.A., Harremoës, P., Topsoe, F.: Refinements of Pinsker’s inequality. IEEE Trans. Inf. Theory 49(6), 1491–1498 (2003)
Article MathSciNet Google Scholar
Gafni, E.M., Bertsekas, D.P.: Two-metric projection methods for constrained optimization. SIAM J. Control. Optim. 22(6), 936–964 (1984)
Article MathSciNet Google Scholar
Gower, R. M., Blondel, M., Gazagnadou, N., Pedregosa, F.: Cutting some slack for SGD with adaptive Polyak stepsizes. arXiv:2202.12328 (2022)
Gu, R., Han, B., Tong, S., Chen, Y.: An accelerated Kaczmarz type method for nonlinear inverse problems in Banach spaces with uniformly convex penalty. J. Comput. Appl. Math. 385, 113211 (2021)
Article MathSciNet Google Scholar
Hanke, M., Neubauer, A., Scherzer, O.: A convergence analysis of the Landweber iteration for nonlinear ill-posed problems. Numer. Math. 72(1), 21–37 (1995)
Article MathSciNet Google Scholar
Iusem, N.A., Solodov, V.M.: Newton-type methods with generalized distances for constrained optimization. Optimization 41(3), 257–278 (1997)
Article MathSciNet Google Scholar
Jarman, B., Yaniv, Y., Needell, D.: Online signal recovery via heavy ball Kaczmarz. arXiv preprint arXiv:2211.06391 (2022)
Jin, Q.: Landweber-Kaczmarz method in Banach spaces with inexact inner solvers. Inverse Prob. 32(10), 104005 (2016)
Article MathSciNet Google Scholar
Jin, Q., Lu, X., Zhang, L.: Stochastic mirror descent method for linear ill-posed problems in banach spaces. Inverse Prob. 39(6), 065010 (2023)
Article MathSciNet Google Scholar
Jin, Q., Wang, W.: Landweber iteration of Kaczmarz type with general non-smooth convex penalty functionals. Inverse Prob. 29(8), 085011 (2013)
Article MathSciNet Google Scholar
Kaczmarz, S.: Angenäherte Auflösung von Systemen linearer Gleichungen. Bull. Internat. Acad. Polon. Sci. Lettres A, pp. 355–357 (1937)
Kostic, V., Salzo, S.: The method of Bregman projections in deterministic and stochastic convex feasibility problems. arXiv preprint arXiv:2101.01704 (2021)
Lin, T., Ho, N., Jordan, M.: On efficient optimal transport: an analysis of greedy and accelerated mirror descent algorithms. In: International conference on machine learning, pp. 3982–3991. PMLR (2019)
Loizou, N., Vaswani, S., Laradji, I. H., Lacoste-Julien, S.: Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, PMLR pp. 1306–1314 (2021)
Lorenz, D.A., Schöpfer, F., Wenger, S.: The linearized Bregman method via split feasibility problems: analysis and generalizations. SIAM J. Imag. Sci. 7(2), 1237–1262 (2014)
Article MathSciNet Google Scholar
Ma, S., Bassily, R., Belkin, M.: The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In: International conference on machine learning, pp 3325–3334. PMLR (2018)
Maaß, P., Strehlow, R.: An iterative regularization method for nonlinear problems based on Bregman projections. Inverse Prob. 32(11), 115013 (2016)
Article MathSciNet Google Scholar
Mishchenko, K.: Regularized Newton method with global $ {O} (1/k^2) $ convergence. arXiv preprint arXiv:2112.02089 (2021)
Nedić, A.: Random algorithms for convex minimization problems. Math. Program. 129(2), 225–253 (2011)
Article MathSciNet Google Scholar
Nemirovskij, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization. Wiley, Hoboken (1983)
Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet Google Scholar
Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation-based image restoration. Multisc. Model. Simul 4(2), 460–489 (2005)
Article MathSciNet Google Scholar
Pinsker, M. S.: Information and information stability of random variables and processes (in Russian). Holden-Day, (1964)
Polyak, B., Tremba, A.: New versions of Newton method: step-size choice, convergence domain and under-determined equations. Optimiz. Methods Software 35(6), 1272–1303 (2020)
Article MathSciNet Google Scholar
Polyak, B., Tremba, A.: Sparse solutions of optimal control via Newton method for under-determined systems. J. Global Optim. 76(3), 613–623 (2020)
Article MathSciNet Google Scholar
Reem, D., Reich, S., De Pierro, A.: Re-examination of Bregman functions and new properties of their divergences. Optimization 68(1), 279–348 (2019)
Article MathSciNet Google Scholar
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing methods in statistics, pp. 233–257. Elsevier, Amsterdam (1971)
Google Scholar
Rockafellar, R.T.: Convex analysis, vol. 36. Princeton University Press, Princeton (1970)
Book Google Scholar
Schöpfer, F., Lorenz, D.A.: Linear convergence of the randomized sparse Kaczmarz method. Math. Program. 173(1), 509–536 (2019)
Article MathSciNet Google Scholar
Schöpfer, F., Lorenz, D.A., Tondji, L., Winkler, M.: Extended randomized Kaczmarz method for sparse least squares and impulsive noise problems. Linear Algebra Appl. 652, 132–154 (2022)
Article MathSciNet Google Scholar
Schöpfer, F., Louis, A.K., Schuster, T.: Nonlinear iterative methods for linear ill-posed problems in Banach spaces. Inverse Prob. 22(1), 311 (2006)
Article MathSciNet Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 567–599 (2013)
MathSciNet Google Scholar
Tondji, L., Lorenz, D. A.: Faster randomized block sparse Kaczmarz by averaging. Numerical Algorithms, pp. 1–35 (2022)
Wang, Q., Li, W., Bao, W., Gao, X.: Nonlinear Kaczmarz algorithms and their convergence. J. Comput. Appl. Math. 399, 113720 (2022)
Article MathSciNet Google Scholar
Wang, W., Carreira-Perpinán, M. A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv preprint arXiv:1309.1541 (2013)
You, J.-K., Cheng, H.-C., Li, Y.-H.: Minimizing quantum Rényi divergences via mirror descent with Polyak step size. In: 2022 IEEE International symposium on information theory (ISIT), IEEE, pp 252–257 (2022)
You, J.-K., Li, Y.-H.: Two Polyak-type step sizes for mirror descent. arXiv preprint arXiv:2210.01532 (2022)
Yuan, R., Lazaric, A., Gower, R.M.: Sketched Newton-Raphson. SIAM J. Optim. 32(3), 1555–1583 (2022)
Article MathSciNet Google Scholar
Zalinescu, C.: Convex analysis in general vector spaces. World scientific, (2002)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Article Google Scholar
Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S., Glynn, P.W.: Stochastic mirror descent in variationally coherent optimization problems. Adv. Neural. Inf. Process. Syst. 30, 7040–7049 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

CCM, Flatiron Institute, Simons Foundation, New York, USA
Robert Gower
Institute of Analysis and Algebra, TU Braunschweig, Braunschweig, Germany
Dirk A. Lorenz & Maximilian Winkler
Center for Industrial Mathematics, Fachbereich 3, University of Bremen, Bremen, Germany
Dirk A. Lorenz & Maximilian Winkler

Authors

Robert Gower
View author publications
You can also search for this author in PubMed Google Scholar
Dirk A. Lorenz
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Winkler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maximilian Winkler.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: the typo in Theorem 4.20 has been updated.

Appendix A: Newton’s method for line search problem (11)

We compute the Newton update for problem (11) for general $\varphi $ with $C^2$-smooth conjugate $\varphi ^*$. The function $g_{i_k,x_k^*}$ from (19) has first derivative

$$\begin{aligned} g_{i_k,x_k^*}'(t)&= \big \langle \nabla \varphi ^*(x_k^*-t\nabla f_{i_k}(x_k)), -\nabla f_{i_k}(x_k)\big \rangle + \beta _k \\&=\big \langle x_k - \nabla \varphi ^*(x_k^* - t\nabla f_{i_k}(x_k)), \ \nabla f_{i_k}(x_k) \big \rangle - f_{i_k}(x_k) \end{aligned}$$

and second derivative

$$\begin{aligned} g_{i_k,x_k^*}''(t) = \big \langle \nabla ^2 \varphi ^*(x_k^* - t\nabla f_{i_k}(x_k)) \nabla f_{i_k}(x_k), \ \nabla f_{i_k}(x_k) \big \rangle \ge 0. \end{aligned}$$

If it holds $g_{f_{i_k},x_k^*}''(t)>0$, Newton’s method for (11) reads

$$\begin{aligned} t_{k,l+1} = t_{k,l} - \frac{g_{i_k,x_k^*}'(t_{k,l})}{g_{i_k,x_k^*}''(t_{k,l})}. \end{aligned}$$

As an initial value we use the step size $t_{k,0}:= \frac{ f_{i_k}(x_k)}{\Vert \nabla f_{i_k}(x_k)\Vert _2^2}$ from the $\ell _2$-projection of $x_k$ onto $H_k$. We propose to stop the method if $|g_{i_k,x_k^*}'(t_{k,l})|<\epsilon $. Typical values we used for our numerical examples were $\epsilon \in \{ 10^{-5}, 10^{-6}, 10^{-9}, 10^{-15}\}$.

It may happen that problem (11) is ill-conditioned, in which case the Newton iterates $t_{k,l}$ may diverge quickly to $\pm \infty $ or alternate between two values. We have observed this can e.g. happen for the problem on left stochastic decomposition in Subsection 5.3, if the number m of rows of the matrix X in the problem is small.

In case that the Newton method diverges, we used the recently proposed globalized Newton method from [42], which reads

$$\begin{aligned} t_{k,l+1} = t_{k,l} - \frac{g_{i_k,x_k^*}'(t_{k,l})}{H\cdot \sqrt{|g_{i_k,x_k^*}'(t_{k,l})|} + g_{i_k,x_k^*}''(t_{k,l})} \end{aligned}$$

with a fixed constant $H>0$. Also here, we stop if $|g_{i_k,x_k^*}'(t_{k,l})|<\epsilon $. Convergence of the $t_{k,l}$ for $l\rightarrow \infty $ is guaranteed, if $\varphi ^*$ is strongly convex, i.e. if $\varphi $ is everywhere finite with Lipschitz continuous gradient and the values $g_{i_k,x_k^*}(t_{k,l})$ are guaranteed to converge to the minimum value if $\varphi ^*$ has Lipschitz continuous Hessian [42]. We have also observed good convergence for the negative entropy function on $\mathbb {R}_{\ge 0}^d$ with this method when Newton’s method is unstable. For problems constrained to the probability simplex $\Delta ^{d-1}$, the globalized Newton method converged more slowly than the vanilla Newton method. For the problem in subsection 5.3 with $(r,m)=(3,100)$ we chose $H=0.1$. In addition, we performed a relaxed Bregman projection (line 10 of Algorithm 1) with step size (12) if $|t_{k,l}|>100$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gower, R., Lorenz, D.A. & Winkler, M. A Bregman–Kaczmarz method for nonlinear systems of equations. Comput Optim Appl 87, 1059–1098 (2024). https://doi.org/10.1007/s10589-023-00541-9

Download citation

Received: 22 March 2023
Accepted: 28 October 2023
Published: 07 December 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10589-023-00541-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Bregman–Kaczmarz method for nonlinear systems of equations

Abstract

Similar content being viewed by others

A New Double-Projection Method for Solving Variational Inequalities in Banach Spaces

A convergent relaxation of the Douglas–Rachford algorithm

Newton’s method with feasible inexact projections for solving constrained generalized equations

1 Introduction

1.1 Related work and our contributions

1.1.1 Nonlinear Kaczmarz method and Sparse Kaczmarz

1.1.2 Stochastic Polyak step size (SPS)

1.1.3 Mirror descent and SPS

1.1.4 Bregman projection methods

1.1.5 Bregman–Landweber methods

1.1.6 Sparse and Bregman–Newton methods

1.2 Notation

2 Basic notions and assumptions

2.1 Convex analysis and standing assumptions

Assumption 1

2.2 Bregman projections

Definition 2.1

Lemma 2.2

Proposition 2.3

Proof

3 Realizations of the method

Remark 3.1

Example 3.2

Example 3.3

Example 3.4

Example 3.5

4 Convergence

Proposition 4.1

Proof

Lemma 4.2

Proof

Lemma 4.3

Proof

Lemma 4.4

Proof

4.1 Convergence for nonnegative star-convex functions

Definition 4.5

Assumption 2

Theorem 4.6

Proof

Corollary 4.7

Proof

Lemma 4.8

Proof

Proposition 4.9

Proof

Lemma 4.10

Proof

Theorem 4.11

Proof

Lemma 4.12

Proof

Theorem 4.13

Proof

Lemma 4.14

Theorem 4.15

Proof

Theorem 4.16

Proof

4.2 Convergence under the local tangential cone condition

Definition 4.17

Assumption 3

Lemma 4.18

Proof

Theorem 4.19

Proof

Theorem 4.20

Lemma 4.21

Proof

Proof of Theorem 4.20

5 Numerical experiments

5.1 Sparse solutions of quadratic equations

5.2 Linear systems on the probability simplex

5.3 Left stochastic decomposition

6 Conclusions and further research

Data availability

Change history