1 Introduction

Stochastic (sub)gradient methods form the algorithmic core of much of modern statistical and machine learning. Such algorithms are typically sensitive to algorithm parameters, and require extensive step size tuning to achieve adequate performance. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates of convergence on convex and strongly convex problems

where the loss functions \(f(\cdot ,z)\) are convex and \(\mathcal {X}\) is a closed convex set [45]. An alternative schedule, called geometric step decay, decreases the step size geometrically by halving it after every few epochs. In recent work [66], geometric step decay was shown to improve exponentially upon classical sublinear rates under the sharpness assumption:

$$\begin{aligned} f(x)- \min _{\mathcal {X}} f\ge \mu \cdot \textrm{dist}(x,\mathcal {X}^*) \qquad \forall x\in \mathcal {X}, \end{aligned}$$
(1.1)

where \(\mu >0\) is some constant and \(\mathcal {X}^*=\hbox {argmin}_{x\in \mathcal {X}} f(x)\) is the solution set. This result complements early works of Goffin [27] and Shor [62], which show that deterministic subgradient methods converge linearly on sharp convex functions if their step sizes decay geometrically. The work [66] also reveals a departure from the smooth strongly convex setting, where deterministic linear rates degrade to sublinear rates when the gradient is corrupted by noise [45].

Beyond the convex setting, sharp problems appear often in nonconvex statistical recovery problems, for example, in robust matrix sensing [40], phase retrieval [18, 20], blind deconvolution [8], and quadratic/bilinear sensing and matrix completion [7]. For such problems, sharpness is surprisingly common and corresponds to strong identifiability of the statistical model. In such settings, sharpness endows standard deterministic first-order algorithms with rapid local convergence guarantees, enabling the recovery of signals at optimal or near-optimal sample complexity in a variety of nonconvex problems. Despite this, we do not know whether stochastic algorithms equipped with geometric step decay—or any other step size schedule—linearly converge on sharp nonconvex problems.Footnote 1 Such algorithms, if available, could pave the way for new sample efficient strategies for these and other statistical recovery problems.

The main result of this work shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems

a geometric step decay schedule endows well-known algorithms

with a local nearly linearFootnote 2 rate of convergence to minimizers.

This guarantee takes hold as soon as the algorithms are initialized in a small neighborhood around the set of minimizers and applies, for example, to the stochastic projected subgradient, proximal point, and prox-linear algorithms. Beyond sharp growth (1.1), we also analyze losses that grow sharply away from some closed set \(\mathcal {S}\), which is strictly larger than \(\mathcal {X}^*\). Such sets \(\mathcal {S}\) are akin to “active manifolds" in the sense of Lewis [38] and Wright [64]. For example, the loss \(f(x,y)=x^2+|y|\) is not sharp relative to its minimizer, but is sharp relative to the x-axis. For these problems, our algorithms converge nearly linearly to the set \(\mathcal {S}\). Finally, we illustrate the result with two statistical recovery problems: phase retrieval and blind deconvolution. For these recovery tasks, our results match the best known computational and sample complexity guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.

1.1 Related work

Our paper is closely related to a number of influential techniques in stochastic, convex, and nonlinear optimization. We now survey these related topics.

Stochastic model-based methods. In this work, we use algorithms that iteratively sample and minimize simple stochastic convex models of the loss function. Throughout, we call these methods model-based algorithms. Such algorithms include the stochastic projected subgradient, prox-linear, and proximal point methods. Stochastic model-based algorithms are known to converge globally to stationary points at a sublinear rate on a large class of nonsmooth and nonconvex problems [11, 19]. Some model-based algorithms also possess superior stability properties and can be less sensitive to step size choice than traditional stochastic subgradient methods [3, 4].

Geometrically decaying learning rate in deterministic optimization. polynomially decaying step-sizes are common in stochastic optimization [35, 53, 55]. In contrast, we develop algorithms with step sizes that decay geometrically. Geometrically decaying step sizes were first analyzed in convex optimization by Shor [62, Thm 2.7, Sec. 2.3] and Goffin [27]. This step size schedule is also closely related to the step size rules of Eremin [21] and Polyak [52]. Similar schedules are known to accelerate convex subgradient methods under Hölder growth as shown in [32, 67]. Geometrically decaying step sizes for deterministic subgradient methods under weak convexity were systematically studied in [12]. The method of this work succeeds under similar assumptions on the population objective as in [12]: sharpness , Lipschitz continuity , and weak convexity , . In contrast to [12] which analyzes a deterministic subgradient method (based on linear approximations of the objective), the method of this work requires only stochastic estimates of the objective and converges under a wider family of nonlinear approximations of the objective (see Example 2.1). Surprisingly, the method only incurs an additional cost of \(\log ^2(1/\varepsilon )\) (stochastic) oracle evaluations, which arises due to our restart scheme. It is an intriguing open question whether one can remove this additional dependence on \(\log ^2(1/\varepsilon )\).

Geometric step decay in stochastic optimization. The geometric step decay schedule is common in practice: for example, see Krizhevsky et. al. [33] and He et al. [29]. It is a standard option in the popular deep learning libraries, such as Pytorch [49] and TensorFlow [1]. Geometric step decay has been analyzed in a number of recent papers in stochastic convex optimization, including [5, 25, 26, 34, 66, 68]. Among these papers, the work [66] relates most to ours. There, the authors propose two geometric step decay strategies that converge linearly on convex functions that are sharp and have bounded stochastic subgradients. These algorithms either use a moving ball constraint or follow a proximal-point type procedure. We follow the latter strategy, too, at least to obtain high-probability guarantees. The paper [66] differs from our work in that they assume convexity and a uniform bound on the stochastic subgradients. In contrast, we do not assume convexity and only assume that stochastic subgradients have a finite second moment. We are aware of only one subgradient method for stochastic nonconvex problems that converges linearly [3] under favorable assumptions. In [3], the authors develop a “clipped" subgradient method, which resembles a safeguarded stochastic Polyak step. In contrast to our work, their algorithms converge linearly only under “perfect interpolation," meaning that all terms in the expectation share a minimizer. Formally, the interpolation condition therein stipulates that, almost surely over z,

$$\begin{aligned} \inf _{x} f(x, z) = f(x_{\star }, z), \quad \text {for any }x_{\star } \in \arg \min _{x} \mathbb {E}[f(x, z)]. \end{aligned}$$

We do not make this assumption here. Indeed, the statistical recovery problems from Sect. 4 do not satisfy the interpolation condition in the presence of corruptions.

Restarts in deterministic optimization. Restart techniques have a long history in nonlinear programming, such as for conjugate gradient and limited memory quasi-Newton methods. They have also been used more recently to improve the complexity of algorithms in deterministic convex optimization. For example, restart schemes can accelerate sublinear convergence rates for convex problems that satisfy growth conditions as shown by Nesterov [47] and Nemirovskii and Nesterov [44], Renegar [54], O’Donoghue and Candes [48], Roulet and d’Aspremont [58], Freund and Lu [24], and Fercoq and Qu [22, 23] and others. In the nonconvex setting, stochastic restart methods are challenging to analyze, since the region of linear convergence is local. To overcome this challenge, one must bound the probability that the iterates leave this region. One of our main technical contributions is a technique for bounding this probability.

Finite sums. For finite sums, stochastic algorithms that converge linearly are more common. For example, for finite sums that are sharp and convex, Bertsekas and Nedić [43] prove that an incremental Polyak-type algorithm converges linearly. For finite sums that are smooth and strongly convex, variance reduced methods, such as SAG [59], SAGA [14], SDCA [60], SVRG [31], MISO/Finito [15, 41], SMART [10] and their proximal extensions converge linearly. The algorithms we develop here do not assume a finite sum structure.

Verifying sharpness. Sharp growth is a central assumption in this work. This property is surprisingly common in statistical recovery problems. For example, sharp growth has been established for robust matrix sensing [40], phase retrieval [18, 20], blind deconvolution [8], quadratic and bilinear sensing and matrix completion [7] problems. Consequently, the results of this paper apply in these settings.

1.2 Notation

We will mostly follow standard notation used in convex analysis and stochastic optimization. Throughout, the symbol \(\mathbb {R}^d\) will denote a d-dimensional Euclidean space with the inner product \(\langle \cdot ,\cdot \rangle \) and the induced norm \(\Vert x\Vert =\sqrt{\langle x,x\rangle }\). We denote the open ball of radius \(\varepsilon >0\) around a point \(x\in \mathbb {R}^d\) by the symbol \(B_{\varepsilon }(x)\); moreover, we write \(\mathbb {S}^{d-1}\) for the unit sphere in \(\mathbb {R}^d\). Given an event A, we write \(A^c\) for its complement. We also let \(1_{A}\) denote the indicator function of A, meaning a function that evaluates to one on A and zero off it. For any set \(Q\subset \mathbb {R}^d\), the distance function and the projection map are defined by

$$\begin{aligned} \textrm{dist}(x,Q):=\inf _{y\in Q} \Vert y-x\Vert \qquad \text {and}\qquad \textrm{proj}_Q(x):=\hbox {argmin}_{y\in Q} \Vert y-x\Vert , \end{aligned}$$

respectively. Consider a function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{\pm \infty \}\) and a point x, with f(x) finite. The Fréchet subdifferential of f at x, denoted by \(\partial f(x)\), consists of all vectors \(v\in \mathbb {R}^d\) satisfying

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle +o(\Vert y-x\Vert )\qquad \text {as }y\rightarrow x. \end{aligned}$$

A function f is called \(\rho \)-weakly convex on an open convex set U if the perturbed function \(f+\frac{\rho }{2}\Vert \cdot \Vert ^2\) is convex on U. The subgradients of such functions automatically satisfy the uniform approximation property (see [56]):

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle -\frac{\rho }{2}\Vert y-x\Vert ^2\qquad \text {for all }x,y\in U, v\in \partial f(x). \end{aligned}$$

2 Algorithms, assumptions, and main results

In this section, we formalize our target problem and introduce algorithms to solve it. We then outline our main results. The complete theorem statements and proofs appear in Sect. 3. Throughout, we consider the minimization problem

$$\begin{aligned} \min _{x \in \mathcal {X}}~ f (x). \end{aligned}$$
(2.1)

for some function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) and a closed convex set \(\mathcal {X}\subseteq \mathbb {R}^d\). We define the set \(\mathcal {X}^*:=\hbox {argmin}_{x\in \mathcal {X}} f(x)\) and assume it to be nonempty. We also fix a probability space \((\Omega ,\mathcal {F},\mathbb {P})\) and equip \(\mathbb {R}^d\) with the Borel \(\sigma \)-algebra and make the following assumption.

  1. (A1)

    (Sampling) It is possible to generate i.i.d. realizations \(z_1,z_2, \ldots \sim \mathbb {P}\).

The algorithms we develop rely on a stochastic oracle model for Problem (\({\mathcal{S}\mathcal{O}}\)) that was recently introduced in [11]. These algorithms assume access to a family of functions \({f_x(\cdot ,z)}\)—called stochastic models—indexed by basepoints \(x\in \mathbb {R}^d\) and random elements \(z\sim \mathbb {P}\). Given these models, the generic stochastic model-based algorithm of [11] iterates the steps:

$$\begin{aligned} \begin{aligned}&\text {Sample } z_k\sim \mathbb {P};\\&\text {Set } y_{k+1} = \hbox {argmin}_{y\in \mathcal {X}}~ \left\{ f_{y_k}(y,z_k) + \frac{1}{2\alpha } \Vert y - y_k\Vert ^2\right\} . \end{aligned} \end{aligned}$$
(2.2)

In this work, model-based algorithms form the core of the following restart strategy: given inner and outer loop sizes \(K\) and \(T\), respectively, as well as initial stepsize \(\alpha _0 > 0 \), perform:

$$\begin{aligned} \begin{aligned}&\text {For }t= 0, \ldots , T-1:\\&\hspace{20pt} {\text {Initialize }y_0 = x_t, \alpha = 2^{-t}\alpha _0,\text { and run }K\text { iterations of~(2.2);}}\\&\hspace{20pt} \text {Sample }x_{t+ 1}\text { uniformly from } y_0, \ldots , y_K. \end{aligned} \end{aligned}$$
(2.3)

This restart strategy is common in machine learning practice and is called geometric step decay. Restart schemes date back to the fundamental work of Nesterov [47] and Nemirovskii and Nesterov [44] and more recently appear in [22, 23, 25, 48, 54, 58, 66]. These strategies often improve the convergence guarantees of the algorithm they restart under growth assumptions, for example, by boosting an algorithm that converges sublinearly to one that converges linearly. In this work, we will show that restart scheme (2.3) similarly improves (2.2) for a large class of weakly convex stochastic optimization problems.

2.1 Assumptions

In this section, we formalize our assumptions on sharp growth of (2.1) as well as on accuracy, regularity, and Lipschitz continuity of the models.

2.1.1 Sharp Growth

We assume that \(f(\cdot )\) grows sharply as x moves in the direction normal to a closed set \(\mathcal {S}\).

  1. (A2)

    (Sharpness) There exists a constant \(\mu > 0\) and a closed set \(\mathcal {S}\subseteq \mathcal {X}\) satisfying \( \mathcal {X}^*\subseteq \mathcal {S}\) such that the following bound holds:

    $$\begin{aligned} f(x) - \inf _{z \in \textrm{proj}_{\mathcal {S}}(x)}f(z) \ge \mu \cdot \textrm{dist}(x, \mathcal {S}) \qquad \forall x \in \mathcal {X}. \end{aligned}$$
    (2.4)

This property generalizes the classical sharp growth condition (1.1), where \(\mathcal {S}=\mathcal {X}^*\). The setting \(\mathcal {S}= \mathcal {X}^*\) is well-studied in nonlinear programming and often underlies rapid convergence guarantees for deterministic local search algorithms. Beyond the classical setting, \(\mathcal {S}\) could be a sublevel set of f or an “active manifold” in the sense of Lewis [38]; see Sect. D. When \(\mathcal {S}\ne \mathcal {X}^*\), we design stochastic algorithms that do not necessarily converge to global minimizers, but instead nearly linearly converge to \(\mathcal {S}\) with high probability. Finally, we note that when \(\mathcal {S}= \mathcal {X}^{*}\) the results of this paper remain valid if (2.4) only holds in a neighborhood of the solution set.Footnote 3

2.1.2 Accuracy and Regularity

We assume that the models are convex and under-approximate f up to quadratic error.

  1. (A3)

    (One-sided accuracy) There exists \(\eta > 0\) and an open convex set U containing \(\mathcal {X}\) and a measurable function \((x,y,z)\mapsto f_x(y,z)\), defined on \(U\times U\times \Omega \), satisfying

    $$\begin{aligned} \mathbb {E}_{z}\left[ f_x(x,z)\right] =f(x) \qquad \forall x\in U, \end{aligned}$$

    and

    $$\begin{aligned} \mathbb {E}_{z}\left[ f_x(y,z)-f(y)\right] \le \frac{\eta }{2}\Vert y-x\Vert ^2\qquad \forall x,y\in U. \end{aligned}$$
  2. (A4)

    (Convex models) The function \(f_x(\cdot ,z)\) is convex \(\forall x\in U\) and a.e. \(z\in \Omega \).

Models satisfying and , and their algorithmic implications, were analyzed in [11, Assumption B]; a closely related family of models was investigated in [3, 4]. Assumptions  and  imply that f is \(\eta \)-weakly convex on U, meaning that the assignment \(x \mapsto f(x) + \frac{\eta }{2}\Vert x\Vert ^2\) is convex [11, Lemma 4.1]. While we assume  throughout the paper, we show in Remark 2 that our results hold under an even weaker assumption. For certain losses, models that satisfy  and  are easy to construct, as the following example shows.

Example 2.1

(Convex Composite Class) Stochastic convex composite losses take the form

$$\begin{aligned} f(x)=\mathbb {E}_{z} f(x,z)\quad \text {with}\quad f(x,z)=h(c(x,z),z), \end{aligned}$$

where \(h(\cdot , z)\) are Lipschitz and convex and the nonlinear maps \(c(\cdot ,z)\) are \(C^1\)-smooth with Lipschitz Jacobian. Such losses appear often in data science and signal processing (see [7, 16, 19] and references therein). For this problem class, natural models include

  • (subgradient)        \(f_x(y,z)=f(x,z)+\langle \nabla c(x, z)^\top v,y-x\rangle \) for any \(v\in \partial h(c(x, z), z)\).

  • (prox-linear)         \(f_x(y,z)=h(c(x,z)+\nabla c(x,z)(y-x),z).\)

  • (proximal point)   \(f_x(y,z)=f(y,z) + \frac{\eta }{2} \Vert x - y \Vert ^2\),

where \(\eta \) is large enough to guarantee the proximal model is convex.Footnote 4 If a lower bound \(\ell (z)\) on \(\inf f(\cdot ,z)\) is known, one can also choose a clipped model

  • (clipped)              \(\tilde{f}_x(y, z) = \max \{ f_x(y,z), \ell (z)\} \),

for any of the models \( f_x(\cdot ,z)\) above, as was suggested by [3]. Intuitively, models that better approximate f(xz) are likely to perform better in practice; see [3] for theoretical evidence supporting this claim. Fig. 1 contains a graphical illustration of one-sided models. \(\square \)

Fig. 1
figure 1

Illustration of one-sided models: \(f(x)=|x^2-1|\), \(f_{0.5}(y)=|1.25-y|\) (figure adapted from [11])

2.1.3 Lipschitz continuity

We assume that the models are Lipschitz on a tube surrounding \(\mathcal {S}\).

  1. (A5)

    (Lipschitz property) Define the tube

    $$\begin{aligned} \mathcal {T}_{\gamma }:= \left\{ x \in \mathcal {X}\mid \textrm{dist}(x, \mathcal {S}) \le \frac{\gamma \mu }{\eta }\right\} \qquad \forall \gamma > 0. \end{aligned}$$

    We assume that there exists a measurable function \(L :\mathbb {R}^d \times \Omega \rightarrow \mathbb {R}_+\) such that

    $$\begin{aligned} \min _{v\in \partial f_x(x,z)}\Vert v\Vert \le L(x,z) \end{aligned}$$
    (2.5)

    for all \(x\in \mathcal {T}_2\) and a.e. \(z\in \Omega \). Moreover, we assume there exists \(\textsf{L}> 0\) such that

    $$\begin{aligned} \sup _{x \in \mathcal {T}_2} \sqrt{\mathbb {E}_{z}\left[ L(x, z)^2\right] } \le \textsf{L}. \end{aligned}$$

This Lipschitz property is local and differs from the global assumption of [11, Assumption B4]. The property holds only in \(\mathcal {T}_2\) since our algorithms will be initialized in this tube and will never leave it with high probability.Footnote 5 The local nature of this property is crucial to signal recovery applications, for example, blind deconvolution and phase retrieval. In these problems, global Lipschitz continuity does not hold; see Sect. 4.

Remark 1

(Finite-sums) While our results are stated in terms of the streaming model, they are also applicable to the finite-sum setting, where our loss function is in the form

$$\begin{aligned} f(x) = \frac{1}{m} \sum _{i=1}^m f(x, i). \end{aligned}$$

Indeed, with \(\mathbb {P}= \textrm{Unif}(\{1, \dots , m\}) \implies \mathbb {P}(z_k = i) = \frac{1}{m}\), for any \(k \in \mathbb {N}\), \(i \in [m]\), we recover

$$\begin{aligned} \mathbb {E}_z[f(x, z_k)] = \frac{1}{m} \sum _{i=1}^m f(x, i) = f(x). \end{aligned}$$

2.2 Algorithms and results

Stochastic model-based algorithms (Algorithm 1) iteratively sample and minimize stochastic convex models of the loss function.Footnote 6 When equipped with models satisfying , and a global Lipschitz condition, these algorithms converge to stationary points of (2.1) at a sublinear rate [11, 19]. In this section, we show that such sublinear rates can be improved to local linear (or nearly linear) rates by using the restart strategy outlined in (2.2)–(2.3). We introduce two such strategies that succeed with probability \(1-\delta \), for any \(\delta > 0\). The first (Algorithm 2) allows for arbitrary sets \(\mathcal {S}\) in Assumption , but its sample complexity and initialization region scale poorly in \(\delta \). The second (Algorithm 5) assumes \(\mathcal {S}= \mathcal {X}^*\), but has much better dependence on \(\delta \).

Algorithm 1
figure a

\(\textrm{MBA}(y_0, \alpha , K, \texttt {is\_conv})\)

Algorithm 2
figure b

\(\textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {is\_conv})\)

Given assumptions –, the following theorem shows that the first restart strategy (Algorithm 2—Restarted Model Based Algorithm (RMBA)) converges nearly linearly to \(\mathcal {S}\) with high probability.

Theorem 2.1

(Informal) Fix a target accuracy \(\varepsilon >0\), failure probability \(\delta \in (0,\frac{1}{4})\), and a point \(x_0\in \mathcal {T}_{\gamma \sqrt{\delta }}\) for some \(\gamma \in (0,1)\). Then with appropriate parameter settings, the point \(x=\textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {false})\) will satisfy \(\textrm{dist}(x,\mathcal {S})<\varepsilon \) with probability at least \(1-4\delta \). Moreover, the number of samples \(z_i\sim \mathbb {P}\) generated by the algorithm is at most

$$\begin{aligned} \mathcal {O}\left( \left( \frac{ \textsf{L}}{\delta \mu }\right) ^2 \log ^3 \left( \frac{\gamma \mu /\eta }{\varepsilon }\right) \right) . \end{aligned}$$

Theorem 2.1 has interesting consequences not only for convergence to global minimizers, but also for “active manifold identification." For example, when \(\mathcal {S}= \mathcal {X}^*\), Theorem 2.1 shows that with constant probability, Algorithm 2 converges nearly linearly to the true solution set. When \(\mathcal {S}\ne \mathcal {X}^*\) and is instead an “active manifold” in the sense of Lewis [38], Algorithm 2 nearly linearly converges to the active manifold. In our numerical evaluation, we illustrate this phenomenon for a sparse logistic regression problem. We empirically observe that the method converges linearly to the support of the solution, even though the overall convergence to the true solution may be sublinear.

In our numerical experiments, we find that Algorithm 2 succeeds for a wide variety of parameter settings, which may not necessarily satisfy the conditions set forth by the theory. Theorem 2.1, on the other hand, only guarantees Algorithm 2 succeeds with high probability when we greatly increase its sample complexity and initialize it close to \(\mathcal {S}\). We would like to boost Algorithm 2 into a new algorithm whose sample complexity and initialization requirements scale only polylogarithmicaly in \(1/\delta \). As a first attempt, we discuss the following two probabilistic techniques, both of which have limitations:

  • (Markov) One approach is to call Algorithm 2 multiple times for a moderately small value \(\delta \) and pick out the “best" iterate from the batch. This approach is flawed since, even in the convex setting, there is no procedure to test which iterate is “best" without increasing sample complexity.

  • (Ensemble) An alternative approach is based on a well-known resampling trick, which applies when \(\mathcal {S}=\{\bar{x}\}\) is a singleton set [45, p. 243], [30, 63, Algorithm 1]: Run m trials of Algorithm 2 with any fixed \(\delta <1/4\), and denote the returned points by \(\{ x_i\}_{i=1}^m\). Then with high probability, the majority of the points \(\{ x_i\}_{i=1}^m\) will be close to \(\bar{x}\). Finally, to find an estimate near \(\bar{x}\), choose any point that has at least m/2 other points close to it.

The ensemble technique is promising, but it requires \(\mathcal {S}\) to be a singleton. This limits its applicability since many low-rank recovery problems (e.g. blind deconvolution, matrix completion, robust PCA) have uncountably many solutions. We overcome this issue by embedding Algorithm 2 and the ensemble method within a proximal-point method. At each stage of this algorithm, we run multiple copies of a stochastic-model based method on a quadratically regularized problem that has a unique solution. Among those copies, we use the ensemble technique to pick out a “successful" iterate. We summarize the resulting nested procedure in Algorithms 3–5: Algorithm 3 (Proximal Model-Based Algorithm—PMBA) is a generic model-based algorithm applied on a quadratically regularized problem; Algorithm 4 (Ensemble PMBA—EPMBA) calls Algorithm 3 as suggested by the ensemble technique; finally, Algorithm 5 (Restarted PMBA (RPMBA)) updates the regularization term, in the style of a proximal point method.

Algorithm 3
figure c

\(\textrm{PMBA}(y_0, \rho , \alpha , K)\)

Algorithm 4
figure d

\(\textrm{EPMBA}(y_0, \rho , \alpha , K, m, \epsilon )\)

Algorithm 5
figure e

\(\textrm{RPMBA}(x_0,\rho _0, \alpha _0, K,\epsilon _0 , M, T)\)

We will establish the following guarantee. In the theorem, we assume \(\mathcal {S}=\mathcal {X}^*\).

Theorem 2.2

(Informal) Fix a target accuracy \(\varepsilon >0\), failure probability \(\delta \in (0,1)\), and a point \(x_0\in \mathcal {T}_{\gamma }\) for some \(\gamma \in (0,\frac{1}{4})\). Then with appropriate parameter settings, the point \(x=\textrm{RPMBA}(x_0, \rho _0, \alpha _0, K,\varepsilon _0, M, T)\) will satisfy \(\textrm{dist}(x,\mathcal {X}^*)<\varepsilon \) with probability at least \(1-\delta \). Moreover the total number of samples \(z_i\sim \mathbb {P}\) generated by the algorithm is at most

$$\begin{aligned} \mathcal {O}\left( \left( \frac{ \textsf{L}}{ \mu }\right) ^2 \log \left( \tfrac{\gamma \mu /\eta }{\varepsilon }\right) \cdot \log \left( \frac{\log (\frac{\gamma \mu /\eta }{\varepsilon })}{\delta }\right) \right) . \end{aligned}$$

Theorem 2.2 resolves the initialization and sample complexity issues of Theorem 2.1. Incidentally, its claimed sample complexity also depends more favorably on \(\varepsilon \) and on the problem parameters \(\mu \) and \(\eta \). Theorem 2.2 is new in the weakly convex setting and also improves on prior work by Xu et al. [66] for convex problems. There, the results require stochastic subgradients to be almost surely bounded, hence, sub-Gaussian. In contrast, Theorem 2.2 guarantees that \(\textrm{dist}(x, \mathcal {X}^*) <\varepsilon \) with high-probability assuming only the local second moment bound .

3 Proofs of main results

In this section, we establish high-probability nearly linear convergence guarantees for Algorithm 2 and linear convergence guarantees for Algorithm 5 —the main contributions of this work. Throughout this section, we assume that Assumptions – hold.

3.1 Warm-up: convex setting

We begin with a short proof of nearly linear convergence for Algorithm 2 in the convex setting. We use this simplified case to explain the general proof strategy and point out the difficulty of extending the argument to the weakly convex setting. Since we restrict ourselves to the convex setting, throughout this section (Sect. 3.1) we suppose:

  • Assumption  holds with \(\eta = 0\) and Assumption  holds with \(\mathcal {S}= \mathcal {X}^*\).

  • The models \(f_x(\cdot , z)\) are \(L(z)\)-Lipschitz on \(\mathbb {R}^d\) for all x, where \(L:\Omega \rightarrow \mathbb {R}\) is a measurable function satisfying \(\sqrt{\mathbb {E}_{z}\left[ L( z)^2\right] } \le \textsf{L}\).

In particular, the tube \(\mathcal {T}_2\) is the entire space \(\mathcal {T}_2 = \mathbb {R}^d\), which alleviates the main difficulty of the weakly convex setting. The proof of convergence relies on the following known sublinear convergence guarantee for Algorithm 1.

Theorem 3.1

([11, Theorem 4.1]). Fix an initial point \(y_0 \in \mathbb {R}^d\) and let \(\alpha = \frac{C}{\sqrt{K+1}}\) for some \(C>0\). Then for any index \(K\in \mathbb {N}\), the point \(y=\textrm{MBA}(y_0, \alpha , K, \texttt {true})\) satisfies

$$\begin{aligned} \mathbb {E}\left[ f\left( y\right) - \min _{\mathcal {X}} f\right] \le \frac{\tfrac{1}{2}\textrm{dist}^2(y_0, \mathcal {X}^*) + C^2\textsf{L}^2 }{C\sqrt{K+1}}. \end{aligned}$$
(3.1)

The proof of nearly linear convergence now follows by iteratively applying Theorem 3.1 with a carefully chosen parameter \(C > 0\). The key idea of the proof is much the same as in the deterministic setting [44, 47, 58]. The proof proceeds by induction on the outer iteration counter \(t\). At the start of each inner iteration, we choose C to minimize the ratio in Equation (3.1), taking into account an inductive estimate on the initial square error \(\textrm{dist}^2(y_0, \mathcal {X}^*)\). We then run the inner loop until the estimate decreases by a fixed fraction. This strategy differs from deterministic setting in only one way: since the output of the inner loop is random, we extract a bound on the initial distance using Markov’s inequality.

Theorem 3.2

(Nearly linear convergence under convexity). Fix an initial point \(x_0 \in \mathbb {R}^d\), real \( \varepsilon >0\), \( \delta \in (0, 1)\), and an upper bound \(R_0 \ge \textrm{dist}(x_0, \mathcal {X}^*)\). Define parameters

Then with probability at least \( 1 - \delta \), the point \(x_{T} = \textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {true})\) satisfies \( \textrm{dist}(x_T, \mathcal {X}^*) \le \varepsilon . \) Moreover, the total number of samples \(z_k\sim \mathbb {P}\) generated by the algorithm is bounded by

Proof

In what follows, set \(C_t=\frac{R_0}{\textsf{L}\cdot 2^{t+\frac{1}{2}} }\) and note the equality \(\alpha _t=\frac{C_t}{\sqrt{K+1}}\) for every index t in Algorithm 2. Let \( E_t\) denote the event that \(\textrm{dist}(x_t, \mathcal {S}) \le 2^{-t} \cdot R_0\). We wish to show the inequality

$$\begin{aligned} \mathbb {P}(E_{t+1}) \ge \mathbb {P}(E_{t})- \frac{ 2^{3/2} \textsf{L}}{ \mu \sqrt{K+1}}, \end{aligned}$$
(3.2)

for all \(t\in \{0, \ldots , T\}\). To that end, observe

$$\begin{aligned} \mathbb {P}\left( E_{t+1}\right) \ge \mathbb {P}(E_{t+ 1} \mid E_t) \mathbb {P}(E_{t}). \end{aligned}$$
(3.3)

To lower bound the right-hand-side, observe by Markov’s inequality the estimate

$$\begin{aligned} \mathbb {P}(E_{t+ 1}^c \mid E_t) \le \frac{\mathbb {E}\left[ \textrm{dist}(x_{t+1}, \mathcal {X}^*)\mid E_t\right] }{2^{-(t+1)} R_0} = \frac{\mathbb {E}\left[ \textrm{dist}(x_{t+1}, \mathcal {X}^*)1_{E_t}\right] }{2^{-(t+1)} R_0 \mathbb {P}(E_t)} . \end{aligned}$$
(3.4)

Combining assumption  and Theorem 3.1, we deduce

$$\begin{aligned} \begin{aligned} \mathbb {E}\left[ \textrm{dist}(x_{t+1}, \mathcal {X}^ *)1_{E_t}\right]&\le \mathbb {E}\left[ \mu ^{-1}(f(x_{t+1}) - \min _\mathcal {X}f)1_{E_t}\right] \le \frac{\frac{1}{2}\mathbb {E}\left[ \textrm{dist}^2(x_t,\mathcal {X}^*) 1_{E_t}\right] + C^2_t\textsf{L}^2 }{\mu C_t\sqrt{K+1}}\\&\le \frac{2^{-(1+2t)} R_0^2 + C^2_t\textsf{L}^2 }{\mu C_t\sqrt{K+1}} = \frac{2^{\frac{1}{2}-t} R_0 \textsf{L}}{\mu \sqrt{K+1}}. \end{aligned} \end{aligned}$$
(3.5)

Therefore, combining (3.3), (3.4), and (3.5) we arrive at the claimed estimate

$$\begin{aligned} \mathbb {P}\left( E_{t+1}\right) \ge \left( 1 - \frac{1}{\mathbb {P}(E_t)} \cdot \frac{2^{3/2} \textsf{L}}{\mu \sqrt{K+1}}\right) \mathbb {P}(E_{t}) \ge \mathbb {P}(E_t) - \frac{2^{3/2} \textsf{L}}{\mu \sqrt{K+1}}. \end{aligned}$$

Iterating (3.2) and using the definition of T and K, we conclude that with probability

$$\begin{aligned} \mathbb {P}(E_T) \ge 1 - \frac{ 2^{3/2} T\textsf{L}}{ \mu \sqrt{K+1}}\ge 1-\delta , \end{aligned}$$

the estimate

$$\begin{aligned} \textrm{dist}(x_T,\mathcal {X}^*)\le 2^{-T} R_0\le \varepsilon , \end{aligned}$$

holds as claimed. This completes the proof. \(\square \)

3.2 Weakly convex setting

We now present the convergence guarantees for Algorithm 2 in the weakly convex setting under Assumptions –. The proof of nearly linear convergence proceeds by inductively applying the following Lemma, which is similar to Lemma 3.1. Compared to the convex setting, the weakly convex setting presents a new challenge: the region of nearly linear convergence, \(\mathcal {T}_\gamma \), is local. The iterates of Algorithm 2 must therefore be shown to never leave \(\mathcal {T}_\gamma \) with high probability. We show this through a stopping time argument in the proof of the following Lemma (see Sect. A.1).

Lemma 3.3

Fix real numbers \(\delta \in (0, 1)\), \(\gamma \in (0,2)\), \(K \in \mathbb {N}\), and \(\alpha >0\). Let \(y_0\) be a random vector and let B denote the event \(\{ y_0 \in \mathcal {T}_{\gamma \sqrt{\delta }}\}\). Define

$$\begin{aligned} y_{K^*} = \textrm{MBA}(y_0, \alpha , K, \texttt {false}). \end{aligned}$$

Then for any \(\varepsilon > 0\), the estimate \(\textrm{dist}(y_{K^*}, \mathcal {S}) \le \varepsilon \) holds with probability at least

$$\begin{aligned} \mathbb {P}(B) - \delta - \left( \frac{\eta }{\gamma \mu }\right) ^2 K\textsf{L}^2\alpha ^2 - \frac{1}{\varepsilon } \cdot \frac{ \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + (K+1)\textsf{L}^2\alpha ^2}{(2-\gamma ) \mu (K+1) \alpha }. \end{aligned}$$

The proof of nearly linear convergence of Algorithm 2 in the weakly convex setting now follows by inductively applying Lemma 3.3.

Theorem 3.4

(Nearly linear convergence under weak convexity) Fix real numbers \( \varepsilon >0\), \( \delta _2\in (0, 1)\), \(\gamma \in (0, 2)\). Let \(R_0\) denote the initial distance estimate satisfying \(\textrm{dist}(x_0, \mathcal {S}) \le R_0\le \frac{\gamma \mu }{\eta }\). Furthermore, define algorithm parameters

Then with probability at least \( 1 - (8/3)R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2 - \delta _2\), the point \(x_{T} = \textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {false})\) satisfies \( \textrm{dist}(x_T, \mathcal {S}) \le \varepsilon . \) Moreover, the total number of samples \(z_k\sim \mathbb {P}\) generated by the algorithm is bounded by

Proof

For all \(t\), let \(E_t\) be the event \(\{\textrm{dist}(x_t, \mathcal {S}) \le 2^{-t}\cdot R_0\}\). In addition, define \(\delta _1:= R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2\). We claim that the inequality

$$\begin{aligned} \mathbb {P}(E_{t+1})\ge \mathbb {P}\left( E_t\right) - 2\delta _12^{-2t} - \frac{4 \textsf{L}}{(2-\gamma )\mu \sqrt{K+1}}, \end{aligned}$$

holds for all \(t\in \{0,1\ldots , T\}\). To see this, apply Lemma 3.3 with \(y_0 = x_t\), \(\varepsilon =2^{-(t+1)}R_0\), \(\delta :=\delta _12^{-2t}\), \(\alpha = 2^{-t} \alpha _0\), thereby yielding

$$\begin{aligned} \mathbb {P}(E_{t+1})&\ge \mathbb {P}\left( E_t\right) - \delta _12^{-2t} - \left( \frac{\eta }{\gamma \mu }\right) ^2 K\textsf{L}^2\alpha ^2_t - \frac{1}{\varepsilon } \cdot \frac{ \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + (K+1)\textsf{L}^2\alpha ^2_t}{(2-\gamma ) \mu (K+1) \alpha _t} \\&\ge \mathbb {P}\left( E_t\right) - 2\delta _12^{-2t} - \frac{4 \textsf{L}}{(2-\gamma )\mu \sqrt{K+1}}, \end{aligned}$$

where the last inequality follows from the definitions of \(\alpha _0\) and \(R_0\). Iterating the inequality, we conclude

$$\begin{aligned} \mathbb {P}(E_T) \ge 1 - 2\delta _1\sum _{i = 0}^{T-1} 2^{-2i} - \frac{4 \textsf{L}T}{(2-\gamma )\mu \sqrt{K+1}}\ge 1-(8/3)\delta _1-\delta _2. \end{aligned}$$
(3.6)

This completes the proof. \(\square \)

Observe that the probability of success \( 1 - (8/3)R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2 - \delta _2\) in Theorem 3.4 depends both on the initialization quality \(R_0\) and on \(\delta _2\). Moreover, \(\delta _2\) also appears inversely in the sample complexity \(\widetilde{O}\left( \frac{ \textsf{L}^2}{\delta _2^2 \mu ^2} \right) \). In the next section, we introduce an algorithm with probability of success independent of \(R_0\) and with sample complexity that depends only logarithmically on its success probability.

We close this section with the following remark, which shows that the results of Theorem 3.4 extend beyond the weakly convex setting.

Remark 2

(Beyond weakly convex problems). Assumptions  and  imply that f is \(\eta \)-weakly convex on U, meaning that the assignment \(x \mapsto f(x) + \frac{\eta }{2}\Vert x\Vert ^2\) is convex [11, Lemma 4.1]. Revisiting the proof of Lemma 3.3, however, we see that  may be replaced by the following weaker assumption:

\({\overline{\mathrm {(A3)}}}\):

(Two-point accuracy) There exists \(\eta > 0\) and an open convex set U containing \(\mathcal {X}\) and a measurable function \((x,y,z)\mapsto f_x(y,z)\), defined on \(U\times U\times \Omega \), satisfying

$$\begin{aligned} \mathbb {E}_{z}\left[ f_x(x,z)\right] =f(x) \qquad \forall x\in U, \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}_{z}\left[ f_x(y,z)-f(y)\right] \le \frac{\eta }{2}\Vert y-x\Vert ^2\qquad \forall x \in U, y \in \hbox {argmin}_{z \in \textrm{proj}_{\mathcal {S}}(x)} f(z). \end{aligned}$$

In the case \(\mathcal {S}= \mathcal {X}^*\), this assumption requires the model to touch the function at x and to lower bound it, up to quadratic error, at its nearest minimizer. This condition does not imply that f is weakly convex.

3.3 Convergence with high probability

In this section, we show that Algorithm 5 succeeds with high probability. Throughout this section (Sect. 3.3), we impose Assumptions – with \(\mathcal {S}=\mathcal {X}^*\).

The following lemma guarantees that with appropriate step size, the proximal point of the problem (2.1) at \(y\in \mathcal {T}_{\gamma }\) lies in \(\textrm{proj}_{\mathcal {X}^*}(y)\). We present the proof in Sect. B.2.

Lemma 3.5

Fix \(\gamma \in (0, 2)\), \(\rho >\eta \), and a point \(y \in \mathcal {T}_\gamma \). Then the proximal subproblem

$$\begin{aligned} \min _{x \in \mathcal {X}} \left\{ f(x) + \frac{\rho }{2}\Vert x - y\Vert ^2\right\} \end{aligned}$$
(3.7)

is strongly convex and therefore has a unique minimizer \(\bar{y}\). Moreover, if \( \rho < \left( \tfrac{2-\gamma }{2\gamma }\right) \eta \), then the inclusion \(\bar{y}\in \textrm{proj}_{\mathcal {X}^*} (y)\) holds.

Lemma 3.5 shows that, unlike Algorithm 1, we can expect the output of Algorithm 3 to be near the minimizer \(\bar{y}_0 \in \mathcal {X}^*\) of the proximal subproblem \(f(y) + \frac{\rho }{2} \Vert y - y_0\Vert ^2\), at least with constant probability. This lemma underlies the validity of Lemma 3.6. We present its proof in Sect. B.3.

Lemma 3.6

Fix real numbers \(\delta \in (0, 1)\), \(\gamma \in (0, 2)\), \(\alpha >0\) \(K \in \mathbb {N}\), and \(\rho \) satisfying \(\eta< \rho < \left( \frac{2-\gamma \sqrt{\delta } }{2\gamma \sqrt{\delta }}\right) \eta \). Choose any point \(y_0 \in \mathcal {T}_{\gamma \sqrt{\delta }}\) and set \( \bar{y}_0:= \hbox {argmin}_{x \in \mathcal {X}} \left\{ f(x) + \frac{\rho }{2} \Vert x - y_0\Vert ^2\right\} . \) Define

$$\begin{aligned} y_{K^*} = \textrm{PMBA}(y_0, \rho , \alpha , K). \end{aligned}$$

Then for all \(\varepsilon > 0\), with probability at least

$$\begin{aligned} P(\varepsilon )&:= 1 - \delta - \left( \frac{\eta }{\gamma \mu }\right) ^2K\textsf{L}^2 \alpha ^2 - \frac{1}{\varepsilon ^2} \cdot \frac{(K+1) \textsf{L}^2 \alpha + (\alpha ^{-1} + \eta ) \cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2}{(\rho - \eta )(K+ 1)} \end{aligned}$$

we have \(\Vert y_{K^*} - \bar{y}_0\Vert \le \varepsilon \).

The next lemma shows that we can boost the success probability of the inner loop arbitrarily high at only a logarithmic cost. This is an immediate application of Lemma 3.6 and the ensemble technique described in Sect. 2.2, which is formally stated in Lemma B.1.

Corollary 3.7

Assume the setting of Lemma 3.6 and suppose \(P(\varepsilon /3) \ge 2/3\). Then for any \(\delta ' > 0\), in the regime \(M> 48 \log (1/\delta ')\), the point

$$\begin{aligned} \hat{y} = \textrm{EPMBA}(y_0, \rho , \alpha , K, M, \varepsilon /3) \end{aligned}$$

satisfies

$$\begin{aligned} \mathbb {P}(\Vert \hat{y} - \bar{y}_0\Vert \le \varepsilon )\ge 1-\delta '. \end{aligned}$$

Finally, we are ready to establish high probability convergence guarantees of Algorithm 5.

Theorem 3.8

(Linear convergence with high probability) Fix constants \(\gamma \in (0, 2)\), \(\varepsilon >0\), and \(\delta ' \in (0, 1)\). Let \(R_0\) denote the initial distance estimate satisfying \(\textrm{dist}(x_0, \mathcal {X}^*) \le R_0\le \frac{\gamma \mu }{4\eta }\). Furthermore, define algorithm parameters

$$\begin{aligned} \rho _0 = \frac{\mu }{2R_0}, \qquad \qquad \epsilon _0 = \frac{ R_0}{3}, \qquad \qquad \alpha _0 = \sqrt{\frac{R_0^2}{\textsf{L}^2(K+1)}}, \end{aligned}$$

and

Then with probability at least \( 1 - \delta '\), the point

$$\begin{aligned} x_{T} = \textrm{RPMBA}(x_0, \rho _0, \alpha _0, K,\epsilon _0, M, T) \end{aligned}$$

satisfies \( \textrm{dist}(x_T, \mathcal {X}^*) \le \varepsilon . \) Moreover, the total number of samples \(z_k\sim \mathbb {P}\) generated by the algorithm is bounded by

Proof

For all \(t\), let \(E_t\) be the event \(\{\textrm{dist}(x_t, \mathcal {X}^*) \le 2^{-t}\cdot R_0\}\). Our goal is to show for all \(t\in \{0, \ldots , T\}\) the estimate \(\mathbb {P}(E_t) \ge 1 - t\delta '/T\) holds.

We proceed by induction. The base case follows since \(\mathbb {P}(E_0) = 1\) by definition of \(R_0\). Now suppose that the claimed estimate \(\mathbb {P}(E_t) \ge 1 - t\delta '/T\) is true for index \(t\). We will show it remains true with t replaced by \(t+1\). We will apply Lemma 3.6 conditionally with \(y_0 = x_t\) and error tolerance \(\epsilon _t\) in the event \(E_t\). To this end, we define \(\delta _1:= R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2\) and set \(\rho = \rho _t\), \(\delta :=\delta _12^{-2t}\), \(\alpha = \alpha _t\). Before we apply the Lemma, we verify that \(\rho \) meets the conditions of Lemma 3.6, namely that \(\eta< \rho < \left( \frac{2- \gamma \sqrt{\delta _1}2^{-t}}{2\gamma \sqrt{\delta _1} 2^{-t}}\right) \eta \). Indeed, given that \(\rho = \frac{2^t}{2\gamma \sqrt{\delta _1}}\cdot \eta \) (by definition of \(\delta _1\)), the bounds follow immediately from the restrictions \(\delta _1<1/16\) and \(\gamma \le 2\). In particular, it is straightforward to verify the bound

$$\begin{aligned} \rho - \eta \ge \frac{\eta 2^{t-2}}{\gamma \sqrt{\delta _1}} = \frac{2^{t}\mu }{4R_0}. \end{aligned}$$
(3.8)

Now Lemma 3.6 yields that the random vector \( y_{K^*} = \textrm{PMBA}(x_t, \rho _t, \alpha _t, K_t)\) satisfies

$$\begin{aligned}&\mathbb {P}\left( \Vert y_{K^*} - \bar{y}_0\Vert \le \epsilon _t \mid E_t, y_0 = x_t\right) \\&\quad \ge 1 - \delta _12^{-2t} - \left( \frac{\eta }{\gamma \mu }\right) ^2K\textsf{L}^2 \alpha ^2 - \frac{1}{\epsilon _t^2} \cdot \frac{(K+1) \textsf{L}^2 \alpha + (\alpha ^{-1} + \eta ) \cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2}{(\rho - \eta )(K+ 1)}\\&\quad \ge 1 - 2\delta _12^{-2t} - \frac{9}{2^{-2(t+1)} R_0^2} \cdot \frac{\textsf{L}2^{-t} R_0}{(\rho - \eta )\sqrt{K+ 1}} - \frac{36\eta }{(\rho - \eta ) (K+1)}\\&\quad \ge 1 - 2\delta _12^{-2t} - \frac{144(\textsf{L}/\mu ) }{\sqrt{K+ 1}} - \frac{144\gamma \sqrt{\delta _12^{-2t}} }{K+1} \ge 2/3, \end{aligned}$$

where the second inequality follows from (3.8), while the third inequality uses the definition of K and the bound \(\delta _1 < 1/16\) and \(\textsf{L}\ge \mu \). Therefore, since \(M \ge 48\log (T/\delta ')\), we may apply Corollary 3.7 (conditionally) to deduce

$$\begin{aligned} \mathbb {P}\left( \Vert x_{t+ 1} - \bar{y}_0\Vert \le 3\epsilon _t \mid E_t, y_0 = x_t\right) \ge 1 - \delta '/T. \end{aligned}$$

Consequently,

$$\begin{aligned} \mathbb {P}\left( \textrm{dist}(x_{t+ 1}, \mathcal {X}^*) \le 2^{-(t+ 1)}R_0 \right)&\ge \mathbb {P}\left( \textrm{dist}(x_{t+ 1}, \mathcal {X}^*) \le 2^{-(t+ 1)}R_0\mid E_t\right) \mathbb {P}(E_t) \\&\ge \mathbb {E}_{y_0}\left[ \mathbb {P}\left( \Vert x_{t+ 1} - \bar{y}_0\Vert \le 2^{-(t+ 1)}R_0\mid E_t, y_0 = x_t \right) \right] \mathbb {P}(E_t) \\&\ge (1-\delta '/T)(1-t\delta '/T) \\&\ge 1 - (t+ 1)\delta '/T, \end{aligned}$$

as desired. This completes the proof. \(\square \)

4 Consequences for statistical recovery problems

Recent work has shown that a variety of statistical recovery problems are both sharp and weakly convex. Prominent examples include robust matrix sensing [40], phase retrieval [18], blind deconvolution [8], quadratic and bilinear sensing and matrix completion [7]. In this section, we briefly comment on how our current work leads to linearly convergent streaming algorithms for robust phase retrieval and blind deconvolution problems.

4.1 Robust Phase retrieval

Phase retrieval is a common task in computational science, with numerous applications including imaging, X-ray crystallography, and speech processing. In this section, we consider the real counterpart of this problem. For details and a historical account of the phase retrieval problem, see for example [6, 18, 28, 61]. Throughout this section, we fix a signal \(\bar{x} \in \mathbb {R}^d\) and consider the following measurement model.

Assumption A

(Robust Phase Retrieval) Consider random \(a \in \mathbb {R}^d\), \(\xi \in \mathbb {R}\), and \(u\in \{0, 1\}\) and the measurement model

$$\begin{aligned} b = \langle a, \bar{x}\rangle ^2 + u\cdot \xi . \end{aligned}$$

We make the following assumptions on the random data.

  1. 1.

    The variable \(u\) is independent of \(\xi \) and a. The failure probability \(p_{\textrm{fail}}\) satisfies

    $$\begin{aligned} p_{\textrm{fail}}:= \mathbb {P}(u\ne 0) < 1/2. \end{aligned}$$
  2. 2.

    The first absolute moment of \(\xi \) is finite, \(\mathbb {E}\left[ |\xi | \right] < \infty \).

  3. 3.

    There exist constants \(\tilde{\eta }, \tilde{\mu }, \tilde{\textsf{L}} > 0\) such that for all \(v, w \in \mathbb {S}^{d-1}\), we have

    $$\begin{aligned} \tilde{\mu } \le \mathbb {E}\left[ |\langle a, v\rangle \langle a, w\rangle |\right] , \qquad \sqrt{\mathbb {E}\left[ \langle a, v\rangle ^2\Vert a\Vert ^2\right] } \le \tilde{\textsf{L}}, \qquad \mathbb {E}\left[ \langle a, v\rangle ^2\right] \le \tilde{\eta }. \end{aligned}$$

Based on the above assumptions, the following theorem develops three models for the robust phase retrieval problem. We defer the proof to Sect. C.1.

Theorem 4.1

(Phase retrieval parameters). Consider the population data \(z = (a, u, \xi )\) and form the optimization problem

$$\begin{aligned} \min _{x}~f(x) = \mathbb {E}_{z}\left[ f(x, z)\right] \quad \text { where } \quad f(x, z):= | \langle a, x\rangle ^2 - b|. \end{aligned}$$

Then the sharpness property  holds with \(\mathcal {S}=\mathcal {X}^*=\{\pm \bar{x}\}\) and \(\mu =(1-2p_{\textrm{fail}})\tilde{\mu } \Vert \bar{x}\Vert \). Moreover, given a measurable selection \(G(x, z) \in \partial f(x, z)\), the models

  • (subgradient) \(f^s_x(y, z) = f(x, z) + \langle G(x, z), y - x\rangle \),

  • (clipped subgradient) \(f^{cl}_x(y, z) = \max \{ f(x, z) + \langle G(x, z), y - x\rangle , 0\}\),

  • (prox-linear) \(f^{pl}_x(y, z) = | \langle a, x\rangle ^2 - b + 2\langle a, x\rangle \langle a, y - x\rangle | \)

satisfy Assumptions – with

$$\begin{aligned} \eta = 2\tilde{\eta }, \quad L(x, z) = 2|\langle a, x\rangle |\Vert a\Vert , \quad \text {and} \quad \textsf{L}\le 2\tilde{\textsf{L}} \Vert \bar{x}\Vert \left( 1 + \frac{(1-2p_{\textrm{fail}})\tilde{\mu }}{\tilde{\eta }}\right) . \end{aligned}$$

With this theorem in hand, we deduce that on the phase retrieval problem, Algorithm 5 with subgradient, clipped subgradient, and prox-linear models converges linearly to \(\mathcal {X}^*\) with high probability, whenever the method is initialized within constant relative error of the optimal solution.

Theorem 4.2

Fix constants \(\gamma \in (0,2)\), \(\varepsilon >0\), and \(\delta '\in (0, 1)\). Consider the subgradient, clipped subgradient, and prox-linear oracles developed in Theorem 4.1. Suppose we are given a point \(x_0\) satisfying

$$\begin{aligned} \textrm{dist}(x_0, \{\pm x\}) \le \gamma \frac{(1-2p_{\textrm{fail}})\tilde{\mu } }{8\tilde{\eta }} \cdot \Vert \bar{x}\Vert . \end{aligned}$$

Set parameters \( \rho _0, \alpha _0, K,\epsilon _0, M, T\) as in Theorem 3.8. In addition, define the iterate \(x_T= \textrm{RPMBA}(x_0, \rho _0, \alpha _0, K,\epsilon _0, M, T)\). Then with probability \(1-\delta '\), we have \(\textrm{dist}(x_T, \{\pm \bar{x}\}) \le \varepsilon \Vert \bar{x}\Vert \) after

$$\begin{aligned} O\left( \left( \frac{\tilde{\textsf{L}} \left( 1 + \frac{(1-2p_{\textrm{fail}})\tilde{\mu }}{2\tilde{\eta }}\right) }{ \tilde{\mu }(1-2p_{\textrm{fail}}) }\right) ^2\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu } /\tilde{\eta }}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu } /\tilde{\eta }}{\varepsilon }\right) }{\delta '}\right) \right) \end{aligned}$$

stochastic subgradient, stochastic clipped subgradient, or stochastic prox-linear iterations.

We now examine Theorem 4.2 in the setting where the measurement vectors a follow a Gaussian distribution. We note, however, that the results of this section extend far beyond the Gaussian setting to heavy tailed distributions.

Example 4.1

(Gaussian setting) Let us analyze the population setting where \(a \sim N(0, I_{d\times d})\). In this case, it is straightforward to show by direct computation that

$$\begin{aligned} \tilde{\mu } > rsim 1; \qquad \eta = 1; \qquad \textsf{L}\lesssim \sqrt{d}. \end{aligned}$$

Consequently, if \(x_0 \in \mathbb {R}^d\) has error \( \textrm{dist}(x_0, \{\pm \bar{x}\}) \le c(1-2p_{\textrm{fail}}) \cdot \Vert \bar{x}\Vert , \) for some numerical constant c, then with probability \(1-\delta \), Algorithm 5 will produce a point \(x_T\) satisfying \(\textrm{dist}(x_T, \{\pm x\}) \le \varepsilon \Vert \bar{x}\Vert \) using only

$$\begin{aligned} O\left( \frac{d}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{1}{\varepsilon }\right) }{\delta }\right) \right) \end{aligned}$$

samples. We note that the spectral initialization of Duchi and Ruan [18, Proposition 3] produces such a point \(x_0\) with sample complexity \(O(d(1-2p_{\textrm{fail}})^{-2})\) with high probability. Therefore, when taken together, combining this spectral initialization with Algorithm 5 produces a point \(x_T\) satisfying \(\textrm{dist}(x_T, \{\pm \bar{x}\}) \le \varepsilon \Vert \bar{x}\Vert \) with \(O\left( \frac{d}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \log \left( \frac{1}{\varepsilon }\right) /\delta \right) \right) \) samples, which is the best known sample complexity for Gaussian robust phase retrieval, up to logarithmic factors. We note that by leveraging standard concentration results, it is possible to prove similar results for empirical average minimization \(\min _x \frac{1}{m}\sum _{i=1}^m f(x,z_i)\), provided \(z_i\) are i.i.d samples of z and the number of samples satisfies \(m > rsim d (1-2p_{\textrm{fail}})^{-2}\).

4.2 Robust blind deconvolution

We next apply the proposed algorithms to the blind deconvolution problem. For a detailed discussion of the the problem, see for example the papers [2, 39]. Henceforth, fix integers \(d_1, d_2 \in \mathbb {N}\) and an underlying signal \((\bar{x}, \bar{y}) \in \mathbb {R}^{d_1}\times \mathbb {R}^{d_2}\). Define the quantity

$$\begin{aligned} D:= \Vert \bar{x}\Vert \Vert \bar{y}\Vert . \end{aligned}$$

Without loss of generality, we will assume \(\Vert \bar{x}\Vert = \Vert \bar{y}\Vert \). We consider the following measurement model:

Assumption B

(Robust Blind Deconvolution). Consider random \(\ell \in \mathbb {R}^{d_1}\), \(r \in \mathbb {R}^{d_2}\), \(\xi \in \mathbb {R}\), and \(u\in \{0, 1\}\) and the measurement model

$$\begin{aligned} b = \langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle + u\cdot \xi . \end{aligned}$$

We make the following assumptions on the random data.

  1. 1.

    The variable \(u\) is independent of \(\xi \), \(\ell \), and r. The failure probability \(p_{\textrm{fail}}\) satisfies

    $$\begin{aligned} p_{\textrm{fail}}:= \mathbb {P}(u\ne 0) < 1/2. \end{aligned}$$
  2. 2.

    We have \(\mathbb {E}\left[ |\xi | \right] < \infty \).

  3. 3.

    There exists constants \(\tilde{\eta }, \tilde{\mu }, \tilde{\textsf{L}} > 0\) such that for all \(M\in \mathbb {R}^{d_1 \times d_2}\) with \(\Vert M\Vert _F = 1\) and \(\textrm{rank}(M) \le 2\), we have

    $$\begin{aligned} \tilde{\mu }\le \mathbb {E}\left[ |\langle \ell , M r\rangle | \right] \le \tilde{\eta }\end{aligned}$$
  4. 4.

    There exists constants \(\tilde{\textsf{L}} > 0\) such that for all \(v \in \mathbb {S}^{d_1-1}, w \in \mathbb {S}^{d_2-1}\), we have

    $$\begin{aligned} \sqrt{\mathbb {E}\left[ \left( |\langle \ell , v\rangle | \Vert r\Vert + |\langle r, w\rangle |\Vert \ell \Vert \right) ^2\right] } \le \tilde{\textsf{L}}. \end{aligned}$$

Based on the above assumptions, the following theorem develops three models for the robust blind deconvolution problem. We defer the proof to Sect. C.2.

Theorem 4.3

(Blind deconvolution parameters). Fix a real \(\nu > 1\) and define the set:

$$\begin{aligned} \mathcal {X}= \{ (x, y) \in \mathbb {R}^{d_1 + d_2} :\Vert x\Vert \le \nu D, \Vert y\Vert \le \nu D\}. \end{aligned}$$

Consider the population data \(z = (a, u, \xi )\) and the form the optimization problem

$$\begin{aligned} \min _{x,y} f(x, y):= \mathbb {E}\left[ f((x,y), z)\right] \quad \text { where } \quad f((x,y), z):= | \langle \ell , x\rangle \langle r, y\rangle - b|. \end{aligned}$$

Then the optimal solution set is \(\mathcal {X}^*= \{(\alpha \bar{x}, (1/\alpha )\bar{y}) \mid (1/\nu ) \le |\alpha | \le \nu \}\) and f satisfies the sharpness assumption  with \(\mathcal {S}=\mathcal {X}^*\) and

$$\begin{aligned} \mu = \frac{\tilde{\mu } (1-2p_{\textrm{fail}})\sqrt{D}}{2\sqrt{2}(\nu + 1)}. \end{aligned}$$

Moreover, given a measurable selection \(G((x,y), z) \in \partial f((x,y), z)\), the models

  • (subgradient) \(f^s_{(x,y)}((\hat{x}, \hat{y}), z) = f((x, y), z) + \langle G((x, y), z), (\hat{x}, \hat{y}) - (x, y)\rangle \),

  • (clipped subgradient)

    $$\begin{aligned} f^{cl}_{(x, y)}((\hat{x}, \hat{y}), z) = \max \{ f((x, y), z) + \langle G((x, y), z), (\hat{x}, \hat{y}) - (x, y)\rangle , 0\}, \end{aligned}$$
  • (prox-linear)

    $$\begin{aligned} f^{pl}_{(x, y)}((\hat{x}, \hat{y}), z)= & {} | \langle \ell , x\rangle \langle r, y\rangle - (\langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle + u\cdot \xi )\\{} & {} + \langle \ell , x\rangle \langle r, \hat{y} - y\rangle + \langle r, y\rangle \langle \ell , \hat{x} -x\rangle | \end{aligned}$$

satisfy Assumptions – with

$$\begin{aligned} \eta = \tilde{\eta }, \quad L((x, y), z) = |\langle \ell , x\rangle | \Vert r\Vert + |\langle r, y\rangle |\Vert \ell \Vert , \quad \text {and} \quad \textsf{L}= \nu \tilde{\textsf{L}} \sqrt{D}. \end{aligned}$$

With this theorem in hand, we deduce that on the blind deconvolution problem, Algorithm 5 with subgradient, clipped subgradient, and prox-linear models converges linearly to \(\mathcal {X}^*\) with high probability, whenever the method is initialized within constant relative error of the solution set.

Theorem 4.4

Fix constants \(\gamma \in (0,2)\), \(\varepsilon >0\), and \(\delta '\in (0, 1)\). Consider the subgradient, clipped subgradient, and prox-linear oracles developed in Theorem 4.1. Suppose we are given a pair \((x_0,y_0) \in \mathbb {R}^{d_1 + d_2}\) satisfying

$$\begin{aligned} \textrm{dist}((x_0, y_0), \mathcal {X}^*) \le \gamma \frac{\tilde{\mu } (1-2p_{\textrm{fail}})\sqrt{D}}{8\sqrt{2}(\nu + 1)\tilde{\eta }}. \end{aligned}$$

Set parameters \(\rho _0, \alpha _0, K,\epsilon _0, M, T\) as in Theorem 3.8. In addition, we define the iterate \((x_T,y_T)= \textrm{RPMBA}((x_0, y_0), \rho _0, \alpha _0, K,\epsilon _0, M, T)\). Then with probability \(1-\delta '\), we have \(\textrm{dist}((x_T,y_T), \mathcal {X}) \le \varepsilon \sqrt{D}\) after

$$\begin{aligned} O\left( \left( \frac{\nu ^2\tilde{\textsf{L}}}{ \tilde{\mu } (1-2p_{\textrm{fail}}) }\right) ^2\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu }/\tilde{\eta }}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu } /\tilde{\eta }}{\varepsilon }\right) }{\delta '}\right) \right) \end{aligned}$$

stochastic subgradient, stochastic clipped subgradient, or stochastic prox-linear iterations.

We now examine Theorem 4.4 in the setting where the measurement vectors \(\ell , r\) follow a Gaussian distribution. We note, however, that the results of this section extend far beyond the Gaussian setting to heavy tailed distributions.

Example 4.2

(Gaussian setting). Let us analyze the population setting where \((\ell , r) \sim N(0, I_{(d_1 + d_2) \times (d_1 + d_2)})\). In this case, one can show by direct computation that

$$\begin{aligned} \tilde{\mu } > rsim 1; \qquad \eta \lesssim 1; \qquad \textsf{L}\lesssim \sqrt{d_1 + d_2}. \end{aligned}$$

Consequently, if \((x_0, y_0) \in \mathbb {R}^{d_1 + d_2}\) has error \( \textrm{dist}((x_0, y_0), \mathcal {X}^*) \le c(1-2p_{\textrm{fail}}) \cdot \sqrt{D}/\nu , \) for some numerical constant \(c>0\), then with probability \(1-\delta \), Algorithm 5 will produce a pair \((x_T, y_T)\) satisfying \(\textrm{dist}((x_T, y_T), \mathcal {X}^*) \le \varepsilon \sqrt{D}\) using only

$$\begin{aligned} O\left( \frac{\nu ^2(d_1 + d_2)}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{1}{\varepsilon }\right) }{\delta }\right) \right) \end{aligned}$$

samples. We note that the spectral initialization of Charisopoulos et al. [8, Theorem 5.4 and Corollary 5.5] can produce such a pair \((x_0, y_0)\) with sample complexity \(O(\nu ^2(d_1 + d_2) (1-2p_{\textrm{fail}})^{-2})\) with high probability with \(\nu \le \sqrt{3}\). Therefore, when taken together, combining this spectral initialization with Algorithm 5 produces a pair \((x_T, y_T)\) satisfying \(\textrm{dist}((x_T, y_T), \mathcal {X}^*) \le \varepsilon \Vert \bar{x}\Vert \) with \(O\left( \frac{d_1 + d_2}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \log \left( \frac{1}{\varepsilon }\right) /\delta \right) \right) \) samples, which is the best known sample complexity for Gaussian robust blind deconvolution, up to logarithmic factors. We note by leveraging standard concentration results, it is possible to prove similar results for empirical average minimization \(\min _{(x,y)\in \mathcal {X}} \frac{1}{m}\sum _{i=1}^m f((x,y),z_i)\), provided \(z_i\) are i.i.d samples of z and the number of samples satisfies \(m > rsim (d_1 + d_2) (1-2p_{\textrm{fail}})^{-2}\).

5 Numerical Experiments

We now evaluate how Algorithm 2 performs both on the statistical recovery problems of Sect. 4 and on a sparse logistic regression problem. We test the convergence behavior, sensitivity to step size, and convergence to an active manifold. While testing the algorithms, we found that Algorithms 2 and 5 perform similarly, despite Algorithm 5 having superior theoretical guarantees. Thus, we do not evaluate Algorithm 5. The problems of Sect. 4 are both convex composite losses of the form in Example 2.1. For these problems, we therefore implement all four models from Example 2.1, using the closed-form solutions developed in [11, Section 5]. For the sparse logistic regression problem, we implement the stochastic proximal gradient method and measure convergence to the support set of the optimal solution. We provide a reference implementation [9] of the methods in Julia.

5.1 Convergence behavior

In this section, we demonstrate that Algorithm 2 converges nearly linearly on the Gaussian robust phase retrieval and blind deconvolution problems of Sect. 4 for a particular dimension, noise distribution, corruption frequency, and initialization quality. In phase retrieval, we set \(d = 100\) and in blind deconvolution, we set \(d_1 = d_2 = d. \) The measurements are corrupted independently with probability \(p_{\textrm{fail}}\): for phase retrieval, the corruption obeys \(\xi = \left| g \right| , \; g \sim N(0, 100)\), while for blind deconvolution, it obeys \(\xi \sim N(0, 100)\). The algorithms are all randomly initialized at a fixed distance \(R_0 > 0\) from the ground truth. The ground truth is normalized in all cases. We use Examples 4.1 and 4.2 to estimate \(\textsf{L}\), \(\eta \), and \(\mu \), and we set \(\gamma = 1\), \(R_0 = 0.25\), \(\delta _2 = \frac{1}{\sqrt{10}}\), and target accuracy \(\varepsilon = 10^{-5}\) to obtain TK and \(\alpha _0\) parameters as in Theorem 3.4

Figures 2 and 3 depict the convergence behavior of Algorithm 2 on robust phase retrieval and blind deconvolution problems in finite sample and streaming settings, respectively. In these plots, solid lines with markers show the mean behavior over 10 runs, while the transparent overlays show one sample standard deviation above and below the mean. In the finite-sample instances, we use \(m = 8 \cdot d\) measurements and corrupt a fixed fraction \(p_{\textrm{fail}}\) with large magnitude sparse noise; see Fig. 2. In the streaming instances, we draw a new i.i.d. sample at each iteration and corrupt it independently with probability \(p_{\textrm{fail}}\); see Fig. 3. In both figures, we plot in red the rate guaranteed by Theorem 3.4 and observe that the algorithms behave consistently with these guarantees. In presence of noise, the algorithms all converge nearly linearly at the rate predicted by Theorem 3.4, while in the noiseless case, all except the subgradient method converge to an exact solution (modulo numerical accuracy) within far fewer iterations.

Fig. 2
figure 2

Convergence behavior for \(d = 100\) with finite sample size \(m=8\cdot d\). Phase Retrieval (left column), Blind Deconvolution (right column), \(p_{\textrm{fail}}= 0.0\) (top row), \(p_{\textrm{fail}}=0.2\) (bottom row). Average over 10 runs

Fig. 3
figure 3

Convergence behavior for \(d = 100\) with streaming data. Phase Retrieval (left column), Blind Deconvolution (right column), \(p_{\textrm{fail}}= 0.0\) (top row), \(p_{\textrm{fail}}=0.2\) (bottom row). Average over 10 runs

5.1.1 Experiments on large-scale problems

We next demonstrate the performance of Algorithm 2 using the subgradient model on large-scale synthetic instances of phase retrieval and blind deconvolution in the finite sample setting with \(d = 512 \times 512\). At each step, we sample a random subset of measurements of size \(S \in \{32, \sqrt{d}, d\}\). The measurement matrices are sampled from the randomized Hadamard ensemble which consists of k vertically stacked \(d \times d\) Hadamard matrices composed with random binary masks:

$$\begin{aligned} \begin{bmatrix} H_d S_1 \\ H_d S_2 \\ \vdots \\ H_d S_k \end{bmatrix}, \quad \text {where} \; S_i = \textbf{diag}\left( \zeta _{i,j} \right) _{j=1}^d, \; \zeta _{i,j} \sim \textrm{Unif}(\{\pm 1\}) \end{aligned}$$

and \(H_d\) is the \(d \times d\) Walsh-Hadamard matrix. We fix \(k = 8\), \(p_{\textrm{fail}}= 0\), \(\gamma = 1\) and \(\delta _2 = 1/3\) and initialize our algorithm randomly at distance \(R_0 = 0.25\) from the ground truth. We estimate \(\textsf{L}\), \(\eta \) and \(\mu \) using Examples 4.1 and 4.2, adjusted to take into account the batch size S.

We plot the distance from the solution as a function of the number of passes over the dataset in Fig. 4. The plots verify that Algorithm 2 converges nearly linearly to a solution at a dimension-independent rate. Moreover, they indicate that using a larger batch size S does not lead to a reduction over the number of passes required to reach a fixed accuracy.

Fig. 4
figure 4

Convergence behavior of Algorithm 2 using the subgradient model with \(d = 128 \times 128\) (top row) and \(d = 512 \times 512\) (bottom row) and finite sample size \(m = 8d\). Phase retrieval (left column) and blind deconvolution (right column)

5.2 Sensitivity to step size

We next explore how Algorithm 2 performs when \(\alpha _0\) is misspecified. Throughout, we scale \(\alpha _0\) by \(\lambda := 2^p\) for integers p between \(-10\) and 10. We run 25 trials of the algorithm and for each model and scalar \(\lambda \), we report two different metrics:

  • We report the sample mean and standard deviation of the number of “inner” loop iterations, or samples, needed to reach accuracy \(\varepsilon = 10^{-5}\). We use the parameters of Sect. 5.1 to cap the number of total iterations by

    as Theorem 3.4 prescribes. This number is depicted as a dotted line in the figure.

  • We report the sample mean and standard deviation of the distance of the final iterate to the solution set.

Figures 5 and 6 show the results for phase retrieval and blind deconvolution problems with \(d = 100\) and \(p_{\textrm{fail}}\in \{0, 0.2\}\). In these plots, solid lines with markers show the mean behavior over 25 runs, while the transparent overlays show one sample standard deviation above and below the mean. The plots show that Algorithm 2 continues to perform as predicted by Theorem 3.4 even if \(\alpha _0\) is misspecified by a few orders of magnitude.

The prox-linear, proximal point, and clipped models perform similarly in all plots. As reported in [4], the prox-linear and clipped methods produce the same iterates. The iterates produced by the stochastic proximal point method and stochastic prox-linear are not identical, but they are practically indistinguishable. This is due to two factors: the proximal and prox-linear models agree up to an error that increases quadratically as we move from the basepoint, and the proximal subproblems force iterates to remain near the basepoint. Running the proximal point method for a much larger stepsize produces different iterates than the prox-linear method, though then the method fails to converge within the specified level of accuracy.

Fig. 5
figure 5

Sensitivity to step size for the phase retrieval problem with \(d = 100\), \(p_{\textrm{fail}}= 0.2\) (top row), \(p_{\textrm{fail}}= 0\) (bottom row). Left: average number of iterations to achieve distance \(10^{-5}\). Right: average final distance with a fixed computational budget

Fig. 6
figure 6

Sensitivity to step size for the blind convolution problem with \(d = 100\), \(p_{\textrm{fail}}= 0.2\) (top row), \(p_{\textrm{fail}}= 0\) (bottom row). Left: average number of iterations to achieve distance \(10^{-5}\). Right: average final distance with a fixed computational budget

5.3 Activity identification

In this section, we demonstrate that Algorithm 2 nearly linearly converges to the active set of nonzero components in a sparse logistic regression problem. We model our experiment on [37, Section 6.2]. There, the authors find a sparse classifier for distinguishing between digits 6 and 7 from the MNIST dataset of handwritten digits [36]. In this problem, we are given a set of \(N = 12183\) samples \((x_i, y_i) \in \mathbb {R}^d \times \{-1, 1\}\), representing \(28\times 28\) dimensional images of digits and their labels, and we seek a target vector \(z:= (w, b) \in \mathbb {R}^{d+1}\), so that w is sparse and \(\textrm{sign}(\langle w, x_i \rangle + b) = \textrm{sign}(y_i)\) for most i. To find z, we minimize the function

$$\begin{aligned} \min _{z}~ \frac{1}{N}\sum _{i=1}^N f(z; i)+\tau \Vert w\Vert _1, \end{aligned}$$

where each component is a logistic loss:

$$\begin{aligned} f(z; i) \equiv f(w, b; i) := \log \left( 1 + \exp \left( -y_i(\langle w, x_i \rangle + b)\right) \right) . \end{aligned}$$
(5.1)

We let \((\tilde{w}, \tilde{b})\) denote the minimizer of the logistic loss, which we find using the standard proximal gradient algorithm. Given \(\tilde{w}\), we denote its support set by \(S:= \{ i \in [d]: \left| \tilde{w}_i \right| > \epsilon \}\), where \(\epsilon \) accounts for the numerical precision of the system.

Our goal is to converge nearly linearly to the set

$$\begin{aligned} \mathcal {S}= \{ (w, b) \in \mathbb {R}^{d+1} \mid w_{S^c} = 0\}. \end{aligned}$$

To that end, we will apply Algorithm 2 with the stochastic proximal gradient model:

$$\begin{aligned} f_{z}((y, a), i ) = f(z, i) + \langle \nabla f(z, i), (y, a) - z \rangle + \tau \left\| y \right\| _1. \end{aligned}$$

Algorithm 2 equipped with this model results in the standard stochastic proximal gradient method. To apply the algorithm, we set parameters using Theorem 3.4. We set \(\gamma = 1\), \(\delta _2 = 1/\sqrt{10}\), \(\varepsilon = 10^{-5}\). We initialize \(w_0 = 0\), \(b_0 = 0\) and set \(R_0 = \Vert (\tilde{w}, \tilde{b}) - (w_0, b_0)\Vert \). We estimate \(\textsf{L}\) by the formula \( \textsf{L}:= \sqrt{\frac{1}{m} \sum _{i=1}^m \left\| x_i \right\| _2^2}. \) Finally, we estimate \(\mu \) by grid search over \((\tau , p)\) using the formula: \(\mu = \tau \cdot \sqrt{d} \cdot 2^{-p}\).

5.3.1 Evaluation

We compare the performance of Algorithm 2 with the Regularized Dual Averaging method (RDA) [46, 65], which was shown to have favorable manifold identification properties in [37]. In our setting, the latter method solves the following subproblem:

$$\begin{aligned} (w_{t+1}, b_{t+1}) \in \hbox {argmin}_{w, b} \left\{ \left\langle \bar{g}_t, \begin{pmatrix} w \\ b \end{pmatrix} \right\rangle + \tau \left\| w \right\| _1 + \frac{\gamma }{2 \sqrt{t}} \left\| \begin{pmatrix} w \\ b \end{pmatrix} \right\| _2^2 \right\} , \end{aligned}$$
(5.2)

where \(\bar{g}_t\) in (5.2) is the running average over all stochastic gradients \(g_k:= \nabla f(z; i_k)\) sampled up to step t, and \(\gamma \) is a tunable parameter which is again determined by a simple grid search. Following the discussion in [37], RDA is initialized at \((w_0, b_0) = 0\); therefore we choose the same initial point \((w_0, b_0)\) for both methods. In addition to RDA, we also performed a comparison with the standard stochastic proximal gradient method, equipped with a range of polynomially decaying step sizes of the form

$$\begin{aligned} \lambda _k:= c k^{-p}, \; p \in \{1/2, 2/3, 3/4, 1\}. \end{aligned}$$

We found that the stochastic proximal gradient method performed comparably with RDA in all metrics, and therefore chose to omit it below.

The convergence plots in Fig. 7 confirm that the iterates of Algorithm 2 converge to the set \(\mathcal {S}\) at a nearly linear rate, while the function values converge at a sublinear rate. In contrast, the iterates generated by RDA converge sublinearly in both metrics.

Fig. 7
figure 7

Performance of RMBA vs. RDA on the sparse logistic regression problem. Left: function value gap. Right: distance to the support set and the solution found by the proximal gradient method