Stochastic algorithms with geometric step decay converge linearly on sharp functions

Davis, Damek; Drusvyatskiy, Dmitriy; Charisopoulos, Vasileios

doi:10.1007/s10107-023-02003-w

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Full Length Paper
Series A
Published: 05 September 2023

Volume 207, pages 145–190, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mathematical Programming Submit manuscript

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Download PDF

Damek Davis¹,
Dmitriy Drusvyatskiy ORCID: orcid.org/0000-0001-5245-0458² &
Vasileios Charisopoulos¹

535 Accesses
3 Citations
Explore all metrics

Abstract

Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called geometric step decay and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of sharp convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp weakly convex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear (or nearly linear) rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks—phase retrieval and blind deconvolution—and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Article Open access 05 August 2019

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Article 28 August 2023

Subgradient Methods for Sharp Weakly Convex Functions

Article 10 September 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Stochastic (sub)gradient methods form the algorithmic core of much of modern statistical and machine learning. Such algorithms are typically sensitive to algorithm parameters, and require extensive step size tuning to achieve adequate performance. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates of convergence on convex and strongly convex problems

where the loss functions $f(\cdot ,z)$ are convex and $\mathcal {X}$ is a closed convex set [45]. An alternative schedule, called geometric step decay, decreases the step size geometrically by halving it after every few epochs. In recent work [66], geometric step decay was shown to improve exponentially upon classical sublinear rates under the sharpness assumption:

$$\begin{aligned} f(x)- \min _{\mathcal {X}} f\ge \mu \cdot \textrm{dist}(x,\mathcal {X}^*) \qquad \forall x\in \mathcal {X}, \end{aligned}$$

(1.1)

where $\mu >0$ is some constant and $\mathcal {X}^*=\hbox {argmin}_{x\in \mathcal {X}} f(x)$ is the solution set. This result complements early works of Goffin [27] and Shor [62], which show that deterministic subgradient methods converge linearly on sharp convex functions if their step sizes decay geometrically. The work [66] also reveals a departure from the smooth strongly convex setting, where deterministic linear rates degrade to sublinear rates when the gradient is corrupted by noise [45].

Beyond the convex setting, sharp problems appear often in nonconvex statistical recovery problems, for example, in robust matrix sensing [40], phase retrieval [18, 20], blind deconvolution [8], and quadratic/bilinear sensing and matrix completion [7]. For such problems, sharpness is surprisingly common and corresponds to strong identifiability of the statistical model. In such settings, sharpness endows standard deterministic first-order algorithms with rapid local convergence guarantees, enabling the recovery of signals at optimal or near-optimal sample complexity in a variety of nonconvex problems. Despite this, we do not know whether stochastic algorithms equipped with geometric step decay—or any other step size schedule—linearly converge on sharp nonconvex problems.^{Footnote 1} Such algorithms, if available, could pave the way for new sample efficient strategies for these and other statistical recovery problems.

The main result of this work shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems

a geometric step decay schedule endows well-known algorithms

with a local nearly linear^{Footnote 2} rate of convergence to minimizers.

This guarantee takes hold as soon as the algorithms are initialized in a small neighborhood around the set of minimizers and applies, for example, to the stochastic projected subgradient, proximal point, and prox-linear algorithms. Beyond sharp growth (1.1), we also analyze losses that grow sharply away from some closed set $\mathcal {S}$, which is strictly larger than $\mathcal {X}^*$. Such sets $\mathcal {S}$ are akin to “active manifolds" in the sense of Lewis [38] and Wright [64]. For example, the loss $f(x,y)=x^2+|y|$ is not sharp relative to its minimizer, but is sharp relative to the x-axis. For these problems, our algorithms converge nearly linearly to the set $\mathcal {S}$. Finally, we illustrate the result with two statistical recovery problems: phase retrieval and blind deconvolution. For these recovery tasks, our results match the best known computational and sample complexity guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.

1.1 Related work

Our paper is closely related to a number of influential techniques in stochastic, convex, and nonlinear optimization. We now survey these related topics.

Stochastic model-based methods. In this work, we use algorithms that iteratively sample and minimize simple stochastic convex models of the loss function. Throughout, we call these methods model-based algorithms. Such algorithms include the stochastic projected subgradient, prox-linear, and proximal point methods. Stochastic model-based algorithms are known to converge globally to stationary points at a sublinear rate on a large class of nonsmooth and nonconvex problems [11, 19]. Some model-based algorithms also possess superior stability properties and can be less sensitive to step size choice than traditional stochastic subgradient methods [3, 4].

Geometrically decaying learning rate in deterministic optimization. polynomially decaying step-sizes are common in stochastic optimization [35, 53, 55]. In contrast, we develop algorithms with step sizes that decay geometrically. Geometrically decaying step sizes were first analyzed in convex optimization by Shor [62, Thm 2.7, Sec. 2.3] and Goffin [27]. This step size schedule is also closely related to the step size rules of Eremin [21] and Polyak [52]. Similar schedules are known to accelerate convex subgradient methods under Hölder growth as shown in [32, 67]. Geometrically decaying step sizes for deterministic subgradient methods under weak convexity were systematically studied in [12]. The method of this work succeeds under similar assumptions on the population objective as in [12]: sharpness , Lipschitz continuity , and weak convexity , . In contrast to [12] which analyzes a deterministic subgradient method (based on linear approximations of the objective), the method of this work requires only stochastic estimates of the objective and converges under a wider family of nonlinear approximations of the objective (see Example 2.1). Surprisingly, the method only incurs an additional cost of $\log ^2(1/\varepsilon )$ (stochastic) oracle evaluations, which arises due to our restart scheme. It is an intriguing open question whether one can remove this additional dependence on $\log ^2(1/\varepsilon )$.

Geometric step decay in stochastic optimization. The geometric step decay schedule is common in practice: for example, see Krizhevsky et. al. [33] and He et al. [29]. It is a standard option in the popular deep learning libraries, such as Pytorch [49] and TensorFlow [1]. Geometric step decay has been analyzed in a number of recent papers in stochastic convex optimization, including [5, 25, 26, 34, 66, 68]. Among these papers, the work [66] relates most to ours. There, the authors propose two geometric step decay strategies that converge linearly on convex functions that are sharp and have bounded stochastic subgradients. These algorithms either use a moving ball constraint or follow a proximal-point type procedure. We follow the latter strategy, too, at least to obtain high-probability guarantees. The paper [66] differs from our work in that they assume convexity and a uniform bound on the stochastic subgradients. In contrast, we do not assume convexity and only assume that stochastic subgradients have a finite second moment. We are aware of only one subgradient method for stochastic nonconvex problems that converges linearly [3] under favorable assumptions. In [3], the authors develop a “clipped" subgradient method, which resembles a safeguarded stochastic Polyak step. In contrast to our work, their algorithms converge linearly only under “perfect interpolation," meaning that all terms in the expectation share a minimizer. Formally, the interpolation condition therein stipulates that, almost surely over z,

$$\begin{aligned} \inf _{x} f(x, z) = f(x_{\star }, z), \quad \text {for any }x_{\star } \in \arg \min _{x} \mathbb {E}[f(x, z)]. \end{aligned}$$

We do not make this assumption here. Indeed, the statistical recovery problems from Sect. 4 do not satisfy the interpolation condition in the presence of corruptions.

Restarts in deterministic optimization. Restart techniques have a long history in nonlinear programming, such as for conjugate gradient and limited memory quasi-Newton methods. They have also been used more recently to improve the complexity of algorithms in deterministic convex optimization. For example, restart schemes can accelerate sublinear convergence rates for convex problems that satisfy growth conditions as shown by Nesterov [47] and Nemirovskii and Nesterov [44], Renegar [54], O’Donoghue and Candes [48], Roulet and d’Aspremont [58], Freund and Lu [24], and Fercoq and Qu [22, 23] and others. In the nonconvex setting, stochastic restart methods are challenging to analyze, since the region of linear convergence is local. To overcome this challenge, one must bound the probability that the iterates leave this region. One of our main technical contributions is a technique for bounding this probability.

Finite sums. For finite sums, stochastic algorithms that converge linearly are more common. For example, for finite sums that are sharp and convex, Bertsekas and Nedić [43] prove that an incremental Polyak-type algorithm converges linearly. For finite sums that are smooth and strongly convex, variance reduced methods, such as SAG [59], SAGA [14], SDCA [60], SVRG [31], MISO/Finito [15, 41], SMART [10] and their proximal extensions converge linearly. The algorithms we develop here do not assume a finite sum structure.

Verifying sharpness. Sharp growth is a central assumption in this work. This property is surprisingly common in statistical recovery problems. For example, sharp growth has been established for robust matrix sensing [40], phase retrieval [18, 20], blind deconvolution [8], quadratic and bilinear sensing and matrix completion [7] problems. Consequently, the results of this paper apply in these settings.

1.2 Notation

We will mostly follow standard notation used in convex analysis and stochastic optimization. Throughout, the symbol $\mathbb {R}^d$ will denote a d-dimensional Euclidean space with the inner product $\langle \cdot ,\cdot \rangle $ and the induced norm $\Vert x\Vert =\sqrt{\langle x,x\rangle }$. We denote the open ball of radius $\varepsilon >0$ around a point $x\in \mathbb {R}^d$ by the symbol $B_{\varepsilon }(x)$; moreover, we write $\mathbb {S}^{d-1}$ for the unit sphere in $\mathbb {R}^d$. Given an event A, we write $A^c$ for its complement. We also let $1_{A}$ denote the indicator function of A, meaning a function that evaluates to one on A and zero off it. For any set $Q\subset \mathbb {R}^d$, the distance function and the projection map are defined by

$$\begin{aligned} \textrm{dist}(x,Q):=\inf _{y\in Q} \Vert y-x\Vert \qquad \text {and}\qquad \textrm{proj}_Q(x):=\hbox {argmin}_{y\in Q} \Vert y-x\Vert , \end{aligned}$$

respectively. Consider a function $f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{\pm \infty \}$ and a point x, with f(x) finite. The Fréchet subdifferential of f at x, denoted by $\partial f(x)$, consists of all vectors $v\in \mathbb {R}^d$ satisfying

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle +o(\Vert y-x\Vert )\qquad \text {as }y\rightarrow x. \end{aligned}$$

A function f is called $\rho $-weakly convex on an open convex set U if the perturbed function $f+\frac{\rho }{2}\Vert \cdot \Vert ^2$ is convex on U. The subgradients of such functions automatically satisfy the uniform approximation property (see [56]):

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle -\frac{\rho }{2}\Vert y-x\Vert ^2\qquad \text {for all }x,y\in U, v\in \partial f(x). \end{aligned}$$

2 Algorithms, assumptions, and main results

In this section, we formalize our target problem and introduce algorithms to solve it. We then outline our main results. The complete theorem statements and proofs appear in Sect. 3. Throughout, we consider the minimization problem

$$\begin{aligned} \min _{x \in \mathcal {X}}~ f (x). \end{aligned}$$

(2.1)

for some function $f:\mathbb {R}^d\rightarrow \mathbb {R}$ and a closed convex set $\mathcal {X}\subseteq \mathbb {R}^d$. We define the set $\mathcal {X}^*:=\hbox {argmin}_{x\in \mathcal {X}} f(x)$ and assume it to be nonempty. We also fix a probability space $(\Omega ,\mathcal {F},\mathbb {P})$ and equip $\mathbb {R}^d$ with the Borel $\sigma $-algebra and make the following assumption.

(A1)
(Sampling) It is possible to generate i.i.d. realizations $z_1,z_2, \ldots \sim \mathbb {P}$.

The algorithms we develop rely on a stochastic oracle model for Problem (${\mathcal{S}\mathcal{O}}$) that was recently introduced in [11]. These algorithms assume access to a family of functions ${f_x(\cdot ,z)}$—called stochastic models—indexed by basepoints $x\in \mathbb {R}^d$ and random elements $z\sim \mathbb {P}$. Given these models, the generic stochastic model-based algorithm of [11] iterates the steps:

$$\begin{aligned} \begin{aligned}&\text {Sample } z_k\sim \mathbb {P};\\&\text {Set } y_{k+1} = \hbox {argmin}_{y\in \mathcal {X}}~ \left\{ f_{y_k}(y,z_k) + \frac{1}{2\alpha } \Vert y - y_k\Vert ^2\right\} . \end{aligned} \end{aligned}$$

(2.2)

In this work, model-based algorithms form the core of the following restart strategy: given inner and outer loop sizes $K$ and $T$, respectively, as well as initial stepsize $\alpha _0 > 0 $, perform:

$$\begin{aligned} \begin{aligned}&\text {For }t= 0, \ldots , T-1:\\&\hspace{20pt} {\text {Initialize }y_0 = x_t, \alpha = 2^{-t}\alpha _0,\text { and run }K\text { iterations of~(2.2);}}\\&\hspace{20pt} \text {Sample }x_{t+ 1}\text { uniformly from } y_0, \ldots , y_K. \end{aligned} \end{aligned}$$

(2.3)

This restart strategy is common in machine learning practice and is called geometric step decay. Restart schemes date back to the fundamental work of Nesterov [47] and Nemirovskii and Nesterov [44] and more recently appear in [22, 23, 25, 48, 54, 58, 66]. These strategies often improve the convergence guarantees of the algorithm they restart under growth assumptions, for example, by boosting an algorithm that converges sublinearly to one that converges linearly. In this work, we will show that restart scheme (2.3) similarly improves (2.2) for a large class of weakly convex stochastic optimization problems.

2.1 Assumptions

In this section, we formalize our assumptions on sharp growth of (2.1) as well as on accuracy, regularity, and Lipschitz continuity of the models.

2.1.1 Sharp Growth

We assume that $f(\cdot )$ grows sharply as x moves in the direction normal to a closed set $\mathcal {S}$.

(A2)
(Sharpness) There exists a constant $\mu > 0$ and a closed set $\mathcal {S}\subseteq \mathcal {X}$ satisfying $ \mathcal {X}^*\subseteq \mathcal {S}$ such that the following bound holds:
$$\begin{aligned} f(x) - \inf _{z \in \textrm{proj}_{\mathcal {S}}(x)}f(z) \ge \mu \cdot \textrm{dist}(x, \mathcal {S}) \qquad \forall x \in \mathcal {X}. \end{aligned}$$
(2.4)

This property generalizes the classical sharp growth condition (1.1), where $\mathcal {S}=\mathcal {X}^*$. The setting $\mathcal {S}= \mathcal {X}^*$ is well-studied in nonlinear programming and often underlies rapid convergence guarantees for deterministic local search algorithms. Beyond the classical setting, $\mathcal {S}$ could be a sublevel set of f or an “active manifold” in the sense of Lewis [38]; see Sect. D. When $\mathcal {S}\ne \mathcal {X}^*$, we design stochastic algorithms that do not necessarily converge to global minimizers, but instead nearly linearly converge to $\mathcal {S}$ with high probability. Finally, we note that when $\mathcal {S}= \mathcal {X}^{*}$ the results of this paper remain valid if (2.4) only holds in a neighborhood of the solution set.^{Footnote 3}

2.1.2 Accuracy and Regularity

We assume that the models are convex and under-approximate f up to quadratic error.

(A3)
(One-sided accuracy) There exists $\eta > 0$ and an open convex set U containing $\mathcal {X}$ and a measurable function $(x,y,z)\mapsto f_x(y,z)$, defined on $U\times U\times \Omega $, satisfying
$$\begin{aligned} \mathbb {E}_{z}\left[ f_x(x,z)\right] =f(x) \qquad \forall x\in U, \end{aligned}$$
and
$$\begin{aligned} \mathbb {E}_{z}\left[ f_x(y,z)-f(y)\right] \le \frac{\eta }{2}\Vert y-x\Vert ^2\qquad \forall x,y\in U. \end{aligned}$$
(A4)
(Convex models) The function $f_x(\cdot ,z)$ is convex $\forall x\in U$ and a.e. $z\in \Omega $.

Models satisfying and , and their algorithmic implications, were analyzed in [11, Assumption B]; a closely related family of models was investigated in [3, 4]. Assumptions and imply that f is $\eta $-weakly convex on U, meaning that the assignment $x \mapsto f(x) + \frac{\eta }{2}\Vert x\Vert ^2$ is convex [11, Lemma 4.1]. While we assume throughout the paper, we show in Remark 2 that our results hold under an even weaker assumption. For certain losses, models that satisfy and are easy to construct, as the following example shows.

Example 2.1

(Convex Composite Class) Stochastic convex composite losses take the form

$$\begin{aligned} f(x)=\mathbb {E}_{z} f(x,z)\quad \text {with}\quad f(x,z)=h(c(x,z),z), \end{aligned}$$

where $h(\cdot , z)$ are Lipschitz and convex and the nonlinear maps $c(\cdot ,z)$ are $C^1$-smooth with Lipschitz Jacobian. Such losses appear often in data science and signal processing (see [7, 16, 19] and references therein). For this problem class, natural models include

(subgradient) $f_x(y,z)=f(x,z)+\langle \nabla c(x, z)^\top v,y-x\rangle $ for any $v\in \partial h(c(x, z), z)$.
(prox-linear) $f_x(y,z)=h(c(x,z)+\nabla c(x,z)(y-x),z).$
(proximal point) $f_x(y,z)=f(y,z) + \frac{\eta }{2} \Vert x - y \Vert ^2$,

where $\eta $ is large enough to guarantee the proximal model is convex.^{Footnote 4} If a lower bound $\ell (z)$ on $\inf f(\cdot ,z)$ is known, one can also choose a clipped model

(clipped) $\tilde{f}_x(y, z) = \max \{ f_x(y,z), \ell (z)\} $,

for any of the models $ f_x(\cdot ,z)$ above, as was suggested by [3]. Intuitively, models that better approximate f(x, z) are likely to perform better in practice; see [3] for theoretical evidence supporting this claim. Fig. 1 contains a graphical illustration of one-sided models. $\square $

2.1.3 Lipschitz continuity

We assume that the models are Lipschitz on a tube surrounding $\mathcal {S}$.

(A5)
(Lipschitz property) Define the tube
$$\begin{aligned} \mathcal {T}_{\gamma }:= \left\{ x \in \mathcal {X}\mid \textrm{dist}(x, \mathcal {S}) \le \frac{\gamma \mu }{\eta }\right\} \qquad \forall \gamma > 0. \end{aligned}$$
We assume that there exists a measurable function $L :\mathbb {R}^d \times \Omega \rightarrow \mathbb {R}_+$ such that
$$\begin{aligned} \min _{v\in \partial f_x(x,z)}\Vert v\Vert \le L(x,z) \end{aligned}$$
(2.5)
for all $x\in \mathcal {T}_2$ and a.e. $z\in \Omega $. Moreover, we assume there exists $\textsf{L}> 0$ such that
$$\begin{aligned} \sup _{x \in \mathcal {T}_2} \sqrt{\mathbb {E}_{z}\left[ L(x, z)^2\right] } \le \textsf{L}. \end{aligned}$$

This Lipschitz property is local and differs from the global assumption of [11, Assumption B4]. The property holds only in $\mathcal {T}_2$ since our algorithms will be initialized in this tube and will never leave it with high probability.^{Footnote 5} The local nature of this property is crucial to signal recovery applications, for example, blind deconvolution and phase retrieval. In these problems, global Lipschitz continuity does not hold; see Sect. 4.

Remark 1

(Finite-sums) While our results are stated in terms of the streaming model, they are also applicable to the finite-sum setting, where our loss function is in the form

$$\begin{aligned} f(x) = \frac{1}{m} \sum _{i=1}^m f(x, i). \end{aligned}$$

Indeed, with $\mathbb {P}= \textrm{Unif}(\{1, \dots , m\}) \implies \mathbb {P}(z_k = i) = \frac{1}{m}$, for any $k \in \mathbb {N}$, $i \in [m]$, we recover

$$\begin{aligned} \mathbb {E}_z[f(x, z_k)] = \frac{1}{m} \sum _{i=1}^m f(x, i) = f(x). \end{aligned}$$

2.2 Algorithms and results

Stochastic model-based algorithms (Algorithm 1) iteratively sample and minimize stochastic convex models of the loss function.^{Footnote 6} When equipped with models satisfying , and a global Lipschitz condition, these algorithms converge to stationary points of (2.1) at a sublinear rate [11, 19]. In this section, we show that such sublinear rates can be improved to local linear (or nearly linear) rates by using the restart strategy outlined in (2.2)–(2.3). We introduce two such strategies that succeed with probability $1-\delta $, for any $\delta > 0$. The first (Algorithm 2) allows for arbitrary sets $\mathcal {S}$ in Assumption , but its sample complexity and initialization region scale poorly in $\delta $. The second (Algorithm 5) assumes $\mathcal {S}= \mathcal {X}^*$, but has much better dependence on $\delta $.

Given assumptions –, the following theorem shows that the first restart strategy (Algorithm 2—Restarted Model Based Algorithm (RMBA)) converges nearly linearly to $\mathcal {S}$ with high probability.

Theorem 2.1

(Informal) Fix a target accuracy $\varepsilon >0$, failure probability $\delta \in (0,\frac{1}{4})$, and a point $x_0\in \mathcal {T}_{\gamma \sqrt{\delta }}$ for some $\gamma \in (0,1)$. Then with appropriate parameter settings, the point $x=\textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {false})$ will satisfy $\textrm{dist}(x,\mathcal {S})<\varepsilon $ with probability at least $1-4\delta $. Moreover, the number of samples $z_i\sim \mathbb {P}$ generated by the algorithm is at most

$$\begin{aligned} \mathcal {O}\left( \left( \frac{ \textsf{L}}{\delta \mu }\right) ^2 \log ^3 \left( \frac{\gamma \mu /\eta }{\varepsilon }\right) \right) . \end{aligned}$$

Theorem 2.1 has interesting consequences not only for convergence to global minimizers, but also for “active manifold identification." For example, when $\mathcal {S}= \mathcal {X}^*$, Theorem 2.1 shows that with constant probability, Algorithm 2 converges nearly linearly to the true solution set. When $\mathcal {S}\ne \mathcal {X}^*$ and is instead an “active manifold” in the sense of Lewis [38], Algorithm 2 nearly linearly converges to the active manifold. In our numerical evaluation, we illustrate this phenomenon for a sparse logistic regression problem. We empirically observe that the method converges linearly to the support of the solution, even though the overall convergence to the true solution may be sublinear.

In our numerical experiments, we find that Algorithm 2 succeeds for a wide variety of parameter settings, which may not necessarily satisfy the conditions set forth by the theory. Theorem 2.1, on the other hand, only guarantees Algorithm 2 succeeds with high probability when we greatly increase its sample complexity and initialize it close to $\mathcal {S}$. We would like to boost Algorithm 2 into a new algorithm whose sample complexity and initialization requirements scale only polylogarithmicaly in $1/\delta $. As a first attempt, we discuss the following two probabilistic techniques, both of which have limitations:

(Markov) One approach is to call Algorithm 2 multiple times for a moderately small value $\delta $ and pick out the “best" iterate from the batch. This approach is flawed since, even in the convex setting, there is no procedure to test which iterate is “best" without increasing sample complexity.
(Ensemble) An alternative approach is based on a well-known resampling trick, which applies when $\mathcal {S}=\{\bar{x}\}$ is a singleton set [45, p. 243], [30, 63, Algorithm 1]: Run m trials of Algorithm 2 with any fixed $\delta <1/4$, and denote the returned points by $\{ x_i\}_{i=1}^m$. Then with high probability, the majority of the points $\{ x_i\}_{i=1}^m$ will be close to $\bar{x}$. Finally, to find an estimate near $\bar{x}$, choose any point that has at least m/2 other points close to it.

The ensemble technique is promising, but it requires $\mathcal {S}$ to be a singleton. This limits its applicability since many low-rank recovery problems (e.g. blind deconvolution, matrix completion, robust PCA) have uncountably many solutions. We overcome this issue by embedding Algorithm 2 and the ensemble method within a proximal-point method. At each stage of this algorithm, we run multiple copies of a stochastic-model based method on a quadratically regularized problem that has a unique solution. Among those copies, we use the ensemble technique to pick out a “successful" iterate. We summarize the resulting nested procedure in Algorithms 3–5: Algorithm 3 (Proximal Model-Based Algorithm—PMBA) is a generic model-based algorithm applied on a quadratically regularized problem; Algorithm 4 (Ensemble PMBA—EPMBA) calls Algorithm 3 as suggested by the ensemble technique; finally, Algorithm 5 (Restarted PMBA (RPMBA)) updates the regularization term, in the style of a proximal point method.

We will establish the following guarantee. In the theorem, we assume $\mathcal {S}=\mathcal {X}^*$.

Theorem 2.2

(Informal) Fix a target accuracy $\varepsilon >0$, failure probability $\delta \in (0,1)$, and a point $x_0\in \mathcal {T}_{\gamma }$ for some $\gamma \in (0,\frac{1}{4})$. Then with appropriate parameter settings, the point $x=\textrm{RPMBA}(x_0, \rho _0, \alpha _0, K,\varepsilon _0, M, T)$ will satisfy $\textrm{dist}(x,\mathcal {X}^*)<\varepsilon $ with probability at least $1-\delta $. Moreover the total number of samples $z_i\sim \mathbb {P}$ generated by the algorithm is at most

$$\begin{aligned} \mathcal {O}\left( \left( \frac{ \textsf{L}}{ \mu }\right) ^2 \log \left( \tfrac{\gamma \mu /\eta }{\varepsilon }\right) \cdot \log \left( \frac{\log (\frac{\gamma \mu /\eta }{\varepsilon })}{\delta }\right) \right) . \end{aligned}$$

Theorem 2.2 resolves the initialization and sample complexity issues of Theorem 2.1. Incidentally, its claimed sample complexity also depends more favorably on $\varepsilon $ and on the problem parameters $\mu $ and $\eta $. Theorem 2.2 is new in the weakly convex setting and also improves on prior work by Xu et al. [66] for convex problems. There, the results require stochastic subgradients to be almost surely bounded, hence, sub-Gaussian. In contrast, Theorem 2.2 guarantees that $\textrm{dist}(x, \mathcal {X}^*) <\varepsilon $ with high-probability assuming only the local second moment bound .

3 Proofs of main results

In this section, we establish high-probability nearly linear convergence guarantees for Algorithm 2 and linear convergence guarantees for Algorithm 5 —the main contributions of this work. Throughout this section, we assume that Assumptions – hold.

3.1 Warm-up: convex setting

We begin with a short proof of nearly linear convergence for Algorithm 2 in the convex setting. We use this simplified case to explain the general proof strategy and point out the difficulty of extending the argument to the weakly convex setting. Since we restrict ourselves to the convex setting, throughout this section (Sect. 3.1) we suppose:

Assumption holds with $\eta = 0$ and Assumption holds with $\mathcal {S}= \mathcal {X}^*$.
The models $f_x(\cdot , z)$ are $L(z)$-Lipschitz on $\mathbb {R}^d$ for all x, where $L:\Omega \rightarrow \mathbb {R}$ is a measurable function satisfying $\sqrt{\mathbb {E}_{z}\left[ L( z)^2\right] } \le \textsf{L}$.

In particular, the tube $\mathcal {T}_2$ is the entire space $\mathcal {T}_2 = \mathbb {R}^d$, which alleviates the main difficulty of the weakly convex setting. The proof of convergence relies on the following known sublinear convergence guarantee for Algorithm 1.

Theorem 3.1

([11, Theorem 4.1]). Fix an initial point $y_0 \in \mathbb {R}^d$ and let $\alpha = \frac{C}{\sqrt{K+1}}$ for some $C>0$. Then for any index $K\in \mathbb {N}$, the point $y=\textrm{MBA}(y_0, \alpha , K, \texttt {true})$ satisfies

$$\begin{aligned} \mathbb {E}\left[ f\left( y\right) - \min _{\mathcal {X}} f\right] \le \frac{\tfrac{1}{2}\textrm{dist}^2(y_0, \mathcal {X}^*) + C^2\textsf{L}^2 }{C\sqrt{K+1}}. \end{aligned}$$

(3.1)

The proof of nearly linear convergence now follows by iteratively applying Theorem 3.1 with a carefully chosen parameter $C > 0$. The key idea of the proof is much the same as in the deterministic setting [44, 47, 58]. The proof proceeds by induction on the outer iteration counter $t$. At the start of each inner iteration, we choose C to minimize the ratio in Equation (3.1), taking into account an inductive estimate on the initial square error $\textrm{dist}^2(y_0, \mathcal {X}^*)$. We then run the inner loop until the estimate decreases by a fixed fraction. This strategy differs from deterministic setting in only one way: since the output of the inner loop is random, we extract a bound on the initial distance using Markov’s inequality.

Theorem 3.2

(Nearly linear convergence under convexity). Fix an initial point $x_0 \in \mathbb {R}^d$, real $ \varepsilon >0$, $ \delta \in (0, 1)$, and an upper bound $R_0 \ge \textrm{dist}(x_0, \mathcal {X}^*)$. Define parameters

Then with probability at least $ 1 - \delta $, the point $x_{T} = \textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {true})$ satisfies $ \textrm{dist}(x_T, \mathcal {X}^*) \le \varepsilon . $ Moreover, the total number of samples $z_k\sim \mathbb {P}$ generated by the algorithm is bounded by

Proof

In what follows, set $C_t=\frac{R_0}{\textsf{L}\cdot 2^{t+\frac{1}{2}} }$ and note the equality $\alpha _t=\frac{C_t}{\sqrt{K+1}}$ for every index t in Algorithm 2. Let $ E_t$ denote the event that $\textrm{dist}(x_t, \mathcal {S}) \le 2^{-t} \cdot R_0$. We wish to show the inequality

$$\begin{aligned} \mathbb {P}(E_{t+1}) \ge \mathbb {P}(E_{t})- \frac{ 2^{3/2} \textsf{L}}{ \mu \sqrt{K+1}}, \end{aligned}$$

(3.2)

for all $t\in \{0, \ldots , T\}$. To that end, observe

$$\begin{aligned} \mathbb {P}\left( E_{t+1}\right) \ge \mathbb {P}(E_{t+ 1} \mid E_t) \mathbb {P}(E_{t}). \end{aligned}$$

(3.3)

To lower bound the right-hand-side, observe by Markov’s inequality the estimate

$$\begin{aligned} \mathbb {P}(E_{t+ 1}^c \mid E_t) \le \frac{\mathbb {E}\left[ \textrm{dist}(x_{t+1}, \mathcal {X}^*)\mid E_t\right] }{2^{-(t+1)} R_0} = \frac{\mathbb {E}\left[ \textrm{dist}(x_{t+1}, \mathcal {X}^*)1_{E_t}\right] }{2^{-(t+1)} R_0 \mathbb {P}(E_t)} . \end{aligned}$$

(3.4)

Combining assumption and Theorem 3.1, we deduce

$$\begin{aligned} \begin{aligned} \mathbb {E}\left[ \textrm{dist}(x_{t+1}, \mathcal {X}^ *)1_{E_t}\right]&\le \mathbb {E}\left[ \mu ^{-1}(f(x_{t+1}) - \min _\mathcal {X}f)1_{E_t}\right] \le \frac{\frac{1}{2}\mathbb {E}\left[ \textrm{dist}^2(x_t,\mathcal {X}^*) 1_{E_t}\right] + C^2_t\textsf{L}^2 }{\mu C_t\sqrt{K+1}}\\&\le \frac{2^{-(1+2t)} R_0^2 + C^2_t\textsf{L}^2 }{\mu C_t\sqrt{K+1}} = \frac{2^{\frac{1}{2}-t} R_0 \textsf{L}}{\mu \sqrt{K+1}}. \end{aligned} \end{aligned}$$

(3.5)

Therefore, combining (3.3), (3.4), and (3.5) we arrive at the claimed estimate

$$\begin{aligned} \mathbb {P}\left( E_{t+1}\right) \ge \left( 1 - \frac{1}{\mathbb {P}(E_t)} \cdot \frac{2^{3/2} \textsf{L}}{\mu \sqrt{K+1}}\right) \mathbb {P}(E_{t}) \ge \mathbb {P}(E_t) - \frac{2^{3/2} \textsf{L}}{\mu \sqrt{K+1}}. \end{aligned}$$

Iterating (3.2) and using the definition of T and K, we conclude that with probability

$$\begin{aligned} \mathbb {P}(E_T) \ge 1 - \frac{ 2^{3/2} T\textsf{L}}{ \mu \sqrt{K+1}}\ge 1-\delta , \end{aligned}$$

the estimate

$$\begin{aligned} \textrm{dist}(x_T,\mathcal {X}^*)\le 2^{-T} R_0\le \varepsilon , \end{aligned}$$

holds as claimed. This completes the proof. $\square $

3.2 Weakly convex setting

We now present the convergence guarantees for Algorithm 2 in the weakly convex setting under Assumptions –. The proof of nearly linear convergence proceeds by inductively applying the following Lemma, which is similar to Lemma 3.1. Compared to the convex setting, the weakly convex setting presents a new challenge: the region of nearly linear convergence, $\mathcal {T}_\gamma $, is local. The iterates of Algorithm 2 must therefore be shown to never leave $\mathcal {T}_\gamma $ with high probability. We show this through a stopping time argument in the proof of the following Lemma (see Sect. A.1).

Lemma 3.3

Fix real numbers $\delta \in (0, 1)$, $\gamma \in (0,2)$, $K \in \mathbb {N}$, and $\alpha >0$. Let $y_0$ be a random vector and let B denote the event $\{ y_0 \in \mathcal {T}_{\gamma \sqrt{\delta }}\}$. Define

$$\begin{aligned} y_{K^*} = \textrm{MBA}(y_0, \alpha , K, \texttt {false}). \end{aligned}$$

Then for any $\varepsilon > 0$, the estimate $\textrm{dist}(y_{K^*}, \mathcal {S}) \le \varepsilon $ holds with probability at least

$$\begin{aligned} \mathbb {P}(B) - \delta - \left( \frac{\eta }{\gamma \mu }\right) ^2 K\textsf{L}^2\alpha ^2 - \frac{1}{\varepsilon } \cdot \frac{ \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + (K+1)\textsf{L}^2\alpha ^2}{(2-\gamma ) \mu (K+1) \alpha }. \end{aligned}$$

The proof of nearly linear convergence of Algorithm 2 in the weakly convex setting now follows by inductively applying Lemma 3.3.

Theorem 3.4

(Nearly linear convergence under weak convexity) Fix real numbers $ \varepsilon >0$, $ \delta _2\in (0, 1)$, $\gamma \in (0, 2)$. Let $R_0$ denote the initial distance estimate satisfying $\textrm{dist}(x_0, \mathcal {S}) \le R_0\le \frac{\gamma \mu }{\eta }$. Furthermore, define algorithm parameters

Then with probability at least $ 1 - (8/3)R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2 - \delta _2$, the point $x_{T} = \textrm{RMBA}(x_0, \alpha _0, K, T, \texttt {false})$ satisfies $ \textrm{dist}(x_T, \mathcal {S}) \le \varepsilon . $ Moreover, the total number of samples $z_k\sim \mathbb {P}$ generated by the algorithm is bounded by

Proof

For all $t$, let $E_t$ be the event $\{\textrm{dist}(x_t, \mathcal {S}) \le 2^{-t}\cdot R_0\}$. In addition, define $\delta _1:= R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2$. We claim that the inequality

$$\begin{aligned} \mathbb {P}(E_{t+1})\ge \mathbb {P}\left( E_t\right) - 2\delta _12^{-2t} - \frac{4 \textsf{L}}{(2-\gamma )\mu \sqrt{K+1}}, \end{aligned}$$

holds for all $t\in \{0,1\ldots , T\}$. To see this, apply Lemma 3.3 with $y_0 = x_t$, $\varepsilon =2^{-(t+1)}R_0$, $\delta :=\delta _12^{-2t}$, $\alpha = 2^{-t} \alpha _0$, thereby yielding

$$\begin{aligned} \mathbb {P}(E_{t+1})&\ge \mathbb {P}\left( E_t\right) - \delta _12^{-2t} - \left( \frac{\eta }{\gamma \mu }\right) ^2 K\textsf{L}^2\alpha ^2_t - \frac{1}{\varepsilon } \cdot \frac{ \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + (K+1)\textsf{L}^2\alpha ^2_t}{(2-\gamma ) \mu (K+1) \alpha _t} \\&\ge \mathbb {P}\left( E_t\right) - 2\delta _12^{-2t} - \frac{4 \textsf{L}}{(2-\gamma )\mu \sqrt{K+1}}, \end{aligned}$$

where the last inequality follows from the definitions of $\alpha _0$ and $R_0$. Iterating the inequality, we conclude

$$\begin{aligned} \mathbb {P}(E_T) \ge 1 - 2\delta _1\sum _{i = 0}^{T-1} 2^{-2i} - \frac{4 \textsf{L}T}{(2-\gamma )\mu \sqrt{K+1}}\ge 1-(8/3)\delta _1-\delta _2. \end{aligned}$$

(3.6)

This completes the proof. $\square $

Observe that the probability of success $ 1 - (8/3)R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2 - \delta _2$ in Theorem 3.4 depends both on the initialization quality $R_0$ and on $\delta _2$. Moreover, $\delta _2$ also appears inversely in the sample complexity $\widetilde{O}\left( \frac{ \textsf{L}^2}{\delta _2^2 \mu ^2} \right) $. In the next section, we introduce an algorithm with probability of success independent of $R_0$ and with sample complexity that depends only logarithmically on its success probability.

We close this section with the following remark, which shows that the results of Theorem 3.4 extend beyond the weakly convex setting.

Remark 2

(Beyond weakly convex problems). Assumptions and imply that f is $\eta $-weakly convex on U, meaning that the assignment $x \mapsto f(x) + \frac{\eta }{2}\Vert x\Vert ^2$ is convex [11, Lemma 4.1]. Revisiting the proof of Lemma 3.3, however, we see that may be replaced by the following weaker assumption:

${\overline{\mathrm {(A3)}}}$:

(Two-point accuracy) There exists $\eta > 0$ and an open convex set U containing $\mathcal {X}$ and a measurable function $(x,y,z)\mapsto f_x(y,z)$, defined on $U\times U\times \Omega $, satisfying

$$\begin{aligned} \mathbb {E}_{z}\left[ f_x(x,z)\right] =f(x) \qquad \forall x\in U, \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}_{z}\left[ f_x(y,z)-f(y)\right] \le \frac{\eta }{2}\Vert y-x\Vert ^2\qquad \forall x \in U, y \in \hbox {argmin}_{z \in \textrm{proj}_{\mathcal {S}}(x)} f(z). \end{aligned}$$

In the case $\mathcal {S}= \mathcal {X}^*$, this assumption requires the model to touch the function at x and to lower bound it, up to quadratic error, at its nearest minimizer. This condition does not imply that f is weakly convex.

3.3 Convergence with high probability

In this section, we show that Algorithm 5 succeeds with high probability. Throughout this section (Sect. 3.3), we impose Assumptions – with $\mathcal {S}=\mathcal {X}^*$.

The following lemma guarantees that with appropriate step size, the proximal point of the problem (2.1) at $y\in \mathcal {T}_{\gamma }$ lies in $\textrm{proj}_{\mathcal {X}^*}(y)$. We present the proof in Sect. B.2.

Lemma 3.5

Fix $\gamma \in (0, 2)$, $\rho >\eta $, and a point $y \in \mathcal {T}_\gamma $. Then the proximal subproblem

$$\begin{aligned} \min _{x \in \mathcal {X}} \left\{ f(x) + \frac{\rho }{2}\Vert x - y\Vert ^2\right\} \end{aligned}$$

(3.7)

is strongly convex and therefore has a unique minimizer $\bar{y}$. Moreover, if $ \rho < \left( \tfrac{2-\gamma }{2\gamma }\right) \eta $, then the inclusion $\bar{y}\in \textrm{proj}_{\mathcal {X}^*} (y)$ holds.

Lemma 3.5 shows that, unlike Algorithm 1, we can expect the output of Algorithm 3 to be near the minimizer $\bar{y}_0 \in \mathcal {X}^*$ of the proximal subproblem $f(y) + \frac{\rho }{2} \Vert y - y_0\Vert ^2$, at least with constant probability. This lemma underlies the validity of Lemma 3.6. We present its proof in Sect. B.3.

Lemma 3.6

Fix real numbers $\delta \in (0, 1)$, $\gamma \in (0, 2)$, $\alpha >0$ $K \in \mathbb {N}$, and $\rho $ satisfying $\eta< \rho < \left( \frac{2-\gamma \sqrt{\delta } }{2\gamma \sqrt{\delta }}\right) \eta $. Choose any point $y_0 \in \mathcal {T}_{\gamma \sqrt{\delta }}$ and set $ \bar{y}_0:= \hbox {argmin}_{x \in \mathcal {X}} \left\{ f(x) + \frac{\rho }{2} \Vert x - y_0\Vert ^2\right\} . $ Define

$$\begin{aligned} y_{K^*} = \textrm{PMBA}(y_0, \rho , \alpha , K). \end{aligned}$$

Then for all $\varepsilon > 0$, with probability at least

$$\begin{aligned} P(\varepsilon )&:= 1 - \delta - \left( \frac{\eta }{\gamma \mu }\right) ^2K\textsf{L}^2 \alpha ^2 - \frac{1}{\varepsilon ^2} \cdot \frac{(K+1) \textsf{L}^2 \alpha + (\alpha ^{-1} + \eta ) \cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2}{(\rho - \eta )(K+ 1)} \end{aligned}$$

we have $\Vert y_{K^*} - \bar{y}_0\Vert \le \varepsilon $.

The next lemma shows that we can boost the success probability of the inner loop arbitrarily high at only a logarithmic cost. This is an immediate application of Lemma 3.6 and the ensemble technique described in Sect. 2.2, which is formally stated in Lemma B.1.

Corollary 3.7

Assume the setting of Lemma 3.6 and suppose $P(\varepsilon /3) \ge 2/3$. Then for any $\delta ' > 0$, in the regime $M> 48 \log (1/\delta ')$, the point

$$\begin{aligned} \hat{y} = \textrm{EPMBA}(y_0, \rho , \alpha , K, M, \varepsilon /3) \end{aligned}$$

satisfies

$$\begin{aligned} \mathbb {P}(\Vert \hat{y} - \bar{y}_0\Vert \le \varepsilon )\ge 1-\delta '. \end{aligned}$$

Finally, we are ready to establish high probability convergence guarantees of Algorithm 5.

Theorem 3.8

(Linear convergence with high probability) Fix constants $\gamma \in (0, 2)$, $\varepsilon >0$, and $\delta ' \in (0, 1)$. Let $R_0$ denote the initial distance estimate satisfying $\textrm{dist}(x_0, \mathcal {X}^*) \le R_0\le \frac{\gamma \mu }{4\eta }$. Furthermore, define algorithm parameters

$$\begin{aligned} \rho _0 = \frac{\mu }{2R_0}, \qquad \qquad \epsilon _0 = \frac{ R_0}{3}, \qquad \qquad \alpha _0 = \sqrt{\frac{R_0^2}{\textsf{L}^2(K+1)}}, \end{aligned}$$

and

Then with probability at least $ 1 - \delta '$, the point

$$\begin{aligned} x_{T} = \textrm{RPMBA}(x_0, \rho _0, \alpha _0, K,\epsilon _0, M, T) \end{aligned}$$

satisfies $ \textrm{dist}(x_T, \mathcal {X}^*) \le \varepsilon . $ Moreover, the total number of samples $z_k\sim \mathbb {P}$ generated by the algorithm is bounded by

Proof

For all $t$, let $E_t$ be the event $\{\textrm{dist}(x_t, \mathcal {X}^*) \le 2^{-t}\cdot R_0\}$. Our goal is to show for all $t\in \{0, \ldots , T\}$ the estimate $\mathbb {P}(E_t) \ge 1 - t\delta '/T$ holds.

We proceed by induction. The base case follows since $\mathbb {P}(E_0) = 1$ by definition of $R_0$. Now suppose that the claimed estimate $\mathbb {P}(E_t) \ge 1 - t\delta '/T$ is true for index $t$. We will show it remains true with t replaced by $t+1$. We will apply Lemma 3.6 conditionally with $y_0 = x_t$ and error tolerance $\epsilon _t$ in the event $E_t$. To this end, we define $\delta _1:= R_0^2\left( \tfrac{\eta }{\gamma \mu }\right) ^2$ and set $\rho = \rho _t$, $\delta :=\delta _12^{-2t}$, $\alpha = \alpha _t$. Before we apply the Lemma, we verify that $\rho $ meets the conditions of Lemma 3.6, namely that $\eta< \rho < \left( \frac{2- \gamma \sqrt{\delta _1}2^{-t}}{2\gamma \sqrt{\delta _1} 2^{-t}}\right) \eta $. Indeed, given that $\rho = \frac{2^t}{2\gamma \sqrt{\delta _1}}\cdot \eta $ (by definition of $\delta _1$), the bounds follow immediately from the restrictions $\delta _1<1/16$ and $\gamma \le 2$. In particular, it is straightforward to verify the bound

$$\begin{aligned} \rho - \eta \ge \frac{\eta 2^{t-2}}{\gamma \sqrt{\delta _1}} = \frac{2^{t}\mu }{4R_0}. \end{aligned}$$

(3.8)

Now Lemma 3.6 yields that the random vector $ y_{K^*} = \textrm{PMBA}(x_t, \rho _t, \alpha _t, K_t)$ satisfies

$$\begin{aligned}&\mathbb {P}\left( \Vert y_{K^*} - \bar{y}_0\Vert \le \epsilon _t \mid E_t, y_0 = x_t\right) \\&\quad \ge 1 - \delta _12^{-2t} - \left( \frac{\eta }{\gamma \mu }\right) ^2K\textsf{L}^2 \alpha ^2 - \frac{1}{\epsilon _t^2} \cdot \frac{(K+1) \textsf{L}^2 \alpha + (\alpha ^{-1} + \eta ) \cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2}{(\rho - \eta )(K+ 1)}\\&\quad \ge 1 - 2\delta _12^{-2t} - \frac{9}{2^{-2(t+1)} R_0^2} \cdot \frac{\textsf{L}2^{-t} R_0}{(\rho - \eta )\sqrt{K+ 1}} - \frac{36\eta }{(\rho - \eta ) (K+1)}\\&\quad \ge 1 - 2\delta _12^{-2t} - \frac{144(\textsf{L}/\mu ) }{\sqrt{K+ 1}} - \frac{144\gamma \sqrt{\delta _12^{-2t}} }{K+1} \ge 2/3, \end{aligned}$$

where the second inequality follows from (3.8), while the third inequality uses the definition of K and the bound $\delta _1 < 1/16$ and $\textsf{L}\ge \mu $. Therefore, since $M \ge 48\log (T/\delta ')$, we may apply Corollary 3.7 (conditionally) to deduce

$$\begin{aligned} \mathbb {P}\left( \Vert x_{t+ 1} - \bar{y}_0\Vert \le 3\epsilon _t \mid E_t, y_0 = x_t\right) \ge 1 - \delta '/T. \end{aligned}$$

Consequently,

$$\begin{aligned} \mathbb {P}\left( \textrm{dist}(x_{t+ 1}, \mathcal {X}^*) \le 2^{-(t+ 1)}R_0 \right)&\ge \mathbb {P}\left( \textrm{dist}(x_{t+ 1}, \mathcal {X}^*) \le 2^{-(t+ 1)}R_0\mid E_t\right) \mathbb {P}(E_t) \\&\ge \mathbb {E}_{y_0}\left[ \mathbb {P}\left( \Vert x_{t+ 1} - \bar{y}_0\Vert \le 2^{-(t+ 1)}R_0\mid E_t, y_0 = x_t \right) \right] \mathbb {P}(E_t) \\&\ge (1-\delta '/T)(1-t\delta '/T) \\&\ge 1 - (t+ 1)\delta '/T, \end{aligned}$$

as desired. This completes the proof. $\square $

4 Consequences for statistical recovery problems

Recent work has shown that a variety of statistical recovery problems are both sharp and weakly convex. Prominent examples include robust matrix sensing [40], phase retrieval [18], blind deconvolution [8], quadratic and bilinear sensing and matrix completion [7]. In this section, we briefly comment on how our current work leads to linearly convergent streaming algorithms for robust phase retrieval and blind deconvolution problems.

4.1 Robust Phase retrieval

Phase retrieval is a common task in computational science, with numerous applications including imaging, X-ray crystallography, and speech processing. In this section, we consider the real counterpart of this problem. For details and a historical account of the phase retrieval problem, see for example [6, 18, 28, 61]. Throughout this section, we fix a signal $\bar{x} \in \mathbb {R}^d$ and consider the following measurement model.

Assumption A

(Robust Phase Retrieval) Consider random $a \in \mathbb {R}^d$, $\xi \in \mathbb {R}$, and $u\in \{0, 1\}$ and the measurement model

$$\begin{aligned} b = \langle a, \bar{x}\rangle ^2 + u\cdot \xi . \end{aligned}$$

We make the following assumptions on the random data.

1.
The variable $u$ is independent of $\xi $ and a. The failure probability $p_{\textrm{fail}}$ satisfies
$$\begin{aligned} p_{\textrm{fail}}:= \mathbb {P}(u\ne 0) < 1/2. \end{aligned}$$
2.
The first absolute moment of $\xi $ is finite, $\mathbb {E}\left[ |\xi | \right] < \infty $.
3.
There exist constants $\tilde{\eta }, \tilde{\mu }, \tilde{\textsf{L}} > 0$ such that for all $v, w \in \mathbb {S}^{d-1}$, we have
$$\begin{aligned} \tilde{\mu } \le \mathbb {E}\left[ |\langle a, v\rangle \langle a, w\rangle |\right] , \qquad \sqrt{\mathbb {E}\left[ \langle a, v\rangle ^2\Vert a\Vert ^2\right] } \le \tilde{\textsf{L}}, \qquad \mathbb {E}\left[ \langle a, v\rangle ^2\right] \le \tilde{\eta }. \end{aligned}$$

Based on the above assumptions, the following theorem develops three models for the robust phase retrieval problem. We defer the proof to Sect. C.1.

Theorem 4.1

(Phase retrieval parameters). Consider the population data $z = (a, u, \xi )$ and form the optimization problem

$$\begin{aligned} \min _{x}~f(x) = \mathbb {E}_{z}\left[ f(x, z)\right] \quad \text { where } \quad f(x, z):= | \langle a, x\rangle ^2 - b|. \end{aligned}$$

Then the sharpness property holds with $\mathcal {S}=\mathcal {X}^*=\{\pm \bar{x}\}$ and $\mu =(1-2p_{\textrm{fail}})\tilde{\mu } \Vert \bar{x}\Vert $. Moreover, given a measurable selection $G(x, z) \in \partial f(x, z)$, the models

(subgradient) $f^s_x(y, z) = f(x, z) + \langle G(x, z), y - x\rangle $,
(clipped subgradient) $f^{cl}_x(y, z) = \max \{ f(x, z) + \langle G(x, z), y - x\rangle , 0\}$,
(prox-linear) $f^{pl}_x(y, z) = | \langle a, x\rangle ^2 - b + 2\langle a, x\rangle \langle a, y - x\rangle | $

satisfy Assumptions – with

$$\begin{aligned} \eta = 2\tilde{\eta }, \quad L(x, z) = 2|\langle a, x\rangle |\Vert a\Vert , \quad \text {and} \quad \textsf{L}\le 2\tilde{\textsf{L}} \Vert \bar{x}\Vert \left( 1 + \frac{(1-2p_{\textrm{fail}})\tilde{\mu }}{\tilde{\eta }}\right) . \end{aligned}$$

With this theorem in hand, we deduce that on the phase retrieval problem, Algorithm 5 with subgradient, clipped subgradient, and prox-linear models converges linearly to $\mathcal {X}^*$ with high probability, whenever the method is initialized within constant relative error of the optimal solution.

Theorem 4.2

Fix constants $\gamma \in (0,2)$, $\varepsilon >0$, and $\delta '\in (0, 1)$. Consider the subgradient, clipped subgradient, and prox-linear oracles developed in Theorem 4.1. Suppose we are given a point $x_0$ satisfying

$$\begin{aligned} \textrm{dist}(x_0, \{\pm x\}) \le \gamma \frac{(1-2p_{\textrm{fail}})\tilde{\mu } }{8\tilde{\eta }} \cdot \Vert \bar{x}\Vert . \end{aligned}$$

Set parameters $ \rho _0, \alpha _0, K,\epsilon _0, M, T$ as in Theorem 3.8. In addition, define the iterate $x_T= \textrm{RPMBA}(x_0, \rho _0, \alpha _0, K,\epsilon _0, M, T)$. Then with probability $1-\delta '$, we have $\textrm{dist}(x_T, \{\pm \bar{x}\}) \le \varepsilon \Vert \bar{x}\Vert $ after

$$\begin{aligned} O\left( \left( \frac{\tilde{\textsf{L}} \left( 1 + \frac{(1-2p_{\textrm{fail}})\tilde{\mu }}{2\tilde{\eta }}\right) }{ \tilde{\mu }(1-2p_{\textrm{fail}}) }\right) ^2\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu } /\tilde{\eta }}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu } /\tilde{\eta }}{\varepsilon }\right) }{\delta '}\right) \right) \end{aligned}$$

stochastic subgradient, stochastic clipped subgradient, or stochastic prox-linear iterations.

We now examine Theorem 4.2 in the setting where the measurement vectors a follow a Gaussian distribution. We note, however, that the results of this section extend far beyond the Gaussian setting to heavy tailed distributions.

Example 4.1

(Gaussian setting) Let us analyze the population setting where $a \sim N(0, I_{d\times d})$. In this case, it is straightforward to show by direct computation that

$$\begin{aligned} \tilde{\mu } > rsim 1; \qquad \eta = 1; \qquad \textsf{L}\lesssim \sqrt{d}. \end{aligned}$$

Consequently, if $x_0 \in \mathbb {R}^d$ has error $ \textrm{dist}(x_0, \{\pm \bar{x}\}) \le c(1-2p_{\textrm{fail}}) \cdot \Vert \bar{x}\Vert , $ for some numerical constant c, then with probability $1-\delta $, Algorithm 5 will produce a point $x_T$ satisfying $\textrm{dist}(x_T, \{\pm x\}) \le \varepsilon \Vert \bar{x}\Vert $ using only

$$\begin{aligned} O\left( \frac{d}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{1}{\varepsilon }\right) }{\delta }\right) \right) \end{aligned}$$

samples. We note that the spectral initialization of Duchi and Ruan [18, Proposition 3] produces such a point $x_0$ with sample complexity $O(d(1-2p_{\textrm{fail}})^{-2})$ with high probability. Therefore, when taken together, combining this spectral initialization with Algorithm 5 produces a point $x_T$ satisfying $\textrm{dist}(x_T, \{\pm \bar{x}\}) \le \varepsilon \Vert \bar{x}\Vert $ with $O\left( \frac{d}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \log \left( \frac{1}{\varepsilon }\right) /\delta \right) \right) $ samples, which is the best known sample complexity for Gaussian robust phase retrieval, up to logarithmic factors. We note that by leveraging standard concentration results, it is possible to prove similar results for empirical average minimization $\min _x \frac{1}{m}\sum _{i=1}^m f(x,z_i)$, provided $z_i$ are i.i.d samples of z and the number of samples satisfies $m > rsim d (1-2p_{\textrm{fail}})^{-2}$.

4.2 Robust blind deconvolution

We next apply the proposed algorithms to the blind deconvolution problem. For a detailed discussion of the the problem, see for example the papers [2, 39]. Henceforth, fix integers $d_1, d_2 \in \mathbb {N}$ and an underlying signal $(\bar{x}, \bar{y}) \in \mathbb {R}^{d_1}\times \mathbb {R}^{d_2}$. Define the quantity

$$\begin{aligned} D:= \Vert \bar{x}\Vert \Vert \bar{y}\Vert . \end{aligned}$$

Without loss of generality, we will assume $\Vert \bar{x}\Vert = \Vert \bar{y}\Vert $. We consider the following measurement model:

Assumption B

(Robust Blind Deconvolution). Consider random $\ell \in \mathbb {R}^{d_1}$, $r \in \mathbb {R}^{d_2}$, $\xi \in \mathbb {R}$, and $u\in \{0, 1\}$ and the measurement model

$$\begin{aligned} b = \langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle + u\cdot \xi . \end{aligned}$$

We make the following assumptions on the random data.

1.
The variable $u$ is independent of $\xi $, $\ell $, and r. The failure probability $p_{\textrm{fail}}$ satisfies
$$\begin{aligned} p_{\textrm{fail}}:= \mathbb {P}(u\ne 0) < 1/2. \end{aligned}$$
2.
We have $\mathbb {E}\left[ |\xi | \right] < \infty $.
3.
There exists constants $\tilde{\eta }, \tilde{\mu }, \tilde{\textsf{L}} > 0$ such that for all $M\in \mathbb {R}^{d_1 \times d_2}$ with $\Vert M\Vert _F = 1$ and $\textrm{rank}(M) \le 2$, we have
$$\begin{aligned} \tilde{\mu }\le \mathbb {E}\left[ |\langle \ell , M r\rangle | \right] \le \tilde{\eta }\end{aligned}$$
4.
There exists constants $\tilde{\textsf{L}} > 0$ such that for all $v \in \mathbb {S}^{d_1-1}, w \in \mathbb {S}^{d_2-1}$, we have
$$\begin{aligned} \sqrt{\mathbb {E}\left[ \left( |\langle \ell , v\rangle | \Vert r\Vert + |\langle r, w\rangle |\Vert \ell \Vert \right) ^2\right] } \le \tilde{\textsf{L}}. \end{aligned}$$

Based on the above assumptions, the following theorem develops three models for the robust blind deconvolution problem. We defer the proof to Sect. C.2.

Theorem 4.3

(Blind deconvolution parameters). Fix a real $\nu > 1$ and define the set:

$$\begin{aligned} \mathcal {X}= \{ (x, y) \in \mathbb {R}^{d_1 + d_2} :\Vert x\Vert \le \nu D, \Vert y\Vert \le \nu D\}. \end{aligned}$$

Consider the population data $z = (a, u, \xi )$ and the form the optimization problem

$$\begin{aligned} \min _{x,y} f(x, y):= \mathbb {E}\left[ f((x,y), z)\right] \quad \text { where } \quad f((x,y), z):= | \langle \ell , x\rangle \langle r, y\rangle - b|. \end{aligned}$$

Then the optimal solution set is $\mathcal {X}^*= \{(\alpha \bar{x}, (1/\alpha )\bar{y}) \mid (1/\nu ) \le |\alpha | \le \nu \}$ and f satisfies the sharpness assumption with $\mathcal {S}=\mathcal {X}^*$ and

$$\begin{aligned} \mu = \frac{\tilde{\mu } (1-2p_{\textrm{fail}})\sqrt{D}}{2\sqrt{2}(\nu + 1)}. \end{aligned}$$

Moreover, given a measurable selection $G((x,y), z) \in \partial f((x,y), z)$, the models

(subgradient) $f^s_{(x,y)}((\hat{x}, \hat{y}), z) = f((x, y), z) + \langle G((x, y), z), (\hat{x}, \hat{y}) - (x, y)\rangle $,
(clipped subgradient)
$$\begin{aligned} f^{cl}_{(x, y)}((\hat{x}, \hat{y}), z) = \max \{ f((x, y), z) + \langle G((x, y), z), (\hat{x}, \hat{y}) - (x, y)\rangle , 0\}, \end{aligned}$$
(prox-linear)
$$\begin{aligned} f^{pl}_{(x, y)}((\hat{x}, \hat{y}), z)= & {} | \langle \ell , x\rangle \langle r, y\rangle - (\langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle + u\cdot \xi )\\{} & {} + \langle \ell , x\rangle \langle r, \hat{y} - y\rangle + \langle r, y\rangle \langle \ell , \hat{x} -x\rangle | \end{aligned}$$

satisfy Assumptions – with

$$\begin{aligned} \eta = \tilde{\eta }, \quad L((x, y), z) = |\langle \ell , x\rangle | \Vert r\Vert + |\langle r, y\rangle |\Vert \ell \Vert , \quad \text {and} \quad \textsf{L}= \nu \tilde{\textsf{L}} \sqrt{D}. \end{aligned}$$

With this theorem in hand, we deduce that on the blind deconvolution problem, Algorithm 5 with subgradient, clipped subgradient, and prox-linear models converges linearly to $\mathcal {X}^*$ with high probability, whenever the method is initialized within constant relative error of the solution set.

Theorem 4.4

Fix constants $\gamma \in (0,2)$, $\varepsilon >0$, and $\delta '\in (0, 1)$. Consider the subgradient, clipped subgradient, and prox-linear oracles developed in Theorem 4.1. Suppose we are given a pair $(x_0,y_0) \in \mathbb {R}^{d_1 + d_2}$ satisfying

$$\begin{aligned} \textrm{dist}((x_0, y_0), \mathcal {X}^*) \le \gamma \frac{\tilde{\mu } (1-2p_{\textrm{fail}})\sqrt{D}}{8\sqrt{2}(\nu + 1)\tilde{\eta }}. \end{aligned}$$

Set parameters $\rho _0, \alpha _0, K,\epsilon _0, M, T$ as in Theorem 3.8. In addition, we define the iterate $(x_T,y_T)= \textrm{RPMBA}((x_0, y_0), \rho _0, \alpha _0, K,\epsilon _0, M, T)$. Then with probability $1-\delta '$, we have $\textrm{dist}((x_T,y_T), \mathcal {X}) \le \varepsilon \sqrt{D}$ after

$$\begin{aligned} O\left( \left( \frac{\nu ^2\tilde{\textsf{L}}}{ \tilde{\mu } (1-2p_{\textrm{fail}}) }\right) ^2\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu }/\tilde{\eta }}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{(1-2p_{\textrm{fail}})\tilde{\mu } /\tilde{\eta }}{\varepsilon }\right) }{\delta '}\right) \right) \end{aligned}$$

stochastic subgradient, stochastic clipped subgradient, or stochastic prox-linear iterations.

We now examine Theorem 4.4 in the setting where the measurement vectors $\ell , r$ follow a Gaussian distribution. We note, however, that the results of this section extend far beyond the Gaussian setting to heavy tailed distributions.

Example 4.2

(Gaussian setting). Let us analyze the population setting where $(\ell , r) \sim N(0, I_{(d_1 + d_2) \times (d_1 + d_2)})$. In this case, one can show by direct computation that

$$\begin{aligned} \tilde{\mu } > rsim 1; \qquad \eta \lesssim 1; \qquad \textsf{L}\lesssim \sqrt{d_1 + d_2}. \end{aligned}$$

Consequently, if $(x_0, y_0) \in \mathbb {R}^{d_1 + d_2}$ has error $ \textrm{dist}((x_0, y_0), \mathcal {X}^*) \le c(1-2p_{\textrm{fail}}) \cdot \sqrt{D}/\nu , $ for some numerical constant $c>0$, then with probability $1-\delta $, Algorithm 5 will produce a pair $(x_T, y_T)$ satisfying $\textrm{dist}((x_T, y_T), \mathcal {X}^*) \le \varepsilon \sqrt{D}$ using only

$$\begin{aligned} O\left( \frac{\nu ^2(d_1 + d_2)}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \frac{\log \left( \frac{1}{\varepsilon }\right) }{\delta }\right) \right) \end{aligned}$$

samples. We note that the spectral initialization of Charisopoulos et al. [8, Theorem 5.4 and Corollary 5.5] can produce such a pair $(x_0, y_0)$ with sample complexity $O(\nu ^2(d_1 + d_2) (1-2p_{\textrm{fail}})^{-2})$ with high probability with $\nu \le \sqrt{3}$. Therefore, when taken together, combining this spectral initialization with Algorithm 5 produces a pair $(x_T, y_T)$ satisfying $\textrm{dist}((x_T, y_T), \mathcal {X}^*) \le \varepsilon \Vert \bar{x}\Vert $ with $O\left( \frac{d_1 + d_2}{(1-2p_{\textrm{fail}})^2} \log \left( \frac{1}{\varepsilon }\right) \log \left( \log \left( \frac{1}{\varepsilon }\right) /\delta \right) \right) $ samples, which is the best known sample complexity for Gaussian robust blind deconvolution, up to logarithmic factors. We note by leveraging standard concentration results, it is possible to prove similar results for empirical average minimization $\min _{(x,y)\in \mathcal {X}} \frac{1}{m}\sum _{i=1}^m f((x,y),z_i)$, provided $z_i$ are i.i.d samples of z and the number of samples satisfies $m > rsim (d_1 + d_2) (1-2p_{\textrm{fail}})^{-2}$.

5 Numerical Experiments

We now evaluate how Algorithm 2 performs both on the statistical recovery problems of Sect. 4 and on a sparse logistic regression problem. We test the convergence behavior, sensitivity to step size, and convergence to an active manifold. While testing the algorithms, we found that Algorithms 2 and 5 perform similarly, despite Algorithm 5 having superior theoretical guarantees. Thus, we do not evaluate Algorithm 5. The problems of Sect. 4 are both convex composite losses of the form in Example 2.1. For these problems, we therefore implement all four models from Example 2.1, using the closed-form solutions developed in [11, Section 5]. For the sparse logistic regression problem, we implement the stochastic proximal gradient method and measure convergence to the support set of the optimal solution. We provide a reference implementation [9] of the methods in Julia.

5.1 Convergence behavior

In this section, we demonstrate that Algorithm 2 converges nearly linearly on the Gaussian robust phase retrieval and blind deconvolution problems of Sect. 4 for a particular dimension, noise distribution, corruption frequency, and initialization quality. In phase retrieval, we set $d = 100$ and in blind deconvolution, we set $d_1 = d_2 = d. $ The measurements are corrupted independently with probability $p_{\textrm{fail}}$: for phase retrieval, the corruption obeys $\xi = \left| g \right| , \; g \sim N(0, 100)$, while for blind deconvolution, it obeys $\xi \sim N(0, 100)$. The algorithms are all randomly initialized at a fixed distance $R_0 > 0$ from the ground truth. The ground truth is normalized in all cases. We use Examples 4.1 and 4.2 to estimate $\textsf{L}$, $\eta $, and $\mu $, and we set $\gamma = 1$, $R_0 = 0.25$, $\delta _2 = \frac{1}{\sqrt{10}}$, and target accuracy $\varepsilon = 10^{-5}$ to obtain T, K and $\alpha _0$ parameters as in Theorem 3.4

Figures 2 and 3 depict the convergence behavior of Algorithm 2 on robust phase retrieval and blind deconvolution problems in finite sample and streaming settings, respectively. In these plots, solid lines with markers show the mean behavior over 10 runs, while the transparent overlays show one sample standard deviation above and below the mean. In the finite-sample instances, we use $m = 8 \cdot d$ measurements and corrupt a fixed fraction $p_{\textrm{fail}}$ with large magnitude sparse noise; see Fig. 2. In the streaming instances, we draw a new i.i.d. sample at each iteration and corrupt it independently with probability $p_{\textrm{fail}}$; see Fig. 3. In both figures, we plot in red the rate guaranteed by Theorem 3.4 and observe that the algorithms behave consistently with these guarantees. In presence of noise, the algorithms all converge nearly linearly at the rate predicted by Theorem 3.4, while in the noiseless case, all except the subgradient method converge to an exact solution (modulo numerical accuracy) within far fewer iterations.

5.1.1 Experiments on large-scale problems

We next demonstrate the performance of Algorithm 2 using the subgradient model on large-scale synthetic instances of phase retrieval and blind deconvolution in the finite sample setting with $d = 512 \times 512$. At each step, we sample a random subset of measurements of size $S \in \{32, \sqrt{d}, d\}$. The measurement matrices are sampled from the randomized Hadamard ensemble which consists of k vertically stacked $d \times d$ Hadamard matrices composed with random binary masks:

$$\begin{aligned} \begin{bmatrix} H_d S_1 \\ H_d S_2 \\ \vdots \\ H_d S_k \end{bmatrix}, \quad \text {where} \; S_i = \textbf{diag}\left( \zeta _{i,j} \right) _{j=1}^d, \; \zeta _{i,j} \sim \textrm{Unif}(\{\pm 1\}) \end{aligned}$$

and $H_d$ is the $d \times d$ Walsh-Hadamard matrix. We fix $k = 8$, $p_{\textrm{fail}}= 0$, $\gamma = 1$ and $\delta _2 = 1/3$ and initialize our algorithm randomly at distance $R_0 = 0.25$ from the ground truth. We estimate $\textsf{L}$, $\eta $ and $\mu $ using Examples 4.1 and 4.2, adjusted to take into account the batch size S.

We plot the distance from the solution as a function of the number of passes over the dataset in Fig. 4. The plots verify that Algorithm 2 converges nearly linearly to a solution at a dimension-independent rate. Moreover, they indicate that using a larger batch size S does not lead to a reduction over the number of passes required to reach a fixed accuracy.

5.2 Sensitivity to step size

We next explore how Algorithm 2 performs when $\alpha _0$ is misspecified. Throughout, we scale $\alpha _0$ by $\lambda := 2^p$ for integers p between $-10$ and 10. We run 25 trials of the algorithm and for each model and scalar $\lambda $, we report two different metrics:

We report the sample mean and standard deviation of the number of “inner” loop iterations, or samples, needed to reach accuracy $\varepsilon = 10^{-5}$. We use the parameters of Sect. 5.1 to cap the number of total iterations by
as Theorem 3.4 prescribes. This number is depicted as a dotted line in the figure.
We report the sample mean and standard deviation of the distance of the final iterate to the solution set.

Figures 5 and 6 show the results for phase retrieval and blind deconvolution problems with $d = 100$ and $p_{\textrm{fail}}\in \{0, 0.2\}$. In these plots, solid lines with markers show the mean behavior over 25 runs, while the transparent overlays show one sample standard deviation above and below the mean. The plots show that Algorithm 2 continues to perform as predicted by Theorem 3.4 even if $\alpha _0$ is misspecified by a few orders of magnitude.

The prox-linear, proximal point, and clipped models perform similarly in all plots. As reported in [4], the prox-linear and clipped methods produce the same iterates. The iterates produced by the stochastic proximal point method and stochastic prox-linear are not identical, but they are practically indistinguishable. This is due to two factors: the proximal and prox-linear models agree up to an error that increases quadratically as we move from the basepoint, and the proximal subproblems force iterates to remain near the basepoint. Running the proximal point method for a much larger stepsize produces different iterates than the prox-linear method, though then the method fails to converge within the specified level of accuracy.

5.3 Activity identification

In this section, we demonstrate that Algorithm 2 nearly linearly converges to the active set of nonzero components in a sparse logistic regression problem. We model our experiment on [37, Section 6.2]. There, the authors find a sparse classifier for distinguishing between digits 6 and 7 from the MNIST dataset of handwritten digits [36]. In this problem, we are given a set of $N = 12183$ samples $(x_i, y_i) \in \mathbb {R}^d \times \{-1, 1\}$, representing $28\times 28$ dimensional images of digits and their labels, and we seek a target vector $z:= (w, b) \in \mathbb {R}^{d+1}$, so that w is sparse and $\textrm{sign}(\langle w, x_i \rangle + b) = \textrm{sign}(y_i)$ for most i. To find z, we minimize the function

$$\begin{aligned} \min _{z}~ \frac{1}{N}\sum _{i=1}^N f(z; i)+\tau \Vert w\Vert _1, \end{aligned}$$

where each component is a logistic loss:

$$\begin{aligned} f(z; i) \equiv f(w, b; i) := \log \left( 1 + \exp \left( -y_i(\langle w, x_i \rangle + b)\right) \right) . \end{aligned}$$

(5.1)

We let $(\tilde{w}, \tilde{b})$ denote the minimizer of the logistic loss, which we find using the standard proximal gradient algorithm. Given $\tilde{w}$, we denote its support set by $S:= \{ i \in [d]: \left| \tilde{w}_i \right| > \epsilon \}$, where $\epsilon $ accounts for the numerical precision of the system.

Our goal is to converge nearly linearly to the set

$$\begin{aligned} \mathcal {S}= \{ (w, b) \in \mathbb {R}^{d+1} \mid w_{S^c} = 0\}. \end{aligned}$$

To that end, we will apply Algorithm 2 with the stochastic proximal gradient model:

$$\begin{aligned} f_{z}((y, a), i ) = f(z, i) + \langle \nabla f(z, i), (y, a) - z \rangle + \tau \left\| y \right\| _1. \end{aligned}$$

Algorithm 2 equipped with this model results in the standard stochastic proximal gradient method. To apply the algorithm, we set parameters using Theorem 3.4. We set $\gamma = 1$, $\delta _2 = 1/\sqrt{10}$, $\varepsilon = 10^{-5}$. We initialize $w_0 = 0$, $b_0 = 0$ and set $R_0 = \Vert (\tilde{w}, \tilde{b}) - (w_0, b_0)\Vert $. We estimate $\textsf{L}$ by the formula $ \textsf{L}:= \sqrt{\frac{1}{m} \sum _{i=1}^m \left\| x_i \right\| _2^2}. $ Finally, we estimate $\mu $ by grid search over $(\tau , p)$ using the formula: $\mu = \tau \cdot \sqrt{d} \cdot 2^{-p}$.

5.3.1 Evaluation

We compare the performance of Algorithm 2 with the Regularized Dual Averaging method (RDA) [46, 65], which was shown to have favorable manifold identification properties in [37]. In our setting, the latter method solves the following subproblem:

$$\begin{aligned} (w_{t+1}, b_{t+1}) \in \hbox {argmin}_{w, b} \left\{ \left\langle \bar{g}_t, \begin{pmatrix} w \\ b \end{pmatrix} \right\rangle + \tau \left\| w \right\| _1 + \frac{\gamma }{2 \sqrt{t}} \left\| \begin{pmatrix} w \\ b \end{pmatrix} \right\| _2^2 \right\} , \end{aligned}$$

(5.2)

where $\bar{g}_t$ in (5.2) is the running average over all stochastic gradients $g_k:= \nabla f(z; i_k)$ sampled up to step t, and $\gamma $ is a tunable parameter which is again determined by a simple grid search. Following the discussion in [37], RDA is initialized at $(w_0, b_0) = 0$; therefore we choose the same initial point $(w_0, b_0)$ for both methods. In addition to RDA, we also performed a comparison with the standard stochastic proximal gradient method, equipped with a range of polynomially decaying step sizes of the form

$$\begin{aligned} \lambda _k:= c k^{-p}, \; p \in \{1/2, 2/3, 3/4, 1\}. \end{aligned}$$

We found that the stochastic proximal gradient method performed comparably with RDA in all metrics, and therefore chose to omit it below.

The convergence plots in Fig. 7 confirm that the iterates of Algorithm 2 converge to the set $\mathcal {S}$ at a nearly linear rate, while the function values converge at a sublinear rate. In contrast, the iterates generated by RDA converge sublinearly in both metrics.

Notes

There are some exceptions for highly-structured problems, such as interpolation problems where all terms in the expectation share the same minimizer [3, 63].
Here, the term “nearly linear” signifies that the oracle / sample complexity of the method is $O(\log ^3(1 / \varepsilon ))$ – i.e., it depends polylogarithmically on $1 / \varepsilon $, in contrast to standard “linear” convergence where in the oracle complexity scales as $O(\log (1 / \varepsilon ))$.
In particular, in the tube $\mathcal {T}_{2}$ described in Assumption .
One could choose the product of the Lipschitz constants of h and $\nabla c$.
The set $\mathcal {T}_2$ is a natural region to initialize in since it is provably the widest tube around the solution set containing no extraneous stationary points; see [12, Lemma 2.1] and the discussion following [8, Lemma 3.1].
Note that Algorithm 1 can be implemented using O(d) memory by maintaining a running average (in the convex case) or by terminating the iteration after $K^{*} \sim \textrm{Unif}\left\{ 0, \dots , K\right\} $ steps (in the nonconvex case).

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. et al.: Tensorflow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org 1(2) (2015)
Ahmed, A., Recht, B., Romberg, J.: Blind deconvolution using convex programming. IEEE Trans. Inf. Theory 60(3), 1711–1732 (2014)
Article MathSciNet Google Scholar
Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. arXiv preprint arXiv:1810.05633 (2018)
Asi, H., Duchi, J.C.: The importance of better models in stochastic optimization. arXiv preprint arXiv:1903.08619 (2019)
Aybat, N.S., Fallah, A., Gurbuzbalaban, M., Ozdaglar, A.: A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022 (2019)
Candès, E.J., Strohmer, T., Voroninski, V.: PhaseLift: exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66(8), 1241–1274 (2013)
Article MathSciNet Google Scholar
Charisopoulos, V., Chen, Y., Davis, D., Díaz, M., Ding, L., Drusvyatskiy, D.: Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020 (2019)
Charisopoulos, V., Davis, D., Díaz, M., Drusvyatskiy, D.: Composite optimization for robust blind deconvolution. arXiv:1901.01624 (2019)
COR-OPT. Geometric step decay: reference implementation. https://github.com/COR-OPT/GeomStepDecay (2019)
Davis, D.: SMART: The stochastic monotone aggregated root-finding algorithm. arXiv preprint arXiv:1601.00698 (2016)
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)
Article MathSciNet Google Scholar
Davis, D., Drusvyatskiy, D., MacPhee, K.J., Paquette, C.: Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. 179(3), 962–982 (2018)
Article MathSciNet Google Scholar
Davis, D., Drusvyatskiy, D., Paquette, C.: The nonsmooth landscape of phase retrieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 (2017)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates Inc, New York (2014)
Google Scholar
Defazio, A., Domke, J., Caetano, T.S.: Finito: a faster, permutable incremental gradient method for big data problems. In: Xing, E.P., Jebara, T. (eds). Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1125–1133. Bejing, China (2014). PMLR
Drusvyatskiy, D.: The proximal point method revisited. SIAG/OPT Views and News, 26(1) (2018)
Drusvyatskiy, D., Lewis, A.S.: Optimality, identifiablity, and sensitivity. arXiv preprint arXiv:1207.6628 (2012)
Duchi, J.C., Ruan, F.: Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. IMA J. Inf. Inference (2018). https://doi.org/10.1093/imaiai/iay015
Article Google Scholar
Duchi, J.C., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)
Article MathSciNet Google Scholar
Eldar, Y.C., Mendelson, S.: Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. 36(3), 473–494 (2014)
Article MathSciNet Google Scholar
Eremin, I.I.: The relaxation method of solving systems of inequalities with convex functions on the left-hand side. Dokl. Akad. Nauk SSSR 160, 994–996 (1965)
MathSciNet Google Scholar
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arXiv:1609.07358 (2016)
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition. arXiv preprint arXiv:1709.02300 (2017)
Freund, R.M., Haihao, Lu.: New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure. Math. Program. 170(2), 445–477 (2018)
Article MathSciNet Google Scholar
Ge, R., Kakade, S.M., Kidambi, R., Netrapalli, P.: The step decay schedule: a near optimal, geometrically decaying learning rate procedure. arXiv preprint arXiv:1904.12838 (2019)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2089 (2013)
Article MathSciNet Google Scholar
Goffin, J.L.: On convergence rates of subgradient optimization methods. Math. Program. 13(3), 329–347 (1977)
Article MathSciNet Google Scholar
Goldstein, T., Studer, C.: Phasemax: convex phase retrieval via basis pursuit. IEEE Trans. Inf. Theory 64(4), 2675–2689 (2018)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hsu, D., Sabato, S.: Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res. 17(1), 543–582 (2016)
MathSciNet Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 315–323, Curran Associates Inc, USA (2013)
Johnstone, P.R., Moulin, P.R.: Faster subgradient methods for functions with hölderian growth. Math. Program. 180, 417–450 (2019)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Kulunchakov, A., Mairal, J.: A generic acceleration framework for stochastic composite optimization. arXiv preprint arXiv:1906.01164 (2019)
Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, volume 35 of Applications of Mathematics (New York). Stochastic Modelling and Applied Probability, 2nd edn. Springer-Verlag, New York (2003)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 Nov (1998)
Lee, S., Wright, S.J.: Manifold identification in dual averaging for regularized stochastic online learning. J. Mach. Learn. Res. 13(JUN), 1705–1744 (2012)
MathSciNet Google Scholar
Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3), 702–725 (2002)
Article MathSciNet Google Scholar
Li, X., Ling, S., Strohmer, T., Wei, K.: Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Appl. Comput. Harmon. Anal. 47(3), 893–934 (2018)
Article MathSciNet Google Scholar
Li, Y., Ma, C., Chen, Y., Chi, Y.: Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286, 2018
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Article MathSciNet Google Scholar
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, vol. 330. Springer, Berlin (2006)
Book Google Scholar
Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Stochastic Optimization: Algorithms and Applications, pp. 223–264. Springer (2001)
Nemirovskii, A.S., Nesterov, Yu.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(2), 21–30 (1985)
Article MathSciNet Google Scholar
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. John Wiley & Sons Inc, New York (1983). Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Article MathSciNet Google Scholar
Nesterov, Y.: A method for solving the convex programming problem with convergence rate $O(1/k^{2})$. Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
MathSciNet Google Scholar
O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)
Article MathSciNet Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS-W (2017)
Penot, J.-P.: Calculus Without Derivatives. Graduate Texts in Mathematics, vol. 266. Springer, New York (2013)
Book Google Scholar
Poliquin, R.A., Rockafellar, R.T.: Prox-regular functions in variational analysis. Trans. Am. Math. Soc. 348, 1805–1838 (1996)
Article MathSciNet Google Scholar
Poljak, B.T.: Minimization of nonsmooth functionals. Ž. Vyčisl. Mat. i Mat. Fiz. 9, 509–521 (1969)
MathSciNet Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Article MathSciNet Google Scholar
Renegar, J., Grimmer, B.: A simple nearly-optimal restart scheme for speeding-up first order methods. arXiv preprint arXiv:1803.00151 (2018)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Rockafellar, R.T.: Favorable classes of Lipschitz-continuous functions in subgradient optimization. In: Progress in Nondifferentiable Optimization, Volume 8 of IIASA Collaborative Proc. Ser. CP-82, pp. 125–143. Int. Inst. Appl. Sys. Anal., Laxenburg (1982)
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Grundlehren der mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Google Scholar
Roulet, V., d’Aspremont, A.: Sharpness, restart and acceleration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 1119–1129. Curran Associates Inc, New York (2017)
Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Article MathSciNet Google Scholar
Shalev-Shwartz, S., Zhang, T.: Proximal stochastic dual coordinate ascent. arXiv:1211.2717 (2012)
Shechtman, Y., Eldar, Y.C., Cohen, O., Chapman, H.N., Miao, J., Segev, M.: Phase retrieval with application to optical imaging: a contemporary overview. IEEE Signal Process. Mag. 32(3), 87–109 (2015)
Article Google Scholar
Shor, N.Z.: Minimization Methods for Non-differentiable Functions, vol. 3. Springer Science & Business Media, New York (2012)
Google Scholar
Tan, Y.S., Vershynin, R.: Phase retrieval via randomized Kaczmarz: Theoretical guarantees. Inf. Inference J. IMA 8(1), 97–123 (2018)
MathSciNet Google Scholar
Wright, S.J.: Identifiable surfaces in constrained optimization. SIAM J. Control Optim. 31(4), 1063–1079 (1993)
Article MathSciNet Google Scholar
Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11(OCT), 2543–2596 (2010)
MathSciNet Google Scholar
Xu, Y., Lin, Q., Yang, T.: Accelerated stochastic subgradient methods under local error bound condition. arXiv preprint arXiv:1607.01027 (2016)
Yang, T., Lin, Q.: RSG: beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(1), 236–268 (2018)
MathSciNet Google Scholar
Yang, T., Yan, Y., Yuan, Z., Jin, R.: Why does stagewise training accelerate convergence of testing error over SGD? arXiv preprint arXiv:1812.03934 (2018)

Download references

Author information

Authors and Affiliations

School of ORIE, Cornell University, Ithaca, NY, 14850, USA
Damek Davis & Vasileios Charisopoulos
Department of Mathematics, University of Washington, Seattle, WA, 98195, USA
Dmitriy Drusvyatskiy

Authors

Damek Davis
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Drusvyatskiy
View author publications
You can also search for this author in PubMed Google Scholar
Vasileios Charisopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitriy Drusvyatskiy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research of Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards. Research of Damek Davis supported by an Alfred P. Sloan research fellowship and NSF-DMS award 2047637.

Appendices

A Proofs from Sect. 3.2

We will need the following elementary lemma.

Lemma A.1

Suppose Assumptions and hold. Fix an arbitrary $\gamma \in (0, 2)$ and consider two points $y \in \mathcal {T}_\gamma $ and $x \in \mathbb {R}^d$. Then the estimate holds:

$$\begin{aligned} f_y(y, z) \le f_y(x, z) + L(y, z) \Vert x - y\Vert . \end{aligned}$$

Proof

Let $v \in \partial f_y(y, z)$ be the minimal norm subgradient. Then, by definition $ f_y(y, z) \le f_y(x, z) + \langle v, y-x\rangle $. Applying (2.5) and Cauchy-Schwarz completes the proof. $\square $

1.1 A.1 Proof of Lemma 3.3

Throughout the proof, we suppose that Assumptions – hold. We let $\mathcal {F}_k$ denote the $\sigma $-algebra generated by the history of the algorithm up to iteration $k$ and define the shorthand for conditional expectation $\mathbb {E}_k\left[ \cdot \right] := \mathbb {E}\left[ \cdot \mid \mathcal {F}_k\right] $. Define the stopping time

$$\begin{aligned} \tau := \min \{ k\ge 0 \mid y_k\notin \mathcal {T}_{\gamma }\}, \end{aligned}$$

and the sequence of events

$$\begin{aligned} B: = \{ y_0 \in \mathcal {T}_{\gamma \sqrt{\delta } } \} \qquad \text { and } \qquad A_{k}:= \{ \tau > k\} \cap B. \end{aligned}$$

Define also for all indices k, the quantity:

$$\begin{aligned} D_k:= \textrm{dist}(y_k, \mathcal {S}). \end{aligned}$$

Recall that our goal is to lower bound $\mathbb {P}(D_{K^*}\le \varepsilon )$. To this end, we successively compute

$$\begin{aligned} \mathbb {P}\left( D_{K^*} \le \varepsilon \right)&\ge \mathbb {P}\left( D_{K^*} \le \varepsilon \text { and } A_K\right) \nonumber \\&=\mathbb {P}\left( D_{K^*} \le \varepsilon \mid A_K\right) \mathbb {P}(A_K)\nonumber \\&=\left( 1-\mathbb {P}\left( D_{K^*} \ge \varepsilon \mid A_K\right) \right) \mathbb {P}(A_K)\nonumber \\&\ge \left( 1-\frac{\mathbb {E}\left[ D_{K^*} \mid A_K\right] }{\varepsilon }\right) \mathbb {P}(A_K)\\&=\mathbb {P}(A_K)-\frac{\mathbb {E}\left[ D_{K^*} 1_{A_K} \right] }{\varepsilon }\nonumber , \end{aligned}$$

(A.1)

where (A.1) follows from Markov’s inequality. The result will now follow immediately from the following two propositions, which establish an upper bound on $\mathbb {E}\left[ D_{K^*} 1_{A_K} \right] $ and a lower bound on $\mathbb {P}(A_K)$, respectively. We note that the first proposition is a quick modification of [11, Lemma 4.2]. We include a proof for completeness.

Proposition A.2

The following bounds hold:

$$\begin{aligned}&\mathbb {E}_k\left[ D_{k+1}^21_{A_{k}} \right] \le D_{k}^21_{A_{k}} + \textsf{L}^2\alpha ^2 -(2-\gamma )\mu \alpha D_{k}1_{A_{k}}, \end{aligned}$$

(A.2)

$$\begin{aligned}&\mathbb {E}\left[ D_{K^*}1_{A_{K}}\right] \le \frac{ \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + (K+1)\alpha ^2\textsf{L}^2}{(2-\gamma )\mu (K+1) \alpha }. \end{aligned}$$

(A.3)

Proof

The loss function $y\mapsto f_{y_k}(y, z_k)+\frac{1}{2\alpha }\Vert y-y_k\Vert ^2$ is strongly convex on $\mathcal {X}$ with constant $1/\alpha $ and $y_{k+1}$ is its minimizer. Hence for any $y\in \mathcal {X}$, the inequality holds:

$$\begin{aligned} \left( f_{y_k}(y, z_k)+ \tfrac{1}{2\alpha }\Vert y-y_k\Vert ^2\right) \ge \Big (&f_{y_k}( y_{k+1}, z_k) +\tfrac{1}{2\alpha } \Vert y_{k+1}-y_k\Vert ^2\Big ) + \tfrac{1}{2\alpha }\Vert y - y_{k+1}\Vert ^2. \end{aligned}$$

Rearranging and taking expectations, we successively deduce that provided $y_k\in \mathcal {T}_\gamma $, we have

$$\begin{aligned}&\frac{1}{2\alpha }\cdot \mathbb {E}_k\left[ \Vert y - y_{k+1}\Vert ^2 + \Vert y_{k+1}-y_k\Vert ^2 - \Vert y - y_{k}\Vert ^2\right] \nonumber \\&\quad \le \mathbb {E}_k\left[ f_{y_k}(y, z_k) - f_{y_k}( y_{k+1}, z_k) \right] \nonumber \\&\quad \le \mathbb {E}_k\left[ f_{y_k}(y, z_k) - f_{y_k}( y_{k}, z_k) + L(y_k, z)\Vert y_{k+1}-y_k\Vert \right] \end{aligned}$$

(A.4)

$$\begin{aligned}&\quad \le f(y) - f( y_{k}) + \frac{\eta }{2}\Vert y - y_{k}\Vert ^2 + \sup _{w \in \mathcal {T}_2}\sqrt{\mathbb {E}_{z}[L(w, z)^2]}\cdot \sqrt{\mathbb {E}_k[\Vert y_{k+1}-y_k\Vert ^2]}, \end{aligned}$$

(A.5)

where (A.4) follows from Lemma A.1 while inequality (A.5) follows from Cauchy-Schwarz and Assumption .

Define $c:=\sqrt{\mathbb {E}_k[\Vert y_{k+1}-y_k\Vert ^2]}$ and notice $c\ge \mathbb {E}_k\Vert y_k- y_{k+1}\Vert $. Then, rearranging (A.5), we immediately deduce that if $y_k \in \mathcal {T}_\gamma $, we have

$$\begin{aligned} \frac{1}{2\alpha }\mathbb {E}_k\left[ \inf _{y \in \textrm{proj}_{\mathcal {S}}(y_k)}\Vert y - y_{k+1}\Vert ^2 \right]&\le \frac{1}{2\alpha }\inf _{y \in \textrm{proj}_{\mathcal {S}}(y_k)}\mathbb {E}_k\left[ \Vert y - y_{k+1}\Vert ^2 \right] \\&\le \frac{\alpha ^{-1}+\eta }{2}D_k^2 - \frac{c^2}{2\alpha } + \textsf{L}c - (f(y_{k}) - \inf _{y \in \textrm{proj}_{\mathcal {S}}(y_k)} f(y))\\&\le \frac{\alpha ^{-1}+\eta }{2}D_k^2 - \frac{c^2}{2\alpha } + \textsf{L}c - \mu D_{k}\\&\le \frac{\alpha ^{-1}+\eta }{2}D_k^2 + \frac{\alpha \textsf{L}^2 }{2} - \mu D_{k}. \end{aligned}$$

where the third inequality follows from assumption , and the fourth inequality follows by maximizing the right-hand-side in $c \in \mathbb {R}$. Then, dividing through by $\frac{1}{2\alpha }$ and multiplying by $1_{A_k}$, we arrive at

$$\begin{aligned} \mathbb {E}_k\left[ D_{k+1}^21_{A_{k+1}} \right] \le \mathbb {E}_k\left[ D_{k+1}^21_{A_{k}} \right]&\le \left( (1+\alpha \eta )D_{k}^2 + \alpha ^2\textsf{L}^2 - 2\mu \alpha D_{k}\right) 1_{A_{k}} \nonumber \\&\le \left( D_{k}^2 + \alpha ^2\textsf{L}^2 - \alpha \left( 2\mu - \eta D_{k} \right) D_{k}\right) 1_{A_{k}} \nonumber \\&\le D_{k}^21_{A_{k}} + \alpha ^2\textsf{L}^2 - \alpha (2-\gamma )\mu D_{k}1_{A_{k}},\nonumber \end{aligned}$$

where the first inequality follows since $A_{k+1} \subseteq A_{k}$, the second inequality follows since $A_{k}$ is $\mathcal {F}_k$ measurable, and the fourth inequality follows since on the event $A_k$, we have $y_k \in \mathcal {T}_{\gamma }$. This completes the proof of (A.2). Next, applying the law of total expectation, we obtain

$$\begin{aligned} \mathbb {E}\left[ D_{k+1}^21_{A_{k+1}} \right] \le \mathbb {E}\left[ D_{k}^21_{A_k}\right] + \alpha ^2\textsf{L}^2 - (2-\gamma )\alpha \mu \mathbb {E}\left[ D_{k}1_{A_k}\right] . \end{aligned}$$

Iterating the inequality and rearranging, we deduce

$$\begin{aligned} \sum _{k=0}^K (2-\gamma )\alpha \mu \mathbb {E}\left[ D_{k}1_{A_k}\right] \le (K+1)\alpha ^2\textsf{L}^2 +\mathbb {E}\left[ D_{0}^21_{A_0}\right] \end{aligned}$$

Dividing through by $(K+1) (2-\gamma )\alpha \mu $, we recognize the left-hand side as $\mathbb {E}\left[ D_{K^*}1_{A_{K^*}}\right] $, and therefore

$$\begin{aligned} \mathbb {E}\left[ D_{K^*}1_{A_{K^*}}\right] \le \frac{ \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + (K+1)\alpha ^2\textsf{L}^2}{(2-\gamma )(K+1)\mu \alpha }. \end{aligned}$$

Finally, note $\mathbb {E}\left[ D_{K^*}1_{A_{K}}\right] \le \mathbb {E}\left[ D_{K^*}1_{A_{K^*}}\right] $ since $A_K\subseteq A_{K^*}$. This completes the proof of the proposition. $\square $

Now we will estimate the probability of the event $A_{K}$.

Proposition A.3

The estimate holds:

$$\begin{aligned} \mathbb {P}(A_K) \ge \mathbb {P}(B) - \left( \delta + \left( \frac{\eta }{\gamma \mu }\right) ^2K \alpha ^2\textsf{L}^2\right) . \end{aligned}$$

Proof

Observe the decomposition

$$\begin{aligned} \mathbb {P}(B) = \mathbb {P}(B \text { and } \tau \le K) + \mathbb {P}(B \text { and } \tau > K) = \mathbb {P}(B \text { and } \tau \le K) + \mathbb {P}(A_K), \end{aligned}$$

and therefore

$$\begin{aligned} \mathbb {P}(A_K)\ge \mathbb {P}(B)-\mathbb {P}(B \text { and } \tau \le K). \end{aligned}$$

(A.6)

We aim to upper bound $\mathbb {P}(B \text { and } \tau \le K)$. To this end, let $Y_k= D^2_{k\wedge \tau }1_{B}$ denote the stopped process. We successively compute

$$\begin{aligned} \mathbb {P}(B \text { and } \tau \le K) = \mathbb {P}\left( Y_{\tau } 1_{\tau \le K} > \left( \frac{\gamma \mu }{\eta }\right) ^2\right) \le \frac{ \mathbb {E}\left[ Y_\tau 1_{\tau \le K} \right] }{\left( \frac{\gamma \mu }{\eta }\right) ^2}, \end{aligned}$$

(A.7)

where the last estimate uses Markov’s inequality. Next, observe

$$\begin{aligned} \mathbb {E}\left[ Y_\tau 1_{\tau \le K} \right] \le \mathbb {E}\left[ Y_K1_{\tau > K} \right] + \mathbb {E}\left[ Y_K1_{\tau \le K} \right] =\mathbb {E}\left[ Y_K\right] . \end{aligned}$$

(A.8)

We next upper bound $\mathbb {E}[Y_K]$. To this end, observe

$$\begin{aligned} \mathbb {E}_{k} \left[ Y_{k+1} \right]&= \mathbb {E}_{k} \left[ Y_{k+1} 1_{\tau \le k}\right] + \mathbb {E}_{k} \left[ Y_{k+1}1_{\tau> k}\right] \nonumber \\&= Y_{k} 1_{\tau \le k} + \mathbb {E}_k\left[ D^2_{k+1}1_{A_{k}} \right] \nonumber \\&\le Y_{k} 1_{\tau \le k} + D_{k}^21_{A_{k}} +\alpha ^2\textsf{L}^2 - (2-\gamma )\alpha \mu D_{k}1_{A_{k}} \\&\le Y_{k} 1_{\tau \le k} +D_{k\wedge \tau }^21_{\tau > k}1_{B} + \alpha ^2\textsf{L}^2 = Y_k+ \alpha ^2\textsf{L}^2,\nonumber \end{aligned}$$

(A.9)

where (A.9) follows from (A.2). We now use the law of total expectation to iterate the above inequality:

$$\begin{aligned} \mathbb {E}\left[ Y_K\right] \le \mathbb {E}\left[ Y_0 \right] + K\alpha ^2\textsf{L}^2 \le D_0^21_{B} + K\alpha ^2\textsf{L}^2 \le \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + K\alpha ^2\textsf{L}^2. \end{aligned}$$

(A.10)

Combining the estimates (A.6), (A.7), (A.8), and (A.10) completes the proof. $\square $

B Proofs from Sect. 3.3

1.1 B.1 The ensemble method

Lemma B.1

(Ensemble method). Let $\{x_i\}_{i=1}^m$ be i.i.d. realizations of a random vector $x\in \mathbb {R}^d$. Suppose that the estimate holds:

$$\begin{aligned} \mathbb {P}(\Vert x-\bar{x}\Vert \le \varepsilon )\ge p, \end{aligned}$$

where $p\in (\tfrac{1}{2},1)$ and $\varepsilon >0$ are some real numbers and $\bar{x}\in \mathbb {R}^d$ is a vector. Then with probability at least $1-\exp (-\frac{1}{2p}m(p-\frac{1}{2})^2)$, there exists an index $i^*$ satisfying

$$\begin{aligned} |B_{2\varepsilon }(x_{i^*})\cap \{x_{i}\}_{i=1}^m|> \frac{m}{2} \end{aligned}$$

(B.1)

and for any index $i^*$ satisfying (B.1) it must be that $\Vert x_{i^*}-\bar{x}\Vert \le 3\varepsilon $.

Proof

By Chernoff’s bound, with probability at least $1-\exp (-\frac{1}{2p}m(p-\frac{1}{2})^2)$, the estimate holds:

$$\begin{aligned} \left| \{i:\Vert x_i-\bar{x}\Vert \le \varepsilon \}\right| >\frac{m}{2}. \end{aligned}$$

In particular, there exists an index $i^*$ satisfying (B.1) Fix such an index $i^*$. Clearly, there must exist another index j satisfying $x_j\in B_{\varepsilon }(\bar{x})\cap B_{2\varepsilon }(x_{i^*})$. We therefore conlude $\Vert x_{i^*}-\bar{x}\Vert \le \Vert x_{i^*}-x_j\Vert +\Vert x_j-\bar{x}\Vert \le 3\varepsilon $. This completes the proof. $\square $

1.2 B.2 Proof of Lemma 3.5

Fix $\gamma \in (0, 2)$ and a point $y \in \mathcal {T}_\gamma $. Recall that f is $\eta $-weakly convex on an open convex set containing $\mathcal {X}$ [11, Lemma 4.1]. Consequently, the proximal subproblem (3.7) is $(\rho -\eta )$-strongly convex. Before proving the remaining portion of the lemma, we first show that subgradients of the extended valued function $f + \delta _\mathcal {X}$ are bounded below. This was essentially already observed in [12, Lemma 2.1]. We provide a quick proof for completeness.

Lemma B.2

The estimate:

$$\begin{aligned} \textrm{dist}(0, \partial f(x) + N_\mathcal {X}(x)) \ge \left( 1- \tfrac{\gamma }{2}\right) \mu \qquad \text {holds for all }x \in \mathcal {T}_\gamma \backslash \mathcal {X}^*. \end{aligned}$$

Proof

Fix any $x \in \mathcal {T}_\gamma \backslash \mathcal {X}^*$ and $v \in \partial f(x) + N_{\mathcal {X}}(x)$, and let $\bar{x} \in \textrm{proj}_{\mathcal {X}^*}(x)$. We successively compute

$$\begin{aligned} \mu \cdot \textrm{dist}(x, \mathcal {X}^*) \le f(x) - \inf _{\mathcal {X}} f \le \langle v,x - \bar{x}\rangle + \frac{ \eta }{2}\textrm{dist}(x, \mathcal {X}^*)^2, \end{aligned}$$

where the first inequality follows from sharpness , and the second from weak convexity of f and convexity of $\mathcal {X}$. Rearranging and using the Cauchy-Schwarz inequality, we deduce

$$\begin{aligned} \left( \mu - \frac{\eta }{2}\textrm{dist}(x, \mathcal {X}^*)\right) \textrm{dist}(x, \mathcal {X}^*) \le \textrm{dist}(x, \mathcal {X}^*)\Vert v\Vert . \end{aligned}$$

Dividing both sides by $\textrm{dist}(x, \mathcal {X}^*)$ yields the result. $\square $

Now, let $\bar{y}$ be any minimizer of the proximal problem (3.7) and suppose $ \rho < \left( \tfrac{2-\gamma }{2\gamma }\right) \eta $. Clearly, to establish Lemma 3.5, it suffices to argue the inclusion $\bar{y}\in \mathcal {X}^{*}$. To this end, choose any $\hat{y} \in \textrm{proj}_{\mathcal {X}^*}(y)$. Observe

$$\begin{aligned} \frac{\rho }{2}\Vert \bar{y}-y\Vert ^2 \le f(\hat{y}) - f(\bar{y}) + \frac{\rho }{2}\Vert \hat{y} - y\Vert ^2\le \frac{\rho }{2}\textrm{dist}^2(y,\mathcal {X}^*) \le \frac{\rho \gamma ^2 \mu ^2}{2\eta ^2}, \end{aligned}$$

where the first inequality follows from the definition of $\bar{y}$, the second uses the definition of $\hat{y}$, and the third follows from the assumption $y\in \mathcal {T}_{\gamma }$. Thus, we deduce $\Vert \bar{y}-y \Vert \le \frac{\gamma \mu }{\eta }$. Consequently, using sharpness we conclude

$$\begin{aligned} \mu \textrm{dist}(\bar{y}, \mathcal {X}^*) \le f(\bar{y}) - f(\hat{y}) \le \frac{\rho }{2} \Vert \hat{y} - y\Vert ^2 \le \frac{\rho }{2} \frac{\gamma ^2 \mu ^2}{\eta ^2} \le \frac{\gamma \mu ^2}{2\eta } \end{aligned}$$

where the last inequality follows from the assumption $\rho < \left( \tfrac{2-\gamma }{2\gamma }\right) \eta $. Consequently, $\bar{y}$ lies in the tube $\mathcal {T}_\gamma $. Now, define $v:= \rho (y - \bar{y}) \in \partial f(\bar{y}) + N_{\mathcal {X}}(\bar{y})$. Appealing to Lemma B.2, we deduce in the case $\bar{y} \notin \mathcal {X}^*$, the contradiction $(1-\tfrac{\gamma }{2}) \mu / \rho \le \Vert v\Vert /\rho = \Vert y - \bar{y}\Vert \le \frac{\gamma \mu }{\eta }$. Therefore, $\bar{y}$ lies in $ \mathcal {X}^*$, as we had to show.

1.3 B.3 Proof of Lemma 3.6

As in the proof of Lemma 3.3, we let $\mathcal {F}_k$ denote the $\sigma $-algebra generated by the history of the algorithm up to iteration $k$ and define the shorthand for conditional expectation $\mathbb {E}_k\left[ \cdot \right] := \mathbb {E}\left[ \cdot \mid \mathcal {F}_k\right] $. Define the stopping time

$$\begin{aligned} \tau := \min \{ k\mid y_k\notin \mathcal {T}_{\gamma }\}, \end{aligned}$$

the sequence of events $ A_{k}:= \{ \tau > k\}, $ and the quantities

$$\begin{aligned} D_k:= \Vert y_k - \bar{y}_0\Vert \qquad \text {and}\qquad E_k:= \frac{\alpha ^{-1} + \eta }{2}D_k^2 + \frac{\rho }{2}\Vert y_{k} - y_0\Vert ^2. \end{aligned}$$

Note that by Lemma 3.5, the inclusion $\bar{y}_0 \in \textrm{proj}_{\mathcal {X}^*}(y_0)$ holds. We will use this observation throughout.

We begin with the estimate

$$\begin{aligned} \mathbb {P}\left( D_{K^*}^2 \le \varepsilon ^2 \right) \ {}&\ge \mathbb {P}\left( D_{K^*}^2 \le \varepsilon ^2 \mid A_K\right) P(A_K) \\&=\left( 1 - \mathbb {P}\left( D_{K^*}^2 \ge \varepsilon ^2 \mid A_K\right) \right) \mathbb {P}(A_K)\\&\ge \left( 1 - \frac{\mathbb {E}\left[ D_{K^*}^2 \mid A_K\right] }{\varepsilon ^2} \right) \mathbb {P}(A_K)\\&=\left( 1 - \frac{\mathbb {E}\left[ D_{K^*}^2 1_{A_K}\right] }{\varepsilon ^2\mathbb {P}(A_K)} \right) \mathbb {P}(A_K)\\&=\mathbb {P}(A_K)-\frac{\mathbb {E}\left[ D_{K^*}^2 1_{A_K}\right] }{\varepsilon ^2} \end{aligned}$$

The result will now follow immediately from the following two propositions, which establish an upper bound on $\mathbb {E}\left[ D_{K^*}^2 1_{A_K} \right] $ and a lower bound on $\mathbb {P}(A_K)$, respectively. We note that the first proposition is a quick modification of [11, Lemma 4.2]. We include a proof for completeness.

Proposition B.3

Define $\nu := \rho - \eta $. Then the following bounds hold:

$$\begin{aligned}&\mathbb {E}_k\left[ E_{k+1} 1_{A_k}\right] \le E_k1_{A_k} - \frac{\nu }{2} D_k^21_{A_k} + \frac{\textsf{L}^2 \alpha }{2 }. \end{aligned}$$

(B.2)

$$\begin{aligned}&\mathbb {E}\left[ D^2_{K^*}1_{A_{K}}\right] \le \frac{(K+1) \textsf{L}^2 \alpha + (\alpha ^{-1} + \eta ) \cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2}{\nu (K+ 1)}. \end{aligned}$$

(B.3)

Proof

Define the function $g(y):= f(y) + \frac{\rho }{2} \Vert y- y_0\Vert ^2$ and notice that g is strongly convex with parameter $\nu $. Observe also that the loss function $y\mapsto f_{y_k}(y, z_k)+\frac{1}{2\alpha }\Vert y-y_k\Vert ^2 + \frac{\rho }{2}\Vert y - y_0\Vert ^2$ is strongly convex on $\mathcal {X}$ with constant $\alpha ^{-1} + \rho $ and $y_{k+1}$ is its minimizer. Hence for any $y\in \mathcal {X}$, the inequality holds:

$$\begin{aligned}&\left( f_{y_k}(y, z_k) + \tfrac{1}{2\alpha }\Vert y-y_k\Vert ^2+\tfrac{\rho }{2}\Vert y - y_0\Vert ^2\right) \\&\quad \ge \Big ( f_{y_k}(y_{k+1}, z_k) +\tfrac{1}{2\alpha } \Vert y_{k+1}-y_k\Vert ^2+\tfrac{\rho }{2}\Vert y_{k+1} - y_0\Vert ^2\Big ) \\&\qquad + \tfrac{\alpha ^{-1} + \rho }{2}\Vert y - y_{k+1}\Vert ^2. \end{aligned}$$

Rearranging and taking expectations we successively deduce that if $y_k\in \mathcal {T}_\gamma $, then

$$\begin{aligned}&\mathbb {E}_k\left[ \frac{\alpha ^{-1} + \rho }{2}\Vert y - y_{k+1}\Vert ^2 +\frac{1}{2\alpha } \Vert y_{k+1}-y_k\Vert ^2 - \frac{1}{2\alpha }\Vert y - y_{k}\Vert ^2\right] \nonumber \\&\le \mathbb {E}_k\left[ f_{y_k}(y, z_k) + \frac{\rho }{2}\Vert y - y_0\Vert ^2 - (f_{y_k}( y_{k+1}, z_k) + \frac{\rho }{2}\Vert y_{k+ 1} - y_0\Vert ^2) \right] \nonumber \\&\le \mathbb {E}_k\left[ f_{y_k}(y, z_k) + \frac{\rho }{2} \Vert y- y_0\Vert ^2- ( f_{y_k}( y_{k}, z_k) + \frac{\rho }{2} \Vert y_{k+ 1} - y_0\Vert ^2) \right] \nonumber \\&\hspace{20pt} + \mathbb {E}_k\left[ L(y_k,z_k)\Vert y_{k+1}-y_k\Vert \right] \end{aligned}$$

(B.4)

$$\begin{aligned}&\le \mathbb {E}_k\left[ f(y) + \frac{\rho }{2} \Vert y- y_0\Vert ^2- ( f(y_k) + \frac{\rho }{2} \Vert y_{k} - y_0\Vert ^2) \right] \nonumber \\&\qquad + \mathbb {E}_k\left[ \frac{\rho }{2} \Vert y_{k} - y_0\Vert ^2 - \frac{\rho }{2} \Vert y_{k+ 1} - y_0\Vert ^2 \right] \nonumber \\&\hspace{20pt} + \mathbb {E}_k\left[ \frac{\eta }{2}\Vert y - y_{k}\Vert ^2 + L(y_k, z_k)\Vert y_{k+1}-y_k\Vert \right] \nonumber \\&\le g(y) - g( y_{k}) + \frac{\eta }{2}\Vert y - y_{k}\Vert ^2 + \sup _{w \in \mathcal {T}_2}\sqrt{\mathbb {E}_{z}[L(w, z)^2]}\cdot \sqrt{\mathbb {E}_k[\Vert y_{k+1}-x_k\Vert ^2]} \nonumber \\&\hspace{20pt} +\mathbb {E}_k\left[ \frac{\rho }{2} \Vert y_{k} - y_0\Vert ^2 - \frac{\rho }{2} \Vert y_{k+ 1} - y_0\Vert ^2 \right] \end{aligned}$$

(B.5)

where (B.4) follows from Lemma A.1, while inequality (B.5) follows from Cauchy-Schwarz and Assumption .

Define $c:=\sqrt{\mathbb {E}_k[\Vert y_{k+1}-y_k\Vert ^2]}$ and notice $c\ge \mathbb {E}_k\Vert y_k- y_{k+1}\Vert $. Thus, letting $y = \bar{y}_0$ and rearranging (B.5), we immediately deduce that if $y_k\in \mathcal {T}_\gamma $, we have

$$\begin{aligned}&\mathbb {E}_k\left[ \frac{\alpha ^{-1} + \eta }{2}D_{k+ 1}^2 + \frac{\rho }{2} \Vert y_{k+ 1} - y_0\Vert ^2 \right] \\&\quad \le \frac{\alpha ^{-1} + \eta }{2}D_{k}^2 + \frac{\rho }{2}\Vert y_{k} - y_0\Vert ^2 - \frac{ c^2}{2\alpha } + \textsf{L}c - (g(y_{k}) - g(\bar{y}_0)) \\&\quad \le \frac{\alpha ^{-1}+\eta }{2}D_{k}^2+ \frac{\rho }{2}\Vert y_{k} - y_0\Vert ^2 + \frac{\textsf{L}^2 \alpha }{2 } - \frac{\nu }{2} D_k^2, \end{aligned}$$

where the second inequality follows from strong convexity of g and by maximizing the right-hand-side in $c \in \mathbb {R}$. Thus, multiplying through by $1_{A_k}$, we deduce that

$$\begin{aligned} \mathbb {E}_k\left[ E_{k+1} 1_{A_k}\right]&\le E_k1_{A_k} - \frac{\nu }{2} D_k^21_{A_k} + \frac{\textsf{L}^2 \alpha }{2 }. \end{aligned}$$

which proves (B.2). Iterating (B.2), using the tower rule, and rearranging, we deduce

$$\begin{aligned} \frac{\nu }{2}\sum _{k=0}^K \mathbb {E}\left[ D_k^21_{A_{k}}\right] \le (K+1)\frac{\textsf{L}^2 \alpha }{2 } +E_0 \le (K+1)\frac{\textsf{L}^2 \alpha }{2 } + \frac{\alpha ^{-1} + \eta }{2}\cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2, \end{aligned}$$

where the last inequality follows from Lemma 3.5 and the assumption $y_0 \in \mathcal {T}_{\gamma \sqrt{\delta }}$. Dividing through by $\frac{\nu }{2}(K+ 1)$, we deduce

$$\begin{aligned} \mathbb {E}\left[ D_{K^*}^21_{A_{K^*}}\right] \le \frac{(K+1) \textsf{L}^2 \alpha + (\alpha ^{-1} + \eta ) \cdot \delta \left( \frac{\gamma \mu }{\eta }\right) ^2}{\nu (K+ 1)}. \end{aligned}$$

Finally, note $\mathbb {E}\left[ D_{K^*}^21_{A_{K}}\right] \le \mathbb {E}\left[ D_{K^*}^21_{A_{K^*}}\right] $ since $A_K\subseteq A_{K^*}$. This completes the proof of the proposition. $\square $

Now we estimate the probability of the event $A_{K}$.

Proposition B.4

The estimate holds:

$$\begin{aligned} \mathbb {P}(A_K) \ge 1 - \delta - \left( \frac{\eta }{\gamma \mu }\right) ^2K\textsf{L}^2 \alpha ^2. \end{aligned}$$

Proof

Define $r:= 2(\alpha ^{-1} + \eta )^{-1}$, let $\overline{Y}_{k} = \textrm{dist}^2(y_{k\wedge \tau }, \mathcal {X}^*)$, and let $Y_k= rE_{k\wedge \tau }$ denote the stopped process. We now estimate

$$\begin{aligned} \mathbb {P}(\tau \le K) = \mathbb {P}\left( \overline{Y}_{\tau } 1_{\tau \le K} > \left( \frac{\gamma \mu }{\eta }\right) ^2\right) \le \frac{ \mathbb {E}\left[ \overline{Y}_\tau 1_{\tau \le K} \right] }{\left( \frac{\gamma \mu }{\eta }\right) ^2}. \end{aligned}$$

(B.6)

Next we upper bound the right-hand-side:

$$\begin{aligned} \mathbb {E}\left[ \overline{Y}_\tau 1_{\tau \le K} \right] \le \mathbb {E}\left[ Y_\tau 1_{\tau \le K} \right] \le \mathbb {E}\left[ Y_K1_{\tau > K} \right] + \mathbb {E}\left[ Y_K1_{\tau \le K} \right] = \mathbb {E}\left[ Y_K\right] , \end{aligned}$$

(B.7)

where the first inequality follows from the bound $\overline{Y}_k \le Y_k$. Next, observe

$$\begin{aligned} \mathbb {E}_{k} \left[ Y_{k+1} \right]&= \mathbb {E}_{k} \left[ Y_{k+1} 1_{\tau \le k}\right] + \mathbb {E}_{k} \left[ Y_{k+1}1_{\tau> k}\right] \\&= Y_{k} 1_{\tau \le k} + \mathbb {E}_k\left[ rE_{k+1}1_{A_{k}} \right] \\&\le Y_{k} 1_{\tau \le k} + rE_{k}1_{A_{k}} - \frac{\nu }{2} D_k^21_{A_{k}} + \frac{\textsf{L}^2 \alpha }{2 } \\&\le Y_{k} 1_{\tau \le k} + rE_k1_{\tau > k} + r\frac{\textsf{L}^2\alpha }{2 } = Y_k+ \frac{\textsf{L}^2\alpha }{(\alpha ^{-1} + \eta )}, \end{aligned}$$

where the first inequality follows from (B.2). We now use the law of total expectation to iterate the above inequality:

$$\begin{aligned} \mathbb {E}\left[ Y_K\right] \le \mathbb {E}\left[ Y_0 \right] + \frac{K\textsf{L}^2\alpha }{(\alpha ^{-1} + \eta )} \le D_0^2 + K\textsf{L}^2\alpha ^2 \le \delta \left( \frac{\gamma \mu }{\eta }\right) ^2 + K\textsf{L}^2 \alpha ^2, \end{aligned}$$

(B.8)

where the second inequality follows from the equality $rE_{0} = D_0^2 = \textrm{dist}^2(y_0, \mathcal {X}^*)$. Combining (B.6), (B.7), and (B.8) completes the proof. $\square $

C Proofs from Sect. 4

1.1 C.1 Proof of Theorem 4.1

The equality $\mathcal {X}^*=\{\pm \bar{x}\}$ and sharpness follows along similar lines as in [18, Proposition 4] and [13, Lemma B.8]. We sketch a quick argument for completeness. Fix $x \in \mathbb {R}^d$ throughout the proof. Let $\hat{f}(x, z) = | \langle a, x\rangle ^2 - \langle a, \bar{x}\rangle ^2|$ denote the “outlier-free” loss function and set $\hat{f}(x):=\mathbb {E}[\hat{f}(x, z)]$. Setting $v:= \frac{ x- \bar{x}}{\Vert x - \bar{x}\Vert }$ and $ w:= \frac{ x + \bar{x}}{\Vert x + \bar{x}\Vert }$, we have

$$\begin{aligned} \hat{f}(x, z) = | \langle a, x - \bar{x}\rangle \langle a, x + \bar{x}\rangle | = \Vert x - \bar{x} \Vert \Vert x+ \bar{x}\Vert |\langle a, v\rangle \langle a, w\rangle |. \end{aligned}$$

Therefore, we deduce

$$\begin{aligned} \hat{f}(x):= \mathbb {E}\left[ \hat{f}(x, z) \right] \ge \tilde{\mu }\Vert x - \bar{x}\Vert \Vert x + \bar{x}\Vert \ge \tilde{\mu } \Vert \bar{x}\Vert \cdot \textrm{dist}(x, \{\pm \bar{x}\}). \end{aligned}$$

Now, using this bound, we find that

$$\begin{aligned} f(x) - f(\pm \bar{x})&= (1- p_{\textrm{fail}})(\hat{f}(x) - \hat{f}(\bar{x})) + p_{\textrm{fail}}\mathbb {E}_{a, \xi } \left[ |\langle a, x\rangle ^2 - \langle a, \bar{x}\rangle ^2 - \xi | - |\xi | \right] \\&\ge (1 - p_{\textrm{fail}}) \hat{f}(x) - p_{\textrm{fail}}\mathbb {E}_a\left[ |\langle a, x\rangle ^2 - \langle a, \bar{x}\rangle ^2|\right] \\&= (1-2p_{\textrm{fail}}) \hat{f}(x) \ge (1-2p_{\textrm{fail}})\tilde{\mu } \Vert \bar{x}\Vert \cdot \textrm{dist}(x, \{\pm \bar{x}\}). \end{aligned}$$

In particular, we deduce the equality $\mathcal {X}^*=\{\pm \bar{x}\}$ and the sharpness estimate with $\mu =(1-2p_{\textrm{fail}})\tilde{\mu } \Vert \bar{x}\Vert $. Now we estimate the parameters of the models.

We begin with an estimate of $\eta $. To that end, fix $y \in \mathbb {R}^d$. Then, using the expansion $ \langle a, y\rangle ^2 = \langle a, x\rangle ^2 + 2\langle a, x\rangle \langle a, y -x\rangle + \langle a, y - x\rangle ^2, $ we find that for any $z \in \Omega $, we have

$$\begin{aligned} f(y, z)&= | \langle a, y\rangle ^2 - (\langle a, \bar{x}\rangle ^2 + u\cdot \xi )|\nonumber \\&= | \langle a, x\rangle ^2 + 2\langle a, x\rangle \langle a, y -x\rangle + \langle a, y - x\rangle ^2 - (\langle a, \bar{x}\rangle ^2 + u\cdot \xi )|\nonumber \\&\ge f^{pl}_x(y, z) - \langle a, y- x\rangle ^2 \nonumber . \end{aligned}$$

We use this inequality to estimate $\eta $ for each of the models. Let us analyze each of the models in turn:

(prox-linear) We have, $ \mathbb {E}\left[ f_x^{pl}(y, z)\right] \le \mathbb {E}\left[ f(y, z) + \langle a, y- x\rangle ^2\right] \le f(y) + \tilde{\eta }\Vert y - x\Vert ^2. $
(subgradient) By inspection, we have $G(x, z) \in \partial f_x^{pl}(x,z)$. Thus, we have
$$\begin{aligned} f_x^{s}(y, z) = f_x^{pl}(x, z) + \langle G(x, z), y - x\rangle \le f_x^{pl}(y, z) \le f(y, z) + \langle a, y- x\rangle ^2, \end{aligned}$$
and consequently, $ \mathbb {E}\left[ f_x^{s}(y, z)\right] \le f(y) + \tilde{\eta }\Vert y - x\Vert ^2. $
(clipped Subgradient) As before, we have
$$\begin{aligned}{} & {} \max \{f_x^{s}(y, z), 0\} = \max \{f_x^{pl}(x, z) + \langle G(x, z), y - x\rangle , 0\} \\{} & {} \quad \le \max \{f_x^{pl}(y, z), 0\} \le f(y, z) + \langle a, y- x\rangle ^2, \end{aligned}$$
and consequently, we have $ \mathbb {E}\left[ f^{cl}_x(y, z)\right] \le f(y) + \tilde{\eta }\Vert y - x\Vert ^2. $

Therefore in all three cases, we have $\eta = 2\tilde{\eta }$.

Now we analyze L(x, z). Any subgradient of any of the models evaluated at a point $x \in \mathbb {R}^d$ is of the form $ v = 2\,s \langle a, x\rangle a \text { for some } s \in [-1, 1]$. Consequently, in all three cases, we have $ \min _{v\in \partial f_x(x,z)} \Vert v\Vert \le 2|\langle a, x\rangle |\Vert a\Vert = L(x,z), $ as desired.

1.2 C.2 Proof of Theorem 4.1

Throughout the proof, let $(x, y), (\hat{x}, \hat{y}) \in \mathcal {X}$. Let $\hat{f}((x, y), z) = | \langle \ell , x\rangle \langle r, y\rangle - \langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle |$ denote the outlier free objective and notice that with $M = \frac{x y^T - \bar{x} \bar{y}^T}{\Vert x y^T - \bar{x} \bar{y}^T\Vert }$, we have

$$\begin{aligned} \hat{f}((x, y), z) = | \langle \ell , ( x y^T - \bar{x} \bar{y}^T) r\rangle | = \Vert x y^T - \bar{x} \bar{y}^T\Vert _F |\langle \ell , M r\rangle |. \end{aligned}$$

Now taking into account [8, Proposition 4.2], we have

$$\begin{aligned} \Vert xy^T - \bar{x} \bar{y}^T\Vert \ge \frac{\sqrt{D}}{2\sqrt{2}(\nu + 1)}\textrm{dist}((x,y), \mathcal {X}^*). \end{aligned}$$

Thus, it follows that

$$\begin{aligned} \hat{f}(x, y):= & {} \mathbb {E}\left[ \hat{f}((x, y), z) \right] = \Vert x y^T - \bar{x} \bar{y}^T\Vert _F \mathbb {E}\left[ |\langle a, M a\rangle |\right] \\ {}{} & {} \ge \frac{\tilde{\mu }\sqrt{D}}{2\sqrt{2}(\nu + 1)}\textrm{dist}((x, y), \mathcal {X}^*). \end{aligned}$$

Finally, using this bound, we find that for any $\alpha \ne 0$ that

$$\begin{aligned} f(x,y) - f( \alpha \bar{x}, (1/\alpha ) \bar{y})&= (1- p_{\textrm{fail}})(\hat{f}(x, y) - \hat{f}( \alpha \bar{x}, (1/\alpha ) \bar{y} ))\\ {}&\quad + p_{\textrm{fail}}\mathbb {E}_{\ell , r, \xi } \left[ | \langle \ell , x\rangle \langle r, y\rangle - \langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle - \xi | - |\xi | \right] \\&\ge (1 - p_{\textrm{fail}}) \hat{f}(x,y) - p_{\textrm{fail}}\mathbb {E}_{\ell , r}\left[ | \langle \ell , x\rangle \langle r, y\rangle - \langle \ell , \bar{x}\rangle \langle r, \bar{y}\rangle |\right] \\&= (1-2p_{\textrm{fail}}) \hat{f}(x,y) \ge \frac{\tilde{\mu }(1-2p_{\textrm{fail}})\sqrt{D}}{2\sqrt{2}(\nu + 1)}\textrm{dist}((x,y), \mathcal {X}^*). \end{aligned}$$

This proves sharpness. Now we estimate the parameters of the models.

Let us begin with $\eta $. To that end, we observe that with $M = \frac{(y - \hat{y})(x - \hat{x})^T}{\Vert (y - \hat{y})(x - \hat{x})^T\Vert }$, we have

$$\begin{aligned} |f((\hat{x}, \hat{y}), z) - f^{pl}_{(x, y)}((\hat{x}, \hat{y}), z)|&\le |\langle \ell , (\hat{x} \hat{y}^T - x y^T - x(\hat{y} - y) - y(\hat{x} - x)^T) r\rangle | \\&= |\langle \ell , ((y - \hat{y})(x - \hat{x})^T)r\rangle | \\&= |\langle \ell , M r\rangle | \Vert y-\hat{y}\Vert \Vert x - \hat{x}\Vert \\&\le \frac{|\langle \ell , M r\rangle |}{2} \left( \Vert y-\hat{y}\Vert ^2 + \Vert x - \hat{x}\Vert ^2\right) . \end{aligned}$$

We use this inequality to establish the weak quadratic approximation property for each of the models. Let us analyze each of the models in turn:

(Prox-linear) Taking expectations, we have
$$\begin{aligned} \mathbb {E}\left[ f^{pl}_{(x,y)}((\hat{x}, \hat{y}), z) \right] \le f(\hat{x}, \hat{y}) + \frac{\tilde{\eta }}{2} \left( \Vert y-\hat{y}\Vert ^2 + \Vert x - \hat{x}\Vert ^2\right) . \end{aligned}$$
(Subgradient) By inspection, the inclusion holds $G((x, y), z) \in \partial f_{(x, y)}^{pl}((x, y), z) $. Therefore,
$$\begin{aligned} f_{(x, y)}^{s}((\hat{x}, \hat{y}), z)= & {} f_{(x, y)}^{pl}((x, y), z)\\{} & {} + \langle G((x,y), z), (\hat{x}, \hat{y}) - (x, y)\rangle \le f_{(x, y)}^{pl}((\hat{x}, \hat{y}),z), \end{aligned}$$
and consequently $ \mathbb {E}\left[ f_{(x, y)}^{s}((\hat{x}, \hat{y}), z) \right] \le \mathbb {E}\left[ f_{(x, y)}^{pl}((\hat{x}, \hat{y}), z) \right] \le f(\hat{x}, \hat{y}) + \frac{\tilde{\eta }}{2} \left( \Vert y-\hat{y}\Vert ^2 + \Vert x - \hat{x}\Vert ^2\right) . $
(Clipped Subgradient) As before, we have
$$\begin{aligned} \max \{f_{(x, y)}^{s}((\hat{x}, \hat{y}), z), 0\}= & {} \max \{f_{(x, y)}^{pl}((x, y), z)\\ {}{} & {} + \langle G((x,y), z), (\hat{x}, \hat{y}) - (x,y)\rangle , 0\} \le f_{(x, y)}^{pl}((\hat{x}, \hat{y}),z), \end{aligned}$$
and consequently $ \mathbb {E}\left[ f_{(x, y)}^{cl}((\hat{x}, \hat{y}), z)\right] \le \mathbb {E}\left[ f_{(x, y)}^{pl}((\hat{x}, \hat{y}), z)\right] \le f(\hat{x}, \hat{y}) + \frac{\tilde{\eta }}{2} \left( \Vert y-\hat{y}\Vert ^2 + \Vert x - \hat{x}\Vert ^2\right) . $

Therefore in all three cases, we have $\eta = \tilde{\eta }$.

Now we analyze L((x, y), z). Any subgradient of any of the models evaluated at the point (x, y) is of the form $ v = (\langle r, y\rangle \ell , \langle \ell , x\rangle r)s \text { for some }s \in [-1, 1]$. Therefore, $ \min _{v\in \partial f_x(x,z)}\Vert v\Vert \le \sqrt{\langle \ell , x\rangle ^2 \Vert r\Vert ^2 + \langle r, y\rangle ^2\Vert \ell \Vert ^2} \le L((x, y),z), $ as desired.

Finally, the bound on $\textsf{L}$ follows since

$$\begin{aligned} \textsf{L}^2 \le \sup _{(x, y) \in \mathcal {T}_2} \mathbb {E}\left[ L((x,y), z)^2\right] \le \sup _{(v, w) \in \mathbb {S}^{d_1 - 1} \times \mathbb {S}^{d_2 - 1} } \nu ^2D \mathbb {E}\left[ L((v,w), z)^2\right] \le \nu ^2D \tilde{\textsf{L}}^2, \end{aligned}$$

as desired.

D Sharpness and identifiability

In this section, we explain that local sharp growth of a function f relative to a set $\mathcal {S}$ is equivalent to $\mathcal {S}$ being an “active manifold” for f locally around its minimizer. This equivalence is in essence well-known, though we have been unable to find a formal statement. To illustrate on a simple example, consider the function $f(x,y)=x^2+|y|$ and the set $\mathcal {S}=\mathbb {R}\times \{0\}$. Notice that f satisfies two geometric properties. On one hand, f grows sharply (at least linearly) as one moves away from $\mathcal {S}$. On the other hand, $\mathcal {S}$ is “active” or “identifiable” in the sense that the subgradients of f are uniformly bounded away from zero outside of $\mathcal {S}$. We will see that these two geometric properties are essentially equivalent. To formalize the notion of an “active set”, we follow the work [17], which expands on the earlier papers of Lewis [38] and Wright [64].

Throughout, we use the standard definitions and notation of variational analysis, as set out in the monographs [42, 50, 57]. Namely, consider a function $f:\mathbb {R}^d\rightarrow \overline{\mathbb {R}}$ and a point x, with f(x) finite. The Fréchet subdifferential, denoted $\partial f(x)$, consists of all vectors $v\in \mathbb {R}^d$ satisfying

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle +o(\Vert y-x\Vert )\quad \text { as }\quad y\rightarrow x. \end{aligned}$$

The limiting subdifferential, denoted $\partial _L f(x)$, consists of all vectors $v\in \mathbb {R}^d$ for which there exist sequences $x_i\in \mathbb {R}^d$ and $v_i\in \partial f(x_i)$ satisfying $(v_i,f(x_i),v_i)\rightarrow ( x,f( x), v)$. Following [51], we say that f is prox-regular at $\bar{x}$ for $\bar{v}\in \partial _L f(\bar{x})$ if there exist real $\epsilon , \rho >0$ such that the estimate

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle -\frac{\rho }{2}\Vert y-x\Vert ^2, \end{aligned}$$

holds for any $x,y\in \mathbb {R}^d$ and $v\in \partial _L f(x)$ satisfying $\max \{\Vert y - \bar{x}\Vert , \Vert x-\bar{x}\Vert , \Vert v-\bar{v}\Vert , |f(x)-f(\bar{x})|\}<\epsilon $. In particular, weakly convex functions are prox-regular.

The following is the formal definition of an identifiable (or active) manifold.

Definition D.1

(Identifiable manifold) Consider a closed function $f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{+\infty \}$. We call a set $\mathcal {S}$ an identifiable manifold at $\bar{x}$ for $\bar{v}\in \partial f(\bar{x})$ if the following properties hold.

1.
(smoothness) The set $\mathcal {S}$ is a $C^2$-smooth manifold around $\bar{x}$ and the restriction $f\big |_{\mathcal {S}}$ is $C^2$-smooth near $\bar{x}$.
2.
(finite identification) For any sequences $(x_i,f(x_i),v_i)\rightarrow (\bar{x}, f(\bar{x}),\bar{v})$ with $v_i\in \partial _L f(x_i)$, the points $x_i$ must all lie in $\mathcal {S}$ for all sufficiently large indices i.

Let us first observe that under a very mild condition on the function f, identifiability at a critical points implies local sharp growth.

Theorem D.2

(Identification implies sharpness) Consider a closed function $f:\mathbb {R}^d\rightarrow \overline{\mathbb {R}}$ and suppose that a closed set $\mathcal {S}$ is an identifiable manifold at $\bar{x}$ for $0\in \partial f(\bar{x})$. Then there exist real $\epsilon , \mu >0$ satisfying

$$\begin{aligned} f(x)\ge f(\textrm{proj}_{\mathcal {S}}(x))+\mu \cdot \textrm{dist}(x,\mathcal {S}) \qquad \text {for all }x\in B_{\epsilon }(\bar{x}). \end{aligned}$$

Proof

First, we record an immediate consequence of [17, Proposition 10.12]. Namely, there exists $\epsilon > 0$ satisfying the following. For all $z\in B_{\epsilon }(\bar{x})\cap \mathcal {S}$ and $v\in \partial _L f(z)\cap B_{\epsilon }(0)$, the inclusion holds:

$$\begin{aligned} v+\epsilon \left( \mathbb {S}^{d-1}\cap N_{\mathcal {S}}(z)\right) \subset \partial f(z). \end{aligned}$$

Next recall that since $\mathcal {S}$ is a $C^2$-smooth manifold, every point x near $\bar{x}$ admits a unique nearest-point projection onto $\mathcal {S}$, characterized by the inclusion $x-\textrm{proj}_{\mathcal {S}}(x)\in N_{\mathcal {S}}(\textrm{proj}_S(x))$. For any point x near $\bar{x}$, set $\hat{x}=\textrm{proj}_{\mathcal {S}}(x)$. Using [17, Proposition 10.11], we deduce that f is prox-regular at $\bar{x}$ for $\bar{v} = 0$. Consequently, there exist $\epsilon ', \gamma ', \rho >0$ such that

$$\begin{aligned} f(x)\ge f(\hat{x})+\langle v ,x-\hat{x}\rangle -\frac{\rho }{2}\Vert x-\hat{x}\Vert ^2 \end{aligned}$$

(D.1)

for any $x\in B_{\epsilon '}(\bar{x})$ and $v\in B_{\gamma }(0)\cap \partial _L f(\hat{x})$. Notice that since the subdifferential of f is inner-semicontinuous relative to $\mathcal {S}$ at $\bar{x}$ for $\bar{v}=0$ [17, Proposition 10.2], decreasing $\epsilon '$ we may ensure that $B_{\gamma }(0)\cap \partial _L f(\hat{x})$ is nonempty for all $x\in B_{\epsilon '}(\bar{x})$. We therefore deduce for every $x\in B_{\epsilon '}(\bar{x})$ the estimate:

$$\begin{aligned} f(x)&\ge f(\hat{x})+\langle v+\epsilon \frac{x-\hat{x}}{\Vert x-\hat{x}\Vert },x-\hat{x}\rangle -\frac{\rho }{2}\Vert x-\hat{x}\Vert ^2\\&\ge f(\hat{x})+\epsilon \cdot \textrm{dist}_{\mathcal {S}}(x)-\Vert v\Vert \textrm{dist}_{\mathcal {S}}(x)-\frac{\rho }{2}\textrm{dist}_{\mathcal {S}}^2(x)\\&=f(\hat{x})+\left( \epsilon -\gamma -\frac{\rho \epsilon '}{2}\right) \textrm{dist}_{\mathcal {S}}(x). \end{aligned}$$

Decreasing $\gamma $ and $\epsilon '$, if necessary, completes the proof. $\square $

We next prove the converse, namely that a function always grows sharply away from its identifiable manifolds.

Theorem D.3

(Sharpness implies identification). Consider a closed function $f:\mathbb {R}^d\rightarrow \overline{\mathbb {R}}$ that is prox-regular at a point $\bar{x}$ for $0\in \partial _L f(\bar{x})$. Suppose that there is a closed set $\mathcal {S}$ containing $\bar{x}$ and real $\epsilon , \mu >0$ satisfying

$$\begin{aligned} f(x)\ge \min _{z\in \textrm{proj}_{\mathcal {S}}(x)}f(z)+\mu \cdot \textrm{dist}(x,\mathcal {S}) \qquad \text {for all }x\in B_{\epsilon }(\bar{x}). \end{aligned}$$

Then for any sequences $(x_i,f(x_i),v_i)\rightarrow (\bar{x}, f(\bar{x}),\bar{v})$ with $v_i\in \partial _L f(x_i)$, the points $x_i$ must all lie in $\mathcal {S}$ for all sufficiently large indices i.

Proof

Let $\epsilon , \mu >0$ be the constants in the assumptions of the theorem. From the definition of prox-regularity, we deduce that there exist real $\epsilon ', \rho >0$ such that the estimate

$$\begin{aligned} f(y)\ge f(x)+\langle v,y-x\rangle -\frac{\rho }{2}\Vert y-x\Vert ^2, \end{aligned}$$

holds for any $x,y\in \mathbb {R}^d$ and $v\in \partial _L f(x)$ satisfying $\max \{\Vert x-\bar{x}\Vert , \Vert v\Vert , |f(x)-f(\bar{x})|\}<\epsilon '$. Shrinking $\epsilon '$, we may ensure $0<\epsilon '<\min \{\epsilon ,\frac{\mu }{\rho }\}$. Consider now any point $x\in B_{\epsilon '}(\bar{x}){\setminus } \mathcal {S}$ with $|f(x)-f(\bar{x})|<\epsilon '$ and $\textrm{dist}(0,\partial _L f(x))<\epsilon '$. We will show that the estimate $\textrm{dist}(0,\partial _L f(x))\ge \frac{\mu }{2}$ holds, thereby completing the proof. To verify this estimate, let $v\in \partial f(x)$ have minimal norm and let $\hat{x}\in \textrm{proj}_{\mathcal {S}}(x)$ achieve $\min _{z\in \textrm{proj}_{\mathcal {S}}(x)} f(z)$. We then deduce

$$\begin{aligned} \mu \cdot \textrm{dist}(x,\mathcal {S})\le f(x)-f(\hat{x})\le \langle v, x-\hat{x}\rangle +\frac{\rho }{2}\textrm{dist}^2(x,\mathcal {S}). \end{aligned}$$

Using the Cauchy-Schwarz inequality, we therefore conclude $\Vert v\Vert \ge \mu -\frac{\rho }{2}\textrm{dist}(x,\mathcal {S})\ge \frac{\mu }{2}$. The result follows.$\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Davis, D., Drusvyatskiy, D. & Charisopoulos, V. Stochastic algorithms with geometric step decay converge linearly on sharp functions. Math. Program. 207, 145–190 (2024). https://doi.org/10.1007/s10107-023-02003-w

Download citation

Received: 29 July 2019
Accepted: 07 July 2023
Published: 05 September 2023
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10107-023-02003-w

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Abstract

Similar content being viewed by others

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Subgradient Methods for Sharp Weakly Convex Functions

1 Introduction

1.1 Related work

1.2 Notation

2 Algorithms, assumptions, and main results

2.1 Assumptions

2.1.1 Sharp Growth

2.1.2 Accuracy and Regularity

Example 2.1

2.1.3 Lipschitz continuity

Remark 1

2.2 Algorithms and results

Theorem 2.1

Theorem 2.2

3 Proofs of main results

3.1 Warm-up: convex setting

Theorem 3.1

Theorem 3.2

Proof

3.2 Weakly convex setting

Lemma 3.3

Theorem 3.4

Proof

Remark 2

3.3 Convergence with high probability

Lemma 3.5

Lemma 3.6

Corollary 3.7

Theorem 3.8

Proof

4 Consequences for statistical recovery problems

4.1 Robust Phase retrieval

Assumption A

Theorem 4.1

Theorem 4.2

Example 4.1

4.2 Robust blind deconvolution

Assumption B

Theorem 4.3

Theorem 4.4

Example 4.2

5 Numerical Experiments

5.1 Convergence behavior

5.1.1 Experiments on large-scale problems

5.2 Sensitivity to step size

5.3 Activity identification

5.3.1 Evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proofs from Sect. 3.2

Lemma A.1

Proof

1.1 A.1 Proof of Lemma 3.3

Proposition A.2

Proof

Proposition A.3

Proof

B Proofs from Sect. 3.3

1.1 B.1 The ensemble method

Lemma B.1

Proof

1.2 B.2 Proof of Lemma 3.5

Lemma B.2

Proof

1.3 B.3 Proof of Lemma 3.6

Proposition B.3

Proof

Proposition B.4

Proof