1 Introduction

Typical methods for statistics and signal processing tasks follow the two-step strategy: (i) find a moderately accurate solution at a low sample complexity cost (e.g., using spectral initialization), and (ii) refine the approximate solution by an iterative “local search algorithm” that converges rapidly under natural statistical assumptions. For smooth problem formulations, the term “local search” almost universally refers to gradient descent or a close variant thereof; see, for example, [1,2,3,4,5,6,7,8]. For nonsmooth and nonconvex problems, the meaning of local search is much less clear. In this work, we ask the following question.

Is there a generic gradient-based local search procedure for nonsmooth and nonconvex problems, which converges linearly under standard regularity conditions?

We answer this question positively for the class of weakly convex functions, i.e., those that become convex after a sufficiently large quadratic perturbation. We show that standard subgradient methods, originally designed for sharp convex problems, locally converge linearly when applied to sharp functions that are only weakly convex. We focus on three stepsize rules: Polyak stepsize [9, 10], geometrically decaying stepsize [11, 12], and constant stepsize [13,14,15]. As a proof of concept, we illustrate the resulting algorithms on phase retrieval and covariance estimation problems.

2 Discussion

Our approach is rooted in subgradient methods for convex optimization. To motivate the discussion, consider the constrained optimization problem

$$\begin{aligned} \min _{x\in {\mathcal {X}}}~ g(x), \end{aligned}$$
(1)

where g is an L-Lipschitz convex function on \({\mathbb {R}}^d\) and \({\mathcal {X}}\) is a closed and convex set. Given a current iterate \(x_k\), subgradient methods proceed as follows:

$$\begin{aligned} \left\{ \begin{aligned}&\text {Choose any }\zeta _k\in \partial g(x_k)\\&\text {Set } x_{k+1}=\mathrm{proj}_{{\mathcal {X}}}\left( x_k-\alpha _k\cdot \frac{\zeta _k}{\Vert \zeta _k\Vert }\right) \end{aligned}\right\} . \end{aligned}$$

Here, the symbol \(\mathrm{proj}_{{\mathcal {X}}}(y)\) denotes the nearest point of \({\mathcal {X}}\) to y and \(\{\alpha _k\}\) is a specified stepsize sequence. The choice of the sequence \(\{\alpha _k\}\) determines the behavior of the scheme and is the main distinguishing feature among subgradient methods. In this work, we will only be interested in subgradient methods that are linearly convergent. As usual, linear rates of convergence of iterative methods require some regularity conditions to hold. Here, the appropriate regularity condition is sharpness [16, 17] (or equivalently a global error bound): There exists a real \(\mu >0\) satisfying

$$\begin{aligned} g(z)-\min _{x\in {\mathcal {X}}} g(x)\ge \mu \cdot \mathrm{dist}(z;{\mathcal {X}}^*)\quad \text {for all } z\in {\mathcal {X}}, \end{aligned}$$

where \({\mathcal {X}}^*\) denotes the set of minimizers of (1), and \(\mathrm{dist}(z;{\mathcal {X}}^*)\) denotes the distance of z to \({\mathcal {X}}^*\). Assuming sharpness holds, subgradient methods with a judicious choice of \(\{\alpha _k\}\) produce iterates that converge to \({\mathcal {X}}^*\) at the linear rate \(\sqrt{1-(\mu /L)^2}\). Results of this type date back to the 1960s and 1970s [9,10,11,12, 18], while some more recent approaches have appeared in [13,14,15].

Various contemporary problems lead to formulations that are indeed sharp, but are only weakly convex and locally Lipschitz. Recall that a function g is \(\rho \)-weakly convex [19] if the perturbed function \(x\mapsto g(x)+\frac{\rho }{2}\Vert \cdot \Vert ^2\) is convex for some \(\rho \ge 0\).Footnote 1 Note that weakly convex functions need not be smooth nor convex. A quick computation (Lemma 3.1) shows that if g is \(\mu \)-sharp and \(\rho \)-weakly convex, then there is a tube around the solution set \({\mathcal {X}}^*\) that contains no extraneous stationary points:

$$\begin{aligned} {\mathcal {T}}:=\left\{ x\in {\mathcal {X}}: \mathrm{dist}(x;{\mathcal {X}}^*) < \frac{2\mu }{\rho }\right\} . \end{aligned}$$

In this work, we show that the standard linearly convergent subgradient methods, originally designed for convex problems, apply in this much greater generality, provided they are initialized within a slight contraction of the tube \({\mathcal {T}}\). The methods exhibit essentially the same linear rate of convergence as in the convex case, while the weak convexity constant \(\rho \) only determines the validity of the initialization. We focus on three stepsize rules: Polyak stepsize [9, 10], geometrically decaying stepsize [11, 12], and constant stepsize [13,14,15]. As a proof of concept, we illustrate the resulting algorithms on phase retrieval and covariance estimation problems.

Our current work sits within the broader scope of analyzing subgradient and proximal methods for weakly convex problems [19,20,21,22,23,24,25,26]; see also the recent survey [27]. In particular, the paper [21] proves a global sublinear rate of convergence, in terms of a natural stationarity measure, of a (stochastic) subgradient method on any weakly convex function. In contrast, here we are interested in subgradient methods that are locally linearly convergent under the additional sharpness assumption. The arguments we present are all quick modification of the proofs already available in the convex setting. Nonetheless, we believe that the drawn conclusions are interesting and powerful, opening the door to generic local search procedures for nonsmooth and nonconvex problems.

3 Notation

Throughout, we consider the Euclidean space \({\mathbb {R}}^d\), equipped with an inner product \(\langle \cdot ,\cdot \rangle \) and the induced norm \(\Vert x\Vert :=\sqrt{\langle x,x\rangle }\). The distance and the projection of any point \(y\in {\mathbb {R}}^d\) onto a set \({\mathcal {X}}\) are defined by

$$\begin{aligned} \mathrm{dist}(y;{\mathcal {X}}):=\inf _{x\in {\mathcal {X}}}~\Vert y-x\Vert \quad \text {and}\quad \mathrm{proj}_{{\mathcal {X}}}(y):=\mathop {\hbox {argmin}}\limits _{x\in {\mathcal {X}}}~\Vert y-x\Vert , \end{aligned}$$

respectively. Note that \(\mathrm{proj}_{{\mathcal {X}}}(y)\) is nonempty as long as \({\mathcal {X}}\) is a closed set. The indicator function of a set \({\mathcal {X}}\), denoted by \(\delta _{{\mathcal {X}}}\), is defined to be zero on \({\mathcal {X}}\) and \(+\infty \) off it.

3.1 Weakly Convex Functions

Our main focus is on those functions that are convex up to an additive quadratic perturbation. Namely, a function \(g:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is called \(\rho \)-weakly convex (with \(\rho \ge 0\)), if the assignment \(x\mapsto g(x)+\tfrac{\rho }{2}\Vert x\Vert ^2\) is a convex function. The algorithms we consider will all use generalized derivative constructions. Variational analytic literature highlights a number of distinct subdifferentials (e.g., [28,29,30,31]); for weakly convex functions, all of these constructions coincide. Consider a \(\rho \)-weakly convex function g. The subdifferential of g at x, denoted \({\partial } g( x)\), is the set of all vectors \(v\in {\mathbb {R}}^d\) satisfying

$$\begin{aligned} g(y)\ge g( x)+\langle v,y- x\rangle +o(\Vert y- x\Vert )\quad \text { as }y\rightarrow x. \end{aligned}$$
(2)

Though the condition (2) appears to lack uniformity with respect to the basepoint x, the subgradients of g automatically satisfy the much stronger property:

$$\begin{aligned} g(y)\ge g(x)+\langle v,y-x\rangle -\frac{\rho }{2}\Vert y-x\Vert ^2,\quad \forall x,y\in {\mathbb {R}}^d,~v\in \partial g(x). \end{aligned}$$
(3)

Indeed, this follows quickly by applying the subgradient inequality to the convex function \(g+\frac{\rho }{2}\Vert \cdot \Vert ^2\). Thus, we may use the two conditions, (2) and (3), interchangeably for weakly convex functions. We note in passing that localizing condition (3) leads to so-called prox-regular functions, introduced in [32].

Weakly convex functions are widespread in applications and are typically easy to recognize. One common source is the composite problem class:

$$\begin{aligned} \min _{x}~ F(x):=h(c(x)), \end{aligned}$$
(4)

where \(h:{\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is convex and L-Lipschitz, and \(c:{\mathbb {R}}^d\rightarrow {\mathbb {R}}^m\) is a \(C^1\)-smooth map with \(\beta \)-Lipschitz gradient. An easy argument shows that F is \(L\beta \)-weakly convex. This is a worst case estimate. In concrete circumstances, the composite function F may have a much more favorable weak convexity constant \(\rho \). The elements of the subdifferential \(\partial F(x)\) are straightforward to compute through the chain rule [28, Theorem 10.6, Corollary 10.9]:

$$\begin{aligned} \partial F(x)=\nabla c(x)^{*}\partial h(c(x))\quad \text {for all }x\in {\mathbb {R}}^d. \end{aligned}$$

For a discussion of some recent uses of weakly convex functions in optimization, see the short survey [27]. Throughout the paper, we will use the following two running examples to illustrate our results.

Example 3.1

(Phase retrieval) Phase retrieval is a common computational problem, with applications in diverse areas such as imaging, X-ray crystallography, and speech processing. For simplicity, we will focus on the version of the problem over the reals. The (real) phase retrieval problem seeks to determine a point x satisfying the magnitude conditions,

$$\begin{aligned} |\langle a_i,x\rangle |^2\approx b_i\quad \text {for }i=1,\ldots ,m, \end{aligned}$$

where \(a_i\in {\mathbb {R}}^d\) and \(b_i\in {\mathbb {R}}\) are given. Note that we can only recover the optimal x up to a universal sign change, since \(|\langle a_i,x\rangle | = |\langle a_i,-x\rangle |\). In this work, we will focus on the following optimization formulation of the problem [24, 33, 34]:

$$\begin{aligned} \min _x ~\frac{1}{m}\sum _{i=1}^m |\langle a_i,x\rangle ^2-b_i|. \end{aligned}$$

Clearly, this is an instance of (4). Indeed, under mild statistical assumptions on the way \(a_i\) are generated, the formulation is \(\rho \)-weakly convex, for some numerical constant \(\rho \) independent of d and m [24, Corollary 3.2]. Moreover, under an appropriate model of the noise in the measurements, the problem is sharp [24, Proposition 3]. It is worthwhile to mention that numerous other approaches to phase retrieval exist, based on different problem formulations; see, for example, [3, 35,36,37].

Experiment setup All of the experiments on phase retrieval will be generated according to the following procedure. In the exact setup, we generate standard Gaussian measurements \(a_i\sim N(0,I_{d\times d})\), for \(i=1,\ldots , m\), and generate the target signal \({\bar{x}}\sim N(0,I_{d\times d})\). We then set \(b_i = \langle a_i, {\bar{x}}\rangle ^2\) for each \(i=1,\ldots , m\). In the corrupted setup, we generate \(a_i\) and \({\bar{x}}\) as in the exact case. We then corrupt a proportion of the measurements with outliers. Namely, we set \(b_i=(1-z_i) \langle a_i,{\bar{x}}\rangle ^2 +z_i |\zeta _i|\), where \(z_i \sim \text {Bernoulli}(0.1)\) and \(\zeta _i\sim {\mathcal {N}}(0,100)\).

Example 3.2

(Covariance matrix estimation) The problem of covariance estimation from quadratic measurements, introduced in [38], is a higher rank variant of phase retrieval. Let \(a_1, \ldots , a_m \in {\mathbb {R}}^d\) be measurement vectors. The goal is to recover a low-rank decomposition of a covariance matrix \({\bar{X}}{\bar{X}}^T\), with \({\bar{X}}\in {\mathbb {R}}^{d \times r}\) for a given \(0 \le r \le d\), from quadratic measurements

$$\begin{aligned} b_i \approx a_i^T {\bar{X}}{\bar{X}}^T a_i = \mathrm {Tr}({\bar{X}}{\bar{X}}^T a_ia_i^T). \end{aligned}$$

Note that we can only recover \({\bar{X}}\) up to multiplication by an orthogonal matrix. This problem arises in a variety of contexts, such as covariance sketching for data streams and spectrum estimation of stochastic processes. We refer the reader to [38] for details. In our examples, we will assume m is even and will focus on the potential functionFootnote 2

$$\begin{aligned} \min _{X\in {\mathbb {R}}^{d\times r}} ~\frac{1}{m}\sum _{i=1}^m \left| \left\langle XX^T,a_{2i}a_{2i}^T-a_{2i-1}a_{2i-1}^T\right\rangle -(b_{2i}-b_{2i-1})\right| . \end{aligned}$$
(5)

Under exact measurements, i.e., \(b_i = a_i^T {\bar{X}}{\bar{X}}^T a_i \) and under appropriate statistical assumptions on how \(a_i\) are generated, the formulation (5) is \(\rho \)-weakly convex for a numerical constant \(\rho \), independent of d or m, and is sharp. Indeed, this is a simple consequence of two results, namely [38, Corollary 1] and [2, Lemma 5.4]. It is possible to show that the objective is also sharp when the measurements are corrupted by gross outliers. This guarantee is beyond the scope of our current work and will appear in a different paper.

Experiment setup All of the experiments on covariance matrix estimation will be generated according to the following procedure. In the exact setup, we generate standard Gaussian measurements \(a_i\sim N(0,I_{d\times d})\) for \(i=1,\ldots , m\), and generate the target matrix \({\bar{X}}\in {\mathbb {R}}^{d\times r}\) as a standard Gaussian. We then set \(b_i = \Vert {{\bar{X}}}^T a_i\Vert ^2_F\) for each \(i=1,\ldots , m\). In the corrupted setup, we generate \(a_i\) and \({\bar{X}}\) as in the exact case. We then corrupt a proportion of the measurements with outliers. Namely, we set \(b_i=(1-z_i)\Vert {\bar{X}}^Ta_i \Vert _F^2+z_i |\zeta _i|\), where \(\zeta _i\sim {\mathcal {N}}(0,100)\) and \(z_i \sim \text {Bernoulli}(0.1)\). All plots will show iteration counter k versus the scaled Procrustes distance \(\displaystyle \mathrm{dist}(X_k, {\mathcal {X}}^*) / \Vert {\bar{X}}\Vert = \min _{\Omega ^T\Omega = I} \Vert \Omega X_k-{\bar{X}}\Vert _F/\Vert {\bar{X}}\Vert _F\).

3.2 Setting of the Paper

Throughout the manuscript, we make the following assumption.

Assumption A

Consider the optimization problem

$$\begin{aligned} \min _{x\in {\mathcal {X}}}~ g(x), \end{aligned}$$
(6)

satisfying the following properties for some real \(\rho \ge 0\) and \(\mu >0\).

  1. 1.

    (Weak convexity) The function \(g:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) is \(\rho \)-weakly convex, and the set \({\mathcal {X}}\subset {\mathbb {R}}^d\) is closed and convex. The set of minimizers \(\displaystyle {\mathcal {X}}^*:=\mathop {\hbox {argmin}}\limits _{x\in {\mathcal {X}}} g(x)\) is nonempty.

  2. 2.

    (Sharpness) The inequality

    $$\begin{aligned} g(z)-\min _{ x\in {\mathcal {X}}} g(x)\ge \mu \cdot \mathrm{dist}(z;{\mathcal {X}}^*)\quad \text { holds for all }z\in {\mathcal {X}}. \end{aligned}$$

We will say that a point \({\bar{x}}\in {\mathcal {X}}\) is stationary for the target problem (6) if

$$\begin{aligned} g(x)-g({\bar{x}})\ge o(\Vert x-{\bar{x}}\Vert )\quad \text { as }x\rightarrow {\bar{x}}\text { in }{{\mathcal {X}}}. \end{aligned}$$

That is, \({\bar{x}}\) is stationary precisely when the zero vector is a subgradient of \(g+\delta _{{\mathcal {X}}}\) at \({\bar{x}}\).

Shortly, we will discuss subgradient methods that converge linearly to \({\mathcal {X}}^*\) under appropriate initialization. As a first step, therefore, we must identify a neighborhood of \({\mathcal {X}}^*\) that is devoid of extraneous stationary points of (6). This is the content of the following lemma.

Lemma 3.1

(Neighborhood with no stationary points) The problem (6) has no stationary points x satisfying

$$\begin{aligned} 0<\mathrm{dist}(x;{\mathcal {X}}^*)< \frac{2\mu }{\rho }. \end{aligned}$$
(7)

Proof

Fix a stationary point \(x\in {\mathcal {X}}\setminus {\mathcal {X}}^*\) of (6). Choosing an arbitrary \({\bar{x}} \in \mathrm{proj}_{{\mathcal {X}}^*}(x)\), observe

$$\begin{aligned} \mu \cdot \mathrm{dist}(x; {\mathcal {X}}^*) \le g(x)-g({\bar{x}}) \le \frac{\rho }{2}\Vert x - {\bar{x}}\Vert ^2=\frac{\rho }{2}\cdot \mathrm{dist}^2(x; {\mathcal {X}}^*). \end{aligned}$$

Dividing through by \(\mathrm{dist}(x; {\mathcal {X}}^*)\), the result follows. \(\square \)

In light of Lemma 3.1, for any \(\gamma > 0\) we define the following tube

$$\begin{aligned} {\mathcal {T}}_{\gamma }:=\left\{ x\in {\mathcal {X}}:\mathrm{dist}(x;{\mathcal {X}}^*) < \gamma \cdot \frac{\mu }{\rho }\right\} \end{aligned}$$

and the constant

$$\begin{aligned} L:= \sup \left\{ \Vert \zeta \Vert : \zeta \in \partial g(x),\, x\in {\mathcal {T}}_{1}\right\} . \end{aligned}$$
(8)

Lemma 3.1 guarantees that the tubes \({\mathcal {T}}_{\gamma }\) contain no extraneous stationary points of the problem for any \(0<\gamma \le 2\). Moreover, observe that \(\mu \) and L play reciprocal roles; consequently, the ratio \(\tau :=\mu /L\) should serve as a measure of conditioning. The following lemma verifies the inclusion \(\tau \in [0,1]\).

Lemma 3.2

(Condition measure) The inclusion \(\tau \in [0,1]\) holds.

Proof

Consider an arbitrary point \(x\in {\mathcal {T}}_1\setminus {\mathcal {X}}^*\) and choose any \({\bar{x}}\in \mathrm{proj}_{{\mathcal {X}}^*}(x)\). By Lebourg’s mean value theorem [39, Theorem 2.3.7], there exists a point z in the open segment \((x,{\bar{x}})\) and a vector \(\zeta \in \partial g(z)\) satisfying

$$\begin{aligned} g(x)-g({\bar{x}})=\langle \zeta ,x-{\bar{x}}\rangle . \end{aligned}$$
(9)

Trivially z lies in \({\mathcal {T}}_1\), and therefore \(\Vert \zeta \Vert \le L\). Using this estimate and sharpness in (9) yields the guarantee

$$\begin{aligned} \mu \cdot \mathrm{dist}(x;{\mathcal {X}}^*)\le g(x)-g({\bar{x}})\le \Vert \zeta \Vert \cdot \Vert x-{\bar{x}}\Vert \le L\cdot \mathrm{dist}(x;{\mathcal {X}}^*). \end{aligned}$$

The result follows. \(\square \)

To summarize, we will use the following symbols to describe the parameters of the problem class (6): \(\rho \) is the weak convexity constant of g, \(\mu \) is the sharpness constant of g, L is the maximal subgradient norm at points in the tube \({\mathcal {T}}_1\), and \(\tau \) is the condition measure \(\tau =\mu /L\in [0,1]\).

4 Polyak Subgradient Method

In this section, we consider the Polyak subgradient method for problem (6). A preliminary version of this material applied to the phase retrieval problem appeared in [34]; we present the arguments here for the sake of completeness.

The Polyak subgradient method is summarized in Algorithm 1. The method requires knowing the optimal value \(\min _{x\in {\mathcal {X}}} g(x)\). In a number of circumstances, this indeed is reasonable (e.g., exact penalty approach for solving nonlinear equations). The latter sections explore subgradient methods that do not require a known optimal value.

figure a

The following theorem shows that Algorithm 1, originally proposed for convex problems enjoys the same linear convergence guarantees for functions that are only weakly convex, provided it is initialized within a certain tube of the optimal solution set.

Theorem 4.1

(Linear rate) Fix a real \(0<\gamma <1\). Then, Algorithm 1 initialized at any point \(x_0\in {\mathcal {T}}_{\gamma }\) produces iterates that converge Q-linearly to \({\mathcal {X}}^*\). That is, for each index \(k\ge 0\) we have

$$\begin{aligned} \mathrm{dist}^2(x_{k+1};{\mathcal {X}}^*) \le \left( 1-(1-\gamma ) \tau ^2\right) \mathrm{dist}^2(x_{k};{\mathcal {X}}^*). \end{aligned}$$
(10)

Proof

We proceed by induction. Suppose that the theorem holds up to iteration \(k-1\). We will prove inequality (10). To this end, observe that the inductive hypothesis implies, \(\mathrm{dist}(x_{k};{\mathcal {X}}^*)\le \mathrm{dist}(x_{0};{\mathcal {X}}^*)\), and therefore \(x_k\) lies in \({\mathcal {T}}_{\gamma }\). Observe that in the case \(\zeta _k=0\), we have \(x_{k+1}=x_k\) by the definition of the subgradient method and \(x_k\in {\mathcal {X}}^*\) by Lemma 3.1; hence, (10) holds trivially. Therefore, we may suppose \(\zeta _k\ne 0\) for the rest of the proof. Choose now a point \({\bar{x}} \in \mathrm{proj}_{{\mathcal {X}}^*}(x_k)\). Taking into account that \(\mathrm{proj}_{{\mathcal {X}}}\) is nonexpansive, we successively deduce

$$\begin{aligned} \Vert x_{k+1} - {\bar{x}}\Vert ^2&\le \left\| (x_{k} -{\bar{x}})- \tfrac{g(x_k)-g(\bar{x})}{\Vert \zeta _k\Vert ^2} \zeta _k\right\| ^2 \\&= \Vert x_{k} - {\bar{x}}\Vert ^2 + \frac{2(g(x_k) - g({\bar{x}}))}{\Vert \zeta _k\Vert ^2}\cdot \langle {\zeta _k,{\bar{x}} - x_{k}}\rangle + \frac{(g(x_k) - g({\bar{x}}))^2}{\Vert \zeta _k\Vert ^2}\\&\le \Vert x_{k} - {\bar{x}}\Vert ^2 + \frac{2(g(x_k) - g({\bar{x}}))}{\Vert \zeta _k\Vert ^2}\left( g({\bar{x}}) - g(x_k) + \frac{\rho }{2}\Vert x_k - {\bar{x}}\Vert ^2 \right) \ \\&\quad + \frac{(g(x_k) - g({\bar{x}}))^2}{\Vert \zeta _k\Vert ^2} \\&= \Vert x_{k} - {\bar{x}}\Vert ^2 + \frac{(g(x_k) - g({\bar{x}}))}{\Vert \zeta _k\Vert ^2}\left( \rho \Vert x_k - {\bar{x}}\Vert ^2 - (g(x_k) - g({\bar{x}})) \right) \\&\le \Vert x_{k} - {\bar{x}}\Vert ^2 + \frac{(g(x_k) - g({\bar{x}}))}{\Vert \zeta _k\Vert ^2}\left( \rho \Vert x_k - {\bar{x}}\Vert ^2 - \mu \Vert x_k - {\bar{x}}\Vert \right) \\&= \Vert x_{k} - {\bar{x}}\Vert ^2 + \frac{\rho (g(x_k) - g({\bar{x}}))}{\Vert \zeta _k\Vert ^2}\left( \Vert x_k - {\bar{x}}\Vert -\frac{\mu }{\rho }\right) \Vert x_k - {\bar{x}}\Vert . \end{aligned}$$

Combining the inclusion \(x_k\in {\mathcal {T}}_{\gamma }\) with sharpness, we therefore deduce

$$\begin{aligned} \mathrm{dist}^2(x_{k+1};{\mathcal {X}}^*)\le \Vert x_{k+1} - {\bar{x}}\Vert ^2\le \left( 1-\frac{(1-\gamma ) \mu ^2}{\Vert \zeta _k\Vert ^2}\right) \Vert x_k-{\bar{x}}\Vert ^2. \end{aligned}$$

The result follows. \(\square \)

As a numerical illustration, we apply the Polyak subgradient method (Fig. 1) to our two running examples, phase retrieval and covariance matrix estimation. Notice that a linear rate of convergence is observed in all experiments except for two, with the rate improving monotonically with an increasing number of measurements m. In the two exceptional experiments, the number of measurements m is too small to guarantee that the initial point \(x_0\) is within the basin of attraction, and the subgradient methods stagnate.

Fig. 1
figure 1

Polyak subgradient method. (Left) Phase retrieval with the exact setup; \(d=5000\) and \(m\in \{11{,}000,12{,}225, 13{,}500, 14{,}750, 16{,}000, 17{,}250,18{,}500\}\). (Right) Covariance matrix estimation with the exact setup; \(d=1000\), \(r=3\), \(m\in \{5000, 8000, 11{,}000, 14{,}000, 17{,}000, 20{,}000\}\). In both experiments, convergence rates uniformly improve with increasing m

5 Subgradient Method with Constant Stepsize

Recall that the Polyak subgradient method (Algorithm 1) crucially relies on knowing the minimal value of the optimization problem (6). Henceforth, all the subgradient methods we consider are agnostic to this value. That being said, they will require some estimates on the problem parameters \((\mu , \rho , L)\). We begin by analyzing a subgradient method with a constant stepsize (Algorithm 2). Constant stepsize schemes are often methods of choice in practice. We will show that when properly initialized, the subgradient method with constant stepsize generates iterates \(x_k\) such that \(\mathrm{dist}(x_k;{\mathcal {X}}^*)\) converges linearly up to a certain threshold.

figure b

The analysis we present fundamentally relies on the following estimate, often used in the analysis of subgradient methods. To simplify notation, for any point \(x\in {\mathbb {R}}^d\), we set

$$\begin{aligned} E(x):=\mathrm{dist}^2(x;{\mathcal {X}}^*). \end{aligned}$$

Whenever x has an index k as a subscript, we will set \(E_k:=E(x_k)\). The following lemma will feature in both the constant and geometrically decaying stepsize schemes.

Lemma 5.1

(Basic recurrence) Consider a point \(x\in {\mathcal {T}}_1\) and a nonzero subgradient \(\zeta \in \partial g(x)\), and define \(x^+:=\mathrm{proj}_{{\mathcal {X}}}\left( x-\alpha \frac{\zeta }{\Vert \zeta \Vert }\right) \) for some \(\alpha >0\). Then, the estimate holds:

$$\begin{aligned} E(x^+)\le \left( 1+\tfrac{\rho \alpha }{L}\right) E(x)-2 \alpha \tau \sqrt{E(x)} +\alpha ^2. \end{aligned}$$
(11)

Proof

Choose an arbitrary point \({\bar{x}} \in \mathrm{proj}_{{\mathcal {X}}^*}(x)\). Observe

$$\begin{aligned} \Vert x^+-{\bar{x}}\Vert ^2&\le \left\| (x-{\bar{x}})-\alpha \tfrac{\zeta }{\Vert \zeta \Vert }\right\| =\Vert x-{\bar{x}}\Vert ^2+\tfrac{2\alpha }{\Vert \zeta \Vert }\cdot \langle \zeta ,{\bar{x}}-x \rangle +\alpha ^2\\&\le \Vert x-{\bar{x}}\Vert ^2 +\tfrac{2\alpha }{\Vert \zeta \Vert }\cdot \left( g({\bar{x}})-g(x)+\frac{\rho }{2}\Vert x-{\bar{x}}\Vert ^2\right) +\alpha ^2\\&\le \left( 1+\frac{\alpha \rho }{\Vert \zeta \Vert }\right) \Vert x-{\bar{x}}\Vert ^2-\frac{2\alpha \mu }{\Vert \zeta \Vert }\cdot \Vert x-{\bar{x}}\Vert +\alpha ^2. \end{aligned}$$

Thus, the inequality holds:

$$\begin{aligned} E(x^+)\le \left( 1+\tfrac{\alpha \rho }{\Vert \zeta \Vert }\right) E(x)-\tfrac{2\alpha \mu }{\Vert \zeta \Vert }\cdot \sqrt{E(x)} +\alpha ^2, \end{aligned}$$

and consequently taking into account \(\alpha /\Vert \zeta \Vert \ge \alpha /L\) we have

$$\begin{aligned} E(x^+)\le \sup _{t\ge \alpha /L} \left\{ \left( 1+\rho t\right) E(x)-2\mu t\cdot \sqrt{E(x)} +\alpha ^2\right\} . \end{aligned}$$

Notice, the function inside the supremum is linear in t with slope given by \(s:=\rho E(x)-2\mu \sqrt{E(x)}\). The inclusion \(x\in {\mathcal {T}}_1\) directly implies \(s\le 0\). Therefore, the supremum on the right-hand side is attained at \(t=\tfrac{\alpha }{L}\), yielding the claimed estimate (11). \(\square \)

In light of Lemma 5.1, we can now prove that the quantities \(E_k = E(x_k)\) converge linearly below a certain fixed threshold. The proof is a modification of that in [15, Section 4].

Lemma 5.2

(Contraction inequality) Fix a constant \(0< \alpha <\frac{\tau \mu }{\rho }\), and let \(\{x_k\}_{k\ge 0}\) be the iterates generated by Algorithm 2. Define the quantity

$$\begin{aligned} E^* := \left( \frac{\alpha L}{\mu + \sqrt{\mu ^2-\alpha \rho L}} \right) ^2. \end{aligned}$$
(12)

Then, whenever an iterate \(x_k\) lies in \({\mathcal {T}}_1\), the estimate holds:

$$\begin{aligned} E_{k+1}-E^*\le q_k(E_k-E^*), \end{aligned}$$

where \(q_k:=1+\tfrac{\alpha }{L}\left( \rho -\tfrac{2 \mu }{\sqrt{E_k}+\sqrt{E^*}}\right) \) satisfies \(q_k< 1\).

Proof

Looking back at estimate (11), consider the following equation in the variable e:

$$\begin{aligned} e= \left( 1+\tfrac{\alpha \rho }{L}\right) e-2\alpha \tau \cdot \sqrt{e}+\alpha ^2. \end{aligned}$$
(13)

An easy computation shows that the minimal positive solution to (13) is exactly \(E^*\), defined in (12). Note that \(E^*\) is well defined by the inequality \(\alpha \le \tau \cdot \frac{\mu }{\rho }\).

Subtracting (13) from (11) yields the estimate

$$\begin{aligned} E_{k+1}-E^*&\le (1+\tfrac{\alpha \rho }{L})(E_k-E^*)-2\alpha \tau (\sqrt{E_k}-\sqrt{E^*})\\&= \left( 1+\tfrac{\alpha }{L}\left( \rho -\tfrac{2 \mu }{\sqrt{E_k}+\sqrt{E^*}}\right) \right) (E_k-E^*). \end{aligned}$$

Finally, noting

$$\begin{aligned} \sqrt{E^*}= \frac{\alpha L}{\mu + \sqrt{\mu ^2-\alpha \rho L}}=\frac{\mu -\sqrt{\mu ^2-\alpha \rho L}}{\rho }\le \frac{\mu }{\rho }, \end{aligned}$$

we obtain

$$\begin{aligned} \rho -\frac{2\mu }{\sqrt{E_k}+\sqrt{E^*}}< \rho -\frac{2\mu }{2\mu /\rho }= 0. \end{aligned}$$

This completes the proof of the lemma. \(\square \)

Iterating Lemma 5.2, we see that the quantities \(E_k\) decrease to a value at most \(E^*\) at a linear rate. Figure 2 illustrates this behavior on our two running examples. It is also clear from the figure that the linear rate of convergence improves as \(E_k\) tends to \(E^*\). An explanation is immediate from the expression for \(q_k\) in Lemma 5.2. Indeed, as \(E_k\) decreases, so do the contraction factors \(q_k\), and for \(E_k\approx E^*\), we have \(q_k\approx (1-\tau ^2)+\frac{\alpha \rho }{L}\). Thus, as the stepsize tends to zero, the limiting linear rate coincides with an ideal rate of \(\approx 1-\tau ^2\).

Even though \(E_k\) could be smaller than \(E^*\) and q could be negative, it is apparent from Fig. 2 that even after \(E_k\) becomes smaller than \(E^*\), all the following values \(E_k\) stay close to \(E^*\). This is the content of the following theorem. The convex version of this theorem appears in [15, Theorem 1].

Fig. 2
figure 2

Constant stepsize subgradient method. (Left) Phase retrieval with the corrupted setup; \(d=1000\), \(m=3000\), and \(\alpha \in \{1,1/3,1/9\}\). (Right) Covariance matrix estimation with the corrupted setup; \(d=1000\), \(r=3\), \(m=10{,}000\), and \(\alpha \in \{1,1/3,1/9\}\). The lower curves correspond to smaller stepsizes in both experiments

Theorem 5.1

(Convergence of fixed stepsize subgradient method) Fix constants \(0<\gamma <1\) and \(\alpha \) satisfying

$$\begin{aligned} 0< \alpha < \frac{\gamma \tau }{\sqrt{1+2\tau ^2}}\cdot \frac{\mu }{\rho }. \end{aligned}$$
(14)

Let \(x_k\) be the iterates generated by Algorithm 2 with stepsize \(\alpha \) and initial point \(x_0\in {\mathcal {T}}_{\gamma }\). Define the constants

$$\begin{aligned} E^* := \left( \frac{\alpha L}{\mu + \sqrt{\mu ^2-\alpha \rho L}} \right) ^2\quad \text {and}\quad D := \sqrt{ \max \{E_0, 2 \alpha ^2+ E^*\}}. \end{aligned}$$

Then, for each index k, the estimates hold:

$$\begin{aligned} \sqrt{E_k}\le D < \tfrac{\gamma \mu }{\rho }\quad \text {and}\quad E_{k}-E^*\le \max \left\{ q^k (E_0-E^*),2 \alpha ^2 \right\} , \end{aligned}$$

where the coefficient \(q := 1+\tfrac{\alpha }{L} \left( \rho -\frac{\mu }{D} \right) \) satisfies \(0<q<1\).

Proof

We first verify the claims that are independent of the iteration counter. To this end, observe that (14) directly implies \(\alpha < \tau \cdot \tfrac{\mu }{\rho }\), and therefore \(E^*\) is well defined. Next, we show \(D< \frac{\gamma \mu }{\rho }\). Indeed, noting \(\sqrt{E^*} < \alpha \tau ^{-1}\) and using (14), we deduce

$$\begin{aligned} D^2=\max \{E_0,2\alpha ^2+E^*\}\le \max \{E_0,\alpha ^2(2+\tau ^{-2})\} < \left( \frac{\gamma \mu }{\rho }\right) ^2. \end{aligned}$$

Next, we show the inequalities \(0<q<1\). To this end, observe

$$\begin{aligned} -1\le -\frac{\tau \alpha }{D}< \tfrac{\alpha }{L}\left( \rho -\tfrac{ \mu }{D}\right) \le (1-\gamma ^{-1})\cdot \frac{\rho \alpha }{L}<0, \end{aligned}$$

where the first inequality follows from the inequality, \(\alpha \le D\), and the third follows from the inequality, \(D<\frac{\gamma \mu }{\rho }\). Thus, we conclude \(0<q<1\), as claimed.

We now proceed by induction. Fix an index k and suppose as inductive hypothesis that for each index \(i=0,1,\ldots , k\), the estimates hold:

$$\begin{aligned} \sqrt{E_i}\le D\quad \text {and}\quad E_{i}-E^*\le \max \left\{ q^{i} (E_0-E^*),2\alpha ^2\right\} . \end{aligned}$$

Let us consider two cases. Suppose first \(E_k\ge E^*\). Then, by applying Lemma 5.2, we deduce

$$\begin{aligned} E_{k+1}-E^*&\le \left( 1+\tfrac{\alpha }{L}\left( \rho -\tfrac{2 \mu }{\sqrt{E_k}+\sqrt{E^*}}\right) \right) (E_k-E^*)\\&\le \left( 1+\tfrac{\alpha }{L}\left( \rho -\tfrac{\mu }{D}\right) \right) (E_k-E^*)\\&=q (E_k-E^*). \end{aligned}$$

Suppose now that the second case, \(E_k<E^*\), holds. Then, Lemma 5.2 implies

$$\begin{aligned} E_{k+1}&\le E_k+\tfrac{\alpha \rho }{L}(E_k-E^*)-2\alpha \tau (\sqrt{E_k}-\sqrt{E^*})\\&\le \max _{E\in [0,E^*]}\{E+\tfrac{\alpha \rho }{L}(E-E^*)-2\alpha \tau (\sqrt{E}-\sqrt{E^*})\}\\&=\max \{E^*,\tfrac{\alpha }{L}(2\mu \sqrt{E^*}-\rho E^*)\}. \end{aligned}$$

Subtracting \(E^*\), we conclude

$$\begin{aligned} E_{k+1}-E^* \le \max \{0, 2 \tau \alpha \sqrt{E^*}\} \le 2 \alpha ^2. \end{aligned}$$

Thus, in both cases, we have the estimate

$$\begin{aligned} E_{k+1}-E^*\le \max \left\{ q(E_k-E^*), 2 \alpha ^2 \right\} . \end{aligned}$$

In particular, we immediately deduce \(\sqrt{E_{k+1}} \le D\). Applying the inductive hypothesis, we conclude

$$\begin{aligned} E_{k+1}-E^*&\le \max \left\{ q\cdot \max \{ q^k(E_0-E^*),2\alpha ^2\}, 2 \alpha ^2 \right\} \\&= \max \{q^{k+1}(E_0-E^*),2\alpha ^2\}. \end{aligned}$$

The theorem is proved. \(\square \)

6 Geometrically Decaying Stepsize

In the previous section, we showed linear convergence of the constant stepsize scheme up to a fixed tolerance \(E^*\). To obtain a linearly convergent method to the true solution set, we will allow the stepsize to decrease geometrically. The analogous strategy in the convex setting goes back to Goffin [12], and our argument is a direct generalization of [12] to the weakly convex setting. The intuition for why one may expect linear convergence under such stepsizes may be gleaned from the Polyak method under the optimal stepsize

$$\begin{aligned} \displaystyle \alpha _k = \frac{g(x_k) - \min _{x\in {\mathcal {X}}} g(x)}{\Vert \zeta _k\Vert }. \end{aligned}$$

It is easy to verify that since the \(E_k\) tend to zero Q-linearly, the stepsizes \(\alpha _k\) tend to zero R-linearly. We implement such a geometrically decaying stepsize in Algorithm 3 and prove linear convergence of the method in Theorem 6.1. In the proof, we use the notation \(E_k:=\mathrm{dist}^2(x_k;{\mathcal {X}}^*)\) as in the previous section.

figure c

Theorem 6.1

Fix a real \(0<\gamma <1\) and suppose \(\tau \le \sqrt{ \frac{1}{2-\gamma } }\). Set

$$\begin{aligned} 0 < \lambda \le \frac{\gamma \mu ^2}{\rho L}\quad \text { and }\quad q:=\sqrt{1-(1-\gamma ) \tau ^2}, \end{aligned}$$

where \(\lambda \) is assumed finite. Then, the iterates \(x_k\) generated by Algorithm 3, initialized at some point \(x_0\) with \(\mathrm{dist}(x_0; {\mathcal {X}}^*) \le \frac{\lambda }{\tau - \sqrt{\tau ^2 -(1-\rho \lambda L^{-1} -q^2)}}\), satisfy:

$$\begin{aligned} \mathrm{dist}^2(x_k;{\mathcal {X}}^*) \le \max \left\{ \tfrac{\lambda ^2}{\tau ^2}, \mathrm{dist}(x_0; {\mathcal {X}}^*)^2 \right\} \left( 1-(1-\gamma )\tau ^2\right) ^{k}. \end{aligned}$$
(15)

Proof

We will prove the result by induction.

To this end, suppose the bound (15) holds for all \(i = 0, \ldots , k\). Appealing to Lemma 5.1 and using the relation \(\alpha _k=\lambda q^k\), we obtain

$$\begin{aligned} E_{k+1}\le \left( 1 + \frac{\rho \lambda q^k}{L} \right) E_k-2\lambda \tau q^k \sqrt{E_k}+\lambda ^2 q^{2k}. \end{aligned}$$
(16)

Define the constant \(M := \max \left\{ \tfrac{\lambda }{\tau }, \mathrm{dist}(x_0; {\mathcal {X}}^*) \right\} \). Recall the induction assumption guarantees \(\sqrt{E_k} \le Mq^k\). Let us therefore fix some value \(R\in [0,M]\) satisfying \(\sqrt{E_k} = R q^k\). Inequality (16) then implies

$$\begin{aligned} E_{k+1}\le \max _{R\in [0,M]} \left\{ R^2q^{2k}+\frac{\rho \lambda R^2}{L}q^{3k}-2\lambda \tau R q^{2k}+ \lambda ^2 q^{2k} \right\} . \end{aligned}$$

Note that the expression inside the maximum is a convex quadratic in R and therefore the maximum must occur either at \(R=0\) or \(R=M\). We therefore deduce

$$\begin{aligned} E_{k+1}\le q^{2k}\cdot \max \left\{ \lambda ^2 ,~ M^2+\frac{\rho \lambda }{L} M^2q^{k}-2\lambda \tau M + \lambda ^2 \right\} . \end{aligned}$$
(17)

To complete the induction, it is therefore sufficient to show

$$\begin{aligned} \lambda ^2 \le M^2 q^2 \quad \text { and }\quad M^2+\frac{\rho \lambda }{L} M^2q^{k}-2\lambda \tau M + \lambda ^2 \le M^2q^2. \end{aligned}$$
(18)

We begin by verifying the first property. Note \(M \ge \tfrac{\lambda }{\tau }\). Hence, it suffices to show that \(\tau \le q\). Observe that the assumption \(\tau \le \sqrt{\frac{1}{2-\gamma }}\) directly implies

$$\begin{aligned} \tau ^2 + (1-\gamma ) \tau ^2 \le 1. \end{aligned}$$

Rearranging yields \(\tau ^2 \le 1-(1-\gamma )\tau ^2 = q^2\). Hence, the first condition in (18) holds.

Next, we show that M satisfies the second property in (18). Rearranging the expression, we must establish

$$\begin{aligned} \left( 1+\frac{\rho \lambda }{L} q^{k}-q^2 \right) M^2-2\lambda \tau M + \lambda ^2 \le 0. \end{aligned}$$
(19)

We will show that the quadratic on the left-hand side in M has two real positive roots. To this end, a quick computation shows that the two roots are

$$\begin{aligned} \frac{\lambda \tau \pm \sqrt{\lambda ^2\tau ^2-\lambda ^2\left( 1+\frac{\rho \lambda }{L} q^{k}-q^2\right) }}{1+\frac{\rho \lambda }{L} q^{k}-q^2}&= \frac{\lambda }{\tau \mp \sqrt{\tau ^2 - \left( 1+ \frac{\rho \lambda }{L} q^k - q^2\right) }}. \end{aligned}$$

To see that the discriminant is nonnegative, observe

$$\begin{aligned} \tau ^2-\left( 1+ \frac{\rho \lambda }{L} q^k-q^2 \right)&\ge \tau ^2 - \left( 1+\frac{\rho \lambda }{L} -q^2 \right) \ge \tau ^2 -(1 + \gamma \tau ^2 -q^2) =0. \end{aligned}$$

Thus, the convex quadratic in (19) has two real roots. Observe \(M\ge \lambda /\tau \), and therefore M is greater than the minimal root. It suffices therefore to argue that M is smaller than the largest root. If \(M=\lambda /\tau \), then this is clearly true. Otherwise, it must be that \(M=\sqrt{E_0}\). Our assumption on the initial distance \(\sqrt{E_0}\) then guarantees that \(\sqrt{E_0}\) is smaller than the largest positive root of (19), as claimed. Hence, the condition (19) holds, and the induction is complete. \(\square \)

In the convex setting, \(\rho = 0\), Theorem 6.1 recovers the guarantees in Goffin [12]. Theorem 6.1 demonstrates a precise relationship between the initial distance, the constant \(\lambda \) which controls the rate, and the constants \(\tau \) and \(\rho \). In particular, the larger the initial distance, the larger \(\lambda \) needs to be to guarantee convergence. It is now straightforward to see how one should set the parameters q and \(\lambda \) to guarantee linear convergence within the tube \({\mathcal {T}}_{\gamma }\).

Fig. 3
figure 3

Geometrically decaying stepsize. (Left) Phase retrieval with the corrupted setup; \(d=1000\), \(m=3000\), \(q\in \{0.983,0.989,0.993,0.996,0.997\}\). (Right) Covariance matrix estimation with the corrupted setup; \(d=1000\), \(r=3\), \(m=10{,}000\), \(q\in \{0.986, 0.991, 0.994, 0.996, 0.998\}\). The depicted rates uniformly improve with lower values of q, in both figures

Corollary 6.1

Fix a real \(0<\gamma <1\) and suppose \(\tau \le \sqrt{ \frac{1}{2-\gamma } }\) and \(\rho >0\). Set

$$\begin{aligned} \lambda := \frac{\gamma \mu ^2}{\rho L}\quad \text { and }\quad q:=\sqrt{1-(1-\gamma ) \tau ^2}. \end{aligned}$$

Then, the iterates \(x_k\) generated by Algorithm 3, initialized at any point \(x_0 \in {\mathcal {T}}_{\gamma }\), satisfy:

$$\begin{aligned} \mathrm{dist}^2(x_k;{\mathcal {X}}^*) \le \frac{\gamma ^2 \mu ^2}{\rho ^2} \left( 1-(1-\gamma )\tau ^2\right) ^{k}. \end{aligned}$$
(20)

Proof

Looking back at Theorem 6.1, we see that \(\mathrm{dist}(x_0;{\mathcal {X}}^*)\) satisfies the assumed upper bound:

$$\begin{aligned} \mathrm{dist}(x_0; {\mathcal {X}}^*) < \frac{\gamma \mu }{\rho } =\frac{\lambda }{\tau } \le \frac{\lambda }{\tau - \sqrt{\tau ^2 -(1-\rho \lambda L^{-1} -q^2)}}. \end{aligned}$$

Since \(\frac{\lambda }{\tau } = \frac{\gamma \mu }{\rho }\), we conclude that \(\max \left\{ \mathrm{dist}(x_0; {\mathcal {X}}^*), \tfrac{\lambda }{\tau } \right\} \le \tfrac{\gamma \mu }{\rho }\). The result immediately follows after applying Theorem 6.1. \(\square \)

We now illustrate the performance of Algorithm 3 on our two running examples in Fig. 3. Empirically, we observed that \(\lambda > 0\) and \(0<q<1\) must be tuned for performance, which is what we did in the experiments. We observe linear convergence in all cases, and the convergence rate of the method improves monotonically as the chosen rate q is decreased. The Polyak scheme pictured in Fig. 1 clearly outperforms all other methods, while the geometrically decaying stepsize scheme performs much better than the constant stepsize scheme in Fig. 2. That being said, we should stress that the Polyak subgradient method requires knowledge of the optimal value of the problem, which is often unavailable. We also note that the work [15] provides schemes which can indeed automatically adapt to the unknown parameter \(\mu \) for convex problems. We leave extensions of such algorithms to weakly convex problems for future work.