1 Introduction

Consider the nonsmooth composite convex optimization problem

$$\begin{aligned} \min _ {x \in Q}\ \left\{ F(x) :=f(x)+h(x) \right\} , \end{aligned}$$
(1.1)

where \(Q \subseteq \mathbb {R}^n\) is a simple closed convex set, \(f,h:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{+\infty \}}\) are convex (not necessarily smooth) and \(F:\mathbb {R}^n \rightarrow \mathbb {R} \cup \{+\infty \}\) is nonsmooth. Moreover, h is assumed to be the summation of a quadratic convex and a convex function (SQCC). Problem (1.1) has received much attention due to its broad applications in several different areas such as signal processing, system identification, machine learning and image processing; see, for instance, [6, 7, 10] and references therein.

Among the numerical algorithms for solving nonsmooth optimization problems (1.1) such as splitting algorithms [9], cutting plane methods [21], ellipsoid methods [11], bundle methods [17], gradient sampling methods [4] and smoothing methods [19], subgradient methods [25] are fundamental, which have been extensively studied due to their applicability to a wide variety of problems and low requirement on memory [3, 8, 22, 23]. The iteration complexity for applying a subgradient method to solve the general nonsmooth convex minimization problem is \(O({1}/{\epsilon ^2})\), i.e., after \(O({1}/{\epsilon ^2})\) iterations, the difference between the objective function value and the optimum is about \(\epsilon \); see [21]. For problems equipped with additional structure, various approaches are proposed such as smoothing schemes [19], fast iterative shrinkage-thresholding algorithm [1], bundle method [17], to improve the iteration complexity to \(O({1}/{\epsilon })\).

Note that for the nonsmooth optimization problems, it is usually not the case that the subgradient vanishes at the solution point, and as a consequence, the stepsize in the subgradient-based method should be approaching zero. Such a vanishing property of the stepsize slows down the convergence rate of the subgradient method [20]. To deal with this undesirable phenomenon, Nesterov proposed a dual averaging (DA) scheme [20]. Each iteration of DA scheme takes the form

$$\begin{aligned} x_{k+1}=\mathop {\mathrm{argmin}\,}_{ x \in Q } \left\{ \sum _{i=0}^{k} \langle \lambda _i D_i , x-x_0 \rangle + {\beta _{k+1}}r(x) \right\} ,~ D_k \in \partial F(x_k),\ \forall k \ge 0, \end{aligned}$$
(1.2)

where \(\lambda _k\), \(\forall k \ge 0\) are stepsizes, \( \{\beta _{k}\}_{k=0}^{\infty } \) is a positive nondecreasing sequence and \( r(\cdot ) \) is an auxiliary strongly convex function. Following the DA scheme, Xiao [26] proposed a regularized dual averaging (RDA) scheme, which generates the iterate by minimizing a problem that involves all the past subgradients of f and the whole function h,

$$\begin{aligned} x _ { k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\beta _{k+1}}r(x) \right\} ,~ g_{i} \in \partial f(x_{i}),\ \forall k \ge 0, \end{aligned}$$
(1.3)

where \(x_0\) is the minimizer of h over Q. Setting the auxiliary function \(r(\cdot )\) as \(\frac{1}{2}\Vert \cdot -x_0\Vert ^2 \) in the above RDA scheme (1.3) becomes the so-called proximal subgradient-based (PSB) method

$$\begin{aligned} x _ { k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\frac{\beta _{k+1}}{2}}\Vert x-x_0\Vert ^2 \right\} ,~ g_{i} \in \partial f(x_{i}),\ \forall k \ge 0. \end{aligned}$$
(1.4)

The regularization function \(r(\cdot )\) is crucial in RDA and PSB, which plays a similar role as the proximal term in the classical proximal point algorithm (PPA) [5, 18, 24]. On one hand, it ensures the existence and uniqueness of the solution of the subproblems, and makes the subproblems stable. On the other hand, it also influences the efficiency of the algorithms. Recently, much attention was paid on relaxing the strong convexity requirements on the proximal term in PPA [13] and related algorithms such as augmented Lagrangian method [12] and alternating direction method of multipliers [14, 16], and such a strategy achieves great success in numerical experiments. In this paper, we relax \(r(\cdot )\) in (1.3) to an indefinite one, yielding the following dynamic regularized dual averaging (DRDA) scheme

$$\begin{aligned} x _ { k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\beta _{k+1}}r_k(x) \right\} ,~ g_{i} \in \partial f(x_{i}),\ \forall k \ge 0, \end{aligned}$$
(1.5)

where \((k+1)h(\cdot )+ {\beta _{k+1}}r_k(\cdot )\) is strongly convex for each k. Note that under this requirement, even if the function \(h(\cdot )\) is convex, \(r_k(\cdot )\) could be carefully chosen to be nonconvex. Specially, we introduce an appropriate indefinite item in (1.4) and then propose the indefinite proximal subgradient-based (IPSB) algorithm. Convergence rate is established under mild assumptions. We do numerical experiments on the regularized least squares problem and elastic net regression. Numerical results demonstrate the efficiency of IPSB in comparing with the existing algorithms SDA and PSB.

The rest of this paper is organized as follows. In the following subsection, we introduce some notations and preliminaries. Section 2 reviews the simple dual averaging algorithm, the proximal subgradient-based algorithm and gives our new extensions. Section 3 presents the convergence analysis. Numerical experiments are performed in Sect. 4. We make conclusions in Sect. 5.

1.1 Notations and preliminaries

In this subsection, we present some definitions and preliminary results that will be used in our analysis later. Let Q be a closed convex set in \(\mathbb {R}^n\). We use \(\langle s, x \rangle \) and \(s^Tx\) to denote the inner product of s and x, two real vectors with the same dimension. Let \(\mathbb {S}^{n}\) denote the set of symmetric matrices of order n, and I denote the identity matrix whose dimension is clear from the context. The Euclidean norm defined by \(\sqrt{\langle \cdot ,\cdot \rangle }\) is denoted by \(\Vert \cdot \Vert \). Let [m] denote the set \(\{1,2,\ldots ,m\}\). The ball with center x and radius r reads as

$$\begin{aligned} B_r (x) = \{ y \in \mathbb {R}^n : \Vert y - x \Vert \le r \}. \end{aligned}$$

The subdifferential of a convex function f at point \(x\in \mathrm {dom} f\) is given by

$$\begin{aligned} \partial f(x) := \{g \in \mathbb {R}^n: f(y) \ge f(x) + \langle g, y-x \rangle , \forall y \in \mathbb {R}^n\}, \end{aligned}$$

and any element in \(\partial f(x)\) is called a subgradient of f at x, where \(\mathrm {dom} f\) is the domain of f, i.e., the set of \(x\in \mathbb {R}^n\) such that f(x) is finite.

A function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\cup \{+\infty \}\) is called strongly convex if there exists a constant \(\kappa > 0\) such that

$$\begin{aligned} f(x) \ge f(y) +\langle g, x-y \rangle + \frac{\kappa }{2}\Vert x-y\Vert ^2,~ \forall x,y \in \mathbb {R}^n,\ \forall g \in \partial f(y), \end{aligned}$$

where the constant \(\kappa \) is called the strong convexity parameter.

For \(M \in \mathbb {R}^{n \times n}\), we use the notation \(\Vert x\Vert _M^2\) to denote \(x^TMx\) even if M is not positive semidefinite. Denote by \(\,\mathrm{tr }\,(M)\) the trace of the matrix M.

Definition 1.1

(SQCC) A function \(h:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{\infty \}}\) is called the summation of quadratic convex and convex functions (SQCC) if there exists a (nonlinear) quadratic convex function \(q:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{\infty \}}\) and a convex function \({\tilde{h}}:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{\infty \}}\) such that

$$\begin{aligned} h(x)= q(x)+{\tilde{h}}(x),\ \forall x \in \mathbb {R}^n. \end{aligned}$$

Since h is SQCC, there exists a non-zero positive semidefinite matrix \(\Sigma _h \in \mathbb {S}^{n}\) such that for all x, \(y \in \mathbb {R}^n\),

$$\begin{aligned} h(y) \ge h(x)+\left\langle u, y-x \right\rangle + \frac{1}{2}\Vert y-x\Vert ^2_{\Sigma _h},~ \forall u \in \partial h(x), \end{aligned}$$
(1.6)

or equivalently,

$$\begin{aligned} \langle x-y, u-v \rangle \ge \Vert x-y\Vert ^2_{\Sigma _h},~ \forall u \in \partial h(x), v \in \partial h(y). \end{aligned}$$

2 A new proximal subgradient algorithm

In the first subsection, we briefly review two existing algorithms SDA and PSB. Then in the second subsection, we describe the indefinite proximal subgradient-based (IPSB) algorithm.

2.1 SDA and PSB

We start from the classical subgradient algorithm [3] for minimizing the problem (1.1)

$$\begin{aligned} x_{k+1}=P_Q(x_k-\lambda _k d_k),\ k \in \mathbb {N}, \end{aligned}$$
(2.1)

where \(P_Q\) denotes the projection onto Q, \(d_k\) is either a subgradient \(D_k \in \partial F(x_k)\) or the normalized subgradient \({ D_k}/{\Vert D_k\Vert } \), and the sequence of the stepsizes \(\{\lambda _k\}_{k=0}^{\infty }\) satisfies the divergent-series rule:

$$\begin{aligned} \lambda _k > 0,\ \lambda _k \rightarrow 0,\ \sum _{i=0}^{\infty }\lambda _k = \infty . \end{aligned}$$

In order to avoid taking decreasing stepsizes (i.e., \(\lambda _k \rightarrow 0\)) as in the classical subgradient algorithm, Nesterov [20] proposed the SDA algorithm,

$$\begin{aligned} x_{k+1}=\mathop {\mathrm{argmin}\,} _ { x \in Q } \left\{ \sum _{i=0}^{k} \langle \lambda _i D_i , x-x_0 \rangle + \frac{\beta _{k+1}}{2}\Vert x - x_0\Vert ^2 \right\} ,~ D_k \in \partial F(x_k),\ \forall k \ge 0, \end{aligned}$$
(2.2)

where \(\{\beta _{ k+1}\}_{k=0}^{\infty }\) is a positive nondecreasing sequence and \(x_0\) denotes the initial point. There are two simple strategies for choosing \(\{\lambda _{i}\}_{i=0}^{\infty }\), either \(\lambda _i \equiv 1\) or \(\lambda _i=1/\Vert d_i\Vert \). SDA can solve the generalized nonsmooth convex optimization problem and it has been proved to be optimal from the view point of worst-case black-box lower complexity bounds [20]. By considering problems with additive structure as in (1.1), Xiao [26] proposed the RDA scheme. A detailed algorithm under the RDA scheme is PSB, which is as follows

$$\begin{aligned} x_{ k+1 }&= \mathop {\mathrm{argmin}\,} _ { x \in Q } \left\{ \sum _{i=0}^{k}\left( \langle g _ { i } , x-x_0 \rangle + h ( x ) \right) + \frac{\beta _{k+1}}{2}\Vert x-x_0\Vert ^2 \right\} \nonumber \\&= \mathop {\mathrm{argmin}\,} _ { x \in Q } \left\{ \sum _{i=0}^{k} \langle g _ { i } , x-x_0 \rangle + (k+1) h ( x ) + \frac{\beta _{k+1}}{2}\Vert x-x_0\Vert ^2 \right\} , \end{aligned}$$
(2.3)

where \(g_i \in \partial f(x_i)\), \(\forall i\ge 0\), the stepsize \(\lambda _i\equiv 1\), \(\{\beta _{k+1}\}_{k=0}^{\infty }\) is nondecreasing, and \(x_0 \in \mathrm{argmin}\,_{x \in Q} h(x)\). The above iteration (2.3) reduces to (2.2) when \(h \equiv 0\).

2.2 Algorithm IPSB

Motivated by indefinite approaches, we extend RDA to the following dynamic regularized dual averaging (DRDA) scheme

$$\begin{aligned} x_{ k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\beta _{k+1}}r_k(x) \right\} . \end{aligned}$$
(2.4)

We only assume that the sum \((k+1)h(x)+{\beta _{k+1}}r_k(x)\) is strongly convex. A simple choice of \(r_k(x)\) is

$$\begin{aligned} r_k(x)=\frac{1}{2}\Vert x-x_0\Vert ^2_{G_{k+1}}, \end{aligned}$$

where \(G_{k+1}=I-(k+1)\Sigma _h/\beta _{k+1}\). Algorithm 1 describes the algorithm in detail.

figure a

In Algorithm 1, the choice of the indefinite matrix \(G_k\) in step 3 guarantees the strong convexity of the subproblem minimized in step 4. In some specially structured problems, the introduction of \(G_k\) can make the subproblem in step 4 much easier to solve.

Remark 2.1

Note that as the progressing of the iteration, the influence of the initial point \(x_0\) should be vanishing. In other words, the auxiliary quadratic term should be as small as possible. By comparing the auxiliary functions in the k-th step of the algorithms PSB and IPSB, we can obtain

$$\begin{aligned} \frac{1}{2}\Vert x-x_0\Vert ^2_{G_{k+1}}&=\frac{1}{2}\Vert x-x_0\Vert ^2_{I-(k+1) \Sigma _h/{\beta _{ k+1 }}}\\&=\frac{1}{2}\Vert x-x_0\Vert ^2-\frac{k+1}{2}\Vert x-x_0\Vert ^2_{ \Sigma _h/{\beta _{ k+1 }}}\\&<\frac{1}{2}\Vert x-x_0\Vert ^2, \end{aligned}$$

which indicates that the indefinite term can reduce the impact of \(x_0\) on the k-th subproblem as k increases.

Remark 2.2

The following choice of the sequence \(\{{\tilde{\beta }}_{k+1}\}_{k=0}^{\infty }\) initialized in Algorithm 1 is due to Nesterov [20]:

$$\begin{aligned} {\tilde{\beta }}_1 = {\hat{\lambda }},\ {\tilde{\beta }}_{k+1} ={\tilde{\beta }}_k+\frac{1}{{\tilde{\beta }}_k},\ k \in \mathbb {N}, \end{aligned}$$
(2.5)

where \( {\hat{\lambda }}>0\) is an initial parameter.

For the sequence \(\{{\tilde{\beta }}_{k+1}\}_{k=0}^{\infty }\), we have the following estimation, which corrected the previous estimation in [20, Lemma 3].

Lemma 2.1

Based on (2.5), we have

$$\begin{aligned} \sqrt{{\hat{\lambda }}^2+2k-2} \le {{\tilde{\beta }}}_k \le {\hat{\lambda }} + \frac{1}{{\hat{\lambda }}}+ \sqrt{{\hat{\lambda }}^2+2k-4}, ~ \forall k \ge 1. \end{aligned}$$
(2.6)

Proof

According to (2.5), we can obtain \({\tilde{\beta }}_1 = {\hat{\lambda }}\) and \({\tilde{\beta }}_{k}^2={\tilde{\beta }}_{k-1}^2+{{\tilde{\beta }}_{k-1}^{-2}}+2\). Consequently,

$$\begin{aligned} {\tilde{\beta }}_{k}^2 \ge {\tilde{\beta }}_{k-1}^2+2 \ge {\tilde{\beta }}_{1}^2+2(k-1) = {\hat{\lambda }}^2+2(k-1),\ \forall k \ge 2, \end{aligned}$$

which implies the left-hand side of estimation (2.6). Conversely, we can derive that

$$\begin{aligned} {\tilde{\beta }}_{ k }&= {\tilde{\beta }}_{ k-1 } +\frac{1}{{\tilde{\beta }}_{ k-1 }} \le {\tilde{\beta }}_{ k-1 } +\frac{1}{\sqrt{{\hat{\lambda }}^2+2(k-2)}} \le {\tilde{\beta }}_{1} +\sum _{t=0}^{k-2}\frac{1}{\sqrt{{\hat{\lambda }}^2+2t}}\nonumber \\&={\hat{\lambda }}+\sum _{t=0}^{k-2}\frac{1}{\sqrt{{\hat{\lambda }}^2+2t}}. \end{aligned}$$
(2.7)

From

$$\begin{aligned} \frac{1}{\sqrt{{\hat{\lambda }}^2+2t}} \le \frac{2}{\sqrt{{\hat{\lambda }}^2+2t}+\sqrt{{\hat{\lambda }}^2+2(t-1)}} = \sqrt{{\hat{\lambda }}^2+2t}-\sqrt{{\hat{\lambda }}^2+2(t-1)}, \end{aligned}$$

we have

$$\begin{aligned} \sum _{t=0}^{k-2} \frac{1}{\sqrt{{\hat{\lambda }}^2+2t}} = \frac{1}{{\hat{\lambda }}}+ \sum _{t=1}^{k-2} \frac{1}{\sqrt{{\hat{\lambda }}^2+2t}} \le \frac{1}{{\hat{\lambda }}}+ \sqrt{{\hat{\lambda }}^2+2(k-2)}-{\hat{\lambda }}. \end{aligned}$$
(2.8)

Finally, the right-hand side of the estimation (2.6) follows from substituting (2.8) into (2.7). \(\square \)

3 Convergence analysis

Similar to Nesterov’s analysis [20], the convergence of the algorithm IPSB is established. First let us define two auxiliary functions as follows

$$\begin{aligned}&U _ { k } ( s ) := \max _{x \in {\mathcal {F}}_D } \{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) \}, \end{aligned}$$
(3.1)
$$\begin{aligned}&V _ { k } ( s ) := \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x )- \frac{\beta _{k}}{2} \Vert x-x_0\Vert ^2_{G_k} \right\} , \end{aligned}$$
(3.2)

where \({\mathcal {F}}_D=\{x \in Q: \frac{1}{2}\Vert x-x_0\Vert ^2 \le D\}\), \(D > 0\) and \( G_k=I-{k\Sigma _h}/{\beta _{ k }},\ \forall k \ge 1. \) Let \(x_0 \in \mathop {\arg \min }_{x \in Q} h(x)\). Since \(s_0=0\), we have

$$\begin{aligned} V _ { 1 } ( -s_{0} ) = \max _{x \in Q} \left\{ -h(x) -\frac{\beta _1}{2}\Vert x-x_0\Vert ^2_{G_1} \right\} =\max _{x \in Q} \left\{ -h(x) -\frac{1}{2}\Vert x-x_0\Vert ^2_{\beta _1I-\Sigma _h} \right\} , \end{aligned}$$
(3.3)

Notice that (3.3) is a concave maximization problem and then it has a unique optimal solution in the closed convex set Q. According to Danskin’s theorem [2, Proposition B.25], we obtain that both \(V _ { 1 } ( -s_{0} ) \) and \(\nabla V_1(-s_0)\) are well defined. Let

$$\begin{aligned} T:=V_1(-s_0) + \langle -g_0, \nabla V_1(-s_0) \rangle + h(x_1). \end{aligned}$$
(3.4)

In the following, the first lemma studies the relation between \(U _ { k } ( s )\) and \(V _ {k}(s)\), and the second lemma studies the smoothness of function \(V _ {k}(s)\).

Lemma 3.1

For any \(s \in \mathbb {R}^n\) and \(k \in \mathbb {N}\), we have

$$\begin{aligned} U _ { k } ( s ) \le \beta _k D + V _ { k } ( s ) \end{aligned}$$
(3.5)

Proof

According to the definitions (3.1), (3.2) and \({\mathcal {F}}_D=\{x \in Q: \frac{1}{2}\Vert x-x_0\Vert ^2 \le D\}\), we have

$$\begin{aligned} U _ { k } ( s )&= \max _{x \in {\mathcal {F}}_D} \{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) \}\\&\le \min _{\beta \ge 0} \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) - \beta ( \frac{1}{2} \Vert x-x_0\Vert ^2 - D) \right\} \\&\le \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) - \beta _k ( \frac{1}{2} \Vert x-x_0\Vert ^2 - D) \right\} \\&\le \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) - \frac{\beta _k}{2} \Vert x-x_0\Vert ^2 + \beta _k D +\frac{k}{2}\Vert x-x_0\Vert ^2_{\Sigma _h} \right\} \\&= \beta _k D + V _ { k } ( s ), \end{aligned}$$

where the first inequality corresponds to the partial Lagrangian relaxation, and the last inequality holds as \({\Sigma _h} \succeq 0\). \(\square \)

Lemma 3.2

The well-defined function \(V _ {k}(s)\) is convex and continuously differentiable. Then we have

$$\begin{aligned} \nabla V_{k} (s) = x_k(s) - x_0,\ \forall k \ge 1, \end{aligned}$$
(3.6)

where \(x_k(s)\) is the minimizer of the function \(V_{k}(s)\). In addition, \(\nabla V_{k} (s)\) is \({1}/{\beta _k}\)-Lipschitz continuous, i.e., there exists a constant \({1}/{\beta _k} > 0\) such that

$$\begin{aligned} \Vert \nabla V_{k} (s)-\nabla V_{k} (t)\Vert \le \frac{1}{\beta _k}\Vert s-t\Vert ,~ \forall s, t \in \mathbb {R}^n. \end{aligned}$$

Proof

Since the objective function of problem (3.2) is \(\beta _k-\)strongly concave with respect to x, \(x_k(s)\) is the unique maximizer of \(V_{k}(s)\). Then (3.6) follows from Danskin’s theorem [2, Proposition B.25].

For any \(l(x)\in \partial h(x)\), \( s_1, s_2 \in \mathbb {R}^n \), according to the first-order optimality conditions, we have

$$\begin{aligned}&\langle -s_1+kl(x_k(s_1))+ \beta _k G_k(x_k(s_1) - x_0), x_k(s_2) - x_k(s_1) \rangle \ge 0, \\&\langle -s_2+kl(x_k(s_2))+ \beta _k G_k(x_k(s_2) - x_0), x_k(s_1) - x_k(s_2) \rangle \ge 0. \end{aligned}$$

Adding these two inequalities together, we can get

$$\begin{aligned} \langle s_2 - s_1, x_k(s_1) - x_k(s_2) \rangle \le&\ k\langle l(x_k(s_2)) - l(x_k(s_1)), x_k(s_1) - x_k(s_2) \rangle \\&+ \langle \beta _k G_k(x_k(s_2) - x_k(s_1)), x_k(s_1) - x_k(s_2) \rangle \\ \le&\ -k\Vert {x_k(s_1) - x_k(s_2)} \Vert ^2_{\Sigma _h} - \beta _k \Vert {x_k(s_1) - x_k(s_2)} \Vert ^2_{G_k}\\ \le&\ - \beta _k \Vert {x_k(s_1) - x_k(s_2)} \Vert ^{2}, \ \forall k \ge 1, \end{aligned}$$

where last inequality follows from \(G_k=I-\frac{k}{\beta _{ k }}\Sigma _h\). Thus, we have

$$\begin{aligned} {\Vert {x_k(s_1) - x_k(s_2)} \Vert ^2}\le & {} -\frac{1}{\beta _k} \langle s_2 - s_1, x_k(s_1) - x_k(s_2) \rangle \\\le & {} \frac{1}{\beta _k} \Vert s_2 - s_1\Vert \Vert x_k(s_1) - x_k(s_2) \Vert , \forall k \ge 1, \end{aligned}$$

which is equivalent to

$$\begin{aligned} \Vert \nabla V_{k} (s_1)-\nabla V_{k} (s_2) \Vert \le \frac{1}{\beta _k} \Vert s_2 - s_1\Vert , \ \forall k \ge 1. \end{aligned}$$

\(\square \)

Let \(F^{*}_{D}=\min _{x \in {\mathcal {F}}_D} F(x)\). According to the convexity of the objective function, we have

$$\begin{aligned} F({\hat{x}}_{k+1})-F^{*}_{D}&\le \frac{1}{k+1} \sum _{i=0}^{k} [{f(x_i)+h(x_i)}]- \min _{x \in {\mathcal {F}}_D} [f(x)+h(x)]\nonumber \\&=\frac{1}{k+1} \max _{x \in {\mathcal {F}}_D}\sum _{i=0}^{k} [f(x_i)-f(x)+h(x_i)-h(x)]\nonumber \\&\le \frac{1}{k+1} \max _{x \in {\mathcal {F}}_D} \sum _{i=0}^{k} [\langle g_i, x_i-x \rangle +h(x_i)-h(x)]. \end{aligned}$$
(3.7)

Consequently, we define the gap function as

$$\begin{aligned} \delta _{k+1} := \max _{x \in {\mathcal {F}}_D} \sum _{i=0}^{k} [\langle g_i, x_i-x \rangle +h(x_i)-h(x)]. \end{aligned}$$

It follows from the inequality (3.5) that

$$\begin{aligned} \delta _{k+1}&= \sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_i)]+ U _ { k+1 } ( -{s}_{k+1} ) \end{aligned}$$
(3.8)
$$\begin{aligned}&\le \sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_i)]+ \beta _{k+1} D + V_{k+1} ( -{s}_{k+1})\nonumber \\&:=\Delta _{k+1}. \end{aligned}$$
(3.9)

Remark 3.1

For any fixed k, there exists a constant P that satisfies \(\max \nolimits _{i\in [k]} \frac{1}{2}\Vert x_i-x_0\Vert ^2\le P\). Thus we have

$$\begin{aligned} \frac{1}{2}\sum _{i=1}^{k}\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h} \le \lambda _{max}kP, \end{aligned}$$
(3.10)

where \(\lambda _{max}\) is the maximum eigenvalue of \(\Sigma _h\).

Now we present the upper bounds as follows.

Theorem 3.1

Let the sequence \({\{x_i\}}^{k}_{i=0} \subset Q\) and \({\{g_i\}}^{k}_{i=0} \subset \mathbb {R}^n\) be generated by Algorithm 1. Let sequence \(\{{\beta }_i\}_{i=0}^{k}\) satisfies \(\beta _{ k }=\gamma {\tilde{\beta }}_k\), where \(\{{\tilde{\beta }}_i\}_{i=1}^{k}\) is defined in (2.5), \({\tilde{\beta }}_{ 0 }={\tilde{\beta }}_{1}\) and \(\gamma >0\). Then

  1. 1.

    For any \(k\in \mathbb {N}\), we have

    $$\begin{aligned} \delta _k \le \Delta _k \le \beta _{k}D + T + \frac{1}{2}\sum _{i=0}^{k-1}\frac{1}{\beta _i} \Vert g_i\Vert ^2 + \lambda _{max}(k-1)P. \end{aligned}$$
    (3.11)
  2. 2.

    Assume that

  1. (1)

    the sequence \(\{g_k\}_{k \ge 0}\) is bounded, which means that

    $$\begin{aligned} \exists L >0,~ such~ that\ \Vert g_k\Vert \le L,\ \forall k \ge 0, \end{aligned}$$
    (3.12)
  2. (2)

    there exists a solution \(x^*\) satisfying

    $$\begin{aligned} \langle g,x-x^* \rangle \ge 0,~ g \in \partial f(x),\ \forall x \in Q. \end{aligned}$$
    (3.13)

Then it holds that

$$\begin{aligned} \Vert x_{k}-x^*\Vert ^2 \le \frac{2T+2\lambda _{max}(k-1)P}{\beta _k} +\Vert x^*-x_0\Vert ^2_{G_k}+L^2. \end{aligned}$$
(3.14)
  1. 3.

    Let \(x^*\) be an interior solution, i.e., there exist \(r, D >0\) satisfying \(B_r(x^*) \subseteq {\mathcal {F}}_D\). Assume there is a \(\Gamma _h >0\) such that

    $$\begin{aligned} \max _{\begin{array}{c} z \in \partial h(y) \\ y \in B_r(x^*) \end{array}} \Vert z\Vert \le \Gamma _h. \end{aligned}$$

Then we have

$$\begin{aligned} \Vert \bar{{s}}_{k+1}\Vert \le \frac{1}{r(k+1)} \left[ \beta _{k+1}D + T + \frac{1}{2} \sum _{i=0}^{k} \frac{1}{\beta _{i}} \Vert g_i\Vert ^2+\lambda _{max}kP \right] +\Gamma _h, \end{aligned}$$
(3.15)

where \( \bar{{s}}_{k+1}=\frac{1}{k+1}\sum _{i=0}^{k}g_k \).

Proof

  1. 1.

    According to the definitions of \(V_k(s)\) and \(G_k\), for any integer \(k\ge 1\), we have

    $$\begin{aligned} V_{k-1}(-s_k)&=\max _{x \in Q}\left\{ \langle -s_k, x-x_0 \rangle -(k-1)h(x)-\frac{\beta _{k-1}}{2}\Vert x-x_0\Vert ^2_{G_{k-1}}\right\} \\&\ge {\langle -s_k, x_k-x_0 \rangle -(k-1)h(x_k)-\frac{\beta _{k-1}}{2}\Vert x_k-x_0\Vert ^2_{G_{k-1}}}\\&= V_{k}(-s_k)+h(x_k)+\frac{\beta _k }{2}\Vert x_k-x_0\Vert ^2_{G_k}-\frac{\beta _{k-1} }{2}\Vert x_k-x_0\Vert ^2_{G_{k-1}}\\&\ge V_k(-s_k)+h(x_k)+\frac{\beta _{ k }-\beta _{k-1} }{2}\Vert x_k-x_0\Vert ^2-\frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}\\&\ge V_k(-s_k)+h(x_k)-\frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}. \end{aligned}$$

    According to Lemma 3.2, we have

    $$\begin{aligned} V_k(s+\sigma ) \le V_k(s) + \langle \sigma , \nabla V_k(s) \rangle +\frac{1}{2\beta _k}\Vert \sigma \Vert ^2,~\forall s, \sigma \in \mathbb {R}^n. \end{aligned}$$
    (3.16)

    Substituting \(s_k\) into (3.16) yields that

    $$\begin{aligned}&V_k(-s_k)+h(x_k) - \frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}\\&\quad \le V_{k-1}(-s_k) = V_{k-1} (-s_{k-1}-g_{k-1}) \\&\quad \le V_{k-1}(-s_{k-1}) + \langle -g_{k-1}, \nabla V_{k-1}(-s_{k-1}) \rangle +\frac{1}{2\beta _{k-1}}\Vert g_{k-1}\Vert ^2\\&\quad = V_{k-1}(-s_{k-1}) + \langle -g_{k-1}, x_{k-1}-x_0 \rangle +\frac{1}{2\beta _{k-1}}\Vert g_{k-1}\Vert ^2,~ \forall k \ge 1, \end{aligned}$$

    which further implies that

    $$\begin{aligned}&\langle g_{k- 1}, x_{k-1}-x_0 \rangle + h(x_k)\\&\quad \le V_{k-1}(-s_{k-1}) - V_k(-s_k)+\frac{1}{2\beta _{k-1}}\Vert g_{k-1}\Vert ^2 + \frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}, ~ \forall k \ge 1. \end{aligned}$$

    By summing the above inequality from 1 to k, we obtain

    $$\begin{aligned}&\sum \limits _{i=1}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_{i+1})]\\&\quad \le V_{1} (-s_1)- V_{k+1} (-s_{k+1})+ \frac{1}{2}\sum \limits _{i=1}^{k} \left[ \frac{1}{\beta _i} \Vert g_i\Vert ^2+\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h}\right] , \end{aligned}$$

    which is equivalent to

    $$\begin{aligned}&\sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_{i})] + V_{k+1} (-s_{k+1}) \le V_{1} (-s_1) + h(x_0)+ h(x_1) - h(x_{k+1})\nonumber \\&\quad +\frac{1}{2}\sum _{i=1}^{k} \left[ \frac{1}{\beta _i} \Vert g_i\Vert ^2+\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h}\right] . \end{aligned}$$
    (3.17)

    By combining with (3.16) and \(s_1=s_0+g_0\), we have

    $$\begin{aligned} V_1(-s_1)&=V_1(-s_0-g_0) \le V_1(-s_0) +\langle -g_0, \nabla V_1(-s_0)\rangle + \frac{1}{2\beta _{1}}\Vert g_0\Vert ^2\\&= T - h(x_1)+ \frac{1}{2\beta _{0}} \Vert g_0\Vert ^2, \end{aligned}$$

    where the equality follows from (3.4) and \(\beta _{ 0 }=\beta _{1}\). By noting that \(x_0 = \arg \min _{ x \in Q } h(x)\), we can obtain that

    $$\begin{aligned} h(x_0) \le h(x_{k+1}). \end{aligned}$$

    Finally, combining (3.9), (3.17) and the above inequalities, we conclude that

    $$\begin{aligned} \Delta _{k+1} \le \beta _{k+1}D + T + \frac{1}{2}\sum _{i=0}^{k}\frac{1}{\beta _i} \Vert g_i\Vert ^2 + \frac{1}{2}\sum _{i=1}^{k}\Vert x_{i+1}- x_0\Vert ^2_{\Sigma _h}. \end{aligned}$$
  2. 2.

    Notice that \(x_k = \mathop {\arg \min }_{ x \in Q } \left\langle s_{k} , x - x _ { 0 } \right\rangle +kh ( x )+ \frac{\beta _{k}}{2} \Vert x-x_0\Vert ^2_{G_k} \). By the convexity of the objective function, we have

    $$\begin{aligned} \left\langle s_k + k l_k+\beta _{ k } G_k (x_k-x_0), x-x_k \right\rangle \ge 0,\ \forall x \in Q. \end{aligned}$$
    (3.18)

    Notice that \(G_k=I-{k\Sigma _h}/{\beta _{ k }}\). Then we can define the following \(\beta _{ k }-\)strongly convex function

    $$\begin{aligned} \phi _k(x):=kh(x)+\frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2_{G_k},\ k \in \mathbb {N}, \end{aligned}$$

    which implies that

    $$\begin{aligned} \phi _k(x) \ge \phi _k(x_k) + \left\langle kl_k+\beta _{ k } G_k (x_k-x_0), x-x_k \right\rangle +\frac{\beta _{ k }}{2}\Vert x_k-x\Vert ^2. \end{aligned}$$
    (3.19)

    By taking \(\phi _k(x_k)\) from the right-hand side of the inequality (3.19) to the left-hand side, we can get

    $$\begin{aligned}&\left\langle kl_k+\beta _{ k } G_k (x_k-x_0), x-x_k \right\rangle +\frac{\beta _{k }}{2}\Vert x_k-x\Vert ^2\\&\le k[h(x)-h(x_k)] + \frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2_{G_k} -\frac{\beta _{ k }}{2}\Vert x_k-x\Vert ^2_{G_k}. \end{aligned}$$

    Combining with (3.18) yields that

    $$\begin{aligned} \frac{\beta _{ k }}{2}\Vert x_k-x\Vert ^2 \le&~ kh(x)-kh(x_k)+ \frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2 _{G_k} - \frac{\beta _{ k }}{2}\Vert x_k-x_0\Vert ^2_{G{_k}} \nonumber \\&+ \left\langle kl_k+\beta _{ k }G_k(x_k-x_0), x_k -x \right\rangle \nonumber \\ \le&~ kh(x)-kh(x_k)+\frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2_ {G_k}- \frac{\beta _{ k }}{2}\Vert x_k-x_0\Vert ^2_{G{_k}} - \left\langle {s_k}, x_k -x \right\rangle \nonumber \\ =&V_k(s_k) + kh(x) + \frac{\beta _{ k }}{2} \Vert x-x_0\Vert ^2_{G_k} + \left\langle {s_k}, x-x_0 \right\rangle \nonumber \\ =&~ V_k(s_k)+\sum _{i=0}^{k-1}\left\langle g_{i}, x_i-x_0 \right\rangle + \sum _{i=0}^{k-1}h(x_i) \nonumber \\&+ \frac{\beta _{ k }}{2} \Vert x-x_0\Vert ^2_{G_k} + \sum _{i=0}^{k-1}\left\langle g_{i}, x-x_i \right\rangle + kh(x) -\sum _{i=0}^{k-1}h(x_i). \end{aligned}$$
    (3.20)

    Furthermore, we notice that (3.17) is taken into the following form

    $$\begin{aligned} V_{k+1} (-s_{k+1})+\sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_{i})] \le T + \frac{1}{2}\sum _{i=0}^{k}\frac{1}{\beta _i} \Vert g_i\Vert ^2 + \frac{1}{2}\sum _{i=1}^{k}\Vert x_{i+1}- x_0\Vert ^2_{\Sigma _h}. \end{aligned}$$
    (3.21)

    By substituting (3.20) into (3.21), we can get

    $$\begin{aligned} \frac{\beta _{ k}}{2}\Vert x_k-x\Vert ^2 \le T&+\frac{1}{2}\sum \limits _{i=0}^{k-1} \frac{1}{\beta _i} \Vert g_i\Vert ^2 + \frac{1}{2}\sum \limits _{i=1}^{k-1}\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h}\nonumber \\&+ \frac{\beta _{ k }}{2} \Vert x-x_0\Vert ^2_{G_k} + \left\{ \sum \limits _{i=0}^{k-1} {[f(x)+h(x)]-[f(x_i)+h(x_i)]}\right\} . \end{aligned}$$

    Finally, we set \(x=x^*:=\mathop {\arg \min }_{x \in {\mathcal {F}}_D} f(x)+h(x)\). Then it holds that

    $$\begin{aligned} \frac{\beta _{k }}{2}\Vert x_k-x^*\Vert ^2 \le T + \frac{1}{2}\sum _{i=0}^{k-1} \frac{1}{\beta _i} \Vert g_i\Vert ^2 +\frac{1}{2}\sum _{i=1}^{k-1}\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h} +\frac{\beta _{ k }}{2} \Vert x^*- x_0\Vert ^2_{G_k}. \end{aligned}$$

    According to the conditions (2.5) and (3.12), we obtain the inequality (3.14).

  3. 3.

    Based on (3.8), we can obtain

    $$\begin{aligned}&\delta _{k+1} = \sum _{i=0}^{k} [\langle g_i, x_i-x^* \rangle + h(x_i)]+ \max _{x \in {\mathcal {F}}_D } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle - (k+1)h ( x ) \}\\&\quad = \sum _{i=0}^{k} [\langle g_i, x_i-x^* \rangle + h(x_i)-h ( x^* )]+ \max _{x \in {\mathcal {F}}_D } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) \\&\qquad - (k+1)h ( x ) \}\\&\quad \ge \sum _{i=0}^{k}\{f(x_i)+h(x_i)-f(x^*)-h ( x^* )\} + \max _{x \in {\mathcal {F}}_D } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) \\&\qquad - (k+1)h ( x ) \}\\&\quad \ge \max _{x \in B_r(x^*) } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) - (k+1)h ( x ) \}. \end{aligned}$$

    Notice that

    $$\begin{aligned} {\bar{x}} =\arg \max _{x \in B_r(x^*) } \left\langle {s}_{k+1} , x^* - x \right\rangle . \end{aligned}$$

    Then we have \(\Vert x^*-{\bar{x}}\Vert =r\) and

    $$\begin{aligned} \left\langle {s}_{k+1} , x^* - {\bar{x}} \right\rangle = \Vert {s}_{k+1}\Vert \Vert x^* - {\bar{x}} \Vert = r\Vert {s}_{k+1}\Vert . \end{aligned}$$

    Thus, it holds that

    $$\begin{aligned} \delta _{k+1}&\ge \max _{x \in B_r(x^*) } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) - (k+1)h ( x ) \}\\&\ge \left\langle {s}_{k+1} , x^* - {\bar{x}} \right\rangle + (k+1){h(x^*)-(k+1)h({\bar{x}})} \\&\ge r\Vert {s}_{k+1}\Vert + (k+1){ \left\langle l({\bar{x}}), x^*-{\bar{x}} \right\rangle } \\&\ge r\Vert {s}_{k+1}\Vert - (k+1){ \Vert l({\bar{x}})\Vert \Vert x^*-{\bar{x}}\Vert }\\&=r\Vert {s}_{k+1}\Vert - (k+1) r \Vert l({\bar{x}})\Vert , \end{aligned}$$

    which implies

    $$\begin{aligned} \frac{1}{k+1}\Vert {s}_{k+1}\Vert&\le \frac{1}{r(k+1)}\delta _{k+1}+ \Vert l({\bar{x}})\Vert \le \frac{1}{r(k+1)}\delta _{k+1}+ \Gamma _h. \end{aligned}$$

    Then (3.15) follows from (3.11).\(\square \)

As a main result, we can now estimate the upper bound on the complexity of IPSB in the following.

Theorem 3.2

Assume there exists a constant \(L>0\) such that \(\Vert g_k\Vert \le L\), \(\forall k \ge 0\). Denote by \(\{x_i\}_{i=0}^{k}\) the sequence generated by Algorithm 1. Let \( F_D^*= \min _ { x \in {\mathcal {F}}_D } F(x)\). Then we have

$$\begin{aligned} F(\hat{x}_{k+1})-F_D^* -\lambda _{max}P \le \frac{\tilde{\beta }_{k+1}}{k+1}\left( \gamma D+ \frac{L^2}{2\gamma }\right) + \frac{T- \lambda _{max}P}{k+1}. \end{aligned}$$
(3.22)

Proof

By combining (2.5) with the inequalities (3.7) and (3.11), we have

$$\begin{aligned} F(\hat{x}_{k+1})-F_D^*&\le \frac{1}{k+1} \delta _{k+1}(D) \le \frac{1}{k+1} \left[ \beta _{k+1}D + T+ \frac{1}{2} \sum _{i=0}^{k} \frac{1}{\beta _{i}} \Vert g_i\Vert ^2 + \lambda _{max}kP\right] \\&\le \frac{\tilde{\beta }_{k+1}}{k+1}\left( \gamma D+ \frac{L^2}{2\gamma }\right) + \frac{T- \lambda _{max}P}{k+1}+\lambda _{max}P, \end{aligned}$$

which finishes the proof of the inequality (3.22). \(\square \)

Remark 3.2

According to Lemma 2.1, we know that the sequence \(\{{\tilde{\beta }}_{k}\}_{k=0}^{\infty }\) can be used for balancing the terms appearing in the right-hand side of inequality (3.11). It follows from Theorem 3.2 that IPSB converges to the region of the optimal value with rate \(O({1}/{\sqrt{k}} )\).

4 Numerical experiments

In this section, we perform numerical experiments to compare the algorithms IPSB, SDA and PSB on two kinds of test problems. All experiments were implemented in MTALAB 2018b and run on a laptop with a dual core (\(1.6+1.8\) GHz) processor and 8 GB RAM.

4.1 Regularized least squares problem

In this subsection, we test the regularized least squares problem

$$\begin{aligned} \min _ {x \in \mathbb {R}^{n}} \left\{ \Vert Ax-b\Vert _2 + {\bar{\rho }} \max _{i \in [m]} {f_i(x)} \right\} , \end{aligned}$$
(4.1)

where \(A \in \mathbb {R}^{n_1\times n_2}\), \(b\in \mathbb {R}^{n_1}\) and \(f_i(x),i\in [m]\) are all positive and strongly convex. Notice that (4.1) is a special case of (1.1) with \(f(x)=\Vert Ax-b\Vert _2\) and \(h(x)={\bar{\rho }} \max _{i\in [m]} {f_i(x)}\).

In our first test, we set \(m=2\), \(f_1(x)=\Vert x\Vert _{B}^2\) and \(f_2(x)=\Vert x-c\Vert _{B}^2\), where \(B \in \mathbb {R}^{n_2\times n_2}\) and \(c \in \mathbb {R}^{n_2}\). We set \(B\ne 0\) to be positive semidefinite but singular so that the function h is SQCC. In fact, we have \(\Sigma _h=2B\). Applying Algorithm 1 to solve (4.1) reduces to

$$\begin{aligned} {\left\{ \begin{array}{lr} \displaystyle s_{k+1}=s_k+ g_k, \\ \displaystyle x_{k+1}=\arg \min _{x }\left\{ \left\langle s_{k+1},x\right\rangle + (k+1)\max \{\Vert x\Vert _{B}^2, \Vert x-c\Vert _{B}^2\}+ \frac{\beta _{ k+1 }}{2}\Vert x\!-\!x_0\Vert ^2_{I-\frac{2(k+1)}{\beta _{ k+1 }} B}\right\} , \end{array} \right. } \end{aligned}$$

where \(g_k \in \partial f(x_k)\) and

$$\begin{aligned} \partial f(x)= \left\{ \begin{array}{lr} \displaystyle \frac{A^T(Ax-b)}{\Vert Ax-b\Vert }, &{}if\ Ax-b \ne 0,\\ \displaystyle \{A^Tx\in \mathbb {R}^n: \Vert x\Vert \le 1 \},&{}if\ Ax-b = 0. \end{array} \right. \end{aligned}$$

The three different algorithms in comparison for solving (4.1) are explicitly reformulated as

$$\begin{aligned} \displaystyle&SDA: x_{k+1}=x_0- \frac{1}{\beta _{ k+1}}{\sum _{i=0}^{k} {(g_i+l_i)}} ,\\ \displaystyle&PSB: x_{k+1}= \left\{ \begin{array}{lr} \displaystyle (2(k+1)B+\beta _{ k+1} I)^{-1}(\beta _{ k+1} x_0-s_{k+1}), &{} \hbox { if }\ \Vert x\Vert _B^2> \Vert x-c\Vert _B^2,\\ \displaystyle (2(k+1)B+\beta _{ k+1} I)^{-1}(\beta _{ k+1} x_0-s_{k+1}), &{} \hbox { if }\ \Vert x\Vert _B^2< \Vert x-c\Vert _B^2,\\ \displaystyle \left( 2(k+1)B + \beta _{ k+1} {\bar{G}} \right) ^{-1} \left( \beta _{ k+1} x_0 -s_{k+1}+ {\bar{\rho }}_1 Bc \right) , &{} \hbox {otherwise}, \end{array} \right. \\ \displaystyle&IPSB: x_{k+1}= \left\{ \begin{array}{lr} \displaystyle x_0-\frac{1}{\beta _{ k+1}}(2(k+1)Bx_0+s_{k+1}), &{} \hbox { if }\ \Vert x\Vert _B^2 > \Vert x-c\Vert _B^2,\\ \displaystyle x_0-\frac{1}{\beta _{ k+1}}(2(k+1)Bx_0+s_{k+1}-2(k+1)Bc) , &{} \hbox { if }\ \Vert x\Vert _B^2 < \Vert x-c\Vert _B^2,\\ \displaystyle x_0-\frac{1}{\beta _{k+1}}(2(k+1)Bx_0+s_{k+1}+{2{\bar{\rho }}_2}Bc), &{} \hbox { otherwise }, \end{array} \right. \end{aligned}$$

where expression for \({\bar{\rho }}_1\), \({\bar{\rho }}_2\) and \({\bar{G}}\) is as follows

$$\begin{aligned}&{\bar{\rho }}_1 = \frac{(k+1)c^T ( B c+ s_{k+1}-\beta _{k+1} x_0)}{c^T B c},\\&{\bar{\rho }}_2 = \frac{1}{4\Vert Bc\Vert ^2}\left( \beta _{ k+1} c^T B (c-x_0)+2c^T B s_{k+1} + 4(k+1)c^T B \cdot B x_0\right) ,\\&{\bar{G}} = I - \frac{B c\cdot c^T}{c^T B c}. \end{aligned}$$

In addition, \(l_k \in \partial h(x_k)\) and

$$\begin{aligned} \partial h(x)=\{l \in \mathbb {R}^{n_2\times n_2}:l=2 B x+2 \alpha B c,\ \alpha \in [0,1]\}. \end{aligned}$$

In our experiments, we choose \({\bar{\rho }} =1\) and \(n_1 \times n_2 \in \{ 400 \times 900, 800 \times 2000, 1500 \times 3000\}\). In Algorithm 1, we set \(\gamma =20\), \({\hat{\lambda }}=1e-3\) and the termination criterion is set as either \(|F( {{\hat{x}}}^{k})-F({\hat{x}}^{k-1})| \le 10^{-3}\) or the number of iterations reaches 300. Starting from a fixed seed, we independently randomly generate \(x^* = (10, \ldots ,10)\in \mathbb {R}^{n_2}\), \(c \in \mathbb {R}^{n_2} \) from standard normal distribution \({\mathcal {N}}(0,0.25)\) and then generate each elment of A from \({\mathcal {N}}(0,20^2)\). We set \(b\in \mathbb {R}^{n_1}\) as follows

$$\begin{aligned} b_i = \sum _{j=1}^{n_2}A_{ij}x^*,~i\in [n_1]. \end{aligned}$$

The matrix B is constructed by randomly generating eigenvalues and eigenvectors. The first ten eigenvalues of B are random positive numbers and the rest are zero. We construct the eigenvecters by randomly generating orthogonal matrix with uniformly distributed random elements. When \(n_1=400\), \(n_2=900\), MATLAB code to generate the above data is as follows. The others are similar.

figure b

where x_op corresponds to the variable \(x^*\).

We plot the variants of \(\log (\log F({{{\hat{x}}}}_k))\) versus the iterations number k and CPU runtime in Figs. 1 and 2, respectively.

Fig. 1
figure 1

Numerical comparison between algorithms SDA and IPSB

As shown in Fig. 1, IPSB would have a better function value than SDA in the iteration process. By zooming in on the details of the figures, it can be seen that the value generated by IPSB is decreasing rather than constant.

Fig. 2
figure 2

Numerical comparison between algorithms PSB and IPSB

PSB runs much slower than IPSB because of the heavy computation cost of matrix inversion. It is shown in Fig. 2 that IPSB is much more efficient than PSB.

4.2 Elastic net regression

The elastic net is a regularized regression model [27] by linearly combining LASSO and ridge regression. It is formulated as

$$\begin{aligned} \min _{\omega \in \mathbb {R}^{n}} \Vert y-X\omega \Vert _2^2+\eta _1\Vert \omega \Vert _1+\eta _2\Vert \omega \Vert ^2_2, \end{aligned}$$
(4.2)

where p is the number of samples, n is the number of features, \(y\in \mathbb {R}^p\) is the response vector, \(X \in \mathbb {R}^{p \times n}\) is the design matrix, and \(\eta _1, \eta _2 > 0\) are regularization parameters. It corresponds to setting \(f(\omega )=\eta _1\Vert \omega \Vert _1\) and \(h(\omega )=\Vert y-X\omega \Vert _2^2+\eta _2\Vert \omega \Vert ^2_2\) in (1.1). The iteration schemes of three different algorithms in comparison for solving (4.2) are reformulated as

$$\begin{aligned} \displaystyle&SDA: \omega _{k+1}=\omega _0+\frac{1}{\beta _{ k+1}}\sum _{i=0}^{k}\left( 2X^T(X\omega _i-y)+\eta _1sgn(\omega _i)+2\eta _2\omega _i \right) ,\\ \displaystyle&PSB: \omega _{k+1}=((k+1)(2X^TX+2\eta _2 I) + \beta _{ k+1 } I)^{-1}\left( \beta _{ k+1 }\omega _0+2X^Ty-\eta _1\sum _{i=0}^{k}sgn(\omega _i)\right) ,\\ \displaystyle&IPSB: \omega _{k+1}=\omega _0-\frac{1}{\beta _{ k+1 }}\left( (k+1)(2X^T(X\omega _0-y)+2\eta _2\omega _0)+\eta _1\sum _{i=0}^{k}sgn(\omega _i)\right) , \end{aligned}$$

where the initial point \({\omega _{0}}\) is given by \((X^TX+\eta _2 I)^{-1}Xy = \arg \min _{\omega \in \mathbb {R}^{n}} h(\omega )\), \(sgn(\cdot )\) is the sign function, and the sequence \(\{\beta _{ k }\}_{k\ge 0}\) utilizes the form (2.5). We list in Table 1 the computational complexity in each iteration. It demonstrates that in each iteration PSB has the highest computational cost when n is much larger than k, and SDA takes the highest cost when k is much larger than p and n.

Table 1 Computational cost in each iteration

We set the termination criterion as

$$\begin{aligned} \frac{|f({\bar{\omega }})-{\bar{f}}|}{{\bar{f}}} \le \epsilon ^{rel}, \end{aligned}$$

where \({\bar{f}}\) is an approximation of the optimal value obtained by running 500 iterations of SDA in advance, \({\bar{\omega }}= \sum _{i=1}^t \omega _i/n\) and t is the realistic number of iterations until termination.

We conduct the experiments with the following synthetic data and real data, respectively.

\( \textit{Synthetic data:} \) Starting a fixed seed, we independently and randomly generate \(X_{ij} \sim {\mathcal {N}}(0, 0.01)\), \(\omega ^* \sim {\mathcal {N}}(0,1)\), \(\epsilon _i \sim {\mathcal {N}}(0,0.04)\), and then set \(y_i=\sum _{j=1}^{n} X_{ij} \omega ^* + \epsilon _i\), \(i\in [p]\), \(j\in [n]\). We choose \(p\times n \in \{300 \times 1000, 500 \times 2000, 700 \times 3000, 1000 \times 5000\}\). The hyperparameters used for the synthetic data are set as

$$\begin{aligned} \gamma =10^2,\ \beta _{ 0 }=3 \times 10^{-2},\ \epsilon ^{rel}=10^{-1}. \end{aligned}$$

When \(p=300\), \(n=1000\), the MATLAB code to generate the data is as follows. The others are similar.

figure c

where x_op and ep correspond to the variables \(\omega ^*\) and \(\epsilon \) respectively.

Table 2 Numerical results for synthetic data

MNIST data [15]: There are 70, 000 samples from the images of 10 digits in the MNIST data set, each with a \(28 \times 28\) gray-scale pixel-map, for a total of 784 features. We take the digits 8 and 9. Thus we have \(p=13783\) and \(n=784\). Moveover, let \(y\in \{+1,-1\}^n\) be the binary label. The hyperparameters used for MNIST data are as follows

$$\begin{aligned} \gamma =10^3,\ \beta _{ 0 }=10^{-3},\ \epsilon ^{rel}=10^{-2}. \end{aligned}$$
Table 3 Numerical results for MNIST data

Tables 2 and 3 represent the experimental results for synthetic data and MNIST data, respectively. In both Tables, we report the results of the numbers of iterations (Iter.), running time in seconds and the accuracy (Accu.) defined as \(1-|f({\bar{\omega }})-{\bar{f}}|/{{\bar{f}}}\).

In synthetic data, SDA takes the largest number of iterations among the three, IPSB runs in less CPU time than the other two algorithms, and PSB is the most inefficient one. In MNIST data, the three algorithms take almost the same number of iterations so that IPSB takes the least CPU time.

5 Conclusions

Nesterov’s dual averaging scheme succeeds in avoiding that stepsizes decrease as in the subgradient methods for nonsmooth convex minimizing problem. It is then extended to solve problems with an additional regularization, denoted by (RDA).

In this paper, we propose the dynamic regularized dual averaging scheme by relaxing the positive definite regularization term in RDA, which can not only reduce the impact of the initial point on the subproblems in later iterations but also make the new subproblem in each iteration easy to solve. Under this new scheme, we proposed indefinite proximal subgradient-based (IPSB) algorithm. We analyze the convergence rate of IPSB, which is \(O({1}/{\sqrt{k}})\), where k is the number of iterations. And IPSB converges to a region of the optimal value. Numerical experiments on regularized least squares problem and elastic net regression show that IPSB is more efficient than the existing algorithms SDA and PSB. Future works include more real applications of IPSB and further improvement of IPSB by, for example, relaxing the condition on the initial point.