An indefinite proximal subgradient-based algorithm for nonsmooth composite optimization

Liu, Rui; Han, Deren; Xia, Yong

doi:10.1007/s10898-022-01173-9

An indefinite proximal subgradient-based algorithm for nonsmooth composite optimization

Published: 16 September 2022

Volume 87, pages 533–550, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Global Optimization Aims and scope Submit manuscript

An indefinite proximal subgradient-based algorithm for nonsmooth composite optimization

Download PDF

Rui Liu¹,
Deren Han¹ &
Yong Xia¹

431 Accesses
Explore all metrics

Abstract

We propose an indefinite proximal subgradient-based algorithm (IPSB) for solving nonsmooth composite optimization problems. IPSB is a generalization of the Nesterov’s dual algorithm, where an indefinite proximal term is added to the subproblems, which can make the subproblem easier and the algorithm efficient when an appropriate proximal operator is judiciously setting down. Under mild assumptions, we establish sublinear convergence of IPSB to a region of the optimal value. We also report some numerical results, demonstrating the efficiency of IPSB in comparing with the classical dual averaging-type algorithms.

A proximal bundle method for a class of nonconvex nonsmooth composite optimization problems

Article 10 April 2023

Globalized inexact proximal Newton-type methods for nonconvex composite functions

Article Open access 16 November 2020

Gradient sliding for composite optimization

Article 22 October 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider the nonsmooth composite convex optimization problem

$$\begin{aligned} \min _ {x \in Q}\ \left\{ F(x) :=f(x)+h(x) \right\} , \end{aligned}$$

(1.1)

where $Q \subseteq \mathbb {R}^n$ is a simple closed convex set, $f,h:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{+\infty \}}$ are convex (not necessarily smooth) and $F:\mathbb {R}^n \rightarrow \mathbb {R} \cup \{+\infty \}$ is nonsmooth. Moreover, h is assumed to be the summation of a quadratic convex and a convex function (SQCC). Problem (1.1) has received much attention due to its broad applications in several different areas such as signal processing, system identification, machine learning and image processing; see, for instance, [6, 7, 10] and references therein.

Among the numerical algorithms for solving nonsmooth optimization problems (1.1) such as splitting algorithms [9], cutting plane methods [21], ellipsoid methods [11], bundle methods [17], gradient sampling methods [4] and smoothing methods [19], subgradient methods [25] are fundamental, which have been extensively studied due to their applicability to a wide variety of problems and low requirement on memory [3, 8, 22, 23]. The iteration complexity for applying a subgradient method to solve the general nonsmooth convex minimization problem is $O({1}/{\epsilon ^2})$, i.e., after $O({1}/{\epsilon ^2})$ iterations, the difference between the objective function value and the optimum is about $\epsilon $; see [21]. For problems equipped with additional structure, various approaches are proposed such as smoothing schemes [19], fast iterative shrinkage-thresholding algorithm [1], bundle method [17], to improve the iteration complexity to $O({1}/{\epsilon })$.

Note that for the nonsmooth optimization problems, it is usually not the case that the subgradient vanishes at the solution point, and as a consequence, the stepsize in the subgradient-based method should be approaching zero. Such a vanishing property of the stepsize slows down the convergence rate of the subgradient method [20]. To deal with this undesirable phenomenon, Nesterov proposed a dual averaging (DA) scheme [20]. Each iteration of DA scheme takes the form

$$\begin{aligned} x_{k+1}=\mathop {\mathrm{argmin}\,}_{ x \in Q } \left\{ \sum _{i=0}^{k} \langle \lambda _i D_i , x-x_0 \rangle + {\beta _{k+1}}r(x) \right\} ,~ D_k \in \partial F(x_k),\ \forall k \ge 0, \end{aligned}$$

(1.2)

where $\lambda _k$, $\forall k \ge 0$ are stepsizes, $ \{\beta _{k}\}_{k=0}^{\infty } $ is a positive nondecreasing sequence and $ r(\cdot ) $ is an auxiliary strongly convex function. Following the DA scheme, Xiao [26] proposed a regularized dual averaging (RDA) scheme, which generates the iterate by minimizing a problem that involves all the past subgradients of f and the whole function h,

$$\begin{aligned} x _ { k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\beta _{k+1}}r(x) \right\} ,~ g_{i} \in \partial f(x_{i}),\ \forall k \ge 0, \end{aligned}$$

(1.3)

where $x_0$ is the minimizer of h over Q. Setting the auxiliary function $r(\cdot )$ as $\frac{1}{2}\Vert \cdot -x_0\Vert ^2 $ in the above RDA scheme (1.3) becomes the so-called proximal subgradient-based (PSB) method

$$\begin{aligned} x _ { k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\frac{\beta _{k+1}}{2}}\Vert x-x_0\Vert ^2 \right\} ,~ g_{i} \in \partial f(x_{i}),\ \forall k \ge 0. \end{aligned}$$

(1.4)

The regularization function $r(\cdot )$ is crucial in RDA and PSB, which plays a similar role as the proximal term in the classical proximal point algorithm (PPA) [5, 18, 24]. On one hand, it ensures the existence and uniqueness of the solution of the subproblems, and makes the subproblems stable. On the other hand, it also influences the efficiency of the algorithms. Recently, much attention was paid on relaxing the strong convexity requirements on the proximal term in PPA [13] and related algorithms such as augmented Lagrangian method [12] and alternating direction method of multipliers [14, 16], and such a strategy achieves great success in numerical experiments. In this paper, we relax $r(\cdot )$ in (1.3) to an indefinite one, yielding the following dynamic regularized dual averaging (DRDA) scheme

$$\begin{aligned} x _ { k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\beta _{k+1}}r_k(x) \right\} ,~ g_{i} \in \partial f(x_{i}),\ \forall k \ge 0, \end{aligned}$$

(1.5)

where $(k+1)h(\cdot )+ {\beta _{k+1}}r_k(\cdot )$ is strongly convex for each k. Note that under this requirement, even if the function $h(\cdot )$ is convex, $r_k(\cdot )$ could be carefully chosen to be nonconvex. Specially, we introduce an appropriate indefinite item in (1.4) and then propose the indefinite proximal subgradient-based (IPSB) algorithm. Convergence rate is established under mild assumptions. We do numerical experiments on the regularized least squares problem and elastic net regression. Numerical results demonstrate the efficiency of IPSB in comparing with the existing algorithms SDA and PSB.

The rest of this paper is organized as follows. In the following subsection, we introduce some notations and preliminaries. Section 2 reviews the simple dual averaging algorithm, the proximal subgradient-based algorithm and gives our new extensions. Section 3 presents the convergence analysis. Numerical experiments are performed in Sect. 4. We make conclusions in Sect. 5.

1.1 Notations and preliminaries

In this subsection, we present some definitions and preliminary results that will be used in our analysis later. Let Q be a closed convex set in $\mathbb {R}^n$. We use $\langle s, x \rangle $ and $s^Tx$ to denote the inner product of s and x, two real vectors with the same dimension. Let $\mathbb {S}^{n}$ denote the set of symmetric matrices of order n, and I denote the identity matrix whose dimension is clear from the context. The Euclidean norm defined by $\sqrt{\langle \cdot ,\cdot \rangle }$ is denoted by $\Vert \cdot \Vert $. Let [m] denote the set $\{1,2,\ldots ,m\}$. The ball with center x and radius r reads as

$$\begin{aligned} B_r (x) = \{ y \in \mathbb {R}^n : \Vert y - x \Vert \le r \}. \end{aligned}$$

The subdifferential of a convex function f at point $x\in \mathrm {dom} f$ is given by

$$\begin{aligned} \partial f(x) := \{g \in \mathbb {R}^n: f(y) \ge f(x) + \langle g, y-x \rangle , \forall y \in \mathbb {R}^n\}, \end{aligned}$$

and any element in $\partial f(x)$ is called a subgradient of f at x, where $\mathrm {dom} f$ is the domain of f, i.e., the set of $x\in \mathbb {R}^n$ such that f(x) is finite.

A function $f:\mathbb {R}^n \rightarrow \mathbb {R}\cup \{+\infty \}$ is called strongly convex if there exists a constant $\kappa > 0$ such that

$$\begin{aligned} f(x) \ge f(y) +\langle g, x-y \rangle + \frac{\kappa }{2}\Vert x-y\Vert ^2,~ \forall x,y \in \mathbb {R}^n,\ \forall g \in \partial f(y), \end{aligned}$$

where the constant $\kappa $ is called the strong convexity parameter.

For $M \in \mathbb {R}^{n \times n}$, we use the notation $\Vert x\Vert _M^2$ to denote $x^TMx$ even if M is not positive semidefinite. Denote by $\,\mathrm{tr }\,(M)$ the trace of the matrix M.

Definition 1.1

(SQCC) A function $h:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{\infty \}}$ is called the summation of quadratic convex and convex functions (SQCC) if there exists a (nonlinear) quadratic convex function $q:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{\infty \}}$ and a convex function ${\tilde{h}}:\mathbb {R}^n \rightarrow {\mathbb {R} \cup \{\infty \}}$ such that

$$\begin{aligned} h(x)= q(x)+{\tilde{h}}(x),\ \forall x \in \mathbb {R}^n. \end{aligned}$$

Since h is SQCC, there exists a non-zero positive semidefinite matrix $\Sigma _h \in \mathbb {S}^{n}$ such that for all x, $y \in \mathbb {R}^n$,

$$\begin{aligned} h(y) \ge h(x)+\left\langle u, y-x \right\rangle + \frac{1}{2}\Vert y-x\Vert ^2_{\Sigma _h},~ \forall u \in \partial h(x), \end{aligned}$$

(1.6)

or equivalently,

$$\begin{aligned} \langle x-y, u-v \rangle \ge \Vert x-y\Vert ^2_{\Sigma _h},~ \forall u \in \partial h(x), v \in \partial h(y). \end{aligned}$$

2 A new proximal subgradient algorithm

In the first subsection, we briefly review two existing algorithms SDA and PSB. Then in the second subsection, we describe the indefinite proximal subgradient-based (IPSB) algorithm.

2.1 SDA and PSB

We start from the classical subgradient algorithm [3] for minimizing the problem (1.1)

$$\begin{aligned} x_{k+1}=P_Q(x_k-\lambda _k d_k),\ k \in \mathbb {N}, \end{aligned}$$

(2.1)

where $P_Q$ denotes the projection onto Q, $d_k$ is either a subgradient $D_k \in \partial F(x_k)$ or the normalized subgradient ${ D_k}/{\Vert D_k\Vert } $, and the sequence of the stepsizes $\{\lambda _k\}_{k=0}^{\infty }$ satisfies the divergent-series rule:

$$\begin{aligned} \lambda _k > 0,\ \lambda _k \rightarrow 0,\ \sum _{i=0}^{\infty }\lambda _k = \infty . \end{aligned}$$

In order to avoid taking decreasing stepsizes (i.e., $\lambda _k \rightarrow 0$) as in the classical subgradient algorithm, Nesterov [20] proposed the SDA algorithm,

$$\begin{aligned} x_{k+1}=\mathop {\mathrm{argmin}\,} _ { x \in Q } \left\{ \sum _{i=0}^{k} \langle \lambda _i D_i , x-x_0 \rangle + \frac{\beta _{k+1}}{2}\Vert x - x_0\Vert ^2 \right\} ,~ D_k \in \partial F(x_k),\ \forall k \ge 0, \end{aligned}$$

(2.2)

where $\{\beta _{ k+1}\}_{k=0}^{\infty }$ is a positive nondecreasing sequence and $x_0$ denotes the initial point. There are two simple strategies for choosing $\{\lambda _{i}\}_{i=0}^{\infty }$, either $\lambda _i \equiv 1$ or $\lambda _i=1/\Vert d_i\Vert $. SDA can solve the generalized nonsmooth convex optimization problem and it has been proved to be optimal from the view point of worst-case black-box lower complexity bounds [20]. By considering problems with additive structure as in (1.1), Xiao [26] proposed the RDA scheme. A detailed algorithm under the RDA scheme is PSB, which is as follows

$$\begin{aligned} x_{ k+1 }&= \mathop {\mathrm{argmin}\,} _ { x \in Q } \left\{ \sum _{i=0}^{k}\left( \langle g _ { i } , x-x_0 \rangle + h ( x ) \right) + \frac{\beta _{k+1}}{2}\Vert x-x_0\Vert ^2 \right\} \nonumber \\&= \mathop {\mathrm{argmin}\,} _ { x \in Q } \left\{ \sum _{i=0}^{k} \langle g _ { i } , x-x_0 \rangle + (k+1) h ( x ) + \frac{\beta _{k+1}}{2}\Vert x-x_0\Vert ^2 \right\} , \end{aligned}$$

(2.3)

where $g_i \in \partial f(x_i)$, $\forall i\ge 0$, the stepsize $\lambda _i\equiv 1$, $\{\beta _{k+1}\}_{k=0}^{\infty }$ is nondecreasing, and $x_0 \in \mathrm{argmin}\,_{x \in Q} h(x)$. The above iteration (2.3) reduces to (2.2) when $h \equiv 0$.

2.2 Algorithm IPSB

Motivated by indefinite approaches, we extend RDA to the following dynamic regularized dual averaging (DRDA) scheme

$$\begin{aligned} x_{ k+1} = \mathop {\arg \min }_{ x \in Q } \left\{ \sum _{i=0}^{k}\left\langle g_{i} , x-x_0 \right\rangle + (k+1)h(x)+ {\beta _{k+1}}r_k(x) \right\} . \end{aligned}$$

(2.4)

We only assume that the sum $(k+1)h(x)+{\beta _{k+1}}r_k(x)$ is strongly convex. A simple choice of $r_k(x)$ is

$$\begin{aligned} r_k(x)=\frac{1}{2}\Vert x-x_0\Vert ^2_{G_{k+1}}, \end{aligned}$$

where $G_{k+1}=I-(k+1)\Sigma _h/\beta _{k+1}$. Algorithm 1 describes the algorithm in detail.

In Algorithm 1, the choice of the indefinite matrix $G_k$ in step 3 guarantees the strong convexity of the subproblem minimized in step 4. In some specially structured problems, the introduction of $G_k$ can make the subproblem in step 4 much easier to solve.

Remark 2.1

Note that as the progressing of the iteration, the influence of the initial point $x_0$ should be vanishing. In other words, the auxiliary quadratic term should be as small as possible. By comparing the auxiliary functions in the k-th step of the algorithms PSB and IPSB, we can obtain

$$\begin{aligned} \frac{1}{2}\Vert x-x_0\Vert ^2_{G_{k+1}}&=\frac{1}{2}\Vert x-x_0\Vert ^2_{I-(k+1) \Sigma _h/{\beta _{ k+1 }}}\\&=\frac{1}{2}\Vert x-x_0\Vert ^2-\frac{k+1}{2}\Vert x-x_0\Vert ^2_{ \Sigma _h/{\beta _{ k+1 }}}\\&<\frac{1}{2}\Vert x-x_0\Vert ^2, \end{aligned}$$

which indicates that the indefinite term can reduce the impact of $x_0$ on the k-th subproblem as k increases.

Remark 2.2

The following choice of the sequence $\{{\tilde{\beta }}_{k+1}\}_{k=0}^{\infty }$ initialized in Algorithm 1 is due to Nesterov [20]:

$$\begin{aligned} {\tilde{\beta }}_1 = {\hat{\lambda }},\ {\tilde{\beta }}_{k+1} ={\tilde{\beta }}_k+\frac{1}{{\tilde{\beta }}_k},\ k \in \mathbb {N}, \end{aligned}$$

(2.5)

where $ {\hat{\lambda }}>0$ is an initial parameter.

For the sequence $\{{\tilde{\beta }}_{k+1}\}_{k=0}^{\infty }$, we have the following estimation, which corrected the previous estimation in [20, Lemma 3].

Lemma 2.1

Based on (2.5), we have

$$\begin{aligned} \sqrt{{\hat{\lambda }}^2+2k-2} \le {{\tilde{\beta }}}_k \le {\hat{\lambda }} + \frac{1}{{\hat{\lambda }}}+ \sqrt{{\hat{\lambda }}^2+2k-4}, ~ \forall k \ge 1. \end{aligned}$$

(2.6)

Proof

According to (2.5), we can obtain ${\tilde{\beta }}_1 = {\hat{\lambda }}$ and ${\tilde{\beta }}_{k}^2={\tilde{\beta }}_{k-1}^2+{{\tilde{\beta }}_{k-1}^{-2}}+2$. Consequently,

$$\begin{aligned} {\tilde{\beta }}_{k}^2 \ge {\tilde{\beta }}_{k-1}^2+2 \ge {\tilde{\beta }}_{1}^2+2(k-1) = {\hat{\lambda }}^2+2(k-1),\ \forall k \ge 2, \end{aligned}$$

which implies the left-hand side of estimation (2.6). Conversely, we can derive that

$$\begin{aligned} {\tilde{\beta }}_{ k }&= {\tilde{\beta }}_{ k-1 } +\frac{1}{{\tilde{\beta }}_{ k-1 }} \le {\tilde{\beta }}_{ k-1 } +\frac{1}{\sqrt{{\hat{\lambda }}^2+2(k-2)}} \le {\tilde{\beta }}_{1} +\sum _{t=0}^{k-2}\frac{1}{\sqrt{{\hat{\lambda }}^2+2t}}\nonumber \\&={\hat{\lambda }}+\sum _{t=0}^{k-2}\frac{1}{\sqrt{{\hat{\lambda }}^2+2t}}. \end{aligned}$$

(2.7)

From

$$\begin{aligned} \frac{1}{\sqrt{{\hat{\lambda }}^2+2t}} \le \frac{2}{\sqrt{{\hat{\lambda }}^2+2t}+\sqrt{{\hat{\lambda }}^2+2(t-1)}} = \sqrt{{\hat{\lambda }}^2+2t}-\sqrt{{\hat{\lambda }}^2+2(t-1)}, \end{aligned}$$

we have

$$\begin{aligned} \sum _{t=0}^{k-2} \frac{1}{\sqrt{{\hat{\lambda }}^2+2t}} = \frac{1}{{\hat{\lambda }}}+ \sum _{t=1}^{k-2} \frac{1}{\sqrt{{\hat{\lambda }}^2+2t}} \le \frac{1}{{\hat{\lambda }}}+ \sqrt{{\hat{\lambda }}^2+2(k-2)}-{\hat{\lambda }}. \end{aligned}$$

(2.8)

Finally, the right-hand side of the estimation (2.6) follows from substituting (2.8) into (2.7). $\square $

3 Convergence analysis

Similar to Nesterov’s analysis [20], the convergence of the algorithm IPSB is established. First let us define two auxiliary functions as follows

$$\begin{aligned}&U _ { k } ( s ) := \max _{x \in {\mathcal {F}}_D } \{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) \}, \end{aligned}$$

(3.1)

$$\begin{aligned}&V _ { k } ( s ) := \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x )- \frac{\beta _{k}}{2} \Vert x-x_0\Vert ^2_{G_k} \right\} , \end{aligned}$$

(3.2)

where ${\mathcal {F}}_D=\{x \in Q: \frac{1}{2}\Vert x-x_0\Vert ^2 \le D\}$, $D > 0$ and $ G_k=I-{k\Sigma _h}/{\beta _{ k }},\ \forall k \ge 1. $ Let $x_0 \in \mathop {\arg \min }_{x \in Q} h(x)$. Since $s_0=0$, we have

$$\begin{aligned} V _ { 1 } ( -s_{0} ) = \max _{x \in Q} \left\{ -h(x) -\frac{\beta _1}{2}\Vert x-x_0\Vert ^2_{G_1} \right\} =\max _{x \in Q} \left\{ -h(x) -\frac{1}{2}\Vert x-x_0\Vert ^2_{\beta _1I-\Sigma _h} \right\} , \end{aligned}$$

(3.3)

Notice that (3.3) is a concave maximization problem and then it has a unique optimal solution in the closed convex set Q. According to Danskin’s theorem [2, Proposition B.25], we obtain that both $V _ { 1 } ( -s_{0} ) $ and $\nabla V_1(-s_0)$ are well defined. Let

$$\begin{aligned} T:=V_1(-s_0) + \langle -g_0, \nabla V_1(-s_0) \rangle + h(x_1). \end{aligned}$$

(3.4)

In the following, the first lemma studies the relation between $U _ { k } ( s )$ and $V _ {k}(s)$, and the second lemma studies the smoothness of function $V _ {k}(s)$.

Lemma 3.1

For any $s \in \mathbb {R}^n$ and $k \in \mathbb {N}$, we have

$$\begin{aligned} U _ { k } ( s ) \le \beta _k D + V _ { k } ( s ) \end{aligned}$$

(3.5)

Proof

According to the definitions (3.1), (3.2) and ${\mathcal {F}}_D=\{x \in Q: \frac{1}{2}\Vert x-x_0\Vert ^2 \le D\}$, we have

$$\begin{aligned} U _ { k } ( s )&= \max _{x \in {\mathcal {F}}_D} \{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) \}\\&\le \min _{\beta \ge 0} \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) - \beta ( \frac{1}{2} \Vert x-x_0\Vert ^2 - D) \right\} \\&\le \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) - \beta _k ( \frac{1}{2} \Vert x-x_0\Vert ^2 - D) \right\} \\&\le \max _{x \in Q} \left\{ \left\langle s , x - x _ { 0 } \right\rangle - kh ( x ) - \frac{\beta _k}{2} \Vert x-x_0\Vert ^2 + \beta _k D +\frac{k}{2}\Vert x-x_0\Vert ^2_{\Sigma _h} \right\} \\&= \beta _k D + V _ { k } ( s ), \end{aligned}$$

where the first inequality corresponds to the partial Lagrangian relaxation, and the last inequality holds as ${\Sigma _h} \succeq 0$. $\square $

Lemma 3.2

The well-defined function $V _ {k}(s)$ is convex and continuously differentiable. Then we have

$$\begin{aligned} \nabla V_{k} (s) = x_k(s) - x_0,\ \forall k \ge 1, \end{aligned}$$

(3.6)

where $x_k(s)$ is the minimizer of the function $V_{k}(s)$. In addition, $\nabla V_{k} (s)$ is ${1}/{\beta _k}$-Lipschitz continuous, i.e., there exists a constant ${1}/{\beta _k} > 0$ such that

$$\begin{aligned} \Vert \nabla V_{k} (s)-\nabla V_{k} (t)\Vert \le \frac{1}{\beta _k}\Vert s-t\Vert ,~ \forall s, t \in \mathbb {R}^n. \end{aligned}$$

Proof

Since the objective function of problem (3.2) is $\beta _k-$strongly concave with respect to x, $x_k(s)$ is the unique maximizer of $V_{k}(s)$. Then (3.6) follows from Danskin’s theorem [2, Proposition B.25].

For any $l(x)\in \partial h(x)$, $ s_1, s_2 \in \mathbb {R}^n $, according to the first-order optimality conditions, we have

$$\begin{aligned}&\langle -s_1+kl(x_k(s_1))+ \beta _k G_k(x_k(s_1) - x_0), x_k(s_2) - x_k(s_1) \rangle \ge 0, \\&\langle -s_2+kl(x_k(s_2))+ \beta _k G_k(x_k(s_2) - x_0), x_k(s_1) - x_k(s_2) \rangle \ge 0. \end{aligned}$$

Adding these two inequalities together, we can get

$$\begin{aligned} \langle s_2 - s_1, x_k(s_1) - x_k(s_2) \rangle \le&\ k\langle l(x_k(s_2)) - l(x_k(s_1)), x_k(s_1) - x_k(s_2) \rangle \\&+ \langle \beta _k G_k(x_k(s_2) - x_k(s_1)), x_k(s_1) - x_k(s_2) \rangle \\ \le&\ -k\Vert {x_k(s_1) - x_k(s_2)} \Vert ^2_{\Sigma _h} - \beta _k \Vert {x_k(s_1) - x_k(s_2)} \Vert ^2_{G_k}\\ \le&\ - \beta _k \Vert {x_k(s_1) - x_k(s_2)} \Vert ^{2}, \ \forall k \ge 1, \end{aligned}$$

where last inequality follows from $G_k=I-\frac{k}{\beta _{ k }}\Sigma _h$. Thus, we have

$$\begin{aligned} {\Vert {x_k(s_1) - x_k(s_2)} \Vert ^2}\le & {} -\frac{1}{\beta _k} \langle s_2 - s_1, x_k(s_1) - x_k(s_2) \rangle \\\le & {} \frac{1}{\beta _k} \Vert s_2 - s_1\Vert \Vert x_k(s_1) - x_k(s_2) \Vert , \forall k \ge 1, \end{aligned}$$

which is equivalent to

$$\begin{aligned} \Vert \nabla V_{k} (s_1)-\nabla V_{k} (s_2) \Vert \le \frac{1}{\beta _k} \Vert s_2 - s_1\Vert , \ \forall k \ge 1. \end{aligned}$$

$\square $

Let $F^{*}_{D}=\min _{x \in {\mathcal {F}}_D} F(x)$. According to the convexity of the objective function, we have

$$\begin{aligned} F({\hat{x}}_{k+1})-F^{*}_{D}&\le \frac{1}{k+1} \sum _{i=0}^{k} [{f(x_i)+h(x_i)}]- \min _{x \in {\mathcal {F}}_D} [f(x)+h(x)]\nonumber \\&=\frac{1}{k+1} \max _{x \in {\mathcal {F}}_D}\sum _{i=0}^{k} [f(x_i)-f(x)+h(x_i)-h(x)]\nonumber \\&\le \frac{1}{k+1} \max _{x \in {\mathcal {F}}_D} \sum _{i=0}^{k} [\langle g_i, x_i-x \rangle +h(x_i)-h(x)]. \end{aligned}$$

(3.7)

Consequently, we define the gap function as

$$\begin{aligned} \delta _{k+1} := \max _{x \in {\mathcal {F}}_D} \sum _{i=0}^{k} [\langle g_i, x_i-x \rangle +h(x_i)-h(x)]. \end{aligned}$$

It follows from the inequality (3.5) that

$$\begin{aligned} \delta _{k+1}&= \sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_i)]+ U _ { k+1 } ( -{s}_{k+1} ) \end{aligned}$$

(3.8)

$$\begin{aligned}&\le \sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_i)]+ \beta _{k+1} D + V_{k+1} ( -{s}_{k+1})\nonumber \\&:=\Delta _{k+1}. \end{aligned}$$

(3.9)

Remark 3.1

For any fixed k, there exists a constant P that satisfies $\max \nolimits _{i\in [k]} \frac{1}{2}\Vert x_i-x_0\Vert ^2\le P$. Thus we have

$$\begin{aligned} \frac{1}{2}\sum _{i=1}^{k}\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h} \le \lambda _{max}kP, \end{aligned}$$

(3.10)

where $\lambda _{max}$ is the maximum eigenvalue of $\Sigma _h$.

Now we present the upper bounds as follows.

Theorem 3.1

Let the sequence ${\{x_i\}}^{k}_{i=0} \subset Q$ and ${\{g_i\}}^{k}_{i=0} \subset \mathbb {R}^n$ be generated by Algorithm 1. Let sequence $\{{\beta }_i\}_{i=0}^{k}$ satisfies $\beta _{ k }=\gamma {\tilde{\beta }}_k$, where $\{{\tilde{\beta }}_i\}_{i=1}^{k}$ is defined in (2.5), ${\tilde{\beta }}_{ 0 }={\tilde{\beta }}_{1}$ and $\gamma >0$. Then

1.
For any $k\in \mathbb {N}$, we have
$$\begin{aligned} \delta _k \le \Delta _k \le \beta _{k}D + T + \frac{1}{2}\sum _{i=0}^{k-1}\frac{1}{\beta _i} \Vert g_i\Vert ^2 + \lambda _{max}(k-1)P. \end{aligned}$$
(3.11)
2.
Assume that

(1)
the sequence $\{g_k\}_{k \ge 0}$ is bounded, which means that
$$\begin{aligned} \exists L >0,~ such~ that\ \Vert g_k\Vert \le L,\ \forall k \ge 0, \end{aligned}$$
(3.12)
(2)
there exists a solution $x^*$ satisfying
$$\begin{aligned} \langle g,x-x^* \rangle \ge 0,~ g \in \partial f(x),\ \forall x \in Q. \end{aligned}$$
(3.13)

Then it holds that

$$\begin{aligned} \Vert x_{k}-x^*\Vert ^2 \le \frac{2T+2\lambda _{max}(k-1)P}{\beta _k} +\Vert x^*-x_0\Vert ^2_{G_k}+L^2. \end{aligned}$$

(3.14)

3.
Let $x^*$ be an interior solution, i.e., there exist $r, D >0$ satisfying $B_r(x^*) \subseteq {\mathcal {F}}_D$. Assume there is a $\Gamma _h >0$ such that
$$\begin{aligned} \max _{\begin{array}{c} z \in \partial h(y) \\ y \in B_r(x^*) \end{array}} \Vert z\Vert \le \Gamma _h. \end{aligned}$$

Then we have

$$\begin{aligned} \Vert \bar{{s}}_{k+1}\Vert \le \frac{1}{r(k+1)} \left[ \beta _{k+1}D + T + \frac{1}{2} \sum _{i=0}^{k} \frac{1}{\beta _{i}} \Vert g_i\Vert ^2+\lambda _{max}kP \right] +\Gamma _h, \end{aligned}$$

(3.15)

where $ \bar{{s}}_{k+1}=\frac{1}{k+1}\sum _{i=0}^{k}g_k $.

Proof

1.
According to the definitions of $V_k(s)$ and $G_k$, for any integer $k\ge 1$, we have
$$\begin{aligned} V_{k-1}(-s_k)&=\max _{x \in Q}\left\{ \langle -s_k, x-x_0 \rangle -(k-1)h(x)-\frac{\beta _{k-1}}{2}\Vert x-x_0\Vert ^2_{G_{k-1}}\right\} \\&\ge {\langle -s_k, x_k-x_0 \rangle -(k-1)h(x_k)-\frac{\beta _{k-1}}{2}\Vert x_k-x_0\Vert ^2_{G_{k-1}}}\\&= V_{k}(-s_k)+h(x_k)+\frac{\beta _k }{2}\Vert x_k-x_0\Vert ^2_{G_k}-\frac{\beta _{k-1} }{2}\Vert x_k-x_0\Vert ^2_{G_{k-1}}\\&\ge V_k(-s_k)+h(x_k)+\frac{\beta _{ k }-\beta _{k-1} }{2}\Vert x_k-x_0\Vert ^2-\frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}\\&\ge V_k(-s_k)+h(x_k)-\frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}. \end{aligned}$$
According to Lemma 3.2, we have
$$\begin{aligned} V_k(s+\sigma ) \le V_k(s) + \langle \sigma , \nabla V_k(s) \rangle +\frac{1}{2\beta _k}\Vert \sigma \Vert ^2,~\forall s, \sigma \in \mathbb {R}^n. \end{aligned}$$
(3.16)
Substituting $s_k$ into (3.16) yields that
$$\begin{aligned}&V_k(-s_k)+h(x_k) - \frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}\\&\quad \le V_{k-1}(-s_k) = V_{k-1} (-s_{k-1}-g_{k-1}) \\&\quad \le V_{k-1}(-s_{k-1}) + \langle -g_{k-1}, \nabla V_{k-1}(-s_{k-1}) \rangle +\frac{1}{2\beta _{k-1}}\Vert g_{k-1}\Vert ^2\\&\quad = V_{k-1}(-s_{k-1}) + \langle -g_{k-1}, x_{k-1}-x_0 \rangle +\frac{1}{2\beta _{k-1}}\Vert g_{k-1}\Vert ^2,~ \forall k \ge 1, \end{aligned}$$
which further implies that
$$\begin{aligned}&\langle g_{k- 1}, x_{k-1}-x_0 \rangle + h(x_k)\\&\quad \le V_{k-1}(-s_{k-1}) - V_k(-s_k)+\frac{1}{2\beta _{k-1}}\Vert g_{k-1}\Vert ^2 + \frac{1}{2}\Vert x_k-x_0\Vert ^2_{\Sigma _h}, ~ \forall k \ge 1. \end{aligned}$$
By summing the above inequality from 1 to k, we obtain
$$\begin{aligned}&\sum \limits _{i=1}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_{i+1})]\\&\quad \le V_{1} (-s_1)- V_{k+1} (-s_{k+1})+ \frac{1}{2}\sum \limits _{i=1}^{k} \left[ \frac{1}{\beta _i} \Vert g_i\Vert ^2+\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h}\right] , \end{aligned}$$
which is equivalent to
$$\begin{aligned}&\sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_{i})] + V_{k+1} (-s_{k+1}) \le V_{1} (-s_1) + h(x_0)+ h(x_1) - h(x_{k+1})\nonumber \\&\quad +\frac{1}{2}\sum _{i=1}^{k} \left[ \frac{1}{\beta _i} \Vert g_i\Vert ^2+\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h}\right] . \end{aligned}$$
(3.17)
By combining with (3.16) and $s_1=s_0+g_0$, we have
$$\begin{aligned} V_1(-s_1)&=V_1(-s_0-g_0) \le V_1(-s_0) +\langle -g_0, \nabla V_1(-s_0)\rangle + \frac{1}{2\beta _{1}}\Vert g_0\Vert ^2\\&= T - h(x_1)+ \frac{1}{2\beta _{0}} \Vert g_0\Vert ^2, \end{aligned}$$
where the equality follows from (3.4) and $\beta _{ 0 }=\beta _{1}$. By noting that $x_0 = \arg \min _{ x \in Q } h(x)$, we can obtain that
$$\begin{aligned} h(x_0) \le h(x_{k+1}). \end{aligned}$$
Finally, combining (3.9), (3.17) and the above inequalities, we conclude that
$$\begin{aligned} \Delta _{k+1} \le \beta _{k+1}D + T + \frac{1}{2}\sum _{i=0}^{k}\frac{1}{\beta _i} \Vert g_i\Vert ^2 + \frac{1}{2}\sum _{i=1}^{k}\Vert x_{i+1}- x_0\Vert ^2_{\Sigma _h}. \end{aligned}$$
2.
Notice that $x_k = \mathop {\arg \min }_{ x \in Q } \left\langle s_{k} , x - x _ { 0 } \right\rangle +kh ( x )+ \frac{\beta _{k}}{2} \Vert x-x_0\Vert ^2_{G_k} $. By the convexity of the objective function, we have
$$\begin{aligned} \left\langle s_k + k l_k+\beta _{ k } G_k (x_k-x_0), x-x_k \right\rangle \ge 0,\ \forall x \in Q. \end{aligned}$$
(3.18)
Notice that $G_k=I-{k\Sigma _h}/{\beta _{ k }}$. Then we can define the following $\beta _{ k }-$strongly convex function
$$\begin{aligned} \phi _k(x):=kh(x)+\frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2_{G_k},\ k \in \mathbb {N}, \end{aligned}$$
which implies that
$$\begin{aligned} \phi _k(x) \ge \phi _k(x_k) + \left\langle kl_k+\beta _{ k } G_k (x_k-x_0), x-x_k \right\rangle +\frac{\beta _{ k }}{2}\Vert x_k-x\Vert ^2. \end{aligned}$$
(3.19)
By taking $\phi _k(x_k)$ from the right-hand side of the inequality (3.19) to the left-hand side, we can get
$$\begin{aligned}&\left\langle kl_k+\beta _{ k } G_k (x_k-x_0), x-x_k \right\rangle +\frac{\beta _{k }}{2}\Vert x_k-x\Vert ^2\\&\le k[h(x)-h(x_k)] + \frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2_{G_k} -\frac{\beta _{ k }}{2}\Vert x_k-x\Vert ^2_{G_k}. \end{aligned}$$
Combining with (3.18) yields that
$$\begin{aligned} \frac{\beta _{ k }}{2}\Vert x_k-x\Vert ^2 \le&~ kh(x)-kh(x_k)+ \frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2 _{G_k} - \frac{\beta _{ k }}{2}\Vert x_k-x_0\Vert ^2_{G{_k}} \nonumber \\&+ \left\langle kl_k+\beta _{ k }G_k(x_k-x_0), x_k -x \right\rangle \nonumber \\ \le&~ kh(x)-kh(x_k)+\frac{\beta _{ k }}{2}\Vert x-x_0\Vert ^2_ {G_k}- \frac{\beta _{ k }}{2}\Vert x_k-x_0\Vert ^2_{G{_k}} - \left\langle {s_k}, x_k -x \right\rangle \nonumber \\ =&V_k(s_k) + kh(x) + \frac{\beta _{ k }}{2} \Vert x-x_0\Vert ^2_{G_k} + \left\langle {s_k}, x-x_0 \right\rangle \nonumber \\ =&~ V_k(s_k)+\sum _{i=0}^{k-1}\left\langle g_{i}, x_i-x_0 \right\rangle + \sum _{i=0}^{k-1}h(x_i) \nonumber \\&+ \frac{\beta _{ k }}{2} \Vert x-x_0\Vert ^2_{G_k} + \sum _{i=0}^{k-1}\left\langle g_{i}, x-x_i \right\rangle + kh(x) -\sum _{i=0}^{k-1}h(x_i). \end{aligned}$$
(3.20)
Furthermore, we notice that (3.17) is taken into the following form
$$\begin{aligned} V_{k+1} (-s_{k+1})+\sum _{i=0}^{k} [\langle g_i, x_i-x_0 \rangle + h(x_{i})] \le T + \frac{1}{2}\sum _{i=0}^{k}\frac{1}{\beta _i} \Vert g_i\Vert ^2 + \frac{1}{2}\sum _{i=1}^{k}\Vert x_{i+1}- x_0\Vert ^2_{\Sigma _h}. \end{aligned}$$
(3.21)
By substituting (3.20) into (3.21), we can get
$$\begin{aligned} \frac{\beta _{ k}}{2}\Vert x_k-x\Vert ^2 \le T&+\frac{1}{2}\sum \limits _{i=0}^{k-1} \frac{1}{\beta _i} \Vert g_i\Vert ^2 + \frac{1}{2}\sum \limits _{i=1}^{k-1}\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h}\nonumber \\&+ \frac{\beta _{ k }}{2} \Vert x-x_0\Vert ^2_{G_k} + \left\{ \sum \limits _{i=0}^{k-1} {[f(x)+h(x)]-[f(x_i)+h(x_i)]}\right\} . \end{aligned}$$
Finally, we set $x=x^*:=\mathop {\arg \min }_{x \in {\mathcal {F}}_D} f(x)+h(x)$. Then it holds that
$$\begin{aligned} \frac{\beta _{k }}{2}\Vert x_k-x^*\Vert ^2 \le T + \frac{1}{2}\sum _{i=0}^{k-1} \frac{1}{\beta _i} \Vert g_i\Vert ^2 +\frac{1}{2}\sum _{i=1}^{k-1}\Vert x_{i+1}-x_0\Vert ^2_{\Sigma _h} +\frac{\beta _{ k }}{2} \Vert x^*- x_0\Vert ^2_{G_k}. \end{aligned}$$
According to the conditions (2.5) and (3.12), we obtain the inequality (3.14).
3.
Based on (3.8), we can obtain
$$\begin{aligned}&\delta _{k+1} = \sum _{i=0}^{k} [\langle g_i, x_i-x^* \rangle + h(x_i)]+ \max _{x \in {\mathcal {F}}_D } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle - (k+1)h ( x ) \}\\&\quad = \sum _{i=0}^{k} [\langle g_i, x_i-x^* \rangle + h(x_i)-h ( x^* )]+ \max _{x \in {\mathcal {F}}_D } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) \\&\qquad - (k+1)h ( x ) \}\\&\quad \ge \sum _{i=0}^{k}\{f(x_i)+h(x_i)-f(x^*)-h ( x^* )\} + \max _{x \in {\mathcal {F}}_D } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) \\&\qquad - (k+1)h ( x ) \}\\&\quad \ge \max _{x \in B_r(x^*) } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) - (k+1)h ( x ) \}. \end{aligned}$$
Notice that
$$\begin{aligned} {\bar{x}} =\arg \max _{x \in B_r(x^*) } \left\langle {s}_{k+1} , x^* - x \right\rangle . \end{aligned}$$
Then we have $\Vert x^*-{\bar{x}}\Vert =r$ and
$$\begin{aligned} \left\langle {s}_{k+1} , x^* - {\bar{x}} \right\rangle = \Vert {s}_{k+1}\Vert \Vert x^* - {\bar{x}} \Vert = r\Vert {s}_{k+1}\Vert . \end{aligned}$$
Thus, it holds that
$$\begin{aligned} \delta _{k+1}&\ge \max _{x \in B_r(x^*) } \{ \left\langle {s}_{k+1} , x^* - x \right\rangle + (k+1)h ( x^* ) - (k+1)h ( x ) \}\\&\ge \left\langle {s}_{k+1} , x^* - {\bar{x}} \right\rangle + (k+1){h(x^*)-(k+1)h({\bar{x}})} \\&\ge r\Vert {s}_{k+1}\Vert + (k+1){ \left\langle l({\bar{x}}), x^*-{\bar{x}} \right\rangle } \\&\ge r\Vert {s}_{k+1}\Vert - (k+1){ \Vert l({\bar{x}})\Vert \Vert x^*-{\bar{x}}\Vert }\\&=r\Vert {s}_{k+1}\Vert - (k+1) r \Vert l({\bar{x}})\Vert , \end{aligned}$$
which implies
$$\begin{aligned} \frac{1}{k+1}\Vert {s}_{k+1}\Vert&\le \frac{1}{r(k+1)}\delta _{k+1}+ \Vert l({\bar{x}})\Vert \le \frac{1}{r(k+1)}\delta _{k+1}+ \Gamma _h. \end{aligned}$$
Then (3.15) follows from (3.11).$\square $

As a main result, we can now estimate the upper bound on the complexity of IPSB in the following.

Theorem 3.2

Assume there exists a constant $L>0$ such that $\Vert g_k\Vert \le L$, $\forall k \ge 0$. Denote by $\{x_i\}_{i=0}^{k}$ the sequence generated by Algorithm 1. Let $ F_D^*= \min _ { x \in {\mathcal {F}}_D } F(x)$. Then we have

$$\begin{aligned} F(\hat{x}_{k+1})-F_D^* -\lambda _{max}P \le \frac{\tilde{\beta }_{k+1}}{k+1}\left( \gamma D+ \frac{L^2}{2\gamma }\right) + \frac{T- \lambda _{max}P}{k+1}. \end{aligned}$$

(3.22)

Proof

By combining (2.5) with the inequalities (3.7) and (3.11), we have

$$\begin{aligned} F(\hat{x}_{k+1})-F_D^*&\le \frac{1}{k+1} \delta _{k+1}(D) \le \frac{1}{k+1} \left[ \beta _{k+1}D + T+ \frac{1}{2} \sum _{i=0}^{k} \frac{1}{\beta _{i}} \Vert g_i\Vert ^2 + \lambda _{max}kP\right] \\&\le \frac{\tilde{\beta }_{k+1}}{k+1}\left( \gamma D+ \frac{L^2}{2\gamma }\right) + \frac{T- \lambda _{max}P}{k+1}+\lambda _{max}P, \end{aligned}$$

which finishes the proof of the inequality (3.22). $\square $

Remark 3.2

According to Lemma 2.1, we know that the sequence $\{{\tilde{\beta }}_{k}\}_{k=0}^{\infty }$ can be used for balancing the terms appearing in the right-hand side of inequality (3.11). It follows from Theorem 3.2 that IPSB converges to the region of the optimal value with rate $O({1}/{\sqrt{k}} )$.

4 Numerical experiments

In this section, we perform numerical experiments to compare the algorithms IPSB, SDA and PSB on two kinds of test problems. All experiments were implemented in MTALAB 2018b and run on a laptop with a dual core ($1.6+1.8$ GHz) processor and 8 GB RAM.

4.1 Regularized least squares problem

In this subsection, we test the regularized least squares problem

$$\begin{aligned} \min _ {x \in \mathbb {R}^{n}} \left\{ \Vert Ax-b\Vert _2 + {\bar{\rho }} \max _{i \in [m]} {f_i(x)} \right\} , \end{aligned}$$

(4.1)

where $A \in \mathbb {R}^{n_1\times n_2}$, $b\in \mathbb {R}^{n_1}$ and $f_i(x),i\in [m]$ are all positive and strongly convex. Notice that (4.1) is a special case of (1.1) with $f(x)=\Vert Ax-b\Vert _2$ and $h(x)={\bar{\rho }} \max _{i\in [m]} {f_i(x)}$.

In our first test, we set $m=2$, $f_1(x)=\Vert x\Vert _{B}^2$ and $f_2(x)=\Vert x-c\Vert _{B}^2$, where $B \in \mathbb {R}^{n_2\times n_2}$ and $c \in \mathbb {R}^{n_2}$. We set $B\ne 0$ to be positive semidefinite but singular so that the function h is SQCC. In fact, we have $\Sigma _h=2B$. Applying Algorithm 1 to solve (4.1) reduces to

$$\begin{aligned} {\left\{ \begin{array}{lr} \displaystyle s_{k+1}=s_k+ g_k, \\ \displaystyle x_{k+1}=\arg \min _{x }\left\{ \left\langle s_{k+1},x\right\rangle + (k+1)\max \{\Vert x\Vert _{B}^2, \Vert x-c\Vert _{B}^2\}+ \frac{\beta _{ k+1 }}{2}\Vert x\!-\!x_0\Vert ^2_{I-\frac{2(k+1)}{\beta _{ k+1 }} B}\right\} , \end{array} \right. } \end{aligned}$$

where $g_k \in \partial f(x_k)$ and

$$\begin{aligned} \partial f(x)= \left\{ \begin{array}{lr} \displaystyle \frac{A^T(Ax-b)}{\Vert Ax-b\Vert }, &{}if\ Ax-b \ne 0,\\ \displaystyle \{A^Tx\in \mathbb {R}^n: \Vert x\Vert \le 1 \},&{}if\ Ax-b = 0. \end{array} \right. \end{aligned}$$

The three different algorithms in comparison for solving (4.1) are explicitly reformulated as

$$\begin{aligned} \displaystyle&SDA: x_{k+1}=x_0- \frac{1}{\beta _{ k+1}}{\sum _{i=0}^{k} {(g_i+l_i)}} ,\\ \displaystyle&PSB: x_{k+1}= \left\{ \begin{array}{lr} \displaystyle (2(k+1)B+\beta _{ k+1} I)^{-1}(\beta _{ k+1} x_0-s_{k+1}), &{} \hbox { if }\ \Vert x\Vert _B^2> \Vert x-c\Vert _B^2,\\ \displaystyle (2(k+1)B+\beta _{ k+1} I)^{-1}(\beta _{ k+1} x_0-s_{k+1}), &{} \hbox { if }\ \Vert x\Vert _B^2< \Vert x-c\Vert _B^2,\\ \displaystyle \left( 2(k+1)B + \beta _{ k+1} {\bar{G}} \right) ^{-1} \left( \beta _{ k+1} x_0 -s_{k+1}+ {\bar{\rho }}_1 Bc \right) , &{} \hbox {otherwise}, \end{array} \right. \\ \displaystyle&IPSB: x_{k+1}= \left\{ \begin{array}{lr} \displaystyle x_0-\frac{1}{\beta _{ k+1}}(2(k+1)Bx_0+s_{k+1}), &{} \hbox { if }\ \Vert x\Vert _B^2 > \Vert x-c\Vert _B^2,\\ \displaystyle x_0-\frac{1}{\beta _{ k+1}}(2(k+1)Bx_0+s_{k+1}-2(k+1)Bc) , &{} \hbox { if }\ \Vert x\Vert _B^2 < \Vert x-c\Vert _B^2,\\ \displaystyle x_0-\frac{1}{\beta _{k+1}}(2(k+1)Bx_0+s_{k+1}+{2{\bar{\rho }}_2}Bc), &{} \hbox { otherwise }, \end{array} \right. \end{aligned}$$

where expression for ${\bar{\rho }}_1$, ${\bar{\rho }}_2$ and ${\bar{G}}$ is as follows

$$\begin{aligned}&{\bar{\rho }}_1 = \frac{(k+1)c^T ( B c+ s_{k+1}-\beta _{k+1} x_0)}{c^T B c},\\&{\bar{\rho }}_2 = \frac{1}{4\Vert Bc\Vert ^2}\left( \beta _{ k+1} c^T B (c-x_0)+2c^T B s_{k+1} + 4(k+1)c^T B \cdot B x_0\right) ,\\&{\bar{G}} = I - \frac{B c\cdot c^T}{c^T B c}. \end{aligned}$$

In addition, $l_k \in \partial h(x_k)$ and

$$\begin{aligned} \partial h(x)=\{l \in \mathbb {R}^{n_2\times n_2}:l=2 B x+2 \alpha B c,\ \alpha \in [0,1]\}. \end{aligned}$$

In our experiments, we choose ${\bar{\rho }} =1$ and $n_1 \times n_2 \in \{ 400 \times 900, 800 \times 2000, 1500 \times 3000\}$. In Algorithm 1, we set $\gamma =20$, ${\hat{\lambda }}=1e-3$ and the termination criterion is set as either $|F( {{\hat{x}}}^{k})-F({\hat{x}}^{k-1})| \le 10^{-3}$ or the number of iterations reaches 300. Starting from a fixed seed, we independently randomly generate $x^* = (10, \ldots ,10)\in \mathbb {R}^{n_2}$, $c \in \mathbb {R}^{n_2} $ from standard normal distribution ${\mathcal {N}}(0,0.25)$ and then generate each elment of A from ${\mathcal {N}}(0,20^2)$. We set $b\in \mathbb {R}^{n_1}$ as follows

$$\begin{aligned} b_i = \sum _{j=1}^{n_2}A_{ij}x^*,~i\in [n_1]. \end{aligned}$$

The matrix B is constructed by randomly generating eigenvalues and eigenvectors. The first ten eigenvalues of B are random positive numbers and the rest are zero. We construct the eigenvecters by randomly generating orthogonal matrix with uniformly distributed random elements. When $n_1=400$, $n_2=900$, MATLAB code to generate the above data is as follows. The others are similar.

where x_op corresponds to the variable $x^*$.

We plot the variants of $\log (\log F({{{\hat{x}}}}_k))$ versus the iterations number k and CPU runtime in Figs. 1 and 2, respectively.

As shown in Fig. 1, IPSB would have a better function value than SDA in the iteration process. By zooming in on the details of the figures, it can be seen that the value generated by IPSB is decreasing rather than constant.

PSB runs much slower than IPSB because of the heavy computation cost of matrix inversion. It is shown in Fig. 2 that IPSB is much more efficient than PSB.

4.2 Elastic net regression

The elastic net is a regularized regression model [27] by linearly combining LASSO and ridge regression. It is formulated as

$$\begin{aligned} \min _{\omega \in \mathbb {R}^{n}} \Vert y-X\omega \Vert _2^2+\eta _1\Vert \omega \Vert _1+\eta _2\Vert \omega \Vert ^2_2, \end{aligned}$$

(4.2)

where p is the number of samples, n is the number of features, $y\in \mathbb {R}^p$ is the response vector, $X \in \mathbb {R}^{p \times n}$ is the design matrix, and $\eta _1, \eta _2 > 0$ are regularization parameters. It corresponds to setting $f(\omega )=\eta _1\Vert \omega \Vert _1$ and $h(\omega )=\Vert y-X\omega \Vert _2^2+\eta _2\Vert \omega \Vert ^2_2$ in (1.1). The iteration schemes of three different algorithms in comparison for solving (4.2) are reformulated as

$$\begin{aligned} \displaystyle&SDA: \omega _{k+1}=\omega _0+\frac{1}{\beta _{ k+1}}\sum _{i=0}^{k}\left( 2X^T(X\omega _i-y)+\eta _1sgn(\omega _i)+2\eta _2\omega _i \right) ,\\ \displaystyle&PSB: \omega _{k+1}=((k+1)(2X^TX+2\eta _2 I) + \beta _{ k+1 } I)^{-1}\left( \beta _{ k+1 }\omega _0+2X^Ty-\eta _1\sum _{i=0}^{k}sgn(\omega _i)\right) ,\\ \displaystyle&IPSB: \omega _{k+1}=\omega _0-\frac{1}{\beta _{ k+1 }}\left( (k+1)(2X^T(X\omega _0-y)+2\eta _2\omega _0)+\eta _1\sum _{i=0}^{k}sgn(\omega _i)\right) , \end{aligned}$$

where the initial point ${\omega _{0}}$ is given by $(X^TX+\eta _2 I)^{-1}Xy = \arg \min _{\omega \in \mathbb {R}^{n}} h(\omega )$, $sgn(\cdot )$ is the sign function, and the sequence $\{\beta _{ k }\}_{k\ge 0}$ utilizes the form (2.5). We list in Table 1 the computational complexity in each iteration. It demonstrates that in each iteration PSB has the highest computational cost when n is much larger than k, and SDA takes the highest cost when k is much larger than p and n.

Table 1 Computational cost in each iteration

Full size table

We set the termination criterion as

$$\begin{aligned} \frac{|f({\bar{\omega }})-{\bar{f}}|}{{\bar{f}}} \le \epsilon ^{rel}, \end{aligned}$$

where ${\bar{f}}$ is an approximation of the optimal value obtained by running 500 iterations of SDA in advance, ${\bar{\omega }}= \sum _{i=1}^t \omega _i/n$ and t is the realistic number of iterations until termination.

We conduct the experiments with the following synthetic data and real data, respectively.

$ \textit{Synthetic data:} $ Starting a fixed seed, we independently and randomly generate $X_{ij} \sim {\mathcal {N}}(0, 0.01)$, $\omega ^* \sim {\mathcal {N}}(0,1)$, $\epsilon _i \sim {\mathcal {N}}(0,0.04)$, and then set $y_i=\sum _{j=1}^{n} X_{ij} \omega ^* + \epsilon _i$, $i\in [p]$, $j\in [n]$. We choose $p\times n \in \{300 \times 1000, 500 \times 2000, 700 \times 3000, 1000 \times 5000\}$. The hyperparameters used for the synthetic data are set as

$$\begin{aligned} \gamma =10^2,\ \beta _{ 0 }=3 \times 10^{-2},\ \epsilon ^{rel}=10^{-1}. \end{aligned}$$

When $p=300$, $n=1000$, the MATLAB code to generate the data is as follows. The others are similar.

where x_op and ep correspond to the variables $\omega ^*$ and $\epsilon $ respectively.

Table 2 Numerical results for synthetic data

Full size table

MNIST data [15]: There are 70, 000 samples from the images of 10 digits in the MNIST data set, each with a $28 \times 28$ gray-scale pixel-map, for a total of 784 features. We take the digits 8 and 9. Thus we have $p=13783$ and $n=784$. Moveover, let $y\in \{+1,-1\}^n$ be the binary label. The hyperparameters used for MNIST data are as follows

$$\begin{aligned} \gamma =10^3,\ \beta _{ 0 }=10^{-3},\ \epsilon ^{rel}=10^{-2}. \end{aligned}$$

Table 3 Numerical results for MNIST data

Full size table

Tables 2 and 3 represent the experimental results for synthetic data and MNIST data, respectively. In both Tables, we report the results of the numbers of iterations (Iter.), running time in seconds and the accuracy (Accu.) defined as $1-|f({\bar{\omega }})-{\bar{f}}|/{{\bar{f}}}$.

In synthetic data, SDA takes the largest number of iterations among the three, IPSB runs in less CPU time than the other two algorithms, and PSB is the most inefficient one. In MNIST data, the three algorithms take almost the same number of iterations so that IPSB takes the least CPU time.

5 Conclusions

Nesterov’s dual averaging scheme succeeds in avoiding that stepsizes decrease as in the subgradient methods for nonsmooth convex minimizing problem. It is then extended to solve problems with an additional regularization, denoted by (RDA).

In this paper, we propose the dynamic regularized dual averaging scheme by relaxing the positive definite regularization term in RDA, which can not only reduce the impact of the initial point on the subproblems in later iterations but also make the new subproblem in each iteration easy to solve. Under this new scheme, we proposed indefinite proximal subgradient-based (IPSB) algorithm. We analyze the convergence rate of IPSB, which is $O({1}/{\sqrt{k}})$, where k is the number of iterations. And IPSB converges to a region of the optimal value. Numerical experiments on regularized least squares problem and elastic net regression show that IPSB is more efficient than the existing algorithms SDA and PSB. Future works include more real applications of IPSB and further improvement of IPSB by, for example, relaxing the condition on the initial point.

Data availability statements

The authors confirm that all data generated or analysed during this study are included in the paper.

References

Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming. Taylor & Francis, Milton Park (1997)
Google Scholar
Boyd, S., Xiao, L., Mutapcic, A.: Subgradient methods. Lecture Notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005 (2003)
Burke, J.V., Curtis, F.E., Lewis, A.S., Overton, M.L., Simões, L.E.: Gradient sampling methods for nonsmooth optimization. In: Numerical Nonsmooth Optimization, pp. 201–225. Springer (2020)
Cai, X.-J., Guo, K., Jiang, F., Wang, K., Wu, Z.-M., Han, D.-R.: The developments of proximal point algorithms. J. Oper. Res. Soc. China 1–43 (2022)
Combettes,P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward–backward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (2005)
Article MathSciNet MATH Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2021–2059 (2011)
MathSciNet MATH Google Scholar
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1), 293–318 (1992)
Article MathSciNet MATH Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741 (1984)
Article MATH Google Scholar
Grötschel, M., Lovász, L., Schrijver, A.: The ellipsoid method. In: Geometric Algorithms and Combinatorial Optimization, pp. 64–101. Springer (1993)
He, B., Ma, F., Yuan, X.: Optimal proximal augmented Lagrangian method and its application to full Jacobian splitting for multi-block separable convex minimization problems. IMA J. Numer. Anal. 40(2), 1188–1216 (2020)
Article MathSciNet MATH Google Scholar
Jiang, F., Cai, X., Han, D.: The indefinite proximal point algorithms for maximal monotone operators. Optimization 70(8), 1759–1790 (2021)
Article MathSciNet MATH Google Scholar
Jiang, F., Wu, Z., Cai, X.: Generalized ADMM with optimal indefinite proximal term for linearly constrained convex optimization. J. Ind. Manag. Optim. 16(2), 835–856 (2020)
Article MathSciNet MATH Google Scholar
LeCun, Y., Cortes, C., Burges, C.J.C.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (2017)
Li, M., Sun, D., Toh, K.C.: A majorized ADMM with indefinite proximal terms for linearly constrained convex composite optimization. SIAM J. Optim. 26(2), 922–950 (2016)
Article MathSciNet MATH Google Scholar
Mäkelä, M.: Survey of bundle methods for nonsmooth optimization. Optim. Methods Softw. 17(1), 1–29 (2002)
Article MathSciNet MATH Google Scholar
Martinet, B.: Regularization d’inequations variationelles par approximations successives. Revue Francaise d’Informatique et de Recherche Opérationelle 4, 154–159 (1970)
MATH Google Scholar
Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)
MATH Google Scholar
Ram, S.S., Nedić, A., Veeravalli, V.V.: Incremental stochastic subgradient algorithms for convex optimization. SIAM J. Optim. 20(2), 691–717 (2009)
Article MathSciNet MATH Google Scholar
Ram, S.S., Nedić, A., Veeravalli, V.V.: Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Article MathSciNet MATH Google Scholar
Shor, N.Z.: Minimization Methods for Non-differentiable Functions. Springer Series in Computational Mathematics, Springer, Berlin (1985)
Book MATH Google Scholar
Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11(10), 2543–2596 (2010)
MathSciNet MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors thank the editor and the referees for the valuable comments/suggestions, which help us improve the paper greatly. The research of the second author was partially supported by NSFC with Nos. 12131004 and 12126603; and the research of the third author was partially supported by NSFC with No. 12171021 and by Beijing NSF with No. Z180005.

Author information

Authors and Affiliations

LMIB, School of Mathematical Sciences, Beihang University, Beijing, 100191, China
Rui Liu, Deren Han & Yong Xia

Authors

Rui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Deren Han
View author publications
You can also search for this author in PubMed Google Scholar
Yong Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deren Han.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, R., Han, D. & Xia, Y. An indefinite proximal subgradient-based algorithm for nonsmooth composite optimization. J Glob Optim 87, 533–550 (2023). https://doi.org/10.1007/s10898-022-01173-9

Download citation

Received: 21 August 2021
Accepted: 19 April 2022
Published: 16 September 2022
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10898-022-01173-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An indefinite proximal subgradient-based algorithm for nonsmooth composite optimization

Abstract

Similar content being viewed by others

A proximal bundle method for a class of nonconvex nonsmooth composite optimization problems

Globalized inexact proximal Newton-type methods for nonconvex composite functions

Gradient sliding for composite optimization

1 Introduction

1.1 Notations and preliminaries

Definition 1.1

2 A new proximal subgradient algorithm

2.1 SDA and PSB

2.2 Algorithm IPSB

Remark 2.1

Remark 2.2

Lemma 2.1

Proof

3 Convergence analysis

Lemma 3.1

Proof

Lemma 3.2

Proof

Remark 3.1

Theorem 3.1

Proof

Theorem 3.2

Proof

Remark 3.2

4 Numerical experiments

4.1 Regularized least squares problem

4.2 Elastic net regression

5 Conclusions

Data availability statements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation