1 Introduction

Let \(X\subset {\mathbb {R}}^n\) and \(Y\subset {\mathbb {R}}^m\) be nonempty open convex sets whose closures are denoted by \({\overline{X}}\), \({\overline{Y}}\) respectively. Let \(F:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) and \(G:{\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be proper lower semicontinuous nonconvex functions; and let \(H:{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper lower semicontinuous biconvex function. In this paper, we consider the optimization problem

$$\begin{aligned} \min _{x\in {\overline{X}},~y\in {\overline{Y}}}\Phi (x,y)=F(x)+G(y)+H(x,y), \end{aligned}$$
(1)

which captures numerous applications, such as Poisson inverse problems (see, for instance, the recent review paper [6, 8] and references therein), nonnegative matrix factorization [1, 2], parallel magnetic resonance imaging [7], multi-modal learning for image classification [29].

An intuitive method for solving problem (1) is the alternating minimization method (also known as the Gauss-Seidel method or the block coordinate descent method), which was originally proposed for the case \(X={\mathbb {R}}^n,~Y={\mathbb {R}}^m\),

$$\begin{aligned} \left\{ \begin{array}{l} x^{k+1}\in \arg \min ~\{\Phi (x,y^k):x\in {\mathbb {R}}^n\},\\ y^{k+1}\in \arg \min ~\{\Phi (x^{k+1},y):y\in {\mathbb {R}}^m\}. \end{array} \right. \end{aligned}$$
(2)

If \(\Phi \) is continuously differentiable and strictly convex with respect to one argument while the other one is fixed, every limit point of the sequence \(\left\{ (x^k,y^k)\right\} \) generated by scheme (2) minimizes \(\Phi \) [9, 10]. A classical technique to relax the strict convexity condition is the proximal alternating minimization method [3, 4]. However, minimizing \(\Phi (\cdot , y)\) or \(\Phi (x, \cdot )\) which is the sum of two functions per iteration is not an easy task. To alleviate the computational cost, Bolte et al. [11] considered linearizing H and proposed the proximal alternating linearized minimization (PALM) algorithm,

$$\begin{aligned} \left\{ \begin{array}{l} x^{k+1}\in \arg \min ~\left\{ F(x)+\langle \nabla _xH(x^k,y^k),x-x^k\rangle +\frac{1}{2\varsigma _k}\Vert x-x^k\Vert ^2:x\in {\mathbb {R}}^n\right\} ,\\ y^{k+1}\in \arg \min ~\left\{ G(y)+\langle \nabla _yH(x^{k+1},y^k),y-y^k\rangle +\frac{1}{2\varrho _k}\Vert y-y^k\Vert ^2:y\in {\mathbb {R}}^m\right\} . \end{array} \right. \end{aligned}$$
(3)

The step sizes \(\varsigma _k\) and \(\varrho _k\) are computed according to

$$\begin{aligned} \varsigma _k\in (0,({\textrm{Lip}}(\nabla _x H(\cdot ,y^{k})))^{-1}),~~\varrho _k\in (0,({\textrm{Lip}}(\nabla _y H(x^{k+1},\cdot )))^{-1}), \end{aligned}$$

where “Lip(\(\cdot \))” denotes the Lipschitz constant. Under the assumption that \(\Phi \) satisfies the Kurdyka–Łojasiewicz (KL) property (see Def. 2.14 for its definition) and a few other conditions, PALM globally converges to a critical point of \(\Phi \). This scheme has great advantage when the functions F and G are simple in the sense that their proximal operators have closed-form expressions or can be evaluated via a fast scheme, while the stepsizes need to change with the iteration sequence.

When the coupling term H has a good structure, e.g., it is a quadratic function [11, 21, 27], the difficulties in solving the minimization problems of (2) and (3) may be caused by the component functions F and G. For this case, Nikolova and Tan [20] proposed the alternating structure-adapted proximal (ASAP) gradient descent algorithm,

$$\begin{aligned} \left\{ \begin{array}{l} x^{k+1}\in \arg \min ~\left\{ H(x,y^k)+\langle \nabla F(x^k),x-x^k\rangle +\frac{1}{2{\bar{\varsigma }}}\Vert x-x^k\Vert ^2:x\in {\mathbb {R}}^n\right\} ,\\ y^{k+1}\in \arg \min ~\left\{ H(x^{k+1},y)+\langle \nabla G(y^k),y-y^k\rangle +\frac{1}{2{\bar{\varrho }}}\Vert y-y^k\Vert ^2:y\in {\mathbb {R}}^m\right\} , \end{array} \right. \end{aligned}$$
(4)

where

$$\begin{aligned} {\bar{\varsigma }}\in (0,({\textrm{Lip}}(\nabla F))^{-1}),~~{\bar{\varrho }}\in (0,({\textrm{Lip}}(\nabla G))^{-1}). \end{aligned}$$

ASAP (4) is much suitable for the case that the component function H has a good structure in the sense of being simple with respect to each variable x and y so that the proximal operators of \(H(\cdot ,y)\) and \(H(x,\cdot )\) have closed-form expressions.

ASAP (4) is a good alternative to PALM (3). Nevertheless, both algorithms are restricted to the case that there is no abstract sets, while in many applications, there are abstract sets in the model (1), i.e., either \(X\ne {\mathbb {R}}^n\) or \(Y\ne {\mathbb {R}}^m\), or both \(X\ne {\mathbb {R}}^n\) and \(Y\ne {\mathbb {R}}^m\) [6, 8]. Certainly, one can use the indicator functions \(\delta _{{\overline{X}}}(x)\) and \(\delta _{{\overline{Y}}}(y)\) to equivalently convert problem (1) to

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n,~y\in {\mathbb {R}}^m}F(x)+G(y)+{{\overline{H}}}(x,y), \end{aligned}$$

where

$$\begin{aligned} {\overline{H}}(x,y)=H(x,y)+\delta _{{\overline{X}}}(x)+\delta _{{\overline{Y}}}(y). \end{aligned}$$

However, in this case, the subproblems in ASAP algorithm (4) are constrained minimization problems, which unavoidably degrades its efficiency. Another key point of the ASAP algorithm is that, F and G are assumed to be globally Lipschitz smooth on the entire spaces, which is very restrictive and excludes many applications. Furthermore, the proximal parameters \({\bar{\varsigma }}\) and \({\bar{\varrho }}\) in (4), playing essential roles in numerical implementations, depend on the estimation of the Lipschitz constants of \(\nabla F\) and \(\nabla G\), which is not an easy task.

There are several techniques that can be utilized to handle some of these difficulties [6, 12, 13, 15, 25]. For example, [13, 15] chose Legendre or Bregman functions to absorb the abstract sets and hence automatically kept the iterates in the abstract sets; Bauschke et al. [6] introduced a NoLips algorithm, avoiding the dependence of the Lipschitz smoothness, which is then extended to the nonconvex case [5, 12, 17, 18, 25, 26, 28]. However, the convergence of this powerful algorithm is limited to the convex case, or the nonconvex case but with the assumptions that \(X={\mathbb {R}}^n, Y={\mathbb {R}}^m\). Moreover, for the nonconvex case, the Lipschitz smoothness is still required, which contradicts to the original main intention of NoLips algorithm.

In this paper, we propose an alternating structure-adapted Bregman proximal (ASABP) gradient descent algorithm,

$$\begin{aligned} \left\{ \begin{array}{l} x^{k+1}\in \arg \min ~\left\{ H(x, y^{k})+\langle \nabla F(x^{k}), x-x^k\rangle +\frac{1}{ \tau _k}D_h(x,x^k): x \in {\overline{X}}\right\} ,\\ y^{k+1}\in \arg \min ~\left\{ H(x^{k+1}, y)+\langle \nabla G(y^{k}),y-y^k \rangle + \frac{1}{ \sigma _k}D_\psi (y,y^k):y \in {\overline{Y}}\right\} , \end{array} \right. \end{aligned}$$
(5)

where we choose the generalized Bregman functions h and \(\psi \), such that the pairs (Fh) and \((G,\psi )\) are GL-smad on X and Y (see Definition 2.6 and Definition 2.12). By associating the generalized Bregman functions h and \(\psi \) to the objective functions F and G in a suitable way, ASABP can absorb the abstract sets and result in the subproblems unconstrained ones. Merely assuming that the underlying function satisfies the Kurdyka–Łojasiewicz (KL) property yet without the Lipschitz smoothness, we establish the global convergence to a critical point. This result paves a new way for analyzing the NoLips algorithms in solving nonsmooth nonconvex optimization problems with abstract sets.

Numerically, the vectors \(x^k\) and \(y^k\) and the parameters \(\tau _k\) and \(\sigma _k\) influent the performance of the algorithm greatly. To improve the numerical performance, we merge an inertial strategy [22] into the proposed ASABP algorithm, leading to the inertial alternating structure-adapted Bregman proximal (IASABP) gradient descent algorithm (see Algorithm 2). The inertial stepsizes are judiciously selected by backtracking line search strategy. Under mild conditions, we can get the O(1/K) convergence rate of the sequence generated by IASABP measured by Bregman distance.

The rest of this paper is organized as follows. In Sect. 2, we list some basic notations and preliminary materials. In Sect. 3, we present the new alternating structure-adapted Bregman proximal (ASABP) gradient descent algorithm and its inertial version IASABP in detail, and give the convergence analysis results. To show the ability of IASABP and ASABP in dealing with non-Lipschitz problem with abstract sets, we implement them to solve the Poisson linear inverse problem and perform numerical experiment in Sect. 4. Finally, we make some conclusions in Sect. 5.

2 Preliminaries

In this section, we summarize some notations and elementary facts to be used for further analysis.

The Euclidean scalar product in \({\mathbb {R}}^d\) and its corresponding norm are respectively denoted by \(\langle \cdot ,\cdot \rangle \) and \(\Vert \cdot \Vert \). If \(F:{\mathbb {R}}^n\rightrightarrows {\mathbb {R}}^m\) is a point-to-set mapping, its graph is defined by

$$\begin{aligned} \text {Graph} ~F:=\{(x,y) \in \mathbb {R} ^n\times \mathbb {R} ^m:~y\in F(x)\}. \end{aligned}$$

Similarly the graph of a real-extended-valued function \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is defined by

$$\begin{aligned} \text {Graph} ~f:=\{(x,s)\in {\mathbb {R}}^d\times {\mathbb {R}}:~s= f(x)\}. \end{aligned}$$

Given a function \(f:~{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\), its domain is

$$\begin{aligned} \text {dom}f:=\{x\in {\mathbb {R}}^d:~f(x)<+\infty \}, \end{aligned}$$

and f is proper if and only if \(\text {dom}f\ne \emptyset \). The proximal operator associated with f is given by

$$\begin{aligned} \text{ prox}_{f}(x)=\mathop {\arg \min }\limits _{u\in {\mathbb {R}}^d}\left\{ f(u)+\frac{1}{2}\Vert u-x\Vert ^{2}\right\} ~~\text{ for } \text{ any }~x\in {\mathbb {R}}^d. \end{aligned}$$
(6)

The indicator function of a nonempty set C is defined by

$$\begin{aligned} \delta _C(x) = {\left\{ \begin{array}{ll} 0 &{}\hbox {if} ~~ x\in C,\\ +\infty &{}\hbox {otherwise}. \end{array}\right. } \end{aligned}$$

The normal cone of a nonempty closed convex set \(C\subset {\mathbb {R}}^d\) at \(x\in C\) is defined by

$$\begin{aligned} N_{C}(x)=\{d\in {\mathbb {R}}^d: \langle d, y-x\rangle \le 0,~\forall y\in C\}; \end{aligned}$$

and if \(x\not \in C\), we set \( N_{C}(x)=\emptyset \). For any subset \(C\subset {\mathbb {R}}^d\) and any \(x\in {\mathbb {R}}^d\), the distance from x to C, denoted by \(\hbox {dist}(x,C)\), is defined as

$$\begin{aligned} \hbox {dist}(x,C):=\inf _{y\in C}\Vert y-x\Vert . \end{aligned}$$

When \(C=\emptyset \), we set \(\hbox {dist}(x,C)=+\infty \) for all \(x\in {\mathbb {R}}^d\).

2.1 Subdifferentials and partial subdifferentials

Let us first recall a few definitions and results concerning subdifferentials of a given function.

Definition 2.1

[3] Let \(f:~{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper lower semicontinuous function.

  1. (i)

    For each \(x\in \text {dom}f\), the Fréchet subdifferential of f at x, written as \({\hat{\partial }} f(x)\), is the set of vectors \(u\in {\mathbb {R}}^d\) which satisfy

    $$\begin{aligned} \liminf _{y\ne x ~y\rightarrow x}\frac{f(y)-f(x)-\langle u,y-x\rangle }{\Vert y-x\Vert }\ge 0. \end{aligned}$$

    If \(x\notin \text {dom}f\), we set \({\hat{\partial }}f(x)=\emptyset \).

  2. (ii)

    The set

    $$\begin{aligned} \partial f(x):=\left\{ u\in {\mathbb {R}}^d:\exists ~ x^k\rightarrow x,~f(x^k)\rightarrow f(x),~u^k\in {\hat{\partial }}f(x^k)\rightarrow u\right\} , \end{aligned}$$

    is the limiting subdifferential, or simply the subdifferential for short, of f at \(x\in \text {dom}f\).

Remark 2.2

[24] From Definition 2.1, we have the following conclusions.

  1. (i)

    \({\hat{\partial }}f(x) \subset \partial f(x)\) for each \(x\in {\mathbb {R}}^d\); both sets are closed; and \({\hat{\partial }}f(x)\) is convex while \(\partial f(x)\) is not necessarily convex.

  2. (ii)

    Let \((x^k,u^k)\in \text {Graph}~\partial f\) be a sequence that converges to (xu). If \(f(x^k)\) converges to f(x) as \(k\rightarrow +\infty \), then \((x,u)\in \text {Graph}~\partial f\).

  3. (iii)

    (Fermat’s rule)  A necessary but not sufficient condition for \(x\in {\mathbb {R}}^d\) to be a local minimizer of f is that x is a critical point, that is,

    $$\begin{aligned} 0\in \partial f(x). \end{aligned}$$

    The set of critical points of f is denoted by crit f.

  4. (iv)

    If \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is proper lower semicontinuous and \(h:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) is continuously differentiable, then for any \(x\in \text {dom}f\), \(\partial (f+h)(x)=\partial f(x)+\nabla h(x)\).

In first order alternating methods (e.g., scheme (5)), each iteration is equivalent to finding a minimizer of the auxiliary function, which is characterized by Fermat’s rule. Since Fermat’s rule is defined on subgradients, applying alternating schemes leads us to consider partial subgradients defined as follows.

Definition 2.3

[20] Given \(H:{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\), the subdifferential of H at \(\left( x^{+},y^{+}\right) \) is denoted by \(\partial H\left( x^{+},y^{+}\right) \). In addition, \(\partial _{x}H\left( x^{+},y^{+}\right) \) and \(\partial _{y}H\left( x^{+},y^{+}\right) \) are the partial subdifferentials of \(H(\cdot ,y^{+})\) at \(x^{+}\) and \(H(x^{+},\cdot )\) at \(y^{+}\) respectively.

Remark 2.4

Note that the desired critical point is characterized by its whole subgradient, namely \((0,0)\in \partial H(x^*,y^*)\). Let \(H:{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be proper function. Then

$$\begin{aligned} \partial H(x,y)\subset \partial _x H(x,y)\times \partial _y H(x,y), ~~~~\forall \left( x,y\right) \in {\text{ d }om}H, \end{aligned}$$

holds naturally [24]. If the function H also satisfies

$$\begin{aligned} \partial _x H(x,y)\times \partial _y H(x,y)\subset \partial H(x,y), ~~~~\forall \left( x,y\right) \in {\text{ d }om}H, \end{aligned}$$
(7)

at least for \((x,y)=(0,0)\), then \(0\in \partial _x H(x^*,y^*)\times \partial _y H(x^*,y^*)\) means that \((x^*,y^*)\) is a critical point. (7) is automatically satisfied in the form as shown in the following example, which gives a sufficient condition.

Example 1

Let \(H:~{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be of the form

$$\begin{aligned} H(x,y)=h (x,y)+\xi (x)+\eta (y), \end{aligned}$$

where \(h:~{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is continuously differentiable; \(\xi :~{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) and \(\eta :~ {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) are proper lower semicontinuous functions. Then, for any \((x, y) \in \textrm{dom}H=\textrm{dom}\xi \times \textrm{dom}\eta \), H satisfies (7).

More details and examples can refer to [20]. Besides condition (7), another property, closely linked to the closedness of the partial subdifferential, is crucial.

Definition 2.5

[20] (Parametric closedness of the partial subdifferentials)  Let \(H:~{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a function, \(\{(x^k,y^k)\}_{k\in {\mathbb {N}}}\) and \((x^+,y^+)\in {\textrm{dom}}H\) such that for \(k\rightarrow +\infty \),

$$\begin{aligned} (x^k,y^k)\rightarrow (x^+,y^+),~~H(x^k,y^k)\rightarrow H(x^+,y^+). \end{aligned}$$
(8)

The x-partial subdifferentail of H is said to be parametrically closed at \((x^+,y^+)\) with respect to the sequence \(\{(x^k,y^k)\}_{k\in {\mathbb {N}}}\) if, for any sequence \(\{p_x^k\}_{k\in {\mathbb {N}}}\),

$$\begin{aligned} \partial _x H(x^k,y^k)\ni p_x^k\rightarrow p_x^+,~k\rightarrow +\infty , \end{aligned}$$

then \(p_x^+\in \partial _x H(x^+,y^+)\). Similarly, we can define the parametrically closeness of y-partial subdifferentail of H at \((x^+,y^+)\) with respect to the sequence \(\{(x^k,y^k)\}_{k\in {\mathbb {N}}}\). The x-partial and y-partial subdifferentials of H are said to be parametrically closed if they are parametrically closed at any point \((x^+,y^+)\in {\textrm{dom}}H\) with respect to any sequence satisfying (8).

Example 2

[20] Let \(H:~{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a biconvex function continuous on its domain. Then its partial subdifferentials are parametrically closed at any point \((x^+,y^+)\in {\textrm{dom}}H\).

2.2 Generalized Bregman function and Bregman distance

To define the Bregman distance, we first give the definition of generalized Bregman function.

Definition 2.6

(Generalized Bregman function)  Let \(\Omega \) be a nonempty convex open subset of \({\mathbb {R}}^d\), and \({\overline{\Omega }}\) denotes its closure. Associated with \(\Omega \), a function \(h:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is called a generalized Bregman function if it satisfies the following:

  1. (i)

    h is proper lower semicontinuous and convex with \(\textrm{dom} h\subset {\overline{\Omega }}\).

  2. (ii)

    \({\textrm{dom}}\partial h=\Omega =int~{\textrm{dom}}h\).

  3. (iii)

    h is strictly convex on \(int~{\textrm{dom}}h\).

  4. (iv)

    If \(\{x^k\}\in \Omega \) converges to \({\bar{x}}\in \partial \Omega \) (the boundary of \(\Omega \)), then \(\displaystyle \lim _{k\rightarrow +\infty }\langle \nabla h(x^k), u-x^k\rangle =-\infty \) for all \(u\in \Omega \).

We denote the class of generalized Bregman functions by \({\mathcal {G}}(\Omega )\).

Definition 2.7

(Bregman distance)  Given \(h\in {\mathcal {G}}(\Omega )\), the Bregman distance associated to h, denoted by \(D_h\): \({\textrm{dom}}h\times int~{\textrm{dom}}h\rightarrow {\mathbb {R}}_{+}\), is defined by

$$\begin{aligned} D_{h}(x, y)=h(x)-h(y)-\langle \nabla h(y), x-y\rangle . \end{aligned}$$

Remark 2.8

We give some remarks on Definition 2.6 and Definition 2.7.

  1. (i)

    The class of functions which satisfy (i) and (ii) in Definition 2.6 are the Kernel generating distance [12]. And the class of functions satisfying (i)-(iii) are Legendre functions [6].

  2. (ii)

    Definition 2.6 (ii) is equivalent to the statement that h is essentially smooth [23, 25], which means that \(\partial h(x)=\{\nabla h(x)\}\) when \(x\in int~\textrm{dom}h\), while \(\partial h(x)=\emptyset \) when \(x\notin int~\textrm{dom}h\).

  3. (iii)

    Given \(h\in {\mathcal {G}}(\Omega )\). If h is continuous on \({\overline{\Omega }}\), Definition 2.6 (iv) holds if and only if \(\displaystyle \lim _{k\rightarrow +\infty }D_{h}(u, x^k)=+\infty \) for all \(u\in \Omega \) and all sequence \(\{x^k\}\in \Omega \) which converges to a point \({\bar{x}}\in \partial \Omega \).

  4. (iv)

    Note that the structural form of \(D_h\) is also useful when h is not convex, and still enjoys two simple but remarkable properties,

  • The three point identity: For any \(y,z\in int~\textrm{dom}h\) and \(x\in {\textrm{dom}}h\), we have

    $$\begin{aligned} D_h(x,z)-D_h(x,y)-D_h(y,z)=\langle \nabla h(y)-\nabla h(z), x-y \rangle . \end{aligned}$$
  • Linear additivity: For any \(\alpha ,\beta \in {\mathbb {R}}\), and any functions \(h_1\) and \(h_2\), we have

    $$\begin{aligned} D_{\alpha h_1+\beta h_2}(x,y)=\alpha D_{h_1}(x,y)+\beta D_{h_2}(x,y), \end{aligned}$$

    for any \((x,y)\in ({\textrm{dom}}h_1\bigcap {\textrm{dom}}h_2)^2\) such that both \(h_1\) and \(h_2\) are differentiable at y.

It is obvious that Bregman distance is, in general, not symmetric. It is thus natural to introduce a measure for the symmetry of \(D_h\).

Definition 2.9

[6] (Symmetry coefficient)  Given \(h\in {\mathcal {G}}(\Omega )\), its symmetry coefficient is defined by

$$\begin{aligned} \alpha (h) = \inf \{D_h(x,y)/D_h (y,x)|(x,y)\in int~\textrm{dom}h \times int~ \textrm{dom}h,~x\ne y \}\in [0,1]. \end{aligned}$$

Remark 2.10

The symmetry coefficient has the following properties:

  1. (i)

    Clearly, the closer \(\alpha (h)\) gets to 1, the more symmetric \(D_h\) is.

  2. (ii)

    For any \(x,y \in int~{\textrm{dom}}h\),

    $$\begin{aligned} \alpha (h) D_h(x,y)\le D_h(y,x) \le \frac{1}{\alpha (h)} D_h(x,y), \end{aligned}$$

    where we have adopted the convention that \(\frac{1}{0}=+\infty \) and \(+\infty \times r = +\infty \) for any \(r\ge 0\).

Lemma 2.11

Given \(h\in {\mathcal {G}}(\Omega )\), for any proper lower semicontinuous convex function \(\Gamma :{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) and any \(z \in int~{\textrm{dom}}h\), if

$$\begin{aligned} z_{+}=\arg \min _{x\in {\overline{\Omega }}}\{ \Gamma (x)+D_h(x,z)\}, \end{aligned}$$

then for any \(x\in {\textrm{dom}}h\),

$$\begin{aligned} \Gamma (z_+)+D_h(z_+,z)\le \Gamma (x)+D_h(x,z)-D_h(x,z_+). \end{aligned}$$

2.3 Generalized L-smooth adaptable and extended descent lemma

We denote \({\mathcal {G}}(f,h)\) the set of pair of functions (fh) satisfies

  1. (i)

    \(h\in {\mathcal {G}}(\Omega )\),

  2. (ii)

    \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is proper lower semicontinuous nonconvex with \(\textrm{dom}h\subset {\textrm{dom}}f\), which is continuously differentiable on \({\overline{\Omega }}\).

Definition 2.12

(Generalized L-smooth adaptable)  A pair of functions \((f,h)\in {\mathcal {G}}(f,h)\) is called generalized L-smooth adaptable (GL-smad) on \(\Omega \) if there exists \(L>0\) such that \(Lh+f\) and \(Lh-f\) are convex on \(\Omega \).

From the above definition we immediately obtain the two-sided descent lemma, which complements and extends the NoLips descent lemma derived in [6].

Lemma 2.13

(Extended descent lemma)  The pair of functions \((f,h)\in {\mathcal {G}}(f,h)\) is GL-smad on \(\Omega \) if and only if

$$\begin{aligned} \left| f(x)- f(y)-\langle \nabla f(y), x-y\rangle \right| \le LD_h(x,y), ~~~\forall x \in \textrm{dom}h,~\forall y \in int~{\textrm{dom}}h. \end{aligned}$$

Due to the structural form of \(D_f\), the extended descent lemma reads equivalently as

$$\begin{aligned}{} & {} \left| f(x)- f(z)-\langle \nabla f(y), xz\rangle +D_f(z,y)\right| \le LD_h(x,y), \\{} & {} \text {for any } x, z \in \text { dom}h \text { and any } y \in int \text { dom}h. \end{aligned}$$

When the function f is assumed to be convex, the convexity condition of \(Lh+f\) naturally holds. In this case, the NoLips descent lemma given in [6] is recovered. When \(h(z)=\frac{1}{2}\Vert z\Vert ^2\) and consequently \(D_h(x,y)=\frac{1}{2}\Vert x-y\Vert ^2\), Lemma 2.13 would be reduced to the descent lemma [19],

$$\begin{aligned} |f(x)-f(y)-\langle \nabla f(y), x-y\rangle |\le \frac{L}{2}\Vert x-y\Vert ^2,~~~\forall x \in {\textrm{dom}}h,~\forall y \in int~{\textrm{dom}}h. \end{aligned}$$

2.4 Kurdyka–Łojasiewicz property

Let \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper lower semicontinuous function. For \(-\infty<\eta _1<\eta _2\le +\infty \), define

$$\begin{aligned}{}[\eta _1<f<\eta _2]:=\{x\in {\mathbb {R}}^d:\eta _1<f(x)<\eta _2\}. \end{aligned}$$

Definition 2.14

[4] (Kurdyka–Łojasiewicz property)  The function f is said to have the Kurdyka–Łojasiewicz (KL) property at \({\bar{x}}\in \text {dom}\partial f\) if there exist \(\eta \in (0,+\infty ]\), a neighborhood U of \({\bar{x}}\) and a continuous concave function \(\varphi :[0,\eta )\rightarrow {\mathbb {R}}_+\) such that

  1. (i)

    \(\varphi (0)=0\);

  2. (ii)

    \(\varphi \) is \(C^1\) on \((0,\eta )\);

  3. (iii)

    for all \(s\in (0,\eta )\), \({\varphi }'(s)>0\);

  4. (iv)

    for all x in \(U\cap [f({\bar{x}})<f<f({\bar{x}})+\eta ]\), the Kurdyka–Łojasiewicz inequality holds, i.e.,

    $$\begin{aligned} {\varphi }'(f(x)-f({\bar{x}}))\hbox {dist}(0,\partial f(x))\ge 1. \end{aligned}$$

Remark 2.15

  1. (i)

    If f satisfies the KL property at each point of \(\text {dom}\partial f\), then f is called a KL function;

  2. (ii)

    For convenience, denote the class of functions which satisfy (i), (ii) and (iii) in Definition 2.14 as \(\Phi _\eta \).

The following lemma gives the important uniformized KL property. More properties and examples can refer to [3, 11, 14].

Lemma 2.16

[11] (Uniformized KL property)  Let C be a compact set and let \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper lower semicontinuous function. Assume that f is constant on C and satisfies the KL property at each point of C. Then, there exist \(\varepsilon >0\), \(\eta >0\) and \(\varphi \in \Phi _\eta \) such that for any \(\bar{x} \in C\) and all x in the following intersection,

$$\begin{aligned} \{x\in {\mathbb {R}}^n:~\hbox {dist}(x,C)<\varepsilon \}\cap [f({\bar{x}})<f(x)<f({\bar{x}})+\eta ], \end{aligned}$$

one has,

$$\begin{aligned} {\varphi }'(f(x)-f({\bar{x}}))\hbox {dist}(0,\partial f(x))\ge 1. \end{aligned}$$

3 Algorithms and convergence

In this section, we first describe our alternating structure-adapted Bregman proximal (ASABP) gradient descent algorithm and its inertial version (IASABP algorithm) for solving

$$\begin{aligned} \min _{x\in {\overline{X}},~y\in {\overline{Y}}}\Phi (x,y)=F(x)+G(y)+H(x,y), \end{aligned}$$

where \(X\subset {\mathbb {R}}^n\) and \(Y\subset {\mathbb {R}}^m\) are open convex sets with nonempty interiors, and \({\overline{X}}\) and \({\overline{Y}}\) are their closures, respectively. Then we present their convergence rate results and establish the global convergence result for ASABP algorithm. Throughout the rest of this section, we suppose the following assumption, which guarantees that the algorithms are well-defined.

Assumption 3.1

  1. (i)

    \(\inf \{\Phi (x,y): (x,y)\in {\overline{X}}\times {\overline{Y}}\}>-\infty \) and \(\Phi \) is coercive.

  2. (ii)

    \(h\in {\mathcal {G}}(X)\) and \(\psi \in {\mathcal {G}}(Y)\) such that the pairs \((F,h)\in {\mathcal {G}}(F,h)\) and \((G,\psi )\in {\mathcal {G}}(G,\psi )\) are GL-smad on X and Y with coefficients \(L_h\) and \(L_{\psi }\) respectively.

  3. (iii)

    For the sequence \(\{x^k\}_{k\in {\mathbb {N}}}\in int~{\textrm{dom}}h\) and \(x\in {\textrm{dom}}h\), \(\Vert x^k-x\Vert \rightarrow 0 \Longleftrightarrow D_{h}(x,x^k)\rightarrow 0\). Similarly, \(\Vert y^k-y\Vert \rightarrow 0 \Longleftrightarrow D_{\psi }(y,y^k)\rightarrow 0\) for any sequence \(\{y^k\}_{k\in {\mathbb {N}}}\in int~{\textrm{dom}}\psi \) and \(y\in \textrm{dom}\psi \),.

  4. (iv)

    \(H:{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is a proper lower semicontinuous biconvex function.

3.1 The ASABP and IASABP algorithm

We now first introduce the alternating structure-adapted Bregman proximal (ASABP) gradient descent algorithm.

figure a

Under Assumption 3.1, \(x^{k+1}\) will be automatically in X and \(y^{k+1}\) will be automatically in Y. Hence, the problems (9) and (10) are essentially unconstrained optimization problems, which in many cases enable these problems to possess closed-form solutions.

Note that the efficiency of ASABP depends on the vectors \(x^k\) and \(y^k\), and the parameters \(\tau _k\) and \(\sigma _k\). For the first order algorithms, it was found that extrapolation can accelerate the convergence, and the adaptive line search scheme can alleviate the difficulty in choosing the parameters. Hence, we give the inertial variant of ASABP algorithm, i.e, the inertial alternating structure-adapted Bregman proximal gradient descent algorithm, which we denote as IASABP for short.

figure b

Remark 3.1

  1. (i)

    Notice that (11) and (13) hold naturally when taking \(\alpha _k^0 = 0\) and \(\beta _k^0=0\) for all \(k\ge 0\). In this case, the inertial strategy makes no difference, i.e., \({\bar{x}}^k=x^k,~{\bar{y}}^k=y^k\). IASABP algorithm then reduces to the ASABP algorithm.

  2. (ii)

    (Finite termination of line search) There exists \(J_{k}\in {\mathbb {N}}\) in step (11) such that for any \(j\ge J_{k}\), \(\alpha _k=\eta ^j\alpha ^0_k\) satisfies

    $$\begin{aligned} D_h(x^k,x^{k}+\alpha _k(x^{k}-x^{k-1}))\le \theta C_k^1 D_h(x^k,x^{k-1}). \end{aligned}$$

    Similarly, we can get \(\beta _k\) via finite number of step (13).

We then verify that the ASABP and IASABP algorithms are well defined. To this end, choose arbitrarily \(u\in int~{\textrm{dom}}h\) and stepsize \(0<\tau < \frac{1}{L_h}\). Fixed \(y\in int~{\textrm{dom}}\psi \), we define the Bregman proximal mapping,

$$\begin{aligned} \begin{aligned} {\mathcal {B}}_{\tau }(u; y)&=\arg \min ~\left\{ H(x,y)+\langle \nabla F(u),x-u \rangle +\frac{1}{\tau }D_{h}(x,u):x\in {\overline{X}}\right\} \\&=\arg \min ~\left\{ H(x,y)+\langle \nabla F(u),x-u \rangle +\frac{1}{\tau }D_{h}(x,u):x\in {\mathbb {R}}^n\right\} , \end{aligned} \end{aligned}$$

where the second equality follows from the fact that \(\textrm{dom}h\subset {\overline{X}}\). Observe that \(x^{k+1}\in {\mathcal {B}}_{\tau _k}(u^k; y^k)\), where \(u^k=x^k\) in ASABP algorithm and \(u^k={\bar{x}}^k\) for IASABP algorithm. Hence from Proposition 3.2, (9) and (12) are well defined, and so are the (10) and (14).

Proposition 3.2

Suppose that Assumption 3.1 holds. Let \(y\in int~\textrm{dom}\psi \)\(u\in int~{\textrm{dom}}h\) and \(0<\tau < \frac{1}{L_h}\). Then the set \({\mathcal {B}}_{\tau }(u; y)\) is a nonempty compact single valued mapping from \(\hbox {int}~{\textrm{dom}}h\) to \(\hbox {int}~{\textrm{dom}}h\).

Proof

Fix any \(y\in int~{\textrm{dom}}\psi \)\(u\in int~{\textrm{dom}}h\) and \(0<\tau <\frac{1}{L_h}\). For any \(x\in {\mathbb {R}}^n\), we define

$$\begin{aligned} \Phi _{h}(x, u; y)=H(x,y)+F(u)+G(y)+\langle \nabla F(u),x-u \rangle +\frac{1}{\tau }D_{h}(x,u), \end{aligned}$$

so that \({\mathcal {B}}_{\tau }(u; y)=\arg \min _{x\in {\mathbb {R}}^n}\Phi _{h}(x, u; y)\). Since h is strictly convex and H is biconvex as assumed in Assumption 3.1, the objective \(\Phi _{h}(x, u; y)\) have at most one minimizer. Taking into account that (Fh) is GL-smad with \(L_h\),

$$\begin{aligned} \begin{aligned} \Phi _{h}(x, u; y)&=\Phi (x, y)-\left( F(x)-F(u)-\langle \nabla F(u),x-u \rangle -\frac{1}{\tau }D_{h}(x,u)\right) \\&\ge \Phi (x, y). \end{aligned} \end{aligned}$$

According to Assumption 3.1 (i), it holds that

$$\begin{aligned} \lim _{\Vert x\Vert \rightarrow +\infty }\Phi _h(x, u; y)\ge \lim _{\Vert x\Vert \rightarrow +\infty }\Phi (x,y)=+\infty . \end{aligned}$$

Since \(\Phi _h\) is also proper and lower semicontinuous, invoking Weierstrass’s theorem, it follows that the set \({\mathcal {B}}_{\tau }(u; y)\) is nonempty and compact. Furthermore, the optimality condition for \({\mathcal {B}}_{\tau }(u; y)\) implies that \(\partial h({\mathcal {B}}_{\tau }(u; y))\) must be nonempty, and Definition 2.6 indicates that \({\mathcal {B}}_{\tau }(u; y)\) belongs to \(\hbox {int}~{\textrm{dom}}h\). \(\square \)

3.2 Rate of convergence

In this subsection, we establish the global sublinear rate of convergence for ASABP and IASABP, measured by Bregman distance. Using the extended descent lemma, we obtain the following key estimation for the objective function \(\Phi \), which will play an important role in deriving our main convergence results.

Lemma 3.3

Suppose that Assumption 3.1 holds. Let \(\left\{ (x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) be the sequence generated by IASABP, then we have

$$\begin{aligned} \begin{aligned}&\Phi (x^{k+1}, y^{k+1})+\frac{\alpha (h)}{\tau _k} D_{h}( x^{k+1},x^k)+\frac{\alpha (\psi )}{\sigma _k} D_{\psi }(y^{k+1},y^k)\\&\quad \le \Phi (x^{k}, y^{k})+\frac{\theta }{\tau _k} D_h(x^k,x^{k-1})+\frac{\theta }{\sigma _k} D_\psi (y^k,y^{k-1}), \end{aligned} \end{aligned}$$

where \(\alpha (h)\) and \(\alpha (\psi )\) are the symmetric coefficients of \(h,~\psi \) respectively.

Proof

Applying Lemma 2.11 with \(\Gamma (x)=\tau _k\left( H(x, y^{k})+\langle \nabla F({\bar{x}}^{k}), x-{\bar{x}}^k\rangle \right) \), it yields that for any \(x\in \textrm{dom}h\),

$$\begin{aligned} \begin{aligned}&H(x^{k+1}, y^{k})-H(x, y^{k})\\&\quad \le \langle \nabla F({\bar{x}}^{k}), x-x^{k+1}\rangle +\frac{1}{\tau _k} D_{h}(x, {\bar{x}}^{k})-\frac{1}{\tau _k} D_{h}(x, x^{k+1})-\frac{1}{\tau _k} D_{h}(x^{k+1}, {\bar{x}}^{k})\\&\quad \le F(x)-F(x^{k+1})+L_hD_h(x,{\bar{x}}^k)+D_F(x^{k+1},{\bar{x}}^k)\\&\qquad +\frac{1}{\tau _k} D_{h}(x, {\bar{x}}^{k})-\frac{1}{\tau _k} D_{h}(x, x^{k+1})- \frac{1}{\tau _k}D_{h}(x^{k+1}, {\bar{x}}^{k})\\&\quad \le F(x)-F(x^{k+1})+\left( \frac{1}{\tau _k}+L_h\right) D_{h}(x, {\bar{x}}^{k})\\&\quad \quad -\frac{1}{\tau _k} D_{h}(x, x^{k+1})-\left( \frac{1}{\tau _k}-L_h\right) D_{h}(x^{k+1}, {\bar{x}}^{k}), \end{aligned} \end{aligned}$$

where the second inequality follows from Lemma 2.13 and the last inequality follows from the GL-smad property of (Fh). Particularly, taking \(x=x^k\), together with \(0<\tau _k<\frac{1}{L_h}\) and the nonnegativity of Bregman distance \(D_h\), we obtain

$$\begin{aligned} \begin{aligned}&H(x^{k+1}, y^{k})+F(x^{k+1})-\left( H(x^k, y^{k})+F(x^k)\right) \\&\quad \le \left( \frac{1}{\tau _k}+L_h\right) D_{h}(x^k, {\bar{x}}^{k})-\frac{1}{\tau _k} D_{h}(x^k, x^{k+1}). \end{aligned} \end{aligned}$$

If \({\bar{x}}^k=x^k\) we get \(D_h(x^k,{\bar{x}}^k)=0\). Hence,

$$\begin{aligned} H(x^{k+1}, y^{k})+F(x^{k+1})+\frac{1}{\tau _k} D_{h}(x^k, x^{k+1})\le H(x^k, y^{k})+F(x^k). \end{aligned}$$

If \({\bar{x}}^k\ne x^k\), according to (11), we have \(D_{h}(x^k, {\bar{x}}^{k})\le \theta C_k^1 D_h(x^k,x^{k-1})\). Recalling the definition of \(C_k^1\), we have

$$\begin{aligned} \begin{aligned}&H(x^{k+1}, y^{k})+F(x^{k+1})+\frac{1}{\tau _k} D_{h}(x^k, x^{k+1})\\&\quad \le H(x^k, y^{k})+F(x^k)+\left( \frac{1}{\tau _k}+L_h\right) \theta C_k^1 D_h(x^k,x^{k-1})\\&\quad \le H(x^k, y^{k})+F(x^k)+\frac{\theta }{\tau _k} D_h(x^k,x^{k-1}). \end{aligned} \end{aligned}$$

Consequently, for any case we conclude that for any \(k\in {\mathbb {N}}\),

$$\begin{aligned}{} & {} H(x^{k+1}, y^{k})+F(x^{k+1})+\frac{1}{\tau _k} D_{h}(x^k, x^{k+1})\nonumber \\{} & {} \quad \le H(x^k, y^{k})+F(x^k)+\frac{\theta }{\tau _k} D_h(x^k,x^{k-1}). \end{aligned}$$
(15)

We can obtain the following inequality in a similar way,

$$\begin{aligned}{} & {} H(x^{k+1}, y^{k+1})+G(y^{k+1})+\frac{1}{\sigma _k} D_{\psi }(y^k, y^{k+1})\nonumber \\{} & {} \quad \le H(x^{k+1}, y^{k})+G(y^k)+\frac{\theta }{\sigma _k}D_\psi (y^k,y^{k-1}). \end{aligned}$$
(16)

Adding (15) and (16) and we can obtain that for any \(k\in {\mathbb {N}}\),

$$\begin{aligned} \begin{aligned}&\Phi (x^{k+1}, y^{k+1})+\frac{1}{\tau _k} D_{h}(x^k, x^{k+1})+\frac{1}{\sigma _k} D_{\psi }(y^k, y^{k+1})\\&\quad \le \Phi (x^{k}, y^{k})+\frac{\theta }{\tau _k} D_h(x^k,x^{k-1})+\frac{\theta }{\sigma _k} D_\psi (y^k,y^{k-1}). \end{aligned} \end{aligned}$$

Combining with Remark 2.10 (ii) completes the proof. \(\square \)

Equipped with Lemma 3.3, we introduce the following useful notations,

$$\begin{aligned} u^k_{{\mathfrak {I}}}:=(x^k, y^k, x^{k-1}, y^{k-1}),~~~{\underline{\tau }}\le \inf \{\tau _k\} \le \sup \{\tau _k\} \le {\bar{\tau }},~~~{\underline{\sigma }}\le \inf \{\sigma _k\} \le \sup \{\sigma _k\} \le {\bar{\sigma }},~~~\forall k\in {\mathbb {N}}. \end{aligned}$$

And we define an auxiliary sequence by

$$\begin{aligned} J(u^k_{{\mathfrak {I}}})=\Phi (x^k,y^k)+M_1D_h(x^k,x^{k-1})+M_2D_\psi (y^k,y^{k-1}), \end{aligned}$$

where \(M_1=\frac{1}{2}\left( \frac{\theta }{{\underline{\tau }}}+\frac{\alpha (h)}{{\bar{\tau }}}\right) \) and \(M_2=\frac{1}{2}\left( \frac{\theta }{{\underline{\sigma }}}+\frac{\alpha (\psi )}{{\bar{\sigma }}}\right) \). Choosing \(\theta < \min \left\{ \frac{{\underline{\tau }}\alpha (h)}{{\bar{\tau }}},\frac{{\underline{\sigma }}\alpha (\psi )}{{\bar{\sigma }}}\right\} \), we have

$$\begin{aligned} \delta _1\!=\!\min \left\{ M_1\!-\!\frac{\theta }{{\underline{\tau }}},~\frac{\alpha (h)}{{\bar{\tau }}}-M_1\right\}>0 \;\;\!\hbox {and}\;\;\! \delta _2\!=\!\min \left\{ M_2\!-\!\frac{\theta }{{\underline{\sigma }}},~\frac{\alpha (\psi )}{{\bar{\sigma }}}-M_2\right\} >0.\nonumber \\ \end{aligned}$$
(17)

Lemma 3.4

Suppose that Assumption 3.1 holds. Let \(\left\{ z^k_{\mathfrak {I}}= (x^k,y^k)\right\} _{k\in \mathbb {N}}, \left\{ \bar{z}^k_{\mathfrak {I}}=(\bar{x}^k,\bar{y}^k) \right\} _{k\in \mathbb {N}}\) be the sequences generated by IASABP.

  1. (i)

    The sequence \(\{J(u_{\small {{\mathfrak {I}}}}^k)\}_{k\in {\mathbb {N}}}\) is monotonically nonincreasing and in particular for all \(k\ge 0\),

    $$\begin{aligned} \begin{aligned}&\delta _{{\mathfrak {I}}}\left( D_h(x^k,x^{k-1})+D_\psi (y^k,y^{k-1})+D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right) \\&\quad \le J(u_{\small {{\mathfrak {I}}}}^k)-J(u_{\small {{\mathfrak {I}}}}^{k+1}), \end{aligned} \end{aligned}$$
    (18)

    where \(\delta _{{\mathfrak {I}}}=\min \{\delta _1,~\delta _2\}\) and \(\delta _1, \delta _2\) are defined as (17). Moreover, \(\left\{ J(u_{\small {{\mathfrak {I}}}}^k)\right\} _{k\in {\mathbb {N}}}\) converges to some finite value, denoted by \(J^*\).

  2. (ii)

    We have

    $$\begin{aligned} \sum _{k=0}^{+\infty }\left( D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right) <+\infty , \end{aligned}$$

    and hence \(\displaystyle \lim _{k\rightarrow +\infty }D_{h}( x^{k+1},x^k)=0\), \(\displaystyle \lim _{k\rightarrow +\infty }D_{\psi }( y^{k+1},y^k)=0\). Moreover, \(\displaystyle \lim _{k\rightarrow +\infty }\Vert z^{k+1}_{{\mathfrak {I}}}-z^k_{{\mathfrak {I}}}\Vert =0\) and \(\displaystyle \lim _{k\rightarrow +\infty }\Vert z^{k}_{{\mathfrak {I}}}-{\bar{z}}^k_{{\mathfrak {I}}}\Vert =0\).

  3. (iii)

    (Convergence rate) For any \(K\ge 0\), it holds that

    $$\begin{aligned} \min _{0\le k\le K}\left\{ D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right\} \le \frac{1}{\delta _{{\mathfrak {I}}}(K+1)}\left( J(u_{\small {{\mathfrak {I}}}}^0)-J^*\right) . \end{aligned}$$

Proof

(i) Choose \(\theta < \min \left\{ \frac{{\underline{\tau }}\alpha (h)}{{\bar{\tau }}},\frac{{\underline{\sigma }}\alpha (\psi )}{{\bar{\sigma }}}\right\} \) and \(\delta _1,~\delta _2\) as in (17), then (18) follows immediately from Lemma 3.3. Meanwhile, \(\Phi \) is assumed to be bounded below and J is also bounded below. Hence \(\left\{ J(u_{\small {{\mathfrak {I}}}}^k)\right\} _{k\in {\mathbb {N}}}\) converges to some real number \(J^*\).

(ii) Using the inequality (18), we have for all \(k\ge 0\),

$$\begin{aligned} D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\le \frac{1}{\delta _{{\mathfrak {I}}}}\left( J(u_{\small {{\mathfrak {I}}}}^k)-J(u_{\small {{\mathfrak {I}}}}^{k+1})\right) . \end{aligned}$$

For \(K\ge 0\), summing from \(k=0\) to K and using the statement (i) yields

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{K}\left( D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right)&\le \frac{1}{\delta _{{\mathfrak {I}}}}\left( J(u_{\small {{\mathfrak {I}}}}^0)-J(u_{\small {{\mathfrak {I}}}}^{K+1})\right) \\&\le \frac{1}{\delta _{{\mathfrak {I}}}}\left( J(u_{\small {{\mathfrak {I}}}}^0)-J^*\right) <+\infty . \end{aligned} \end{aligned}$$
(19)

Taking the limit as \(K\rightarrow +\infty \) leads to

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{+\infty }\left( D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right) <+\infty , \end{aligned} \end{aligned}$$

which deduces that \(\lim _{k\rightarrow +\infty }\Vert z^{k}_{{\mathfrak {I}}}-z^{k+1}_{{\mathfrak {I}}}\Vert =0\). And from IASABP algorithm, we have \(\displaystyle \lim _{k\rightarrow +\infty }\Vert z^{k}_{{\mathfrak {I}}}-{\bar{z}}^k_{{\mathfrak {I}}}\Vert =0\).

(iii) Using (19) yields

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{K}\left( D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right) \le \frac{1}{\delta _{{\mathfrak {I}}}}\left( J(u_{\small {{\mathfrak {I}}}}^0)-J^*\right) . \end{aligned} \end{aligned}$$

Hence, we obtain

$$\begin{aligned} \min _{0\le k\le K}\left\{ D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right\} \le \frac{1}{\delta _{{\mathfrak {I}}}(K+1)}\left( J(u_{\small {{\mathfrak {I}}}}^0)-J^*\right) . \end{aligned}$$

\(\square \)

Similar to Lemma 3.3 and Lemma 3.4, we have the following sublinear rate of convergence for ASABP.

Lemma 3.5

Suppose that Assumption 3.1 holds. Let \(\left\{ (x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) be the sequence generated by ASABP. Then

  1. (i)

    The sequence \(\left\{ \Phi (x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) is monotonically nonincreasing and in particular for all \(k\ge 0\),

    $$\begin{aligned} \begin{aligned} \Phi (x^{k+1},y^{k+1})&\le \Phi (x^{k},y^{k})-\frac{1}{\tau _k} D_{h}(x^k,x^{k+1})-\frac{1}{\sigma _k}D_{\psi } (y^k,y^{k+1})\\&\le \Phi (x^{k},y^{k} )-\frac{1}{{\bar{\tau }}} D_{h}(x^k,x^{k+1})-\frac{1}{{\bar{\sigma }}} D_{\psi }(y^k,y^{k+1})\\&\le \Phi (x^{k},y^{k})-\delta \left( D_{h}( x^k,x^{k+1})+D_{\psi }(y^k,y^{k+1})\right) , \end{aligned} \end{aligned}$$
    (20)

    where \(\delta =\min \{\frac{1}{{\bar{\tau }}},\frac{1}{{\bar{\sigma }}}\}\). And \(\left\{ \Phi (x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) converges to some finite value, denoted by \(\Phi ^*\).

  2. (ii)

    (Convergence rate) For any \(K\ge 0\), it holds that

    $$\begin{aligned} \min _{0\le k\le K}\left\{ D_{h}( x^k, x^{k+1})+D_{\psi }( y^k, y^{k+1})\right\} \le \frac{1}{\delta (K+1)}\left( \Phi (x^0,y^0)-\Phi ^*\right) . \end{aligned}$$

3.3 Global convergence of ASABP

The analysis in Sect. 3.2 can only give the monotonically nonincreasing property of the potential function of ASABP and IASABP algorithms. In this subsection, without the Lipschitz continuity requirement on the gradients of F and G, we aim to prove that the sequence generated by ASABP algorithm is a Cauchy sequence, which hence converges to a critical point of model (1). It is worth noticing that all the NoLips algorithms for the nonconvex case [5, 12, 18, 25], the global convergence results were limited to the case that \(X= {\mathbb {R}}^n,~Y={\mathbb {R}}^m\) and the Lipschitz smoothness condition holds.

We start our analysis by specializing the assumptions on the coupling term H.

Assumption 3.2

  1. (i)

    The subdifferential of H obeys

    $$\begin{aligned} \partial _xH(x,y)\times \partial _yH(x,y)\subset \partial H(x,y),~~~\forall (x,y)\in {\textrm{dom}}H. \end{aligned}$$
  2. (ii)

    The domain of H is closed.

    Moreover, H satisfies one of the following conditions.

  3. (iii)

    H is continuous on its domain.

  4. (iv)

    \(H:~{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) has the form

    $$\begin{aligned} H(x,y)=\phi (x,y)+w(x), \end{aligned}$$

    where \(w:~{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is continuously differentiable on its domain; \(\phi :~{\mathbb {R}}^n\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is continuous on \({\textrm{dom}}H\), such that \(\phi (\cdot ,y)\) is differentiable on \({\mathbb {R}}^n\) of gradient \(\nabla _x\phi \) for any \( y\in {\mathcal {D}}_y=\{y\in {\mathbb {R}}^m|(x,y)\in \textrm{dom}H\}\ne \emptyset \). For each bounded subset \({\mathcal {B}}_x\times {\mathcal {B}}_y\subset {\textrm{dom}}H\), there exists \(\xi >0\) such that, for any \({\bar{x}}\in {\mathcal {B}}_x\) and any \((y,{\bar{y}})\in {\mathcal {B}}_y^2\),

    $$\begin{aligned} \Vert \nabla _x\phi ({\bar{x}},y)-\nabla _x\phi ({\bar{x}},{\bar{y}})\Vert \le \xi \Vert y-{\bar{y}}\Vert . \end{aligned}$$

Remark 3.6

We give some remarks on Assumption 3.2.

  1. (i)

    Since F and G are continuous differentiable on \({\overline{X}}\) and \({\overline{Y}}\) respectively, Assumption 3.2 (i) is equivalent to having \(\partial _x\Phi (x,y)\times \partial _y\Phi (x,y)\subset \partial \Phi (x,y)\) for any \((x,y)\in {\overline{X}}\times {\overline{Y}}\cap \textrm{dom}H\), which is nonempty and closed.

  2. (ii)

    Assumption 3.2 (iii) and (iv) aim at guaranteeing that the cluster points of the sequence generated by the algorithm are critical points. However, they are sufficient but not necessary, and the results presented in this paper can be established for a broader class of functions H.

  3. (iii)

    Note that for Assumption 3.2 (iv), the regularity assumptions on H are not symmetric in x and y. In particular, \(\phi (x,\cdot )\) does not need to be differentiable.

Theorem 3.7

Suppose that Assumptions 3.1 and  3.2 hold. If the sequence \(\left\{ z^k=(x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) generated by ASABP converges to \({\hat{z}}=({\hat{x}},{\hat{y}})\), then it is a critical point of the problem (1) with the abstract sets \({\overline{X}},~{\overline{Y}}\), i.e.,

$$\begin{aligned} \exists ~{\hat{H}}\in \partial H({\hat{z}}),~~~~0\in {\hat{H}}+\nabla F({\hat{x}})\times \nabla G({\hat{y}})+N_{{\overline{X}}\times {\overline{Y}}}({\hat{z}}), \end{aligned}$$

where \(N_{{\overline{X}}\times {\overline{Y}}}({\hat{z}})\) denotes the normal cone of \({\overline{X}}\times {\overline{Y}}\) at \({\hat{z}}\).

Proof

The first order optimality condition of the subproblem (9) shows that there exists \({\hat{H}}_{x}^{k+1}\in \partial _xH(x^{k+1},y^k)\),

$$\begin{aligned} \left\langle x-x^{k+1}, {\hat{H}}_{x}^{k+1}+ \nabla F(x^{k})+\frac{1}{\tau _k}\left( \nabla h(x^{k+1})-\nabla h(x^k) \right) \right\rangle \ge 0,~~~~~~\forall x\in {\overline{X}}, \end{aligned}$$
(21)

or equivalently,

$$\begin{aligned} 0\in \partial _xH(x^{k+1},y^k)+ \nabla F(x^{k})+\frac{1}{\tau _k}\left( \nabla h(x^{k+1})-\nabla h(x^k) \right) +N_{{\overline{X}}}(x^{k+1}). \end{aligned}$$

We first investigate the behavior of the following term,

$$\begin{aligned} \langle x-x^{k+1}, \nabla h(x^{k+1})-\nabla h(x^k) \rangle . \end{aligned}$$

Fix some arbitrary \(x\in {\overline{X}}\). We consider the following two possible cases.

The fist case is that the sequence \(\{D_{h}(x,x^k)\}\) converges. For this case, the three point identity (see Remark 2.8) yields

$$\begin{aligned} \begin{aligned}&\lim _{k\rightarrow +\infty }\langle \nabla h(x^{k+1})-\nabla h(x^{k}),x-x^{k+1}\rangle \\&\quad =\lim _{k\rightarrow +\infty } \left( D_{h}(x,x^{k})-D_{h}(x,x^{k+1})\right) -\lim _{k\rightarrow +\infty }D_{h}(x^{k+1},x^k)\\&\quad =0. \end{aligned} \end{aligned}$$

Combining with the facts that H is biconvex and F is continuously differentiable on \({\overline{X}}\), passing to the limit in the optimality condition (21), we obtain

$$\begin{aligned} \left\langle x-{\hat{x}}, \partial _{x}H({\hat{z}})+ \nabla F({\hat{x}}) \right\rangle \ge 0,~~~~~~\forall x\in {\overline{X}}, \end{aligned}$$

i.e.,

$$\begin{aligned} 0\in \partial _xH({\hat{z}})+ \nabla F({\hat{x}})+N_{{\overline{X}}}({\hat{x}}). \end{aligned}$$
(22)

In particular, if \(x^k\rightarrow {\hat{x}}\in X\), one has that \(\lim _{k\rightarrow +\infty }\Vert \nabla h(x^{k+1})-\nabla h(x^k)\Vert =0\) because of the continuity of \(\nabla h\) on X. The problem (1) in this case is equivalent to an unconstrained problem and (22) holds naturally. Certainly, the sequence \(\{D_{h}(x,x^k)\}\) must be convergent for this situation.

The second case is that the sequence \(\{D_{h}(x,x^k)\}\) fails to converge. For this case, since \(\{D_{h}(x,x^k)\}\) is nonnegative, it cannot be monotonically decreasing. Thus, there are infinitely many \(k\in {\mathbb {N}}\) such that \(D_h(x,x^{k+1})\ge D_{h}(x,x^k)\). Let \(\{k_{l}\}\subset {\mathbb {N}}\) be a subsequence with \(D_h(x,x^{k_{l}+1})\ge D_{h}(x,x^{k_{l}})\) for arbitrary l. Then for all \(x\in {\overline{X}}\),

$$\begin{aligned} \begin{aligned} 0&\ge \limsup _{l\rightarrow +\infty } \left( D_{h}(x,x^{k_{l}})-D_{h}(x,x^{k_{l}+1})\right) \\&= \limsup _{l\rightarrow +\infty }\left\langle \nabla h(x^{k_{l}+1})-\nabla h(x^{k_{l}}),x-x^{k_{l}+1}\right\rangle +\lim _{l\rightarrow +\infty }D_{h}(x^{k_{l}+1},x^{k_{l}})\\&= \limsup _{l\rightarrow +\infty }\left\langle \nabla h(x^{k_{l}+1})-\nabla h(x^{k_{l}}),x-x^{k_{l}+1}\right\rangle . \end{aligned} \end{aligned}$$

This means that

$$\begin{aligned} \nabla h(x^{k_{l}+1})- \nabla h(x^{k_{l}})\in N_{{\overline{X}}}(x^{k_{l}+1}). \end{aligned}$$

Hence,

$$\begin{aligned} \left( \nabla h(x^{k_{l}+1})- \nabla h(x^{k_{l}})\right) +N_{{\overline{X}}}(x^{k_{l}+1})\subset N_{{\overline{X}}}(x^{k_{l}+1}). \end{aligned}$$

Since the corresponding subsequence also converges to \({\hat{x}}\), passing to the limit \(l\rightarrow +\infty \) in the optimality condition, \(0\in \partial _xH({\hat{z}})+ \nabla F({\hat{x}})+N_{{\overline{X}}}({\hat{x}})\) holds. Especially, if \(x^k\rightarrow {\hat{x}}\in \partial X\) (the boundary of X), for any \(x\in X\), the sequence \(\{D_{h}(x,x^k)\}\) fails to converge due to Remark 2.8 (iii). It provides a sufficient condition for this case.

Similarly, we have

$$\begin{aligned} 0\in \partial _yH({\hat{z}})+ \nabla G({\hat{y}})+N_{{\overline{Y}}}({\hat{y}}). \end{aligned}$$

Combining with Assumption 3.2, we complete the proof. \(\square \)

Denote the set of cluster points of the sequence \(\left\{ z^k=(x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) generated by ASABP by \(\Omega ^*\), i.e.,

$$\begin{aligned} \begin{aligned} \Omega ^*=&\{z^*=(x^*,y^*): \exists ~\text {an~increasing~sequence~of~ integer}~ \{k_l\}_{l\in {\mathbb {N}}}, \\&\text {such~that}~ z^{k_l}\rightarrow z^*~ \text {as}~ l\rightarrow +\infty \}. \end{aligned} \end{aligned}$$

Lemma 3.8

Suppose that Assumptions 3.1 and 3.2 hold. Let \(\left\{ z^k=(x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) be the sequence generated by ASABP which is assumed to be bounded. Then the following assertions hold.

  1. (i)

    \(\Omega ^*\) is a nonempty compact set, and

    $$\begin{aligned} \lim _{k\rightarrow +\infty }{\textrm{dist}}(z^k,\Omega ^*)=0. \end{aligned}$$
    (23)
  2. (ii)

    \(\lim _{k\rightarrow +\infty }\Phi (z^k)=\Phi (z^*)\) and \(\Phi \) is constant on \(\Omega ^*\).

Proof

(i) \(\Omega ^*\) is nonempty because the sequence \(\left\{ z^k\right\} _{k\in {\mathbb {N}}}\) generated by ASABP is bounded. By definition, it is trivial that the set \(\Omega ^*\) is compact. We now prove (23) by contradiction. Suppose that \(\{{\textrm{dist}}(z^k,\Omega ^*)\}_{k\in {\mathbb {N}}}\) does not converge to zero. Then there exist \(C>0\) and an increasing sequence \(\{k_j\}_{j\in {\mathbb {N}}}\) such that, for any \(j\in {\mathbb {N}}\), \({\textrm{dist}}(z^{k_j},\Omega ^*)>C\). However, since \(\left\{ z^{k_j}\right\} _{j\in \mathbb {N}}\) is a subsequence of the bounded sequence \(\left\{ z^{k}\right\} _{k\in {\mathbb {N}}}\), it has a convergent subsequence \(\left\{ z^{k_{j_n}} \right\} _{n\in \mathbb {N}}\) with limit \(z^*\in \Omega ^*\). Then,

$$\begin{aligned} C<{\textrm{dist}}(z^{k_{j_n}},\Omega ^*)\le \Vert z^{k_{j_n}}-z^*\Vert \rightarrow 0, \end{aligned}$$

which leads to a contradiction.

(ii) Taking an arbitrary point \(z^*\in \Omega ^*\), then there exists a subsequence \(\left\{ z^{k_j}\right\} _{j\in {\mathbb {N}}}\) of \(\left\{ z^{k}\right\} _{k\in {\mathbb {N}}}\) converging to \(z^*\). Notice that \((x^{k_j},y^{k_j})\in {\overline{X}}\times {\overline{Y}}\cap {\textrm{dom}}H\), which is assumed to be closed. Hence, \((x^*,y^*)\in {\overline{X}}\times {\overline{Y}}\cap {\textrm{dom}}H\), and by continuity one has

$$\begin{aligned} \lim _{k\rightarrow +\infty }\Phi (z^{k_j})=\Phi (z^*). \end{aligned}$$

Since \(\left\{ \Phi (z^{k})\right\} _{k\in {\mathbb {N}}}\) is a convergent sequence by Lemma 3.5, this proves assertion (ii). \(\square \)

To establish the convergence of the sequence generated by ASABP, we need some extra assumptions on the Bregman functions h and \(\psi \), which are also required in [5, 12, 18, 25].

Assumption 3.3

  1. (i)

    \(h,~\psi \) are strongly convex functions with coefficients \(\mu _h,~\mu _{\psi }\) respectively.

  2. (ii)

    \(\nabla h,~\nabla \psi \) are Lipschitz continuous with coefficients \(L_1,~L_2\) respectively on any bounded subsets.

Remark 3.9

  1. (i)

    It is worth noticing that Assumption 3.3 (i) is not actually stringent. For example, Burg’s entropy \(h(x)=-\log x\) with \({\textrm{dom}}h=(0,+\infty )\) is a generalized Bregman function which is not strongly convex. However, regularized Burg’s entropy \({\hat{h}}(x)=-\mu \log x+\frac{\sigma }{2}x^2\) with \(\sigma , \mu >0\), is strongly convex with coefficient \(\sigma \) on \(\textrm{dom}h=(0,+\infty )\).

  2. (ii)

    Combining linear additivity of Bregman distance (see Remark 2.8), Assumption 3.1 (ii) and Assumption 3.3 (ii), \(\nabla F\) is Lipschitz continuous with coefficient \(L_{h}L_{1}\) on any bounded subset in \({\overline{X}}\). Similarly, \(\nabla G\) is Lipschitz continuous with coefficient \(L_{\psi }L_2\) on any bounded subset in \({\overline{Y}}\).

Proposition 3.10

Suppose that Assumptions 3.1,  3.2 with (iv) and Assumption 3.3 hold. Let \(\left\{ z^k=(x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) be the sequence generated by ASABP which is assumed to be bounded. For all \(k\ge 0\), define

$$\begin{aligned} A^{k+1}=(A_x^{k+1},A_y^{k+1}), \end{aligned}$$

where

$$\begin{aligned} \left\{ \begin{aligned} A_x^{k+1}=&\nabla F(x^{k+1})-\nabla F(x^{k})-\frac{1}{\tau _k}\left( \nabla h(x^{k+1})-\nabla h(x^{k})\right) +\nabla _x \phi (x^{k+1},y^{k+1})-\nabla _x \phi (x^{k+1},y^{k})\\ \in&\nabla F(x^{k+1})-\nabla F(x^{k})-\frac{1}{\tau _k}\left( \nabla h(x^{k+1})-\nabla h(x^{k})\right) +\partial _x H(x^{k+1},y^{k+1})-\partial _x H(x^{k+1},y^{k}),\\ A_y^{k+1}=&\nabla G(y^{k+1})-\nabla G(y^{k})-\frac{1}{\sigma _k}\left( \nabla \psi (y^{k+1})-\nabla \psi (y^{k})\right) . \end{aligned} \right. \end{aligned}$$

Then \(A^{k+1}\in \partial \Phi (z^{k+1})\) and there exists \({\hat{\delta }}>0\) such that

$$\begin{aligned} \Vert A^{k+1}\Vert \le {\hat{\delta }}\Vert z^k-z^{k+1}\Vert , \end{aligned}$$

where \({\hat{\delta }}=\sqrt{2}\max \left\{ L_{h}L_1+\frac{L_1}{{\underline{\tau }}},~L_{\psi }L_2+\frac{L_2}{{\underline{\sigma }}}+\xi \right\} \).

Proof

The first order optimality conditions of (9) and (10) yield

$$\begin{aligned} 0\in \partial _xH(x^{k+1},y^k)+ \nabla F(x^{k})+\frac{1}{\tau _k}\left( \nabla h(x^{k+1})-\nabla h(x^k)\right) , \end{aligned}$$

and

$$\begin{aligned} 0\in \partial _yH(x^{k+1},y^{k+1})+ \nabla G(y^{k})+\frac{1}{\sigma _k}\left( \nabla \psi (y^{k+1})-\nabla \psi (y^k)\right) , \end{aligned}$$

respectively. Following from these relations and Assumption 3.2, it is clear that \(A^{k+1}\in \partial \Phi (z^{k+1})\). Combining with Assumption 3.3, we get that

$$\begin{aligned} \begin{aligned} \Vert A_x^{k+1}\Vert \le&\left\| \nabla F(x^{k+1})-\nabla F(x^{k})-\frac{1}{\tau _k}\left( \nabla h(x^{k+1})-\nabla h(x^{k})\right) \right\| +\xi \Vert y^{k+1}-y^k\Vert \\ \le&\left( L_{h}L_1+\frac{L_1}{{\underline{\tau }}}\right) \Vert x^{k+1}-x^k\Vert +\xi \Vert y^{k+1}-y^k\Vert . \end{aligned} \end{aligned}$$

Similarly, we can get

$$\begin{aligned} \begin{aligned} \Vert A_y^{k+1}\Vert =&\left\| \nabla G(y^{k+1})-\nabla G(y^{k})-\frac{1}{\sigma _k}\left( \nabla \psi (y^{k+1})-\nabla \psi (y^{k})\right) \right\| \\ \le&\left( L_{\psi }L_2+\frac{L_2}{{\underline{\sigma }}}\right) \Vert y^{k+1}-y^k\Vert . \end{aligned} \end{aligned}$$

Rearranging the terms and completing the proof. \(\square \)

We are now ready for proving the main result of this section.

Theorem 3.11

Let the objective function \(\Phi \) be a KL function. Suppose that Assumptions 3.1,  3.2 with (iv) and Assumption 3.3 hold. Let \(\left\{ z^k=(x^k,y^k)\right\} _{k\in {\mathbb {N}}}\) be the sequence generated by ASABP which is assumed to be bounded. Then

$$\begin{aligned} \sum _{k=0}^{+\infty }\Vert z^{k+1}-z^k\Vert <+\infty , \end{aligned}$$

and as a sequence, we have \(\{z^k\}_{k\in {\mathbb {N}}}\) converges to a critical point of model (1) with the abstract sets \({\overline{X}},~{\overline{Y}}\).

Proof

From Lemma 3.8, we get that

$$\begin{aligned} \lim _{k\rightarrow +\infty }\Phi (z^k)=\Phi (z^*). \end{aligned}$$

Then we consider the following two cases.

First, if there exists an integer \(k'\) for which \(\Phi (z^{k'})=\Phi (z^*)\), according to the strongly convexity of h and \(\psi \) and (20), we have that for any \(k>k'\),

$$\begin{aligned} \begin{aligned} {\bar{\delta }}\Vert z^{k}-z^{k+1}\Vert ^2&\le \delta \left( D_{h}( x^{k+1},x^k)+D_{\psi }( y^{k+1},y^k)\right) \\&\le \Phi (z^k)-\Phi (z^{k+1}), \end{aligned} \end{aligned}$$
(24)

where \({\bar{\delta }}=\delta \min \left\{ \frac{\mu _h}{2},~\frac{\mu _{\psi }}{2}\right\} \). Combining with the monotonically nonincreasing of \(\Phi (z^{k})\), one can obtain that \(z^{k+1}=z^k\) for any \(k\ge k'\), and the assertion holds.

Then we consider the case that \(\Phi (z^k)>\Phi (z^*)\) for all \(k\ge 0\). Since \(\lim _{k\rightarrow +\infty }\Phi (z^k)=\Phi (z^*)\), it follows that for any \(\eta >0\), there exists a nonnegative integer \(k_0\) such that \(\Phi (z^k)<\Phi (z^*)+\eta \) for all \(k\ge k_0\). From (23), we know that \(\lim _{k\rightarrow +\infty }\textrm{dist}(z^k,\Omega ^*)=0\). This means that for any \(\varepsilon >0\), there exists a positive integer \(k_1\) such that \(\textrm{dist}(z^k,\Omega ^*)<\varepsilon \) for all \(k\ge k_1\). Consequently, for all \(k>l:=\max \{k_0,k_1\}\),

$$\begin{aligned} \textrm{dist}(z^k,\Omega ^*)<\varepsilon ,~\Phi (z^*)<\Phi (z^k)<\Phi (z^*)+\eta . \end{aligned}$$

Applying Lemma 2.16 and Lemma 3.8, we deduce that for any \(k>l\),

$$\begin{aligned} \varphi '\left( \Phi (z^k)-\Phi (z^*)\right) \textrm{dist}(0,\partial \Phi (z^k))\ge 1. \end{aligned}$$

From Proposition 3.10,

$$\begin{aligned} \varphi '\left( \Phi (z^k)-\Phi (z^*)\right) \ge \frac{1}{{\hat{\delta }}\Vert z^k-z^{k-1}\Vert }. \end{aligned}$$
(25)

On the other hand, from the concavity of \(\varphi \), we get that

$$\begin{aligned} \varphi \left( \Phi (z^k)-\Phi (z^*)\right) -\varphi \left( \Phi (z^{k+1})-\Phi (z^*)\right) \nonumber \\ \ge \varphi '\left( \Phi (z^k)-\Phi (z^*)\right) \left( \Phi (z^k)-\Phi (z^{k+1})\right) . \end{aligned}$$
(26)

For convenience, we define

$$\begin{aligned} \triangle _k=\varphi \left( \Phi (z^k)-\Phi (z^*)\right) -\varphi \left( \Phi (z^{k+1})-\Phi (z^*)\right) . \end{aligned}$$

Combining (24) with (25) and (26) yields that, for any \(k>l\),

$$\begin{aligned} \triangle _k\ge \frac{{\bar{\delta }}\left\| z^{k}-z^{k+1}\right\| ^2}{{\hat{\delta }} \Vert z^k-z^{k-1}\Vert }, \end{aligned}$$

or equivalently,

$$\begin{aligned} \Vert z^{k}-z^{k+1}\Vert \le \sqrt{ \frac{{\hat{\delta }}}{{\bar{\delta }}}\triangle _k}\Vert z^k-z^{k-1}\Vert ^{\frac{1}{2}}. \end{aligned}$$

Using the fact that \(2\sqrt{\alpha \beta }\le \alpha +\beta \) for all \(\alpha , \beta > 0\), we have

$$\begin{aligned} 2\Vert z^k-z^{k+1}\Vert \le \Vert z^k-z^{k-1}\Vert +\frac{{\hat{\delta }}}{{\bar{\delta }}}\triangle _k. \end{aligned}$$
(27)

Summing up (27) for \(k=l+1,\cdots ,m\) yields

$$\begin{aligned} \begin{aligned} 2\sum _{k=l+1}^m\Vert z^{k}-z^{k+1}\Vert \le&\sum _{k=l+1}^m\Vert z^{k-1}-z^{k}\Vert + \sum _{k=l+1}^m\frac{{\hat{\delta }}}{{\bar{\delta }}}\triangle _k \\ =&\sum _{k=l+1}^m\Vert z^{k-1}-z^{k}\Vert + \frac{{\hat{\delta }}}{{\bar{\delta }}}\left( \varphi \left( \Phi (z^{l+1})-\Phi (z^*)\right) -\varphi \left( \Phi (z^{m+1})-\Phi (z^*)\right) \right) \\ \le&\sum _{k=l+1}^m\Vert z^{k-1}-z^{k}\Vert + \frac{{\hat{\delta }}}{{\bar{\delta }}}\varphi \left( \Phi (z^{l+1})-\Phi (z^*)\right) , \end{aligned} \end{aligned}$$

where the last inequality due to \(\varphi (\Phi (z^{m+1})-\Phi (z^*))\ge 0\). Letting \(m\rightarrow +\infty \) yield

$$\begin{aligned} \sum _{k=l+1}^{+\infty }\Vert z^{k+1}-z^k\Vert \le \Vert z^{l+1}-z^{l}\Vert + \frac{{\hat{\delta }}}{{\bar{\delta }}}\varphi \left( \Phi (z^{l+1})-\Phi (z^*)\right) . \end{aligned}$$

Equipped with the fact that \(\bar{\delta } \) is bounded below, we have

$$\begin{aligned} \sum _{k=0}^{+\infty }\Vert z^{k+1}-z^k\Vert < +\infty , \end{aligned}$$

and the sequence \(\{z^k\}_{k\in {\mathbb {N}}}\) is a Cauchy sequence. Combining with Theorem 3.7, \(\{z^k\}_{k\in {\mathbb {N}}}\) converges to a critical point of model (1) with the abstract sets \({\overline{X}},~{\overline{Y}}\). \(\square \)

4 Numerical experiment

In this section, we numerically verify the ability of ASABP and IASABP analyzed in previous sections. All experiments are performed in MATLAB R2014a on a 64-bit PC with an Intel Core i7-8550U CPU (1.80GHz) and 16GB of RAM.

We consider the Poisson linear inverse problems [6], which can be conveniently described as follows. Given a matrix \(A\in {\mathbb {R}}_{+}^{m\times n}\) modeling the experimental protocol, and \(b\in {\mathbb {R}}^{m}_{+}\) the vector of measurement, the goal is to reconstruct the signal or image \(x\in {\mathbb {R}}_{+}^{n}\) from the noisy measurement b such that \(Ax\simeq b\). Often these problems can be represented as a minimize problem like

$$\begin{aligned} \min \{d(b,Ax)+\lambda g(x):~x\in {\mathbb {R}}^{n}_{+}\}, \end{aligned}$$
(28)

where \(\lambda >0\) is used to control the trade-off between matching the data fidelity criteria and the weight given to its regularizer, and \(d(\cdot ,\cdot )\) denotes a convex proximity measure between two vectors. Poisson linear inverse problems emerged in many fields, like astronomy, nuclear medicine, electronic microscopy, statistical and image science [6, 8, 16]. Therefore, the design of methods for such problems has been studied intensively.

By introducing an auxiliary variable \(y\in {\mathbb {R}}^{n}\), we can solve (28) approximately according to the following optimization problem

$$\begin{aligned} \min \limits _{x\in {\mathbb {R}}^{n}_{+},~y\in {\mathbb {R}}^{n}} F(x)+G(y)+H(x,y), \end{aligned}$$
(29)

by defining

$$\begin{aligned} F(x)=d(b,Ax), ~~~G(y)=\lambda g(y), ~~~H(x,y)=\frac{\mu }{2}\Vert x-y\Vert ^2, \end{aligned}$$

where \(\mu \) is a positive penalization parameter.

Adopting the above model (29), we take the first component in the objective function based on the Boltzmann-Shannon entropy,

$$\begin{aligned} d(b,Ax)=\sum _{i=1}^{m}\{(Ax)_{i}-b_{i}\log (Ax)_{i}\}. \end{aligned}$$

It is easy to find that \(F(x)=d(b,Ax)\) has no globally Lipschitz continuous gradient [6], but satisfies GL-samd condition with a generalized Bregman function called Burg’s entropy, denoted as

$$\begin{aligned} h(x)=-\sum ^{n}_{j=1}\log x_{j}, ~~ \textrm{dom}h=\mathbb {R}^{n}_{++}, \end{aligned}$$

and so now the Bregman distance is given by

$$\begin{aligned} D_h(x,z)=\sum _{j=1}^{n}\left\{ \frac{x_j}{z_{j}}-\log \left( \frac{x_j}{z_{j}}\right) -1 \right\} , \end{aligned}$$

and for any \(L_h\) satisfying \(L_h\ge \Vert b\Vert _{1}=\sum _{i=1}^{m}b_{i}\), the function \(L_h h-F\) is convex on \({\mathbb {R}}^{n}_{+}\). In this case, the iterates in subproblem (9) and (12) reduce to solve n one-dimensional convex problems. Note that Assumption 3.3 does not hold for this case.

We consider Tikhonov regularization in model (29), i.e., the regularizer is \(G(y)=\lambda g(y)=\frac{\lambda }{2}\Vert y\Vert ^2\). Take Energy \(\psi =\frac{1}{2}\Vert y\Vert ^2\), which corresponding Bregman distance is defined as \(D_{\psi }(y,z)=\frac{1}{2}\Vert y-z\Vert ^2\), and for any \(L_{\psi }\) satisfying \(L_{\psi }\ge \lambda \), the function \(L_{\psi }\psi -G\) is convex on \({\mathbb {R}}^{n}\). It is worth noticing that the objective function in model (29) in this case is KL function according to [3, 11].

Now we give the setting for the model to be tested. For the model (29), the entries of \(A\in {\mathbb {R}}_{+}^{m\times n}\) and \(b\in {\mathbb {R}}^{m}_{+}\) are generated following independent uniform distributions over the interval [0, 1]; we set \(\lambda =1\) and the penalization parameter \(\mu \) as large as the conditions permit. We initialize all the algorithms at the vector generated by uniform distribution over the interval (0, 1), and terminate it when \(Error:=\frac{\Vert x^k-x^{k-1}\Vert }{\max \{1,\Vert x^k\Vert \}} \le 10^{-6}\). We also terminate the algorithms when the number of iteration hits 100.

As these methods can be applied to overdetermined (\(m>n\)) and underdetermined (\(m<n\)) problems, we have preformed numerical tests on both cases. As commented before, the main parameters in IASABP are setting as follows: \(\eta \) is a random number between 0.90 and 0.95 and \(\theta = 0.99\); \(\alpha _k^0=\beta _{k}^{0}=1\) for all k; \(L_h=\Vert b\Vert _1\) and \(L_{\psi }=\lambda \).

Fig. 1
figure 1

Poisson linear inverse problem. First row: \(m=1000\); Second row: \(n=1000\)

For this Poisson linear inverse problem, Fig. 1 records the trend of the residual functions of ASABP and IASABP with respect to the overdetermined and underdetermined situations. We can see that IASABP converges faster than ASABP. It shows that the IASABP algorithm has an advantage over ASABP, giving an interesting option for real problems.

5 Conclusions

We have proposed an alternating structure-adapted Bregman proximal (ASABP) gradient descent algorithm for solving the nonconvex nonsmooth minimization problems with abstract sets. An improved version with inertial accelerated scheme and line search strategy (IASABP) was also presented. Under mild conditions, we proved that both ASABP and IASABP possess an O(1/K) sublinear rate of convergence, measured by Bregman distance. Furthermore, if the objective function has Kurdyka–Łojasiewicz property, we proved that the sequence generated by ASABP converges to the critical point of the constrained problem. Theoretically, our results are extension of the recent NoLips algorithms, which are either limited to the convex case, or the nonconvex case but there is no abstract sets. Numerically, our results avoid the estimation of the Lipschitz constants, an annoying task in implementation. We applied ASABP and IASABP to Poisson linear inverse problem, and their performance is very promising.