1 Introduction

Over the last few years, in many modern applications ranging from machine learning, statistics to image/signal processing (see, e.g., [15, 20, 27] and references therein), it has been observed that nonconvex models are often more faithful than their convex relaxed/approximate counterparts and provide better quality solutions, even in spite of the obvious difficulties in deriving meaningful convergence guarantees for a nonconvex problem. This led to a strong renewed interest in nonconvex optimization methods and their theoretical properties.

This study addresses a broad class of nonsmooth nonconvex minimization problems with nonlinear functional equality constraints via the fundamental (augmented) Lagrangian-based optimization methodology. The augmented Lagrangian (AL) approach is a centerpiece in algorithmic optimization literature with a long history that can be traced back to the classical works of [18] and [24]. To this day, it provides fertile ground for algorithmic frameworks, with prominent fundamental algorithms such as the Proximal Method of Multipliers (PMM) [25] and the Alternating Direction Method of Multipliers (ADMM) [16, 17]. We refer the reader to the monographs [3, 5] for a comprehensive survey.

The recent literature on Lagrangian-based methods is quite voluminous and is almost entirely devoted to the convex setting, see e.g., [3,4,5, 11, 17, 29] and references therein. Unlike the convex setup, the situation in the nonconvex setting is far from being well-understood, especially global analysis of Lagrangian-based methods for general nonlinear models remains scarce and very challenging.

The genuine nonconvex optimization model we study has been barely touched in the literature. In fact, only very recently some progress has been initiated for certain types of nonconvex models with linear constraints. Even in this simpler linearly constraints setup, the analysis is challenging and is forced to rely on various types of assumptions; see e.g., [19] and [10] for recent results, as well as relevant references, therein.

To our knowledge, the class of nonsmooth nonconvex optimization models with nonlinear equality constraints studied here within the Lagrangian-based scheme we propose, has not been addressed in the literature. The only closely related model we are aware of is the universal Lagrangian framework studied by [9], in which the authors introduce a novel methodology to conduct global analysis of a nonsmooth nonlinear composite optimization problem, within the framework of a broad Lagrangian-based scheme with global convergence guarantees. The results developed in [9] have revealed the fundamental theoretical difficulties involved in the global analysis of Lagrangian-based methods when studying such an all-embracing nonsmooth nonconvex composite minimization model; see, e.g., [28] for a brief summary of this approach, and the underlying nontrivial assumptions needed in the analysis of [9]. Besides the theoretical front, another central issue is on the tractability side, namely, how to handle the minimization step efficiently while preserving the convergence properties. In this context, the approach suggested by [9] reduces in many scenarios to the optimization of very difficult composite nonlinear problems.

Motivated by the recent general framework developed by [9], we seek to bypass the limitations mentioned above by considering a more specific class of problems, whereby one can refine the theoretical analysis and design tractable minimization steps by further exploiting structures and specific data information. Indeed, here we focus our efforts on a class of nonconvex nonsmooth optimization models with data assumptions which, on the one hand, are sufficiently broad to cover many applications, and on the other hand, allow for the global convergence analysis of more tractable schemes.

The paper is structured as follows. In the next section, we introduce the class of nonconvex nonsmooth problems with nonlinear constraints as well as some mathematical preliminaries on our approach. Section 3 introduces the dynamic alternating directions scheme (described by Algorithm 1), and Sect. 4 analyzes its global and subsequence convergence guarantees. In Sect. 5, we provide an explicit and tractable backtracking procedure (cf. Algorithm 2) to generate the proximal parameter used in Algorithm 1 and prove that it satisfies all the properties required by our analysis. Finally, as an interesting byproduct of our analysis, in Sect. 6, we derive new results on global convergence for the proximal gradient method with a dynamic proximal parameter when applied to the nonconvex sum composite minimization model where the smooth part is locally Lipschitz. Section 7 concludes this work, including a brief discussion on additional algorithmic variants that could be obtained within our approach.

Notation. Unless otherwise specified, our notations are standard and can be found in the classical monograph of [26].

2 Problem Formulation and Mathematical Preliminaries

We consider the nonconvex nonsmooth model under the following blanket assumptions on the problem’s data:

$$\begin{aligned} \min _{u\in {\mathbb {R}}^{n},v\in {\mathbb {R}}^{m}}\left\{ \varPhi (u,v){:=}f(u)+g(u)+h(v): F(u)=Gv\right\} \text {,} \end{aligned}$$
(M)

where

  • \(f: {\mathbb {R}}^{n} \rightarrow (-\infty , \infty ]\) is a proper and lower-semicontinuous (lsc) function and \(\inf _{u\in {\mathbb {R}}^{n}}f(u)>-\infty \);

  • \(g: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) is a continuously differentiable (\(C^{1}\)) function with a locally Lipschitz continuous gradient;

  • \(h: {\mathbb {R}}^{m} \rightarrow {\mathbb {R}}^{}\) is a \(C^{1}\) function with an \(L_{h}\)-Lipschitz continuous gradient, \(L_{h}>0\);

  • \(F: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}^{q}\) is a \(C^{1}\) mapping with a locally Lipschitz continuous Jacobian;

  • \(G\in {\mathbb {R}}^{q\times m }\) has full row rank, i.e., the minimal eigenvalue \(\lambda _{\min }(GG^T)>0.\)

Model (M) encompasses a wide variety of problems arising in disparate statistical and engineering applications see e.g., [15, 20, 21, 27], as well as problems ranging from nonlinear least squares to bi-level optimization. We briefly describe two specific examples.

Example 2.1

(Nonconvex Stackelberg competition) In a Stackelberg competition (see, e.g., [31]), firms optimize their utility function in a predetermined order. In the common scenario, there is a leader firm which optimizes first and a follower firm that optimizes second based on the choices of the leader firm. Consequently, the leader firm takes into account the implications of its decisions on the follower firm. Model (M) captures Stackelberg competitions having the general form: \(H_{\text {leader}} (u;v) = f(u)+g(u)+h(v)\), and \(H_{\text {follower}} (v;u) = \frac{1}{2} v^T G v + F(u)^T v \); our model is obtained by applying the first-order optimality conditions to the follower firm’s problem. Note that this model can be considered as a bi-level optimization problem.

Example 2.2

(Sparsity-related approximations) Model (M) also captures the interesting problem obtained from setting h to be a continuously differentiable sparsity inducing function, such as the famous Smoothly Clipped Absolute Deviation (SCAD) function proposed by [14]. This leads to a variety of possible sparsity-related problems, and nonlinear regression mimicking the \(\ell _1\)-norm regression. Moreover, noting that G can be chosen as a low row rank matrix, compressed sensing-type instances are also captured by the model.

Model (M) also includes the so-called minimization of the sum composite model.

Example 2.3

(Nonlinear composite model) Letting \(g=0\), and G be the identity map \(I_{q\times m}\), the model (M) reduces to the nonconvex sum composite minimization problem:

$$\begin{aligned} \min \left\{ f(u)+ h(F(u)): \; u\in {\mathbb {R}}^{n} \right\} . \end{aligned}$$

This model was studied in [9], but with the more general h being nonsmooth extended-valued, through a general Lagrangian-based framework.Footnote 1 Here, we assume that h is a \(C^{1}\) function with an \(L_{h}\)-Lipschitz continuous gradient. As we shall see, as a byproduct of our analysis, this allows us to tackle the sum composite problem directly, namely via the so-called proximal gradient scheme [1], yet under the mild locally Lipschitz assumptions and with a dynamic proximal parameter, and obtain global convergence guarantees; see Sect. 6 for details.

Based on a variational analysis of the problem (see [26, Chapters 8 and 10]), we define the following constraint qualification conditions for (M). We denote by \(\nabla F(\cdot )\in {\mathbb {R}}^{q\times n} \) the Jacobian of F.

Definition 2.1

(Constraint qualification) A point \(u\in {\mathbb {R}}^{n}\) is said to satisfy the constraint qualification [CQ] conditions for problem (M) if the following hold true:

  1. (i)

    The function f is subdifferential regular at u.

  2. (ii)

    \(\partial ^{\infty }f(u)\cap \mathrm {ran}\,{\nabla F(u)^{T}}=\{0\}\).

For more details regarding regularity, see Definition 7.25 and Corollaries 8.11 and 8.19 in [26]. In Appendix A, we also give a more detailed account on subdifferentials of nonconvex functions which are used here.

The [CQ] conditions provide both smoothness and regularity of the constraint set, and the necessary subdifferential calculus rules which allow us to derive first-order necessary optimality conditions for (M).

Lemma 2.1

(First-order optimality conditions) Let \((u^{*},v^{*})\in {\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\) be a local minimizer of (M), where \(u^{*}\) satisfies the [CQ] conditions of Definition 2.1. Then,

  1. (i)

    \((u^{*},v^{*})\) is a feasible point, i.e., \(0=F(u^{*})-Gv^{*}\), and

  2. (ii)

    there exists \(y^{*}\in {\mathbb {R}}^{q}\) such that

    $$\begin{aligned} 0\in \partial f(u^{*}) + \nabla g(u^{*}) + \nabla F(u^{*})^{T}y^{*}\quad \text {and}\quad 0=\nabla h(v^{*}) -G^{T}y^{*}. \end{aligned}$$
    (2.1)

The Lagrangian \({\mathcal {L}}:{\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\times {\mathbb {R}}^{q}\rightarrow (-\infty , \infty ]\) associated with problem (M) is defined by

$$\begin{aligned} {\mathcal {L}}(u, v, y){:=}f(u) + g(u) + h(v) +\langle y, F(u)-Gv\rangle . \end{aligned}$$
(2.2)

We note that

$$\begin{aligned} \partial {\mathcal {L}}(u,v,y)= & {} \left( \partial f(u) + \{\nabla g(u) +\nabla F(u)^{T}y\}\right) \nonumber \\&\times \{\nabla h(v) -G^{T}y\} \times \{F(u)-G(v)\}, \end{aligned}$$
(2.3)

where the subdifferential calculus rules which justify (2.3) always hold and do not depend on the [CQ] conditions (see [26, Exercise 8.8]). Thus, it can be easily verified that the first-order optimality conditions of Lemma 2.1 are equivalent to requiring the point \((u^{*}, v^{*}, y^{*})\) to be a critical point of the Lagrangian, i.e., \((u^{*}, v^{*}, y^{*})\in {{\,\mathrm{crit}\,}}{\mathcal {L}}:=\{ (u^{*}, v^{*}, y^{*}): 0 \in \partial {\mathcal {L}}(u^{*}, v^{*}, y^{*})\}\). The main purpose of this work is to devise an algorithm that converges to such a point.

To achieve our goal, we rely on an augmented Lagrangian approach (see, e.g., [2, Chapter 4.2]), in which the Lagrangian associated with (M) (cf. (2.2)) is penalized by a quadratic term, with a penalty parameter \(\rho >0\), to obtain the augmented Lagrangian function

$$\begin{aligned} {\mathcal {L}}_{\rho }(u, v, y) {:=}f(u) + g(u) + h(v) + \langle y, F(u)-Gv\rangle +\frac{\rho }{2}\left\| {F(u)-Gv} \right\| ^{2},\qquad \end{aligned}$$
(2.4)

where \(\left\| {\cdot } \right\| \) denotes the usual Euclidean \(l_{2}\)-norm. It will be convenient in the sequel to distinguish between the smooth and nonsmooth elements of \({\mathcal {L}}_{\rho }\). For this purpose, let \(\varphi :{\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\times {\mathbb {R}}^{q}\rightarrow {\mathbb {R}}\) be defined by

$$\begin{aligned} \begin{aligned} \varphi (u,v,y)&{:=}g(u) + \langle y, F(u)-Gv\rangle +\frac{\rho }{2}\left\| {F(u)-Gv} \right\| ^{2}\\&=g(u)+\frac{1}{2\rho }\left\| {y+\rho (F(u)-Gv)} \right\| ^{2}- \frac{1}{2\rho }\left\| {y} \right\| ^{2}, \end{aligned} \end{aligned}$$
(2.5)

so that

$$\begin{aligned} {\mathcal {L}}_{\rho }(u, v, y) = f(u) + h(v) + \varphi (u,v,y). \end{aligned}$$

A key motivation behind the Augmented Lagrangian approach is that any critical point \((u^*,v^*,y^*)\) of \({\mathcal {L}}_\rho \) satisfies that \((u^*,v^*)\) is a critical point of (M); this can easily be established via the characterization of the critical set of Lagrangian function (2.2); see also Proposition 4.1 in Sect. 4.3.

3 A Dynamic Alternating Directions of Multipliers

For the purpose of obtaining a critical point of \({\mathcal {L}}_{\rho }\), we propose the Dynamic linearized Alternating direction method of Multipliers (DAM) described by Algorithm 1. As its name suggests, this method relies on an ADMM-based scheme whereby we use here a specific linearization of the augmented Lagrangian and proximal regularization.

figure a

Several comments regarding Algorithm 1 are in order.

Remark 3.1

(On the proximal parameter \(\sigma _{k}\)) The proximal parameter \(\sigma _{k+1}\) is generated dynamically according to the information obtained at the current iterate; hence the dynamic nature of Algorithm 1. The analysis of Algorithm 1 requires that the sequence \(\{\sigma _{k}\}_{k\ge 1}\) will satisfy two conditions: (i) that it will facilitate a descent property with respect to the u-update step; and, (ii) that it will be bounded. In Sect. 5, we provide a detailed backtracking procedure to generate \(\sigma _{k+1}\) in a tractable manner (cf. Algorithm 2) so that both of these conditions are met under our model’s assumptions and boundedness of the generated sequence of iterates. However, we emphasize that our convergence guarantees (stated in Sect. 4) are independent of the explicit manner in which the proximal parameter is generated.

Remark 3.2

(On the u-update) u-update step (3.1) is in fact a proximal gradient step:

$$\begin{aligned} \begin{aligned} u^{k+1}&\in \mathrm {argmin}_{u\in {\mathbb {R}}^{n}}\left\{ f(u) +\langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), u\rangle + \frac{\sigma _{k+1}}{2}\left\| {u-u^{k}} \right\| ^{2}\right\} \\&=\mathop {{\text {prox}}}\nolimits _{\sigma _{k+1}^{-1}f}^{}\left( u^{k}-\sigma _{k+1}^{-1}\nabla _{u}\varphi (u^{k},v^{k},y^{k})\right) . \end{aligned} \end{aligned}$$

A preliminary requirement for the validity of the u-update step is that the solution set of the minimization problem in (3.1) is nonempty. The results in [26, p. 21], concerning the Moreau envelopes and proximal mappings of proper and l.s.c (but not necessarily convex) functions, together with the fact that \(\inf _{u\in {\mathbb {R}}^{n}}f(u)>-\infty \), guarantee that the proximal mapping is nonempty and compact; this is stated formally below in Lemma 4.2.

Remark 3.3

(On the v-update) Noting that \(\varPsi _{k}\) is a strongly convex function, due to the fact that \(\rho G^{T}G +\theta I\succ 0\) for any \(\theta >0\), we can derive from the first-order optimality conditions of (3.2) an explicit expression for the v-update step:

$$\begin{aligned} v^{k+1} = (\rho G^{T}G +\theta I)^{-1}\left( \theta v^{k} +G^{T}(y^{k}+\rho F(u^{k+1})) -\nabla h(v^{k})\right) . \end{aligned}$$
(3.4)

Since the matrix \(G^{T}G + \theta I\) remains unchanged throughout the execution of Algorithm 1, its inversion needs to be calculated only once.

In the next section we establish the subsequence and global convergence guarantees of Algorithm 1.

4 Convergence Analysis of DAM

4.1 Preliminaries

Our analysis will revolve around the function \(\varphi \) and its derivatives; these are recorded below for ease of use.

$$\begin{aligned} \nabla _{u}\varphi (u, v,y)&=\nabla g(u) + \nabla F^{T}(u)(y +\rho (F(u)-Gv)), \end{aligned}$$
(4.1)
$$\begin{aligned} \nabla _{v}\varphi (u, v,y)&=-G^{T}(y +\rho (F(u)-Gv)), \end{aligned}$$
(4.2)
$$\begin{aligned} \nabla _{y}\varphi (u, v,y)&=F(u)-Gv. \end{aligned}$$
(4.3)

Mainly, we utilize the local Lipschitz continuity of the gradients above; the proof is rather technical and so is deferred to Appendix C.

Lemma 4.1

(Local Lipschitz continuity of \(\nabla \varphi \)) The function

$$\begin{aligned} \varphi (u,v,y) = g(u) + \langle y, F(u)-Gv\rangle +\frac{\rho }{2}\left\| {F(u)-Gv} \right\| ^{2} \end{aligned}$$

satisfies that for every nonempty and compact set \(C\subseteq {\mathbb {R}}^n \times {\mathbb {R}}^m \times {\mathbb {R}}^q\) there exists \(L_{C}\ge 0\) such that

$$\begin{aligned} \left\| {\nabla \varphi (x) - \nabla \varphi (z)} \right\| \le L_{C}\left\| {x-z} \right\| ,\quad \forall x,z\in C. \end{aligned}$$
(4.4)

That is, \(\nabla \varphi \) is locally Lipschitz continuous.

Let us recall the well-definedness of the proximal mapping operator with respect to a nonconvex function (cf. [26, page 21]).

Lemma 4.2

(Properties of the proximal mapping) Let \(\zeta :{\mathbb {R}}^d\rightarrow (-\infty ,\infty ]\) be a proper lower semi-continuous function with \(\inf _{{\mathbb {R}}^n}\zeta >-\infty \), \(x\in {\mathbb {R}}^d\), and \(t>0\). The proximal map associated to \(\zeta / t\) given by

$$\begin{aligned} \mathrm {prox}_{\zeta /t} (x) = \mathrm {argmin}\left\{ \zeta (z) + \dfrac{t}{2} \Vert z - x\Vert ^2 : z\in {\mathbb {R}}^d \right\} , \end{aligned}$$

is a nonempty and compact set.

The infrastructure for the forthcoming analysis of Algorithm 1 comprises the following assumptions related to the proximal parameter.

Assumption A

(Proximal parameter implies descent) The proximal parameters \(\{\sigma _{k}\}_{k\ge 1}\) are chosen so that there exists \(\nu >0\) such that for every \(k\ge 1\),

$$\begin{aligned}&\varphi (u^{k+1}, v^{k}, y^{k})-\varphi (u^{k}, v^{k}, y^{k}) - \langle \nabla _{u}\varphi (u^{k}, v^{k}, y^{k}),u^{k+1}-u^{k}\rangle \nonumber \\&\le \frac{\sigma _{k+1}-\nu }{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}. \end{aligned}$$
(4.5)

Assumption B

(Proximal parameter sequence is bounded) The proximal parameter’s sequence \(\{\sigma _{k}\}_{k\ge 1}\) is bounded, i.e.,

$$\begin{aligned} {\bar{\sigma }}{:=}\sup _{k\ge 1}\sigma _{k} <\infty . \end{aligned}$$
(4.6)

We will also utilize the celebrated descent lemma.

Lemma 4.3

(Descent lemma [2]) Let \(\phi : {\mathbb {R}}^{n}\rightarrow {\mathbb {R}}\) be a function with an \(L_{S}\)-Lipschitz continuous gradient \(\nabla \phi \) over a given convex set S, with \(L_{S}\in {\mathbb {R}}_{+}\). Then,

$$\begin{aligned} \left| \phi (x) - \phi (z)-\langle \nabla \phi (z), x-z\rangle \right| \le \frac{L_{S}}{2}\left\| {x-z} \right\| ^{2},\quad \forall x,z\in S\text {.} \end{aligned}$$
(4.7)

Our analysis is based on the methodology and terminology introduced by [7], and refined in the more recent [8]; see [30] for a summary of these results. Accordingly, we begin by defining the notion of gradient-like descent sequence [8, Definition 6.1].

Definition 4.1

(Gradient-like descent sequence) A sequence \(\{x_{k}\}_{k\ge 1}\subseteq {\mathbb {R}}^{d}\) is called a gradient-like descent sequence with respect to a proper and lsc function \(\varGamma :{\mathbb {R}}^{d}\rightarrow (-\infty ,\infty ]\) if it satisfies the following three conditions:

  1. (C1)

    Sufficient decrease property. There exists \(c_{1}>0\) such that

    $$\begin{aligned} \varGamma (x^{k+1})-\varGamma (x^{k})\le -c_{1}\left\| {x^{k+1}-x^{k}}\right\| ^{2},\quad \forall k\ge 1\text{. } \end{aligned}$$
  2. (C2)

    A subgradient lower bound for the iterates gap. There exists \(c_{2}>0\) such that, for every \(k\ge 1\), there exists \(\xi ^{k+1}\in \partial \varGamma (x^{k+1})\) which satisfies

    $$\begin{aligned} \left\| {\xi ^{k+1}} \right\| \le c_{2}\left\| {x^{k+1}-x^{k}} \right\| \text {.} \end{aligned}$$
  3. (C3)

    Continuity condition. Let \(\{x^{k}\}_{k\in {\mathcal {K}}}\) be a subsequence converging to a point \({\bar{x}}\). Then,

    $$\begin{aligned} \limsup _{k\in {\mathcal {K}}\subseteq {\mathbb {N}}}\varGamma (x^{k})\le \varGamma ({\bar{x}})\text {.} \end{aligned}$$

A bounded gradient-like descent sequence has the following convergence guarantees (cf. [7] and [8]).

Lemma 4.4

(Subsequence convergence) Let \(\{x_{k}\}_{k\ge 1}\) be a bounded gradient-like descent sequence with respect to the proper and lsc function \(\varGamma \). Denote \(\varOmega \) as the set of cluster points of \(\{x^{k}\}_{k\ge 1}\). Then, \(\varOmega \) is a nonempty and compact subset of \({{\,\mathrm{crit}\,}}\varGamma \); \(\lim _{k\rightarrow \infty }{\text {dist}}(x^{k}, \varOmega )=0\); and \(\varGamma \) is finite and constant on \(\varOmega \).

In order to obtain a global convergence result, we assume that \(\varGamma \) satisfies the nonsmooth Kurdyka-Łojasiewicz (KL) property [6], which is in particular true when \(\varGamma \) is a semi-algebraic function. For further details, see [7] and references therein.

Theorem 4.1

(Global convergence) [8, Theorem 6.2] Let \(\{x_{k}\}_{k\ge 1}\) be a bounded gradient-like descent sequence with respect to the proper and lsc function \(\varGamma \). If \(\varGamma \) satisfies the KL property, then the sequence \(\{x^{k}\}_{k\ge 1}\) has a finite length, i.e., \(\sum _{k=1}^{\infty }\left\| {x^{k+1}-x^{k}} \right\| <\infty \), and it converges to a critical point \(x^{*}\in {{\,\mathrm{crit}\,}}\varGamma \).

In light of Lemma 4.4 and Theorem 4.1, our goal boils down to establishing that the sequence generated by Algorithm 1 is a gradient-like descent sequence.

4.2 The Descent-ascent and v-y Relations

Before we establish that the sequence generated by Algorithm 1 is indeed a gradient-like descent sequence, two results are required: (i) links between the primal variable v and the multiplier y (Lemma 4.5 ); (ii) a descent-ascent relation (Lemma 4.6).

Lemma 4.5

(v and y relation) It holds that

$$\begin{aligned} G^{T}y^{k+1}&=\nabla h(v^{k}) + \theta (v^{k+1}-v^{k}),&\text {for all } k\ge 0 \end{aligned}$$
(4.8)
$$\begin{aligned} \left\| {y^{k+1}-y^{k}} \right\| ^{2}&\le \frac{2\theta ^{2}}{\lambda _{\min }(GG^{T})}\left\| {v^{k+1}-v^{k}} \right\| ^{2} + \frac{2(\theta +L_{h})^{2}}{\lambda _{\min }(GG^{T})}\left\| {v^{k}-v^{k-1}} \right\| ^{2},&\text {for all } k\ge 1. \end{aligned}$$
(4.9)

Proof

We begin by proving Equation 4.8. Recall that (cf. (4.2))

$$\begin{aligned} \begin{aligned} \nabla _{v}\varphi (u^{k+1}, v^{k+1},y^{k})&=-G^{T}(y^{k} +\rho (F(u^{k+1})-Gv^{k+1}))=-G^{T}y^{k+1}, \end{aligned} \end{aligned}$$

where the last equality is due to multiplier update step (3.3). Applying the first-order optimality condition to the minimization problem of v-update step (3.2), we obtain that

$$\begin{aligned}&0=\nabla _{v}\varphi (u^{k+1}, v^{k+1}, y^{k}) + \nabla h(v^{k}) +\theta (v^{k+1}-v^{k})\\&=-G^{T}y^{k+1}+\nabla h(v^{k})+\theta (v^{k+1}-v^{k}), \end{aligned}$$

meaning that Equation 4.8 holds true for any \(k\ge 0\).

We continue with proving relation (4.9). The matrix \(GG^{T}\) is positive definite and hence \(\lambda {:=}\lambda _{\min }(GG^{T})>0\). Additionally, for every \(y\in {\mathbb {R}}^{q}\), we have

$$\begin{aligned} \lambda ^{-1}\left\| {G^{T}y} \right\| ^{2}=\lambda ^{-1}\langle G^{T}y, G^{T}y\rangle =\lambda ^{-1}\langle y, GG^{T}y\rangle \ge \lambda ^{-1}\langle y, \lambda y\rangle =\left\| {y} \right\| ^{2}. \end{aligned}$$

Thus, for every \(k\ge 1\),

$$\begin{aligned} \left\| {y^{k+1}-y^{k}} \right\| \le \lambda ^{-1/2}\left\| {G^{T}(y^{k+1}-y^{k})} \right\| . \end{aligned}$$
(4.10)

Combining (4.10) with (4.8), we obtain for every \(k\ge 1\) that

$$\begin{aligned} \begin{aligned} \left\| {y^{k+1}-y^{k}} \right\| ^{2}&\le \frac{1}{\lambda }\left\| {\nabla h(v^{k}) - \nabla h(v^{k-1}) + \theta (v^{k+1}-v^{k}) - \theta (v^{k}-v^{k-1})} \right\| ^{2}\\&\le \frac{1}{\lambda }\left( \left\| {\nabla h(v^{k}) - \nabla h(v^{k-1})} \right\| +\theta \left\| {v^{k+1}-v^{k}} \right\| + \theta \left\| {v^{k}-v^{k-1}} \right\| \right) ^{2}\\&\le \frac{1}{\lambda }\left( \theta \left\| {v^{k+1}-v^{k}} \right\| + (\theta +L_{h})\left\| {v^{k} - v^{k-1}} \right\| \right) ^{2}, \end{aligned} \end{aligned}$$

where for the second inequality we apply the triangle inequality, and the subsequent inequality is due to the \(L_{h}\)-Lipschitz continuity of \(\nabla h\) . We finally complete the proof by utilizing the relation \((s+t)^{2}\le 2s^{2}+2t^{2}\). \(\square \)

Lemma 4.6

(Descent-ascent relation) Suppose that Assumption A holds. Then, for every \(k\ge 1\), it holds that

$$\begin{aligned} \begin{aligned} \varDelta _{{\mathcal {L}}_{\rho },k}&{:=}{\mathcal {L}}_{\rho }(u^{k+1}, v^{k+1}, y^{k+1})-{\mathcal {L}}_{\rho }(u^{k}, v^{k}, y^{k})\\&\le -\frac{\nu }{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}-\frac{(2\theta -L_{h})}{2}\left\| {v^{k+1}-v^{k}} \right\| ^{2} + \frac{1}{\rho }\left\| {y^{k+1}-y^{k}} \right\| ^{2}. \end{aligned}\qquad \end{aligned}$$
(4.11)

Proof

We will prove the required by establishing bounds following from the three update steps in Algorithm 1, starting with y-update step (3.3). From (3.3), we trivially have that

$$\begin{aligned} \begin{aligned} \delta _{y,k}&{:=}{\mathcal {L}}_{\rho }(u^{k+1}, v^{k+1}, y^{k+1})-{\mathcal {L}}_{\rho }(u^{k+1}, v^{k+1}, y^{k})\\&=\langle y^{k+1}-y^{k}, F(u^{k+1})-Gv^{k+1}\rangle =\frac{1}{\rho }\left\| {y^{k+1}-y^{k}} \right\| ^{2}. \end{aligned} \end{aligned}$$
(4.12)

By the u-update step (3.1),

$$\begin{aligned} u^{k+1} \in \mathrm {argmin}_{u\in {\mathbb {R}}^{n}} f(u) +\langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), u\rangle + \frac{\sigma _{k+1}}{2}\left\| {u-u^{k}} \right\| ^{2}, \end{aligned}$$

we have that

$$\begin{aligned}&f(u^{k+1})+\langle \nabla _{u}\varphi (u^{k}, v^{k}, y^{k}), u^{k+1}-u^{k}\rangle \nonumber \\&+\frac{\sigma _{k+1}}{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}\le f(u^{k}). \end{aligned}$$
(4.13)

Assumption A then ensures that decrease property (4.5) holds true, which combined with (4.13) results in

$$\begin{aligned} f(u^{k+1})+\varphi (u^{k+1}, v^{k}, y^{k})-f(u^{k})-\varphi (u^{k}, v^{k}, y^{k}) \le -\frac{\nu }{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}. \end{aligned}$$

Hence, setting \(\delta _{u,k}{:=}{\mathcal {L}}_{\rho }(u^{k+1}, v^{k}, y^{k})-{\mathcal {L}}_{\rho }(u^{k}, v^{k}, y^{k})\), yields

$$\begin{aligned} \delta _{u,k} \le -\frac{\nu }{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}. \end{aligned}$$
(4.14)

Now, let us consider the implications of the v-update (3.2). For any \(k\ge 1\), we have from the y update step (3.3) and the definition of \(\lambda _{\min }(\cdot )\) that

$$\begin{aligned} \begin{aligned} \delta _{v,k}&{:=}{\mathcal {L}}_{\rho }(u^{k+1}, v^{k+1}, y^{k})-{\mathcal {L}}_{\rho }(u^{k+1}, v^{k}, y^{k})\\&=h(v^{k+1})-h(v^{k}) -\langle G^{T}y^{k}, v^{k+1}-v^{k}\rangle \\&\quad +\frac{\rho }{2}\left( \left\| {Gv^{k+1}-F(u^{k+1})} \right\| ^{2}-\left\| {Gv^{k}-F(u^{k+1})} \right\| ^{2}\right) \\&=h(v^{k+1})-h(v^{k}) -\left\langle y^{k} + \rho (F(u^{k+1})-Gv^{k+1}) + \frac{\rho }{2}G(v^{k+1}-v^{k}), G(v^{k+1}-v^{k})\right\rangle \\&\le h(v^{k+1})-h(v^{k})-\langle G^{T}y^{k+1},v^{k+1}-v^{k}\rangle -\frac{\rho \lambda _{\min }(G^{T}G)}{2}\left\| {v^{k+1}-v^{k}} \right\| ^{2}. \end{aligned} \end{aligned}$$

Applying identity (4.8) from Lemma 4.5, we obtain that

$$\begin{aligned} \begin{aligned} \delta _{v,k}&\le h(v^{k+1})-h(v^{k})-\langle \nabla h(v^{k}),v^{k+1}-v^{k}\rangle \\&-\frac{2\theta +\rho \lambda _{\min }(G^{T}G)}{2}\left\| {v^{k+1}-v^{k}} \right\| ^{2}. \end{aligned} \end{aligned}$$
(4.15)

Since h has an \(L_{h}\)-Lipschitz continuous gradient, it holds that (cf. Lemma 4.3)

$$\begin{aligned} h(v^{k+1})-h(v^{k})-\langle \nabla h(v^{k}),v^{k+1}-v^{k}\rangle \le \frac{L_{h}}{2}\left\| {v^{k+1}-v^{k}} \right\| ^{2}. \end{aligned}$$
(4.16)

Thus, by summing (4.15) and (4.16), and noting that \(\lambda _{\min }(G^{T}G)\ge 0\), we obtain that

$$\begin{aligned}&\delta _{v,k}\le -\frac{ \rho \lambda _{\min }(G^{T}G)+2\theta -L_{h}}{2}\left\| {v^{k+1}-v^{k}} \right\| ^{2}\nonumber \\&\le -\frac{ 2\theta -L_{h}}{2}\left\| {v^{k+1}-v^{k}} \right\| ^{2}. \end{aligned}$$
(4.17)

Finally, by summing our three bounds (4.12), (4.14), and (4.17), and using the relation

$$\begin{aligned} \varDelta _{{\mathcal {L}}_{\rho },k}=\delta _{u,k} + \delta _{v,k} + \delta _{y,k}, \end{aligned}$$

we obtain stated inequality (4.11). \(\square \)

Remark 4.1

(On the descent-ascent of the scheme) A prominent difficulty in the Lagrangian-based approach is the ascent property that is induced by the multiplier update, as expressed by Lemma 4.6. In the nonconvex setup, this nullifies a fundamental tool of descent methods for nonconvex optimization—the sufficient decrease property. We overcome this challenge in the forthcoming section.

By combining Lemma 4.5 and Lemma 4.6, we can derive a bound on \(\varDelta _{{\mathcal {L}}_{\rho }, k}\) given in terms of the primal variables u and v.

Corollary 4.1

Suppose that Assumption A holds. Then, for every \(k\ge 1\),

$$\begin{aligned} \varDelta _{{\mathcal {L}}_{\rho }, k}\le & {} -\frac{\nu }{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}-\left( \frac{2\theta -L_{h}}{2}-\frac{2\theta ^{2}}{\rho \lambda _{\min }(GG^T)}\right) \left\| {v^{k+1}-v^{k}} \right\| ^{2}\nonumber \\&+\frac{2(\theta +L_{h})^{2}}{\rho \lambda _{\min }(GG^T)}\left\| {v^{k}-v^{k-1}} \right\| ^{2}. \end{aligned}$$
(4.18)

The bound in Corollary 4.1 is utilized in the next section to establish a sufficient decrease property for a Lyapunov function.

4.3 The Lyapunov Function and Convergence Guarantees

To reach our goal of convergence to a critical point, we adopt the approach formulated by [9] and introduce the Lyapunov function \({\mathcal {E}}_{\beta }:{\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\times {\mathbb {R}}^{q}\times {\mathbb {R}}^{m}\rightarrow (-\infty ,\infty ]\), with \(\beta >0\), defined by

$$\begin{aligned} {\mathcal {E}}_{\beta }(u,v,y,w){:=}{\mathcal {L}}_{\rho }(u,v,y) +\frac{\beta }{2}\left\| {v-w} \right\| ^{2}. \end{aligned}$$
(4.19)

The following relationship between the critical points of \({\mathcal {E}}_{\beta }\), \({\mathcal {L}}_{\rho }\), and \({\mathcal {L}}\) justifies referring to \({\mathcal {E}}_{\beta }\) as a Lyapunov function for (M).

Proposition 4.1

(Critical points relationships) For any \(u\in {\mathbb {R}}^{n}\), \(v,w\in {\mathbb {R}}^{m}\), and \(y\in {\mathbb {R}}^{q}\), it holds that

$$\begin{aligned} (u, v, y, w)\in {{\,\mathrm{crit}\,}}{\mathcal {E}}_{\beta }&\Longrightarrow (u, v, y)\in {{\,\mathrm{crit}\,}}{\mathcal {L}}_{\rho }\ \text {and}\ {\mathcal {E}}_{\beta }(u, v, y, w) ={\mathcal {L}}_{\rho }(u,v,y),\\ (u, v, y)\in {{\,\mathrm{crit}\,}}{\mathcal {L}}_{\rho }&\Longrightarrow (u, v, y)\in {{\,\mathrm{crit}\,}}{\mathcal {L}}\ \text {and}\ {\mathcal {L}}_{\rho }(u, v, y) ={\mathcal {L}}(u,v,y). \end{aligned}$$

Proof

Let \((u,v,y,w)\in {{\,\mathrm{crit}\,}}{\mathcal {E}}_{\beta }\), then by the definitions of \({\mathcal {E}}_{\beta }\) and \({\mathcal {L}}_{\rho }\), we have

$$\begin{aligned} 0&\in \partial _{u}{\mathcal {E}}_{\beta }(u,v,y,w)=\partial _{u}{\mathcal {L}}_{\rho }(u,v,y), \end{aligned}$$
(4.20)
$$\begin{aligned} 0&=\nabla _{v}{\mathcal {E}}_{\beta }(u,v,y,w) =\nabla _{v}{\mathcal {L}}_{\rho }(u,v,y) +\beta (v-w), \end{aligned}$$
(4.21)
$$\begin{aligned} 0&=\nabla _{y}{\mathcal {E}}_{\beta }(u,v,y,w)=\nabla _{y}{\mathcal {L}}_{\rho }(u,v,y), \end{aligned}$$
(4.22)
$$\begin{aligned} 0&=\nabla _{w}{\mathcal {E}}_{\beta }(u,v,y,w)=\beta (w-v). \end{aligned}$$
(4.23)

Thus, we obtain that \(w=v\) by (4.23), and together with (4.21), it follows that \(\nabla _{v}{\mathcal {L}}_{\rho }(u,v,y)=0\). Combined with (4.20) and (4.22), we obtain that \((u,v,y)\in {{\,\mathrm{crit}\,}}{\mathcal {L}}_{\rho }(u,v,y)\). Having \(w=v\) and the definition of \({\mathcal {E}}_{\beta }\) also imply that \({\mathcal {E}}_{\beta }(u,v,y,w)={\mathcal {L}}_{\rho }(u,v,y)\).

Next, assume that \((u,v,y)\in {{\,\mathrm{crit}\,}}{\mathcal {L}}_{\rho }(u,v,y)\). Then, by the definitions of \({\mathcal {L}}_{\rho }\) and \({\mathcal {L}}\), we have

$$\begin{aligned} 0&\in \partial _{u}{\mathcal {L}}_{\rho }(u,v,y)=\partial _{u}{\mathcal {L}}(u,v,y)+\rho \nabla F(u)^{T}(F(u)-Gv) \end{aligned}$$
(4.24)
$$\begin{aligned} 0&=\nabla _{v}{\mathcal {L}}_{\rho }(u,v,y)=\nabla _{v}{\mathcal {L}}(u,v,y)-\rho G{^T}(F(u)-Gv), \end{aligned}$$
(4.25)
$$\begin{aligned} 0&=\nabla _{y}{\mathcal {L}}_{\rho }(u,v,y)=\nabla _{y}{\mathcal {L}}(u,v,y)=F(u)-Gv. \end{aligned}$$
(4.26)

So, by (4.26), we obtain that \(\nabla _{y}{\mathcal {L}}(u,v,y)=0\) and that \(F(u)-Gv=0\). Together, with (4.24) and (4.25), we also obtain that \(0\in \partial _{u}{\mathcal {L}}(u,v,y)\) and that \(\nabla _{v}{\mathcal {L}}(u,v,y)=0\), and it follows that \((u,v,y)\in {{\,\mathrm{crit}\,}}{\mathcal {L}}\). Having \(F(u)=Gv\) and the definition of \({\mathcal {L}}_{\rho }\) also imply that \({\mathcal {L}}_{\rho }(u,v,y)={\mathcal {L}}(u,v,y)\). \(\square \)

At this point, we are ready to prove that the sequence \(\{z^{k}{:=}(u^{k}, v^{k}, y^{k}, v^{k-1})\}_{k\ge 1}\) is a gradient-like descent sequence with respect to the Lyapunov function \({\mathcal {E}}_{\beta }\), for appropriate choices of \(\rho \), \(\theta \), and \(\beta \). In the following three lemmas we prove that the conditions of Definition 4.1 are satisfied. We begin with proving the sufficient descent property of the sequence \(\{z^k\}_{k\ge 1}\) with respect to \({\mathcal {E}}_{\beta }(\cdot )\).

Lemma 4.7

(Sufficient decrease for \({\mathcal {E}}_{\beta }\)) Suppose that Assumption A holds and that

$$\begin{aligned} \rho = \frac{4\eta L_{h}}{\lambda _{\min }(GG^{T})}\quad \text {and}\quad \beta =\theta =\frac{(\eta - 1)L_{h}}{2}, \end{aligned}$$
(4.27)

for some \(\eta >2+\sqrt{5}\). Then, there exists \(c_{1}>0\) such that, for every \(k\ge 1\), we have

$$\begin{aligned} {\mathcal {E}}_{\beta }(z^{k+1})-{\mathcal {E}}_{\beta }(z^{k})\le -c_{1}\left\| {z^{k+1}-z^{k}} \right\| ^{2}. \end{aligned}$$
(4.28)

Proof

Let \(k\ge 1\) and \(\nu \) be as defined in Assumption A. Recalling the bound of \(\varDelta _{{\mathcal {L}}_{\rho }, k}\) (cf. (4.18)), we observe that

$$\begin{aligned} \begin{aligned} \varDelta _{{\mathcal {E}}_{\beta },k}&{:=}{\mathcal {E}}_{\beta }(u^{k+1}, v^{k+1}, y^{k+1},v^{k})-{\mathcal {E}}_{\beta }(u^{k}, v^{k}, y^{k},v^{k-1})\\&=\varDelta _{{\mathcal {L}}_{\rho }, k} + \frac{\beta }{2}\left( \left\| {v^{k+1}-v^{k}} \right\| ^{2} - \left\| {v^{k}-v^{k-1}} \right\| ^{2}\right) \\&\le -\frac{\nu }{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}-\alpha _{1}\left\| {v^{k+1}-v^{k}} \right\| ^{2} - \alpha _{2}\left\| {v^{k}-v^{k-1}} \right\| ^{2}, \end{aligned} \end{aligned}$$

with

$$\begin{aligned} \lambda = \lambda _{\min }(GG^T),\; \alpha _{1}=\frac{2\theta -L_{h}-\beta }{2}-\frac{2\theta ^{2}}{\rho \lambda }\quad \text {and}\quad \alpha _{2}=\frac{\beta }{2}-\frac{2(\theta +L_{h})^{2}}{\rho \lambda }. \end{aligned}$$

Substituting \(\rho \), \(\theta \), and \(\beta \) according to (4.27) followed by simple algebra, we obtain that

$$\begin{aligned} \alpha _{1}=\alpha _{2}=\frac{L_{h}}{8\eta }(\eta ^{2}-4\eta -1). \end{aligned}$$

As the solutions for the quadratic equation \(\eta ^{2}-4\eta -1=0\) are \(\eta _{1,2}=2\pm \sqrt{5}\), it follows that \(\alpha _{1}>0\) for every \(\eta >2+\sqrt{5}\).

Following Lemma 4.5 (cf. (4.9)), we have

$$\begin{aligned} \left\| {y^{k+1}-y^{k}} \right\| ^{2}\le \frac{2(\theta +L_{h})^{2}}{\lambda }\left( \left\| {v^{k+1}-v^{k}} \right\| ^{2}+\left\| {v^{k}-v^{k-1}} \right\| ^{2}\right) . \end{aligned}$$

Adding \(\left\| {v^{k+1}-v^{k}} \right\| ^{2}+\left\| {v^{k}-v^{k-1}} \right\| ^{2}\) to both sides and setting

$$\begin{aligned} \alpha _{3}= 1+2(\theta +L_{h})^{2}/\lambda =1 + \frac{1}{2}\left( (\eta +1)L_{h}\right) ^2/\lambda , \end{aligned}$$

we obtain

$$\begin{aligned}&\left\| {v^{k+1}-v^{k}} \right\| ^{2}+\left\| {v^{k}-v^{k-1}} \right\| ^{2}+\left\| {y^{k+1}-y^{k}} \right\| ^{2}\\&\le \alpha _{3}\left( \left\| {v^{k+1}-v^{k}} \right\| ^{2}+\left\| {v^{k}-v^{k-1}} \right\| ^{2}\right) . \end{aligned}$$

Therefore, with \(c_{1}=\min \{\nu /2, \alpha _{1}/\alpha _{3}\}\), we have

$$\begin{aligned} \varDelta _{{\mathcal {E}}_{\beta },k}\le -c_{1}\left( \left\| {u^{k+1}-u^{k}} \right\| ^{2} +\left\| {v^{k+1}-v^{k}} \right\| ^{2} + \left\| {y^{k+1}-y^{k}} \right\| ^{2}+\left\| {v^{k}-v^{k-1}} \right\| ^{2}\right) . \end{aligned}$$

\(\square \)

We continue with proving the subgradient lower bound and the continuity condition (cf. conditions (C2) and (C3) of Definition 4.1). This requires the boundedness assumption on the proximal parameter sequence stated by Assumption B .

Remark 4.2

The proximal parameter backtracking procedure we devise in Sect. 5 satisfies Assumption A, and it also satisfies Assumption B under a standard boundness assumption with respect to the sequence generated by DAM.

Lemma 4.8

(Subgradient bound for \({\mathcal {E}}_{\beta }\)) Suppose that Assumption B holds and assume that the sequence \(\{\omega ^{k}{:=}(u^{k},v^{k},y^{k})\}_{k\ge 1}\) is bounded. Then, there exists \(c_{2}>0\), such that, for every \(k\ge 1\), there exists \(\xi ^{k+1}\in \partial {\mathcal {E}}_{\beta }(z^{k+1})\) satisfying

$$\begin{aligned} \left\| {\xi ^{k+1}} \right\| \le c_{2}\left\| {z^{k+1}-z^{k}} \right\| . \end{aligned}$$
(4.29)

Proof

Let \(k\ge 1\). We note that for every \(\xi ^{k+1}=(\xi _{u}^{k+1},\xi _{v}^{k+1}, \xi _{y}^{k+1}, \xi _{w}^{k+1})\in \partial {\mathcal {E}}_{\beta }(z^{k+1}),\) we have

$$\begin{aligned} \xi ^{k+1}_{u}&\in \partial _{u}{\mathcal {E}}(u^{k+1},v^{k+1},y^{k+1},v^{k})=\partial _{u}f(u^{k+1}) +\nabla _{u}\varphi (u^{k+1},v^{k+1},y^{k+1}), \end{aligned}$$
(4.30)
$$\begin{aligned} \xi ^{k+1}_{v}&=\nabla _{v}{\mathcal {E}}(u^{k+1},v^{k+1},y^{k+1},v^{k})\nonumber \\&=\nabla h(v^{k+1}) +\nabla _{v}\varphi (u^{k+1},v^{k+1},y^{k+1}) + \beta (v^{k+1}-v^{k}), \end{aligned}$$
(4.31)
$$\begin{aligned} \xi ^{k+1}_{y}&=\nabla _{y}{\mathcal {E}}(u^{k+1},v^{k+1},y^{k+1},v^{k})=F(u^{k+1})-Gv^{k+1}, \end{aligned}$$
(4.32)
$$\begin{aligned} \xi ^{k+1}_{w}&=\nabla _{w}{\mathcal {E}}(u^{k+1},v^{k+1},y^{k+1},v^{k})=-\beta (v^{k+1}-v^{k}). \end{aligned}$$
(4.33)

Furthermore, we recall that u-step (3.1) is stated as the following minimization problem:

$$\begin{aligned} u^{k+1}\in \mathrm {argmin}_{u\in {\mathbb {R}}^{n}}\left\{ f(u)+\langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), u\rangle + \frac{\sigma _{k+1}}{2}\left\| {u-u^{k}} \right\| ^{2} \right\} . \end{aligned}$$

Thus, we apply the first-order optimality condition and obtain that there exists \(\zeta ^{k+1}\in \partial f(u^{k+1})\) such that

$$\begin{aligned} \zeta ^{k+1} = -\nabla _{u}\varphi (u^{k}, v^{k},y^{k}) -\sigma _{k+1}(u^{k+1}-u^{k}). \end{aligned}$$

It follows that there exists \(\xi ^{k+1}_{u}\) as in (4.30), such that

$$\begin{aligned} \begin{aligned} \left\| {\xi ^{k+1}_{u}} \right\|&= \left\| {\nabla _{u}\varphi (u^{k+1},v^{k+1},y^{k+1})-\nabla _{u}\varphi (u^{k},v^{k},y^{k})-\sigma _{k+1}(u^{k+1}-u^{k})} \right\| \\&\le \left\| {\nabla _{u}\varphi (u^{k+1},v^{k+1},y^{k+1})-\nabla _{u}\varphi (u^{k},v^{k},y^{k})} \right\| + \sigma _{k+1}\left\| {u^{k+1}-u^{k}} \right\| \\&\le \left\| {\nabla _{u}\varphi (u^{k+1},v^{k+1},y^{k+1})-\nabla _{u}\varphi (u^{k},v^{k},y^{k})} \right\| + \sigma _{k+1}\left\| {z^{k+1}-z^{k}} \right\| . \end{aligned} \end{aligned}$$

Recalling that \(\nabla _{u}\varphi \) is locally Lipschitz continuous (cf. Lemma 4.1) and noting that the sequence \(\{\omega ^{k}{:=}(u^{k},v^{k},y^{k})\}_{k\ge 1}\) is bounded, it follows (cf. Proposition B.1) that there exists \(L_{\varphi }\in {\mathbb {R}}_{+}\) such that, for every \(k\ge 1\), we have

$$\begin{aligned} \left\| {\nabla _{u}\varphi (\omega ^{k+1})-\nabla _{u}\varphi (\omega ^{k})} \right\| \le L_{\varphi }\left\| {\omega ^{k+1}-\omega ^{k}} \right\| \le L_{\varphi }\left\| {z^{k+1}-z^{k}} \right\| . \end{aligned}$$

Together with Assumption B, it follows that

$$\begin{aligned} \left\| {\xi ^{k+1}_{u}} \right\| \le (L_{\varphi }+{\bar{\sigma }})\left\| {z^{k+1}-z^{k}} \right\| , \end{aligned}$$

with \({\bar{\sigma }}{:=}\sup _{k\ge 1}\sigma _{k} <\infty \).

Next, we note that

$$\begin{aligned}&\nabla _{v}\varphi (u^{k+1},v^{k+1},y^{k+1})\nonumber \\&=-G^{T}\left( y^{k+1}+\rho (F(u^{k+1})-Gv^{k+1})\right) =-G^{T}(2y^{k+1}-y^{k}), \end{aligned}$$
(4.34)

where the second equality is due to multiplier update step (3.3). Together with identity (4.8) proven in Lemma 4.5, we obtain that,

$$\begin{aligned} \nabla _{v}\varphi (u^{k+1},v^{k+1},y^{k+1})=-G^{T}(y^{k+1}-y^{k})-\nabla h(v^{k})-\theta (v^{k+1}-v^{k}). \end{aligned}$$
(4.35)

So together with (4.31), we have

$$\begin{aligned} \begin{aligned} \left\| {\xi ^{k+1}_{v}} \right\|&=\left\| {\nabla h(v^{k+1}) +\nabla _{v}\varphi (u^{k+1},v^{k+1},y^{k+1}) + \beta (v^{k+1}-v^{k})} \right\| \\&=\left\| {\nabla h(v^{k+1})-\nabla h(v^{k})-G^{T}(y^{k+1}-y^{k})+(\beta -\theta )(v^{k+1}-v^{k})} \right\| \\&\le \left\| {h(v^{k+1})-\nabla h(v^{k})} \right\| +\left\| {G} \right\| \left\| {y^{k+1}-y^{k}} \right\| +|\beta -\theta |\left\| {v^{k+1}-v^{k}} \right\| \\&\le (L_{h}+ |\beta -\theta |)\left\| {v^{k+1}-v^{k}} \right\| + \left\| {G} \right\| \left\| {y^{k+1}-y^{k}} \right\| \\&\le (L_{h}+ |\beta -\theta | + \left\| {G} \right\| )\left\| {z^{k+1}-z^{k}} \right\| , \end{aligned} \end{aligned}$$
(4.36)

where the second inequality is due to the \(L_{h}\)-Lipschitz continuity of \(\nabla h\).

Finally, we note that

$$\begin{aligned} \left\| {\xi ^{k+1}_{y}} \right\|&=\left\| {F(u^{k+1})-Gv^{k+1}} \right\| =\rho ^{-1}\left\| {y^{k+1}-y^{k}} \right\| \le \rho ^{-1}\left\| {z^{k+1}-z^{k}} \right\| \end{aligned}$$
(4.37)

and

$$\begin{aligned} \left\| {\xi ^{k+1}_{w}} \right\|&=\beta \left\| {v^{k+1}-v^{k}} \right\| \le \beta \left\| {z^{k+1}-z^{k}} \right\| , \end{aligned}$$
(4.38)

where the second equality in (4.37) is due to multiplier update step (3.3). Thus, by summing the bounds for \(\left\| {\xi _{u}^{k+1}} \right\| \), \(\left\| {\xi _{v}^{k+1}} \right\| \), \(\left\| {\xi _{y}^{k+1}} \right\| \), and \(\left\| {\xi _{w}^{k+1}} \right\| \), we obtain that

$$\begin{aligned} \left\| {\xi ^{k+1}} \right\| \le \left\| {\xi _{u}^{k+1}} \right\| +\left\| {\xi _{v}^{k+1}} \right\| +\left\| {\xi _{y}^{k+1}} \right\| +\left\| {\xi _{w}^{k+1}} \right\| \le c_{2}\left\| {z^{k+1}-z^{k}} \right\| , \end{aligned}$$

with \(c_{2}=L_{\varphi } +{\bar{\sigma }}+ L_{h}+ |\beta -\theta |+ \left\| {G} \right\| +\rho ^{-1}+\beta >0\). \(\square \)

Lemma 4.9

(Continuity condition for \({\mathcal {E}}_{\beta }\)) Suppose that Assumptions A and B hold and that \(\rho \), \(\theta \), and \(\beta \) are as in Lemma 4.7. Let \(\{z^{k+1}\}_{k\in {\mathcal {K}}}\) be a subsequence of \(\{z^{k}\}_{k\ge 1}\) which converges to a point \({\bar{z}}=({\bar{u}},{\bar{v}},{\bar{y}},{\bar{w}})\). Then,

$$\begin{aligned} \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }} {\mathcal {E}}_{\beta }(z^{k+1})\le {\mathcal {E}}_{\beta }({\bar{z}}). \end{aligned}$$
(4.39)

Proof

The Lyapunov function \({\mathcal {E}}_{\beta }\) can be written as \({\mathcal {E}}_{\beta }(u,v,y,w)=f(u) +\varPsi (u,v,y,w)\), where the function \(\varPsi \) is defined by

$$\begin{aligned} \varPsi (u,v,y,w){:=}g(u) + h(v) + \langle y, F(u)-Gv\rangle +\frac{\rho }{2}\left\| {F(u)-Gv} \right\| ^{2}+\frac{\beta }{2}\left\| {v-w} \right\| ^{2}. \end{aligned}$$

Obviously, \(\varPsi \) is a continuous function and, as f is proper and lsc, it follows that \({\mathcal {E}}_{\beta }\) is also proper and lsc. Furthermore, we have \(\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\varPsi (u^{k+1},v^{k+1},y^{k+1},v^{k})=\varPsi ({\bar{u}},{\bar{v}},{\bar{y}},{\bar{w}})\), and hence inequality (4.39) can be stated as

$$\begin{aligned} \begin{aligned} \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}{\mathcal {E}}_{\beta }(u^{k+1},v^{k+1},y^{k+1},v^{k})&=\limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}f(u^{k+1})+\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\varPsi (u^{k+1},v^{k+1},y^{k+1},v^{k})\\ {}&= \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}f(u^{k+1})+\varPsi ({\bar{u}},{\bar{v}},{\bar{y}},{\bar{w}})\\&\le f({\bar{u}})+\varPsi ({\bar{u}},{\bar{v}},{\bar{y}},{\bar{w}})={\mathcal {E}}({\bar{u}},{\bar{v}},{\bar{y}},{\bar{w}}). \end{aligned} \end{aligned}$$

Therefore, it remains to prove the following inequality

$$\begin{aligned} \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}f(u^{k+1})\le f({\bar{u}}). \end{aligned}$$
(4.40)

Let \(k\in {\mathcal {K}}\). We recall that u-step (3.1) is given by the following minimization problem:

$$\begin{aligned} u^{k+1}\in \mathrm {argmin}_{u\in {\mathbb {R}}^{n}}\left\{ f(u)+\langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), u\rangle + \frac{\sigma _{k+1}}{2}\left\| {u-u^{k}} \right\| ^{2} \right\} , \end{aligned}$$

which immediately implies that

$$\begin{aligned}&f(u^{k+1}) + \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), u^{k+1}\rangle + \frac{\sigma _{k+1}}{2}\left\| {u^{k+1}-u^{k}} \right\| ^{2}\\&\le f({\bar{u}}) + \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), {\bar{u}}\rangle + \frac{\sigma _{k+1}}{2}\left\| {{\bar{u}}-u^{k}} \right\| ^{2}. \end{aligned}$$

After some algebra we can state the above inequality as

$$\begin{aligned} \begin{aligned} f(u^{k+1})&\le f(\bar{u }) +\left\langle \nabla _{u}\varphi (u^{k},v^{k},y^{k})+\frac{\sigma _{k+1}}{2}({\bar{u}}-u^{k}+u^{k+1}-u^{k}), {\bar{u}}-u^{k+1}\right\rangle \\ {}&\le f({\bar{u}})+\left\| {\nabla _{u}\varphi (u^{k},v^{k},y^{k})+\frac{\sigma _{k+1}}{2}({\bar{u}}-u^{k}+u^{k+1}-u^{k})}\right\| \left\| {{\bar{u}}-u^{k+1}} \right\| \\ {}&\le f({\bar{u}})+\alpha _{k}\left\| {{\bar{u}}-u^{k+1}} \right\| , \end{aligned}\qquad \end{aligned}$$
(4.41)

with

$$\begin{aligned} \alpha _{k}{:=}\left\| {\nabla _{u}\varphi (u^{k},v^{k},y^{k})} \right\| +\frac{\sigma _{k+1}}{2}\left( \left\| {{\bar{u}}-u^{k}} \right\| +\left\| {u^{k+1}-u^{k}} \right\| \right) \ge 0. \end{aligned}$$
(4.42)

The second and third inequalities in (4.41) are due to the Cauchy–Schwarz and triangle inequalities, respectively.

Taking \(\limsup \) over \({\mathcal {K}}\), we obtain that

$$\begin{aligned} \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}f(u^{k+1})\le f({\bar{u}}) + \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left( \alpha _{k}\left\| {{\bar{u}}-u^{k+1}} \right\| \right) . \end{aligned}$$

Thus, proving inequality (4.40) reduces to proving that \(\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left( \alpha _{k}\left\| {{\bar{u}}-u^{k+1}} \right\| \right) =0\). We assume that \(\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}u^{k+1}={\bar{u}}\) , and so it is sufficient to prove that

$$\begin{aligned} \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\alpha _{k}<\infty , \end{aligned}$$

which requires that we analyze the subsequence \(\{(u^{k},v^{k}, y^{k}\}_{k\in {\mathcal {K}}}\). To this end, we recall the sufficient descent result of Lemma 4.7 (cf. (4.28)), i.e., that there exists \(c_{1}> 0\) such that

$$\begin{aligned} {\mathcal {E}}_{\beta }(z^{k+1})-{\mathcal {E}}_{\beta }(z^{k})\le -c_{1}\left\| {z^{k+1}-z^{k}} \right\| ^{2},\quad \forall k\ge 1. \end{aligned}$$

Thus, summing \(\left\| {z^{k+1}-z^{k}} \right\| ^{2}\) over \({\mathcal {K}}\), we obtain that

$$\begin{aligned} \begin{aligned} \sum _{k\in {\mathcal {K}}}\left\| {z^{k+1}-z^{k}} \right\| ^{2}&\le \lim _{\underset{N\in {\mathcal {K}}}{N\rightarrow \infty }}\sum _{k=1}^{N}\left\| {z^{k+1}-z^{k}} \right\| ^{2}\le \limsup _{\underset{N\in {\mathcal {K}}}{N\rightarrow \infty }} c_{1}^{-1}\sum _{k=1}^{N}\left( {\mathcal {E}}_{\beta }(z^{k})-{\mathcal {E}}_{\beta }(z^{k+1}))\right) \\&=c_{1}^{-1}\left( {\mathcal {E}}_{\beta }(z^{1}) - \liminf _{\underset{N\in {\mathcal {K}}}{N\rightarrow \infty }}{\mathcal {E}}_{\beta }(z^{N+1})\right) \le c_{1}^{-1}\left( {\mathcal {E}}_{\beta }(z^{1})-{\mathcal {E}}_{\beta }({\bar{z}})\right) <\infty \text {,} \end{aligned} \end{aligned}$$
(4.43)

where the first inequality is due to the sufficient descent and the second inequality is justified by the fact that \(\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left\| {z^{k+1}} \right\| ={\bar{z}}\) and that \({\mathcal {E}}_{\beta }\) is proper and lsc, which implies that

$$\begin{aligned} \liminf _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}{\mathcal {E}}_{\beta }(z^{k+1}) \ge {\mathcal {E}}_{\beta }({\bar{z}})>-\infty \text {.} \end{aligned}$$

As the nonnegative series \(\sum _{k\in {\mathcal {K}}}\left\| {z^{k+1}-z^{k}} \right\| ^{2}\) is finite, it follows that \(\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left\| {z^{k+1}-z^{k}} \right\| =0,\) which implies that the subsequence \(\{z^{k}\}_{k\in {\mathcal {K}}}\) also converges to \({\bar{z}}\). This result has several consequences. First, as \(\varphi \) is \(C^{1}\), it follows that \(\nabla _{u}\varphi \) is continuous over \({\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\times {\mathbb {R}}^{q}\) and therefore

$$\begin{aligned} \lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left\| {\nabla _{u}\varphi (u^{k},v^{k},y^{k})} \right\| =\left\| {\nabla _{u}\varphi ({\bar{u}},{\bar{v}},{\bar{y}})} \right\| <\infty . \end{aligned}$$
(4.44)

Second, noting that \(\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}u^{k}=\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}u^{k+1}={\bar{u}}\), we have

$$\begin{aligned} \lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left\| {u^{k+1}-u^{k}} \right\| =0\quad \text {and}\quad \lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\left\| {{\bar{u}}-u^{k}} \right\| =0. \end{aligned}$$

Together with Assumption B which states that \(\sup _{k\ge 1}\sigma _{k}<\infty \), it follows that

$$\begin{aligned} \lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\frac{\sigma _{k+1}}{2}\left( \left\| {{\bar{u}}-u^{k}} \right\| +\left\| {u^{k+1}-u^{k}} \right\| \right) =0. \end{aligned}$$
(4.45)

Summing (4.44) and (4.45) and recalling the definition of \(\alpha _{k}\) (cf. (4.42)) we obtain that

$$\begin{aligned} \limsup _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\alpha _{k}=\lim _{\underset{k\in {\mathcal {K}}}{k\rightarrow \infty }}\alpha _{k}<\infty . \end{aligned}$$

\(\square \)

Equipped with Lemma 4.7, Lemma 4.8, and Lemma 4.9, we can now apply Lemma 4.4 and Theorem 4.1 followed by Proposition 4.1, to establish that Algorithm 1 achieves criticality, given that the parameters \(\rho \) and \(\theta \) are chosen accordingly to (4.27).

Theorem 4.2

(Subsequence and global convergence) Suppose that Assumption A and Assumption B hold true, and that the sequence \(\{\omega ^{k}{:=}(u^{k},v^{k},y^{k})\}_{k\ge 1}\) is bounded. Let \(\rho \) and \(\theta \) be chosen such that

$$\begin{aligned} \rho = (4\eta L_{h})/\lambda _{\min }(GG^{T})\;\text {and}\; \theta = (\eta - 1)L_h/2\;\text {for some}\; \eta >2+\sqrt{5}. \end{aligned}$$

Then the following implications hold:

  1. (i)

    Let \(\varOmega \) be the set of cluster points of the sequence \(\{\omega ^{k}\}_{k\ge 1}\). Then, \(\varOmega \) is a nonempty and compact set; \(\varOmega \subset {{\,\mathrm{crit}\,}}{\mathcal {L}}\); \(\lim _{k\rightarrow \infty }{\text {dist}}(\omega ^{k}, \varOmega )=0\); and \({\mathcal {L}}\) is finite and constant on \(\varOmega \).

  2. (ii)

    Assume, in addition, that the functions f, g, and h, and the mapping F are semi-algebraic. Then, \(\{\omega ^{k}\}_{k\ge 1}\) has a finite length, i.e., \(\sum _{k=1}^{\infty }\left\| {\omega ^{k+1}-\omega ^{k}} \right\| <\infty \), and it converges to a critical point \(\omega ^{*}\in {{\,\mathrm{crit}\,}}{\mathcal {L}}\).

Remark 4.3

Recall that as proved in [6], a semi-algebraic function satisfies the KL property, and that the semi-algebraic property is preserved for most useful operations e.g., finite sum and product of semi-algebraic functions, and their composition (see e.g., [7] for examples and references). The class of problems modeled by semi-algebraic functions is ubiquitous in applications, and hence the global convergence result in Theorem 4.2 is applicable to a wide variety of models.

Remark 4.4

Asymptotic convergence rates of the sequence \(\{\omega ^{k}\}_{k\ge 1}\) could also be established; see e.g., [8, Theorem 6.3].

5 The Proximal Parameter Backtracking Procedure

In this section we propose a tractable backtracking procedure to generate proximal parameters that meet the requirements posed by our analysis in Sect. 4.1 (note that (R2) is an implicit requirement):

  1. (R1)

    The sequence \(\{\sigma _{k}\}_{k\ge 1}\) satisfies the descent condition of Assumption A.

  2. (R2)

    The backtracking procedure completes in a finite time.

  3. (R3)

    The sequence \(\{\sigma _{k}\}_{k\ge 1}\) is guaranteed to be bounded, i.e., Assumption B holds true whenever the sequence \(\{ (u^k,v^k,y^k) \}_{k\ge 1} \) is bounded.

The backtracking procedure is described by Algorithm 2, which comprises a loop that increases the candidate for the proximal parameter until a sufficient decrease property is met.

figure b

The generated proximal parameter satisfies (R1) by definition. We will focus now on proving that it also meets requirements (R2) and (R3), starting with (R2). Our derivations rely on the theory of Lipschitz functions, mainly on the fact that \(\nabla \varphi \) is locally Lipschitz continuous (cf. Lemma 4.1); we defer the reader to some useful review/results on Lipschitz and local Lipschitz continuity in Appendix B.

Lemma 5.1

(Finite termination) The backtracking procedure in Algorithm 2 terminates in a finite number of iterations.

Proof

Let \(k\ge 1\). For every \(j\ge 0\), define

$$\begin{aligned} \begin{aligned} S_{k,j}&{:=}\mathop {{\text {prox}}}\nolimits _{\sigma _{k,j}^{-1}f}^{}(u^{k}-\sigma _{k,j}^{-1}\nabla _{u}\varphi (u^{k},v^{k},y^{k}))\\&=\mathrm {argmin}_{u\in {\mathbb {R}}^{n}}\left\{ f(x)+ \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), u\rangle +\frac{\sigma _{k,j}}{2}\left\| {u-u^{k}} \right\| ^{2}\right\} . \end{aligned} \end{aligned}$$

Note that \(\sigma _{k,j}\ge \sigma _{0}>0\), as can be easily verified by steps (5.1) and (5.2). Thus, \(S_{k,j}\) is nonempty and compact (cf. Remark 3.2); specifically, with \(r_{k,j}{:=}\sup \{\left\| {x-u^{k}} \right\| : x\in S_{k,j}\}\), we have \(r_{k,j}<\infty \). We will now show that as we increase the proximal parameter \(\sigma _{k,j} \rightarrow \sigma _{k,j+1}\), the distance from the solution set \(S_{k,j}\) to \(u^{k}\) does not increase, i.e., \(r_{k,j}\le r_{k,1}\) for any \(j\ge 0\).

For any \(j>1\), set \(x^{j}\in S_{k,j}\). Then

$$\begin{aligned}&f(x^{i})+ \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), x^{i}\rangle +\frac{\sigma _{k,i}}{2}\left\| {x^{i}-u^{k}} \right\| ^{2}\\&\le f(x^{1})+ \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), x^{1}\rangle +\frac{\sigma _{k,i}}{2}\left\| {x^{1}-u^{k}} \right\| ^{2} \end{aligned}$$

and

$$\begin{aligned}&f(x^{1})+ \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), x^{1}\rangle +\frac{\sigma _{k,1}}{2}\left\| {x^{1}-u^{k}} \right\| ^{2}\\&\le f(x^{i})+ \langle \nabla _{u}\varphi (u^{k},v^{k},y^{k}), x^{i}\rangle +\frac{\sigma _{k,1}}{2}\left\| {x^{i}-u^{k}} \right\| ^{2}. \end{aligned}$$

Consequently, noting that \(\sigma _{k,j}>\sigma _{k,1}\), we can sum the equations above, divide by \((\sigma _{k,j}-\sigma _{k,1})/2\), and obtain that

$$\begin{aligned} \left\| {x^{j}-u^{k}} \right\| \le \left\| {x^{1}-u^{k}} \right\| \le r_{k,1}. \end{aligned}$$

Thus,

$$\begin{aligned} \{(u^{k,j}, v^{k}, y^{k}) \}_{j\ge 0} \subseteq {\mathcal {B}}[(u^{k}, v^{k}, y^{k}), r_{k,1}], \end{aligned}$$

where \({\mathcal {B}}[(u^{k}, v^{k}, y^{k}), r_{k,1}]\) is the Euclidean norm ball centered at \((u^{k}, v^{k}, y^{k})\) with radius \(r_{k,1}\). Since \(\nabla \varphi \) is locally Lipschitz continuous (cf. Lemma 4.1), we in particular have that for the nonempty, convex, and compact set \({\mathcal {B}}[(u^{k}, v^{k}, y^{k}), r_{k,1}]\), there exists \(L_{k}\ge 0\) such that

$$\begin{aligned} \left\| {\nabla _u \varphi (x) - \nabla _u \varphi (z)} \right\| \le L_k \left\| {x-z} \right\| ,\quad \forall x,z\in {\mathcal {B}}[(u^{k}, v^{k}, y^{k}), r_{k,1}]. \end{aligned}$$

Subsequently, the descent lemma (cf. Lemma 4.3) holds true over \({\mathcal {B}}[(u^{k}, v^{k}, y^{k}), r_{k,1}]\), meaning that the procedure terminates after at most \(1 +\max \{0,\lceil (\log (L_{k}+\nu ) -\log \sigma _{k,0})/\log \mu \rceil \}\) iterations, with \(\sigma _{k+1}\le \max \{\sigma _{k}, \mu (L_{k}+\nu )\}\), as the iteration prior to the last must satisfy that \(\sigma _{k,J_{k}}/\mu - \nu =\sigma _{k,J_{k}-1}-\nu <L_{k}\). \(\square \)

Next we establish that the proximal parameters generated via Algorithm 2 satisfy (R3) whenever the sequence \(\{u^k, v^{k}, y^{k}\}_{k\ge 1}\) is bounded.

Lemma 5.2

(Boundedness of the proximal parameter) Suppose that the sequence \(\{u^k, v^{k}, y^{k}\}_{k\ge 1}\) is bounded. Additionally, for \({\mathcal {U}}_{k}{:=}\{u^{k,j}: j=1,2,\ldots ,J_{k}\}\), assume that \(\bigcup _{k=0}^{\infty }{\mathcal {U}}_{k}\) is bounded. Then, the proximal parameter sequence \(\{\sigma _{k}\}_{k\ge 1}\) is bounded.

Proof

It is clear that \(\inf _{k\ge 1}\sigma _{k}\ge \sigma _{0}>0\) (cf. (5.1) and (5.2)), and therefore it is enough to prove that

$$\begin{aligned} \sup _{k\ge 1}\sigma _{k} <\infty . \end{aligned}$$
(5.5)

First note that \(u^{k+1}\in {\mathcal {U}}_{k}\) as \(u^{k+1}=u^{k,J_{k}}\), for every \(k\ge 1\). Next, set \(C_{u}{:=}{{\,\mathrm{cl}\,}}{{\,\mathrm{conv}\,}}\left( \{u^{0}\}\cup \bigcup _{k=0}^{\infty }{\mathcal {U}}_{k}\right) \), \(C_{v}={{\,\mathrm{cl}\,}}{\{v^{k}\}_{k\ge 0}}\) and \(C_{y}={{\,\mathrm{cl}\,}}{\{y^{k}\}_{y\ge 0}}\). Then, the set \(C{:=}C_{u}\times C_{v}\times C_{y}\) is nonempty and compact and as \(\nabla _{u}\varphi \) is locally Lipschitz continuous (cf. Lemma 4.1) it follows (cf. Proposition B.1) that there exists \(L_{C}\in {\mathbb {R}}_{+}\) such that \(\nabla _{u}\varphi \) is \(L_{C}\)-Lipschitz continuous over C, and specifically, for every \(k\ge 0\), \(\nabla _{u}\varphi (\cdot , v^{k},y^{k})\) is \(L_{C}\)-Lipschitz continuous over the convex and compact set \(C_{u}\). Hence, we can apply the descent Lemma (cf. Lemma 4.3) and obtain that, for every \(\sigma \ge L_{C}+\nu \), every \(k\ge 1\), and every \(j\in \{1,2,\ldots ,J_{k}\}\), we have

$$\begin{aligned}&\varphi (u^{k,j}, v^{k}, y^{k})- \varphi (u^{k}, v^{k}, y^{k}) -\langle \nabla _{u}\varphi (u^{k}, v^{k}, y^{k}),u^{k,j}-u^{k}\rangle \nonumber \\&\le \frac{\sigma -\nu }{2}\left\| {u^{k,j}-u^{k}} \right\| ^{2}. \end{aligned}$$
(5.6)

To obtain (5.5), we prove by induction that, for every \(k\ge 1\), we have

$$\begin{aligned} \sigma _{k}\le \max \{\sigma _{0}, \mu (L_{C}+\nu )\}. \end{aligned}$$
(5.7)

For \(k=0\), we have \(\sigma _{k}=\sigma _{0}\) and so (5.7) trivially holds. Next, consider the call to the backtracking procedure by DAM at iteration k and assume by contradiction that

$$\begin{aligned} \sigma _{k+1}>\max \{\sigma _{0}, \mu (L_{C}+\nu )\}. \end{aligned}$$
(5.8)

Steps (5.1) and (5.2) together with the induction hypothesis imply that

$$\begin{aligned} \sigma _{k,1}\le \sigma _{k}\le \max \{\sigma _{0}, \mu (L_{C}+\nu )\}. \end{aligned}$$
(5.9)

As \(\sigma _{k+1}=\sigma _{k,J_{k}}\) then (5.8) and (5.9) imply that \(J_{k}\ge 2\). In addition, as \(\sigma _{k,J_{k}-1}=\sigma _{k,J_{k}}/\mu \) (cf. (5.2)), inequality (5.8) implies that \(\sigma _{k,J_{k}-1}>L_{C}+\nu \). Then, together with (5.6), it follows that procedure’s termination condition (5.4) holds true for \(j=J_{k}-1\) which is obviously a contradiction as it should hold true only for \(j=J_{k}\). \(\square \)

We conclude this section with the following remark when global Lipschitz constants are available.

Remark 5.1

(Global Lipschitz continuity) If the function g has an \(L_{g}\)-Lipschitz continuous gradient, and the mapping F is \(l_{F}\)-Lipschitz continuous with an \(L_{F}\)-Lipschitz continuous Jacobian. Then the backtracking procedure in Algorithm 2 can be replaced with

$$\begin{aligned} \sigma _{k+1} = \nu + L_{g} + L_{F}\left\| {y^{k}+\rho (F(u^{k})-Gv^{k})} \right\| +\rho l_{F}^{2}. \end{aligned}$$

It can easily be verified that the above proximal parameter generator satisfies the descent condition of Assumption A, and that when the sequence \(\{(u^{k},v^{k},y^{k})\}_{k\ge 1}\) is bounded, then Assumption B also holds, i.e., \(\{\sigma _{k}\}_{k\ge 1}\) is bounded.

6 Implication to the Proximal Gradient Method Under Locally Lipschitz Data

As an interesting byproduct of our analysis, we can immediately obtain convergence results of the proximal gradient scheme with locally Lipschitz data. More precisely, in this section we briefly review the special case when \(h \equiv 0, F \equiv 0\), meaning that model described through (M) reduces to the minimization of the nonconvex sum composite minimization problem

$$\begin{aligned} \min _{u\in {\mathbb {R}}^{n}} f(u) + g(u), \end{aligned}$$

where \(f: {\mathbb {R}}^{n} \rightarrow (-\infty , \infty ]\) is a proper and lsc function, and \(g: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) is a continuously differentiable function with a locally Lipschitz continuous gradient. Alternatively, by setting G to be the identity map, the model (M) reduces to the nonlinear composite model \(\min \{ f(u) + g(u)+ h(F(u)): \; u\in {\mathbb {R}}^{n}\}\). In the later case, one can verify that under our local Lipschitz assumptions for (M), the gradient of the composite term \(h\circ F\) is also locally Lipschitz. Thus in both special cases, we minimize the sum composite model:

$$\begin{aligned} \min \{ f(u) + \psi (u): \; u\in {\mathbb {R}}^{n}\}, \end{aligned}$$

where \(\psi (u):= g(u)\; \text {or}\; \psi (u):= g(u)+ h(F(u))\), both having a locally Lipschitz gradient.

In this case, Algorithm 1 is effectively a proximal gradient scheme with a dynamic proximal parameter, described below.

Prox-Grad with locally Lipschitz gradient (PG-Loc).

  1. 1.

    Generate a proximal parameter \(\sigma _{k+1}>0\);

  2. 2.

    Set \(u^{k+1} \in \mathrm {argmin}_{u\in {\mathbb {R}}^{n}}\left\{ f(u) +\langle \nabla \psi (u^{k}), u\rangle + \frac{\sigma _{k+1}}{2}\left\| {u-u^{k}} \right\| ^{2}\right\} \).

According to our convergence result in Theorem 4.2, this scheme has global convergence guarantees even though \(\psi \) only has a local Lipschitz continuous gradient.

Theorem 6.1

(Subsequence and global convergence of PG-Loc) Suppose that Assumption A and Assumption B hold true, and that the sequence \(\{u^{k}\}_{k\ge 1}\) generated by PG-Loc is bounded. Then the following implications hold.

  1. (i)

    Let \(\varOmega \) be the set of cluster points of the sequence \(\{u^{k}\}_{k\ge 1}\). Then, \(\varOmega \) is a nonempty, compact, and connected set; \(\varOmega \subset {{\,\mathrm{crit}\,}}(f + \psi )\); \(\lim _{k\rightarrow \infty }{\text {dist}}(u^{k}, \varOmega )=0\); and \((f + \psi )\) is finite and constant on \(\varOmega \).

  2. (ii)

    Assume, in addition, that the functions f and \(\psi \) are semi-algebraic. Then, \(\{u^{k}\}_{k\ge 1}\) has a finite length, i.e., \(\sum _{k=1}^{\infty }\left\| {u^{k+1}- u^{k}} \right\| <\infty \), and it converges to a critical point of the objective function \(u^{*}\in {{\,\mathrm{crit}\,}}(f + \psi )\).

We can then invoke Algorithm 2 as the proximal parameter generator in this case to derive a tractable, globally convergent, proximal gradient method with a backtracking procedure, for composite problems with locally Lipschitz continuous gradient.

7 Concluding Remarks

Under mild, yet specific, data assumptions and structure, we have established subsequence and global convergence results by using a dynamic backtracking mechanism to determine the proximal parameter. The suggested algorithm DAM, together with a minor refinement of the analysis, can also be applied to a model where G has only full column rank, i.e., \(G^{T}G\) is positive definite, but \(GG^{T}\) is rank deficient. In this case, we should explicitly require having that the range of the mapping F is contained in the range of the matrix G. One may also consider other alternatives for the v-update step: (i) linearize \(\varphi (u^{k+1}, \cdot , y^{k})\) and leave h intact; (ii) linearize \(\varPsi _{k}\), i.e., linearize both h and \(\varphi (u^{k+1}, \cdot , y^{k})\). However, these alternatives are less attractive since they impose substantial additional restrictions on the model at hand and specifically on the matrix G. Finally, let us mention that if we do not linearize h in the optimization process, then h must be ’prox-friendly’, a requirement that limits the cases where the step’s minimization subproblem is tractable. Actually, in that scenario, it can be easily verified that the convergence analysis we conduct in Sect. 4 can even be further simplified.