1 Introduction

We consider the solution of the problem

$$\begin{aligned} \begin{array}{cl} \underset{{\mathbf {x}}}{\text{ minimize }} &{} U(\mathbf {x}) \\ \text{ s.t. } &{} {\mathbf {x}}\in {\mathcal X}, \end{array} \end{aligned}$$
(P)

where \({\mathcal {X}} \subseteq \mathbb {R}^n\) is a nonempty closed set and \(U:O \rightarrow \mathbb {R}\) is locally Lipschitz on O, an open set containing \(\mathcal X\). We are interested in a method for finding stationary points of the nonconvex problem (P) with the following properties: (i) feasibility is maintained throughout iterates, (ii) nonsmooth (objective or constraint) functions can be handled, and (iii) the algorithm is possibly amenable to parallel/distributed implementations.

Properties (i) and (iii) are of well-recognized interest and have been extensively studied in the literature, somewhat independently, both by optimizers and different engineering communities. Among these results, we list below the main approaches, paying attention to the ability to handle nonsmooth problems and with an emphasis on those methods that can be applied to nonconvex problems and, in particular, to problems with nonconvex constraints. Standard feasible approaches include the classical logarithmic barrier Interior-Point (IP) method by Fiacco and Mc Cormick [18], see also [10, 13, 21] for modern developments that try to overcome the numerical pitfalls of the original algorithm, and the Feasible Sequential Quadratic Programming (FSQP) method of Tits, see e.g. [27, 3537] and [11, 14, 25, 38, 50] for further developments. Both IP and FSQP methods require smoothness and are not suited to parallel/distributed implementations. Parallel Variable Distribution (PVD) schemes, proposed in [17, 40, 46] for smooth problems, are suitable for implementation over parallel architectures. However, PVD methods require an amount of information exchange/knowledge that is often not compatible with a distributed architecture and they call for the solution of possibly difficult nonconvex subproblems; moreover, convergence has been established only for convex constraints [17, 46] or nonconvex constraints with a Cartesian product structure [40]. In [34] methods for DC programs are discussed. Among them, Algorithm II generates feasible iterates; parallel/distributed schemes are also considered, but only for problems with a convex feasible set. While the methods proposed in [34] enjoy very strong convergence properties, in that they converge to B-stationary points, they are potentially computationally very intensive and only apply to a specific subclass of DC programs. In [8] new results are proved on a proximal alternating linearized minimization method for nonconvex and nonsmooth problems. This method maintains feasibility but requires at each iteration the solution of (generally) nontrivial nonsmooth and nonconvex subproblems. The paper [48] is the rigorous culmination of a series of works in the structural engineering community such as the CONvex LINearization (CONLIN) method [19, 20] and the Method of Moving Asymptotes (MMA) [47]. It introduces a large class of feasible schemes for smooth problems, potentially suited to parallel/distributed implementations, called Conservative Convex Separable Approximation (CCSA). A serious drawback of CCSA is that, besides requiring the difficult tuning of some unknown parameters, it finds stationary points of a reformulation that are not necessarily stationary also for the original problem. To some extent similar to CCSA methods is the INner Convex Approximation (INCA) algorithm, which was first proposed in 1978 by Marks and Wright [30] for problems with a smooth convex objective function and smooth nonconvex constraints. In the INCA approach a sequence of convex subproblems is solved where the original nonconvex constraints are replaced by upper convex approximations so that the algorithms generate only feasible iterates. INCA has been recently more rigorously studied in [3], in the case of a strongly convex objective function. In [1], the authors, unaware of [30], use a specific inner convex approximation for the feasible set and, in the spirit of majorization/minimization techniques, also approximate the possibly nonconvex objective function with an upper convex quadratic, so as to solve at each iteration a surrogate problem with a convex quadratic objective function and convex quadratic constraints. The resulting algorithm, which is potentially suited to parallel/distributed implementations, cannot deal with nonsmooth problems and the proposed approximation for the objective function, based on the global Lipschitz property of its gradient, is possibly hard to obtain and, more importantly, can lead to poor numerical performances (see Sect. 6).

As a major departure from the above approaches, in this work we consider general nonconvex problems and quite an ample class of nonsmooth functions and provide easily implementable solutions methods for (P) that, in many cases, with no difficulty can also lead to parallel and distributed implementations. Building on the INCA paradigm, along the lines put forward in [43, 44], our method is based on the (inexact) solution of a sequence of strongly convex optimization subproblems, followed by a step-size procedure. At each iteration \({\mathbf {x}}^\nu \), a subproblem is formed of the type

where \({\widetilde{U}} (\bullet ; {\mathbf {x}}^\nu )\) is a strongly convex “approximation” of U and \(\widetilde{\mathcal {X}}({\mathbf {x}}^\nu )\) is a closed, convex, inner approximation of \(\mathcal X\), meaning that \(\widetilde{\mathcal {X}}({\mathbf {x}}^\nu ) \subseteq \mathcal X\). Let \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\) be the unique solution of problem (\({\hbox {P}_{{\mathbf {x}}^\nu }}\)); the underlying idea of the approach is that (the subproblems (\({\hbox {P}_{{\mathbf {x}}^\nu }}\)) are chosen so that) one is able to compute efficiently \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\). The new iteration \({\mathbf {x}}^{\nu +1}\) is then given by moving from \({\mathbf {x}}^{\nu }\) towards (an approximation of) \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\) according to a step-size \(\gamma ^\nu \). We prove that, under appropriate assumptions, every limit point of the sequence \({\mathbf {x}}^{\nu +1}\) is stationary for the original problem (P).

Our main contributions are: (a) the definition and analysis of a general algorithmic framework enjoying properties (i), (ii), and (iii) and vastly improving over all the approaches described above; (b) the first practical feasible method for ample classes of nonsmooth, nonconvex problems requiring at each iteration only the solution of strongly convex subproblems (as opposite to [8, 34]); (c) a systematic approach to the development of parallel and distributed solution methods that can be applied to classes of problems for which parallel and distributed methods were not available; (d) the analysis of several step-size (Armijo-type, diminishing and fixed step-size) rules, and of corresponding inexactness criteria for the solution of problem (\({\hbox {P}_{{\mathbf {x}}^\nu }}\)); (e) the first numerically efficient solution method for an important open problem in green communications, paving the way to the application of our algorithmic framework to a host of unsolved resource allocation problems in interference networks. During the reviewing process, a referee brought to our attention the interesting, recent paper [7]. Also in [7] a sequence of strongly convex approximating subproblems is solved; in particular, following the INCA approach, the authors resort to inner convex approximations of the feasible set. The analysis in [7] is very elegant and shows full convergence to a single limit point; moreover, convergence rates aspects are investigated. There are however several differences with our approach: (i) [7] follows the classical majorization/minimization idea, as in [1], i.e., at each iteration the nonconvex objective function is approximated by an upper convex estimate; this feature, on the one hand simplifies the convergence analysis but, on the other hand, may be numerically impractical, see the eloquent numerical results in Sect. 6; (ii) the analysis is geared towards smooth problems and the assumptions made are strong, so that only simple and well-behaved forms of nonsmoothness, under (Clarke’s) regularity, can actually be treated, see Sect. 5; (iii) no consideration is given to practical aspects like inexactness and parallel/distributed implementations.

The paper is organized as follows. In the next section we recall some basic background material on nonsmooth analysis and optimality conditions. In Sect. 3 we give an overall description of our approach and present some preliminary results, while in Sect. 4 we provide an in-depth convergence analysis. Section 5 is devoted to a detailed presentation of approximations that can be used in our framework and to their use in the development of parallel/distributed versions of our algorithm. Section 6 is dedicated to the application of our results to an open problem in green communications and shows how the flexibility of our approach easily permits to tailor the method to specific problems and obtain some impressive numerical results.

2 Background material

In this section we briefly recall some basic nonsmooth analysis notions, but we refer the reader to [31, 39] for more details and other results that, in some cases, will be freely invoked. We assume that all sets C and functions f considered in this section are closed and locally Lipschitz continuous, respectively.

2.1 Continuity properties of set-valued mappings

Let \(M: \mathbb R^n \rightrightarrows \mathbb R^m\) be a set-valued mapping and C a subset of \(\mathbb R^n\).

Definition 1

([39, Definition 5.4]) The set-valued mapping M is outer semicontinuous relative to C at \({\bar{{\mathbf {x}}}}\in C\) if \( \limsup _{{\mathbf {x}}\underset{C}{\rightarrow } {\bar{{\mathbf {x}}}}} M({\mathbf {x}}) \subseteq M({\bar{{\mathbf {x}}}}), \) and inner semicontinuous at \({\bar{{\mathbf {x}}}}\) if \( \liminf _{{\mathbf {x}}\underset{C}{\rightarrow } {\bar{{\mathbf {x}}}}} M({\mathbf {x}}) \supseteq M({\bar{{\mathbf {x}}}}). \) M is called continuous (relative to C) at \({\bar{{\mathbf {x}}}}\) if both conditions hold.

The following definition extends to set-valued mappings the notion of locally bounded function.

Definition 2

([39, Definition 5.14]) The set-valued mapping M is locally bounded at \({\bar{{\mathbf {x}}}} \in \mathbb R^n\) if, for some neighborhood \({\mathcal {V}}_{{\bar{{\mathbf {x}}}}}\) of \({\bar{{\mathbf {x}}}}\), the set \(M({\mathcal {V}}_{{\bar{{\mathbf {x}}}}}) \subseteq \mathbb R^m\) is bounded.

Finally, we recall the Aubin property, which is a Lipschitz-type condition that makes reference to specific points in the graph of M.

Definition 3

([39, Definition 9.36]) The set-valued mapping M has the Aubin property relative to C at \({\bar{{\mathbf {x}}}} \in C\) for \({\bar{{\mathbf {u}}}} \in M({\bar{{\mathbf {x}}}})\), if the graph of M is locally closed at \(({\bar{{\mathbf {x}}}}, {\bar{{\mathbf {u}}}})\) and there are neighborhoods \({\mathcal {V}}_{{\bar{{\mathbf {x}}}}}\) of \({\bar{{\mathbf {x}}}}\), \({\mathcal {W}}_{{\bar{{\mathbf {u}}}}}\) of \({\bar{{\mathbf {u}}}}\), and a constant \(\kappa \in \mathbb R_+\triangleq \{ t\in \mathbb {R}: t\ge 0\}\) such that, denoting by \(\mathbb B\) the closed unit ball,

$$\begin{aligned} M({\mathbf {y}}) \cap {\mathcal {W}}_{{\bar{{\mathbf {u}}}}} \subseteq M({\mathbf {x}}) + \kappa \, \Vert {\mathbf {y}}- {\mathbf {x}}\Vert \mathbb B \quad \quad \text {for all} \quad {\mathbf {x}}, \, {\mathbf {y}}\in C \cap {\mathcal {V}}_{{\bar{{\mathbf {x}}}}}. \end{aligned}$$

2.2 Normal cones and subgradients

Definition 4

([39, Definition 6.3, 6(19)]) Let a closed set \(C\subseteq \mathbb {R}^n\) and a point \({\bar{{\mathbf {x}}}} \in C\) be given. A vector \({\mathbf {v}}\) is a regular normal to C at \({\bar{{\mathbf {x}}}}\) if \({\mathbf {v}}^{\scriptscriptstyle T}({\mathbf {y}}-{\bar{{\mathbf {x}}}}) \le o(\Vert {\mathbf {y}}-{\bar{{\mathbf {x}}}}\Vert )\), for \({\mathbf {y}}\in C.\) The set of regular normals to C at \({\bar{{\mathbf {x}}}}\) is denoted by \(\widehat{N}_C({\bar{{\mathbf {x}}}})\). Moreover, \({\mathbf {v}}\) is a normal to C at \({\bar{{\mathbf {x}}}}\) if there are sequences and \({\mathbf {v}}^\nu \rightarrow {\mathbf {v}}\) with \({\mathbf {v}}^\nu \in \widehat{N}_C({\mathbf {x}}^\nu )\). The set of of normals to C at \({\bar{{\mathbf {x}}}}\) is denoted by \(N_C({\bar{{\mathbf {x}}}})\). Lastly, the convexified normal cone to C at \({\bar{{\mathbf {x}}}}\) is defined by \(\overline{N}_C({\bar{{\mathbf {x}}}}) = \text {cl co} \, N_C({\bar{{\mathbf {x}}}})\), where cl and co denote the closure and the convex hull of a set.

The set C is termed regular at \({\bar{{\mathbf {x}}}} \in C\) if \(\widehat{N}_C({\bar{{\mathbf {x}}}}) =N_C({\bar{{\mathbf {x}}}})\); convex sets are regular at each of their points. With the aid of normal cones we can give the following definition of subgradients, see [31, Definition 1.77, 1.83 and (2.73)] or [39, Theorem 8.9 and 8(32)].

Definition 5

The set of regular subgradients of f at \(\bar{{\mathbf {x}}} \in C\) is \(\widehat{\partial } f(\bar{{\mathbf {x}}}) \triangleq \{ {\mathbf {v}}| ({\mathbf {v}},-1) \in \widehat{N}_{\mathrm{epi}f}(\bar{{\mathbf {x}}}, f(\bar{{\mathbf {x}}}))\}\). The set of subgradients of f at \(\bar{{\mathbf {x}}}\in C\) is \(\partial f(\bar{{\mathbf {x}}}) \triangleq \{ {\mathbf {v}}| ({\mathbf {v}},-1) \in N_{\mathrm{epi}f}(\bar{{\mathbf {x}}}, f(\bar{{\mathbf {x}}}))\}\). The set of Clarke subgradients of f at \(\bar{{\mathbf {x}}} \in C\) is \({\bar{\partial }} f(\bar{{\mathbf {x}}}) \triangleq \{ {\mathbf {v}}| ({\mathbf {v}},-1) \in \overline{N}_{\mathrm{epi}f}(\bar{{\mathbf {x}}}, f(\bar{{\mathbf {x}}}))\}.\)

We recall [39, Theorems 8.6, 8.49] that, if \({\bar{{\mathbf {x}}}} \in C\), \(\widehat{\partial }f({\bar{{\mathbf {x}}}})\), \(\partial f({\bar{{\mathbf {x}}}})\) and \({\bar{\partial }} f({\bar{{\mathbf {x}}}})\) are closed with \(\widehat{\partial }f({\bar{{\mathbf {x}}}})\) convex, \(\widehat{\partial }f({\bar{{\mathbf {x}}}}) \subseteq \partial f({\bar{{\mathbf {x}}}})\) and, by [39, Theorems 8.49 and 9.13], \({\bar{\partial }} f({\bar{{\mathbf {x}}}}) = \text {co} \, \partial f({\bar{{\mathbf {x}}}})\). Letting \({\bar{{\mathbf {x}}}}\) be a point in C, by [39, Theorem 8.7],

  • \(\partial f\) is outer semicontinuous relative to C,

and the following properties of the subgradient set characterizes locally Lipschitz continuous functions, see [39, Theorem 9.13]:

  • \(\partial f({\bar{{\mathbf {x}}}})\) is nonempty and compact;

  • both \(\widehat{\partial }f\) and \(\partial f\) are locally bounded as set-valued mappings.

The locally Lipschitz function f is said to be regular at \({\bar{{\mathbf {x}}}} \in C\) if \(\mathrm{epi}f\) is regular at \(({\bar{{\mathbf {x}}}}, f({\bar{{\mathbf {x}}}}))\). Furthermore, f is regular at \({\bar{{\mathbf {x}}}} \in C\) if and only if \(\widehat{\partial }f({\bar{{\mathbf {x}}}}) = \partial f({\bar{{\mathbf {x}}}})\) and, thus, the three subgradient sets coincide. In particular, if f is convex on C or continuously differentiable on an open set containing C, then it is regular at each point of C. Besides, in the former case, the set of subgradients \(\partial f({\bar{{\mathbf {x}}}})\) coincides with the classical subdifferential from convex analysis while, in the latter, the set of subgradients collapses to the gradient.

2.3 Optimality conditions and constraint qualifications

Consider the constrained optimization problem

$$\begin{aligned} \begin{array}{cl} \underset{{\mathbf {x}}}{\text{ minimize }} &{} f(\mathbf {x}) \\ \text{ s.t. } &{} {\mathbf {x}}\in {C}, \end{array} \end{aligned}$$
(1)

where \(C\subseteq \mathbb {R}^n\) is a nonempty, closed set and f is locally Lipschitz on an open set containing C. The basic necessary optimality condition for a point \({\bar{{\mathbf {x}}}} \in C\) to be (locally) optimal for this problem is, see [39, Theorems 8.15 and 9.13],

$$\begin{aligned} {\mathbf 0} \in \partial f({\bar{{\mathbf {x}}}}) + N_C({\bar{{\mathbf {x}}}}). \end{aligned}$$
(2)

When f and C are convex, the previous condition is also sufficient for \({\bar{{\mathbf {x}}}}\) to be globally optimal. A key issue in applying this condition is the capability to estimate the normal cone \(N_C({\bar{{\mathbf {x}}}})\). To this end, as usual, we suppose that \(C \triangleq \{{\mathbf {x}}\in S: h_j({\mathbf {x}}) \le 0, \, j=1, \ldots , m\}\), with \(S \subseteq \mathbb R^n\) closed and, for every j, \(h_j\) locally Lipschitz continuous on an open set containing S. In this setting, if we suppose that a standard constraint qualification holds, we can obtain a “dual” description of normal vectors.

Definition 6

The (nonsmooth) Mangasarian–Fromovitz Constraint Qualification (MFCQ) holds at \(\bar{{\mathbf {x}}} \in C\) if

$$\begin{aligned} \mathbf {0}\in \sum _{j=1}^m \mu _j \partial h_j({\bar{{\mathbf {x}}}})+N_S(\bar{\mathbf {x}}), \; \varvec{\mu }\in N_{\mathbb R^m_-}(\mathbf {h}({\bar{{\mathbf {x}}}})) \;\Rightarrow \;\varvec{\mu }=\mathbf {0}, \end{aligned}$$
(3)

where \(\mathbf {h}({\mathbf {x}}) \triangleq (h_1({\mathbf {x}}), \ldots , h_m({\mathbf {x}}))^{\scriptscriptstyle T}\) and \(\mathbb R^m_- \triangleq \{\varvec{\mu }\in \mathbb {R}^m | \varvec{\mu }\le 0\}.\)

A somewhat weaker condition can be obtained by replacing \(\sum _{j=1}^m \mu _j \partial h_j({\bar{{\mathbf {x}}}})\) by \(\partial (\sum _{j=1}^m \mu _j h_j({\bar{{\mathbf {x}}}}))\); we refrain from doing this, since, for our purpose, the MFCQ (3) is what we can really expect to check and use in practice. The condition \(\varvec{\mu }\in N_{\mathbb {R}^m_-}(\mathbf {h}({\bar{{\mathbf {x}}}}))\) can be rewritten, taking into account \({\bar{{\mathbf {x}}}}\in C\), as \( \mu _j \ge 0, \) \(\mu _j h_j({\bar{{\mathbf {x}}}}) =0 \) for all j. It is worth pointing out that, if \(S=\mathbb {R}^n\) and all functions involved are smooth, (3) reduces to the classical MFCQ.

In view of the description of normal vectors for Lipschitzian constraints given in [39, Corollary 10.50], if (3) holds at \({\bar{{\mathbf {x}}}} \in C\), thanks to [39, Theorem 9.13, Corollary 10.9] and by [39, Theorem 6.42], we have

$$\begin{aligned} N_{C}({\bar{{\mathbf {x}}}}) \subseteq \bigcup \left\{ \sum _{j=1}^m \mu _j \partial h_j({\bar{{\mathbf {x}}}}) | \varvec{\mu }\in N_{\mathbb R^m_-}(\mathbf {h}(\bar{{\mathbf {x}}}))\right\} + N_{S}({\bar{{\mathbf {x}}}}), \end{aligned}$$
(4)

with equality holding if all functions \(h_j\)s and the set S are regular at \({\bar{{\mathbf {x}}}}\). A particularly important case is the convex one: whenever all functions \(h_j\)s and the set S are convex, and hence regular, equality holds in (4) and we get an exact representation of the normal cone. Combining relations (2) and (4), it is straightforward to obtain KKT-like optimality conditions. Consider problem (1), with C having the constraint structure just discussed. Let \({\bar{{\mathbf {x}}}}\in C\) be a minimum point at which the MFCQ holds, then

$$\begin{aligned} {\mathbf 0} \in \partial f(\bar{{\mathbf {x}}}) + \bigcup \left\{ \sum _{j=1}^m \mu _j \partial h_j({\bar{{\mathbf {x}}}}) | \varvec{\mu }\in N_{\mathbb R^m_-} (\mathbf {h}(\bar{{\mathbf {x}}}))\right\} + N_{S}({\bar{{\mathbf {x}}}}). \end{aligned}$$
(5)

The KKT-like conditions (5), although less stringent than the necessary condition for local optimality (2) (but the two are equivalent for regular data), are the stationarity conditions generally used when studying algorithms.

Remark 1

In this paper we express all optimality conditions and constraint qualifications by resorting to subgradients, not to Clarke subgradients. This choice allows us to get sharper optimality conditions and weaker constraint qualifications. The price we pay is a more complex technical analysis for some of our results. Nevertheless, all the conclusions in this paper still hold if Clarke subgradients are employed.

3 Outline of the method

Our aim is to compute stationary solutions of (P) while preserving feasibility of the iterates, by solving a sequence of “easier” convex subproblems. In this section we give a description of the overall method and present some preliminary technical results; the complete convergence properties are investigated in the next section.

Given the current feasible iterate \({\mathbf {x}}^\nu \), we consider a strongly convex “approximation” to the original nonconvex problem (P). The convex feasible set \(\widetilde{\mathcal {X}}({\mathbf {x}}^\nu )\) of this subproblem is contained in the feasible region \({\mathcal {X}}\) of (P) and is such that \({\mathbf {x}}^\nu \in \widetilde{\mathcal {X}}({\mathbf {x}}^\nu )\). We then compute an (approximate) solution \({\mathbf {v}}^\nu \) of the subproblem and generate the new iteration \({\mathbf {x}}^{\nu +1}\) as a suitably chosen point on the segment \([{\mathbf {x}}^\nu , {\mathbf {v}}^\nu ]\). By construction, \({\mathbf {x}}^{\nu +1}\) still belongs to \({\mathcal {X}}\) since the whole segment \([{\mathbf {x}}^\nu , {\mathbf {v}}^\nu ]\) is contained in \(\widetilde{\mathcal {X}}({\mathbf {x}}^\nu )\subseteq {\mathcal {X}}\). Clearly, in order to prove convergence we need to make proper assumptions: these conditions are discussed in this and in the next section, while in Sect. 5 we present many examples of approximations that satisfy the required assumptions. In this section we only make minimal assumptions and, accordingly, prove a weak convergence result, see Proposition 1. Although Proposition 1 is of little interest per se, it constitutes the basis for the developments in the next section.

We introduce the following approximation of (P) at a feasible point \({\mathbf {y}}\in \mathcal {X}\):

where \({\widetilde{U}} ({\mathbf {x}}; {\mathbf {y}})\) and \(\widetilde{\mathcal {X}}({\mathbf {y}})\) are convex approximations at \({\mathbf {y}}\) of U and \({\mathcal {X}}\), respectively. The basic requirement on \({\widetilde{U}}\) is that it capture the first-order behavior of U at \({\mathbf {y}}\) while being strongly convex. In order to make subproblem (\({\hbox {P}_{{\mathbf {y}}}}\)) convex, we also need \(\widetilde{\mathcal {X}}({\mathbf {y}})\) to be convex; in addition, we also impose \(\widetilde{\mathcal{X}}({\mathbf {y}})\) to contain \({\mathbf {y}}\) and to be an inner approximation of \(\mathcal X\). In what follows \(O_{\mathbf {x}}\) and \(O_{\mathbf {y}}\) are an open, convex set and an open set containing \(\mathcal X,\) respectively.

Assumption U

Let \(\widetilde{U}:O_{\mathbf {x}}\times O_{\mathbf {y}}\rightarrow \mathbb {R}\) satisfy the following properties:

(U1) :

\(\widetilde{U}(\bullet ;{\mathbf {y}})\) is a finite strongly convex function on \(O_{\mathbf {x}}\) for every \({\mathbf {y}}\in {\mathcal {X}}\) with modulus of strong convexity independent of \({\mathbf {y}}\);

(U2) :

\(\partial _1 {\widetilde{U}}(\bullet ; \bullet )\) is locally bounded relative to \({\mathcal {X}}\times {\mathcal {X}}\) at every \(({\mathbf {y}}, {\mathbf {y}})\) with \({\mathbf {y}}\in {\mathcal {X}}\);

(U3) :

\(\limsup _{({\mathbf {u}},\mathbf {w}) \underset{\mathcal X\times \mathcal X}{\longrightarrow } ({\mathbf {y}}, {\mathbf {y}})} \partial _1 {\widetilde{U}}({\mathbf {u}}; \mathbf {w}) \subseteq \partial U({\mathbf {y}})\) for every \( {\mathbf {y}}\in {\mathcal {X}}\);

where \(\partial _{1}\widetilde{U}({\mathbf {u}};\mathbf {w})\) denotes the set of subgradients of \(\widetilde{U}(\bullet ;\mathbf {w})\) evaluated at \({\mathbf {u}}\).

Assumption X

(X1) :

\(\widetilde{\mathcal {X}}({\mathbf {y}})\) is closed and convex for every \({\mathbf {y}}\in \mathcal X\);

(X2) :

\({\mathbf {y}}\in \widetilde{\mathcal {X}}({\mathbf {y}})\) and \( \widetilde{\mathcal {X}}({\mathbf {y}})\subseteq \mathcal{X}\) for every \({\mathbf {y}}\in \mathcal X\).

U1 and X1, together with X2, which ensures that the feasible set \(\widetilde{\mathcal {X}}({\mathbf {y}})\) is nonempty, guarantee that all subproblems (\({\hbox {P}_{{\mathbf {y}}}}\)) are strongly convex and have a unique solution, which we denote by \(\hat{{\mathbf {x}}}({\mathbf {y}})\):

$$\begin{aligned} \hat{{\mathbf {x}}}({\mathbf {y}})\triangleq {\mathop {\mathrm{argmin}}\limits _{{{\mathbf {x}}\in \widetilde{\mathcal {X}}({\mathbf {y}})}}}\, \widetilde{U}({\mathbf {x}};{\mathbf {y}}). \end{aligned}$$

Our Algorithmic Framework is based on the successive (approximate) solution of subproblems (\({\hbox {P}_{{\mathbf {y}}}}\)) followed by some kind of line-search procedure.

figure a

Assumptions U and X alone are not enough to establish the convergence of AF 1 to stationary points of problem (P), and yet some instructive aspects of AF 1 can be investigated under Assumptions U and X alone. U2 and U3 establish a link between the first-order properties of U and those of \({\widetilde{U}}\). To illustrate their meaning, let us consider first the simple case of continuously differentiable U and \({\widetilde{U}}\): U2 is then automatically satisfied, while U3 simply postulates that \(\nabla _1 {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}}) = \nabla U({\mathbf {y}})\) for all \({\mathbf {y}}\in {\mathcal {X}}\), i.e. that \({\widetilde{U}}(\bullet ; {\mathbf {y}})\) and U have the same first-order behavior at the “base point” \({\mathbf {y}}\). Assume now that \({\widetilde{U}}\) is locally Lipschitz continuous: Lemma 1, in view of the local Lipschitz continuity of U, shows that, whenever \({\widetilde{U}}\) is (jointly) locally Lipschitz in both its variables and \({\widetilde{U}}(\bullet ; {\mathbf {y}})\) is convex for every \({\mathbf {y}}\in {\mathcal {X}}\) (as it must be by U1), U2 is automatically satisfied and, by virtue of the outer semicontinuity property of the partial subgradient set-valued mapping, U3 boils down to the simple condition \(\partial _{1}\widetilde{U}({\mathbf {y}}; {\mathbf {y}})\subseteq \partial U({\mathbf {y}})\) for all \({\mathbf {y}}\in \mathcal {X}\).

Lemma 1

Let \({\widetilde{U}}: O_{\mathbf {x}}\times O_{\mathbf {y}}\rightarrow \mathbb R\) be locally Lipschitz at \(({\mathbf {y}}, {\mathbf {y}}) \in {\mathcal {X}} \times {\mathcal {X}}\) and such that \({\widetilde{U}}(\bullet ; \mathbf {w})\) is convex on \(O_{\mathbf {x}}\) for every \(\mathbf {w}\in {\mathcal {X}}\) in a neighborhood of \({\mathbf {y}}\). Then, the set-valued mapping \(\partial _1 {\widetilde{U}}(\bullet ; \bullet )\) is locally bounded and outer semicontinuous (relative to \({\mathcal {X}} \times {\mathcal {X}}\)) at \(({\mathbf {y}}, {\mathbf {y}})\).

Proof

Without loss of generality, suppose and let \(\{{\varvec{\xi }}^\nu \}\), with \(\varvec{\xi }^\nu \in \partial _1 {\widetilde{U}}({\mathbf {u}}^\nu ; \mathbf {w}^\nu )\), be a sequence of partial subgradients of \({\widetilde{U}}(\bullet ;\mathbf {w}^\nu )\) evaluated at \({\mathbf {u}}^\nu \). By [39, Corollary 10.11], we have \({\varvec{\xi }}^\nu \in \pi _{\mathbf {x}}\partial {\widetilde{U}}({\mathbf {u}}^\nu ;\mathbf {w}^\nu )\), where \(\pi _{\mathbf {x}}\) denotes the projection on the subspace of the first argument: therefore, there exists a sequence \(\{ {\varvec{\zeta }}^\nu \}\) such that \(\{({\varvec{\xi }}^\nu , {\varvec{\zeta }}^\nu )\} \in \partial {\widetilde{U}}({\mathbf {u}}^\nu ;\mathbf {w}^\nu )\). By the local boundedness and outer semicontinuity (relative to \({\mathcal {X}} \times {\mathcal {X}}\)) of the subgradient set of the locally Lipschitz function \({\widetilde{U}}\), renumbering if necessary, we have \(\{({\varvec{\xi }}^\nu , {\varvec{\zeta }}^\nu )\} \rightarrow ({\varvec{\xi }}, {\varvec{\zeta }}) \in \partial {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}})\). Since \(\partial {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}}) \subseteq {\bar{\partial }} {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}})\), where \({\bar{\partial }} {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}})\) is the set of Clarke subgradients of \({\widetilde{U}}\) at \(({\mathbf {y}}, {\mathbf {y}})\), by [12, Proposition 2.5.3] and thanks to the regularity of \({\widetilde{U}}(\bullet ; {\mathbf {y}})\), we have \({\varvec{\xi }} \in {\bar{\partial }}_1 {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}}) = \partial _1 {\widetilde{U}}({\mathbf {y}}; {\mathbf {y}})\); then, in view of [39, Proposition 5.15] and by Definition 1, the thesis follows.\(\square \)

In the general case, U2 and U3 suitably extend the previous considerations by requiring some consistency property between the subgradient set of U and the partial subgradient set of \({\widetilde{U}}\) at the “base point”.

Proposition 1 below essentially states that, if the sequence \(\{{\mathbf {x}}^\nu \}\) produced by AF 1 is such that \(\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert \rightarrow 0\), then every limit point of \(\{ {\mathbf {x}}^\nu \}\) satisfies the “stationarity-looking” condition (7), which is formally similar to (2) applied to problem (P), but with the normal cone \(N_\mathcal{X}({\bar{{\mathbf {x}}}})\) replaced by the cone \(\widetilde{N}({\bar{{\mathbf {x}}}})\), where the set \(\widetilde{N}({\bar{{\mathbf {x}}}})\) is defined as

$$\begin{aligned} \begin{array}{rcl} \widetilde{N}({\mathbf {y}}) &{}\triangleq &{} { \limsup \nolimits _{\begin{array}{c} {\mathbf {u}}\rightarrow {\mathbf {y}}\\ \mathbf {w}\underset{\mathcal {X}}{\rightarrow } {\mathbf {y}} \end{array}}} N_{\widetilde{\mathcal X} (\mathbf {w})}({\mathbf {u}})\\ &{}=&{} \big \{\varvec{\eta } \, | \, \exists \, {\varvec{\eta }}^\nu \rightarrow \varvec{\eta }, {\mathbf {u}}^\nu \rightarrow {\mathbf {y}}, {\mathcal X} \ni \mathbf {w}^\nu \rightarrow {\mathbf {y}}: {\varvec{\eta }}^\nu \in N_{\widetilde{\mathcal X}(\mathbf {w}^\nu )}({\mathbf {u}}^\nu )\big \}. \end{array} \end{aligned}$$

Notice that, if \({\mathbf {y}}\in {\mathcal X}\), since \({\mathbf {y}}\in \widetilde{\mathcal {X}}({\mathbf {y}})\), then \(\emptyset \ne N_{\widetilde{X}({\mathbf {y}})}({\mathbf {y}}) \subseteq \widetilde{N}({\mathbf {y}})\).

Proposition 1

Let \(\{{{\mathbf {x}}}^{\nu }\}\) be the sequence generated by AF 1 under Assumptions U and X.

  1. (i)

    All the iterates \({\mathbf {x}}^\nu \) are feasible, i.e. they belong to \(\mathcal X\).

  2. (ii)

    Suppose that \(\{{\mathbf {x}}^\nu \}\) is bounded and such that

    $$\begin{aligned} \underset{_{\nu \rightarrow \infty }}{\liminf }\,\Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })-{{\mathbf {x}}}^{\nu }\Vert =0. \end{aligned}$$
    (6)

    Then, at least one limit point \({\bar{{\mathbf {x}}}}\) of \(\{{{\mathbf {x}}}^{\nu }\}\) satisfies the condition

    $$\begin{aligned} 0\, \in \, \partial U({\bar{{\mathbf {x}}}}) + \widetilde{N}({\bar{{\mathbf {x}}}}). \end{aligned}$$
    (7)
  3. (iii)

    Suppose that \(\{{\mathbf {x}}^\nu \}\) is bounded and such that

    $$\begin{aligned} \underset{_{\nu \rightarrow \infty }}{\lim }\Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })-{{\mathbf {x}}}^{\nu }\Vert =0. \end{aligned}$$
    (8)

    Then, every limit point of \(\{{{\mathbf {x}}}^{\nu }\}\) satisfies condition (7).

  4. (iv)

    Whenever \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) = {\mathbf {x}}^\nu ,\) the point \({\mathbf {x}}^\nu \) satisfies condition (7).

Proof

(i) The thesis follows from Assumption X by induction and considering that \({{\mathbf {x}}}^{\nu +1}\) is a convex combination of \({\mathbf {v}}^\nu \in {\widetilde{\mathcal{X}}}({\mathbf {x}}^{\nu })\) and \({\mathbf {x}}^\nu \in {\widetilde{\mathcal{X}}}({\mathbf {x}}^{\nu })\), which is a convex set. We now prove (ii); (iii) follows applying (ii) to every convergent subsequence of \(\{{{\mathbf {x}}}^{\nu }\}\). If (6) holds, by passing to subsequences, we have \(({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu ) \rightarrow 0\) and . By U1 and X1, \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\) is the unique global optimal solution of the strongly convex optimization problem (P\(_{{\mathbf {x}}^\nu }\)) for every \(\nu \); this fact, in turn, is equivalent to the following condition:

$$\begin{aligned} 0 \in \partial _1 {\widetilde{U}}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu ) + N_{\widetilde{\mathcal X}({\mathbf {x}}^\nu )}(\hat{{\mathbf {x}}}({\mathbf {x}}^\nu )). \end{aligned}$$
(9)

Thus, we can write \(0 = {\varvec{\xi }}^\nu + {\varvec{\eta }}^\nu ,\) for some \({\varvec{\xi }}^\nu \in \partial _1 {\widetilde{U}}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu )\) and \({\varvec{\eta }}^\nu \in N_{\widetilde{\mathcal X}({\mathbf {x}}^\nu )}(\hat{{\mathbf {x}}}({\mathbf {x}}^\nu ))\). By this and U2, we deduce that the sequence \(\{{\varvec{\eta }}^\nu \}\) is bounded and we can assume, without loss of generality, that it converges to an element \(\bar{\varvec{\eta }}\) belonging, by definition, to \(\widetilde{N}({\bar{{\mathbf {x}}}})\). Thanks to U3, the thesis follows passing to the limit in \(0 = {\varvec{\xi }}^\nu + {\varvec{\eta }}^\nu .\) (iv) By (9) and U3, considering that \(N_{\widetilde{\mathcal X}({\mathbf {x}}^\nu )}({\mathbf {x}}^\nu ) \subseteq \widetilde{N}({\mathbf {x}}^\nu )\), we have \(0 \in \partial _1 {\widetilde{U}}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ) + N_{\widetilde{\mathcal X}({\mathbf {x}}^\nu )}({\mathbf {x}}^\nu ) \subseteq \partial U({\mathbf {x}}^\nu ) + \widetilde{N}({\mathbf {x}}^\nu )\). \(\square \)

Proposition 1 sets the following agenda for the further study of AF 1.

  1. 1.

    What is the exact meaning of inclusion (7) and how does this condition relate to the classical “stationarity” condition (2)? The answer to this question revolves around the nature of cone \(\widetilde{N}({\bar{{\mathbf {x}}}})\). As things stand now, since, except for X2, we made no assumptions linking \(\mathcal X\) and \(\widetilde{X} ({\bar{{\mathbf {x}}}})\), (7) is not very informative. It is thus important to study concrete realizations of the inner approximation \(\widetilde{\mathcal{X}}\) that make (7) meaningful.

  2. 2.

    We will see that the validity of key conditions (6) and (8) depend critically on the choice of \(\gamma ^\nu \) (as well as on the characteristics of \({\widetilde{U}}\) and \(\widetilde{\mathcal{X}}\)). We will consider fixed and diminishing step-size rules, and step-size determined by an Armijo-type condition.

  3. 3.

    In the subsequent analysis of the first two issues we will resort to additional conditions, beyond Assumptions U and X, on \({\widetilde{U}}\) and \(\widetilde{\mathcal {X}}\) (see Sect. 4.1). It is then essential to show that, in relevant cases, one can easily define suitable approximations \({\widetilde{U}}\) and \(\widetilde{\mathcal{X}}\) satisfying these assumptions, thus leading to practical solution methods.

Points 1 and 2 above will be discussed in the next section, while point 3 will be dealt with in Sect. 5.

4 The method in detail

In this section we present an in-depth analysis aimed at answering the questions of points 1-2 listed at the end of the previous section. In order to perform our analysis, we consider a more concrete realization of problem (P), that is,

$$\begin{aligned} \begin{array}{cl} \underset{{\mathbf {x}}}{\text{ minimize }} &{} F({\mathbf {x}}) + H({\mathbf {x}}) \\ \text{ s.t. } &{} g_j({\mathbf {x}}) \le 0, \quad j = 1, \ldots , m\\ &{} {\mathbf {x}}\in {\mathcal K}, \end{array} \end{aligned}$$
(P1)

where \({\mathcal X} \triangleq \{ {\mathbf {x}}\in \mathcal{K} : g_j({\mathbf {x}}) \le 0, \, j = 1, \ldots , m\}\) is nonempty, with \(\mathcal{K}\subseteq \mathbb {R}^n\) closed and convex and, letting \(O \supseteq \mathcal{X}\) be open, \(g_j: O \rightarrow \mathbb {R}\) locally Lipschitz continuous on O for every j; moreover, \(U \triangleq F+H\) is coercive on \({\mathcal X}\), with F continuously differentiable on O, its gradient Lipschitz continuous on O with constant \(L_{\nabla F}\) and H locally Lipschitz continuous on O. We note that the above conditions guarantee the existence of a solution of problem (P1).

In this formulation the feasible set \({\mathcal X}\) is described by the convex, closed set \(\mathcal K\) and m inequality constraints. The set \(\mathcal K\) is intended to represent the “easy” part of the constraints and is possibly defined by further convex inequalities. On the other hand, the explicit and “difficult” inequality constraints \(g_j\)s, which are possibly nonsmooth and nonconvex, need to be suitably approximated/simplified in order to build, starting from (P1) and at a generic feasible point \({\mathbf {y}}\), the better behaved subproblem (\({\hbox {P}1_{{\mathbf {y}}}}\)) below.

The objective function U is given by the sum \(F + H\), with F smooth and H locally Lipschitz. Of course, this description of U is somewhat redundant since we could simply take \(U=H\); however, it is useful to highlight, if present, a smooth component F, because, on this term, we can relax the assumptions that we need, instead, for the more general nonsmooth part H.

Taking into account that the general abstract problem (P) is actually represented by (P1), we are naturally led to consider the following approximation of (P1) at a feasible point \({\mathbf {y}}\):

where \({\widetilde{U}} = {\widetilde{F}} + \widetilde{H}\) and \( \widetilde{\mathcal X}({\mathbf {y}}) \triangleq \{ {\mathbf {x}}\in \mathcal{K} : {\tilde{g}}_j({\mathbf {x}}; {\mathbf {y}}) \le 0, \, j = 1, \ldots , m\}\).

In the next subsection we provide a detailed description of the assumptions that we make on problem (\({\hbox {P}1_{{\mathbf {y}}}}\)), while in the remaining part of the section we analyze the convergence properties of AF 1.

4.1 Assumptions

In order to strengthen the results of Proposition 1 and to take into account the specific structure of problems (P1) and (\({\hbox {P}1_{{\mathbf {y}}}}\)), below we state three assumptions about the approximations \({\widetilde{F}}\), \(\widetilde{H}\), and \({\tilde{g}}_j\) that complement Assumptions U and X. These conditions are somewhat technical, but in Sect. 5 we give many examples of approximations satisfying the assumptions, thus showing the breadth and wide practical applicability of our approach. The first two assumptions concern the objective function.

Assumption F

\(\widetilde{F}:O_{\mathbf {x}}\times O_{\mathbf {y}}\rightarrow \mathbb {R}\) satisfies the following properties:

(F1) :

\(\widetilde{F}(\bullet ;{\mathbf {y}})\) is a finite strongly convex function on \(O_{\mathbf {x}}\) for every \({\mathbf {y}}\in {\mathcal {X}}\) with modulus of strong convexity \(c_{\widetilde{F}}\!\!>\!0\) independent of \({\mathbf {y}}\);

(F2) :

\(\nabla _{1}\widetilde{F}(\bullet ;\bullet )\) exists and is continuous on \({O_{\mathbf {x}}\times O_{\mathbf {y}}}\);

(F3) :

\(\nabla _{1}\widetilde{F}({\mathbf {y}};{\mathbf {y}})=\nabla F({\mathbf {y}}),\) for every \({\mathbf {y}}\in \mathcal {X}\);

where \(\nabla _{1}\widetilde{F}({\mathbf {u}};\mathbf {w})\) denotes the partial gradient of \(\widetilde{F}(\bullet ;\mathbf {w})\) evaluated at \({\mathbf {u}}\).

Assumption H

\(\widetilde{H}:O_{\mathbf {x}}\times O_{\mathbf {y}}\rightarrow \mathbb {R}\) satisfies the following properties:

(H1) :

\(\widetilde{H}(\bullet ;{\mathbf {y}})\) is a finite convex function on \(O_{\mathbf {x}}\) for every \({\mathbf {y}}\in \mathcal {X}\);

(H2) :

\(\widetilde{H}({\mathbf {y}};{\mathbf {y}})=H({\mathbf {y}})\), for every \({\mathbf {y}}\in \mathcal {X}\);

(H3) :

\(H({\mathbf {x}})\le \widetilde{H}({\mathbf {x}};{\mathbf {y}})\) for every \({\mathbf {x}}\in {\mathcal {X}}\) and \({\mathbf {y}}\in \mathcal {X}\);

(H4) :

\(\partial _1 \widetilde{H}(\bullet ; \bullet )\) is locally bounded relative to \({\mathcal {X}}\times {\mathcal {X}}\) at every \(({\mathbf {y}}, {\mathbf {y}})\) with \({\mathbf {y}}\in {\mathcal {X}}\);

(H5) :

\(\limsup _{({\mathbf {u}},\mathbf {w}) \underset{\mathcal X\times \mathcal X}{\longrightarrow } ({\mathbf {y}}, {\mathbf {y}})} \partial _1 \widetilde{H}({\mathbf {u}}; \mathbf {w}) \subseteq \partial H({\mathbf {y}})\) for every \({\mathbf {y}}\in {\mathcal {X}}\).

Assumption F on the smooth part of the objective function is just a verbatim repetition of Assumption U, see the discussion just before Lemma 1. Assumptions H1, H4 and H5 are again a plain restatement of Assumption U, the only difference being that condition H1 only requires convexity: indeed, in view of F1, it is enough to have \(\widetilde{H}\) convex in order to make the overall approximation \({\widetilde{U}} = {\widetilde{F}} + \widetilde{H}\) strongly convex for fixed \({\mathbf {y}}\). Conditions H2 and H3 are instead new. H2 is only made for convenience and can always be satisfied by a translation. The key assumption H3 requires \(\widetilde{H}({\mathbf {x}}, {\mathbf {y}})\) to be an upper approximation of \(H({\mathbf {x}})\) making up, in some sense, for the lack of differentiability of H. Note that this condition is typical of majorization/minimizations schemes as, for example, [1, 7]. An important feature of AF 1 is that we do not require this majorization condition of \({\widetilde{F}}\) and therefore on \({\widetilde{U}}\). This is very important from the practical point of view, since majorization schemes can be very slow, see for example Sect. 6, and also because, in general, it is easier to build an approximations which is not un upper estimate, see Sect. 5.

When considering the constraints, we assume that, at each given point \({\mathbf {y}}\in {\mathcal X}\) and for each function \(g_j\), we can define a suitable approximation \({\tilde{g}}_j\) that satisfies the following conditions.

Assumption G

Each \(\tilde{g}_j:O_{\mathbf {x}}\times O_{\mathbf {y}}\rightarrow \mathbb {R}\) satisfies the following properties:

(G1) :

\(\tilde{g}_j(\bullet ;{\mathbf {y}})\) is a finite convex function on \(O_{\mathbf {x}}\) for every \({\mathbf {y}}\in \mathcal {X}\);

(G2) :

\(\tilde{g}_j({\mathbf {y}};{\mathbf {y}})=g_j({\mathbf {y}})\), for every \({\mathbf {y}}\in \mathcal {X}\);

(G3) :

\(g_j({\mathbf {x}})\le \tilde{g}_j({\mathbf {x}};{\mathbf {y}})\) for every \({\mathbf {x}}\in {\mathcal {X}}\) and \({\mathbf {y}}\in \mathcal {X}\);

(G4) :

\(\tilde{g}_j(\bullet ;\bullet )\) is continuous relative to \({\mathcal {X}}\times {\mathcal {X}}\) at every \(({\mathbf {y}}, {\mathbf {y}})\) with \({\mathbf {y}}\in {\mathcal {X}}\);

(G5) :

\(\partial _1 {\tilde{g}}_j(\bullet ; \bullet )\) is locally bounded relative to \({\mathcal {X}}\times {\mathcal {X}}\) at every \(({\mathbf {y}}, {\mathbf {y}})\) with \({\mathbf {y}}\in {\mathcal {X}}\);

(G6) :

\(\limsup _{({\mathbf {u}},\mathbf {w}) \underset{\mathcal X\times \mathcal X}{\longrightarrow } ({\mathbf {y}}, {\mathbf {y}})} \partial _1 \tilde{g}_j({\mathbf {u}}; \mathbf {w}) \subseteq \partial g_j({\mathbf {y}})\) for every \({\mathbf {y}}\in {\mathcal {X}}\).

Conditions G1-G3 are just the translation of Assumption X in the setting of problems (P1) and (\({\hbox {P}1_{{\mathbf {y}}}}\)). While G4 is a fairly mild continuity assumption, the consistency condition G6, along with G5, is instrumental for building a bridge between the sets \(\widetilde{N}\) and \(N_{\mathcal X}\) at \({\mathbf {y}}\in {\mathcal {X}}\). In any case, G5 and G6 are identical to H4 and H5 and, if we are able to define suitable assumptions for the \(g_j\)s we can do the same for H.

For future reference, it may be useful to explicitly state the following fact that, in the light of the discussion so fare, does not need any proof.

Proposition 2

Assumptions F, H, and G on problem (P1) imply that problem (P1\(_{{\mathbf {y}}}\)) satisfies Assumptions U and X.

4.2 Main convergence result

In this subsection we study the main convergence properties of AF 1. We still have to specify how to choose the step-size \(\gamma ^\nu \) at each iteration. We consider three options: (i) a sufficiently small fixed step-size; (ii) a diminishing step-size rule; (iii) a step-size that is chosen according to an Armijo-type line-search procedure. Each of these options has its pros and cons. Indeed, (i) is conceptually the simplest (provided that some Lipschitz and strong convexity constants are known), however it is expected to perform poorly in practice; (ii) is the easiest-to-implement rule and does not require any prior knowledge of any possibly-hard-to-know constant; (iii) is the most complex option and is expected to lead to a better practical behavior, since, in this case, the step-size is chosen taking into account the problem structure; rule (iii) is less suited to distributed environments, since performing a line-search is inherently a centralized task. The following theorem shows that, by choosing \(\gamma ^\nu \) according to these three methods, we can ensure that (6) or (8) hold.

Proposition 3

Given the nonconvex problem (P1) under F1-F3, H1-H3 and G1-G3, let \(\{{\mathbf {x}}^{\nu }\}\) be the sequence generated by AF 1. The following hold:

  1. (i)

    if the step-size \(\gamma ^{\nu }\) and the error term \(\varepsilon ^\nu \) are chosen so that

    $$\begin{aligned}&0 < \inf _\nu \gamma ^\nu \le \sup _\nu \gamma ^\nu \le \gamma ^{\max } \le \min \{1,c_{{\widetilde{F}}}/L_{\nabla F} \} \end{aligned}$$
    (10)
    $$\begin{aligned}&\varepsilon ^\nu \le a_1 \min \{b^\nu /[\Vert \nabla F({\mathbf {x}}^\nu ) + {\varvec{\xi }}^\nu \Vert ], c^\nu \}, \end{aligned}$$
    (11)

    for some \({\varvec{\xi }}^\nu \in \partial _1 \widetilde{H}({\mathbf {v}}^\nu ; {\mathbf {x}}^\nu )\) and for some non negative \(a_1\), \(b^\nu \) and \(c^\nu \) with \(\sum _\nu b^\nu < + \infty \), \(\sum _\nu (c^\nu )^2 < + \infty \), then \(\{{\mathbf {x}}^{\nu }\}\) is a bounded sequence such that

    $$\begin{aligned} \underset{_{\nu \rightarrow \infty }}{\lim }\,\,\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )-{\mathbf {x}}^\nu \Vert =0; \end{aligned}$$
    (12)
  2. (ii)

    if the step-size \(\gamma ^{\nu }\) and the error term \(\varepsilon ^\nu \) are chosen so that

    $$\begin{aligned} \gamma ^{\nu }\in (0,1],\quad \underset{_{\nu \rightarrow \infty }}{\lim }\gamma ^{\nu }=0,\quad \sum _{\nu }\gamma ^{\nu }=+\infty , \quad \text{ and }\quad \sum _{\nu }(\gamma ^{\nu })^2<+\infty , \end{aligned}$$
    $$\begin{aligned} \varepsilon ^\nu \le \gamma ^\nu a_1 \min \{a_2, 1/[\Vert \nabla F({\mathbf {x}}^\nu ) + {\varvec{\xi }}^\nu \Vert ]\}, \end{aligned}$$
    (13)

    for some \({\varvec{\xi }}^\nu \in \partial _1 \widetilde{H}({\mathbf {v}}^\nu ; {\mathbf {x}}^\nu )\) and for some non negative \(a_1\) and \(a_2\), then \(\{{\mathbf {x}}^{\nu }\}\) is a bounded sequence such that

    $$\begin{aligned} \underset{_{\nu \rightarrow \infty }}{\liminf }\,\,\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )-{\mathbf {x}}^\nu \Vert =0; \end{aligned}$$
    (14)
  3. (iii)

    if the step-size \(\gamma ^{\nu }\in (0,1]\) is chosen by means of a backtracking rule so that \(\gamma ^{\nu } = 0.5^{i^\nu }\), where \(i^\nu \) is the smallest number in \(0,1, 2, \ldots \) such that \({\mathbf {x}}(i^\nu ) \triangleq {\mathbf {x}}^\nu + 0.5^{i^\nu }({\mathbf {v}}^\nu - {\mathbf {x}}^\nu )\) satisfies the Armijo-type condition

    $$\begin{aligned} \begin{array}{rcl} F({\mathbf {x}}(i^\nu )) + H({\mathbf {x}}(i^\nu )) &{} \le &{} F({\mathbf {x}}^{\nu }) + H({\mathbf {x}}^{\nu }) + \alpha \gamma ^\nu [\nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu - {\mathbf {x}}^\nu )\\ &{}&{}+\, \widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )], \end{array} \end{aligned}$$
    (15)

    for some \(\alpha \in (0,1)\), the error \(\varepsilon ^\nu \) is such that \( \lim _{\nu \rightarrow \infty } \varepsilon ^\nu = 0, \) and

    $$\begin{aligned} {\widetilde{F}}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) + \widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) \le {\widetilde{F}}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ) + \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ), \end{aligned}$$
    (16)

    then \(\{{\mathbf {x}}^{\nu }\}\) is well-defined, bounded and condition (12) holds.

As stated above, according to AF 1, we consider the possibility of an inaccurate computation of \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\). The error must obey rather standard rules, with possibly the exception of (16), which, however, is a natural and easily enforced condition: indeed, (16) states that the approximation \({\mathbf {v}}^\nu \) of the optimal solution \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\) of problem (P1\(_{{\mathbf {x}}^\nu }\)) has to at least improve on the current iteration \({\mathbf {x}}^\nu \). Rules (11) and (13) would require us to estimate \( \Vert {\mathbf {v}}^\nu -{\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\Vert \). In principle, this is possible by resorting to appropriate error bounds, which are available for the strongly convex problem (P1\(_{{\mathbf {x}}^\nu }\)) [15]. However, from a practical point of view, taking into account that constants \(a_1\), \(a_2\) and sequences \(b^\nu \) and \(c^\nu \) are arbitrary, the meaning of (11) and (13) is simply that the error \(\varepsilon ^\nu \) should go to zero, i.e. the same condition needed for the line-search option (iii).

To prove Proposition 3 we need two preliminary lemmas.

Lemma 2

Under Assumption F1-F3, H1 and G1-G3, for every \({\mathbf {x}}^\nu \),

$$\begin{aligned} \nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}(\hat{{\mathbf {x}}}({\mathbf {x}}^\nu )-{\mathbf {x}}^\nu ) + \widetilde{H}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ); {\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ; {\mathbf {x}}^\nu ) \le - c_{\widetilde{F}}\Vert \hat{{\mathbf {x}}}({\mathbf {x}}^\nu )-{\mathbf {x}}^\nu \Vert ^{2}. \end{aligned}$$
(17)

Proof

For any given \({\mathbf {x}}^\nu \in \mathcal{\mathcal{X}}\), by referring to problem (\({\hbox {P}1_{{\mathbf {y}}}}\)), with \({\mathbf {y}}= {\mathbf {x}}^\nu \), thanks to F2, G1, and H1 we have

$$\begin{aligned} ({\mathbf {z}}-\hat{{\mathbf {x}}}({\mathbf {x}}^\nu ))^{\scriptscriptstyle T}\nabla _{1}\widetilde{F}\left( \hat{{\mathbf {x}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu \right) +\widetilde{H}(\mathbf {z};{\mathbf {x}}^\nu )-\widetilde{H}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu )\ge 0,\quad \forall {\mathbf {z}}\in \widetilde{\mathcal{X}}({\mathbf {x}}^\nu ). \end{aligned}$$
(18)

By adding and subtracting \(\nabla _{1}\widetilde{F}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )\) in the LHS of (18), and choosing \({\mathbf {z}}={\mathbf {x}}^\nu \in {\widetilde{\mathcal{X}}}({\mathbf {x}}^\nu )\), we get

$$\begin{aligned} \begin{array}{rcl} 0 &{} \le &{} \left( {\mathbf {x}}^\nu -\hat{{\mathbf {x}}}({\mathbf {x}}^\nu )\right) ^{\scriptscriptstyle T}\left( \nabla _{1}\widetilde{F}(\hat{{\mathbf {x}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu )-\nabla _{1}\widetilde{F}({\mathbf {x}}^\nu ; {\mathbf {x}}^\nu )+\nabla _{1}\widetilde{F}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )\right) \\ &{} &{} +\widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )-\widetilde{H}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu ), \end{array} \end{aligned}$$

which, using F1 and F3, gives (17). \(\square \)

Lemma 3

([5, Lemma 3.4]) Let \(\{a^\nu \}\), \(\{b^\nu \}\), and \(\{c^\nu \}\) be three sequences of numbers such that \(\{b^\nu \} \ge 0\) for all \(\nu \). Suppose that \( a^{\nu +1} \le a^\nu - b^\nu + c^\nu ,\) \(\forall \nu = 0, 1, \ldots \) and \(\, \sum _{\nu =0}^\infty c^\nu < \infty \). Then, either \(a^\nu \rightarrow -\infty \) or else \(\{a^\nu \}\) converges to a finite value and \(\sum _{\nu =0}^\infty b^\nu < \infty \).

Proof of Proposition 3

First of all, we note that, by the descent lemma [4, Proposition A.24] and step (S.3) of AF 1, we get

$$\begin{aligned} F({{\mathbf {x}}}^{\nu +1})\le F({{\mathbf {x}}}^{\nu })+\gamma ^{\nu }\nabla F({{\mathbf {x}}}^{\nu })^{\scriptscriptstyle T}({\mathbf {v}}^{\nu }-{{\mathbf {x}}}^{\nu })+\frac{(\gamma ^{\nu })^{2}L_{\nabla F}}{2}\Vert {\mathbf {v}}^{\nu }-{{\mathbf {x}}}^{\nu }\Vert ^{2}. \end{aligned}$$
(19)

Furthermore,

$$\begin{aligned}&H({\mathbf {x}}^{\nu + 1}) \overset{(a)}{\le } \widetilde{H}({\mathbf {x}}^{\nu +1};{\mathbf {x}}^\nu ) \overset{(b)}{\le } \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ) + \gamma ^\nu (\widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )) \nonumber \\&\quad =\widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ) + \gamma ^\nu \Big (\widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ); {\mathbf {x}}^\nu ) + \widetilde{H}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ); {\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )\Big )\nonumber \\&\quad \overset{(c)}{\le } H({\mathbf {x}}^\nu ) + \gamma ^\nu ({\varvec{\xi }}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu - {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )) + \gamma ^\nu (\widetilde{H}(\hat{\mathbf {x}}({\mathbf {x}}^\nu ); {\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ; {\mathbf {x}}^\nu )), \end{aligned}$$
(20)

where the last relation holds for any \({\varvec{\xi }}^\nu \in \partial _1 \widetilde{H}({\mathbf {v}}^\nu ; {\mathbf {x}}^\nu )\). The inequality (a) follows from H3, while (b) and (c) are due to H2 and to the convexity of \(\widetilde{H}(\bullet ;{\mathbf {y}})\). Using

$$\begin{aligned}&\nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu -{\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) + {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu ) \le \nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu - {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ))\\&\quad - c_{{\widetilde{F}}} \Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert ^2 + \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu );{\mathbf {x}}^\nu ), \end{aligned}$$

which follows from Lemma 2, and combining (19) and (20), we have

$$\begin{aligned} \begin{array}{rcl} F({\mathbf {x}}^{\nu + 1}) + H({\mathbf {x}}^{\nu + 1}) &{} \le &{} F({\mathbf {x}}^\nu ) + H({\mathbf {x}}^\nu ) - \gamma ^\nu c_{\widetilde{F}} \Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert ^2 \\ &{} &{} + \gamma ^\nu \varepsilon ^\nu \Vert \nabla F({\mathbf {x}}^\nu ) + {\varvec{\xi }}^\nu \Vert + \frac{(\gamma ^{\nu })^{2}L_{\nabla F}}{2}\Vert {\mathbf {v}}^{\nu }-{{\mathbf {x}}}^{\nu }\Vert ^{2}. \end{array} \end{aligned}$$
(21)

Since \(\Vert {\mathbf {v}}^\nu - {\mathbf {x}}^\nu \Vert ^2 \le 2\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert ^2 + 2 \Vert {\mathbf {v}}^\nu - {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\Vert ^2 \le 2\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert ^2 + 2 (\varepsilon ^\nu )^2\), from (21) we obtain

$$\begin{aligned} F({{\mathbf {x}}}^{\nu +1}) + H({\mathbf {x}}^{\nu +1})\le F({{\mathbf {x}}}^{\nu }) + H({{\mathbf {x}}}^{\nu }) - \gamma ^{\nu }\left( c_{\widetilde{F}}-\gamma ^{\nu }L_{\nabla F}\right) \Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })-{{\mathbf {x}}}^{\nu }\Vert ^{2} + \, T^\nu , \end{aligned}$$
(22)

where \(T^\nu \triangleq \gamma ^\nu \varepsilon ^\nu [\Vert \nabla F({\mathbf {x}}^\nu ) + {\varvec{\xi }}^\nu \Vert ] + L_{\nabla F} (\gamma ^\nu \varepsilon ^\nu )^2\). Note that in cases (i) and (ii), under the assumptions of the theorem, we have \(\sum _{\nu =0}^\infty T^\nu <\infty \).

We now prove the thesis for cases (i), (ii) and (iii) separately.

(i) In view of relation (22), since \(F+H\) is coercive, Lemma 3 implies that the sequence \(\{F({\mathbf {x}}^\nu )+ H({\mathbf {x}}^\nu )\}\) is convergent and, thus, that \(\{{\mathbf {x}}^\nu \}\) is bounded. Furthermore, again by Lemma 3, \( \lim \nolimits _{\nu \rightarrow \infty } \sum _{t=\bar{\nu }}^{\nu }\gamma ^{t}\Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{t})-{{\mathbf {x}}}^{t}\Vert ^{2}<+\infty ,\) which, taking into account that \(\gamma ^\nu \) is bounded away from zero, readily entails (12).

(ii) Since \(\lim \nolimits _{\nu \rightarrow \infty } \gamma ^{\nu }=0\), there exists a positive constant \(\omega \) such that by (22) we have, for \(\nu \ge \bar{\nu }\) sufficiently large,

$$\begin{aligned} F({{\mathbf {x}}}^{\nu +1}) + H({{\mathbf {x}}}^{\nu +1})\le F({{\mathbf {x}}}^{\nu }) + H({{\mathbf {x}}}^{\nu })-\gamma ^{\nu }\omega \Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })-{{\mathbf {x}}}^{\nu }\Vert ^{2} + T^\nu . \end{aligned}$$
(23)

Since \(F+H\) is coercive, Lemma 3 implies that the sequence \(\{F({\mathbf {x}}^\nu )+ H({\mathbf {x}}^\nu )\}\) is convergent and, thus, that \(\{{\mathbf {x}}^\nu \}\) is bounded. Furthermore, again by Lemma 3, we have \( \lim \nolimits _{\nu \rightarrow \infty } \sum _{t=\bar{\nu }}^{\nu }\gamma ^{t}\Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{t})-{{\mathbf {x}}}^{t}\Vert ^{2}<+\infty , \) from which, taking into account \(\sum _{\nu =0}^{\infty }\gamma ^{\nu }=+\infty \), (14) follows.

(iii) By (19) and relation (b) in (20), we have

$$\begin{aligned} F({\mathbf {x}}^{\nu +1})&+ H({\mathbf {x}}^{\nu +1}) - [F({\mathbf {x}}^{\nu }) + H({\mathbf {x}}^{\nu })] \le \gamma ^\nu \Big [\nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu - {\mathbf {x}}^\nu )\nonumber \\&+ \widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )\Big ] + \frac{(\gamma ^{\nu })^{2}L_{\nabla F}}{2}\Vert {\mathbf {v}}^{\nu }-{{\mathbf {x}}}^{\nu }\Vert ^{2}. \end{aligned}$$
(24)

We also observe that, by F1, F3 and (16),

$$\begin{aligned} \nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu - {\mathbf {x}}^\nu ) + \widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu ) \le -\frac{c_{{\widetilde{F}}}}{2} \Vert {\mathbf {v}}^\nu - {\mathbf {x}}^\nu \Vert ^2. \end{aligned}$$
(25)

Now, we are able to prove that there exist suitable values of the step-size \(\gamma ^\nu \) such that (15) holds. Indeed, by (24), the sufficient decrease condition is satisfied if

$$\begin{aligned} \gamma ^\nu [\nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu&- {\mathbf {x}}^\nu )+ \widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )] + \frac{(\gamma ^{\nu })^{2}L_{\nabla F}}{2}\Vert {\mathbf {v}}^{\nu }-{{\mathbf {x}}}^{\nu }\Vert ^{2} \nonumber \\&\le \, \alpha \gamma ^\nu [\nabla F({\mathbf {x}}^\nu )^{\scriptscriptstyle T}({\mathbf {v}}^\nu - {\mathbf {x}}^\nu ) + \widetilde{H}({\mathbf {v}}^\nu ;{\mathbf {x}}^\nu ) - \widetilde{H}({\mathbf {x}}^\nu ;{\mathbf {x}}^\nu )], \end{aligned}$$
(26)

and thus, thanks to (25) and easy reasonings, for any \(\gamma ^\nu \) such that \( \gamma ^\nu \le \min \{1, (1 - \alpha ) \frac{c_{{\widetilde{F}}}}{L_{\nabla F}}\}. \) Therefore, we can always find a finite \(i^\nu \) for which the Armijo condition (15) holds. By the same token, the step-size \(\gamma ^\nu \) is bounded away from zero and, more precisely, such that, for all \(\nu \),

$$\begin{aligned} \gamma ^\nu \ge \min \left\{ 1, \frac{1 - \alpha }{2} \frac{c_{{\widetilde{F}}}}{L_{\nabla F}}\right\} . \end{aligned}$$
(27)

Besides, (15) and (25) also imply

$$\begin{aligned} F({\mathbf {x}}^{\nu +1}) + H({\mathbf {x}}^{\nu +1}) - [F({\mathbf {x}}^\nu ) - H({\mathbf {x}}^{\nu })] \le - \alpha \gamma ^\nu \frac{c_{{\widetilde{F}}}}{2} \Vert {\mathbf {v}}^{\nu }-{{\mathbf {x}}}^{\nu }\Vert ^{2} \end{aligned}$$

and in turn, thanks to (27) and the coercivity of \(F+H\), we deduce that \(\{F({\mathbf {x}}^\nu )+H({\mathbf {x}}^\nu )\}\) is convergent, \(\{{\mathbf {x}}^\nu \}\) is bounded, and \( \lim \nolimits _{\nu \rightarrow \infty } \Vert {\mathbf {v}}^\nu - {\mathbf {x}}^\nu \Vert = 0.\) By observing that \(\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert \le \Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {v}}^\nu \Vert + \Vert {\mathbf {v}}^\nu - {\mathbf {x}}^\nu \Vert \), we get (12). \(\square \)

Point (iii) in Proposition 3 and, in particular, (27) show that if we choose \(c_{{\widetilde{F}}}\) large enough (see the next section to appreciate how this is easily possible) we could always take \(\gamma ^\nu =1\) and get (12). However, if the original objective function \(U=F + H\) is itself strongly convex, then, without any additional requirement, a unit step-size leads to condition (12). The following proposition is a generalization of a similar result in [3] (see the first part of the proof of Proposition 3.2 therein) valid for smooth problems only and only when the exact solution \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\) is used at each iteration (i.e. when \(\varepsilon ^\nu =0\) at each iteration).

Proposition 4

Given the nonconvex problem (P1) under G1-G3, suppose that \(U=F + H\) is strongly convex with constant \(c_U\) and that we set \({\widetilde{U}}({\mathbf {x}}; {\mathbf {y}}) = U({\mathbf {x}})\). Letting \(\{{\mathbf {x}}^{\nu }\}\) be the sequence generated by AF 1,

  1. (iv)

    if the step-size \(\gamma ^\nu \) and the error term \(\varepsilon ^\nu \) are chosen so that

    $$\begin{aligned} \gamma ^\nu = 1, \quad \lim _{\nu \rightarrow \infty } \varepsilon ^\nu = 0, \quad U({\mathbf {v}}^\nu ) \le U({\mathbf {x}}^\nu ), \end{aligned}$$

    then \(\{{\mathbf {x}}^\nu \}\) is a bounded sequence such that (12) holds.

Proof

Preliminarily, we observe that, by choosing \(\gamma ^\nu = 1\), we have \({\mathbf {x}}^{\nu +1} = {\mathbf {v}}^\nu \). Then, by \(U({\mathbf {v}}^\nu ) \le U({\mathbf {x}}^\nu )\), since \(U = F + H\) is coercive, the sequence \(\{U({\mathbf {x}}^\nu )\}\) converges and \(\{{\mathbf {x}}^\nu \}\) is bounded. Thus, by \(\Vert {\mathbf {v}}^\nu - {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\Vert \le \varepsilon ^\nu \), we have

$$\begin{aligned} \lim _{\nu \rightarrow \infty } \Vert U({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )) - U({\mathbf {x}}^\nu )\Vert = 0. \end{aligned}$$
(28)

On the other hand, by the strong convexity of U, for every \({\varvec{\xi }}^\nu \in \partial U({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ))\),

$$\begin{aligned} U({\mathbf {x}}^\nu ) - U({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )) \ge ({\varvec{\xi }}^\nu )^{\scriptscriptstyle T}({\mathbf {x}}^\nu - {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )) + \frac{c_U}{2} \Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert ^2. \end{aligned}$$
(29)

Furthermore, since \({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )\) is the solution of problem (P1\(_{{\mathbf {x}}^\nu }\)) and \({\mathbf {x}}^\nu \in \widetilde{\mathcal X}({\mathbf {x}}^\nu )\),

$$\begin{aligned} (\hat{\varvec{\xi }}^\nu )^{\scriptscriptstyle T}({\mathbf {x}}^\nu - {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )) \ge 0 \end{aligned}$$
(30)

for some \(\hat{\varvec{\xi }}^\nu \in \partial U({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ))\). Thus, by (29) and (30), we get \( U({\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )) - U({\mathbf {x}}^\nu ) \le -\frac{c_U}{2} \Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert ^2 \) for every \(\nu \) and, in turn, by (28), \(\lim \nolimits _{\nu \rightarrow \infty } \Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu ) - {\mathbf {x}}^\nu \Vert = 0.\) \(\square \)

Combining Propositions 1, 3, and 4, we easily get convergence to points satisfying (7). By showing that, under the MFCQ, this condition is actually equivalent to the stationarity condition (5), the following theorem provides the main convergence properties of AF 1.

Theorem 1

Let \(\{{\mathbf {x}}^\nu \}\) be the sequence generated by AF 1. Under Assumptions F, H, and G, \(\{{\mathbf {x}}^\nu \}\) is bounded and

  1. (a)

    if the step-size rules in (i) or (iii) of Proposition 3, or if U is strongly convex and the step-size in (iv) of Proposition 4 is adopted, every limit point of \(\{{\mathbf {x}}^\nu \}\) satisfies (7);

  2. (b)

    if the step-size rule in (ii) of Proposition 3 is employed, at least one limit point of \(\{{\mathbf {x}}^\nu \}\) satisfies (7).

Furthermore, any limit point of \(\{{\mathbf {x}}^\nu \}\) which satisfies (7) and the MFCQ with respect to the set \(\mathcal{X}\), is also stationary for problem (P1) according to (5).

Proof

In view of the previous comments and taking into account Proposition 2, (a) and (b) need no further proof. Then, we show that a limit point \({\bar{{\mathbf {x}}}}\) that satisfies (7) and the MFCQ also satisfies (5). It is enough to prove that if (3) holds at \({\bar{{\mathbf {x}}}} \in {\mathcal X}\), then

$$\begin{aligned} \widetilde{N}({\bar{{\mathbf {x}}}}) \subseteq \bigcup \left\{ \sum _{j=1}^m \mu _j \partial g_j({\bar{{\mathbf {x}}}}) | \varvec{\mu }\in N_{\mathbb R^m_-} (\mathbf {g}(\bar{{\mathbf {x}}}))\right\} + N_{\mathcal K}({\bar{{\mathbf {x}}}}) \triangleq S({\bar{{\mathbf {x}}}}). \end{aligned}$$
(31)

We claim that (3) must hold for \(\widetilde{\mathcal X}({\mathbf {y}})\) for all \({\mathbf {y}}\in \mathcal{X}\) and \({\mathbf {u}}\in \widetilde{\mathcal X}({\mathbf {y}})\) sufficiently close to \({\bar{{\mathbf {x}}}}\). If this were not true, sequences and \({\mathbf {u}}^k \rightarrow {\bar{{\mathbf {x}}}}\) with \({\mathbf {u}}^k \in \widetilde{\mathcal {X}}({\mathbf {y}}^k)\) would exist such that

$$\begin{aligned} \mathbf {0}\in \sum _{j=1}^m \mu _j^{k} \partial _1 {\tilde{g}}_j({\mathbf {u}}^k;{\mathbf {y}}^k)+N_{\mathcal {K}}({\mathbf {u}}^k), \, \varvec{\mu }^k \in N_{\mathbb {R}^m_-}(\tilde{\mathbf {g}}({\mathbf {u}}^k;{\mathbf {y}}^k)), \end{aligned}$$

with \(\Vert \varvec{\mu }^k\Vert =1\). Note that we use the iteration index k instead of \(\nu \) not to create confusion with the algorithm iterates. Thus, we must have, for every k, \(-\sum _{j=1}^m \mu _j^k {\varvec{\xi }}_{j}^k \in N_{\mathcal{K}}\left( {\mathbf {u}}^k\right) \), for some \({\varvec{\xi }}_{j}^k \in \partial _1 {\tilde{g}}_j({\mathbf {u}}^k;{\mathbf {y}}^k)\) and \(\varvec{\mu }^k \in N_{\mathbb R^m_-}({\tilde{\mathbf {g}}}({\mathbf {u}}^k;{\mathbf {y}}^k))\). Taking the limit and renumbering if necessary, by G5 and the outer semicontinuity (relative to \({\mathcal {K}}\)) of the mapping \(N_{\mathcal{K}}\left( \bullet \right) \) (see e.g. [39, Proposition 6.6]), we get \(-\sum _{j=1}^m {\bar{\mu }}_j \bar{\varvec{\xi }}_{j} \in N_{\mathcal{K}}\left( \bar{{\mathbf {x}}}\right) ,\) where \(\bar{\varvec{\xi }}_{j} \in \partial g_j(\bar{{\mathbf {x}}})\) in view of G6, and \(\bar{\varvec{\mu }} \in N_{\mathbb R^m_-}(\mathbf {g}(\bar{{\mathbf {x}}}))\) with \(\Vert \bar{\varvec{\mu }}\Vert =1\), by G4, G2 and the outer semicontinuity (relative to \(\mathbb R^m_-\)) of the mapping \(N_{\mathbb R^m_-}(\bullet )\). This contradicts the assumed MFCQ at \({\bar{{\mathbf {x}}}} \in \mathcal{X}\). Therefore, for all \({\mathbf {y}}\in \mathcal{X}\) and \({\mathbf {u}}\in \widetilde{\mathcal X}({\mathbf {y}})\) sufficiently close to \({\bar{{\mathbf {x}}}}\),

$$\begin{aligned} \mathbf {0}\in \sum _{j=1}^m \mu _j \partial _1 {\tilde{g}}_j ({\mathbf {u}};{\mathbf {y}})+N_{\mathcal {K}}({\mathbf {u}}), \, \varvec{\mu }\in N_{\mathbb R^m_-}({\tilde{\mathbf {g}}}({\mathbf {u}};{\mathbf {y}})) \Rightarrow \varvec{\mu }=\mathbf {0}. \end{aligned}$$
(32)

In order to prove relation (31), take any \(\bar{\varvec{\eta }}\) in \(\widetilde{N}({\bar{{\mathbf {x}}}})\); we show that \(\bar{\varvec{\eta }}\) belongs to \(S({\bar{{\mathbf {x}}}})\). By definition of \(\widetilde{N}({\bar{{\mathbf {x}}}})\), there exist sequences \(\mathcal{X} \ni {\mathbf {y}}^k \rightarrow {\bar{{\mathbf {x}}}}\) and \({\mathbf {u}}^k \rightarrow {\bar{{\mathbf {x}}}}\) such that \(N_{\widetilde{X}({\mathbf {y}}^k)}({\mathbf {u}}^k) \ni {\varvec{\eta }}^k \rightarrow \bar{\varvec{\eta }}\). We also have

$$\begin{aligned} N_{\widetilde{\mathcal X}({\mathbf {y}}^k)}({\mathbf {u}}^k) \!=\! \bigcup \left\{ \sum _{j=1}^m \mu _j^k \partial _1 \tilde{g}_j({\mathbf {u}}^k; {\mathbf {y}}^k) | \varvec{\mu }^k \in N_{\mathbb R^m_-}(\tilde{\mathbf {g}}({\mathbf {u}}^k; {\mathbf {y}}^k))\right\} + N_{\mathcal K}({\mathbf {u}}^k), \end{aligned}$$
(33)

where (33) comes from G1, the constraint qualification (32), and rather standard facts in convex analysis (alternatively, see [39, Corollary 10.50] and [39, Theorem 9.13 along with Corollary 10.9 and Theorem 6.42]). Therefore

$$\begin{aligned} {\varvec{\eta }}^k = \sum _{j=1}^m \mu _j^k {\varvec{\xi }}_j^k + {\varvec{\vartheta }}^k, \end{aligned}$$
(34)

for some \({\varvec{\xi }}_j^\nu \in \partial _1 \tilde{g}_j({\mathbf {u}}^k; {\mathbf {y}}^k)\), \(\varvec{\mu }^k \in N_{\mathbb R^m_-}(\tilde{\mathbf {g}} ({\mathbf {u}}^k; {\mathbf {y}}^k))\) and \({\varvec{\vartheta }}^k \in N_{\mathcal K}({\mathbf {u}}^k)\). By passing to subsequences, we can distinguish two cases: either \(({\varvec{\mu }}^k, \, {\varvec{\vartheta }}^k) \rightarrow (\bar{\varvec{\mu }}, \, \bar{\varvec{\vartheta }})\) or \(\lambda ^k ({\varvec{\mu }}^k, \, {\varvec{\vartheta }}^k) \rightarrow (\bar{\varvec{\mu }}, \, \bar{\varvec{\vartheta }}) \ne (\mathbf {0}, \, \mathbf {0})\) for some sequence \(\lambda ^k \downarrow 0\). The latter case cannot actually occur; in fact, one would obtain \( \lambda ^k {\varvec{\eta }}^k = \sum _{j=1}^m \lambda ^k \mu _j^k {\varvec{\xi }}_j^k + \lambda ^k {\varvec{\vartheta }}^k, \) which would entail \({\mathbf 0} = \sum _{j=1}^m \bar{\varvec{\mu }}_j \bar{\varvec{\xi }}_j + \bar{\varvec{\vartheta }}\), with \(\bar{\varvec{\xi }}_j \in \partial g_j({\bar{{\mathbf {x}}}})\) (by G6), \(\bar{\varvec{\mu }} \in N_{\mathbb R_-^m} (\mathbf {g}(\bar{{\mathbf {x}}}))\) (by G4, G2, and the outer semicontinuity relative to \(\mathbb R_-^m\) of \(N_{\mathbb R_-^m}(\bullet )\)), and \(\bar{\varvec{\vartheta }} \in N_\mathcal{K}(\bar{{\mathbf {x}}})\) (by the outer semicontinuity relative to \({\mathcal {K}}\) of \(N_{\mathcal K}(\bullet )\)). This contradicts the assumed MFCQ (3) at \(\bar{{\mathbf {x}}}\). Hence, we are left with \(({\varvec{\mu }}^k, \, {\varvec{\vartheta }}^k) \rightarrow (\bar{\varvec{\mu }}, \, \bar{\varvec{\vartheta }})\). Recalling G5 and taking the limit in (34), we have, without loss of generality, \( \bar{\varvec{\eta }} = \sum _{j=1}^m {\bar{\mu }}_j \bar{\varvec{\xi }}_j + \bar{\varvec{\vartheta }}, \) where, as before, \(\bar{\varvec{\xi }}_j \in \partial g_j({\bar{{\mathbf {x}}}})\), \(\bar{\varvec{\mu }} \in N_{\mathbb R_-^m} (\mathbf {g}(\bar{{\mathbf {x}}}))\), and \(\bar{\varvec{\vartheta }} \in N_\mathcal{K}(\bar{{\mathbf {x}}})\), thus concluding the proof. \(\square \)

4.3 From \(\liminf \) to \(\lim \) when a diminishing step-size is adopted

When a diminishing step-size (case (ii) in Proposition 3) is employed, in view of Theorem 1, it appears that one can establish weaker convergence properties, compared to the result for cases (i), (iii), and (iv) in Proposition 4: indeed we were only able to show (14) instead of (12). Taking into account that the diminishing step-size rule is arguably the best option in case of nonconvex parallel/distributed implementations, see subsection 5.3, it is desirable to inquire whether we can improve the results for case (ii). It turns out that the Hölder behavior of the best-response mapping \({\hat{{\mathbf {x}}}}(\bullet )\) plays a key role.

Proposition 5

Under the assumptions of Proposition 3, if, in addition,

$$\begin{aligned} \Vert \hat{{\mathbf {x}}}({\mathbf {y}})-\hat{{\mathbf {x}}}(\mathbf {z})\Vert \le \theta \Vert {\mathbf {y}}-\mathbf {z}\Vert ^\beta , \end{aligned}$$
(35)

for every \({\mathbf {y}}, \, \mathbf {z}\in \mathcal X\) and for positive scalars \(\beta \) and \(\theta \), then, also in case (ii) of Proposition 3, (12) holds, i.e. \(\lim _{\nu \rightarrow \infty }\,\Vert {\hat{{\mathbf {x}}}}({\mathbf {x}}^\nu )-{\mathbf {x}}^\nu \Vert =0.\)

Proof

Denoting \(\varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\triangleq \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })-{{\mathbf {x}}}^{\nu }\), by the statement (ii) of Proposition 3, we have \(\liminf _{\nu \rightarrow \infty } \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert =0\). Suppose by contradiction that \(\limsup _{\nu \rightarrow \infty }\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert >0\). Then, there exists \(\delta >0\) such that \( \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert >\delta \) and \(\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert <\delta /2 \) for infinitely many \(\nu \)s. Therefore, there is an infinite subset of indices \(\mathcal{N}\) such that, for each \(\nu \in \mathcal{N}\), and some \(i_{\nu }>\nu \), the following relations hold:

$$\begin{aligned} \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert <\delta /2,\quad \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{i_{\nu }})\Vert >\delta \end{aligned}$$
(36)

and, if \(i_\nu > \nu + 1\),

$$\begin{aligned} \delta /2\le \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{j})\Vert \le \delta ,\quad \nu<j<i_{\nu }. \end{aligned}$$
(37)

Hence, for all \(\nu \in \mathcal{N}\), we can write

$$\begin{aligned} \delta /2&\textstyle < \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{i_{\nu }})\Vert -\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert \, \le \, \Vert \hat{{\mathbf {x}}}({{\mathbf {x}}}^{i_{\nu }})-\hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert +\Vert {{\mathbf {x}}}^{i_{\nu }}-{{\mathbf {x}}}^{\nu }\Vert \nonumber \\&\textstyle \overset{(a)}{\le } \Vert {{\mathbf {x}}}^{i_{\nu }}-{{\mathbf {x}}}^{\nu }\Vert +\theta \Vert {{\mathbf {x}}}^{i_{\nu }}-{{\mathbf {x}}}^{\nu }\Vert ^\beta \nonumber \\&\textstyle \overset{(b)}{\le } \sum \limits _{t=\nu }^{i_{\nu }-1}\gamma ^{t} \left( \Vert \varDelta \hat{\mathbf {x}}({\mathbf {x}}^t)\Vert + \Vert {\mathbf {v}}^t - {\hat{{\mathbf {x}}}}({\mathbf {x}}^t)\Vert \right) \nonumber \\&\textstyle \quad +\theta \left[ \sum \limits _{t=\nu }^{i_{\nu }-1}\gamma ^{t} \left( \Vert \varDelta \hat{\mathbf {x}}({\mathbf {x}}^t)\Vert + \Vert {\mathbf {v}}^t - {\hat{{\mathbf {x}}}}({\mathbf {x}}^t)\Vert \right) \right] ^\beta \\&\textstyle \overset{(c)}{\le } \left( \delta +\varepsilon ^{\max }\right) \, \sum \limits _{t=\nu }^{i_{\nu }-1}\gamma ^{t}\nonumber +\theta \left( \delta +\varepsilon ^{\max }\right) ^\beta \left( \sum \limits _{t=\nu }^{i_{\nu }-1}\gamma ^{t}\right) ^\beta , \end{aligned}$$
(38)

where (a) is due to (35), (b) comes from the triangle inequality and the updating rule of the algorithm and in (c) we used (37), \(\Vert {\mathbf {v}}^t-{\hat{{\mathbf {x}}}}({\mathbf {x}}^t)\Vert \le \varepsilon ^t\) and set \(\varepsilon ^{\max } \triangleq \max _\nu \varepsilon ^\nu \), which is well-defined because by (13), \(\varepsilon ^\nu \rightarrow 0\). By (38) we have

$$\begin{aligned} \underset{_{\nu \rightarrow \infty }}{\liminf }\left[ \left( \delta +\varepsilon ^{\max }\right) \sum _{t=\nu }^{i_{\nu }-1}\gamma ^{t} + \theta \left( \delta +\varepsilon ^{\max }\right) ^\beta \left( \sum _{t=\nu }^{i_{\nu }-1}\gamma ^{t}\right) ^\beta \right] >0. \end{aligned}$$
(39)

We prove next that (39) is in contradiction with the convergence of \(\{F({\mathbf {x}}^\nu ) + H({\mathbf {x}}^\nu )\}\), which is established in Proposition 3. To this end, we first show that \(\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert \ge \delta /4\), for sufficiently large \(\nu \in \mathcal{N}\). Reasoning as in (38), we have

$$\begin{aligned} \begin{array}{rcl} \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu +1})\Vert &{}-&{} \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert \, \le \, \Vert {{\mathbf {x}}}^{\nu +1}-{{\mathbf {x}}}^{\nu }\Vert +\theta \Vert {{\mathbf {x}}}^{\nu +1}-{{\mathbf {x}}}^{\nu }\Vert ^\beta \\ &{} \le &{} \gamma ^{\nu }(\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert +\varepsilon ^{\max }) +\theta (\gamma ^\nu )^\beta (\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert +\varepsilon ^{\max })^\beta , \end{array} \end{aligned}$$

for any given \(\nu \). For \(\nu \in \mathcal{N}\) large enough so that \(\gamma ^{\nu }(\delta /4+\varepsilon ^{\max })+\theta (\gamma ^\nu )^\beta (\delta /4+\varepsilon ^{\max })^\beta <\delta /4\), suppose by contradiction that \(\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert <\delta /4\); this would give \(\Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu +1})\Vert <\delta /2\) and, thus, condition (37) (or (36)) would be violated. Then, it must be \( \Vert \varDelta \hat{{\mathbf {x}}}({{\mathbf {x}}}^{\nu })\Vert \ge \delta /4. \) From this, and using (23) we have, for sufficiently large \(\nu \in \mathcal{N}\),

$$\begin{aligned} \begin{array}{rcl} F({{\mathbf {x}}}^{i_{\nu }}) + H({{\mathbf {x}}}^{i_{\nu }}) &{} \le &{} F({{\mathbf {x}}}^{\nu }) + H({{\mathbf {x}}}^{\nu }) -\omega \sum \limits _{t=\nu }^{i_{\nu }-1}\gamma ^{t}\Vert \varDelta \hat{{\mathbf {x}}} ({{\mathbf {x}}}^{t})\Vert ^{2}+\sum \limits _{t=\nu }^{i_{\nu }-1} T^t\\ &{} \le &{} F({{\mathbf {x}}}^{\nu }) + H({{\mathbf {x}}}^{\nu }) -\omega \frac{\delta ^{2}}{16}\sum \limits _{t=\nu }^{i_{\nu }-1}\gamma ^{t}+\sum \limits _{t=\nu }^{i_{\nu }-1} T^t. \end{array} \end{aligned}$$
(40)

Since \(\sum _{\nu =0}^\infty T^\nu <\infty \) and \(T^\nu \ge 0\) for every \(\nu \), the series \(\sum _{\nu =0}^\infty T^\nu \) is convergent, so that the Cauchy convergence criterion implies \(\sum _{t=\nu }^{i_{\nu }-1} T^t \rightarrow 0\). Therefore, recalling that \(\{F({\mathbf {x}}^\nu ) + H({\mathbf {x}}^\nu )\}\) converges, renumbering if necessary, (40) implies \(\sum _{t=\nu }^{i_{\nu }-1}\gamma ^{t} \rightarrow 0\), in contradiction with (39). \(\square \)

In view of Proposition 5, we aim at deriving suitable sufficient conditions for (35) to hold. Of course, one can draw on the vast literature on stability in optimization (see e.g. [9, 28], the recent [32, 33] and references therein). In particular, in [33] a second-order subdifferential characterization of Hölderian full stability (via the regular coderivative of the subdifferential) is established. However, these conditions are hard to verify in practice and, hence, below we prove the desired Hölderian behavior (35) under assumptions that, while not minimal, are easy to check and, in our context, not difficult to satisfy. We therefore require the following additional conditions; in the next section we show that they are easily met in many practical cases.

Assumption B

(F4) :

\(\nabla _1 {\widetilde{F}}({\mathbf {x}};\bullet )\) is Lipschitz continuous on \(O_{\mathbf {y}}\) for every \({\mathbf {x}}\in {\mathcal {X}}\) with modulus of Lipschitz continuity independent of \({\mathbf {x}}\);

(F5) :

\(\nabla _1 {\widetilde{F}}(\bullet ;{\mathbf {y}})\) is Lipschitz continuous on \(O_{\mathbf {x}}\) for every \({\mathbf {y}}\in {\mathcal {X}}\) with modulus of Lipschitz continuity independent of \({\mathbf {y}}\);

(G7) :

each \(\tilde{g}_j(\bullet ;\bullet )\) is locally Lipschitz continuous on \(O_{\mathbf {x}}\times O_{\mathbf {y}}\).

The following proposition, which builds on the results in [51], allows us to provide suitable sufficient conditions that guarantee the Hölder behavior (35) of \({\hat{{\mathbf {x}}}}(\bullet )\) in the case of a smooth objective function.

Proposition 6

Assume that F1, F2, F4, F5, G1 and G7 hold and that \(H \equiv 0\). Suppose further that \(\mathcal {X}\) is compact and the MFCQ holds at \(\hat{{\mathbf {x}}}({\bar{{\mathbf {x}}}}) \in \widetilde{\mathcal{X}}({\bar{{\mathbf {x}}}})\) for every \({\bar{{\mathbf {x}}}} \in \mathcal{X}\). Then, there exists \(\theta > 0\) such that for every \({{\mathbf {y}}},{\mathbf {z}}\in \mathcal{X}\):

$$\begin{aligned} \Vert \hat{{\mathbf {x}}}({{\mathbf {y}}})-\hat{{\mathbf {x}}}({\mathbf {z}})\Vert \le \theta \Vert {{\mathbf {y}}}-{\mathbf {z}}\Vert ^{\frac{1}{2}}. \end{aligned}$$
(41)

Proof

Observing that, in view of G1 and G7, by [12, Propositions 2.3.16 and 2.5.3], we have \(\partial _{1} g_j({\hat{{\mathbf {x}}}}({\bar{{\mathbf {x}}}}); {\bar{{\mathbf {x}}}}) = \bar{\partial }_{1} g_j({\hat{{\mathbf {x}}}}({\bar{{\mathbf {x}}}}); {\bar{{\mathbf {x}}}})= \pi _{{\mathbf {x}}} {\bar{\partial }} g_j({\hat{{\mathbf {x}}}}({\bar{{\mathbf {x}}}}); {\bar{{\mathbf {x}}}})\), the MFCQ at \({\hat{{\mathbf {x}}}}({\bar{{\mathbf {x}}}}) \in \widetilde{\mathcal X}({\bar{{\mathbf {x}}}})\), for every \({\bar{{\mathbf {x}}}} \in \mathcal{X}\), implies by [51, Lemma 3.1] that the set-valued mapping \(\widetilde{\mathcal X}\) has the Aubin property relative to \({\mathcal X}\) at \({\bar{{\mathbf {x}}}}\) for \({\hat{{\mathbf {x}}}}({\bar{{\mathbf {x}}}})\) for every \({\bar{{\mathbf {x}}}} \in \mathcal{X}\). Therefore, in view of [51, Theorem 2.1], for every \({\bar{{\mathbf {x}}}} \in \mathcal{X}\), there exist \(\eta _{{\bar{{\mathbf {x}}}}} > 0\), \(\theta _{{\bar{{\mathbf {x}}}}} > 0\) and a neighborhood \({\mathcal {V}}_{{\bar{{\mathbf {x}}}}}\) of \({\bar{{\mathbf {x}}}}\) such that, for every \({\mathbf {y}}, \, \mathbf {z}\in {\mathcal {V}}_{{\bar{{\mathbf {x}}}}} \cap {\mathcal {X}}\)

$$\begin{aligned} \Vert \hat{{\mathbf {x}}}({{\mathbf {y}}})-\hat{{\mathbf {x}}}({\mathbf {z}})\Vert \le \eta _{{\bar{{\mathbf {x}}}}}\Vert {{\mathbf {y}}}-{\mathbf {z}}\Vert +\theta _{{\bar{{\mathbf {x}}}}}\Vert {{\mathbf {y}}}-{\mathbf {z}}\Vert ^{\frac{1}{2}}. \end{aligned}$$

By the previous relation and the compactness of set \({\mathcal X}\), (41) holds. \(\square \)

5 Approximations and distributed implementation

In this section we show the wide applicability of our approach by defining functions \({\widetilde{F}}\), \(\widetilde{H}\), and \(\tilde{g}\) satisfying Assumptions F, H, and G in an array of cases of practical interest. We also briefly discuss how a suitable selection of \({\widetilde{F}}\), \(\widetilde{H}\), and \(\tilde{g}\) can lead to parallel and distributed versions of our scheme. We remark that, given a specific application, it is often possible to find ad hoc approximations that exploit the structure of the problem at hand; Sect. 6 provides examples in this sense. While the functions we consider are “common” and reflect what can be encountered in applications, the fact that these approximations, even if natural-looking, satisfy our assumptions is by no means obvious; on the contrary, it is evidence of the heed given to the choice of Assumptions F, H, and G. It is precisely because of these technical, but carefully selected assumptions that we are able to deal, for the first time in INCA-type methods, with non subdifferentially regular nonsmooth functions as those considered in Examples 2, 3, 4 and 7 in Sect. 5.2 below, which, for example, cannot be dealt with the method proposed in [7]. We believe that our treatment of problems involving this kind of functions probably provides in this context the most efficient solution method currently available.

5.1 Approximating F

Finding suitable approximations for F is in general easy, since F is differentiable and we do not need to impose any majorization property. Below we list a few possible choices for \({\widetilde{F}}\). The verification of the validity of Assumption F, and F4 and F5 in Assumption B is straightforward and, therefore, omitted. For the sake of simplicity, in all this sections we do not detail every time the sets \(O_{\mathbf {x}}\) and \(O_{\mathbf {y}}\), as they are easily determinable by the reader.

  1. 1.

    If F is convex, as assumed in [3, 30], we can set

    $$\begin{aligned} \widetilde{F}({\mathbf {x}};{\mathbf {y}})\triangleq F({\mathbf {x}}) +\frac{\tau }{2}\left\| \mathbf {x}-\mathbf {y}\right\| ^{2}, \end{aligned}$$

    where \(\tau \) is a positive constant that can be taken to be 0 in case F is already strongly convex. This \({\widetilde{F}}\) is just a regularization of the original F.

  2. 2.

    In general, one can always resort to the first order approximation of F,

    $$\begin{aligned} \widetilde{F}({\mathbf {x}};{\mathbf {y}})\triangleq \nabla F(\mathbf {y})^{T}(\mathbf {x}-\mathbf {y})+\frac{\tau }{2}\left\| \mathbf {x}-\mathbf {y}\right\| ^{2}, \end{aligned}$$

    where \(\tau \) is a positive constant. Classical (proximal) gradient methods [6] use this approximation but they are not applicable to (P1) because of the nonconvexity of the feasible set. Note that if \(F\equiv 0\), the above approximation boils down to \( {\widetilde{F}}({\mathbf {x}}; {\mathbf {y}}) \triangleq \frac{\tau }{2} \Vert {\mathbf {x}}- {\mathbf {y}}\Vert ^2. \) This means that even if \(U= H\) we still have a “contribution” from F. Approximating \(F\equiv 0\) with a proximal term is simply a way of requiring the overall \(U \triangleq {\widetilde{F}} + \widetilde{H}\) to be strongly convex without making this assumption also on \(\widetilde{H}\).

  3. 3.

    In some cases, a certain degree of convexity may be present in F and it is useful to preserve this feature as much as possible. Suppose that the vector of variables \({\mathbf {x}}\) is partitioned in blocks \({\mathbf {x}}=({\mathbf {x}}_{i})_{i=1}^{I}\) and the function F is convex in each \({\mathbf {x}}_{i}\) but not jointly in \({\mathbf {x}}\). A natural approximation is then

    $$\begin{aligned} \widetilde{F}({\mathbf {x}};{\mathbf {y}})=\sum _{i=1}^{I}\widetilde{F}_{i}({\mathbf {x}}_{i};{\mathbf {y}}), \quad \text{ with }\quad \widetilde{F}_{i}({\mathbf {x}}_{i};{\mathbf {y}})\triangleq F({\mathbf {x}}_{i},{\mathbf {y}}_{-i})+\dfrac{\tau _{i}}{2}\Vert {\mathbf {x}}_{i}-{\mathbf {y}}_{i}\Vert ^2, \end{aligned}$$
    (42)

    where \({\mathbf {y}}\triangleq ({\mathbf {y}}_{i})_{i=1}^{I}\), \({\mathbf {y}}_{-i}\triangleq ({\mathbf {y}}_{j})_{j\ne i}\) and \(\tau _i > 0.\) If \(F(\bullet ,{\mathbf {y}}_{-i})\) is already uniformly strongly convex for all feasible \({\mathbf {y}}_{-i}\), the quadratic term in (42) can be dropped.

  4. 4.

    In some practical problems, for example in multi-agent scenarios, the objective function F is given by the sum of the utilities \(f_{i} ({\mathbf {x}}_{1},\ldots ,{\mathbf {x}}_{I})\) of I agents, each of them controlling the variable block \({\mathbf {x}}_{i}\): thus, \(F({\mathbf {x}})\triangleq \sum _{i=1}^{I}f_{i} ({\mathbf {x}}_{1},\ldots ,\) \({\mathbf {x}}_{I})\). Typically, it may happen that some of the \(f_{i}\)s are convex in some agents’ variables: our purpose, following [16, 45], is to employ an approximation such that, for each agent i, the convex part of F w.r.t. \({\mathbf {x}}_i\) is kept unaltered, while the nonconvex part is linearized. Let

    $$\begin{aligned} \mathcal {S}_{i}\triangleq \left\{ j=1,\ldots ,I:f_{j}(\bullet ,{\mathbf {x}}_{-i})\,\text { is convex, }\,\forall ({\mathbf {x}}_{i},{\mathbf {x}}_{-i})\in \mathcal {K}\right\} \end{aligned}$$

    be the set of indices of all the functions \(f_{j}({\mathbf {x}}_{i},{\mathbf {x}}_{-i})\) that are convex in \({\mathbf {x}}_{i}\), for any feasible \({\mathbf {x}}_{-i}\), and \({\mathcal {C}}_{i}\) be any subset of \({\mathcal {S}}_{i}\); one can set

    $$\begin{aligned} \widetilde{F}({\mathbf {x}};{\mathbf {y}})=\sum _{i=1}^{I}\widetilde{F}_{\mathcal {C}_{i}}({\mathbf {x}}_{i};{\mathbf {y}}), \end{aligned}$$

    with

    $$\begin{aligned} \widetilde{F}_{\mathcal {C}_{i}}({\mathbf {x}}_{i};{\mathbf {y}})\triangleq {\displaystyle {\sum _{j\in {\mathcal {C}}_{i}}}}f_{j}({\mathbf {x}}_{i},{\mathbf {y}}_{-i}) + {\displaystyle {\sum _{k\notin {\mathcal {C}}_{i}}}} \left[ f_k({\mathbf {y}}) + \nabla _{i}f_{k}({\mathbf {y}})^{\scriptscriptstyle T}({\mathbf {x}}_{i}-{\mathbf {y}}_{i})\right] +\dfrac{{\tau _{i}}}{2}\,\Vert \mathbf {x}_{i}-\mathbf {y}_{i}\Vert ^2, \end{aligned}$$

    where \(\tau _i\) is a positive constant.

  5. 5.

    Assume that F is a saddle function, i.e., F depends on two groups of variables, \({\mathbf {x}}_1\) and \({\mathbf {x}}_2\), with \(F(\bullet , {\mathbf {x}}_2)\) convex for every fixed \({\mathbf {x}}_2\) and \(F({\mathbf {x}}_1, \bullet )\) concave for every fixed \({\mathbf {x}}_1\). It is easy to check that

    $$\begin{aligned} {\widetilde{F}}({\mathbf {x}}_1,{\mathbf {x}}_2;{\mathbf {y}}_1,{\mathbf {y}}_2)\triangleq & {} F({\mathbf {x}}_1, {\mathbf {y}}_2) + \nabla _{2} F({\mathbf {y}}_1, {\mathbf {y}}_2)^{\scriptscriptstyle T}({\mathbf {x}}_2 - {\mathbf {y}}_2)\\&+ \frac{\tau }{2} \Vert ({\mathbf {x}}_1,{\mathbf {x}}_2)-({\mathbf {y}}_1,{\mathbf {y}}_2)\Vert ^2 \end{aligned}$$

    satisfies Assumption F for any positive \(\tau \). Needless to say, if \(F(\bullet , {\mathbf {x}}_2)\) is strongly convex in \({\mathbf {x}}_1\), uniformly relative to \({\mathbf {x}}_2\), we can employ \(\Vert {\mathbf {x}}_2-{\mathbf {y}}_2\Vert ^2\) instead of \(\Vert ({\mathbf {x}}_1,{\mathbf {x}}_2)-({\mathbf {y}}_1,{\mathbf {y}}_2)\Vert ^2\).

  6. 6.

    If the set \(\mathcal{K}\) is contained in \(\mathbb {R}^n_{++}\triangleq \{x\in \mathbb {R}^n: x_i >0, \forall i\}\), we can set

    $$\begin{aligned} {\widetilde{F}}({\mathbf {x}}; {\mathbf {y}})&\triangleq F({\mathbf {y}}) + \textstyle {\sum \limits _{i:\nabla _{i}F({\mathbf {y}})\ge 0}} \nabla _{i}F({\mathbf {y}}) (x_i -y_i)\\&\quad \textstyle - \sum \limits _{i:\nabla _{i}F({\mathbf {y}})<0} \nabla _{i}F({\mathbf {y}}) \left( \frac{y_i^2}{x_i} - y_i \right) +\frac{\tau }{2}\left\| \mathbf {x}-\mathbf {y}\right\| ^{2} , \end{aligned}$$

    where \(\tau \) is a positive constant. This kind of approximation is used in structural optimization problems [19, 47, 48] for which it proves to be extremely effective.

5.2 Approximating H or \(g_j\)

The approximations of the nondifferentiable component H of the objective function and of the constraint \(g_j\) must obey almost the same assumptions. In fact, Assumptions H and G differ only in the additional condition G4. Since all our examples do satisfy G4, we deal with the approximations for H and \(g_j\) together, referring to a generic locally Lipschitz function h. In Assumption B we have also condition G7, which is slightly harder to satisfy. We do not require the approximations of H or \(g_j\) to be differentiable; in fact, what matters is that the resulting subproblem (\({\hbox {P}1_{{\mathbf {y}}}}\)), can be efficiently solved, and this does not necessarily imply that \(\widetilde{H}\) or \(\tilde{g}_j\) should be differentiable, see e.g. the discussion after (46). Below we provide several examples, discussing only those properties (among G1-G6 and G7) that are not immediately obvious.

  1. 1.

    If h is convex, we can simply set \({\tilde{h}}({\mathbf {x}};{\mathbf {y}}) \triangleq h({\mathbf {x}})\). In particular, if h is a constraint function, this means that such constraint can be incorporated in the convex set \(\mathcal K\).

  2. 2.

    If h is concave, the linear approximation

    $$\begin{aligned} {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) \triangleq h({\mathbf {y}}) + \varvec{\xi }^{\scriptscriptstyle T}({\mathbf {x}}- {\mathbf {y}}), \end{aligned}$$

    for any \(\varvec{\xi } \in \partial h({\mathbf {y}})\), satisfies G4-G6 thanks to the local boundedness and outer semicontinuity of the subgradient set of the locally Lipschitz function h. In order for G7 to hold, we must assume h to be continuously differentiable with locally Lipschitz gradient.

  3. 3.

    Consider \(h({\mathbf {x}}) = \min _i \{h_i({\mathbf {x}}) \}\), with \(h_i\) smooth and convex for every \(i=1, \ldots ,p\). The constraint \( \min _i \{h_i({\mathbf {x}})\} \le 0\) is a difficult disjunctive-type constraint requiring at least one of the \(h_i({\mathbf {x}}) \le 0\) to be satisfied. In order to define a suitable approximation, we derive first an exact formula for the subdifferential of the \(\min \) function. To this end, let \(I({\mathbf {y}}) \triangleq \{i \in \{1, \ldots , p\} | h({\mathbf {y}}) = h_i({\mathbf {y}})\}\) be the set of active indices at \({\mathbf {y}}\) and \(I^e({\mathbf {y}}) \triangleq \{i \in \{1, \ldots , p\} | {\mathbf {y}}\in \text {cl} \, \text {int} \, \{{\mathbf {x}}| h({\mathbf {x}}) = h_i({\mathbf {x}})\}\subseteq I({\mathbf {y}})\) be the set of essentially active indices at \({\mathbf {y}}\) (\(\text {cl}\) and \(\text {int}\) indicate the closure and the interior of a set).

Proposition 7

Let all \(h_i\) be continuously differentiable (not necessarily convex). Then,

$$\begin{aligned} \partial h({\mathbf {y}}) = \bigcup _{i \in I^e({\mathbf {y}})} \{\nabla h_i({\mathbf {y}})\}. \end{aligned}$$
(43)

Proof

By [31, Proposition 1.113] and [41, Proposition 4.1.1], we have:

$$\begin{aligned} \partial h({\mathbf {y}}) \subseteq \bigcup _{i \in I^e({\mathbf {y}})} \{\nabla h_i({\mathbf {y}})\}. \end{aligned}$$

We prove now that \(\nabla h_i({\mathbf {y}}) \in \partial h({\mathbf {y}})\) for every \(i \in I^e({\mathbf {y}})\). Indeed, for any \(i \in I^e({\mathbf {y}})\), we have \({\mathbf {y}}\in \text {cl} \, \text {int} \, \{{\mathbf {x}}| h({\mathbf {x}}) = h_i({\mathbf {x}})\}\). Thus, there exists a sequence \({\mathbf {w}^\nu }\) converging to \({\mathbf {y}}\) such that \(\mathbf {w}^\nu \in \text {int} \, \{{\mathbf {x}}| h({\mathbf {x}}) = h_i({\mathbf {x}})\}\) and \(\partial h(\mathbf {w}^\nu ) = \{\nabla h(\mathbf {w}^\nu )\} = \{\nabla h_i(\mathbf {w}^\nu )\}\) for every \(\nu \). Therefore, thanks to [39, Theorem 9.61], \(\nabla h_i({\mathbf {y}}) \in \partial h({\mathbf {y}})\) for every \(i \in I^e({\mathbf {y}})\) and the thesis follows. \(\square \)

In view of the previous considerations, we propose the approximation

$$\begin{aligned} {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) \triangleq h_{i_{\mathbf {y}}}({\mathbf {x}}), \quad {\text {for any}} \; i_{\mathbf {y}}\in I^e({\mathbf {y}}). \end{aligned}$$

As regards G4, it is enough to recall that \(I(\mathbf {w}) \subseteq I({\mathbf {y}})\) for every \(\mathbf {w}\) in some neighborhood of \({\mathbf {y}}\). Observing that \(\partial _1 {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) = \{\nabla h_{i_{\mathbf {y}}}({\mathbf {x}})\}\) with \(i_{\mathbf {y}}\in I^e({\mathbf {y}})\), G5 holds trivially. Turning to G6, we preliminarily check that \(I^e(\mathbf {w}) \subseteq I^e({\mathbf {y}})\) for every \(\mathbf {w}\) sufficiently close to \({\mathbf {y}}\). We show that if \(i \not \in I^e({\mathbf {y}})\), then, for every \(\mathbf {w}\) in some neighborhood of \({\mathbf {y}}\), we have \(i \not \in I^e(\mathbf {w})\). Let \(i \not \in I^e({\mathbf {y}})\). Then, there exists a neighborhood \({\mathcal {V}}\) of \({\mathbf {y}}\) with \( {\mathcal {V}} \cap \text {int} \, \{{\mathbf {x}}| h({\mathbf {x}}) = h_i({\mathbf {x}})\} = \emptyset . \) If there were \(\mathbf {w}\in \text {int} \, {\mathcal {V}}\) such that \(i \in I^e(\mathbf {w})\), we would also have \(\mathbf {w}\in \text {cl} \, \text {int} \, \{{\mathbf {x}}| h({\mathbf {x}}) = h_i({\mathbf {x}})\}\). Hence, there would be a sequence \(\{{\mathbf {v}}^\nu \}\) converging to \(\mathbf {w}\) with \({\mathbf {v}}^\nu \in \text {int} \, \{{\mathbf {x}}| h({\mathbf {x}}) = h_i({\mathbf {x}})\}\) and we would get \({\mathbf {v}}^\nu \not \in {\mathcal {V}}\) for every \(\nu \), a contradiction. Therefore, \(I^e(\mathbf {w}) \subseteq I^e({\mathbf {y}})\) and this fact, together with Proposition 7, shows that also G6 holds. G7 does not hold for this approximation.

More in general, if each \(h_i\) is only smooth (i.e., not necessarily convex) and, for every \(h_i\), there exists a differentiable convex approximation \({\tilde{h}}_i\) such that Assumption G holds, the convex function given by

$$\begin{aligned} {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) \triangleq {\tilde{h}}_{i_{\mathbf {y}}}({\mathbf {x}};{\mathbf {y}}), \quad {\text {for any}} \; i_{\mathbf {y}}\in I^e({\mathbf {y}}), \end{aligned}$$

satisfies Assumption G.

We conclude the analysis of this case by providing a very simple procedure that, in most practical cases, provides an index in \(I^e({\mathbf {y}})\). Given \({\mathbf {y}}\), we can easily compute \(I({\mathbf {y}})\). Now, generate a random nonzero direction d and set \(i_{\mathbf {y}}\triangleq \mathrm{argmin}_{i \in I({\mathbf {y}})} \{ \nabla h_i({\mathbf {y}})^{\scriptscriptstyle T}d\}\). By using standard continuity arguments on Taylor expansions, it is easy to see that if the minimum in the definition of \(i_{\mathbf {y}}\) is reached for a single index, then \(i_{\mathbf {y}}\in I^e({\mathbf {y}})\). If the minimum is not unique, we can simply generate a different direction or use a second order expansion if the \(h_i\)s are twice continuously differentiable; we omit the details.

  1. 4.

    Suppose that h is the difference of a smooth convex function \(h^{+}\) and a possibly nonsmooth, convex function \(h^{-}\), having thus the following nonsmooth DC structure: \( h({\mathbf {x}})=h^{+}({\mathbf {x}})-h^{-}({\mathbf {x}}). \) By linearizing the concave part \(-h^{-}\) and keeping the convex (smooth) part \(h^{+}\) unchanged, we obtain the following convex upper approximation of h:

    $$\begin{aligned} {\tilde{h}}({\mathbf {x}};{\mathbf {y}})\triangleq h^{+}({\mathbf {x}})-h^{-}({\mathbf {y}})+\varvec{\xi }^{\scriptscriptstyle T}({\mathbf {x}}-{\mathbf {y}}), \end{aligned}$$

    for any \(\varvec{\xi } \in \partial (-h^-({\mathbf {y}}))\). Note that, by [39, Exercise 10.10], we have \(\partial h({\mathbf {y}}) = \nabla h^+({\mathbf {y}}) + \partial (-h^-({\mathbf {y}}))\) and \(\partial _1 {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) = \nabla h^+({\mathbf {x}}) + \varvec{\xi }\), with \(\varvec{\xi } \in \partial (-h^-({\mathbf {y}}))\). In view of the local boundedness and outer semicontinuity of the subgradient set of the locally Lipschitz function \(-h^-\), G4-G6 are valid. In order for G7 to hold, we must assume \(h^-\) to be continuously differentiable with locally Lipschitz gradient. To make a concrete example, suppose that \(h^{-}({\mathbf {x}}) \triangleq \max _i\{ h^-_i({\mathbf {x}})\}\), with \(h^-_i({\mathbf {x}})\) convex and smooth for every \(i =1, \ldots , p\). We can then write \(h({\mathbf {x}})=h^{+}({\mathbf {x}}) + \min _i \{-h^-_i({\mathbf {x}})\}\), and thus we can resort to Proposition 3 to compute \(\partial \min _i \{-h^-_i({\mathbf {x}})\}\).

  2. 5.

    On the other hand, when h is the difference of a convex nonsmooth function \(h^{+}\) and a convex smooth function \(h^{-}\), by linearizing the concave part \(h^{-}\) and keeping the convex (nonsmooth) part \(h^{+}\) unchanged, we obtain

    $$\begin{aligned} {\tilde{h}}({\mathbf {x}};{\mathbf {y}})\triangleq h^{+}({\mathbf {x}})-h^{-}({\mathbf {y}})-\nabla h^-({\mathbf {y}})^{\scriptscriptstyle T}({\mathbf {x}}-{\mathbf {y}}). \end{aligned}$$

    We remark, again by [39, Exercise 10.10], that \(\partial h({\mathbf {y}}) = \partial h^+({\mathbf {y}}) - \nabla h^-({\mathbf {y}})\) and \(\partial _1 {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) = \partial h^+({\mathbf {x}}) - \nabla h^-({\mathbf {y}})\). Therefore, G4-G6 hold. In this case G7 holds if the gradient of \(h^-\) is locally Lipschitz.

  3. 6.

    Let \(h({\mathbf {x}}) = s(\mathbf {f}({\mathbf {x}}))\), where \(s: \mathbb R^m \rightarrow \mathbb R\) is a finite convex function such that \(s(u_1, \ldots , u_m)\) is nondecreasing in each \(u_j\), and \(\mathbf {f}: \mathbb R^n \rightarrow \mathbb R^m\) is a smooth mapping, with \(\mathbf {f}({\mathbf {x}}) = (f_1({\mathbf {x}}), \ldots , f_m({\mathbf {x}}))^{\scriptscriptstyle T}\) and \(f_i\) concave for every i. Consider the convex approximation

    $$\begin{aligned} {\tilde{h}}({\mathbf {x}}; {\mathbf {y}}) \triangleq s(\mathbf {f}({\mathbf {y}}) + \nabla \mathbf {f}({\mathbf {y}}) ({\mathbf {x}}- {\mathbf {y}})), \end{aligned}$$
    (44)

    where \(\nabla \mathbf {f}({\mathbf {y}})\) is the Jacobian of \(\mathbf {f}\) at \({\mathbf {y}}\). G1 holds by, e.g., [39, Exercise 2.20], and so obviously do G2-G4. Furthermore, since, in view of [39, Theorem 10.6], \(\partial h({\mathbf {y}}) = \nabla \mathbf {f}({\mathbf {y}})^{\scriptscriptstyle T}\partial s(\mathbf {f}({\mathbf {y}}))\) and \(\partial _1 \tilde{h}({\mathbf {x}}; {\mathbf {y}}) = \nabla \mathbf {f}({\mathbf {y}})^{\scriptscriptstyle T}\partial s(\mathbf {f}({\mathbf {y}}) + \nabla \mathbf {f}({\mathbf {y}}) ({\mathbf {x}}- {\mathbf {y}}))\), also G5 and G6 are valid. Finally, G7 is clearly satisfied. This example can be expanded assuming that the \(f_i\) are smooth mappings, not necessarily concave, for which approximations \(\tilde{f}_i({\mathbf {x}};{\mathbf {y}})\) satisfying Assumption G exist and such that each \(\tilde{f}_i(\bullet ,{\mathbf {y}})\) is \(C^1\). In this case, following the same steps above, it is easy to see that the approximation \(\tilde{h}({\mathbf {x}};{\mathbf {y}}) \triangleq s(\tilde{f}_1({\mathbf {x}};{\mathbf {y}}),\ldots , \tilde{f}_m({\mathbf {x}};{\mathbf {y}}))\) satisfies Assumption G. If in addition all \(\tilde{f}_i\) satisfy G7, also \(\tilde{h}({\mathbf {x}};{\mathbf {y}})\) satisfies G7.

  4. 7.

    Consider the widely used soft thresholding function \(h(x) = \min \{x+a, \, \max \{0, x-a\}\}\) with \(a>0\). We can resort to the following approximation

    $$\begin{aligned} \tilde{h}(x; y) \triangleq \left\{ \begin{array}{ll} \max \{0, x-a\}, \; &{}\quad y\ge -a\\ x+a, \; &{}\quad y<-a. \end{array}\right. \end{aligned}$$

    Conditions G4-G6 trivially hold at any point different from \((-a, -a)\). The same happens also in this “problemati” point: indeed, \(\limsup _{x, y \rightarrow -a} \partial _1 \tilde{h}(x; y) = \{0, 1\} = \partial h(-a)\). Also G7 holds quite clearly. Note that, as in the previous case, we can easily extend this approximation to the composition of the soft thresholding function with a smooth function f for which a suitable approximation \(\tilde{f}\) is available; we leave the details to the reader.

  5. 8.

    In the case of a nonconvex function h with a Lipschitz continuous (with constant \(L_{\nabla h}\)) gradient, the following convex approximation is such that conditions G1-G7 hold:

    $$\begin{aligned} \tilde{h}({\mathbf {x}};{\mathbf {y}})\triangleq h({\mathbf {y}})+\nabla h({\mathbf {y}})^{\scriptscriptstyle T}({\mathbf {x}}-{\mathbf {y}})+\frac{L_{\nabla h}}{2}\Vert {\mathbf {x}}-{\mathbf {y}}\Vert ^{2}. \end{aligned}$$
  6. 9.

    Whenever \(h: \mathbb R \rightarrow \mathbb R\) is polynomial, thus,

    $$\begin{aligned} h(x)= a_n x^n + a_{n-1} x^{n-1} + \ldots + a_1 x + a_0, \end{aligned}$$

    relying on its Taylor series expansion around y, the approximation

    $$\begin{aligned} \tilde{h}(x;y)\triangleq & {} h(y) + h'(y) (x - y) + (n - 1) \max _{i=2, \ldots , n} \left\{ \frac{|h^{(i)}(y)|}{i!}\right\} \\&\times \,[(x- y)^2 + (x - y)^{{\bar{n}}}], \end{aligned}$$

    where \({\bar{n}} = n\) if n is even and \({\bar{n}} = n + 1\) otherwise, satisfies G1-G7. The procedure adopted here for unidimensional polynomials can easily be extended to the multidimensional case, but the required computations can become very expensive for dimensions greater than two.

5.3 Parallel/distributed implementations

AF 1 can be viewed as a double loop iterative scheme. Given an outer iteration \({\mathbf {x}}^\nu \), to generate \({\mathbf {x}}^{\nu +1}\) one needs to (a) (approximatively) solve problem (P1\(_{{\mathbf {x}}^\nu }\)) and (b) choose a suitable \(\gamma ^\nu \). In this subsection, we make some high-level considerations on how to perform in a parallel/distributed fashion operations (a) and (b) mentioned above, leading to a parallel/distributed instance of AF 1; more detailed discussions could only be made with reference to specific problems.

As far as task (b) is concerned, one can choose any of the rules (i)-(iii) in Proposition 3 or (iv) in Proposition 4. Options (i), (ii) and (iv) are more suitable for a fully parallel/distributed implementation than the line-search procedure (iii), which instead requires a high degree of coordination among cores/nodes in a parallel/distributed environment. Also, experience shows that, although theoretically attractive, option (i) could perform poorly, since practical estimates of \(c_{{\widetilde{F}}}/L_{\nabla F} \) are usually very small, leading at each iteration to little progress towards stationarity. On the other hand, option (iv) can be used only in the presence of strongly convex objective functions. Therefore, the diminishing step-size rule (ii) seems to be the more appropriate choice for parallel/distributed environments. This is the main reason why in Sect. 4 special attention was devoted to the study of (ii).

Referring to task (a), the design of parallel/distributed solution methods for (P1\(_{{\mathbf {x}}^\nu }\)) depends on the specific structure of the original problem (P1) as well as on the choice of the approximations \({\widetilde{F}}\), \(\widetilde{H}\), and \(\tilde{g}\). The flexibility offered by our approach in the definition of (P1\(_{{\mathbf {x}}^\nu }\)) is of great help to obtain alternative distributed solution methods. Suppose that the feasible set of (P1) is block-separable, i.e., \(\mathcal K\triangleq \mathcal K_1\times \cdots \times \mathcal K_p\), with each \(\mathcal {K}_i\subseteq \mathbb {R}^{n_i}\), and \(\mathcal X \triangleq \{\mathbf {x}\triangleq (\mathbf {x}_i)_{i=1}^p, \,\mathbf {x}_i\in \mathcal K_i, \, i=1,\ldots ,p: \,\sum _{i=1}^p{g}_j^i(\mathbf {x}_i)\le \mathbf {0},\,j=1,\ldots , m\}\). Then, problem (P1) becomes

$$\begin{aligned} \begin{array}{cl} \underset{{\mathbf {x}}}{\text{ minimize }} &{} F({\mathbf {x}})+ H(\mathbf {x}) \\ \text{ s.t. } &{} \sum \limits _{i=1}^p g_j^i({\mathbf {x}}_i) \le 0, \quad j = 1, \ldots , m\\ &{} {\mathbf {x}}_i \in {\mathcal K}_i,\quad i = 1, \ldots , p, \end{array} \end{aligned}$$
(45)

The structure of (45) naturally suggests approximations \(\widetilde{F}\), \(\widetilde{H}\), and \(\tilde{g}_j^i\) that are additively block-separable, like the examples given in case 2 and 3 in Sect. 5.1 and in case 2 and 8 in Sect. 5.2. This results in convex subproblems of the form

$$\begin{aligned} \begin{array}{cl} \underset{{\mathbf {x}}}{\text{ minimize }} &{} \sum \limits _{i=1}^p{\widetilde{F}}_i({\mathbf {x}}_i; {\mathbf {x}}^\nu )+ \sum \limits _{i=1}^p\widetilde{H}_i({\mathbf {x}}_i; {\mathbf {x}}^\nu )\\ \text{ s.t. } &{} \sum \limits _{i=1}^p \tilde{g}_j^i({\mathbf {x}}_i; {\mathbf {x}}^\nu ) \le 0, \quad j = 1, \ldots , m\\ &{} {\mathbf {x}}_i \in {\mathcal K}_i,\quad i = 1, \ldots , p, \end{array} \end{aligned}$$
(46)

that can be solved in a parallel/distributed way by standard techniques, see, e.g., [6]. It is worth mentioning a special instance of (45), where the feasible set has a Cartesian structure, i.e.,

$$\begin{aligned} \begin{array}{cl} \underset{{\mathbf {x}}}{\text{ minimize }} &{} F({\mathbf {x}})+ H(\mathbf {x}) \\ \text{ s.t. } &{} \mathbf {g}_i({\mathbf {x}}_i) \le \mathbf {0}, \quad i = 1, \ldots , p\\ &{} {\mathbf {x}}_i \in {\mathcal K}_i,\quad i = 1, \ldots , p. \end{array} \end{aligned}$$
(47)

Choosing separable approximations for F and H, in the form \({\widetilde{F}}({\mathbf {x}}; {\mathbf {x}}^\nu )\triangleq \sum _{i=1}^p{\widetilde{F}}_i({\mathbf {x}}_i; {\mathbf {x}}^\nu )\) and \(\widetilde{H}({\mathbf {x}}; {\mathbf {x}}^\nu )\triangleq \sum _{i=1}^p\widetilde{H}_i({\mathbf {x}}_i; {\mathbf {x}}^\nu )\), lends itself to totally separable convex subproblems that can be solved by tackling in parallel p strongly convex problems. We point out that, even though the constraint functions \(g_j\)s in (P1) are not additively block-separable as in (45), it might still be possible to choose appropriate approximations so that the resulting subproblems (P1\(_{{\mathbf {x}}^\nu }\)) can be solved in a parallel/distributed fashion, see for example cases 2 and 8 in Sect. 5.2. It is also interesting to observe that, when (P1) involves nondifferentiable functions, our approach provides a systematic way to possibly develop parallel and distributed solution methods for classes of problems for which such option is not currently available.

6 Green communications: energy minimization in wireless systems

In this section we present an application of our algorithmic framework to an important open problem in green communications, resulting in an efficient ad-hoc solution method for this class of problems. Our numerical results show superior performance with respect to recent proposals.

Energy efficiency and Quality-of-Service (QoS) have been two key aspects in the design of modern multiuser communication systems. Energy efficiency is a growing concern due to energy costs and associated environmental issues. It has been reported that ICT infrastructures account for more than 3\(\%\) of the world energy consumptions [22].

Moreover, given the rate at which wireless connected devices are increasing, as well as the mass deployment of 5G systems, mobile communications are expected to consume significantly more energy, if no countermeasures are taken. On the other hand, the explosive growth of data and multimedia services in wireless networks (video conferencing, streaming, mobile TV, 3D services) entails strict guarantees on QoS, such as delay and data rates, which contrast with the need of energy saving. These issues have motivated in the past few years a growing interest in energy-aware optimization design of wireless communication networks, under general QoS constraints [24, 29, 53]. This emerging research area is called green communications; see [52] for a recent overview of the state of the art. Although there are substantial differences between the many, often specialized, formulations of problems in this field, all approaches share the definition of the energy efficiency as fractional functions and leverage fractional programming as a key tool for the analysis and design of solution methods.

In this section, we consider a very general and challenging energy-efficient design in green communications, namely: the minimization of the sum-energy in MIMO multiuser interference networks, subject to rate profile constraints. The resulting optimization problem is nonconvex, with nonconvexities occurring both in the objective function and in the constraints. The multiuser nature of the system, along with concurrent interfering communications, prevent one to apply fractional programming-based methods, leaving the design of efficient solution methods an open problem [52]. Furthermore, the proposed formulation uses complex variables and calls for a solution method that preserves feasibility of the iterates, thus ruling out off-the-shelf solvers.

We remark that, although in this section we focus on a specific setup (MIMO ad-hoc interference networks) and formulation (the energy minimization subject to QoS constraints), the proposed methodology and framework open the way to the solution of a variety of currently open resource allocation problems in (green) communications [52]. Examples include (i) energy efficient designs in relay MIMO interference networks, multi-cell cellular systems, or densely deployed small cells; (ii) max-min multicast (capacity) multi-stream beamforming-like problems; and (iii) joint optimization of precoders, receive filters, and power allocation in the scenarios described in (i)-(ii).

6.1 Problem formulation

In this subsection, we freely use standard facts in communications (see, e.g., [49]). The reader interested mainly in the mathematical formulation of the problem can take (50) as starting point of the subsequent discussion.

Consider a wireless ad-hoc network composed of I multiple antenna pairs transmitter-receiver, communicating over MIMO channels; we denote by \(\mathcal I\triangleq \{1,\ldots , I\}\) the set of active pairs (also termed users) in the system. Each transmitter and receiver i is equipped with \(T_{i}\) and \(R_{i}\) antennas, respectively. The optimization variables of each transceiver i are given by the transmit covariance matrix \(\mathbf {Q}_i\in \mathbb {C}^{T_i\times T_i}\), which is a positive semi-definite matrix. Each transmitter i is subject to the power constraint \(\text {tr}(\mathbf {Q}_i)\le P_i\), where \(P_i\) is the maximum average transmit power, and \(\text {tr}(\mathbf {Q}_i)\) denotes the trace of \(\mathbf {Q}_i\). Since no a-priori multiple access scheme is assumed for the users (like, e.g., OFDMA, TDMA, or CDMA), multiuser interference is experienced at each receiver. Under standard information theoretical assumptions, the maximum achievable rate \(r_i\) on each MIMO link i (measured in bit/s/Hz) can be written as follows [49]: given \(\mathbf {Q}\triangleq (\mathbf {Q}_1, \ldots ,\mathbf {Q}_I)\), with each \(\mathbf {Q}_i\succeq \mathbf {0}\),

$$\begin{aligned} r_i(\mathbf {Q})\triangleq \log _2\det \left( \mathbf {I}+\mathbf {H}_{ii}\mathbf {Q}_{i}\mathbf {H}_{ii}^{H}{\mathbf {R}}_{i}\left( \mathbf {Q}_{- i}\right) ^{-1}\right) , \end{aligned}$$
(48)

where \(\mathbf {H}_{ij}\in \mathbb {C}^{R_{i}\times T_{j} }\) is the channel matrix between the transmitter of user j and the receiver of link i; \({\mathbf {R}}_{i}\left( \mathbf {Q}_{- i}\right) \triangleq \sigma _i^2\cdot \mathbf {I}+\sum _{j\ne i}\mathbf {H}_{ij}\mathbf {Q}_{j}\mathbf {H}_{ij}^{H}\) is the covariance matrix of the Gaussian noise \(\sigma _i^2\cdot \mathbf {I}\) (assumed to be proportional to the identity matrix, without loss of generality, otherwise one can always pre-whiten the channel matrices) plus the interference at the receiver i due to the other transmitters; and \(\mathbf {Q}_{-i}\triangleq (\mathbf {Q}_{1}, \ldots , \mathbf {Q}_{i-1},\mathbf {Q}_{i+1},\ldots \) \(\mathbf {Q}_{I})\) is the tuple of the covariance matrices of all the transmitters except the i-th one. If the transmission of each user i takes \(T_i\) seconds to complete (implying thus \(r_i(\mathbf {Q})>0\)), the energy consumed by user i per bit that can be reliably transmitted is

$$\begin{aligned} E_i(\mathbf {Q})\triangleq \dfrac{T_i \left( \mu _i \text {tr}(\mathbf {Q}_i) + P_{c,i}\right) }{T_i \, W_i \, r_i(\mathbf {Q})}= \dfrac{\left( \mu _i \text {tr}(\mathbf {Q}_i) + P_{c,i}\right) }{W_i \, r_i(\mathbf {Q})}\quad [\text {J/bit}], \end{aligned}$$
(49)

where \(W_i\) is the communication bandwidth of user i, \(1/\mu _i\) is the efficiency of the power amplifier at transmitter i, while \(P_{c,i}\) includes the power dissipated in all other circuit blocks of the transmitter and receiver i to operate the terminals (assumed to be constant [2]). The QoS of each communication is measured in terms of a given minimum achievable rate \(\bar{r}_i>0\), leading to constraints of the type \(r_i(\mathbf {Q})\ge \bar{r}_i.\) The energy-efficient design of the wireless system consists then in minimizing the sum of the users’ energy consumption while guaranteeing the rate QoS:

$$\begin{aligned} \underset{\mathbf {Q}}{\text {minimize}} \quad&E(\mathbf {Q})\triangleq \textstyle \sum \limits _{i=1}^I E_i(\mathbf {Q}) \nonumber \\ \text {s.t.} \qquad&g_i(\mathbf {Q})\triangleq \bar{r}_i-r_i(\mathbf {Q})\le 0, \quad \forall i=1,\ldots ,I,\nonumber \\&\mathbf {Q}_i\succeq \mathbf {0},\quad \text {tr}(\mathbf {Q}_i)\le P_i, \quad \forall i=1,\ldots ,I. \end{aligned}$$
(50)

Note that in the above formulation, the \(g_i\)s define the noncovex constraints whereas \( \mathbf {Q}_i\succeq \mathbf {0}\) and \(\text {tr}(\mathbf {Q}_i)\le P_i\) are convex constrains and define the set \(\mathcal K\) in (P1). Hereafter, we denote by \(\mathcal Q\) the feasible set of problem (50). We also assume that \(P_i\)s and \(\bar{r}_i\)s are set so that \(\mathcal Q\) is nonempty.

Remark 2

Problem (50) contains complex variables. One could reformulate the problem into the real domain by using separate variables for the real and imaginary parts of the complex variables, but this would lead to awkward reformulations wherein all the desirable structure of the original functions gets lost. Following a well-established path in the signal processing community, we work directly with complex variables by means of “Wirtinger derivatives”. The main advantage of this approach is that we can use “Wirtinger calculus” to easily compute in practice derivatives of the rate (and energy) functions in (50) directly in the complex domain. It can be shown that all results in this paper extend to the complex domain when using Wirtinger derivatives instead of classical gradients. In what follows we freely use the Wirtinger calculus and refer the reader to [23, 26, 42] for more information on this topic.

Problem (50) represents an open problem in the wireless communication community, see [52]. Furthermore, the constraints \( \mathbf {Q}_i\succeq \mathbf {0}\) and \( g_i(\mathbf {Q})\le 0\) cannot be violated, since the objective function could be not defined outside the feasible set \(\mathcal Q\), calling thus for the use of a feasible method. Since (50) is a special case of (P1), one can readily resort to our approach to efficiently compute stationary solutions. We propose next a novel nontrivial ad-hoc approximation of the objective function, opening the way to an efficient INCA-based solution method, which exhibits very good numerical performances.

6.2 Ad-hoc approximations

In this section we define suitable approximations \(\tilde{E}\) and \(\tilde{g}_i\) for (50). We remark that, while the proposed approximations satisfy Assumptions F, G and B, the one of the objective function is substantially different from those in Sect. 5 and exploits the partial convexity of E in each of the users’ variables \(\mathbf {Q}_i\)s.

Given \(\mathbf {Q}^\nu \triangleq (\mathbf {Q}^\nu _i)_{i=1}^I\) such that each \(r_i(\mathbf {Q}^\nu )>0\), observe that: (i) setting \(\mathbf {Q}_{-i}=\mathbf {Q}_{-i}^\nu \), each term \(E_i(\mathbf {Q}_{i},\mathbf {Q}_{-i}^\nu )\) of the sum in \(E(\mathbf {Q})\) is the product of a linear and a convex function in \(\mathbf {Q}_i\), namely \(\mu _i \text {tr}(\mathbf {Q}_i) + P_{c,i}\) and \(1/W_i \, r_i(\mathbf {Q}_i,\mathbf {Q}_{-i}^\nu )\); and (ii) the other summands, \(\sum _{j\ne i} E_j(\mathbf {Q}_i,\mathbf {Q}_{-j}^\nu )\), are nonconvex in \(\mathbf {Q}_i\). Exploiting such a structure, a convex approximation of E can be obtained for each user i by convexifying \(\left( \mu _i \text {tr}(\mathbf {Q}_i) + P_{c,i}\right) /(W_i \, r_i(\mathbf {Q}_i,\mathbf {Q}_{-i}^\nu ))\), while retaining the partial convexity in \(\mathbf {Q}_i\) and linearizing the nonconvex part \(\sum _{j\ne i} E_j(\mathbf {Q}_i,\mathbf {Q}_{-j}^\nu )\). More formally, let us define

$$\begin{aligned} \begin{array}{ll} \tilde{E}_i(\mathbf {Q}_i;\mathbf {Q}^\nu ) \triangleq &{}\dfrac{\mu _i \text {tr}(\mathbf {Q}_i) + P_{c,i}}{ W_i\, r_i(\mathbf {Q}^\nu )} + \dfrac{\mu _i \text {tr}(\mathbf {Q}_i^\nu ) + P_{c,i}}{W_i\, r_i(\mathbf {Q}_i,\mathbf {Q}^\nu _{-i})}\\ &{}+ {\displaystyle {\sum _{j\ne i}}} \left\langle \nabla _{\mathbf {Q}_{i}^{*}}E_{j}(\mathbf {Q}^{\nu }),\mathbf {Q}_{i}-\mathbf {Q}_{i}^{\nu }\right\rangle + \tau _i \Vert \mathbf {Q}_{i}-\mathbf {Q}_{i}^{\nu }\Vert ^2_F, \end{array} \end{aligned}$$
(51)

where the first two terms on the right-hand-side are the convexification of \(\left( \mu _i \text {tr}(\mathbf {Q}_i) + P_{c,i}\right) /(W_i \, r_i(\mathbf {Q}_i,\mathbf {Q}_{-i}^\nu ))\); the third one comes from the linearization of \(\sum _{j\ne i} E_j(\mathbf {Q}_i,\mathbf {Q}_{-j}^\nu )\), with \( \left\langle \mathbf {A}, \mathbf {B}\right\rangle \triangleq \text {Re}\{\text {tr}( \mathbf {A}^H \mathbf {B})\}\) and \(\nabla _{\mathbf {Q}_{i}^{*}}E_{j}(\mathbf {Q}^\nu )\) denoting the conjugate gradient of \(E_j\) w.r.t. \(\mathbf {Q}_i\) evaluated at \(\mathbf {Q}^\nu \) (see [42]), given by

$$\begin{aligned} \begin{array}{ll} \dfrac{\left( \mu _{j}\text {tr}(\mathbf {Q}_{j})+P_{c,j}\right) }{W_{j}\,r_{j}^{2}(\mathbf {Q}^{\nu })}\,\,\mathbf {H}_{ji}^{H}\,\left( \mathbf {R}_{j}(\mathbf {Q}_{-j}^{\nu })^{-1}-\left( \mathbf {R}_{j}(\mathbf {Q}_{-j}^{\nu })+\mathbf {H}_{jj}\mathbf {Q}_{j}^{\nu }\mathbf {H}_{jj}^{H}\right) ^{-1}\right) \mathbf {H}_{ji};\end{array} \end{aligned}$$

and the fourth term is added to make \(\tilde{E}_i(\mathbf {Q}_i;\mathbf {Q}^\nu )\) uniformly strongly convex in \(\mathbf {Q}_i\). Thus, the sum-energy surrogate function \(\tilde{E}(\mathbf {Q};\mathbf {Q}^\nu )\) is defined as:

$$\begin{aligned} \tilde{E}(\mathbf {Q};\mathbf {Q}^\nu )={\displaystyle {\sum _{i=1}^I}}\tilde{E}_i(\mathbf {Q}_i;\mathbf {Q}^\nu ). \end{aligned}$$
(52)

In order to build an upper convex approximation of each (nonconvex) \(g_i\), one can readily exploit the DC structure of the rate function \(r_i(\mathbf {Q})\), that is,

$$\begin{aligned}&r_{i}(\mathbf {Q})=r_{i}^{+}(\mathbf {Q})-r_{i}^{-}(\mathbf {Q}_{-i}),\nonumber \\&r_{i}^{+}(\mathbf {Q}) \triangleq \log _2\det \left( {\mathbf {R}}_{i}\left( \mathbf {Q}_{- i}\right) +\mathbf {H}_{ii}\mathbf {Q}_{i}\mathbf {H}_{ii}^{H}\right) \;\text {and}\; r_{i}^{-}(\mathbf {Q}_{-i})\triangleq \log _2\det ({\mathbf {R}}_{i}\left( \mathbf {Q}_{-i}\right) ). \end{aligned}$$
(53)

Following Example 4 in Sec. 5.2 and adapting it to the complex case, a tight lower bound of \(r_{i}(\mathbf {Q})\) [and thus upper bound of \(g_i(\mathbf {Q})\)] can be obtained by retaining the concave part in (53) and linearizing the convex function \(-r_{i}^{-}\), which leads to the following rate approximation function:

$$\begin{aligned}&{r}_{i}(\mathbf {Q})\ge \tilde{r}_{i}(\mathbf {Q};\mathbf {Q}^{\nu })\triangleq {r_{i}^{+}}(\mathbf {Q})-\tilde{r}_{i}^{-}(\mathbf {Q}_{-i};\mathbf {Q}^{\nu }),\end{aligned}$$
(54)
$$\begin{aligned}&\tilde{r}_{i}^{-}(\mathbf {Q}_{-i};\mathbf {Q}^{\nu })\!\triangleq \!{r_{i}^{-}}(\mathbf {Q}_{-i}^{\nu })+ {\textstyle \sum _{j\ne i}}\left<\nabla _{\mathbf {Q}_{j}^{*}}r_{i}^{-}(\mathbf {Q}_{-i}^{\nu }),\mathbf {Q}_{j}-\mathbf {Q}_{j}^{\nu }\right>, \end{aligned}$$
(55)

where \(\nabla _{\mathbf {Q}_{j}^{*}}r_{i}^{-}(\mathbf {Q}_{-i}^{\nu })=\mathbf {H}_{ij}^{H}{\mathbf {R}}_{i} (\mathbf {Q}_{-i}^{\nu })^{-1} \mathbf {H}_{ij}\) is the conjugate gradient of \(r_i^{-}\) w.r.t. \(\mathbf {Q}_j\). Using (52) and (54), the convex approximation of the nonconvex problem (50) reads: given the feasible point \(\mathbf {Q}^{\nu }\), let

$$\begin{aligned} \begin{array}{lll} \underset{\mathbf {Q}}{\text {minimize}} \quad &{} \tilde{E}(\mathbf {Q};\mathbf {Q}^\nu ) \\ \,\,\quad \text {s.t.} &{} \tilde{g}_i(\mathbf {Q};\mathbf {Q}^\nu )\triangleq \bar{r}_i-\tilde{r}_i(\mathbf {Q};\mathbf {Q}^\nu )\le 0, \quad \forall i=1,\ldots ,I,\\ &{} \mathbf {Q}_i\succeq \mathbf {0},\quad \text {tr}(\mathbf {Q}_i)\le P_i, \quad \forall i=1,\ldots ,I. \end{array} \end{aligned}$$
(56)

The following proposition, whose proof is omitted because of space limitations, summarize the main properties of problems (50) and (56).

Proposition 8

Given problems (50) and (56), the following hold:

  1. (i)

    any feasible point of (50) satisfies the MFCQ;

  2. (ii)

    convex surrogates \(\tilde{E}\) and \(\tilde{g}_i\)s in (56) satisfy Assumptions F, G and B.

Hence, (56) is an instance of problem (P1\(_\mathbf {y}\)) and AF 1 can be readily applied.

6.3 Numerical results

In this section we present numerical results to assess the effectiveness of our solution method, which we term Partial Linearization INCA (PL-INCA). We implemented PL-INCA, based on AF 1 and convex subproblems (56), in MATLAB. We used the diminishing step-size rule \(\gamma ^{\nu +1}=\gamma ^{\nu } (1-\alpha \gamma ^{\nu })\), for all \(\nu \ge 0\), with \(\gamma ^0=1\) and \(\alpha =1e^{-3}\). The proximal coefficients \(\tau _i\) in (51) are set equal to \(10^{-2}\). As we mentioned earlier, standard methods cannot be used to solve problem (50), since feasibility is to be maintained throughout the iterations. This fact restricts drastically the possible choices for a comparison; we decided to settle for the very recent approach proposed in [1, 7], which we adapted to deal with complex variables. The method in [7] is based on the solution of a sequence of convex subproblems wherein the nonconvex objective function and constraints are replaced by convex global upper approximations. While the nonconvex constraints \(g_i(\mathbf {Q})\le 0\) can be approximated as in our method [see \(\tilde{g}_i\) in (56)], obtaining a convex upper bound of E is less immediate. The only available option seems to set

$$\begin{aligned} \tilde{E}^{\text {up}}(\mathbf {Q};\mathbf {Q}^\nu )\triangleq {\displaystyle {\sum _{i=1}^I}}{\displaystyle {\sum _{j=1}^I}} \left\langle \nabla _{\mathbf {Q}_j^{*}}E_{i}(\mathbf {Q}^{\nu }),\mathbf {Q}_{j}-\mathbf {Q}_{j}^{\nu }\right\rangle + {\displaystyle {\sum _{i=1}^I}}L_i \Vert \mathbf {Q}_{i}-\mathbf {Q}_{i}^{\nu }\Vert ^2_F, \end{aligned}$$
(57)

where \(L_i\) is the Lipschitz constant of \(\nabla _{\mathbf {Q}^{*}}E_{i}(\bullet )\) on \(\mathcal Q\). An exact expression of \(L_i\) is not easy to compute. The following is an upper bound of \(L_i\) over \(\mathcal Q\) (the proof is quite tedious and is omitted):

$$\begin{aligned} L_i^{\text {up}}=\left( \sum _{i=1}^I \frac{\mu _i\,P_i + P_{c,i}}{W_i \bar{r}_i}\right) \cdot \lambda _{\max }\left( \displaystyle {\sum _{j=1}^I}\left( \dfrac{4}{\bar{r}_j^2}+\dfrac{1}{\bar{r}_j}\right) \mathbf {H}_{ji}^H\mathbf {H}_ji \otimes \mathbf {H}_{ji}^H\mathbf {H}_ji \right) , \end{aligned}$$
(58)

where \(\lambda (\mathbf {A})\) is the maximum eigenvalue of matrix \(\mathbf {A}\), and \(\otimes \) denotes the Kronecker product. We refer to the scheme in [7] tailored to the complex problem (56) and based on (57)-(58), as Linearized Upper INCA (L-Up-INCA). In L-Up-INCA, given an iteration \(\mathbf {Q}^\nu \), the new iteration is the solution of subproblem

$$\begin{aligned} \begin{array}{lll} \underset{\mathbf {Q}}{\text {minimize}} \quad &{} \tilde{E}^\text {up}(\mathbf {Q};\mathbf {Q}^\nu ) \\ \,\,\quad \text {s.t.} &{} \tilde{g}_i(\mathbf {Q};\mathbf {Q}^\nu )\le 0, \quad \forall i=1,\ldots ,I,\\ &{} \mathbf {Q}_i\succeq \mathbf {0},\quad \text {tr}(\mathbf {Q}_i)\le P_i, \quad \forall i=1,\ldots ,I, \end{array} \end{aligned}$$
(59)

where \(\tilde{g}_i(\mathbf {Q};\mathbf {Q}^\nu )\) is defined as in (56).

Both methods solve a convex optimization problem at each iteration: PL-INCA solves (56) and then takes a step-size \(\gamma ^\nu \) in the direction given by the solution of the subproblem, while L-Up-INCA solves (59) and takes its solution as the new iteration (or, equivalently, takes a step-size of 1). In our implementations, the two algorithms differ in (i) the choice of the objective function in the subproblems and (ii) the step-size. While \( L_i= L_i^{\text {up}}\) guarantees theoretical convergence of L-Up-INCA, in principle it might not be a tight bound of \(L_i\), a fact that could hamper progresses of L-Up-INCA; and indeed L-Up-INCA with \( L_i= L_i^{\text {up}}\) performed very poorly in all problem instances we tested, as we discuss shortly. Therefore, we also considered other instances of L-Up-INCA, obtained by progressively reducing the value of \(L_i\). Note that if \(L_i\) falls below the Lipschitz constant of \(\nabla _{\mathbf {Q}^{*}}E_{i}(\bullet )\), theoretical convergence of L-Up-INCA is no longer guaranteed.

We simulated a network composed of \(I= 10\) randomly deployed users (pairs), whose transceivers are equipped with \(T_i=R_i=2\) antennas for each i. The MIMO channels are simulated according to the Rayleigh channel model [49]; the transmit powers \(P_i\)s and the rate thresholds \(\bar{r}_i\)s are chosen to guarantee that the feasible set of (50) is nonempty. Both algorithms, PL-INCA and L-Up-INCA, are initialized from the same randomly chosen feasible point. In this setting we generated 10 independent channel realizations and initial feasible points \(\mathbf {Q}^0\) and run both algorithms on the resulting 10 problems.Footnote 1 We used cvx in MATLAB to solve, at each iteration, the strongly convex subproblems. Since the original problem is nonconvex, we measured the progresses of the algorithms toward convergence using the “stationarity measure” \(\Vert \hat{\mathbf {Q}}(\mathbf {Q}^\nu )-\mathbf {Q}^\nu \Vert _{\infty }\), where \(\hat{\mathbf {Q}}(\mathbf {Q}^\nu )\) is the unique solution of subproblem (56) and \(\Vert \bullet \Vert _{\infty }\) denotes the infinity norm. Note that \(\Vert \hat{\mathbf {Q}} (\mathbf {Q}^\nu )-\mathbf {Q}^\nu \Vert _{\infty }\) is a continuous function, which is zero if and only if \(\mathbf {Q}^\nu \) is a stationary point.

Fig. 1
figure 1

PL-INCA and two instances of L-Up-INCA (the x-axis is in log scale). a Normalized average stationarity measure \(\Vert \hat{\mathbf {Q}}(\mathbf {Q}^\nu )-\mathbf {Q}^\nu \Vert _{\infty }/S^{\max }\) versus iterations \(\nu \). b Normalized sum-energy \(E(\mathbf {Q}^\nu )/E^{\max }\) versus iterations \(\nu \)

The results of our experiments are summarized in Fig. 1. In Fig. 1a, b we plot the average (over the 10 realizations) of the normalized stationarity measure \(\Vert \hat{\mathbf {Q}}(\mathbf {Q}^\nu )-\mathbf {Q}^\nu \Vert _{\infty }/S^{\max }\) and of the normalized sum-energy \(E(\mathbf {Q}^\nu )/E^{\max }\), respectively, achieved by PL-INCA and L-Up-INCA versus the number of iterations (note that we use the log-scale for the x-axis), where \(S^{\max }\) and \(E^{\max }\) are the maximum of the stationarity measure and of E over all the randomly chosen initial points, respectively. The normalizations are done just for the purpose of a simpler graphical representation. We did not terminate the algorithms but let them run for 5000 iterations. The figures clearly show that the performance of PL-INCA is vastly superior to that of L-Up-INCA: the stationarity measure (as well as the objective function) associated with L-Up-INCA does not decrease significantly over 5000 iterations, remaining thus much higher than that of PL-INCA. Finally, note the behavior of L-Up-INCA when \(L_i\) is set to \(L_i^{\min }\), the largest value empirically found to make the algorithm divergent on all 10 instances: for this value of \(L_i\), the stationarity measure and the objective function progressively increase, see Fig. 1. We also tested L-Up-INCA for intermediate values of \(L_i\), between \(L_i^{\text {up}}\) and \(L_i^{\min }\), but in all cases where clear divergence did not occur, the resulting behavior is still very poor and is described by lines very close to the ones of L-Up-INCA (\(L_1 = L_i^{\text {up}}\)) (in Fig. 1a, b), with a slightly better diminishing slope. While we reported only averages, we remark that the average curves are representative of the behavior of the single realizations and in no case we observed large deviations from the average behavior. With regard to the cpu times, the main computational burden per iteration is given, for both algorithms, by the solution of the respective convex subproblems, (56) for PL-INCA and (59) for L-Up-INCA. Although (59) is marginally a simpler problem than (56), we found that, on average, the time needed by cvx to solve an instance of (56) is less than twice the time to solve one of (59). In practice, this means that the huge gap between PL-INCA over L-Up-INCA in terms of iterations translates in a similar superiority in terms of computational times.