Abstract
Composite minimization involves a collection of functions which are aggregated in a nonsmooth manner. It covers, as a particular case, smooth approximation of minimax games, minimization of max-type functions, and simple composite minimization problems, where the objective function has a nonsmooth component. We design a higher-order majorization algorithmic framework for fully composite problems (possibly nonconvex). Our framework replaces each component with a higher-order surrogate such that the corresponding error function has a higher-order Lipschitz continuous derivative. We present convergence guarantees for our method for composite optimization problems with (non)convex and (non)smooth objective function. In particular, we prove stationary point convergence guarantees for general nonconvex (possibly nonsmooth) problems and under Kurdyka–Lojasiewicz (KL) property of the objective function we derive improved rates depending on the KL parameter. For convex (possibly nonsmooth) problems we also provide sublinear convergence rates.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this work, we consider the class of general composite optimization problems:
where \(h:{\mathbb {E}}\rightarrow {\bar{{\mathbb {R}}}}\), with \({\bar{{\mathbb {R}}}} = {\mathbb {R}}\cup \{+ \infty \}\), and \(F: {\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}^{m}\) are general proper lower semicontinuous functions on their closed domains and \(g:{\mathbb {R}}^{m} \rightarrow {\mathbb {R}}\) is a proper closed convex function defined everywhere. Here, \({\mathbb {E}}\) is a finite-dimensional real vector space and \(F=(F_1, \cdots , F_m)\). Note that \(\text {dom}\;f =g(\text {dom}F)\cap \text {dom}\;h \). This formulation unifies many particular cases, such as smooth approximation of minimax games, max-type minimization problems or exact penalty formulations of nonlinear programs, while recent instances include robust phase retrieval and matrix factorization problems [5, 8, 9, 15]. Note that the setting where g is the identity function was intensively investigated in large-scale optimization [1, 16, 19, 26]. In this paper, we call this formulation simple composite optimization. When g is restricted to be a Lipschitz convex function and F smooth, a natural approach to this problem consists in linearizing the smooth part, leaving the nonsmooth term unchanged and adding an appropriate quadratic regularization term. This is the approach considered e.g., in [8, 27], leading to a proximal Gauss-Newton method, i.e. given the current point \({\bar{x}}\) and a regularization parameter \(t>0\), solve at each iteration the subproblem:
For such a method it was proved in [8] that \(\text {dist}(0,\partial f(x))\) converges to 0 at a sublinear rate of order \({\mathcal {O}}(1/k^{\frac{1}{2}})\), where k is the iteration counter, while convergence of the iterates under KL inequality was recently shown in [27]. Note that the case where g is (Lipschitz) convex, F is smooth and \(h = 0\) has been also analysed in ( [29]) [10].
In [5] a flexible method is proposed, where the smooth part F is replaced by its quadratic approximation, i.e., given \({\bar{x}}\), solve:
where \(L=(L_1,\cdots ,L_m)^T\), with \(L_i\) being the Lipschitz constant of the gradient of \(F_i\), for \(i=1:m\). Assuming F, g and h are convex functions, and g additionally is componentwise nondecreasing and Lipschitz, [5] derives sublinear convergence rate of order \({\mathcal {O}}(1/k)\) in function values. Finally, in the recent paper [9], a general composite minimization problem of the form:
is considered, where \(F=(F_{1},\cdots ,F_{m})\), with \(F_{i}\)’s being convex and p-smooth functions on \({\mathbb {E}}\) and having the p-derivative Lipschitz, with \(p\ge 1\) an integer constant. Under these settings, [9] replaces the smooth part by its Taylor approximation of order p plus a proper regularization term, i.e., given \({\bar{x}}\), solve the following subproblem:
where \(L=(L_1,\cdots ,L_m)^T\), with \(L_i\) being related to the Lipchitz constant of the p-derivative of \(F_i\) and \(T^{F}_{p}(x;\!\!{\bar{x}}) \) is the p-Taylor approximation of F around the current point \({\bar{x}}\). For such a higher-order method, in the convex settings and assuming that g has full domain in the second argument, [9] derives a sublinear convergence rate in function values of order \({\mathcal {O}}\left( 1/k^p\right) \).
Note that the optimization scheme in [9] belongs to the class of higher-order methods. Such methods are popular due to their performance in dealing with ill conditioning and fast rates of convergence, see e.g., [2, 7, 13, 20, 24, 25]. For example, first-order methods achieve convergence rates of order \({\mathcal {O}}(1/k)\) for smooth convex optimization [16, 26], while higher-order methods of order \(p >1\) have converge rates \({\mathcal {O}}(1/k^p)\) for minimizing p smooth convex objective functions [13, 20, 24, 25]. Accelerated variants of higher-order methods were also developed e.g., in [11, 24, 25]. Recently, [20] provided a unified framework for the convergence analysis of higher-order optimization algorithms for solving simple composite optimization problems using the majorization-minimization approach. This is a technique that approximate an objective function by a majorization function, which can be minimized in closed form or its solution computed fast, yielding a solution or some acceptable improvement. Note that papers such as [5, 16] use a first-order majorization-minimization approach to build a model (i.e., use only gradient information), while [20] uses higher-order derivatives to build such a model. However, global complexity bounds for higher-order methods based on the majorization-minimization principle for solving composite problem (1) are not yet given. This is the goal of this work.
Contributions In this paper, we provide an algorithmic framework based on the notion of higher-order upper bound approximation of the composite problem (1). Note that in this optimization formulation we consider general properties for our objects, e.g., the functions F and h can be smooth or nonsmooh, convex or nonconvex and g is convex, nondecreasing and has full domain. Our framework consists of replacing F by a higher-order surrogate, leading to a General Composite Higher-Order minimization algorithm, which we call GCHO. This majorization minimization approach is relevant as it yields an array of algorithms, each of which is associated with the specific properties of F and the corresponding surrogate, and it provides a unified convergence analysis. Note that most of our variants of GCHO for \(p> 1\) were not explicitly considered in the literature before (at least in the nonconvex settings). Moreover, our new first-, second-, and third-order methods can be implemented in practice using existing efficient techniques from e.g., [25, 28].
We derive convergence guarantees for the GCHO algorithm when the upper bound approximate F from the objective function up to an error that is \(p \ge 1\) times differentiable and has a Lipschitz continuous p derivative; we call such upper bounds composite higher-order surrogate functions. More precisely, on composite (possibly nonsmooth) nonconvex problems we prove for GCHO, with the help of a new auxiliary sequence, convergence rates \({\mathcal {O}}\left( \frac{1}{k^{p/(p+1)}}\right) \) in terms of first-order optimality conditions. We also characterize the convergence rate of GCHO algorithm locally, in terms of function values, under the Kurdyka–Lojasiewicz (KL) property. Our result show that the convergence behavior of GCHO ranges from sublinear to linear depending on the parameter of the underlying KL geometry. Moreover, on general (possibly nonsmooth) composite convex problems (i.e., f is convex function) our algorithm achieves global sublinear convergence rate of order \({\mathcal {O}}\left( 1/k^{p}\right) \) in function values. We summaries our convergence results in Table 1. Finally, for \(p=2\), \(g(\cdot ) = \max (\cdot )\) and \(h = 0\), we show that the subproblem, even in the nonconvex case, is equivalent to minimizing an explicitly written convex function over a convex set that can be solve using efficient convex optimization tools.
Besides providing a general framework for the design and analysis of composite higher-order methods, in special cases, where complexity bounds are known for some particular algorithms, our convergence results recover the existing bounds. For example, from our convergence analysis one can easily recover the convergence bounds of higher-order algorithms from [24] for unconstrained minimization and from [20, 24, 25] for simple composite minimization. Furthermore, in the composite convex case we recover the convergence bounds from [5] for \(p=1\) and particular choices of g and from [9] for \(p \ge 1\). To the best of our knowledge, this is the first complete work dealing with composite problems in the nonconvex and nonsmooth settings, and explicitly deriving convergence bounds for higher-order majorization-minimization algorithms (including convergence under KL).
2 Notations and preliminaries
We use the standard notations from [24, 25]. We denote a finite-dimensional real vector space with \({\mathbb {E}}\) and \({\mathbb {E}}^{*}\) its dual space composed of linear functions on \({\mathbb {E}}\). For any linear function \(s\in {\mathbb {E}}^{*}\), the value of s at point \(x\in {\mathbb {E}}\) is denoted by \(\langle s,x\rangle \). Using a self-adjoint positive-definite operator \({B}:{\mathbb {E}}\rightarrow {\mathbb {E}}^{*}\), we endow these spaces with conjugate Euclidean norms:
For a twice differentiable function \(\phi \) on a convex and open domain \(\text {dom}\;\phi \subseteq {\mathbb {E}}\), we denote by \(\nabla \phi (x)\) and \(\nabla ^{2} \phi (x)\) its gradient and hessian evaluated at \(x\in \text {dom}\;\phi \), respectively. Then, \(\nabla \phi (x)\in {\mathbb {E}}^{*}\) and \(\nabla ^{2}\phi (x)h\in {\mathbb {E}}^{*}\) for all \(x\in \text {dom}\;\phi \), \(h\in {\mathbb {E}}\). Throughout the paper, we consider p a positive integer. In what follows, we often work with directional derivatives of function \(\phi \) at x along directions \(h_{i}\in {\mathbb {E}}\) of order p, \({D^{p}} \phi (x)[h_{1},\cdots ,h_{p}]\), with \(i=1:p\). If all the directions \(h_{1},\cdots ,h_{p}\) are the same, we use the notation \( {D^{p}} \phi (x)[h]\), for \(h\in {\mathbb {E}}.\) Note that \({D^{p}} \phi (x)\) is a symmetric p-linear form. Then, its norm is defined as:
Further, the Taylor approximation of order p of \(\phi \) at \(x\in \text {dom}\;\phi \) is denoted:
If \(\phi : {\mathbb {E}}\rightarrow {\bar{{\mathbb {R}}}}\) is p differentiable function on \(\text {dom}\;\phi \), then the pth derivative is Lipschitz continuous if there exist a constant \(L_p^{\phi } > 0\) such that:
It is known that if (2) holds, then the residual between the function and its Taylor approximation can be bounded [24]:
If \(p \ge 2\), we also have the following inequalities valid for all \( x,y \in \text {dom}\;\phi \):
For the Hessian, the norm defined in (5) corresponds to the spectral norm of self-adjoint linear operator (maximal module of all eigenvalues computed w.r.t. B). A function \(g:{\mathbb {R}}^{m} \rightarrow {\mathbb {R}}\) is said to be nondecreasing if for all \(i=1:m\), g is nondecreasing in its ith argument, i.e., the univariate function:
is nondecreasing. In what follows, if x and y are in \({\mathbb {R}}^{m}\), then \(x\ge y\) means that \(x_{i}\ge y_{i}\) for all \(i=1:m\). Similarly, we define \(x>y\). Since g is nondecreasing, then for all \(x,y\in {\mathbb {R}}^m\) such that \(x\le y\) we have \(g(x)\le g(y)\). Next, we provide few definitions and properties concerning subdifferential calculs (see [17]).
Definition 1
(Subdifferential): Let \(f: {\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\) be a proper lower semicontinuous function. For a given \(x \in \text {dom} \; f\), the Fr\(\acute{e}\)chet subdifferential of f at x, written \({\widehat{\partial }}f(x)\), is the set of all vectors \(g_{x}\in {\mathbb {E}}^{*}\) satisfying:
When \(x \notin \text {dom} \; f\), we set \({\widehat{\partial }} f(x) = \emptyset \). The limiting-subdifferential, or simply the subdifferential, of f at \(x\in \text {dom} \, f\), written \(\partial f(x)\), is defined as [17]:
Note that we have \({\widehat{\partial }}f(x)\subseteq \partial f(x)\) for each \(x\in \text {dom}\,f\). In the previous inclusion, the first set is closed and convex while the second one is closed, see e.g., [17]. Let us recall the following chain rule for the composite problem g(F):
where F can be nondifferentiable and \(u=(u_{1},\cdots ,u_{m})\), which is valid provided that, e.g., \(F=(F_{1},\cdots ,F_{m})\) and g are locally Lipschitz, \(F_{i}\)’s are regular at x, g is regular at F(x) and \( \partial g(F(x))\subseteq {\mathbb {R}}^{m}_{+}\) (see Theorem 6 in [14] for more details). As a consequence, if g is the identity function and \(m=2\), then:
For any \(x\in \text {dom} \; f\) let us define:
If \(\partial f(x) = \emptyset \), we set \(S_f(x) = \infty \).
3 General composite higher-order algorithm
In this section, we propose a higher-order algorithm for solving the general composite problem (1) and analyse its convergence.
Assumption 1
We consider the following assumptions for optimization problem (1):
-
1.
The functions \(F_{i}\), with \(i=1\!:\!m\), g and h are proper lower semicontinuous on their domains, satisfy the chain rule (6) and \(\text {dom}\; h \subseteq g(\text {dom}\;F)\).
-
2.
Additionally, g is convex nondecreasing and has full domain, satisfying the following subhomogeneity property:
$$\begin{aligned} g(\alpha x)\le \alpha g(x)\quad \forall x\in {\mathbb {R}}^m \quad \forall \alpha \ge 1. \end{aligned}$$(7) -
3.
Problem (1) has a solution and thus \(f^{*}:=\!\inf _{x\in \text {dom}\;f}\! f(x)\!>\!-\infty \).
From Assumption 1.1, it follows that \(\text {dom}\;f = \text {dom}\;h\). Moreover, if Assumption 1.2 holds, then from [9](Theorem 4) it follows that:
Next, we provide several examples of optimization problems that can be written as (1) and satisfy our Assumption 1.
Example 1
(Minimax strategies for nonlinear games) Let us consider the problem:
where \(\triangle _n\), \(\triangle _m\) are the standard simplexes in \({\mathbb {R}}^n\) and \({\mathbb {R}}^m\), respectively. The smooth approximation for this problem using the entropy distance is as follows [22]:
for some \(\mu >0\). Using Lemma 4 in [22], we get:
Hence, considering \(g(y) = \mu \ln \left( \sum _{j=1}^{m}e^{\frac{y_i}{\mu }}\right) \), then the original minimax problem can be approximated, for sufficiently small \(\mu \), with \(\min _{x\in \Delta _n}f_\mu (x):= g(F(x))\). Note that g satisfies Assumption 1.2.
Example 2
(Min-max problems) Let us consider the following min-max problem:
This type of problem is classical in optimization. Note that if we define \(g(y)=\max _{i=1:m}y_{i}\) and \(h={{\textbf {1}}}_{Q}\), then, the previous min-max problem can be written as problem (1) and g satisfies Assumption 1.2.
Example 3
(Simple composite problems) Let us consider the following simple composite minimization problem:
Taking \(F(x)=F_{0}(x)\) and g the identity function, we can clearly see that \(g\big (F(x)\big ) + h(x)=F_{0}(x)+h(x)\).
Further, let us introduce the notion of a higher-order surrogate, see also [20].
Definition 2
Let \(\phi :{\mathbb {E}}\rightarrow \bar{{\mathbb {R}}}\) be a proper lower semicontinuous function and \(x\in \text {dom}\;\phi \). We call the function \(s(\cdot \;;\!x): {\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\), with \(\text {dom}\; s(\cdot \;;\!x)=\text {dom}\;\phi \), a p higher-order surrogate of \(\phi \) at x if it has the following properties:
-
(i)
the error function
$$\begin{aligned} e(y;\!\!x)=s(y;\!\!x) - \phi (y), \end{aligned}$$(9)with \(\text {dom}\;e\) open such that \(\text {dom}\;\phi \subseteq \text {dom}\;e\), is p differentiable and the p derivative is smooth with Lipschitz constant \(L_{p}^{e}\).
-
(ii)
the derivatives of the error function e satisfy
$$\begin{aligned} \nabla ^{i}e(x;\!\!x)=0 \quad \forall i=0:p, \; x\in \text {dom}\;\phi , \end{aligned}$$(10)and there exist a positive constant \(R_{p}^{e}>0\) such that
$$\begin{aligned} e(y;\!\!x)\ge \frac{R_{p}^{e}}{(p+1)!}\Vert y-x\Vert ^{p+1} \quad \forall x, y \in \text {dom}\;\phi . \end{aligned}$$(11)
Note that \(\text {dom}\;\phi \) may not be equal to \(\text {dom}\;e\) (see examples below) and from (11) we have \(s(y;x) \ge \phi (y)\) for all \(x,y\in \text {dom}\;\phi \). Next, we give two nontrivial examples of higher-order surrogate functions, see [20] for more examples.
Example 4
(Composite functions) Let \(F_1:{\mathbb {E}}\rightarrow {\mathbb {R}}\) be p times differentiable and the p derivative be Lipschitz with constant \(L_p^{{F_1}}\) and let \(F_2:{\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\) be a proper closed function. Then, for the composite function \(F=F_{1} + F_{2}\), where \(\text {dom}\;F= \text {dom}\;F_2 \), one can consider the following p higher-order surrogate function:
where \(M_{p}>L_{p}^{F_{1}}\). Indeed, from the definition of the error function, we get:
Thus \(e(\cdot ;\!x)\), with \(\text {dom}\;e = {\mathbb {E}}\) and \(\text {dom}\;F\subseteq \text {dom}\;e\), has the p derivative Lipschitz with constant \(L_p^{F_1} + M_p\). Further, from the definition of the error function e, we have:
Moreover, since \(F_1\) has the p derivative Lipschitz, it follows from (3) that:
Combining this inequality with (12), we get:
Hence, the error function e has \(L_{p}^{e}=M_{p} + L^{F_1}_{p}\) and \(R_{p}^{e}=M_{p}-L^{F_1}_{p}\).
Example 5
(proximal higher-order) Let \(F:{\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\) be a proper lower semicontinuous function. Then, we can consider the following higher-order surrogate function:
where r is a positive integer. Indeed, the error function is:
where \(\text {dom}\;F\subseteq \text {dom}\;e = {\mathbb {E}}\). In this case, the error function e has the r derivative Lipschitz with \(L_{r}^{e}= M_{r}\) and \(R_{r}=M_{r}\).
In the following, we assume for problem (1) that each function \(F_{i}\), with \(i=1:m\), admits a p higher-order surrogate as in Definition 2. Then, we propose the following General Composite Higher-Order algorithm, called GCHO.
Although our algorithm requires that the next iterate \(x_{k+1}\) only to satisfy the descent (14), we usually generate \(x_{k+1}\) by solving the following subproblem:
If F and h are convex functions, then the subproblem (15) can be also convex. Indeed, for Example 4, if \(M_p \ge pL_p^{F_1}\) and \(F_2\) is convex, then the surrogate function s is convex and hence the problem (15) is convex (see Theorem 1 [24]), while for Example 5, the surrogate is convex if \(M_p\ge 0\). Hence, in the convex case we assume that \(x_{k+1}\) is the global optimum of the subproblem (15). However, in the nonconvex case, we cannot guarantee the convexity of the subproblem. In this case, we either assume that we can compute a stationary point of the subproblem (15) if g is the identity function or we can compute an inexact solution as defined in (25) if g is a general function. Note that our algorithmic framework is quite general and yields an array of algorithms, each of which is associated with the specific properties of F and the corresponding surrogate. For example, if F is a sum between a smooth term and a nonsmooth one we can use a surrogate as in Example 4; if F is fully nonsmooth we can use a surrogate as in Example 5. This is the first time such an analysis is performed, and most of our variants of GCHO were not explicitly considered in the literature before (especially in the nonconvex settings). Note that in both Examples 4 and 5, \(x_{k+1}\) can be computed inexactly, as detailed in the next sections.
3.1 Nonconvex convergence analysis
In this section we consider that each \(F_{i}\), with \(i=1:m\), and h are nonconvex functions (possible nonsmooth). Then, problem (1) becomes a pure nonconvex optimization problem. Now we are ready to analyze the convergence behavior of GCHO algorithm under these general settings. In the sequel, we assume that \(g(-R_{p}^{e})<0\). Note that since the vector \(R_{p}^{e}>0\), then for all the optimization problems considered in Examples 1, 2 and 3 this assumption holds provided that \(M_p\) is large enough.
Theorem 1
Let F, g and h satisfy Assumption 1 and additionally each \(F_{i}\) admits a p higher-order surrogate \(s_{i}\) as in Definition 2 with the constants \(L^{e}_{p}(i)\) and \(R_{p}^{e}(i)\), for \(i=1:m\). Let \(\left( x_{k}\right) _{k\ge 0}\) be the sequence generated by Algorithm GCHO, \(R_{p}^{e}=\big (R_{p}^{e}(1),\cdots ,R_{p}^{e}(m)\big )\) and \(L^{e}_{p}=\big (L^{e}_{p}(1),\cdots ,L^{e}_{p}(m)\big )\). Then, the sequence \(\left( f(x_{k})\right) _{k\ge 0}\) is nonincreasing and satisfies the following descent relation:
Proof
Denote \(e(x_{k+1};\!\! x_{k})=\big (e_{1}(x_{k+1};\!\!x_{k}),\cdots ,e_{m}(x_{k+1};\!\!x_{k})\big )\). Then, from the definition of the error function e and (11), we have:
This implies that:
Since g is nondecreasing, we get:
Finally, we obtain that:
which yields our statement. \(\square \)
Summing (16) from \(j=0\) to k, we get:
Taking the limit as \(k\rightarrow +\infty \), we obtain:
Hence \(\lim _{k\mapsto +\infty }\Vert x_{k}-x_{k+1}\Vert =0\). In our convergence analysis, we also consider the following additional assumption which requires the existence of some auxiliary sequence that must be closed to the sequence generated by GCHO algorithm and some first-order relation holds:
Assumption 2
Given the sequence \(\big (x_{k}\big )_{k\ge 0}\) generated by GCHO algorithm, there exist two constants \(L^{1}_{p}, L^{2}_{p}>0\) and a sequence \(\left( y_{k}\right) _{k\ge 0}\) such that:
In the next section, we provide concrete examples for the sequence \((y_{k})_{k\ge 0}\) satisfying Assumption 2, and the corresponding expressions for \(L_{p}^1\) and \(L_{p}^2\).
3.2 Approaching the set of stationary points
Before continuing with the convergence analysis of GCHO algorithm, let us analyze the relation between \(\Vert x_{k+1}-x_{k}\Vert ^{p}\) and \(S_{f}(x_{k+1})\) and also give examples when Assumption 2 is satisfied. For simplicity, consider the following simple composite minimization problem:
where F is p times differentiable function, having the p derivative \(L_{p}^{F}\)-Lipschitz and h is proper lower semicontinuous function. In this case g is the identity function and we can take as a surrogate \(s(y;\!x)=T_{p}^{F}(y;\!x) + \frac{M_{p}}{(p+1)!}\Vert x-y\Vert ^{p+1} + h(y)\), with the positive constant \(M_{p}\) satisfying \(M_{p} > L_{p}^{F}\) and \(g(-R_p^e)<0\). The following lemma gives an example when Assumption 2 holds.
Lemma 1
Assume g is the identity function, F has the p derivative Lipschitz and \(x_{k+1}\) is a stationary point of the following subproblem:
Then, Assumption 2 holds with \(y_{k+1}=x_{k+1}\), \(L^{1}_{p}=1\) and \(L^{2}_{p}=\frac{M_{p} + L_{p}^{F}}{p!}\).
Proof
Since \(x_{k+1}\) is a stationary point of subproblem (19), using (6), we get:
or equivalently
Taking into account that F is p-smooth, we further get:
Hence, Assumption 2 holds with \(y_{k+1} = x_{k+1}\), \(L^{1}_{p}=1\) and \(L^{2}_{p}=\frac{M_{p} + L_{p}^{F}}{p!}\) . \(\square \)
The algorithm GCHO which generates a sequence \((x_k)_{k \ge 0}\) satisfying the descent (14) and the stationary condition (19) has been also considered e.g., in the recent papers [20, 25], with h assumed to be a convex function. Here we remove this assumption on h.
Combining (20) and (16), we further obtain:
where \(C_{M_{p},L_{p}^{F}}=\left( \dfrac{M_{p}+L_{p}^{F}}{p!}\right) ^{\frac{p+1}{p}}\dfrac{(p+1)!}{M_{p}-L_{p}^{F}}\). Summing the last inequality from \(j=0:k-1\), and using that f is bounded from bellow by \(f^{*}\), we get:
Hence:
Thus, we have proved convergence for the simple composite problem under slightly more general assumptions than in [20, 25], i.e., F and h are possibly nonconvex functions. Finally, if we have \( \Vert x_{k+1}-x_{k}\Vert ^{p} \le \frac{p!}{L^F_{p}+M_{p}}\epsilon \), then from (20) it follows that \(S_f (x_{k+1})\le \epsilon \), i.e., \(x_{k+1}\) is nearly stationary for f. Note that in the previous Lemma 1, we assume \(x_{k+1}\) to be a stationary point of the following subproblem (see (19)):
However, our stationary condition for \(x_{k+1}\) can be relaxed to the following inexact optimality criterion (see also [2]):
where \(g_{x_{k+1}}\in \partial s(x_{k+1};\!x_{k}) \) and \(\theta >0\). For simplicity of the exposition, in our convergence analysis below for this particular case (i.e., g identity function) we assume however that \(x_{k+1}\) satisfies the exact stationary condition (21), although our results can be extended to the inexact stationary condition from above. The situation is dramatically different for the general composite problem (1). When g is nonsmooth, the distance \(\text {dist}\big (0,\partial f(x_{k+1})\big )\) will typically not even tend to zero in the limit, although we have seen that \(\Vert x_{k+1}-x_{k}\Vert ^{p}\) converges to zero. Indeed, consider the minimization of the following function:
For \(p=1\), we have \(L_{1}^{F}(1)=L_{1}^{F}(2)=2\). Taking \(x_{0}>1\) and \(M_{1}=M_{2}=4\), GCHO algorithm becomes:
where \(Q_{1}(x,x_{k})=2x^{2} - 2xx_{k}+ x_{k}^{2}-1 \). Let us prove by induction that \(x_{k}> 1\) for all \(k\ge 0\). Assume that \(x_{k}>1\) for some \(k\ge 0\). We notice that the polynomials \(Q_{2}(x,x_{k}):=Q_{1}(x,x_{k}) - 4xx_{k}+2x_{k}^2 + 2\) and \(Q_{1}(x,x_{k})\) are 2-strongly convex functions and they intersect in a unique point \({\bar{x}}=\frac{x_{k}^2+1}{2x_{k}}\). Also, the minimum of \(Q_{2}\) is \({{\bar{x}}}_{2}=\frac{3}{2} x_{k}\) and the minimum of \(Q_{1}\) is \({{\bar{x}}}_{1}:=\frac{1}{2}x_{k}\), satisfying \({{\bar{x}}}_{1}\le {\bar{x}}\le {{\bar{x}}}_{2}\). Let us prove that \(x_{k+1}={\bar{x}}\). Indeed, if \(x\le {\bar{x}}\), then \(Q(x,x_{k}) = Q_{2}(x,x_{k})\) and it is nonincreasing on \((-\infty ,{\bar{x}}]\). Hence, \(Q(x,x_{k})\ge Q({\bar{x}},x_{k})\) for all \(x\le {\bar{x}}\). Further, if \(x\ge {\bar{x}}\), then \(Q(x,x_{k}) = Q_{1}(x,x_{k})\) and it is nondecreasing on \( [{\bar{x}},+\infty )\). In conclusion, \(Q(x,x_{k})\ge Q({\bar{x}},x_{k})\) for all \(x\le {\bar{x}}\). Finally, we have that: \(Q(x,x_{k})\ge Q({\bar{x}},x_{k})\) for all \(x\in {\mathbb {R}}\). Since \(x_{k}>1\), we also get that \(x_{k+1}=\frac{x_{k}^2+1}{2x_{k}} > 1\). Since \(x_{k}>1\), then \(\partial f(x_{k})=2x_{k} > 2 \) and \(S_{f}(x_{k})\ge 2 > 0\). Moreover, \(x_{k+1} < x_{k}\) and bounded below by 1, thus \((x_{k})_{k\ge 0}\) is convergent and its limit is 1. Indeed, assume that \(x_{k}\rightarrow {\hat{x}}\) as \(k \rightarrow \infty \). Then, we get \( {\hat{x}} = \frac{{\hat{x}}^{2} + 1}{2{\hat{x}}}\) and thus \({\hat{x}} = 1\) (recall that \({\hat{x}}\ge 1\)). Consequently, \(\Vert x_{k+1} - x_{k}\Vert \) also converges to 0. Therefore, we must look elsewhere for a connection between \(S_{f}(\cdot )\) and \(\Vert x_{k+1} - x_{k}\Vert ^{p}\).
Let us now consider the following subproblem:
where \(\mu _{p}>g(L^{e}_{p})\). Since f is assumed bounded from bellow, then for any fixed x, the function \(y\mapsto {\mathcal {M}}_{p}(y,x)\) is coercive and hence the optimal value \({\mathcal {M}}_{p}^* =\inf \limits _{y}{\mathcal {M}}_{p}(y,x)\) is finite. Then, the subproblem (23) is equivalent to:
for some compact set \({\mathcal {B}}_k\). Since \({\mathcal {M}}_{p}\) is proper lower semicontinuous function in the first argument and \({\mathcal {B}}_k\) is compact set, then from Weierstrass theorem we have that the infimum \({\mathcal {M}}_{p}^*\) is attained, i.e., there exists \({\bar{y}}_{k+1} \in {\mathcal {P}}(x_{k})\) such that \({\mathcal {M}}_{p}({\bar{y}}_{k+1},x_{k}) = {\mathcal {M}}_{p}^*\). Since the level sets of \(y \mapsto {\mathcal {M}}_{p}(x,y)\) are compact, then \( {\mathcal {P}}(x_{k})\) is nonempty and compact and one can consider the following point:
Let us assume that \(F_{i}\) admits a higher-order surrogate as in Definition 2, where the error functions \(e_{i}\) are p smooth with Lipschitz constants \(L_{p}^{e}(i)\) for all \(i=1:m\). Denote \(L_{p}^{e}=\big (L_{p}^{e}(1),\cdots ,L_{p}^{e}(m)\big )\) and define the following positive constant \(C_{L^{e}_{p}}^{\mu _{p}}= \dfrac{\mu _{p}}{\mu _{p} - g(L^{e}_{p})}\) (recall that \(\mu _{p}\) is chosen such that \(\mu _{p}>g(L^{e}_{p})\)). Next lemma shows that Assumption 2 holds provided that we compute \(x_{k+1}\) as an approximate local solution of subproblem (15) (hence, \(x_{k+1}\) doesn’t need to be global optimum) and \(y_{k+1}\) as in (24).
Lemma 2
Let the assumptions of Theorem 1 hold and additionally there exists \(\delta >0\) such that \(x_{k+1}\) satisfies the following inexact optimality condition:
where \(D_k:= \left( \frac{(p+1)!}{\mu _p}(f(x_k) - f^*)\right) ^{\frac{1}{p+1}}\). Then, Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}} + \frac{{\delta (p+1)!}}{\mu _p - g(L^{e}_p)}\right) ^{1/(p+1)}\) and \(L^{2}_{p}=\frac{\mu _{p}}{p!}\).
Proof
From the definition of \(y_{k+1}\) in (24), we have:
Further, taking \(y=x_k\) in (26) we also have:
which implies that:
Note that since the error functions \(e_{i}\)’s have the p derivative Lipschitz with constants \(L_{p}^{e}(i)\)’s, then using (3), we get:
From (10), the Taylor approximations of \(e_{i}\)’s of order p at \(x_{k}\), \(T_p^{e}(y;\!\!x_{k})\), are zero. Hence we get:
Further, since \(F(x_{k+1})\le s(x_{k+1};\!\!x_{k})\) (see (11)) and g is a nondecreasing function, we have:
Then, combining the last inequality with (26), we get:
which is the first statement of Assumption 2. Further, using (6) and optimality conditions for \(y_{k+1}\), we obtain:
It follows that:
Hence, Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}} + \frac{\delta (p+1)!}{\mu _p - g(L^{e}_p)}\right) ^{1/(p+1)}\) and \(L^{2}_{p}=\frac{\mu _{p}}{p!}\). \(\square \)
Finally, we provide a third (practical) example satisfying Assumption 2 when \(p=2\), \(h(\cdot ) = 0\) and \(g(\cdot ) = \max (\cdot )\) function.
Lemma 3
Let the assumptions of Theorem 1 hold and additionally assume that \(p=2\), \(g(\cdot ) = \max (\cdot )\) and the surrogate function \(s(\cdot ;\cdot )\) is given in Example 4 with \(F_2 = 0\). Then, the global solution of the subproblem (15) with \(h=0\), denoted \(x_{k+1}\), can be computed efficiently and consequently Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}}\right) ^{1/3}\) and \(L^{2}_{p}=\frac{\mu _{p}}{2}\).
Proof
See appendix. \(\square \)
From the proof of Lemma 3 one can see that the global minimum of subproblem (15) can be computed as:
where \(H_k(u,w) = \sum _{i=1}^{m} u_i \nabla ^2 F_i(x_k) + \frac{w}{2} I\), \(g_k(u) = \sum _{i=1}^{m}u_i \nabla F_i(x_k)\) and \(l_k(u) = \sum _{i=1}^{m} u_i F_i(x_k)\), with (u, w) the solution of the following dual problem:
with \(D = \left\{ (u,w)\in \Delta _m\times {\mathbb {R}}_+:\; \; \text {s.t.}\; H_k(u,w) \succ 0 \right\} \), i.e., a maximization of a concave function over a convex set D. Hence, if m is not too large, this convex dual problem can be solved efficiently by interior point methods [21]. In conclusion, GCHO algorithm can be implementable for \(p=2\) even for nonconvex problems, since we can effectively compute the global minimum \(x_{k+1}\) of subproblem (15) using the powerful tools from convex optimization.
Define the following constant: \(D_{R_{p}^{e},L^{1,2}_{p}}=\frac{\left( L_{p}^{1}\left( L_{p}^{2}\right) ^{p}\right) ^{\frac{p+1}{p}}(p+1)!}{-g(-R_{p}^{e})}\). Then, we derive the following convergence result for GCHO algorithm in the nonconvex case.
Theorem 2
Let the assumptions of Theorem 1 hold. Additionally, Assumption 2 holds. Then, for the sequence \(\left( x_{k}\right) _{k\ge 0}\) generated by Algorithm GCHO we have the following sublinear convergence rate:
Proof
From Assumption 2, we have:
Using the descent (16), we get:
Summing the last inequality from \(j=0:k-1\) and taking the minimum, we get:
which proves the statement of the theorem. \(\square \)
Theorem 2 requires that \(x_{k+1}\) satisfies the descent (14) and Assumption 2. However Assumption 2, according to Lemmas 1 and 2, holds if \(x_{k+1}\) is an (inexact) stationary point or an inexact solution of the subproblem (15), respectively.
Remark 1
To this end, Assumption 2 requires an auxiliary sequence \(y_{k+1}\) satisfying:
If \(\Vert x_{k+1}-x_{k}\Vert \) is small, the point \(x_{k}\) is near \(y_{k+1}\), which is nearly stationary for f (recall that \(\Vert x_{k+1}-x_{k}\Vert \) converges to 0). Hence, we do not have approximate stationarity for the original sequence \(x_k\) but for the auxiliary sequence \(y_k\), which is close to the original sequence. Note that in practice, \(y_{k+1} \) does not need to be computed. The purpose of \(y_{k+1}\) is to certify that \(x_{k}\) is approximately stationary in the sense of (30). For \(p=1\) a similar conclusion was derived in [8]. For a better understanding of the behavior of the sequence \(y_{k+1}\), let us come back to our example \(f(x)=\max \big (x^{2}-1,1-x^{2}\big )\) and \(p=1\). Recall that we have proved \(x_{k}>1\) and choosing \(\mu _{p}=4\), then \(y_{k+1}\) is the solution of the following subproblem:
Then, it follows immediately that:
Since we have already proved that \(x_k \rightarrow 1\), we conclude that \(\vert y_{k+1} - x_{k} \vert \rightarrow 0\) and consequently \(\text {dist}(0,{\partial f(y_{k+1})})\rightarrow 0\) for \(k \rightarrow \infty \), as predicted by our theory.
3.3 Better rates for GCHO under KL
In this section, we show that improved rates can be derived for GCHO algorithm if the objective function satisfies the KL property. This is the first time when such convergence analysis is derived for the GCHO algorithm on the composite problem (1). We believe that this lack of analysis comes from the fact that, when g is nonsmooth and different from the identity function, one can’t bound directly the distance \(S_{f}(x_{k+1})\) by \(\Vert x_{k+1} - x_{k}\Vert \). However, using the newly introduced (artificial) point \(y_{k+1}\), we can now overcome this difficulty. Let us recall the definition of a function satisfying the Kurdyka–Lojasiewicz (KL) property (see [4] for more details).
Definition 3
A proper lower semicontinuous function \(f: {\mathbb {E}}\rightarrow \bar{{\mathbb {R}}}\) satisfies Kurdyka–Lojasiewicz (KL) property on the compact set \(\Omega \subseteq \text {dom} \; f\) on which f takes a constant value \({f_*}\) if there exist \(\delta , \epsilon >0\) such that one has:
where \(\kappa : [0,\epsilon ] \rightarrow {\mathbb {R}}\) is concave differentiable function satisfying \(\kappa (0) = 0\) and \(\kappa '>0\).
When \( \kappa \) takes the form \(\kappa (t) = \sigma _q^{\frac{1}{q}} \frac{q}{q-1} t^{\frac{q-1}{q}}\), with \(q >1\) and \(\sigma _q>0\) (which is our interest here), the KL property establishes the following local geometry of the nonconvex function f around a compact set \(\Omega \):
Note that the relevant aspect of the KL property is when \(\Omega \) is a subset of stationary points for f, i.e. \(\Omega \subseteq \{x: 0 \in \partial f(x) \}\), since it is easy to establish the KL property when \(\Omega \) is not related to stationary points. The KL property holds for a large class of functions including semi-algebraic functions (e.g., real polynomial functions), vector or matrix (semi)norms (e.g., \(\Vert \cdot \Vert _p\) with \(p \ge 0\) rational number), logarithm functions, exponential functions and uniformly convex functions, see [4] for a comprehensive list. In particular, the max (sup) of semi-algebraic functions is a semi-algebraic function, see [6] (Example 2). Let us show that if \((x_{k})_{k\ge 0}\) is bounded, then also \(( y_{k})_{k\ge 0}\) is bounded and they have the same limit points.
Lemma 4
Let \(\left( x_{k}\right) _{k\ge 0}\) generated by Algorithm GCHO be bounded and \((y_{k})_{k\ge 0}\) satisfies Assumption 2. Then, \((y_{k})_{k\ge 0}\) is bounded and the set of limit points of the sequence \(\left( y_{k}\right) _{k\ge 0}\) coincides with the set of limit points of \(\left( x_{k}\right) _{k\ge 0}\).
Proof
Indeed, since \(\left( x_{k}\right) _{k\ge 0}\) is bounded, then it has limit points. Let \(x_{*}\) be a limit point of the sequence \(\left( x_{k}\right) _{k\ge 0}\). Then, there exists a subsequence \(( x_{k_{t}})_{t\ge 0}\) such that \(x_{k_{t}}\rightarrow x_{*}\) for \(t\rightarrow \infty \). We have:
Since \((x_k)_{k\ge 0}\) is bounded and \(\Vert x_{k+1} - x_{k}\Vert \rightarrow 0\), then \((y_{k})_{k\ge 0}\) is also bounded. This implies that \(y_{k_{t}}\rightarrow x_{*}\). Hence, \(x_{*}\) is also a limit point of the sequence \(\left( y_{k}\right) _{k\ge 0}\). Further, let \(y_{*}\) be a limit point of the bounded sequence \(\left( y_{k}\right) _{k\ge 0}\). Then, there exists a subsequence \(( y_{{\bar{k}}_{t}})_{t\ge 0}\) such that \(y_{{\bar{k}}_{t}}\rightarrow y_{*}\) for \(t\rightarrow \infty \). Taking \(t\rightarrow \infty \) in an inequality similar to (32) and using \(\lim _{t\rightarrow \infty } \Vert x_{{\bar{k}}_t} - x_{{\bar{k}}_t - 1}\Vert = 0 \) and boundedness of \((x_k)_{k\ge 0}\), we get that \(x_{{\bar{k}}_{t}}\rightarrow y_{*}\), i.e., \(y_{*}\) is also a limit point of the sequence \(\left( x_{k}\right) _{k\ge 0}\). \(\square \)
Note that usually for deriving convergence rates under KL condition, we need to assume that the sequence generated by the algorithm is bounded (see e.g., Theorem 1 in [6]). Let us denote the set of limit points of \((x_{k})_{k\ge 0}\) by:
and the set of stationary points of problem (1) by \(\text {crit}f\).
Lemma 5
Let the assumptions of Theorem 1 hold. Additionally, assume that \(\left( x_{k}\right) _{k\ge 0}\) is bounded, \((y_{k})_{k\ge 0}\) satisfies Assumption 2 and f is continuous. Then, we have: \(\emptyset \ne \Omega (x_{0}) \subseteq \text {crit} f \), \( \Omega (x_{0})\) is compact and connected set, and f is constant on \( \Omega (x_{0})\), i.e., \(f(\Omega (x_{0})) = f_*\).
Proof
First let us show that \(f(\Omega (x_{0}))\) is constant. From (16) we have that \(\left( f(x_{k})\right) _{k\ge 0}\) is monotonically decreasing and since f is assumed bounded from below, it converges, let us say to \({f_*}>-\infty \), i.e. \(f(x_{k})\rightarrow {f_*}\) as \(k\rightarrow \infty \). On the other hand let \(x_{*}\) be a limit point of the sequence \(\left( x_{k}\right) _{k\ge 0}\). This means that there exist a subsequence \(\left( x_{k_{t}}\right) _{t\ge 0}\) such that \(x_{k_{t}}\rightarrow x_{*} \). Since f is continuous, then \(f(x_{k_{t}})\rightarrow f(x_{*}) = {f_*}\). In conclusion, we have \(f(\Omega (x_{0}))={f_*}\). The closeness property of \(\partial f\) implies that \(S_{f}(x_{*})=0\), and thus \(0\in \partial f(x_{*})\). This proves that \(x_{*}\) is a stationary point of f and thus \(\Omega (x_{0})\) is nonempty. By observing that \(\Omega (x_{0})\) can be viewed as an intersection of compact sets:
so it is also compact. This completes our proof. \(\square \)
Note that \(f_*\) from Lemma 5 is usually different from \(f^* = \inf _{x\in \text {dom}f } f(x)\) defined in Assumption 1. In addition, let us consider the following assumption:
Assumption 3
For the sequence \(\big (x_{k}\big )_{k\ge 0}\) generated by GCHO algorithm, there exist positive constants \(\theta _{1,p},\theta _{2,p}>0 \) such that:
Remark 2
Note that Assumption 3 holds when e.g., g is the identity function or when \((y_{k})_{k\ge 0}\) is given in (24) and \(x_{k+1}\) satisfies (25) (see Lemmas 2 and 3). For completeness, we provide a proof for this statement in Appendix.
Let us also recall the following lemma, whose proof is similar to the one in [1](Theorem 2). For completeness, we give the proof in Appendix.
Lemma 6
Let \(\theta >0\), \(C_{1},C_{2}\ge 0\) and \((\lambda _{k})_{k\ge 0}\) be a nonnegative, nonincreasing sequence, satisfying the following recurrence:
If \(\theta \le 1\), then there exists an integer \(k_{0}\) such that:
If \(\theta >1\), then there exists \(\alpha > 0\) and integer \(k_{0}\) such that:
From previous lemmas, all the conditions of the KL property from Definition 3 are satisfied. Then, we can derive the following convergence rates depending on the KL parameter.
Theorem 3
Let the assumptions of Lemma 5 hold. Additionally, assume that f satisfies the KL property (31) on \(\Omega (x_0)\) and Assumption 3 is valid. Then, the following convergence rates hold for the sequence \((x_{k})_{k\ge 0}\) generated by GCHO algorithm:
-
If \(q\ge \frac{p+1}{p}\), then \(f(x_{k})\) converges to \({f_*}\) linearly for k sufficiently large.
-
If \(q < \frac{p+1}{p} \), then \(f(x_{k})\) converges to \({f_*}\) at sublinear rate of order \({\mathcal {O}}\left( \frac{1}{k^{\frac{pq}{p+1-pq}}}\right) \) for k sufficiently large.
Proof
Since \((x_k)_{k\ge 0}\) and \((y_k)_{k \ge 0}\) have the same limit points, we get:
If we define \(\Delta _{k}=f(x_{k})-{f_*}\), then combining the last inequality with (16), we get the following recurrence:
where \( C_{1}= \sigma _{q}(L_{p}^{2}(L_p^1)^p)^{q}\left( \frac{(p+1)!}{-g(-R_{p}^{e})}\right) ^{\frac{pq}{p+1}}\) and \( C_{2}= \left( \theta _{1,p}(L_{p}^{1})^{p+1} + \theta _{2,p}\right) \frac{(p+1)!}{-g(-R_{p}^{e})}.\)
Using Lemma 6, with \(\theta =\frac{p+1}{pq}\) we get our statements. \(\square \)
Remark 3
Contrary to Theorem 2, under KL we prove in Theorem 3 that the original sequence \((x_k)_{k\ge 0}\) converge in function values. When the objective function f is uniformly convex of order \(p+1\) and g not necessarily with full domain, [9] proves linear convergence for their algorithm in function values. Our results are different, i.e., we provide convergence rates for GCHO algorithm for possibly nonconvex objective f.
3.4 Convex convergence analysis
In this section, we assume that the objective function f in (1) is convex. Since the problem (1) is convex, we assume that \(x_{k+1}\) is a global minimum of the subproblem (15), which is convex provided that \(M_p\) is sufficiently large (see Theorem 1 in [24]). Below, we also assume that the level sets of f are bounded. Since GCHO algorithm is a descent method, this implies that there exist a constant \(R_0>0\) such that \(\Vert x_{k}-x^{*}\Vert \le R_0\) for all \(k\ge 0\), where \(x^*\) is an optimal solution of (1). Then, we get the following sublinear rate for GCHO algorithm.
Theorem 4
Let F, g and h satisfy Assumption 1 and additionally each \(F_{i}\) admits a p higher-order surrogate \(s_{i}\) as in Definition 2 with the constants \(L^{e}_{p}(i)\) and \(R_{p}^{e}(i)\), for \(i=1:m\). Additionally, f is a convex function and has bounded level sets. Let \(\left( x_{k}\right) _{k\ge 0}\) be the sequence generated by Algorithm GCHO, \(R_{p}^{e}=\big (R_{p}^{e}(1),\cdots ,R_{p}^{e}(m)\big )\) and \(L^{e}_{p}=\big (L^{e}_{p}(1),\cdots ,L^{e}_{p}(m)\big )\). Then, we have the following convergence rate:
Proof
Since \(F(x_{k+1})\le S(x_{k+1};\!\!x_k)\) (see (11)) and g is nondecreasing, we have:
Hence we get:
where the last inequality follows from the convexity of f and the boundness of the level sets of f. The minimum in \(\alpha \ge 0\) is achieved at:
We have \(0\le \alpha ^{*}<1 \). Indeed, since \(\big (f(x_{k})\big )_{k\ge 0}\) is decreasing, we have:
Hence:
Thus, we conclude:
Denoting \(\delta _{k}=f(x_{k})-f(x^{*})\), we get the following estimate:
where \({C=\frac{p}{p+1}\left( \dfrac{p!}{g(L^{e}_{p})R_0^{p+1}} \right) ^{\frac{1}{p}}}\). Thus, for \(\mu _{k} = C^{p}\delta _{k}\) we get the following recurence:
Following the same proof as in [24](Theorem 4), we get:
Since
then
This proves the statement of the theorem. \(\square \)
Note that in the convex case the convergence results from [5, 8, 9] assume Lipschitz continuity of the p derivative of the object function F, which may be too restrictive. However, Theorem 4 assumes Lipschitz continuity of the p derivative of the error function \(e(\cdot )\) (note that we may have the error function \(e(\cdot )\) p times differentiable and with the p derivative Lipschitz, while the objective function F may not be even differentiable, see Examples 4 and 5). Hence, our proof is different and more general than [5, 8, 9]. Moreover, our convergence rate from the previous theorem covers the usual convergence rates \({\mathcal {O}}(\frac{1}{k^{p}})\) of higher-order Taylor-based methods in the convex unconstrained case [24], simple composite case [24, 25] and composite case for \(p\ge 1\) [5, 9]. Therefore, Theorem 4 provides a unified convergence analysis for general composite higher-order algorithms, that covers in particular, minimax strategies for nonlinear games, min-max problems and simple composite problems, under possibly more general assumptions.
3.5 Adaptive GCHO algorithm
In this section, we propose an adaptive variant of GCHO algorithm. Since the surrogate functions in all the examples given in this paper depend on a given constant M (see Examples 4 and 5, where \(M = M_p\)), below we consider the following notation \(s(\cdot ;\cdot ): = s_{M}(\cdot ;\cdot )\). Note that the convergence results from Theorems 1, 2 and 3 are derived provided that Assumption 2 and 3 and the following properties of the sequence \((x_k)_{k\ge 0}\) generated by GCHO hold:
where \(C_p^e: = -g(-R_p^e)\) is a given constant depending on the choice of the surrogate \(s_M(x_{k+1};x_k)\), which may be difficult to find in practice. Hence, in the following we propose an adaptive general composite higher-order algorithm, called (A-GCHO):
For a better understanding of this process, let us consider Example 4, where \(F = F_1 + F_2\), having the p derivative of \(F_1\) \(L_p^{F_1}\)-Lipschitz and \(F_2\) proper closed convex function. Then, in this case the surrogate is \(s_M(y;\!\!\!x) = T^{F^1}_p(y;\!\!x) + \frac{M}{(p+1)!}\Vert y - x\Vert ^{p+1} + F_2(y)\). Let \(R_p,M_0 > 0\) be fixed. Then, the step 1 in A-GCHO algorithm can be seen as a line search procedure (see for example [13]): that is at each step \(k \ge 0\) we choose \(M_k\ge M_0\), then build \(s_{M_k}(y;\!\!x_k) = T^{F^1}_p(y;\!\!x_k) + \frac{M_k}{(p+1)!}\Vert y - x_k\Vert ^{p+1} + F_2(y)\) and compute \(x_{k+1}\) satisfying (35). If (36) doesn’t hold, then we increase \(M_k \leftarrow 2\cdot M_k\), recompute \(s_{M_k}(y;\!\!x_k)\) using the new \(M_k\) and go to step 2. We repeat this process until condition (36) is satisfied. Note that this line search procedure finishes in a finite number of steps. Indeed, if \(M_k \ge R_p + L_p^{F^1}\), then from inequality (13), we get \(s_{M_k}(y;\!\!x_k) - F(y) \ge \frac{R_p}{(p+1)!}\Vert y - x_k\Vert ^{p+1}\) for all y and thus for \(y = x_{k+1}\) and g increasing function (36) holds. Note also that in this case the error function e satisfies Definition 2 (i) with \(L_p^e = 2(R_p + L_p^{F^1})\). Hence, using the same convergence analysis as in the previous sections, we can derive similar convergence rates as in Theorems 1, 2 and 3 for A-GCHO algorithm under Assumption 2 and 3, since the sequence \((x_k)_{k\ge 0}\) generated by A-GCHO automatically satisfies (35) and (36). For the convex case, as in Sect. 3.4, in A-GCHO algorithm we require that \(x_{k+1}\) is the global solution of the corresponding subproblem and consequently similar convergence results as in Theorem 4 can be derived for this adaptive algorithm.
4 Numerical simulations
In this section we present some preliminary numerical results for GCHO algorithm. For simulations, we consider the tests set from [18]. In [18], one can find systems of nonlinear equations, where one searches for \(x^{*}\) such that \(F_{i}(x^{*}) = 0\) for all \(i = 1,\cdots ,m\). For solving these problems, we implement our GCHO algorithm for \(p=1,2\). We consider two formulations: min-max and least-squares problems, respectively. The min-max formulation has the form:
Similarly, the least-squares formulation can be written as a simple composite minimization problem:
Note that both formulation fit into our general problem (1). We consider the following 2 implementations. First, for problem (37) we compare GCHO algorithm for \(p=1,2\) with IPOPT (the results are given in Table 2). Secondly, for problem (38) we compare GCHO algorithm for \(p=1,2\) with IPOPT and the method proposed in [12] (the results are given in Table 3). At each iteration of GCHO algorithm we replace each function \(F_{i}\) by its Taylor approximation of order p, with \(p=1,2\), and a quadratic/cubic regularization and solve the corresponding subproblem (15) using IPOPT [28]. In the numerical simulations we have noticed that for \(p=2\) IPOPT was able to detect a global minimizer of the subproblem at each iteration, i.e., the solution of IPOPT coincided with the solution obtained by solving the dual problem as described in the proof of Lemma 3 given in the appendix. Since it is difficult to compute the corresponding Lipschitz constants for the gradient/hessian, we use the line search procedure described in Sect. 3.5. Note that since in practice it is difficult to compute the sequence \((y_k)_{k\ge 1}\), then we cannot consider \(\text {dist}(0,\partial f(y_k))\le \epsilon \) as a stooping criterion for the proposed algorithm. Thus the stopping criterion considered in this paper is the same as in [3]:
where \(f_{\text {best}} = f^* \approx 0\), but positive, and the starting point \(x_0\) are taken from [18]. In Tables 2 and 3, we summarize our numerical results in terms of cpu time and number of iterations for GCHO algorithm \(p = 1,2\), IPOPT and [12]. Note that the test functions we consider in the two tables are nonconvex and most of them satisfy the KL condition (as semi-algrabraic functions). From the tables, we observe that GCHO algorithm (\(p=1\) or \(p=2\)) applied to the min-max formulation performs better than the GCHO algorithm (\(p=1\) or \(p=2\)) applied to the the least-squares problem, both in cpu time and number of iterations. This is due to the fact that the regularization constants for the min-max problem (37), \(M_{p}^{\text {max}} = \big (M_{p}^{\text {max}}(1), \cdots , M_{p}^{\text {max}}(m)\big )\), are much smaller than the one for the least-squares formulation (38), \(M_{p}^\text {ls}\), i.e., \(M_{p}^\text {ls} \approx \sum _{i=1}^m M_{p}^{\text {max}} (i)\). Moreover, from the tables we observe that increasing the order of the Taylor approximation is beneficial for the GCHO algorithm: e.g., in the min-max formulation, GCHO with \(p=2\) is at least twice faster than GCHO with \(p=1\). We also observe from Table 3 that GCHO algorithm applied to min-max formulation for \(p=2\) has a better behavior (in both cpu time and number of iterations) than the method proposed in [12] for the least-squares formulation. Finally, GCHO algorithm (\(p=1, 2\)) for both formulations is able to identify the global optimal points/values given in [18], while IPOPT directly applied to the formulations (38) or (37) may fail to identify the global optimal points/values (see Tables 2 and 3).
References
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1–2), 5–16 (2009)
Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, P.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163(1), 359–368 (2017)
Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A.: On the use of third-order models with fourth-order regularization for unconstrained optimization. Optim. Lett. 14, 815–838 (2020)
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM 18(2), 556–572 (2007)
Bolte, J., Chen, Z., Pauwels, E.: The multiproximal linearization method for convex composite problems. Math. Prog. 182, 1–36 (2020)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)
Cartis, C., Gould, N., Toint, P.L.: A concise second-order complexity analysis for unconstrained optimization using high-order regularized models. Optim. Methods Softw. 35, 243–256 (2020)
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178(1–2), 503–558 (2019)
Doikov, N., Nesterov, Yu.: Optimization methods for fully composite problems. SIAM J. Optim. 32(3), 2402–2427 (2022)
Fletcher, R.: A model algorithm for composite NDO problems. Math. Program. Stud. 17, 67–76 (1982)
Gasnikov, A., Dvurechensky, P., Gorbunov, E., Vorontsova, E., Selikhanovych, D., Uribe, C., Jiang, B., Wang, H., Zhang, S., Bubeck, S., Jiang, Q.: Near optimal methods for minimizing convex functions with Lipschitz \(p\)th derivatives. Conf. on Learning Theory 1392–1393 (2019)
Gould, N.I.M., Rees, T., Scott, J.: Convergence and evaluation-complexity analysis of a regularized tensor-Newton method for solving nonlinear least-squares problems. Comput. Optim. Appl. 73(1), 1–35 (2019)
Grapiglia, G., Nesterov, Yu.: Tensor methods for minimizing convex functions with Hölder continuous higher-order derivatives. SIAM J. Optim. 30(4), 2750–2779 (2020)
Hiriart-Urruty, J.-B.: New concepts in nondifferentiable programming. Memoires de la Societe Mathematique de France 60, 57–85 (1979)
Li, C., Ng, K.F.: Majorizing functions and convergence of the Gauss-Newton method for convex composite optimization. SIAM J. Optim. 18(2), 613–642 (2007)
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Mordukhovich, B.: Variational Analysis and Generalized Differentiation. Basic Theory. Springer, Berlin (2006)
More, J., Garbow, B.S., Hillstrom, K.E.: Testing unconstrained optimization software. ACM Transat. Math. Soft. 7(1), 17–41 (1981)
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first-order methods for non-strongly convex optimization. Math. Program. 175, 69–107 (2019)
Necoara, I., Lupu, D.: General higher-order majorization-minimization algorithms for (non) convex optimization (2020). arXiv preprint: arXiv:2010.13893
Nesterov, Yu., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Nesterov, Yu.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)
Nesterov, Yu., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108, 177–205 (2006)
Nesterov, Yu.: Implementable tensor methods in unconstrained convex optimization. Math. Program. 186, 157–183 (2021)
Nesterov, Yu.: Inexact basic tensor methods for some classes of convex optimization problems. Optim. Methods Soft 37, 878–906 (2022)
Nesterov, Yu.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Pauwels, E.: The value function approach to convergence analysis in composite optimization. Oper. Res. Lett. 44, 790–795 (2016)
Wächter, A., Biegler, L.T.: On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming. Math. Program. 106(1), 25–57 (2006)
Yuan, Y.: Conditions for convergence of trust-region algorithms for nonsmooth optimization. Math. Program. 31, 220–228 (1985)
Acknowledgements
The research leading to these results has received funding from: ITN-ETN project TraDE-OPT funded by the EU, H2020 Research and Innovation Programme under the Marie Skolodowska-Curie grant agreement No. 861137; NO Grants 2014-2021, under project ELO-Hyp, contract no. 24/2020; UEFISCDI PN-III-P4-PCE-2021-0720, under project L2O-MOC, nr. 70/2022.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Lemma 3
Let us first prove that for \(p=2\), \(g(\cdot )=\max (\cdot )\) and \(h(\cdot ) = 0\), one can compute efficiently the global solution \(x_{k+1}\) of the subproblem (15). Indeed, in this particular case (15) is equivalent to the following subproblem:
Further, this is equivalent to:
where \(u=(u_1,\cdots ,u_m)\) and \(\Delta _m:= \left\{ u\ge 0: \sum _{i=1}^{m} u_i = 1 \right\} \) is the standard simplex in \({\mathbb {R}}^m\). Further, this \(\min -\max \) problem can be written as follows:
Denote for simplicity \(H_k(u,w) = \sum _{i=1}^{m} u_i \nabla ^2 F_i(x_k) + \frac{w}{2} I\), \(g_k(u) = \sum _{i=1}^{m}u_i \nabla F_i(x_k)\), \(l_k(u) = \sum _{i=1}^{m} u_i F_i(x_k)\) and \(\tilde{M}(u) = \sum _{i=1}^{m}u_i M_i\). Then, the dual formulation of this problem takes the form:
Consider the following notations:
Below, we prove that if there exists an \(M_i >0\), for some \(i = 1:m\), then we have the following relation:
Additionally, for any \((u,w)\in D\) the direction \(x_{k+1} = x_k -H_k(u,w)^{-1}g_k(u)\) satisfies:
where \(r_k:= \Vert x_{k+1} - x_k\Vert \). Indeed, let us first show that \(\theta ^*\ge \beta ^*\). Using a similar reasoning as in [23], we have:
Let \((u,w) \in D\). Then, we have \(g_k(u)= - H_k(u,w)(x_{k+1} - x_k)\) and thus:
which proves (40). Note that we have [23]:
Therefore, if \(\beta ^*\) is attained at some \((u^*,w^*) \in D\), then we have \(\nabla \beta (u^*,w^*) = 0\). This implies \(\frac{w^*}{\tilde{M}(u^*)} = r_k\) and by (40) we conclude that \(\theta ^* = \beta ^*\).
Finally, if \(x_{k+1}\) is a global solution of the subproblem (15) (or equivalently (39)), then it satisfies the inexact condition (25) with \(\delta = 0\). Hence, using the proof of Lemma 2 with \(\delta = 0\) we can conclude that Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}}\right) ^{1/3}\) and \(L^{2}_{p}=\frac{\mu _{p}}{2}\). \(\square \)
Proof of Remark 2
If g is the identity function, then taking \(y_{k+1}=x_{k+1}\) one can see that Assumption 3 holds for any \(\theta _{1,p}\) and \(\theta _{2,p}\) nonnegative constants. If g is a general function, then Assumption 3 holds, provided that \(x_{k+1}\) satisfies the inexact optimality condition (25). Indeed, in this case, we have:
where the last inequality follows taking \(y = y_{k+1}\). Hence, Assumption 3 holds in this case for \(\theta _{1,p} = \dfrac{g(L^{e}_{p})}{(p+1)!}\) and \(\theta _{2,p} = \delta \). Finally, if \(p=2\) and \(g(\cdot ) = \max (\cdot )\), then \(x_{k+1}\) is the global solution of the subproblem (15) and hence, using similar arguments as above, we can prove that Assumption 3 also holds in this case. \(\square \)
Proof of Lemma 6
Note that the sequence \(\lambda _{k}\) is nonincreasing and nonnegative, thus it is convergent. Let us consider first \(\theta \le 1\). Since \(\lambda _{k}-\lambda _{k+1}\) converges to 0, then there exists \(k_{0}\) such that \(\lambda _{k}-\lambda _{k+1} \le 1\) and \(\lambda _{k+1}\le (C_{1}+C_{2})\left( \lambda _{k}-\lambda _{k+1}\right) \) for all \(k \ge k_{0}\). It follows that:
which proves the first statement. If \(1<\theta \le 2\), then there exists also an integer \(k_{0}\) such that \(\lambda _{k}-\lambda _{k+1} \le 1\) for all \(k\ge k_{0}\). Then, we have:
Since \(1<\theta \le 2\), then taking \(0 < \beta = \theta -1 \le 1\), we have:
for all \(k\ge k_{0}\). From Lemma 11 in [25], we further have:
for all \(k\ge k_{0}\) and for some \(\sigma >0\). Finally, if \(\theta > 2\), then define \(h(s){=}s^{-\theta }\) and let \(R>1\) be fixed. Since \(1/\theta < 1\), then there exists a \(k_{0}\) such that \(\lambda _{k}-\lambda _{k+1} \le 1\) for all \(k \ge k_{0}\). Then, we have \(\lambda _{k+1}\le (C_{1}+C_{2})\left( \lambda _{k}-\lambda _{k+1}\right) ^{\frac{1}{\theta }}\), or equivalently:
If we assume that \(h(\lambda _{k+1})\le R h(\lambda _{k}) \), then:
Denote \(\mu =\frac{-R(C_{1}+C_{2})^{\theta }}{-\theta +1}\). Then:
If we assume that \(h(\lambda _{k+1})> R h(\lambda _{k}) \) and set \(\gamma = R^{-\frac{1}{\theta }}\), then it follows immediately that \(\lambda _{k+1}\le \gamma \lambda _{k}\). Since \(1-\theta \) is negative, we get:
Since \(1- \theta <0\), \(\gamma ^{1-\theta } > 1\) and \(\lambda _{k}\) has a nonnegative limit, then there exists \({\bar{\mu }} > 0\) such that \((\gamma ^{1-\theta } - 1) \lambda _{k}^{1-\theta } > {\bar{\mu }}\) for all \(k \ge k_0\). Therefore, in this case we also obtain:
If we set \({\hat{\mu }}=\min (\mu ^{-1},{\bar{\mu }})\) and combine (41) and (42), we obtain:
Summing the last inequality from \(k_{0}\) to k, we obtain \(\lambda _{k}^{1-\theta }-\lambda _{k_{0}}^{1-\theta }\ge {\hat{\mu }}(k-k_{0})\), i.e.:
for all \(k \ge k_0\). This concludes our proof. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nabou, Y., Necoara, I. Efficiency of higher-order algorithms for minimizing composite functions. Comput Optim Appl 87, 441–473 (2024). https://doi.org/10.1007/s10589-023-00533-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-023-00533-9