1 Introduction

In this work, we consider the class of general composite optimization problems:

$$\begin{aligned} \min _{x \in \text {dom}f} f(x):= g\big (F(x)\big )+h(x), \end{aligned}$$
(1)

where \(h:{\mathbb {E}}\rightarrow {\bar{{\mathbb {R}}}}\), with \({\bar{{\mathbb {R}}}} = {\mathbb {R}}\cup \{+ \infty \}\), and \(F: {\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}^{m}\) are general proper lower semicontinuous functions on their closed domains and \(g:{\mathbb {R}}^{m} \rightarrow {\mathbb {R}}\) is a proper closed convex function defined everywhere. Here, \({\mathbb {E}}\) is a finite-dimensional real vector space and \(F=(F_1, \cdots , F_m)\). Note that \(\text {dom}\;f =g(\text {dom}F)\cap \text {dom}\;h \). This formulation unifies many particular cases, such as smooth approximation of minimax games, max-type minimization problems or exact penalty formulations of nonlinear programs, while recent instances include robust phase retrieval and matrix factorization problems [5, 8, 9, 15]. Note that the setting where g is the identity function was intensively investigated in large-scale optimization [1, 16, 19, 26]. In this paper, we call this formulation simple composite optimization. When g is restricted to be a Lipschitz convex function and F smooth, a natural approach to this problem consists in linearizing the smooth part, leaving the nonsmooth term unchanged and adding an appropriate quadratic regularization term. This is the approach considered e.g., in [8, 27], leading to a proximal Gauss-Newton method, i.e. given the current point \({\bar{x}}\) and a regularization parameter \(t>0\), solve at each iteration the subproblem:

$$\begin{aligned} x^{+}=\mathop {{\mathrm{arg\,min}}}\limits \limits _{x} g\Big (F({\bar{x}}) + \nabla F({\bar{x}})(x-{\bar{x}})\Big ) + \frac{1}{2t}\Vert x-{\bar{x}}\Vert ^{2} + h(x). \end{aligned}$$

For such a method it was proved in [8] that \(\text {dist}(0,\partial f(x))\) converges to 0 at a sublinear rate of order \({\mathcal {O}}(1/k^{\frac{1}{2}})\), where k is the iteration counter, while convergence of the iterates under KL inequality was recently shown in [27]. Note that the case where g is (Lipschitz) convex, F is smooth and \(h = 0\) has been also analysed in ( [29]) [10].

In [5] a flexible method is proposed, where the smooth part F is replaced by its quadratic approximation, i.e., given \({\bar{x}}\), solve:

$$\begin{aligned} x^{+}=\mathop {{\mathrm{arg\,min}}}\limits \limits _{x} g\left( F({\bar{x}}) +\nabla F({\bar{x}})(x-{\bar{x}}) + \frac{L}{2}\Vert x-{\bar{x}}\Vert ^{2} \right) + h(x), \end{aligned}$$

where \(L=(L_1,\cdots ,L_m)^T\), with \(L_i\) being the Lipschitz constant of the gradient of \(F_i\), for \(i=1:m\). Assuming F, g and h are convex functions, and g additionally is componentwise nondecreasing and Lipschitz, [5] derives sublinear convergence rate of order \({\mathcal {O}}(1/k)\) in function values. Finally, in the recent paper [9], a general composite minimization problem of the form:

$$\begin{aligned} \min \limits _{x\in {\mathbb {E}}}g\big (x,F(x)\big ), \end{aligned}$$

is considered, where \(F=(F_{1},\cdots ,F_{m})\), with \(F_{i}\)’s being convex and p-smooth functions on \({\mathbb {E}}\) and having the p-derivative Lipschitz, with \(p\ge 1\) an integer constant. Under these settings, [9] replaces the smooth part by its Taylor approximation of order p plus a proper regularization term, i.e., given \({\bar{x}}\), solve the following subproblem:

$$\begin{aligned} x^{+}=\mathop {{\mathrm{arg\,min}}}\limits \limits _{x}g\left( x, T^{F}_{p}(x;\!\!{\bar{x}}) + \frac{L}{(p+1)!}\Vert x-{\bar{x}}\Vert ^{p+1}\right) , \end{aligned}$$

where \(L=(L_1,\cdots ,L_m)^T\), with \(L_i\) being related to the Lipchitz constant of the p-derivative of \(F_i\) and \(T^{F}_{p}(x;\!\!{\bar{x}}) \) is the p-Taylor approximation of F around the current point \({\bar{x}}\). For such a higher-order method, in the convex settings and assuming that g has full domain in the second argument, [9] derives a sublinear convergence rate in function values of order \({\mathcal {O}}\left( 1/k^p\right) \).

Note that the optimization scheme in [9] belongs to the class of higher-order methods. Such methods are popular due to their performance in dealing with ill conditioning and fast rates of convergence, see e.g., [2, 7, 13, 20, 24, 25]. For example, first-order methods achieve convergence rates of order \({\mathcal {O}}(1/k)\) for smooth convex optimization [16, 26], while higher-order methods of order \(p >1\) have converge rates \({\mathcal {O}}(1/k^p)\) for minimizing p smooth convex objective functions [13, 20, 24, 25]. Accelerated variants of higher-order methods were also developed e.g., in [11, 24, 25]. Recently, [20] provided a unified framework for the convergence analysis of higher-order optimization algorithms for solving simple composite optimization problems using the majorization-minimization approach. This is a technique that approximate an objective function by a majorization function, which can be minimized in closed form or its solution computed fast, yielding a solution or some acceptable improvement. Note that papers such as [5, 16] use a first-order majorization-minimization approach to build a model (i.e., use only gradient information), while [20] uses higher-order derivatives to build such a model. However, global complexity bounds for higher-order methods based on the majorization-minimization principle for solving composite problem (1) are not yet given. This is the goal of this work.

Contributions In this paper, we provide an algorithmic framework based on the notion of higher-order upper bound approximation of the composite problem (1). Note that in this optimization formulation we consider general properties for our objects, e.g., the functions F and h can be smooth or nonsmooh, convex or nonconvex and g is convex, nondecreasing and has full domain. Our framework consists of replacing F by a higher-order surrogate, leading to a General Composite Higher-Order minimization algorithm, which we call GCHO. This majorization minimization approach is relevant as it yields an array of algorithms, each of which is associated with the specific properties of F and the corresponding surrogate, and it provides a unified convergence analysis. Note that most of our variants of GCHO for \(p> 1\) were not explicitly considered in the literature before (at least in the nonconvex settings). Moreover, our new first-, second-, and third-order methods can be implemented in practice using existing efficient techniques from e.g., [25, 28].

We derive convergence guarantees for the GCHO algorithm when the upper bound approximate F from the objective function up to an error that is \(p \ge 1\) times differentiable and has a Lipschitz continuous p derivative; we call such upper bounds composite higher-order surrogate functions. More precisely, on composite (possibly nonsmooth) nonconvex problems we prove for GCHO, with the help of a new auxiliary sequence, convergence rates \({\mathcal {O}}\left( \frac{1}{k^{p/(p+1)}}\right) \) in terms of first-order optimality conditions. We also characterize the convergence rate of GCHO algorithm locally, in terms of function values, under the Kurdyka–Lojasiewicz (KL) property. Our result show that the convergence behavior of GCHO ranges from sublinear to linear depending on the parameter of the underlying KL geometry. Moreover, on general (possibly nonsmooth) composite convex problems (i.e., f is convex function) our algorithm achieves global sublinear convergence rate of order \({\mathcal {O}}\left( 1/k^{p}\right) \) in function values. We summaries our convergence results in Table 1. Finally, for \(p=2\), \(g(\cdot ) = \max (\cdot )\) and \(h = 0\), we show that the subproblem, even in the nonconvex case, is equivalent to minimizing an explicitly written convex function over a convex set that can be solve using efficient convex optimization tools.

Besides providing a general framework for the design and analysis of composite higher-order methods, in special cases, where complexity bounds are known for some particular algorithms, our convergence results recover the existing bounds. For example, from our convergence analysis one can easily recover the convergence bounds of higher-order algorithms from [24] for unconstrained minimization and from [20, 24, 25] for simple composite minimization. Furthermore, in the composite convex case we recover the convergence bounds from [5] for \(p=1\) and particular choices of g and from [9] for \(p \ge 1\). To the best of our knowledge, this is the first complete work dealing with composite problems in the nonconvex and nonsmooth settings, and explicitly deriving convergence bounds for higher-order majorization-minimization algorithms (including convergence under KL).

Table 1 Convergence results for GCHO algorithm

2 Notations and preliminaries

We use the standard notations from [24, 25]. We denote a finite-dimensional real vector space with \({\mathbb {E}}\) and \({\mathbb {E}}^{*}\) its dual space composed of linear functions on \({\mathbb {E}}\). For any linear function \(s\in {\mathbb {E}}^{*}\), the value of s at point \(x\in {\mathbb {E}}\) is denoted by \(\langle s,x\rangle \). Using a self-adjoint positive-definite operator \({B}:{\mathbb {E}}\rightarrow {\mathbb {E}}^{*}\), we endow these spaces with conjugate Euclidean norms:

$$\begin{aligned} \Vert x\Vert =\langle {B}x,x\rangle ^{{\frac{1}{2}}},\quad x\in {\mathbb {E}},\qquad \Vert g\Vert _{*}=\langle g,{B}^{-1}g\rangle ^{\frac{1}{2}},\quad g\in {\mathbb {E}}^{*}. \end{aligned}$$

For a twice differentiable function \(\phi \) on a convex and open domain \(\text {dom}\;\phi \subseteq {\mathbb {E}}\), we denote by \(\nabla \phi (x)\) and \(\nabla ^{2} \phi (x)\) its gradient and hessian evaluated at \(x\in \text {dom}\;\phi \), respectively. Then, \(\nabla \phi (x)\in {\mathbb {E}}^{*}\) and \(\nabla ^{2}\phi (x)h\in {\mathbb {E}}^{*}\) for all \(x\in \text {dom}\;\phi \), \(h\in {\mathbb {E}}\). Throughout the paper, we consider p a positive integer. In what follows, we often work with directional derivatives of function \(\phi \) at x along directions \(h_{i}\in {\mathbb {E}}\) of order p, \({D^{p}} \phi (x)[h_{1},\cdots ,h_{p}]\), with \(i=1:p\). If all the directions \(h_{1},\cdots ,h_{p}\) are the same, we use the notation \( {D^{p}} \phi (x)[h]\), for \(h\in {\mathbb {E}}.\) Note that \({D^{p}} \phi (x)\) is a symmetric p-linear form. Then, its norm is defined as:

$$\begin{aligned} \Vert {D^{p}} \phi (x)\Vert = \max \limits _{h\in {\mathbb {E}}} \left\{ {D^{p}} \phi (x)[h]^{p} :\Vert h\Vert \le 1 \right\} . \end{aligned}$$

Further, the Taylor approximation of order p of \(\phi \) at \(x\in \text {dom}\;\phi \) is denoted:

$$\begin{aligned} T_p^{\phi }(y;\!\! x)= \phi (x) + \sum _{i=1}^{p} \frac{1}{i !} {D^{i}} \phi (x)[y-x]^{i} \quad \forall y \in {\mathbb {E}}. \end{aligned}$$

If \(\phi : {\mathbb {E}}\rightarrow {\bar{{\mathbb {R}}}}\) is p differentiable function on \(\text {dom}\;\phi \), then the pth derivative is Lipschitz continuous if there exist a constant \(L_p^{\phi } > 0\) such that:

$$\begin{aligned} \Vert {D^p} \phi (x) - {D^p} \phi (y) \Vert \le L_p^{\phi } \Vert x-y \Vert \quad \forall x,y \in \text {dom}\;\phi . \end{aligned}$$
(2)

It is known that if (2) holds, then the residual between the function and its Taylor approximation can be bounded [24]:

$$\begin{aligned} \vert \phi (y) - T_p^{\phi }(y;\!\!x) \vert \le \frac{L_p^{\phi }}{(p+1)!} \Vert y-x\Vert ^{p+1} \quad \forall x,y \in \text {dom}\;\phi . \end{aligned}$$
(3)

If \(p \ge 2\), we also have the following inequalities valid for all \( x,y \in \text {dom}\;\phi \):

$$\begin{aligned}&\Vert \nabla \phi (y) - \nabla T_p^{\phi }(y;\!\!x) \Vert _* \le \frac{L_p^{\phi }}{p!} \Vert y-x \Vert ^p, \end{aligned}$$
(4)
$$\begin{aligned}&\Vert \nabla ^2 \phi (y) - \nabla ^2 T_p^{\phi }(y;\!\!x) \Vert \le \frac{L_p^{\phi }}{(p-1)!} \Vert y-x\Vert ^{p-1}. \end{aligned}$$
(5)

For the Hessian, the norm defined in (5) corresponds to the spectral norm of self-adjoint linear operator (maximal module of all eigenvalues computed w.r.t. B). A function \(g:{\mathbb {R}}^{m} \rightarrow {\mathbb {R}}\) is said to be nondecreasing if for all \(i=1:m\), g is nondecreasing in its ith argument, i.e., the univariate function:

$$\begin{aligned} z \mapsto g(z_{1},\cdots ,z_{i-1},z,z_{i+1},\cdots ,z_{m}), \end{aligned}$$

is nondecreasing. In what follows, if x and y are in \({\mathbb {R}}^{m}\), then \(x\ge y\) means that \(x_{i}\ge y_{i}\) for all \(i=1:m\). Similarly, we define \(x>y\). Since g is nondecreasing, then for all \(x,y\in {\mathbb {R}}^m\) such that \(x\le y\) we have \(g(x)\le g(y)\). Next, we provide few definitions and properties concerning subdifferential calculs (see [17]).

Definition 1

(Subdifferential): Let \(f: {\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\) be a proper lower semicontinuous function. For a given \(x \in \text {dom} \; f\), the Fr\(\acute{e}\)chet subdifferential of f at x, written \({\widehat{\partial }}f(x)\), is the set of all vectors \(g_{x}\in {\mathbb {E}}^{*}\) satisfying:

$$\begin{aligned} \lim _{x\ne y}\inf \limits _{y\rightarrow x}\frac{f(y) - f(x) - \langle g_{x}, y - x\rangle }{\Vert x-y\Vert }\ge 0. \end{aligned}$$

When \(x \notin \text {dom} \; f\), we set \({\widehat{\partial }} f(x) = \emptyset \). The limiting-subdifferential, or simply the subdifferential, of f at \(x\in \text {dom} \, f\), written \(\partial f(x)\), is defined as [17]:

$$\begin{aligned} \partial f(x)\!:=\! \left\{ g_{x}\!\in \! {\mathbb {E}}^{*}\!\!: \exists x^{k}\!\rightarrow \! x, f(x^{k})\!\rightarrow \! f(x) \; \text {and} \; \exists g_{x}^{k}\!\in \!{\widehat{\partial }} f(x^{k}) \;\; \text {such that} \;\; g_{x}^{k} \!\rightarrow \! g_{x}\right\} . \end{aligned}$$

Note that we have \({\widehat{\partial }}f(x)\subseteq \partial f(x)\) for each \(x\in \text {dom}\,f\). In the previous inclusion, the first set is closed and convex while the second one is closed, see e.g., [17]. Let us recall the following chain rule for the composite problem g(F):

$$\begin{aligned} \partial (g\circ F)(x) = co\left\{ \sum _{i=1}^{m} u_{i}v_{i} \mid u\in \partial g\big (F(x)\big ),\; v_{i}\in \partial F_{i}(x),\; i=1:m\right\} , \end{aligned}$$
(6)

where F can be nondifferentiable and \(u=(u_{1},\cdots ,u_{m})\), which is valid provided that, e.g., \(F=(F_{1},\cdots ,F_{m})\) and g are locally Lipschitz, \(F_{i}\)’s are regular at x, g is regular at F(x) and \( \partial g(F(x))\subseteq {\mathbb {R}}^{m}_{+}\) (see Theorem 6 in [14] for more details). As a consequence, if g is the identity function and \(m=2\), then:

$$\begin{aligned} \partial (F_1+F_2)(x) = \partial F_{1}(x) + \partial F_{2}(x). \end{aligned}$$

For any \(x\in \text {dom} \; f\) let us define:

$$\begin{aligned} S_{f}(x)=\text {dist}\big (0,\partial f(x)\big ):=\inf \limits _{g_{x}\in \partial f(x)}\Vert g_{x}\Vert . \end{aligned}$$

If \(\partial f(x) = \emptyset \), we set \(S_f(x) = \infty \).

3 General composite higher-order algorithm

In this section, we propose a higher-order algorithm for solving the general composite problem (1) and analyse its convergence.

Assumption 1

We consider the following assumptions for optimization problem (1):

  1. 1.

    The functions \(F_{i}\), with \(i=1\!:\!m\), g and h are proper lower semicontinuous on their domains, satisfy the chain rule (6) and \(\text {dom}\; h \subseteq g(\text {dom}\;F)\).

  2. 2.

    Additionally, g is convex nondecreasing and has full domain, satisfying the following subhomogeneity property:

    $$\begin{aligned} g(\alpha x)\le \alpha g(x)\quad \forall x\in {\mathbb {R}}^m \quad \forall \alpha \ge 1. \end{aligned}$$
    (7)
  3. 3.

    Problem (1) has a solution and thus \(f^{*}:=\!\inf _{x\in \text {dom}\;f}\! f(x)\!>\!-\infty \).

From Assumption 1.1, it follows that \(\text {dom}\;f = \text {dom}\;h\). Moreover, if Assumption 1.2 holds, then from [9](Theorem 4) it follows that:

$$\begin{aligned} g(x+ty)\le g(x) + tg(y) \quad \forall t\ge 0. \end{aligned}$$
(8)

Next, we provide several examples of optimization problems that can be written as (1) and satisfy our Assumption 1.

Example 1

(Minimax strategies for nonlinear games) Let us consider the problem:

$$\begin{aligned} \min _{x\in \triangle _n} \left\{ f(x): = \max _{u\in \triangle _m}\langle F(x),u\rangle \right\} , \end{aligned}$$

where \(\triangle _n\), \(\triangle _m\) are the standard simplexes in \({\mathbb {R}}^n\) and \({\mathbb {R}}^m\), respectively. The smooth approximation for this problem using the entropy distance is as follows [22]:

$$\begin{aligned} \min _{x\in \triangle _n} \left\{ f_\mu (x) := \max _{u\in \triangle _m}\Big \lbrace \langle F(x),u\rangle - \mu \sum _{j=1}^{m}u_j\ln (u_j) - \mu \ln (m)\Big \rbrace \right\} , \end{aligned}$$

for some \(\mu >0\). Using Lemma 4 in [22], we get:

$$\begin{aligned} f_\mu (x) = \mu \ln \left( \sum _{j=1}^{m}e^{\frac{F_i(x)}{\mu }}\right) . \end{aligned}$$

Hence, considering \(g(y) = \mu \ln \left( \sum _{j=1}^{m}e^{\frac{y_i}{\mu }}\right) \), then the original minimax problem can be approximated, for sufficiently small \(\mu \), with \(\min _{x\in \Delta _n}f_\mu (x):= g(F(x))\). Note that g satisfies Assumption 1.2.

Example 2

(Min-max problems) Let us consider the following min-max problem:

$$\begin{aligned} \min \limits _{x\in Q} \max \limits _{i=1:m}F_{i}(x). \end{aligned}$$

This type of problem is classical in optimization. Note that if we define \(g(y)=\max _{i=1:m}y_{i}\) and \(h={{\textbf {1}}}_{Q}\), then, the previous min-max problem can be written as problem (1) and g satisfies Assumption 1.2.

Example 3

(Simple composite problems) Let us consider the following simple composite minimization problem:

$$\begin{aligned} \min \limits _{x\in {\mathbb {R}}^{n}} F_{0}(x)+h(x). \end{aligned}$$

Taking \(F(x)=F_{0}(x)\) and g the identity function, we can clearly see that \(g\big (F(x)\big ) + h(x)=F_{0}(x)+h(x)\).

Further, let us introduce the notion of a higher-order surrogate, see also [20].

Definition 2

Let \(\phi :{\mathbb {E}}\rightarrow \bar{{\mathbb {R}}}\) be a proper lower semicontinuous function and \(x\in \text {dom}\;\phi \). We call the function \(s(\cdot \;;\!x): {\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\), with \(\text {dom}\; s(\cdot \;;\!x)=\text {dom}\;\phi \), a p higher-order surrogate of \(\phi \) at x if it has the following properties:

  1. (i)

    the error function

    $$\begin{aligned} e(y;\!\!x)=s(y;\!\!x) - \phi (y), \end{aligned}$$
    (9)

    with \(\text {dom}\;e\) open such that \(\text {dom}\;\phi \subseteq \text {dom}\;e\), is p differentiable and the p derivative is smooth with Lipschitz constant \(L_{p}^{e}\).

  2. (ii)

    the derivatives of the error function e satisfy

    $$\begin{aligned} \nabla ^{i}e(x;\!\!x)=0 \quad \forall i=0:p, \; x\in \text {dom}\;\phi , \end{aligned}$$
    (10)

    and there exist a positive constant \(R_{p}^{e}>0\) such that

    $$\begin{aligned} e(y;\!\!x)\ge \frac{R_{p}^{e}}{(p+1)!}\Vert y-x\Vert ^{p+1} \quad \forall x, y \in \text {dom}\;\phi . \end{aligned}$$
    (11)

Note that \(\text {dom}\;\phi \) may not be equal to \(\text {dom}\;e\) (see examples below) and from (11) we have \(s(y;x) \ge \phi (y)\) for all \(x,y\in \text {dom}\;\phi \). Next, we give two nontrivial examples of higher-order surrogate functions, see [20] for more examples.

Example 4

(Composite functions) Let \(F_1:{\mathbb {E}}\rightarrow {\mathbb {R}}\) be p times differentiable and the p derivative be Lipschitz with constant \(L_p^{{F_1}}\) and let \(F_2:{\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\) be a proper closed function. Then, for the composite function \(F=F_{1} + F_{2}\), where \(\text {dom}\;F= \text {dom}\;F_2 \), one can consider the following p higher-order surrogate function:

$$\begin{aligned} {s(y;\!\!x)}=T^{F_{1}}_{p}(y;\!\!x) + \frac{M_{p}}{(p+1)!}\Vert x-y\Vert ^{p+1} + F_2(y)\;\;\forall \;x,y\in \text {dom}\;F, \end{aligned}$$

where \(M_{p}>L_{p}^{F_{1}}\). Indeed, from the definition of the error function, we get:

$$\begin{aligned} {e(y; x)} =T^{F_{1}}_{p}(y;x) - F_{1}(y) + \frac{M_{p}}{(p+1)!}\Vert x-y\Vert ^{p+1}. \end{aligned}$$
(12)

Thus \(e(\cdot ;\!x)\), with \(\text {dom}\;e = {\mathbb {E}}\) and \(\text {dom}\;F\subseteq \text {dom}\;e\), has the p derivative Lipschitz with constant \(L_p^{F_1} + M_p\). Further, from the definition of the error function e, we have:

$$\begin{aligned} \nabla ^{i} e(x;\!\!x)&= \nabla T^{F_1}_p(x;\!\!x) - \nabla ^i F_1(x) = \nabla ^i F_1(x) - \nabla ^i F(x) = 0\;\; \forall i=1:p. \end{aligned}$$

Moreover, since \(F_1\) has the p derivative Lipschitz, it follows from (3) that:

$$\begin{aligned} T^{F_1}_{p}(y;\!\!x) - F_1 (y)\ge \frac{-L_{p}^{F_1}}{(p+1)!}\Vert x-y\Vert ^{p+1}. \end{aligned}$$

Combining this inequality with (12), we get:

$$\begin{aligned} {e(y;\!\!x)}\ge \frac{M_{p} - L_{p}^{F_1}}{(p+1)!}\Vert x-y\Vert ^{p+1}. \end{aligned}$$
(13)

Hence, the error function e has \(L_{p}^{e}=M_{p} + L^{F_1}_{p}\) and \(R_{p}^{e}=M_{p}-L^{F_1}_{p}\).

Example 5

(proximal higher-order) Let \(F:{\mathbb {E}} \rightarrow \bar{{\mathbb {R}}}\) be a proper lower semicontinuous function. Then, we can consider the following higher-order surrogate function:

$$\begin{aligned} s(y;\!\!x)=F(y) + \frac{M_{r}}{(r+1)!}\Vert y-x\Vert ^{r+1}, \end{aligned}$$

where r is a positive integer. Indeed, the error function is:

$$\begin{aligned} e(y;\!x)=s(y;\!\!x)-F(x)=\frac{M_{r}}{(r+1)!}\Vert y-x\Vert ^{r+1}, \end{aligned}$$

where \(\text {dom}\;F\subseteq \text {dom}\;e = {\mathbb {E}}\). In this case, the error function e has the r derivative Lipschitz with \(L_{r}^{e}= M_{r}\) and \(R_{r}=M_{r}\).

In the following, we assume for problem (1) that each function \(F_{i}\), with \(i=1:m\), admits a p higher-order surrogate as in Definition 2. Then, we propose the following General Composite Higher-Order algorithm, called GCHO.

figure a

Although our algorithm requires that the next iterate \(x_{k+1}\) only to satisfy the descent (14), we usually generate \(x_{k+1}\) by solving the following subproblem:

$$\begin{aligned} \min _{x} g\big (s(x;\!x_{k})\big ) + h(x). \end{aligned}$$
(15)

If F and h are convex functions, then the subproblem (15) can be also convex. Indeed, for Example 4, if \(M_p \ge pL_p^{F_1}\) and \(F_2\) is convex, then the surrogate function s is convex and hence the problem (15) is convex (see Theorem 1 [24]), while for Example 5, the surrogate is convex if \(M_p\ge 0\). Hence, in the convex case we assume that \(x_{k+1}\) is the global optimum of the subproblem (15). However, in the nonconvex case, we cannot guarantee the convexity of the subproblem. In this case, we either assume that we can compute a stationary point of the subproblem (15) if g is the identity function or we can compute an inexact solution as defined in (25) if g is a general function. Note that our algorithmic framework is quite general and yields an array of algorithms, each of which is associated with the specific properties of F and the corresponding surrogate. For example, if F is a sum between a smooth term and a nonsmooth one we can use a surrogate as in Example 4; if F is fully nonsmooth we can use a surrogate as in Example 5. This is the first time such an analysis is performed, and most of our variants of GCHO were not explicitly considered in the literature before (especially in the nonconvex settings). Note that in both Examples 4 and 5, \(x_{k+1}\) can be computed inexactly, as detailed in the next sections.

3.1 Nonconvex convergence analysis

In this section we consider that each \(F_{i}\), with \(i=1:m\), and h are nonconvex functions (possible nonsmooth). Then, problem (1) becomes a pure nonconvex optimization problem. Now we are ready to analyze the convergence behavior of GCHO algorithm under these general settings. In the sequel, we assume that \(g(-R_{p}^{e})<0\). Note that since the vector \(R_{p}^{e}>0\), then for all the optimization problems considered in Examples 1,  2 and 3 this assumption holds provided that \(M_p\) is large enough.

Theorem 1

Let F, g and h satisfy Assumption 1 and additionally each \(F_{i}\) admits a p higher-order surrogate \(s_{i}\) as in Definition 2 with the constants \(L^{e}_{p}(i)\) and \(R_{p}^{e}(i)\), for \(i=1:m\). Let \(\left( x_{k}\right) _{k\ge 0}\) be the sequence generated by Algorithm GCHO, \(R_{p}^{e}=\big (R_{p}^{e}(1),\cdots ,R_{p}^{e}(m)\big )\) and \(L^{e}_{p}=\big (L^{e}_{p}(1),\cdots ,L^{e}_{p}(m)\big )\). Then, the sequence \(\left( f(x_{k})\right) _{k\ge 0}\) is nonincreasing and satisfies the following descent relation:

$$\begin{aligned} f(x_{k+1})\le f(x_{k})+\frac{g(-R_{p}^{e})}{(p+1)!} \Vert x_{k+1}-x_{k}\Vert ^{p+1}\qquad \forall k\ge 0. \end{aligned}$$
(16)

Proof

Denote \(e(x_{k+1};\!\! x_{k})=\big (e_{1}(x_{k+1};\!\!x_{k}),\cdots ,e_{m}(x_{k+1};\!\!x_{k})\big )\). Then, from the definition of the error function e and (11), we have:

$$\begin{aligned} \frac{{R_p^{e}}}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}\le e(x_{k+1};x_k) = s(x_{k+1};x_k) - F(x_{k+1}). \end{aligned}$$

This implies that:

$$\begin{aligned} F(x_{k+1}) \le s(x_{k+1};x_k) - \frac{{R_p^{e}}}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}. \end{aligned}$$

Since g is nondecreasing, we get:

$$\begin{aligned} g(F(x_{k+1}))&\le g\left( s(x_{k+1};x_k) - \frac{{R_p^{e}}}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}\right) \\&{\mathop {\le }\limits ^{(8)}} g\big ((s(x_{k+1};x_k)\big ) +\frac{g(-{R_p^{e}})}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}. \end{aligned}$$

Finally, we obtain that:

$$\begin{aligned} f(x_{k+1})&\le g\big (s(x_{k+1};x_k)) + h(x_{k+1}) + \frac{g(-{R_p^{e}})}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}\\&{\mathop {\le }\limits ^{(14)}} f(x_k) + \frac{g(-{R_p^{e}})}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}, \end{aligned}$$

which yields our statement. \(\square \)

Summing (16) from \(j=0\) to k, we get:

$$\begin{aligned} \sum _{j=0}^{k}-\frac{g(-R_{p}^{e})}{(p+1)!}\Vert x_{j+1}-x_{j}\Vert ^{p+1}&\le \sum _{j=0}^{k} f(x_{j})-f(x_{j+1})\\&= f(x_{0})-f(x_{k+1})\le f(x_{0})-f^{*}. \end{aligned}$$

Taking the limit as \(k\rightarrow +\infty \), we obtain:

$$\begin{aligned} \sum \limits _{k=0}^{+\infty }\Vert x_{k}-x_{k+1}\Vert ^{p+1}< +\infty . \end{aligned}$$
(17)

Hence \(\lim _{k\mapsto +\infty }\Vert x_{k}-x_{k+1}\Vert =0\). In our convergence analysis, we also consider the following additional assumption which requires the existence of some auxiliary sequence that must be closed to the sequence generated by GCHO algorithm and some first-order relation holds:

Assumption 2

Given the sequence \(\big (x_{k}\big )_{k\ge 0}\) generated by GCHO algorithm, there exist two constants \(L^{1}_{p}, L^{2}_{p}>0\) and a sequence \(\left( y_{k}\right) _{k\ge 0}\) such that:

$$\begin{aligned} \Vert y_{k+1} - x_{k}\Vert \!\le \!L^{1}_{p}\Vert x_{k+1} - x_{k}\Vert \;\text {and}\;\; S_{f}(y_{k+1})\!\le \! L^{2}_{p}\Vert y_{k+1} - x_{k}\Vert ^{p} \;\;\; \forall k\ge 0. \end{aligned}$$
(18)

In the next section, we provide concrete examples for the sequence \((y_{k})_{k\ge 0}\) satisfying Assumption 2, and the corresponding expressions for \(L_{p}^1\) and \(L_{p}^2\).

3.2 Approaching the set of stationary points

Before continuing with the convergence analysis of GCHO algorithm, let us analyze the relation between \(\Vert x_{k+1}-x_{k}\Vert ^{p}\) and \(S_{f}(x_{k+1})\) and also give examples when Assumption 2 is satisfied. For simplicity, consider the following simple composite minimization problem:

$$\begin{aligned} \min \limits _{x} f(x):=F(x)+h(x), \end{aligned}$$

where F is p times differentiable function, having the p derivative \(L_{p}^{F}\)-Lipschitz and h is proper lower semicontinuous function. In this case g is the identity function and we can take as a surrogate \(s(y;\!x)=T_{p}^{F}(y;\!x) + \frac{M_{p}}{(p+1)!}\Vert x-y\Vert ^{p+1} + h(y)\), with the positive constant \(M_{p}\) satisfying \(M_{p} > L_{p}^{F}\) and \(g(-R_p^e)<0\). The following lemma gives an example when Assumption 2 holds.

Lemma 1

Assume g is the identity function, F has the p derivative Lipschitz and \(x_{k+1}\) is a stationary point of the following subproblem:

$$\begin{aligned} x_{k+1}\in \mathop {{\mathrm{arg\,min}}}\limits \limits _{x} T_{p}^{F}(x;\!\!x_{k})+\dfrac{M_{p}}{(p+1)!}\Vert x-x_{k}\Vert ^{p+1} + h(x). \end{aligned}$$
(19)

Then, Assumption 2 holds with \(y_{k+1}=x_{k+1}\), \(L^{1}_{p}=1\) and \(L^{2}_{p}=\frac{M_{p} + L_{p}^{F}}{p!}\).

Proof

Since \(x_{k+1}\) is a stationary point of subproblem (19), using (6), we get:

$$\begin{aligned} {\dfrac{M_{p}}{p!}\Vert x_{k+1}-x_{k}\Vert ^{p-1}B (x_{k}-x_{k+1})- \nabla T_{p}^{F}(x_{k+1};\!x_{k})\in \partial h(x_{k+1}),} \end{aligned}$$

or equivalently

$$\begin{aligned}&\dfrac{M_{p}}{p!}\Vert x_{k+1}-x_{k}\Vert ^{p-1}B(x_{k}-x_{k+1}) +\Big (\nabla F(x_{k+1})-\nabla T_{p}^{F}(x_{k+1};\!x_{k})\Big )\\&\quad \in \nabla F(x_{k+1})+\partial h(x_{k+1})=\partial f(x_{k+1}). \end{aligned}$$

Taking into account that F is p-smooth, we further get:

$$\begin{aligned} S_{f}(x_{k+1})&\le \dfrac{M_{p}}{p!}\Vert x_{k+1}-x_{k}\Vert ^{p} + \Vert \nabla F(x_{k+1})-\nabla T_{p}^{F}(x_{k+1},x_{k})\Vert _{*}\\&{\mathop {\le }\limits ^{(5)}} \dfrac{M_{p}+L_{p}^{F}}{p!}\Vert x_{k+1}-x_{k}\Vert ^{p}.\nonumber \end{aligned}$$
(20)

Hence, Assumption 2 holds with \(y_{k+1} = x_{k+1}\), \(L^{1}_{p}=1\) and \(L^{2}_{p}=\frac{M_{p} + L_{p}^{F}}{p!}\) . \(\square \)

The algorithm GCHO which generates a sequence \((x_k)_{k \ge 0}\) satisfying the descent (14) and the stationary condition (19) has been also considered e.g., in the recent papers [20, 25], with h assumed to be a convex function. Here we remove this assumption on h.

Combining (20) and (16), we further obtain:

$$\begin{aligned} S_{f}(x_{k+1})^{\frac{p+1}{p}}&\le \left( \dfrac{M_{p}+L_{p}^{F}}{p!}\right) ^{\frac{p+1}{p}} \dfrac{(p+1)!}{M_{p}-L_{p}^{F}}\Big (f(x_{k})-f(x_{k+1})\Big )\\&=C_{M_{p},L_{p}^{F}}\Big (f(x_{k})-f(x_{k+1})\Big ), \end{aligned}$$

where \(C_{M_{p},L_{p}^{F}}=\left( \dfrac{M_{p}+L_{p}^{F}}{p!}\right) ^{\frac{p+1}{p}}\dfrac{(p+1)!}{M_{p}-L_{p}^{F}}\). Summing the last inequality from \(j=0:k-1\), and using that f is bounded from bellow by \(f^{*}\), we get:

$$\begin{aligned} \sum ^{k-1}_{j=0}S_{f}(x_{j})^{\frac{p+1}{p}}&\le C_{M_{p},L_{p}^{F}}\Big (f(x_{0})-f(x_{k})\Big ) \le C_{M_{p},L_{p}^{F}}\Big (f(x_{0})-f^{*}\Big ). \end{aligned}$$

Hence:

$$\begin{aligned} \min _{j=0:k-1}S_{f}(x_{j})\le \dfrac{\left( C_{M_{p},L_{p}^{F}}(f(x_{0})-f^{*})\right) ^{\frac{p}{p+1}}}{k^{\frac{p}{p+1}}}. \end{aligned}$$

Thus, we have proved convergence for the simple composite problem under slightly more general assumptions than in [20, 25], i.e., F and h are possibly nonconvex functions. Finally, if we have \( \Vert x_{k+1}-x_{k}\Vert ^{p} \le \frac{p!}{L^F_{p}+M_{p}}\epsilon \), then from (20) it follows that \(S_f (x_{k+1})\le \epsilon \), i.e., \(x_{k+1}\) is nearly stationary for f. Note that in the previous Lemma 1, we assume \(x_{k+1}\) to be a stationary point of the following subproblem (see (19)):

$$\begin{aligned} x_{k+1} \in \mathop {{\mathrm{arg\,min}}}\limits _{x} s(x;\!x_{k}). \end{aligned}$$
(21)

However, our stationary condition for \(x_{k+1}\) can be relaxed to the following inexact optimality criterion (see also [2]):

$$\begin{aligned} \Vert g_{x_{k+1}}\Vert \le \theta \Vert x_{k+1}-x_{k}\Vert ^{p}, \end{aligned}$$
(22)

where \(g_{x_{k+1}}\in \partial s(x_{k+1};\!x_{k}) \) and \(\theta >0\). For simplicity of the exposition, in our convergence analysis below for this particular case (i.e., g identity function) we assume however that \(x_{k+1}\) satisfies the exact stationary condition (21), although our results can be extended to the inexact stationary condition from above. The situation is dramatically different for the general composite problem (1). When g is nonsmooth, the distance \(\text {dist}\big (0,\partial f(x_{k+1})\big )\) will typically not even tend to zero in the limit, although we have seen that \(\Vert x_{k+1}-x_{k}\Vert ^{p}\) converges to zero. Indeed, consider the minimization of the following function:

$$\begin{aligned} f(x)=\max \big (x^{2}-1,1-x^{2}\big ). \end{aligned}$$

For \(p=1\), we have \(L_{1}^{F}(1)=L_{1}^{F}(2)=2\). Taking \(x_{0}>1\) and \(M_{1}=M_{2}=4\), GCHO algorithm becomes:

$$\begin{aligned} x_{k+1}=\mathop {{\mathrm{arg\,min}}}\limits \limits _{x} Q(x,x_{k})\;\Big (:=\max \big ( Q_{1}(x,x_{k}) , Q_{1}(x,x_{k}) - 4xx_{k}+2x_{k}^2 + 2\big )\Big ), \end{aligned}$$

where \(Q_{1}(x,x_{k})=2x^{2} - 2xx_{k}+ x_{k}^{2}-1 \). Let us prove by induction that \(x_{k}> 1\) for all \(k\ge 0\). Assume that \(x_{k}>1\) for some \(k\ge 0\). We notice that the polynomials \(Q_{2}(x,x_{k}):=Q_{1}(x,x_{k}) - 4xx_{k}+2x_{k}^2 + 2\) and \(Q_{1}(x,x_{k})\) are 2-strongly convex functions and they intersect in a unique point \({\bar{x}}=\frac{x_{k}^2+1}{2x_{k}}\). Also, the minimum of \(Q_{2}\) is \({{\bar{x}}}_{2}=\frac{3}{2} x_{k}\) and the minimum of \(Q_{1}\) is \({{\bar{x}}}_{1}:=\frac{1}{2}x_{k}\), satisfying \({{\bar{x}}}_{1}\le {\bar{x}}\le {{\bar{x}}}_{2}\). Let us prove that \(x_{k+1}={\bar{x}}\). Indeed, if \(x\le {\bar{x}}\), then \(Q(x,x_{k}) = Q_{2}(x,x_{k})\) and it is nonincreasing on \((-\infty ,{\bar{x}}]\). Hence, \(Q(x,x_{k})\ge Q({\bar{x}},x_{k})\) for all \(x\le {\bar{x}}\). Further, if \(x\ge {\bar{x}}\), then \(Q(x,x_{k}) = Q_{1}(x,x_{k})\) and it is nondecreasing on \( [{\bar{x}},+\infty )\). In conclusion, \(Q(x,x_{k})\ge Q({\bar{x}},x_{k})\) for all \(x\le {\bar{x}}\). Finally, we have that: \(Q(x,x_{k})\ge Q({\bar{x}},x_{k})\) for all \(x\in {\mathbb {R}}\). Since \(x_{k}>1\), we also get that \(x_{k+1}=\frac{x_{k}^2+1}{2x_{k}} > 1\). Since \(x_{k}>1\), then \(\partial f(x_{k})=2x_{k} > 2 \) and \(S_{f}(x_{k})\ge 2 > 0\). Moreover, \(x_{k+1} < x_{k}\) and bounded below by 1, thus \((x_{k})_{k\ge 0}\) is convergent and its limit is 1. Indeed, assume that \(x_{k}\rightarrow {\hat{x}}\) as \(k \rightarrow \infty \). Then, we get \( {\hat{x}} = \frac{{\hat{x}}^{2} + 1}{2{\hat{x}}}\) and thus \({\hat{x}} = 1\) (recall that \({\hat{x}}\ge 1\)). Consequently, \(\Vert x_{k+1} - x_{k}\Vert \) also converges to 0. Therefore, we must look elsewhere for a connection between \(S_{f}(\cdot )\) and \(\Vert x_{k+1} - x_{k}\Vert ^{p}\).

Let us now consider the following subproblem:

$$\begin{aligned} {\mathcal {P}}(x_{k})=\mathop {{\mathrm{arg\,min}}}\limits \limits _{y} {\mathcal {M}}_{p}(y,x):= f(y)+\dfrac{\mu _{p}}{(p+1)!}\Vert y-x_k\Vert ^{p+1}, \end{aligned}$$
(23)

where \(\mu _{p}>g(L^{e}_{p})\). Since f is assumed bounded from bellow, then for any fixed x, the function \(y\mapsto {\mathcal {M}}_{p}(y,x)\) is coercive and hence the optimal value \({\mathcal {M}}_{p}^* =\inf \limits _{y}{\mathcal {M}}_{p}(y,x)\) is finite. Then, the subproblem (23) is equivalent to:

$$\begin{aligned} \inf \limits _{y\in {\mathcal {B}}_k} f(y)+\dfrac{\mu _{p}}{(p+1)!}\Vert y-x_{k}\Vert ^{p+1}, \end{aligned}$$

for some compact set \({\mathcal {B}}_k\). Since \({\mathcal {M}}_{p}\) is proper lower semicontinuous function in the first argument and \({\mathcal {B}}_k\) is compact set, then from Weierstrass theorem we have that the infimum \({\mathcal {M}}_{p}^*\) is attained, i.e., there exists \({\bar{y}}_{k+1} \in {\mathcal {P}}(x_{k})\) such that \({\mathcal {M}}_{p}({\bar{y}}_{k+1},x_{k}) = {\mathcal {M}}_{p}^*\). Since the level sets of \(y \mapsto {\mathcal {M}}_{p}(x,y)\) are compact, then \( {\mathcal {P}}(x_{k})\) is nonempty and compact and one can consider the following point:

$$\begin{aligned} y_{k+1} =\mathop {{\mathrm{arg\,min}}}\limits _{y\in {\mathcal {P}}(x_{k}) } \Vert y-x_{k}\Vert . \end{aligned}$$
(24)

Let us assume that \(F_{i}\) admits a higher-order surrogate as in Definition 2, where the error functions \(e_{i}\) are p smooth with Lipschitz constants \(L_{p}^{e}(i)\) for all \(i=1:m\). Denote \(L_{p}^{e}=\big (L_{p}^{e}(1),\cdots ,L_{p}^{e}(m)\big )\) and define the following positive constant \(C_{L^{e}_{p}}^{\mu _{p}}= \dfrac{\mu _{p}}{\mu _{p} - g(L^{e}_{p})}\) (recall that \(\mu _{p}\) is chosen such that \(\mu _{p}>g(L^{e}_{p})\)). Next lemma shows that Assumption 2 holds provided that we compute \(x_{k+1}\) as an approximate local solution of subproblem (15) (hence, \(x_{k+1}\) doesn’t need to be global optimum) and \(y_{k+1}\) as in (24).

Lemma 2

Let the assumptions of Theorem 1 hold and additionally there exists \(\delta >0\) such that \(x_{k+1}\) satisfies the following inexact optimality condition:

$$\begin{aligned}&g\big (s(x_{k+1};\!\!x_k)\big ) + h(x_{k+1}) - \!\!\!\min _{{ {x: \; \Vert x - x_k\Vert \le D_k }}} \left( g\big (s(x;\!\!x_k)\big ) + h(x) \right) \\&\quad \le \delta \Vert x_{k+1}-x_k\Vert ^{p+1}, \nonumber \end{aligned}$$
(25)

where \(D_k:= \left( \frac{(p+1)!}{\mu _p}(f(x_k) - f^*)\right) ^{\frac{1}{p+1}}\). Then, Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}} + \frac{{\delta (p+1)!}}{\mu _p - g(L^{e}_p)}\right) ^{1/(p+1)}\) and \(L^{2}_{p}=\frac{\mu _{p}}{p!}\).

Proof

From the definition of \(y_{k+1}\) in (24), we have:

$$\begin{aligned} f(y_{k+1})+\frac{\mu _{p}}{(p+1)!}\Vert y_{k+1}-x_{k}\Vert ^{p+1}&= \min \limits _{y} f(y) + \frac{\mu _{p}}{(p+1)!}\Vert y-x_{k}\Vert ^{p+1}\\&\le f(x_{k+1}) + \frac{\mu _{p}}{(p+1)!}\Vert x_{k+1}-x_{k}\Vert ^{p+1}.\nonumber \end{aligned}$$
(26)

Further, taking \(y=x_k\) in (26) we also have:

$$\begin{aligned} f(y_{k+1}) + \frac{\mu _{p}}{(p+1)!}\Vert y_{k+1}-x_{k}\Vert ^{p+1} \le f(x_k), \end{aligned}$$

which implies that:

$$\begin{aligned} \Vert y_{k+1} - x_k\Vert \le \left( \frac{(p+1)!}{\mu _p}(f(x_k) - f^*)\right) ^{\frac{1}{p+1}} = D_k. \end{aligned}$$
(27)

Note that since the error functions \(e_{i}\)’s have the p derivative Lipschitz with constants \(L_{p}^{e}(i)\)’s, then using (3), we get:

$$\begin{aligned} \vert e_{i}(y;\!\!x_{k}) - T_p^{e_{i}}(y;\!\!x_{k}) \vert \le \frac{L_p^{e}(i)}{(p+1)!} \Vert y-x_{k}\Vert ^{p+1} \quad \forall i=1:m,\quad \forall y\in \text {dom}\;e_i. \end{aligned}$$

From (10), the Taylor approximations of \(e_{i}\)’s of order p at \(x_{k}\), \(T_p^{e}(y;\!\!x_{k})\), are zero. Hence we get:

$$\begin{aligned} \vert s_{i}(y;\!\!x_{k}) - F_{i}(y) \vert =\vert e_{i}(y;\!\!x_{k}) \vert \le \frac{L_p^{e}(i)}{(p+1)!} \Vert y-x_{k}\Vert ^{p+1} \quad \forall i=1:m. \end{aligned}$$
(28)

Further, since \(F(x_{k+1})\le s(x_{k+1};\!\!x_{k})\) (see (11)) and g is a nondecreasing function, we have:

$$\begin{aligned} f(x_{k+1})&\le g \Big (s(x_{k+1};\!\!x_{k})\Big ) + h(x_{k+1})\\&{\mathop {\le }\limits ^{(25)}} \min _{{ y:\;\Vert y - x_k\Vert \le D_k }} g \Big (s(y;\!\!x_{k})\Big ) + h(y) + \delta \Vert x_{k+1} - x_k\Vert ^{p+1}\\&{\mathop {\le }\limits ^{(28)}} \min _{{y:\;\Vert y - x_k\Vert \le D_k}} g\left( F(y) + \frac{L_p^{e}}{{p+1}!}\Vert y - x_k\Vert ^{p+1}\right) \\&\qquad + h(y) + \delta \Vert x_{k+\!1} - x_k\Vert ^{p+1}\\&{\mathop {\le }\limits ^{(8)}} \min _{{y:\;\Vert y - x_k\Vert \le D_k}} f(y) + \frac{g(L^{e}_p)}{(p+1)!}\Vert y - x_k\Vert ^{p+1} + \delta \Vert x_{k+1} - x_k\Vert ^{p+1}\\&{\mathop {\le }\limits ^{(27)}} f(y_{k+1}) + \dfrac{g(L^{e}_{p})}{(p+1)!}\Vert y_{k+1}-x_{k}\Vert ^{p+1} + \delta \Vert x_{k+1} - x_k\Vert ^{p+1}. \end{aligned}$$

Then, combining the last inequality with (26), we get:

$$\begin{aligned} \Vert y_{k+1}-x_{k}\Vert ^{p+1} \le \dfrac{\mu _{p} + {\delta (p+1)!}}{\mu _{p} - g(L^{e}_{p})} \Vert x_{k+1}-x_{k}\Vert ^{p+1}, \end{aligned}$$

which is the first statement of Assumption 2. Further, using (6) and optimality conditions for \(y_{k+1}\), we obtain:

$$\begin{aligned} 0 \in \partial f(y_{k+1}) + \frac{\mu _{p}}{p!}\Vert y_{k+1}-x_{k}\Vert ^{p-1} B(y_{k+1}-x_{k}). \end{aligned}$$

It follows that:

$$\begin{aligned} S_{f}(y_{k+1})\le \frac{\mu _{p}}{p!}\Vert y_{k+1}-x_{k}\Vert ^{p}. \end{aligned}$$

Hence, Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}} + \frac{\delta (p+1)!}{\mu _p - g(L^{e}_p)}\right) ^{1/(p+1)}\) and \(L^{2}_{p}=\frac{\mu _{p}}{p!}\). \(\square \)

Finally, we provide a third (practical) example satisfying Assumption 2 when \(p=2\), \(h(\cdot ) = 0\) and \(g(\cdot ) = \max (\cdot )\) function.

Lemma 3

Let the assumptions of Theorem 1 hold and additionally assume that \(p=2\), \(g(\cdot ) = \max (\cdot )\) and the surrogate function \(s(\cdot ;\cdot )\) is given in Example 4 with \(F_2 = 0\). Then, the global solution of the subproblem (15) with \(h=0\), denoted \(x_{k+1}\), can be computed efficiently and consequently Assumption 2 holds with \(y_{k+1}\) given in (24), \(L^{1}_{p}=\left( C_{L^{e}_{p}}^{\mu _{p}}\right) ^{1/3}\) and \(L^{2}_{p}=\frac{\mu _{p}}{2}\).

Proof

See appendix. \(\square \)

From the proof of Lemma 3 one can see that the global minimum of subproblem (15) can be computed as:

$$\begin{aligned} x_{k+1} =x_k - H_k(u,w)^{-1}g_k(u), \end{aligned}$$

where \(H_k(u,w) = \sum _{i=1}^{m} u_i \nabla ^2 F_i(x_k) + \frac{w}{2} I\), \(g_k(u) = \sum _{i=1}^{m}u_i \nabla F_i(x_k)\) and \(l_k(u) = \sum _{i=1}^{m} u_i F_i(x_k)\), with (uw) the solution of the following dual problem:

$$\begin{aligned}&\max _{(u,w)\in D}\; l_k(u) -\frac{1}{2} \left\langle H_k(u,w)^{-1} g_k(u) , g_k(u)\right\rangle - \frac{1}{12(\sum _{i=1}^{m}u_i M_i)^2} w^3, \end{aligned}$$
(29)

with \(D = \left\{ (u,w)\in \Delta _m\times {\mathbb {R}}_+:\; \; \text {s.t.}\; H_k(u,w) \succ 0 \right\} \), i.e., a maximization of a concave function over a convex set D. Hence, if m is not too large, this convex dual problem can be solved efficiently by interior point methods [21]. In conclusion, GCHO algorithm can be implementable for \(p=2\) even for nonconvex problems, since we can effectively compute the global minimum \(x_{k+1}\) of subproblem (15) using the powerful tools from convex optimization.

Define the following constant: \(D_{R_{p}^{e},L^{1,2}_{p}}=\frac{\left( L_{p}^{1}\left( L_{p}^{2}\right) ^{p}\right) ^{\frac{p+1}{p}}(p+1)!}{-g(-R_{p}^{e})}\). Then, we derive the following convergence result for GCHO algorithm in the nonconvex case.

Theorem 2

Let the assumptions of Theorem 1 hold. Additionally, Assumption 2 holds. Then, for the sequence \(\left( x_{k}\right) _{k\ge 0}\) generated by Algorithm GCHO we have the following sublinear convergence rate:

$$\begin{aligned} \min \limits _{j=0:k-1} S_{f}(y_{j})\le \frac{\left( D_{R_{p}^{e},L^{1,2}_{p}}(f(x_{0})-f^{*}) \right) ^{\frac{p}{p+1}}}{k^{\frac{p}{p+1}}}. \end{aligned}$$

Proof

From Assumption 2, we have:

$$\begin{aligned} S_{f}(y_{k+1})&\le L_{p}^{2}\Vert y_{k+1}-x_{k}\Vert ^{p}\le L_{p}^{2}\left( L_{p}^{1}\right) ^{p}\Vert x_{k+1}-x_{k}\Vert ^{p}. \end{aligned}$$

Using the descent (16), we get:

$$\begin{aligned} S_{f}(y_{k+1})^{\frac{p+1}{p}}\le \frac{\left( L_{p}^{2}\left( L_{p}^{1}\right) ^{p} \right) ^{\frac{p+1}{p}}(p+1)!}{-g(-R_{p}^{e})}\left( f(x_{k})-f(x_{k+1})\right) . \end{aligned}$$

Summing the last inequality from \(j=0:k-1\) and taking the minimum, we get:

$$\begin{aligned} \min \limits _{j=0:k-1} S_{f}(y_{j})\le \frac{\left( D_{R_{p}^{e},L^{1,2}_{p}}(f(x_{0})-f^{*}) \right) ^{\frac{p}{p+1}}}{k^{\frac{p}{p+1}}}, \end{aligned}$$

which proves the statement of the theorem. \(\square \)

Theorem 2 requires that \(x_{k+1}\) satisfies the descent (14) and Assumption 2. However Assumption 2, according to Lemmas 1 and  2, holds if \(x_{k+1}\) is an (inexact) stationary point or an inexact solution of the subproblem (15), respectively.

Remark 1

To this end, Assumption 2 requires an auxiliary sequence \(y_{k+1}\) satisfying:

$$\begin{aligned} \left\{ \begin{array}{ll} \Vert y_{k+1}-x_{k}\Vert \le L_{p}^{1}\Vert x_{k+1}-x_{k}\Vert \\ S_{f}(y_{k+1})\le L_{p}^{2} \Vert x_{k+1}-x_{k}\Vert ^{p}. \end{array} \right. \end{aligned}$$
(30)

If \(\Vert x_{k+1}-x_{k}\Vert \) is small, the point \(x_{k}\) is near \(y_{k+1}\), which is nearly stationary for f (recall that \(\Vert x_{k+1}-x_{k}\Vert \) converges to 0). Hence, we do not have approximate stationarity for the original sequence \(x_k\) but for the auxiliary sequence \(y_k\), which is close to the original sequence. Note that in practice, \(y_{k+1} \) does not need to be computed. The purpose of \(y_{k+1}\) is to certify that \(x_{k}\) is approximately stationary in the sense of (30). For \(p=1\) a similar conclusion was derived in [8]. For a better understanding of the behavior of the sequence \(y_{k+1}\), let us come back to our example \(f(x)=\max \big (x^{2}-1,1-x^{2}\big )\) and \(p=1\). Recall that we have proved \(x_{k}>1\) and choosing \(\mu _{p}=4\), then \(y_{k+1}\) is the solution of the following subproblem:

$$\begin{aligned} y_{k+1}&= \mathop {{\mathrm{arg\,min}}}\limits \limits _{y} \max \big (y^{2}-1,1-y^{2}\big ) + 2(y-x_{k})^2. \end{aligned}$$

Then, it follows immediately that:

$$\begin{aligned} y_{k+1} = \left\{ \begin{array}{ll} \frac{2}{3}x_{k}\quad &{} \text {if } x_k > \frac{3}{2} \\ 1\quad &{} \text {if } 1\le x_{k} \le \frac{3}{2}. \end{array} \right. \end{aligned}$$

Since we have already proved that \(x_k \rightarrow 1\), we conclude that \(\vert y_{k+1} - x_{k} \vert \rightarrow 0\) and consequently \(\text {dist}(0,{\partial f(y_{k+1})})\rightarrow 0\) for \(k \rightarrow \infty \), as predicted by our theory.

3.3 Better rates for GCHO under KL

In this section, we show that improved rates can be derived for GCHO algorithm if the objective function satisfies the KL property. This is the first time when such convergence analysis is derived for the GCHO algorithm on the composite problem (1). We believe that this lack of analysis comes from the fact that, when g is nonsmooth and different from the identity function, one can’t bound directly the distance \(S_{f}(x_{k+1})\) by \(\Vert x_{k+1} - x_{k}\Vert \). However, using the newly introduced (artificial) point \(y_{k+1}\), we can now overcome this difficulty. Let us recall the definition of a function satisfying the Kurdyka–Lojasiewicz (KL) property (see [4] for more details).

Definition 3

A proper lower semicontinuous function \(f: {\mathbb {E}}\rightarrow \bar{{\mathbb {R}}}\) satisfies Kurdyka–Lojasiewicz (KL) property on the compact set \(\Omega \subseteq \text {dom} \; f\) on which f takes a constant value \({f_*}\) if there exist \(\delta , \epsilon >0\) such that one has:

$$\begin{aligned} \kappa ' (f(x) - {f_*}) \cdot S_{f}(x) \ge 1 \quad \forall x\!: \text {dist}(x, \Omega ) \le \delta , \; {f_*}< f(x) < {f_*} + \epsilon , \end{aligned}$$

where \(\kappa : [0,\epsilon ] \rightarrow {\mathbb {R}}\) is concave differentiable function satisfying \(\kappa (0) = 0\) and \(\kappa '>0\).

When \( \kappa \) takes the form \(\kappa (t) = \sigma _q^{\frac{1}{q}} \frac{q}{q-1} t^{\frac{q-1}{q}}\), with \(q >1\) and \(\sigma _q>0\) (which is our interest here), the KL property establishes the following local geometry of the nonconvex function f around a compact set \(\Omega \):

$$\begin{aligned} f(x) - {f_*} \le \sigma _q S_{f}(x)^q \quad \forall x\!: \; \text {dist}(x, \Omega ) \le \delta , \; {f_*}< f(x) < {f_*} + \epsilon . \end{aligned}$$
(31)

Note that the relevant aspect of the KL property is when \(\Omega \) is a subset of stationary points for f, i.e. \(\Omega \subseteq \{x: 0 \in \partial f(x) \}\), since it is easy to establish the KL property when \(\Omega \) is not related to stationary points. The KL property holds for a large class of functions including semi-algebraic functions (e.g., real polynomial functions), vector or matrix (semi)norms (e.g., \(\Vert \cdot \Vert _p\) with \(p \ge 0\) rational number), logarithm functions, exponential functions and uniformly convex functions, see [4] for a comprehensive list. In particular, the max (sup) of semi-algebraic functions is a semi-algebraic function, see [6] (Example 2). Let us show that if \((x_{k})_{k\ge 0}\) is bounded, then also \(( y_{k})_{k\ge 0}\) is bounded and they have the same limit points.

Lemma 4

Let \(\left( x_{k}\right) _{k\ge 0}\) generated by Algorithm GCHO be bounded and \((y_{k})_{k\ge 0}\) satisfies Assumption 2. Then, \((y_{k})_{k\ge 0}\) is bounded and the set of limit points of the sequence \(\left( y_{k}\right) _{k\ge 0}\) coincides with the set of limit points of \(\left( x_{k}\right) _{k\ge 0}\).

Proof

Indeed, since \(\left( x_{k}\right) _{k\ge 0}\) is bounded, then it has limit points. Let \(x_{*}\) be a limit point of the sequence \(\left( x_{k}\right) _{k\ge 0}\). Then, there exists a subsequence \(( x_{k_{t}})_{t\ge 0}\) such that \(x_{k_{t}}\rightarrow x_{*}\) for \(t\rightarrow \infty \). We have:

$$\begin{aligned} \Vert y_{k_{t}}-x_{k_{t}}\Vert&\le \Vert y_{k_{t}}-x_{k_{t}-1}\Vert +\Vert x_{k_{t}}-x_{k_{t}-1}\Vert \\&{\mathop {\le }\limits ^{(18)}} \left( L^{1}_{p} + 1\right) \Vert x_{k_{t}}-x_{k_{t}-1}\Vert \qquad \forall k\ge 0,\nonumber \end{aligned}$$
(32)

Since \((x_k)_{k\ge 0}\) is bounded and \(\Vert x_{k+1} - x_{k}\Vert \rightarrow 0\), then \((y_{k})_{k\ge 0}\) is also bounded. This implies that \(y_{k_{t}}\rightarrow x_{*}\). Hence, \(x_{*}\) is also a limit point of the sequence \(\left( y_{k}\right) _{k\ge 0}\). Further, let \(y_{*}\) be a limit point of the bounded sequence \(\left( y_{k}\right) _{k\ge 0}\). Then, there exists a subsequence \(( y_{{\bar{k}}_{t}})_{t\ge 0}\) such that \(y_{{\bar{k}}_{t}}\rightarrow y_{*}\) for \(t\rightarrow \infty \). Taking \(t\rightarrow \infty \) in an inequality similar to (32) and using \(\lim _{t\rightarrow \infty } \Vert x_{{\bar{k}}_t} - x_{{\bar{k}}_t - 1}\Vert = 0 \) and boundedness of \((x_k)_{k\ge 0}\), we get that \(x_{{\bar{k}}_{t}}\rightarrow y_{*}\), i.e., \(y_{*}\) is also a limit point of the sequence \(\left( x_{k}\right) _{k\ge 0}\). \(\square \)

Note that usually for deriving convergence rates under KL condition, we need to assume that the sequence generated by the algorithm is bounded (see e.g., Theorem 1 in [6]). Let us denote the set of limit points of \((x_{k})_{k\ge 0}\) by:

$$\begin{aligned} \Omega (x_{0})&=\lbrace {\bar{x}}\in {\mathbb {E}}: \exists \text { an increasing sequence of integers } (k_{t})_{t\ge 0},\\&\quad \text { such that } x_{k_{t}}\rightarrow {\bar{x}} \text { as } t\rightarrow \infty \rbrace , \end{aligned}$$

and the set of stationary points of problem (1) by \(\text {crit}f\).

Lemma 5

Let the assumptions of Theorem 1 hold. Additionally, assume that \(\left( x_{k}\right) _{k\ge 0}\) is bounded, \((y_{k})_{k\ge 0}\) satisfies Assumption 2 and f is continuous. Then, we have: \(\emptyset \ne \Omega (x_{0}) \subseteq \text {crit} f \), \( \Omega (x_{0})\) is compact and connected set, and f is constant on \( \Omega (x_{0})\), i.e., \(f(\Omega (x_{0})) = f_*\).

Proof

First let us show that \(f(\Omega (x_{0}))\) is constant. From (16) we have that \(\left( f(x_{k})\right) _{k\ge 0}\) is monotonically decreasing and since f is assumed bounded from below, it converges, let us say to \({f_*}>-\infty \), i.e. \(f(x_{k})\rightarrow {f_*}\) as \(k\rightarrow \infty \). On the other hand let \(x_{*}\) be a limit point of the sequence \(\left( x_{k}\right) _{k\ge 0}\). This means that there exist a subsequence \(\left( x_{k_{t}}\right) _{t\ge 0}\) such that \(x_{k_{t}}\rightarrow x_{*} \). Since f is continuous, then \(f(x_{k_{t}})\rightarrow f(x_{*}) = {f_*}\). In conclusion, we have \(f(\Omega (x_{0}))={f_*}\). The closeness property of \(\partial f\) implies that \(S_{f}(x_{*})=0\), and thus \(0\in \partial f(x_{*})\). This proves that \(x_{*}\) is a stationary point of f and thus \(\Omega (x_{0})\) is nonempty. By observing that \(\Omega (x_{0})\) can be viewed as an intersection of compact sets:

$$\begin{aligned} \Omega (x_{0})=\cap _{q\ge 0} \overline{\cup _{k\ge q}\lbrace x_{k}\rbrace }, \end{aligned}$$

so it is also compact. This completes our proof. \(\square \)

Note that \(f_*\) from Lemma 5 is usually different from \(f^* = \inf _{x\in \text {dom}f } f(x)\) defined in Assumption 1. In addition, let us consider the following assumption:

Assumption 3

For the sequence \(\big (x_{k}\big )_{k\ge 0}\) generated by GCHO algorithm, there exist positive constants \(\theta _{1,p},\theta _{2,p}>0 \) such that:

$$\begin{aligned} f(x_{k+1})\le f(y_{k+1}) + \theta _{1,p}\Vert y_{k+1}-x_{k}\Vert ^{p+1} + \theta _{2,p}\Vert x_{k+1}-x_{k}\Vert ^{p+1} \quad \forall k\ge 0. \end{aligned}$$
(33)

Remark 2

Note that Assumption 3 holds when e.g., g is the identity function or when \((y_{k})_{k\ge 0}\) is given in (24) and \(x_{k+1}\) satisfies (25) (see Lemmas 2 and 3). For completeness, we provide a proof for this statement in Appendix.

Let us also recall the following lemma, whose proof is similar to the one in [1](Theorem 2). For completeness, we give the proof in Appendix.

Lemma 6

Let \(\theta >0\), \(C_{1},C_{2}\ge 0\) and \((\lambda _{k})_{k\ge 0}\) be a nonnegative, nonincreasing sequence, satisfying the following recurrence:

$$\begin{aligned} \lambda _{k+1}\le C_{1}\left( \lambda _{k}-\lambda _{k+1}\right) ^{\frac{1}{\theta }} +C_{2}\left( \lambda _{k}-\lambda _{k+1}\right) . \end{aligned}$$
(34)

If \(\theta \le 1\), then there exists an integer \(k_{0}\) such that:

$$\begin{aligned} \lambda _{k}\le \left( \frac{C_{1}+C_{2}}{1+C_{1}+C_{2}}\right) ^{k-k_{0}}\lambda _{0} \qquad \forall k\ge k_{0}. \end{aligned}$$

If \(\theta >1\), then there exists \(\alpha > 0\) and integer \(k_{0}\) such that:

$$\begin{aligned} \lambda _{k}\le \frac{\alpha }{(k-k_{0})^{\frac{1}{\theta -1}}}\qquad \forall k\ge k_{0}. \end{aligned}$$

From previous lemmas, all the conditions of the KL property from Definition 3 are satisfied. Then, we can derive the following convergence rates depending on the KL parameter.

Theorem 3

Let the assumptions of Lemma 5 hold. Additionally, assume that f satisfies the KL property (31) on \(\Omega (x_0)\) and Assumption 3 is valid. Then, the following convergence rates hold for the sequence \((x_{k})_{k\ge 0}\) generated by GCHO algorithm:

  • If \(q\ge \frac{p+1}{p}\), then \(f(x_{k})\) converges to \({f_*}\) linearly for k sufficiently large.

  • If \(q < \frac{p+1}{p} \), then \(f(x_{k})\) converges to \({f_*}\) at sublinear rate of order \({\mathcal {O}}\left( \frac{1}{k^{\frac{pq}{p+1-pq}}}\right) \) for k sufficiently large.

Proof

Since \((x_k)_{k\ge 0}\) and \((y_k)_{k \ge 0}\) have the same limit points, we get:

$$\begin{aligned} f(x_{k+1})-{f_*}&{\mathop {\le }\limits ^{(33)}} f(y_{k+1})-{f_*} + \theta _{1,p}\Vert y_{k+1}-x_{k}\Vert ^{p+1}+ \theta _{2,p}\Vert x_{k+1} - x_k\Vert ^{p+1}\\&{\mathop {\le }\limits ^{(31) + (18)}}\sigma _{q} S_f(y_{k+1})^{q} + \left( \theta _{1,p}(L_{p}^{1})^{p+1} + \theta _{2,p}\right) \Vert x_{k+1}-x_{k}\Vert ^{p+1}\\&{\mathop {\le }\limits ^{(18)}} \sigma _{q}\left( L_{p}^{2} (L_p^1)^p \right) ^{q}\Vert x_{k+1}-x_{k}\Vert ^{qp} \\&\qquad + \left( \theta _{1,p}(L_{p}^{1})^{p+1} + \theta _{2,p}\right) \Vert x_{k+1}-x_{k}\Vert ^{p+1}. \end{aligned}$$

If we define \(\Delta _{k}=f(x_{k})-{f_*}\), then combining the last inequality with (16), we get the following recurrence:

$$\begin{aligned} \Delta _{k+1} \le C_{1}\left( \Delta _{k}-\Delta _{k+1}\right) ^{\frac{qp}{p+1}} + C_{2}\left( \Delta _{k}-\Delta _{k+1}\right) , \end{aligned}$$

where \( C_{1}= \sigma _{q}(L_{p}^{2}(L_p^1)^p)^{q}\left( \frac{(p+1)!}{-g(-R_{p}^{e})}\right) ^{\frac{pq}{p+1}}\) and \( C_{2}= \left( \theta _{1,p}(L_{p}^{1})^{p+1} + \theta _{2,p}\right) \frac{(p+1)!}{-g(-R_{p}^{e})}.\)

Using Lemma 6, with \(\theta =\frac{p+1}{pq}\) we get our statements. \(\square \)

Remark 3

Contrary to Theorem 2, under KL we prove in Theorem 3 that the original sequence \((x_k)_{k\ge 0}\) converge in function values. When the objective function f is uniformly convex of order \(p+1\) and g not necessarily with full domain, [9] proves linear convergence for their algorithm in function values. Our results are different, i.e., we provide convergence rates for GCHO algorithm for possibly nonconvex objective f.

3.4 Convex convergence analysis

In this section, we assume that the objective function f in (1) is convex. Since the problem (1) is convex, we assume that \(x_{k+1}\) is a global minimum of the subproblem (15), which is convex provided that \(M_p\) is sufficiently large (see Theorem 1 in [24]). Below, we also assume that the level sets of f are bounded. Since GCHO algorithm is a descent method, this implies that there exist a constant \(R_0>0\) such that \(\Vert x_{k}-x^{*}\Vert \le R_0\) for all \(k\ge 0\), where \(x^*\) is an optimal solution of (1). Then, we get the following sublinear rate for GCHO algorithm.

Theorem 4

Let F, g and h satisfy Assumption 1 and additionally each \(F_{i}\) admits a p higher-order surrogate \(s_{i}\) as in Definition 2 with the constants \(L^{e}_{p}(i)\) and \(R_{p}^{e}(i)\), for \(i=1:m\). Additionally, f is a convex function and has bounded level sets. Let \(\left( x_{k}\right) _{k\ge 0}\) be the sequence generated by Algorithm GCHO, \(R_{p}^{e}=\big (R_{p}^{e}(1),\cdots ,R_{p}^{e}(m)\big )\) and \(L^{e}_{p}=\big (L^{e}_{p}(1),\cdots ,L^{e}_{p}(m)\big )\). Then, we have the following convergence rate:

$$\begin{aligned} f(x_{k})-f(x^{*})\le \dfrac{g(L^{e}_{p})R_0^{p+1}(p+1)^{p}}{p!k^{p}}. \end{aligned}$$

Proof

Since \(F(x_{k+1})\le S(x_{k+1};\!\!x_k)\) (see (11)) and g is nondecreasing, we have:

$$\begin{aligned} f(x_{k+1})&\le g\big (s(x_{k+1};\!x_{k})\big )+h(x_{k+1})\\&{\mathop {=}\limits ^{(21)}}\min \limits _{x} g\big (s(x;\!\!x_{k})\big )+h(x)\\&{\mathop {\le }\limits ^{(28)}} \min \limits _{x} g\left( F(x)+\dfrac{L^{e}_{p}}{(p+1)!}\Vert x-x_{k}\Vert ^{p+1}\right) +h(x). \end{aligned}$$

Hence we get:

$$\begin{aligned} f(x_{k+1})&{\mathop {\le }\limits ^{(8)}} \min \limits _{x} g\big (F(x)\big )+ \dfrac{g(L^{e}_p)}{(p+1)!}\Vert x-x_{k}\Vert ^{p+1} + h(x) \\&= \min \limits _{x} f(x) + \frac{g(L_p^{e})}{(p+1)!}\Vert x - x_k\Vert ^{p+1}\\&\le \min \limits _{\alpha \in [0,1]} f(x_{k})+\alpha \big [(f(x^{*})-f(x_{k})\big ]+\alpha ^{p+1} \dfrac{R_0^{p+1}}{(p+1)!}g\left( L^{e}_{p}\right) , \end{aligned}$$

where the last inequality follows from the convexity of f and the boundness of the level sets of f. The minimum in \(\alpha \ge 0\) is achieved at:

$$\begin{aligned} \alpha ^{*}=\left( \dfrac{f(x_{k})-f(x^{*})p!}{g(L^{e}_{p}) R_0^{p+1}}\right) ^{\frac{1}{p}}. \end{aligned}$$

We have \(0\le \alpha ^{*}<1 \). Indeed, since \(\big (f(x_{k})\big )_{k\ge 0}\) is decreasing, we have:

$$\begin{aligned} f(x_{k})\le f(x_{1})&\le g\big ( s(x_{1};\!\!x_{0})\big )+h(x_{1})\\&=\min \limits _{x} g\big (s(x;\!\!x_{0})\big )+h(x)\\&{\mathop {\le }\limits ^{(28)}} \min \limits _{x} g\left( F(x)+\frac{L^{e}_{p}}{(p+1)!}\Vert x-x_{0}\Vert ^{p+1}\right) +h(x)\\&\le g\left( F(x^{*})+\frac{L^{e}_{p}}{(p+1)!}\Vert x^{*}-x_{0}\Vert ^{p+1}\right) +h(x^{*})\\&\le f(x^{*})+\dfrac{g(L^{e}_{p})R_0^{p+1}}{(p+1)!}. \end{aligned}$$

Hence:

$$\begin{aligned} 0\le \alpha ^{*}&\le \left( \dfrac{\big (f(x_{1})-f(x^{*})\big )p!}{g(L^{e}_{p} )R_0^{p+1}}\right) ^{\frac{1}{p}}\le \left( \dfrac{g(L^{e}_{p})R_0^{p+1}p!}{g(L^{e}_{p} )R_0^{p+1}(p+1)!}\right) ^{\frac{1}{p}}\\&= \left( \dfrac{p!}{(p+1)!}\right) ^{\frac{1}{p}}= \left( \dfrac{{1}}{p+1}\right) ^{\frac{1}{p}}<1. \end{aligned}$$

Thus, we conclude:

$$\begin{aligned} f(x_{k+1})&\le f(x_{k})-\alpha ^{*}\left( f(x_{k})-f(x^{*}) -\frac{g(L^{e}_{p})R_0^{p+1}}{(p+1)!}(\alpha ^{*})^{p} \right) \\&= f(x_{k})-\dfrac{p\alpha ^*}{p+1} \big [f(x_{k})-f(x^{*})\big ] . \end{aligned}$$

Denoting \(\delta _{k}=f(x_{k})-f(x^{*})\), we get the following estimate:

$$\begin{aligned} \delta _{k}-\delta _{k+1}\ge C \delta _{k}^{\frac{p+1}{p}}, \end{aligned}$$

where \({C=\frac{p}{p+1}\left( \dfrac{p!}{g(L^{e}_{p})R_0^{p+1}} \right) ^{\frac{1}{p}}}\). Thus, for \(\mu _{k} = C^{p}\delta _{k}\) we get the following recurence:

$$\begin{aligned} \mu _{k}-\mu _{k+1}\ge \mu _{k}^{\frac{p+1}{p}}. \end{aligned}$$

Following the same proof as in [24](Theorem 4), we get:

$$\begin{aligned} \dfrac{1}{\mu _{k}}\ge \left( \dfrac{1}{\mu _{1}^{\frac{1}{p}}} +\dfrac{k-1}{p}\right) ^{p}. \end{aligned}$$

Since

$$\begin{aligned} \dfrac{1}{\mu _{1}^{\frac{1}{p}}}=\dfrac{1}{C{ \delta _{1}}^{\frac{1}{p}}}=\frac{p+1}{p}\left( \dfrac{g(L^{e}_{p}) R_0^{p+1}}{p!(f(x_{1})-f^{*})} \right) ^{\frac{1}{p}} \ge \frac{1}{p}(p+1)^{\dfrac{p+1}{p}}, \end{aligned}$$

then

$$\begin{aligned} \delta _{k}=C^{-p}\mu _{k}&=\left( \dfrac{p+1}{p}\right) ^{p} \dfrac{g(L^{e}_{p})R_0^{p+1}}{p!}\mu _{k}\\&\le \left( \dfrac{p+1}{p}\right) ^{p}\dfrac{g(L^{e}_{p}) R_0^{p+1}}{p!}\left( \frac{1}{p}(p+1)^{\frac{p+1}{p}} +\dfrac{k-1}{p}\right) ^{-p} \\&=\dfrac{g(L^{e}_{p})R_0^{p+1}}{p!}\left( (p+1)^{\frac{1}{p}} +\dfrac{k-1}{p+1}\right) ^{-p} \le \dfrac{(p+1)^{p}g(L^{e}_{p})R_0^{p+1}}{p!k^{p}}. \end{aligned}$$

This proves the statement of the theorem. \(\square \)

Note that in the convex case the convergence results from [5, 8, 9] assume Lipschitz continuity of the p derivative of the object function F, which may be too restrictive. However, Theorem 4 assumes Lipschitz continuity of the p derivative of the error function \(e(\cdot )\) (note that we may have the error function \(e(\cdot )\) p times differentiable and with the p derivative Lipschitz, while the objective function F may not be even differentiable, see Examples 4 and 5). Hence, our proof is different and more general than [5, 8, 9]. Moreover, our convergence rate from the previous theorem covers the usual convergence rates \({\mathcal {O}}(\frac{1}{k^{p}})\) of higher-order Taylor-based methods in the convex unconstrained case [24], simple composite case [24, 25] and composite case for \(p\ge 1\) [5, 9]. Therefore, Theorem 4 provides a unified convergence analysis for general composite higher-order algorithms, that covers in particular, minimax strategies for nonlinear games, min-max problems and simple composite problems, under possibly more general assumptions.

3.5 Adaptive GCHO algorithm

In this section, we propose an adaptive variant of GCHO algorithm. Since the surrogate functions in all the examples given in this paper depend on a given constant M (see Examples 4 and 5, where \(M = M_p\)), below we consider the following notation \(s(\cdot ;\cdot ): = s_{M}(\cdot ;\cdot )\). Note that the convergence results from Theorems 12 and 3 are derived provided that Assumption 2 and 3 and the following properties of the sequence \((x_k)_{k\ge 0}\) generated by GCHO hold:

$$\begin{aligned}&g\big (s_{M}(x_{k+1};\!x_k)\big ) + h(x_{k+1}) \le f(x_k), \end{aligned}$$
(35)
$$\begin{aligned}&g(s_{M}(x_{k+1};\!x_k)) - g(F(x_{k+1}))\ge \frac{C_p^e}{(p+1)!}\Vert x_{k+1} - x_k\Vert ^{p+1}, \end{aligned}$$
(36)

where \(C_p^e: = -g(-R_p^e)\) is a given constant depending on the choice of the surrogate \(s_M(x_{k+1};x_k)\), which may be difficult to find in practice. Hence, in the following we propose an adaptive general composite higher-order algorithm, called (A-GCHO):

figure b

For a better understanding of this process, let us consider Example 4, where \(F = F_1 + F_2\), having the p derivative of \(F_1\) \(L_p^{F_1}\)-Lipschitz and \(F_2\) proper closed convex function. Then, in this case the surrogate is \(s_M(y;\!\!\!x) = T^{F^1}_p(y;\!\!x) + \frac{M}{(p+1)!}\Vert y - x\Vert ^{p+1} + F_2(y)\). Let \(R_p,M_0 > 0\) be fixed. Then, the step 1 in A-GCHO algorithm can be seen as a line search procedure (see for example [13]): that is at each step \(k \ge 0\) we choose \(M_k\ge M_0\), then build \(s_{M_k}(y;\!\!x_k) = T^{F^1}_p(y;\!\!x_k) + \frac{M_k}{(p+1)!}\Vert y - x_k\Vert ^{p+1} + F_2(y)\) and compute \(x_{k+1}\) satisfying (35). If (36) doesn’t hold, then we increase \(M_k \leftarrow 2\cdot M_k\), recompute \(s_{M_k}(y;\!\!x_k)\) using the new \(M_k\) and go to step 2. We repeat this process until condition (36) is satisfied. Note that this line search procedure finishes in a finite number of steps. Indeed, if \(M_k \ge R_p + L_p^{F^1}\), then from inequality (13), we get \(s_{M_k}(y;\!\!x_k) - F(y) \ge \frac{R_p}{(p+1)!}\Vert y - x_k\Vert ^{p+1}\) for all y and thus for \(y = x_{k+1}\) and g increasing function (36) holds. Note also that in this case the error function e satisfies Definition 2 (i) with \(L_p^e = 2(R_p + L_p^{F^1})\). Hence, using the same convergence analysis as in the previous sections, we can derive similar convergence rates as in Theorems  12 and 3 for A-GCHO algorithm under Assumption 2 and 3, since the sequence \((x_k)_{k\ge 0}\) generated by A-GCHO automatically satisfies (35) and (36). For the convex case, as in Sect. 3.4, in A-GCHO algorithm we require that \(x_{k+1}\) is the global solution of the corresponding subproblem and consequently similar convergence results as in Theorem 4 can be derived for this adaptive algorithm.

4 Numerical simulations

In this section we present some preliminary numerical results for GCHO algorithm. For simulations, we consider the tests set from [18]. In [18], one can find systems of nonlinear equations, where one searches for \(x^{*}\) such that \(F_{i}(x^{*}) = 0\) for all \(i = 1,\cdots ,m\). For solving these problems, we implement our GCHO algorithm for \(p=1,2\). We consider two formulations: min-max and least-squares problems, respectively. The min-max formulation has the form:

$$\begin{aligned} \min \limits _{x \in {\mathbb {R}}^n} f(x) := \max (F_{1}^{2}(x),\cdots ,F_{m}^{2}(x)). \end{aligned}$$
(37)

Similarly, the least-squares formulation can be written as a simple composite minimization problem:

$$\begin{aligned} \min \limits _{x \in {\mathbb {R}}^n} f(x) := \sum ^{m}_{i=0} F_{i}^{2}(x). \end{aligned}$$
(38)

Note that both formulation fit into our general problem (1). We consider the following 2 implementations. First, for problem (37) we compare GCHO algorithm for \(p=1,2\) with IPOPT (the results are given in Table 2). Secondly, for problem (38) we compare GCHO algorithm for \(p=1,2\) with IPOPT and the method proposed in [12] (the results are given in Table 3). At each iteration of GCHO algorithm we replace each function \(F_{i}\) by its Taylor approximation of order p, with \(p=1,2\), and a quadratic/cubic regularization and solve the corresponding subproblem (15) using IPOPT [28]. In the numerical simulations we have noticed that for \(p=2\) IPOPT was able to detect a global minimizer of the subproblem at each iteration, i.e., the solution of IPOPT coincided with the solution obtained by solving the dual problem as described in the proof of Lemma 3 given in the appendix. Since it is difficult to compute the corresponding Lipschitz constants for the gradient/hessian, we use the line search procedure described in Sect. 3.5. Note that since in practice it is difficult to compute the sequence \((y_k)_{k\ge 1}\), then we cannot consider \(\text {dist}(0,\partial f(y_k))\le \epsilon \) as a stooping criterion for the proposed algorithm. Thus the stopping criterion considered in this paper is the same as in [3]:

$$\begin{aligned} \frac{f(x_k) - f_{\text {best}}}{\max (1,f_\text {best})} \le 10^{-4}, \end{aligned}$$
Table 2 Behaviour of GCHO for \(p=1,2\) and IPOPT for the min-max formulation (37). Here "*" means that IPOPT didn’t find \(x^*\)/\(f^*\) reported in [18]
Table 3 Behaviour of GCHO algorithm for \(p=1,2\), algorithm [12] and IPOPT for the least-squares problem (38). Here "*" means that IPOPT didn’t find \(x^*\)/\(f^*\) reported in [18]

where \(f_{\text {best}} = f^* \approx 0\), but positive, and the starting point \(x_0\) are taken from [18]. In Tables 2 and 3, we summarize our numerical results in terms of cpu time and number of iterations for GCHO algorithm \(p = 1,2\), IPOPT and [12]. Note that the test functions we consider in the two tables are nonconvex and most of them satisfy the KL condition (as semi-algrabraic functions). From the tables, we observe that GCHO algorithm (\(p=1\) or \(p=2\)) applied to the min-max formulation performs better than the GCHO algorithm (\(p=1\) or \(p=2\)) applied to the the least-squares problem, both in cpu time and number of iterations. This is due to the fact that the regularization constants for the min-max problem (37), \(M_{p}^{\text {max}} = \big (M_{p}^{\text {max}}(1), \cdots , M_{p}^{\text {max}}(m)\big )\), are much smaller than the one for the least-squares formulation (38), \(M_{p}^\text {ls}\), i.e., \(M_{p}^\text {ls} \approx \sum _{i=1}^m M_{p}^{\text {max}} (i)\). Moreover, from the tables we observe that increasing the order of the Taylor approximation is beneficial for the GCHO algorithm: e.g., in the min-max formulation, GCHO with \(p=2\) is at least twice faster than GCHO with \(p=1\). We also observe from Table 3 that GCHO algorithm applied to min-max formulation for \(p=2\) has a better behavior (in both cpu time and number of iterations) than the method proposed in [12] for the least-squares formulation. Finally, GCHO algorithm (\(p=1, 2\)) for both formulations is able to identify the global optimal points/values given in [18], while IPOPT directly applied to the formulations (38) or (37) may fail to identify the global optimal points/values (see Tables 2 and 3).