Abstract
In this paper we develop new tensor methods for unconstrained convex optimization, which solve at each iteration an auxiliary problem of minimizing convex multivariate polynomial. We analyze the simplest scheme, based on minimization of a regularized local model of the objective function, and its accelerated version obtained in the framework of estimating sequences. Their rates of convergence are compared with the worst-case lower complexity bounds for corresponding problem classes. Finally, for the third-order methods, we suggest an efficient technique for solving the auxiliary problem, which is based on the recently developed relative smoothness condition (Bauschke et al. in Math Oper Res 42:330–348, 2017; Lu et al. in SIOPT 28(1):333–354, 2018). With this elaboration, the third-order methods become implementable and very fast. The rate of convergence in terms of the function value for the accelerated third-order scheme reaches the level \(O\left( {1 \over k^4}\right) \), where k is the number of iterations. This is very close to the lower bound of the order \(O\left( {1 \over k^5}\right) \), which is also justified in this paper. At the same time, in many important cases the computational cost of one iteration of this method remains on the level typical for the second-order methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Motivation In the last decade, we observe an increasing interest to the complexity analysis of the high-order methods. Starting from the paper [31] containing the first global rate of convergence of Cubic Regularization of Newton Method, it became more and more common to provide the second-order methods with the worst-case complexity bounds on different problem classes (see, for example, [5, 11, 12]). New efficiency measurements in this field naturally generated a new spectrum of questions, starting from the possibilities to accelerate the second-order methods (see [27]) up to the lower complexity bounds (see [1, 2, 13, 18]) and attempts of constructing the optimal methods [24].
Another possibility of accelerating the minimization processes consists in the increase of the power of oracle. The idea of using the high-order approximations in Optimization is not new. Initially, such approximations were employed in the optimality conditions (see, for example [22]). However, it seems that the majority of attempts of using the high-order tensors in optimization methods failed by the standard obstacle related to the enormous complexity of minimization of nonconvex multivariate polynomials. To the best of our knowledge, the only theoretical analysis of such schemes for convex problems can be found in an unpublished preprint [3], which is concluded by a pessimistic comment on practical applicability of these methods. For nonconvex problems, several recent papers [8,9,10, 14] contain the complexity analysis for high-order methods designed for generating points with small norm of the gradient. For the auxiliary nonconvex optimization problem, these methods need to guarantee a sufficient level for the first-order optimality condition and the local decrease of the objective function. However, for nonconvex functions even this moderate goal is difficult to achieve.
The key observation, which underlies all results of this paper, is that an appropriately regularized Taylor approximation of convex function is a convex multivariate polynomial. This is indeed a very natural property since this regularized approximation usually belongs to the epigraph of convex function. Thus, the auxiliary optimization problem in the high-order (or tensor) methods becomes generally solvable by many powerful methods of Convex Optimization. This fact explains our interest to complexity analysis of the simplest tensor scheme (Sect. 2), based on the convex regularized Taylor approximation, and to its accelerated version (Sect. 3). The latter method is obtained by the technique of estimating functions (see [25,26,27]). Therefore it is similar to Algorithm 4.2 in [3]. The main difference consists in the correct choice of parameters ensuring convexity of the auxiliary problem. We show that this algorithm converges with the rate \(O(\left( {1 \over k}\right) ^{p+1})\), where k is the number of iterations and p is the degree of the tensor.
In the next Sect. 4, we derive lower complexity bounds for the tensor methods. We show that the lower bound for the rate of convergence is of the order \(O\left( \left( {1 \over k} \right) ^{3p+1 \over 2}\right) \). This result is better than the bound in [1] and coincide with the bound in [2]. However, it seems that our justification is simpler.
For practical implementations, the most important results are included in Sect. 5, where we discuss an efficient scheme for minimizing the regularized Taylor approximation of degree three. This auxiliary convex problem can be treated in the framework of relative smoothness condition. The first element of this approach was introduced in [4], for generalizing the Lipschitz condition for the norm of the gradient. In [21] it was shown that the same extension can be applied to the condition of strong convexity. This second step is important since it leads to linearly convergent methods for functions with nonstandard growth properties. The auxiliary problem with the third-order tensor is a good application of this technique. We show that the corresponding method converges linearly, with the rate depending on an absolute constant. In the end of the section, we argue that the complexity of one iteration of the resulting third-order scheme is often of the same order as that of the second-order methods.
In the last Sect. 6 we discuss the presented results and mention the open problems.
Notations and generalities In what follows, we denote by \(\mathbb {E}\) a finite-dimensional real vector space, and by \(\mathbb {E}^*\) its dual spaced composed by linear functions on \(\mathbb {E}\). For such a function \(s \in \mathbb {E}^*\), we denote by \(\langle s, x \rangle \) its value at \(x \in \mathbb {E}\). Using a self-adjoint positive-definite operator \(B: \mathbb {E}\rightarrow \mathbb {E}^*\) (notation \(B = B^* \succ 0\)), we can endow these spaces with conjugate Euclidean norms:
Sometimes, in the formulas involving products of linear operators, it will be convenient to treat \(x \in \mathbb {E}\) as a linear operator from \({\mathbb {R}}\) to \(\mathbb {E}\), and \(x^*\) as a linear operator from \(\mathbb {E}^*\) to \({\mathbb {R}}\). In this case, \(xx^*\) is a linear operator from \(\mathbb {E}^*\) to \(\mathbb {E}\), acting as follows:
For a smooth function \(f: {\mathrm{dom }}\,f \rightarrow {\mathbb {R}}\) with convex and open domain \({\mathrm{dom }}\,f \subseteq \mathbb {E}\), denote by \(\nabla f(x)\) its gradient, and by \(\nabla ^2 f(x)\) its Hessian evaluated at point \(x \in {\mathrm{dom }}\,f \subseteq \mathbb {E}\). Note that
In what follows, we often work with directional derivatives. For \(p \ge 1\), denote by
the directional derivative of function f at x along directions \(h_i \in \mathbb {E}\), \(i = 1, \dots , p\). Note that \(D^p f(x)[ \cdot ]\) is a symmetric p-linear form. Its norm is defined in the standard way:
For example, for any \(x \in {\mathrm{dom }}\,f\) and \(h_1, h_2 \in \mathbb {E}\), we have
Thus, for the Hessian, our definition corresponds to the spectral norm of self-adjoint linear operator (maximal module of all eigenvalues computed with respect to operator B).
If all directions \(h_1, \dots , h_p\) are the same, we apply notation
Then, Taylor approximation of function \(f(\cdot )\) at \(x \in {\mathrm{dom }}\,f\) can be written as follows:
Note that in general, we have (see, for example, Appendix 1 in [30])
Similarly, since for \(x, y \in {\mathrm{dom }}\,f\) being fixed, the form \(D^pf(x)[\cdot , \dots , \cdot ] - D^pf(y)[\cdot , \dots , \cdot ]\) is p-linear and symmetric, we also have
In this paper, we consider functions from the problem classes \(\mathcal{F}_p\), which are convex and p times differentiable on \(\mathbb {E}\). Denote by \(L_p\) its uniform bound for the Lipschitz constant of their pth derivative:
Sometimes, if an ambiguity can arise, we us notation \(L_p(f)\).
Assuming that \(f \in \mathcal{F}_{p}\) and \(L_{p} < +\infty \), by the standard integration arguments we can bound the residual between function value and its Taylor approximation:
If \(p \ge 2\), then applying the same reasoning for functions \(\langle \nabla f(\cdot ), h \rangle \) and \(\langle \nabla ^2 f(\cdot ) h, h \rangle \) with direction \(h \in \mathbb {E}\) being fixed, we get the following guarantees:
which are valid for all \(x, y \in {\mathrm{dom }}\,f\).
2 Convex tensor approximations
In our methods, we use the following power prox function
Note that
All results of this paper are based on the following observation. From now on, for the sake of simplicity, we assume that \({\mathrm{dom }}\,f = \mathbb {E}\).
Theorem 1
Let \(f \in \mathcal{F}_p\) with \(p \ge 2\) and \(L_p < +\infty \). Then for any \(x, y \in \mathbb {E}\) we have
Moreover, for any \(M \ge L_p\) and any \(x \in \mathbb {E}\), functionFootnote 1
is convex and
Proof
Let us fix arbitrary x and y from \({\mathrm{dom }}\,f\). Then for any direction \(h \in \mathbb {E}\) we have
and this is (2.3). Further,
Thus, function \(\Omega _{x,p,M}(\cdot )\) is convex. Finally,
\(\square \)
The statements of Theorem 1 explain our interest to the following point:
with \(M \ge L_p\). We are going to use such points for solving the problem
starting from some point \(x_0 \in \mathbb {E}\), where \(f \in \mathcal{F}_p\) and \(L_p < +\infty \). We assume that there exists at least one solution \(x_*\) of this problem and that the level sets of f are bounded:
In this case, in view of (2.5), the level sets of function \(\Omega _{x,p,M}(\cdot )\) are also bounded. Therefore, point \(T = T_{p,M}(x)\) is well defined. It satisfies the following first-order optimality condition:
Multiplying this equality by \(T-x\), we get
Denote \(r_{p,M}(x) = \Vert x - T_{p,M}(x) \Vert \).
Lemma 1
For any \(x \in \mathbb {E}\) and \(M \ge L_p\) we have
First two inequalities of this lemma are already known (see, for example, [8]). We provide them with a simple proof for the reader convenience.
Proof
Denote \(T = T_{p,M}(x)\) and \(r = \Vert x - T \Vert \). Then,
and this is inequality (2.11). Further,
Therefore, we get the following bound:
which leads to (2.12). Finally,
Squaring both sides of this bound, we get:
and this is (2.13). \(\square \)
Corollary 1
For any \(x \in \mathbb {E}\) and \(M \ge L_p\), we have
where \(c(p) = {p \over p-1} \left[ {p-1 \over p+1} \right] ^{1-p \over 2p} [(p-1)!]^{1 \over p}\).
Proof
Indeed, in view of inequality (2.13), we have the following bound:
where \(a = {(p-1)! \over 2 M} \Vert \nabla f(T_{p,M}(x)) \Vert _*^2\), \(b = {M^2 - L^2_p \over 2M (p-1)!}\), \(\tau = r_{p,M}^{p-1}(x)\), and \(\alpha = {p+1 \over p-1}\). Note that
It remains to substitute in this bound the values of our parameters a, b, and \(\alpha \). \(\square \)
Let us estimate now the rate of convergence of the following process:
where \(M \ge L_p\). Thus, in view of Theorem 1, point \(x_{t+1}\) is a solution to the auxiliary convex problem (2.6).
Theorem 2
Let sequence \(\{ x_t \}_{t \ge 0}\) be generated by method (2.16) as applied to problem (2.7). Then for all \(t \ge 0\) we have \(f(x_{t+1}) \le f(x_t)\). At the same time,
Proof
In view of inequality (2.5), method (2.16) is monotone. Hence, for all \(t \ge 0\) we have
Let us prove the first inequality in (2.17). First of all, let us estimate the difference \(f(x_1) - f_*\). We have
and this is (2.17) for \(t = 1\).
Further, for any \(t \ge 1\), we have
The minimum of the above objective in \(\alpha \ge 0\) is achieved for
Thus, we conclude that
Denoting \(\delta _t = f(x_t) - f_*\), we get the following estimate:
where \(C = {p \over p+1}\left( {p! \over (p M + L_p)D^{p+1} } \right) ^{1 \over p}\). Thus, for \(\mu _t = C^p \delta _t\), the recursive inequality is as follows:
Then,
This means that \({1 \over \mu _t} \ge \left( {1 \over \mu _1^{1/p}} + {t-1 \over p} \right) ^p\). Note that
Therefore,
and this is (2.17). \(\square \)
3 Accelerated tensor methods
In order to accelerate method (2.16), we apply a variant of the estimating sequences technique, which becomes a standard tool for accelerating the usual Gradient and Second-Order Methods (see, for example, [25,26,27]). In our situation, this idea can be applied to tensor methods in the following way.
For solving the problem (2.7), we choose a constant \(M \ge L_p\) and recursively update the following sequences.
-
Sequence of estimating functions
$$\begin{aligned} \begin{array}{rcl} \psi _k(x)= & {} \ell _k(x) + {C \over p!} d_{p+1}(x-x_0), \quad k = 1, 2, \dots , \end{array} \end{aligned}$$(3.1)where \(\ell _k(x)\) are linear functions in \(x \in \mathbb {E}\), and C is a positive parameter.
-
Minimizing sequence \(\{ x_k \}_{k=1}^{\infty }\).
-
Sequence of scaling parameters \(\{A_k \}_{k=1}^{\infty }\):
$$\begin{aligned} A_{k+1} \; {\mathop {=}\limits ^{{\mathrm {def}}}}\; A_k + a_k, \quad k = 1,2, \dots . \end{aligned}$$
For these objects, we are going to maintain the following relations:
Let us ensure that relations (3.2) hold for \(k = 1\). We choose
Then \(\psi _1^* = f(x_1)\), so \(\mathcal{R}_1^1\) holds. On the other hand, in view of definition (3.1), we get
and \(\mathcal{R}_1^2\) follows.
Assume now that relations (3.2) hold for some \(k \ge 1\). Denote
Let us choose some \(a_k > 0\) and \(M \ge L_p\). Define
In view of \(\mathcal{R}_k^2\) and convexity of f, for any \(x \in \mathbb {E}\) we have
and this is \(\mathcal{R}_{k+1}^2\). Let us show now that, for the appropriate choices of \(a_k\), C and M, relation \(\mathcal{R}_{k+1}^1\) is also valid.
Indeed, in view of \(\mathcal{R}_k^1\) and Lemma 4 in [27], for any \(x \in \mathbb {E}\), we have
Denote \(\gamma _p = {C \over p!} \cdot ({1 \over 2})^{p-1}\). Then,
Further, if we choose \(M \ge L_p\), then by inequality (2.15) we have
Hence, our choice of parameters must ensure the following inequality:
for all \(x \in \mathbb {E}\). Minimizing this expression in \(x \in \mathbb {E}\), we come to the following condition:
Substituting in this inequality expressions for corresponding constants, we obtain
After cancellations, we get
For the sake of notation, let us choose
Then, inequality (3.6) becomes simpler:
Let us choose some \(\alpha > 0\) and define
Note that for any \(k \ge 0\), using trivial inequality \((1-\tau )^p \ge 1 - p \tau \), \(0 \le \tau \le 1\), we get
Thus, \(A_{k+1} \ge \alpha ^{1 \over p} \left( {1 \over p+1} \right) ^{p+1 \over p} a_k^{p+1 \over p}\). Now, if we choose
then \(\alpha ^{-{1 \over p+1}} (1 + p) = \left[ { (p-1)(M^2 - L_p^2) \over 4(p+1)M^2} \right] ^{p \over 2(p+1)}\), and inequality (3.8) is satisfied for all \(k \ge 0\).
Now we are ready to write down the accelerated tensor method. Define
The above discussion proves the following theorem.
Theorem 3
Let the sequence \(\{x_k \}_{k=1}^{\infty }\) be generated by method (3.12) as applied to the problem (2.7). Then for any \(k \ge 1\) we have:
Proof
Indeed, we have shown that
Thus, (3.13) follows from (3.11). \(\square \)
Note that the point \(v_k\) can be found in (3.12) by a closed-form expression. Consider
Since function \(\ell _k(\cdot )\) is linear, this vector is the same for all \(x \in \mathbb {E}\). Therefore, the point \(v_k\) is a solution of the following minimization
The first-order optimality condition for this problem is as follows:
Thus, we get the following closed-form expression for its solution:
4 Lower complexity bounds for tensor methods
For constructing functions, which are difficult for all tensor methods, it is convenient to assume that \(\mathbb {E}= \mathbb {E}^* = {\mathbb {R}}^n\), and \(B = I_n\), the identity \(n \times n\)-matrix. Thus, in this section we work with the standard Euclidean norm
For an integer parameter \(p \ge 1\), define the following function:
Clearly, \(\eta _{p+1} \in \mathcal{F}_p\). On the other hand, for any x and \(h \in {\mathbb {R}}^n\), we have
Therefore, for all x, y, and h from \({\mathbb {R}}^n\), by Cauchy-Schwartz inequality we have
Thus, \(L_p(\eta _{p+1}) = p!\).
For integer parameter k, \(2 \le k \le n\), let us define the following \(k \times k\) upper triangular matrix with two nonzero diagonals:
Now we can introduce \(n \times n\) upper triangular matrix \(A_k\) with the following structure:
Note that
Our parametric family of difficult functions is defined in the following way:
where \(e_1 = (1, 0, \dots , 0)^T\). Let us compute its optimal solution from the first-order optimality condition
Thus, \(A_k x^*_k = y^*_k\), where \(y^*_k\) satisfies equation \(\nabla \eta _{p+1}(y^*_k) = A_k^{-T} e_1 = {\hat{e}}_k {\mathop {=}\limits ^{{\mathrm {def}}}}\left( \begin{array}{l} {\bar{e}}_k \\ 0_{n-k} \end{array} \right) \), with \({\bar{e}}_k \in {\mathbb {R}}^k\) being the vector of all ones, and \(0_{n-k}\) beng the origin in \({\mathbb {R}}^{n-k}\).
Thus, in coordinate form, vector \(y^*_k\) can be found from the following equations:
In other words, \(y^*_k = {\hat{e}}_k\) and vector \(x^*_k = A_k^{-1} {\hat{e}}_k\) has the following coordinates:
where \((\tau )_+ = \max \{ 0, \tau \}\). Consequently,
Let us describe now the abilities of tensor methods of degree \(p \ge 2\) in generating new test points. We assume that the response of oracle at point \({\bar{x}} \in {\mathbb {R}}^n\) consists in the following collection of multi-linear forms:
Therefore, we assume that the method is able to compute stationary points of the following polynomial functions
with coefficients \(a \in {\mathbb {R}}^p\), \(\gamma > 0\) and \(m > 1\). Denote by \(\Gamma _{{\bar{x}}, f}(a,\gamma ,m)\) the set of all stationary points of this function. Then we can define the linear subspace
Our assumption about the rules of tensor methods is as follows.
Assumption 1
The method generates a sequence of test points \(\{ x_k \}_{k \ge 0}\) satisfying recursive condition
Note that for the absolute majority of the first-order, second-order, and tensor methods this assumption is satisfied.
Let us look at the consequences Assumption 1 in the case of minimization of function \(f_k(\cdot )\). Denote
Lemma 2
Any tensor method satisfying Assumption 1 and minimizing function \(f_t(\cdot )\) starting from the point \(x_0 = 0\), generates test points \(\{ x_k \}_{k \ge 0}\) satisfying condition
Proof
Let us prove first that inclusion \(x \in {\mathbb {R}}^n_k\) with \(k \ge 1\) implies \(\mathcal{S}_{f_t}(x) \subseteq {\mathbb {R}}^n_{k+1}\). Indeed, since matrix \(A_t\) is upper triangular, this inclusion ensures that \(y {\mathop {=}\limits ^{{\mathrm {def}}}}A_t x \in {\mathbb {R}}^n_k\). Therefore, all derivatives of function \(f_t(\cdot )\) along direction \(h \in {\mathbb {R}}^n\) have the following form:
with certain coefficients \(d_{i,k}\), \(i = 1, \dots , n\), \(k = 1, \dots , p\). This means that the gradients of these derivatives in h are as follows
Thus \(\nabla D^k f_t(x)[h]^k \in {\mathbb {R}}^n_{k+1}\) for all k, \(1 \le k \le p\). Hence, since the regularization term in definition (4.6) is formed by the standard Euclidean norm, all stationary points of this function belong to \({\mathbb {R}}^n_{k+1}\).
Now we can prove statement of the lemma by induction. For \(k=0\), we have \(x_0 = 0\), and therefore
for all \(h \in {\mathbb {R}}^n\). Consequently, stationary points of all possible functions \(\phi _{a,\gamma ,m}(\cdot )\) belong to \({\mathbb {R}}^n_1\) implying \(\mathcal{S}_{f_t}(x_0) = {\mathbb {R}}^n_1\). Thus, \(x_1\) belongs to \({\mathbb {R}}^n_1\) by Assumption 1.
Assume now that all \(x_i \in {\mathbb {R}}^n_k\), \(i = 1, \dots , k\), for some \(k \ge 1\). Then, as we have already seen, \(\mathcal{S}_{f_t}(x_i) \subseteq {\mathbb {R}}^n_{k+1}\). Hence, the inclusion (4.8) follows from Assumption 1. \(\square \)
Now we can prove the main statement of this section.
Theorem 4
Let some tensor method \(\mathcal{M}\) of degree p satisfies Assumption 1. Assume that this method ensures for any function \(f \in \mathcal{F}_p\) with \(L_p((f) < + \infty \) the following rate of convergence:
where \(\{ x_k \}_{k \ge 0}\) is the sequence of test points, generated by method \(\mathcal{M}\) and \(x^*\) is the solution of the problem (2.7). Then for all \(t \ge 1\) such that \(2t +1 \le n\) we have
Proof
Let us use method \(\mathcal{M}\) for minimizing function \(f(x) = f_{2t+1}(x)\). In view of Lemma 2, we have \(x_i \in {\mathbb {R}}^n_t\) for all i, \(0 \le i \le t\). However,
At the same time, for all \(x, y, h \in {\mathbb {R}}^n\) we have
Therefore, \(L_p(f_{2t+1}) \le 2^p p!\), and we have
\(\square \)
5 Third-order methods: implementation details
Tensor optimization methods, presented in Sects. 2 and 3, are based on the solution of the auxiliary optimization problem (2.6). In the existing literature on the tensor methods [6, 7, 20, 32], it was solved by the standard local technique of Nonconvex Optimization. However, now we know that by Theorem 1, this problem is convex. Hence, it is solvable by the standard and very efficient methods of Convex Optimization.
Since we need to solve this problem at each iteration of the methods, its complexity significantly affects the total computational time. Since the objective function in the problem (2.6) is a convex multivariate polynomial, we there could exist some special efficient algorithms for finding its solution. Unfortunately, at this moment the authors failed to find such methods in the literature. Therefore, we present in this section a special approach for solving the problem (2.6) with the third degree Taylor approximation, which is based on the recently developed optimization framework of relatively smooth functions (see [4, 21]).
Let us fix an arbitrary \(x \in \mathbb {E}\). Consider the following multivariate polynomial of degree three:Footnote 2
Since in this section we work only with third-order approximations, we drop the corresponding index.
Recall that \(D^3f(x)[h_1,h_2,h_3]\) is a symmetric trilinear form. Hence, \(D^3 f(x) [h_1,h_2] \in \mathbb {E}^*\) is a symmetric bilinear vector function, and \(D^3f(x)[h]\) is a linear function of \(h \in \mathbb {E}\), whose values are self-adjoint linear operators from \(\mathbb {E}\) to \(\mathbb {E}^*\) (as Hessians).
Denote \(p(h) = {1 \over 6} D^3 f(x) [h]^3\). Then we can define its gradient and Hessian as follows:
In this section, our main class of functions is \(\mathcal{F}_3\), composed by convex functions, which are three times continuously differentiable, and for which the constant \(L_3\) is finite. As we have shown in Theorem 1, our assumptions imply interesting relations between derivatives. Let us derive a consequence of the matrix inequality (2.3).
Lemma 3
For any \(h \in \mathbb {E}\) and \(\tau > 0\), we have
Consequently, for any \(h, u \in \mathbb {E}\), we get
Proof
Let us fix arbitrary directions \(u, h \in \mathbb {E}\). Then, in view of relation (2.3) we have
Thus, replacing h by \(\tau h\) with \(\tau > 0\) and dividing the resulting inequality by \(\tau \), we get
And this is equivalent to the left-hand side of matrix inequality (5.2). Its right-hand side can be obtained by replacing h by \(-h\), which gives
Minimizing the right-hand side of this inequality in \(\tau \), we get (5.3). \(\square \)
Let us look now at our auxiliary minimization problem:
where \(d_4(h) = {1 \over 4} \Vert h \Vert ^4\). In view of Theorem 1, function \(\Omega _{x,M}(\cdot )\) is convex for
with \(\tau \ge 1\). For any \(h \in \mathbb {E}\), we have
Let \(\rho _x(h) = {1 \over 2}(1 - {1 \over \tau }) \langle \nabla ^2 f(x) h,h \rangle + {M -\tau L_3 \over 2} d_4(h)\). Then, we have proved that
On the other hand,
Thus, we have proved the following lemma.
Lemma 4
Let \(M = \tau ^2L_3\) with \(\tau > 1\). Then function \(\Omega _{x,M}(\cdot )\) satisfies the strong relative smoothness condition
with respect to function \(\rho _x(\cdot )\), where \(\kappa (\tau )={\tau +1 \over \tau -1}\).
As it is shown in [21], condition (5.7) allows to solve problem (5.4) very efficiently by a kind of primal gradient method. In accordance to this approach, we need to define the Bregman distance of function \(\rho _x(\cdot )\):
and iterate the process
In our case, this method has the following form:
In accordance to Theorem 3.1 in [21], the rate of convergence of this method is as follows:
where \(h_*\) is the unique optimal solution to problem (5.4).
As we can see, the algorithm (5.8) is very fast. Its linear rate of convergence depends only on absolute constant \(\tau > 1\), which can be chosen reasonably close to one for allowing faster convergence of the main tensor methods (2.16) and (3.12). Let us discuss two potentially expensive operations in the implementation of method (5.8).
-
1.
Computation of the gradient \(\nabla \Omega _{x,M}(h)\). Note that
$$\begin{aligned} \begin{array}{rcl} \nabla \Omega _{x,M}(h)= & {} \nabla f(x) + \nabla ^2 f(x) h + {1 \over 2}D^3f(x) [h]^2. \end{array} \end{aligned}$$In this formula, only the computation of the third derivative may be dangerous. However, this difficulty can be resolved using the technique of automatic differentiation (see, for example, [19]). Indeed, assume we have a sequence of operations for computing the function value f(x) with computational complexity T. Let us fix a direction \(h \in \mathbb {E}\). Then by forward differentiation, we can generate automatically a sequence of operations for computing the value
$$\begin{aligned} \begin{array}{rcl} g_h(x)= & {} \langle \nabla ^2 f(x) h, h \rangle \end{array} \end{aligned}$$with computational complexity O(T). Now, by backward differentiation in x, we can compute the gradient of this function:
$$\begin{aligned} \begin{array}{rcl} \nabla g(x)= & {} D^3f(x)[h,h] \end{array} \end{aligned}$$with computational complexity O(T). Thus, the oracle complexity of method (5.8) is proportional to the complexity of computing the function value f(x). Another example of simple computation of the third derivative is provided by a separable objective function. Assume that \(\mathbb {E}= {\mathbb {R}}^n\) and
$$\begin{aligned} \begin{array}{rcl} f(x)= & {} \sum \limits _{i=1}^N f_i(b_i - \langle a_i, x \rangle ), \end{array} \end{aligned}$$where \(a_i \in {\mathbb {R}}^n\) and univariate functions \(f_i(\cdot )\) are three times continuously differentiable, \(i = 1, \dots , N\). Then vector \(D^3 f[h]^2\) has the following representation:
$$\begin{aligned} \begin{array}{rcl} D^3 f(x)[h]^2= & {} - \sum \limits _{i=1}^N a_i \, f'''_i(b_i - \langle a_i, x \rangle ) \langle a_i, h \rangle ^2. \end{array} \end{aligned}$$Thus, for solving the problem (5.4), we need to compute in advance all values
$$\begin{aligned} \begin{array}{c} f'''_i(b_i - \langle a_i, x \rangle ), \quad i = 1, \dots , N \end{array} \end{aligned}$$(this needs O(nN) operations). After that, each computation of vector \(D^3 f(x)[h]^2 \in {\mathbb {R}}^n\) also needs O(nN) operations. This computation will be cheaper for the sparse data.
-
2.
Solution of the auxiliary problem At all iterations of method (5.8), we need to solve an auxiliary problem in the following form:
$$\begin{aligned} \begin{array}{rcl} \min \limits _{h \in \mathbb {E}} \left\{ \langle c, h \rangle + {1 \over 2}\langle A h, h \rangle + {\gamma \over 4} \Vert h \Vert ^4 \right\} , \end{array} \end{aligned}$$(5.10)where \(A \succeq 0\) and \(\gamma > 0\). Note that at all these iterations only the vector c and coefficients \(\gamma \) are changing, and matrix \(A = \nabla ^2 f(x)\) remains the same. Therefore, before the algorithm (5.8) starts working, it is reasonable to transform this matrix in a tri-diagonal form:
$$\begin{aligned} \begin{array}{rcl} A= & {} U T U^T, \end{array} \end{aligned}$$where \(U \in {\mathbb {R}}^{n \times n}\) is an orthogonal matrix: \(UU^T = I\), and \(T \in {\mathbb {R}}^{n \times n}\) is tri-diagonal.
Denoting now \({\tilde{c}} = U^T c\), we have:
Thus, the solution of the primal problem can be retrieved from a solution to the univariate dual problem. The complexity of computing its function value and derivative is linear in n. Moreover, since its objective function is strongly convex and infinitely times differentiable, all reasonable one-dimensional methods have global linear rate of convergence and the quadratic convergence in the end.
Let us estimate the total computational complexity of the method (5.8), assuming that the computational time of the value of the objective function is \(T_f\). Assume also that its gradient, the product of its Hessian by a vector, and the value of its third derivative on two identical vectors can be computed using the fast backward differentiation (then the complexity of all these operations is \(O(T_f)\)). Then, the most expensive operations in this method are as follows.
-
Computation of the Hessian \(\nabla ^2 f(x)\) and its tri-diagonal factorization: \(O(nT_f + n^3)\) operations.
-
We need \(O(\ln {1 \over \epsilon })\) iterations of method (5.8), in order to get \(\epsilon \)-solution of the auxiliary problem. At each iteration of this method we need:
-
Compute the gradient \(\nabla \Omega _{x,M}(h_k)\): \(O(T_f)\) operations.
-
Compute the vector \({\tilde{c}}\) for the univariate problem in (5.11): \(O(n^2)\) operations.
-
Solve the dual problem in (5.11) up to accuracy \(\delta \): \(O(n \ln {1 \over \delta })\) operations.
-
Compute an approximate solution \(h = - U(\gamma \tau I +T)^{-1} {\tilde{c}}\) of the problem (5.10), using an approximate solution \(\tau \) of the dual problem: \(O(n^2)\) operations.
-
Thus, we come to the following estimate:
This is the same order of complexity as that of one iteration in Trust Region Methods [15] and usual Cubic Regularization [27, 31]. However, we can expect that the third-order methods converge much faster.
For the readers, which are not interested in all these computational details, we just mention that the Galahad Optimization Library [16] has special subroutines for solving the auxiliary problems in the form (5.10).
6 Discussion
In this paper, we did an important step towards practical implementation of tensor methods in unconstrained convex optimization. We have shown that the auxiliary optimization problems in these scheme can be reduced to minimization of a convex multivariate polynomial. In the important case of third-order tensor, we have proved that this problem can be efficiently solved by a special optimization scheme derived from the relative smoothness condition.
Our results highlight several interesting questions. One of the direct consequences of our approach is a systematic way of generating convex multivariate polynomials. Is it possible to minimize them by some tools of Algebraic Geometry (see [23] for the related technique like sums of squares, etc.), or we need to treat them using an appropriate technique from Convex Optimization? The results of Sect. 5 demonstrate a probably unbeatable superiority of optimization technique for the third-order polynomials. But what happens with polynomials of higher degree?
One of the difficult unsolved problems in our approach is related to dynamic adjustment of the Lipschitz constant for the highest derivative. This dynamic estimate should not be much bigger than the actual Lipschitz constant. On the other hand, it must ensure convexity of the auxiliary problem solved at each iteration of the tensor methods. This question is clearly crucial for the practical efficiency of the high-order schemes.
Simple comparison of the complexity bounds in Sects. 3 and 4 shows that we failed to develop an optimal tensor scheme. The missing factor in the complexity estimates is of the order of \(O\left( \left( {1 \over \epsilon }\right) ^{{1 \over p+1} - {2 \over 3p + 1}}\right) = O\left( \left( {1 \over \epsilon }\right) ^{p-1 \over (p+1)(3p+1)}\right) \). For \(p = 3\), this factor is of the order of \(O\left( \left( {1 \over \epsilon }\right) ^{{1 \over 20}}\right) \). This means that from the viewpoint of practical efficiency, the cost of one iteration of the hypothetical optimal scheme must be of the same order as that of the accelerated tensor method (3.12). Any additional logarithmic factors in the complexity bound of this “optimal” method will definitely kill its tiny superiority in the convergence rate.
In the last years, we have seen an increasing interest to universal methods, which can adjust to the best possible Hölder condition instead of the Lipschitz one during the running optimization process (see [14, 17, 29]). Of course, it is very interesting to extend this philosophy onto the tensor minimization schemes. Another important extension could be the treatments of the constraints, either in functional form, or using the framework of composite minimization [28]. The main difficulty here is related to the complexity of the auxiliary optimization problems.
One of the main restrictions for practical implementation of our results is the necessity to know the Lipschitz constant of the corresponding derivative. If our estimate is too small, then the auxiliary problem (2.6) may loose convexity. Consequently, we will loose the fast convergence in the auxiliary process (5.8). However, this observation gives us a clue how to tune this constant: if we see that this process is too slow, this means that our estimate is too small. But of course it is very interesting to find a recipe with better theoretical justification.
Notes
In our notation, the approximation function in [3] was chosen as \(\Phi _{x,p}(y) + {M \over p!} d_{p+1}(y-x)\). Thus, we cannot guarantee that this polynomial is convex.
We drop the index in this notation since from now on we always have \(p = 3\).
References
Agarwal, N., Hazan, E.: Lower Bounds for Higher-Order Convex Optimization (2017). arXiv:1710.10329v1 [math.OC]
Arjevani, Y., Shamir, O., Shiff, R.: Oracle Complexity of Second-Order Methods for Smooth Convex Optimization (2017). arXiv:1705.07260 [math.OC]
Baes, M.: Estimate sequence methods: extensions and approximations. Optim. Online (2009)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuety: first-order methods revisited and applications. Math. Oper. Res. 42, 330–348 (2017)
Bian, W., Chen, X., Ye, Y.: Complexity analysis of interior-point algorithms for non-Lipschitz and non-convex minimization. Math. Program. 139, 301–327 (2015)
Birgin, E.G., Gardenghi, J.L., Martines, J.M., Santos, S.A.: Remark on Algorithm 566: Modern Fortran Routines for Testing Unconsrained Optimization Software with Derivatives up to Third-Order. Technical report, Department of Computer Sciences, University of Sao Paolo, Brazil (2018)
Birgin, E.G., Gardenghi, J.L., Martines, J.M., Santos, S.A.: On the Use of Third-Order Models with Fourth-Order Regularization for Unconstrained Optimization. Technical report, Department of Computer Sciences, University of Sao Paolo, Brazil (2018)
Birgin, E.G., Gardenghi, J.L., Martines, J.M., Santos, S.A., Toint, PhL: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularization models. Math. Program. 163, 359–368 (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Archiv (2017). arXiv:1710.11606
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II. Archiv (2017). arXiv:1711.00841
Cartis, C., Gould, N.I.M., Toint, PhL: Adaptive cubic overestimation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 130(2), 295–319 (2012)
Cartis, C., Gould, N.I.M., Toint, PhL: Adaptive cubic overestimation methods for unconstrained optimization. Part II: worst-case function evaluation complexity. Math. Program. 127(2), 245–295 (2011)
Cartis, C., Gould, N.I.M., Toint, PhL: Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 27(2), 197–219 (2012)
Cartis, C., Gould, N.I.M., Toint, PhL: Universal regularization methods–varying the power, the smoothness and the accuracy. SIAM. J. Optim. 29(1), 595–615 (2019)
Conn, A.R., Gould, N.I.M., Toint, PhL: Trust Region Methods. MOS-SIAM Series on Optimization, New York (2000)
Gould, N.I.M., Orban, D., Toint, PhL: GALAHAD, a library of thread-safe Fortran 90 packages for large-scale nonlinear optimization. ACM Trans. Math. Softw. 29(4), 353–372 (2003)
Grapiglia, G.N., Nesterov, Yu.: Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIOPT 27(1), 478–506 (2017)
Grapiglia, G.N., Yuan, J., Yuan, Y.: On the convergence and worst-case complexity of trust-region and regularization methods for unconstrained optimization. Math. Program. 152, 491–520 (2015)
Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Applied Mathematics, vol. 105, 2nd edn. SIAM, Philadelphia (2008)
Gundersen, G., Steihaug, T.: On large-scale unconstrained optimization problems and higher order methods. Optim. Methods. Softw. 25(3), 337–358 (2010)
Lu, H., Freund, R., Nesterov, Yu.: Relatively smooth convex optimization by first-order methods, and applications. SIOPT 28(1), 333–354 (2018)
Hoffmann, K.H., Kornstaedt, H.J.: Higher-order necessary conditions in abstract mathematical programming. JOTA 26, 533–568 (1978)
Lasserre, J.B.: Moments, Positive Polynomials and Their Applications. Imperial College Press, London (2010)
Monteiro, R.D.C., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIOPT 23(2), 1092–1125 (2013)
Nesterov, Yu.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)
Nesterov, Yu.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Yu.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)
Nesterov, Yu.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Nesterov, Yu.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)
Nesterov, Yu., Nemirovskii, A.: Interior Point Polynomial Methods in Convex Programming: Theory and Applications. SIAM, Philadelphia (1994)
Nesterov, Yu., Polyak, B.: Cubic regularization of Newton’s method and its global performance. Math. Program. 108(1), 177–205 (2006)
Schnabel, R.B., Chow, T.T.: Tensor methods for unconstrained optimization using second derivatives. SIAM J. Optim. 1(3), 293–315 (1991)
Acknowledgements
The author is very thankful to Geovani Grapiglia for interesting discussions of the results. The comments of two anonymous referees were extremely useful.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research results presented in this paper were obtained in the framework of ERC Advanced Grant 788368 and Russian Science Foundation (Grant 17-11-01027).
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Nesterov, Y. Implementable tensor methods in unconstrained convex optimization. Math. Program. 186, 157–183 (2021). https://doi.org/10.1007/s10107-019-01449-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01449-1
Keywords
- High-order methods
- Tensor methods
- Convex optimization
- Worst-case complexity bounds
- Lower complexity bounds