Near-Optimal Hyperfast Second-Order Method for Convex Optimization

Kamzolov, Dmitry

doi:10.1007/978-3-030-58657-7_15

Dmitry Kamzolov ORCID: orcid.org/0000-0001-8488-9692⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1275))

Included in the following conference series:

International Conference on Mathematical Optimization Theory and Operations Research

607 Accesses
7 Citations

Abstract

In this paper, we present a new Hyperfast Second-Order Method with convergence rate $O(N^{-5})$ up to a logarithmic factor for the convex function with Lipshitz 3rd derivative. This method based on two ideas. The first comes from the superfast second-order scheme of Yu. Nesterov (CORE Discussion Paper 2020/07, 2020). It allows implementing the third-order scheme by solving subproblem using only the second-order oracle. This method converges with rate $O(N^{-4})$. The second idea comes from the work of Kamzolov et al. (arXiv:2002.01004). It is the inexact near-optimal third-order method. In this work, we improve its convergence and merge it with the scheme of solving subproblem using only the second-order oracle. As a result, we get convergence rate $O(N^{-5})$ up to a logarithmic factor. This convergence rate is near-optimal and the best known up to this moment.

The work was funded by RFBR, project number 19-31-27001.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Superfast Second-Order Methods for Unconstrained Convex Optimization

Article Open access 29 August 2021

Mirror-Descent Methods in Mixed-Integer Convex Optimization

Convergence analysis for Lasserre’s measure-based hierarchy of upper bounds for polynomial optimization

Article 12 July 2016

Keywords

1 Introduction

In recent years, it has been actively developing higher-order or tensor methods for convex optimization problems. The primary impulse was the work of Yu. Nesterov [23] about the possibility of the implementation tensor method. He proposed a smart regularization of Taylor approximation that makes subproblem convex and hence implementable. Also Yu. Nesterov proposed accelerated tensor methods [22, 23], later A. Gasnikov et al. [4, 11, 12, 18] proposed the near-optimal tensor method via the Monteiro–Svaiter envelope [21] with line-search and got a near-optimal convergence rate up to a logarithmic factor. Starting from 2018–2019 the interest in this topic rises. There are a lot of developments in tensor methods, like tensor methods for Hölder-continuous higher-order derivatives [15, 28], proximal methods [6], tensor methods for minimizing the gradient norm of convex function [9, 15], inexact tensor methods [14, 19, 24], and near-optimal composition of tensor methods for sum of two functions [19]. There are some results about local convergence and convergence for strongly convex functions [7, 10, 11]. See [10] for more references on applications of tensor method.

At the very beginning of 2020, Yurii Nesterov proposed a Superfast Second-Order Method [25] that converges with the rate $O(N^{-4})$ for a convex function with Lipshitz third-order derivative. This method uses only second-order information during the iteration, but assume additional smoothness via Lipshitz third-order derivative.^{Footnote 1} Here we should note that for the first-order methods, the worst-case example can’t be improved by additional smoothness because it is a specific quadratic function that has all high-order derivatives bounded [24].^{Footnote 2} But for the second-order methods, one can see that the worst-case example does not have Lipshitz third-order derivative. This means that under the additional assumption, classical lower bound $O(N^{-2/7})$ can be beaten, and Nesterov proposes such a method that converges with $O(N^{-4})$ up to a logarithmic factor. The main idea of this method to run the third-order method with an inexact solution of the Taylor approximation subproblem by method from Nesterov with inexact gradients that converges with the linear speed. By inexact gradients, it becomes possible to replace the direct computation of the third derivative by the inexact model that uses only the first-order information. Note that for non-convex problems previously was proved that the additional smoothness might speed up algorithms [1, 3, 14, 26, 29].

In this paper, we propose a Hyperfast Second-Order Method for a convex function with Lipshitz third-order derivative with the convergence rate $O(N^{-5})$ up to a logarithmic factor. For that reason, firstly, we introduce Inexact Near-optimal Accelerated Tensor Method, based on methods from [4, 19] and prove its convergence. Next, we apply Bregman-Distance Gradient Method from [14, 25] to solve Taylor approximation subproblem up to the desired accuracy. This leads us to Hyperfast Second-Order Method and we prove its convergence rate. This method have near-optimal convergence rates for a convex function with Lipshitz third-order derivative and the best known up to this moment.

The paper is organized as follows. In Sect. 2 we formulate problem and introduce some basic facts and notation. In Sect. 3 we propose Inexact Near-optimal Accelerated Tensor Method and prove its convergence rate. In Sect. 4 we propose Hyperfast Second-Order Method and get its convergence speed.

2 Problem Statement and Preliminaries

In what follows, we work in a finite-dimensional linear vector space $E=\mathbb {R}^n$, equipped with a Euclidian norm $\Vert \,\cdot \,\Vert =\Vert \,\cdot \,\Vert _2$.

We consider the following convex optimization problem:

$$\begin{aligned} \min \limits _{x} f(x), \end{aligned}$$

(1)

where f(x) is a convex function with Lipschitz p-th derivative, it means that

$$\begin{aligned} \Vert D^p f(x)- D^p f(y)\Vert \le L_{p}\Vert x-y\Vert . \end{aligned}$$

(2)

Then Taylor approximation of function f(x) can be written as follows:

$$\begin{aligned} \varOmega _{p}(f,x;y)=f(x)+\sum _{k=1}^{p}\frac{1}{k!}D^{k}f(x)\left[ y-x \right] ^k, \, y\in \mathbb {R}^n. \end{aligned}$$

(3)

By (2) and the standard integration we can get next two inequalities

$$\begin{aligned} |f(y)-\varOmega _{p}(f,x;y)|\le \frac{L_{p}}{(p+1)!}\Vert y-x\Vert ^{p+1}, \end{aligned}$$

(4)

$$\begin{aligned} \Vert \nabla f(y)- \nabla \varOmega _{p}(f,x;y)\Vert \le \frac{L_{p}}{p!}\Vert y-x\Vert ^{p}. \end{aligned}$$

(5)

3 Inexact Near-Optimal Accelerated Tensor Method

Problem (1) can be solved by tensor methods [23] or its accelerated versions [4, 12, 18, 22]. This methods have next basic step:

$$\begin{aligned} T_{H_p}(x) = \mathop {\mathrm {argmin}}\limits _{y} \left\{ \tilde{\varOmega }_{p,H_p}(f,x;y) \right\} , \end{aligned}$$

where

$$\begin{aligned} \tilde{\varOmega }_{p,H_p}(f,x;y) = \varOmega _{p}(f,x;y) + \frac{H_p}{p!}\Vert y - x \Vert ^{p+1}. \end{aligned}$$

(6)

For $H_p\ge L_p$ this subproblem is convex and hence implementable.

But what if we can not solve exactly this subproblem. In paper [25] it was introduced Inexact pth-Order Basic Tensor Method (BTMI$_p$) and Inexact pth-Order Accelerated Tensor Method (ATMI$_p$). They have next convergence rates $O(k^{-p})$ and $O(k^{-(p+1)})$, respectively. In this section, we introduce Inexact pth-Order Near-optimal Accelerated Tensor Method (NATMI${_p}$) with improved convergence rate $\tilde{O}(k^{-\frac{3p+1}{2}})$, where $\tilde{O}(\cdot )$ means up to logarithmic factor. It is an improvement of Accelerated Taylor Descent from [4] and generalization of Inexact Accelerated Taylor Descent from [19].

Firstly, we introduce the definition of the inexact subproblem solution. Any point from the set

$$\begin{aligned} \mathcal {N}_{p,H_p}^{\gamma }(x) = \left\{ T \in \mathbb {R}^n \, : \, \Vert \nabla \tilde{\varOmega }_{p,H_p}(f,x;T) \Vert \le \gamma \Vert \nabla f(T)\Vert \right\} \end{aligned}$$

(7)

is the inexact subproblem solution, where $\gamma \in [0; 1]$ is an accuracy parameter. $N_{p,H_p}^{0}$ is the exact solution of the subproblem.

Next we propose Algorithm 1.

To get the convergence rate of Algorithm 1 we prove additional lemmas. The first lemma gets intermediate inequality to connect theory about inexactness and method’s theory.

Lemma 1

If $y_{k+1} \in \mathcal {N}_{p,H_p}^{\gamma }(\tilde{x}_k) $, then

$$\begin{aligned} \Vert \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1}) \Vert \le \frac{\gamma }{1-\gamma } \cdot \frac{(p+1)H_p+L_p}{p!}\Vert y_{k+1}-\tilde{x}_k\Vert ^p. \end{aligned}$$

(9)

Proof

From triangle inequality we get

$$\begin{aligned} \Vert \nabla f(y_{k+1}) \Vert&\le \Vert \nabla f(y_{k+1}) - \nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1}) \Vert \\&+ \Vert \nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1})-\nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1}) \Vert + \Vert \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1}) \Vert \\&\overset{(5),(6),(7)}{\le } \frac{L_p}{p!}\Vert y_{k+1}-\tilde{x}_k \Vert ^{p}+ \frac{(p+1)H_p}{p!}\Vert y_{k+1} - \tilde{x}_k \Vert ^{p} + \gamma \Vert \nabla f(y_{k+1}) \Vert . \end{aligned}$$

Hence,

$$\begin{aligned} (1-\gamma )\Vert \nabla f(y_{k+1}) \Vert \le \frac{(p+1)H_p+L_p}{p!}\Vert y_{k+1} - \tilde{x}_k \Vert ^{p}. \end{aligned}$$

And finally from (7) we get

$$\begin{aligned} \Vert \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1}) \Vert \le \frac{\gamma }{1-\gamma } \cdot \frac{(p+1)H_p+L_p}{p!}\Vert y_{k+1} - \tilde{x}_k \Vert ^{p}. \end{aligned}$$

Next lemma plays the crucial role in the prove of the Algorithm 1 convergence. It is the generalization for inexact subpropblem of Lemma 3.1 from [4].

Lemma 2

If $y_{k+1} \in \mathcal {N}_{p,H_p}^{\gamma }(\tilde{x}_k) $, $H_p=\xi L_p$ such that $1 \ge 2\gamma +\frac{1}{\xi (p+1)}$ and

$$\begin{aligned} \frac{1}{2} \le \lambda _{k+1}&\frac{H_p \cdot \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1}}{(p-1)!} \le \frac{p}{p+1} \, , \text { then}\end{aligned}$$

(10)

$$\begin{aligned} \Vert y_{k+1} - (\tilde{x}_k&- \lambda _{k+1} \nabla f(y_{k+1})) \Vert \le \sigma \cdot \Vert y_{k+1}-\tilde{x}_k\Vert \, ,\end{aligned}$$

(11)

$$\begin{aligned} \sigma&\ge \frac{p \xi + 1 -\xi +2\gamma \xi }{(1-\gamma )2p \xi }, \end{aligned}$$

(12)

where $\sigma \le 1$.

Proof

Note, that by definition

$$\begin{aligned} \begin{aligned} \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1})&= \nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1})\\&+ \frac{H_p(p+1)}{p!}\Vert y_{k+1} - \tilde{x}_k \Vert ^{p-1} (y_{k+1}-\tilde{x}_k). \end{aligned} \end{aligned}$$

(13)

Hence,

$$\begin{aligned} \begin{aligned} y_{k+1}-\tilde{x}_k&= \frac{p!}{H_p(p+1)\Vert y_{k+1} - \tilde{x}_k \Vert ^{p-1}} \\&\cdot \left( \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1}) - \nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1})\right) . \end{aligned} \end{aligned}$$

(14)

Then, by triangle inequality we get

$$\begin{aligned}&\Vert y_{k+1} - (\tilde{x}_k - \lambda _{k+1} \nabla f(y_{k+1})) \Vert = \Vert \lambda _{k+1} (\nabla f(y_{k+1})- \nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1}))\\&+\lambda _{k+1}\nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1})\\&+ \left. \left( y_{k+1} - \tilde{x}_k + \lambda _{k+1}(\nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1})-\nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1}))\right) \right\| \\&\overset{(5),(14)}{\le } \lambda _{k+1} \frac{L_p}{p!} \Vert y_{k+1} - \tilde{x}_k\Vert ^p + \lambda _{k+1}\Vert \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1})\Vert \\&+ \left| \lambda _{k+1} - \frac{p!}{H_p \cdot (p+1) \cdot \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1}} \right| \\&\cdot \Vert \nabla \tilde{\varOmega }_{p,H_p}(f,\tilde{x}_k;y_{k+1})-\nabla \varOmega _{p}(f,\tilde{x}_k;y_{k+1})\Vert \end{aligned}$$

$$\begin{aligned}&\overset{(9),(13)}{\le } \Vert y_{k+1} - \tilde{x}_k\Vert \left( \lambda _{k+1} \frac{L_p}{p!} \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1} \right. \\&+\left. \lambda _{k+1}\frac{\gamma }{1-\gamma } \cdot \frac{(p+1)H_p+L_p}{p!}\Vert y_{k+1}-\tilde{x}_k\Vert ^{p-1}\right) \\&+ \left| \lambda _{k+1} - \frac{p!}{H_p \cdot (p+1) \cdot \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1}} \right| \cdot \frac{(p+1)H_p}{p!} \Vert y_{k+1} - \tilde{x}_k\Vert ^{p} \end{aligned}$$

$$\begin{aligned}&=\Vert y_{k+1} - \tilde{x}_k\Vert \left( \frac{\lambda _{k+1}}{p!} \left( L_p + \frac{\gamma }{1-\gamma } ((p+1)H_p+L_p) \right) \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1} \right) \\&+ \Vert y_{k+1} - \tilde{x}_k\Vert \left| \frac{\lambda _{k+1}(p+1)H_p}{p!} \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1} - 1\right| \end{aligned}$$

$$\begin{aligned}&\overset{(10)}{\le }\Vert y_{k+1} - \tilde{x}_k\Vert \left( \frac{\lambda _{k+1}}{p!} \left( L_p + \frac{\gamma }{1-\gamma } ((p+1)H_p+L_p) \right) \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1} \right) \\&+ \Vert y_{k+1} - \tilde{x}_k\Vert \left( 1-\frac{\lambda _{k+1}(p+1)H_p}{p!} \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1} \right) \\&=\Vert y_{k+1} - \tilde{x}_k\Vert \left( 1 + \frac{\lambda _{k+1}}{p!} \Vert y_{k+1} - \tilde{x}_k\Vert ^{p-1} \right. \\&\cdot \left. \left( L_p - (p+1)H_p + \frac{\gamma }{1-\gamma } ((p+1)H_p+L_p) \right) \right) . \end{aligned}$$

Hence, by (10) and simple calculations we get

$$\begin{aligned} \sigma&\ge 1 + \frac{1}{2p H_p} \left( L_p - (p+1)H_p + \frac{\gamma }{1-\gamma } ((p+1)H_p+L_p) \right) \\&= 1 + \frac{1}{2p \xi } \left( 1 - (p+1)\xi + \frac{\gamma }{1-\gamma } ((p+1)\xi +1) \right) \\&= 1 + \frac{1}{2p \xi } \left( 1 - p\xi -\xi + \frac{\gamma p\xi +\gamma \xi +\gamma }{1-\gamma } \right) \\&= 1 + \frac{1}{2p \xi } \left( \frac{1 - p\xi -\xi - \gamma + \gamma p\xi +\gamma \xi +\gamma p\xi +\gamma \xi +\gamma }{1-\gamma } \right) \\&= 1 + \left( \frac{1 - p\xi -\xi + 2\gamma p\xi +2\gamma \xi }{(1-\gamma )2p \xi } \right) \\&= \frac{p \xi + 1 -\xi +2\gamma \xi }{(1-\gamma )2p \xi }. \end{aligned}$$

Lastly, we prove that $\sigma \le 1$. For that we need

$$\begin{aligned} (1-\gamma )2p \xi&\ge p \xi + 1 -\xi +2\gamma \xi \\ (p+1) \xi&\ge 1 +2\gamma \xi (1+p)\\ \frac{1}{2}-\frac{1}{2\xi (p+1)}&\ge \gamma . \end{aligned}$$

We have proved the main lemma for the convergence rate theorem, other parts of the proof are the same as [4]. As a result, we get the next theorem.

Theorem 1

Let f be a convex function whose $p^{th}$ derivative is $L_p$-Lipschitz and $x_{*}$ denote a minimizer of f. Then Algorithm 1 converges with rate

$$\begin{aligned} f(y_k) - f(x_{*}) \le \tilde{O}\left( \frac{H_p R^{p+1}}{k^{\frac{ 3p +1}{2}}}\right) \,, \end{aligned}$$

(15)

where

$$\begin{aligned} R=\Vert x_0 - x^{*}\Vert \end{aligned}$$

(16)

is the maximal radius of the initial set.

4 Hyperfast Second-Order Method

In recent work [25] it was mentioned that for convex optimization problem (1) with first order oracle (returns gradient) the well-known complexity bound $\left( L_{1}R^2/\varepsilon \right) ^{1/2}$ can not be beaten even if we assume that all $L_{p} < \infty $. This is because of the structure of the worth case function

$$f_p(x) = |x_1|^{p+1} + |x_2 - x_1|^{p+1} + ... + |x_n - x_{n-1}|^{p+1},$$

where $p = 1$ for first order method. It’s obvious that $f_p(x)$ satisfy the condition $L_{p} < \infty $ for all natural p. So additional smoothness assumptions don’t allow to accelerate additionally. The same thing takes place, for example, for $p=3$. In this case, we also have $L_{p} < \infty $ for all natural p. But what is about $p=2$? In this case $L_3 = \infty $. It means that $f_2(x)$ couldn’t be the proper worth case function for the second-order method with additional smoothness assumptions. So there appears the following question: Is it possible to improve the bound $\left( L_{2}R^3/\varepsilon \right) ^{2/7}$? At the very beginning of 2020 Yu. Nesterov gave a positive answer. For this purpose, he proposed to use an accelerated third-order method that requires $\tilde{O}\left( (L_{3}R^4/\varepsilon )^{1/4}\right) $ iterations by using second-order oracle [23]. So all this means that if $L_3 < \infty $, then there are methods that can be much faster than $\tilde{O}\left( \left( L_{2}R^3/\varepsilon \right) ^{2/7}\right) $.

In this section, we improve convergence speed and reach near-optimal speed up to logarithmic factor. We consider problem (1) with $p=3$, hence $L_3<\infty $. In previous section, we have proved that Algorithm 1 converges. Now we fix the parameters for this method

$$\begin{aligned} p=3,\quad \gamma =\frac{1}{2p}=\frac{1}{6}, \quad \xi = \frac{2p}{p+1}=\frac{3}{2}. \end{aligned}$$

(17)

By (12) we get $\sigma = 0.6$ that is rather close to initial exact $\sigma _{0}=0.5$. For such parameters we get next convergence speed of Algorithm 1 to reach accuracy $\varepsilon $:

$$\begin{aligned} N_{out}= \tilde{O}\left( \left( \frac{L_3 R^{4}}{\varepsilon }\right) ^{\frac{1}{5}}\right) . \end{aligned}$$

(18)

Note, that at every step of Algorithm 1 we need to solve next subproblem with accuracy $\gamma =1/6$

$$\begin{aligned} \begin{aligned} \mathop {\mathrm {argmin}}\limits _{y}&\left\{ \left\langle \nabla f(x_i),y-x_i\right\rangle +\frac{1}{2}\nabla ^2 f(x_i)[y-x_i]^2\right. \\&+ \left. \frac{1}{6}D^3f(x_i)[y-x_i]^3 + \frac{L_3}{4}\Vert y - x_i \Vert ^{4} \right\} . \end{aligned} \end{aligned}$$

(19)

In [14] it was proved, that problem (19) can be solved by Bregman-Distance Gradient Method (BDGM) with linear convergence speed. According to [25] BDGM can be improved to work with inexact gradients of the functions. This made possible to approximate $D^3 f(x)$ by gradients and escape calculations of $D^3 f(x)$ at each step. As a result, in [25] it was proved, that subproblem (19) can be solved up to accuracy $\gamma = 1/6$ with one calculation of Hessian and $O\left( \log \left( \frac{\Vert \nabla f(x_i)\Vert +\Vert \nabla ^2 f(x_i)\Vert }{\varepsilon }\right) \right) $ calculation of gradient.

We use BDGM to solve subproblem from Algorithm 1 and, as a result, we get next Hyperfast Second-Order method as merging NATMI and BDGM.

In the Algorithm 3, $\beta _{\rho _k}(z_i,z)$ is a Bregman distance generated by $\rho _k(z)$

$$\begin{aligned} \beta _{\rho _k}(z_i,z)=\rho _k(z) - \rho _k(z_i) -\left\langle \nabla \rho _k(z_i), z-z_i \right\rangle . \end{aligned}$$

By $g_{\varphi _k,\tau }(z)$ we take an inexact gradient of the subproblem (19)

$$\begin{aligned} g_{\varphi _k,\tau }(z)= \nabla f(\tilde{x}_k) +\nabla ^2 f(\tilde{x}_k)[z-\tilde{x}_k]+ \frac{1}{2} g^{\tau }_{\tilde{x}_k}(z) + L_3\Vert z - \tilde{x}_k \Vert ^{2} (z - \tilde{x}_k) \end{aligned}$$

(22)

and $g^{\tau }_{\tilde{x}_k}(z)$ is a inexact approximation of $D^3f(\tilde{x}_k)[y-\tilde{x}_k]^2$

$$\begin{aligned} g^{\tau }_{\tilde{x}_k}(z)= \frac{1}{\tau ^2}\left( \nabla f(\tilde{x}_k+\tau (z-\tilde{x}_k))+ \nabla f(\tilde{x}_k-\tau (z-\tilde{x}_k))-2\nabla f(\tilde{x}_k)\right) . \end{aligned}$$

(23)

In paper [25] it is proved, that we can choose

$$\delta =O\left( \frac{\varepsilon ^{\frac{3}{2}}}{\Vert \nabla f(\tilde{x}_k)\Vert ^{\frac{1}{2}}_{*}+\Vert \nabla ^2 f(\tilde{x}_k)\Vert ^{\frac{3}{2}}/L_3^{\frac{1}{2}}} \right) ,$$

then total number of inner iterations equal to

$$\begin{aligned} T_k(\delta )=O\left( \ln {\frac{G+H}{\varepsilon }}\right) , \end{aligned}$$

(24)

where G and H are the uniform upper bounds for the norms of the gradients and Hessians computed at the points generated by the main algorithm. Finally, we get next theorem.

Theorem 2

Let f be a convex function whose third derivative is $L_3$-Lipschitz and $x_{*}$ denote a minimizer of f. Then to reach accuracy $\varepsilon $ Algorithm 2 with Algorithm 3 for solving subproblem computes

$$\begin{aligned} N_{1}=\tilde{O}\left( \left( \frac{L_3 R^{4}}{\varepsilon }\right) ^{\frac{1}{5}}\right) \end{aligned}$$

(25)

Hessians and

$$\begin{aligned} N_{2}=\tilde{O}\left( \left( \frac{L_3 R^{4}}{\varepsilon }\right) ^{\frac{1}{5}}\log \left( \frac{G+H}{\varepsilon }\right) \right) \end{aligned}$$

(26)

gradients, where G and H are the uniform upper bounds for the norms of the gradients and Hessians computed at the points generated by the main algorithm.

One can generalize this result on uniformly-strongly convex functions by using inverse restart-regularization trick from [13].

So, the main observation of this section is as follows: If $L_3 < \infty $, then we can use this hyperfast^{Footnote 3} second-order algorithm instead of considered in the paper optimal one to make our sliding faster (in convex and uniformly convex cases).

5 Conclusion

In this paper, we present Inexact Near-optimal Accelerated Tensor Method and improve its convergence rate. This improvement make it possible to solve the Taylor approximation subproblem by other methods. Next, we propose Hyperfast Second-Order Method and get its convergence speed $O(N^{-5})$ up to logarithmic factor. This method is a combination of Inexact Third-Order Near-Optimal Accelerated Tensor Method with Bregman-Distance Gradient Method for solving inner subproblem. As a result, we prove that our method has near-optimal convergence rates for given problem class and the best known on that moment.

In this paper, we developed near-optimal Hyperfast Second-Order method for sufficiently smooth convex problem in terms of convergence in function. Based on the technique from the work [9], we can also developed near-optimal Hyperfast Second-Order method for sufficiently smooth convex problem in terms of convergence in the norm of the gradient. In particular, based on the work [16] one may show that the complexity of this approach to the dual problem for 1-entropy regularized optimal transport problem will be $\tilde{O}\left( \left( (\sqrt{n})^{4}/\varepsilon \right) ^{1/5}\right) \cdot O(n^{2.5}) = O(n^{2.9}\varepsilon ^{-1/5})$ a.o., where n is the linear dimension of the transport plan matrix, that could be better than the complexity of accelerated gradient method and accelerated Sinkhorn algorithm $O(n^{2.5}\varepsilon ^{-1/2})$ a.o. [8, 16]. Note, that the best theoretical bounds for this problem are also far from to be practical ones [2, 17, 20, 27].

Notes

1.
Note, that for the first-order methods in non-convex case earlier (see, [5] and references therein) it was shown that additional smoothness assumptions lead to an additional acceleration. In convex case, as far as we know these works of Yu. Nesterov [24, 25] are the first ones where such an idea was developed.
2.
However, there are some results [30] that allow to use tensor acceleration for the first-order schemes. This additional acceleration requires additional assumptions on smoothness. More restrictive ones than limitations of high-order derivatives.
3.
Here we use terminology introduced in [25].

References

Birgin, E.G., Gardenghi, J., Martínez, J.M., Santos, S.A., Toint, L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163(1–2), 359–368 (2017)
Article MathSciNet Google Scholar
Blanchet, J., Jambulapati, A., Kent, C., Sidford, A.: Towards optimal running times for optimal transport. arXiv preprint arXiv:1810.07717 (2018)
Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Complexity of highly parallel non-smooth convex optimization. In: Advances in Neural Information Processing Systems, pp. 13900–13909 (2019)
Google Scholar
Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Near-optimal method for highly smooth convex optimization. In: Conference on Learning Theory, pp. 492–507 (2019)
Google Scholar
Carmon, Y., Duchi, J., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: first-order methods. arXiv preprint arXiv:1711.00841 (2017)
Doikov, N., Nesterov, Y.: Contracting proximal methods for smooth convex optimization. arXiv preprint arXiv:1912.07972 (2019)
Doikov, N., Nesterov, Y.: Local convergence of tensor methods. arXiv preprint arXiv:1912.02516 (2019)
Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. arXiv preprint arXiv:1802.04367 (2018)
Dvurechensky, P., Gasnikov, A., Ostroukhov, P., Uribe, C.A., Ivanova, A.: Near-optimal tensor methods for minimizing the gradient norm of convex function. arXiv preprint arXiv:1912.03381 (2019)
Gasnikov, A.: Universal gradient descent. arXiv preprint arXiv:1711.00394 (2017)
Gasnikov, A., Dvurechensky, P., Gorbunov, E., Vorontsova, E., Selikhanovych, D., Uribe, C.A.: Optimal tensor methods in smooth convex and uniformly convex optimization. In: Conference on Learning Theory, pp. 1374–1391 (2019)
Google Scholar
Gasnikov, A., et al.: Near optimal methods for minimizing convex functions with Lipschitz $ p $-th derivatives. In: Conference on Learning Theory, pp. 1392–1393 (2019)
Google Scholar
Gasnikov, A.V., Kovalev, D.A.: A hypothesis about the rate of global convergence for optimal methods (Newton’s type) in smooth convex optimization. Comput. Res. Model. 10(3), 305–314 (2018)
Article Google Scholar
Grapiglia, G.N., Nesterov, Y.: On inexact solution of auxiliary problems in tensor methods for convex optimization. arXiv preprint arXiv:1907.13023 (2019)
Grapiglia, G.N., Nesterov, Y.: Tensor methods for minimizing functions with hölder continuous higher-order derivatives. arXiv preprint arXiv:1904.12559 (2019)
Guminov, S., Dvurechensky, P., Nazary, T., Gasnikov, A.: Accelerated alternating minimization, accelerated Sinkhorn’s algorithm and accelerated iterative Bregman projections. arXiv preprint arXiv:1906.03622 (2019)
Jambulapati, A., Sidford, A., Tian, K.: A direct tilde $\{$O$\}$(1/epsilon) iteration parallel algorithm for optimal transport. In: Advances in Neural Information Processing Systems, pp. 11355–11366 (2019)
Google Scholar
Jiang, B., Wang, H., Zhang, S.: An optimal high-order tensor method for convex optimization. In: Conference on Learning Theory, pp. 1799–1801 (2019)
Google Scholar
Kamzolov, D., Gasnikov, A., Dvurechensky, P.: On the optimal combination of tensor optimization methods. arXiv preprint arXiv:2002.01004 (2020)
Lee, Y.T., Sidford, A.: Solving linear programs with Sqrt (rank) linear system solves. arXiv preprint arXiv:1910.08033 (2019)
Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013)
Article MathSciNet Google Scholar
Nesterov, Y.: Lectures on Convex Optimization. SOIA, vol. 137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91578-4
Book MATH Google Scholar
Nesterov, Y.: Implementable tensor methods in unconstrained convex optimization. Math. Program., 1–27 (2019). https://doi.org/10.1007/s10107-019-01449-1
Nesterov, Y.: Inexact accelerated high-order proximal-point methods. Technical report, Technical Report CORE Discussion paper 2020, Université catholique de Louvain, Center for Operations Research and Econometrics (2020)
Google Scholar
Nesterov, Y.: Superfast second-order methods for unconstrained convex optimization. Technical report, Technical Report CORE Discussion paper 2020, Université catholique de Louvain, Center for Operations Research and Econometrics (2020)
Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet Google Scholar
Quanrud, K.: Approximating optimal transport with linear programs. arXiv preprint arXiv:1810.05957 (2018)
Song, C., Ma, Y.: Towards unified acceleration of high-order algorithms under hölder continuity and uniform convexity. arXiv preprint arXiv:1906.00582 (2019)
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. In: Proceedings of the Uncertainty in Artificial Intelligence (UAI) Conference (2019)
Google Scholar
Wilson, A., Mackey, L., Wibisono, A.: Accelerating rescaled gradient descent. arXiv preprint arXiv:1902.08825 (2019)

Download references

Acknowledgements

I would like to thank Alexander Gasnikov, Yurii Nesterov, Pavel Dvurechensky and Cesar Uribe for fruitful discussions.

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, Moscow, Russia
Dmitry Kamzolov

Authors

Dmitry Kamzolov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Kamzolov .

Editor information

Editors and Affiliations

Sobolev Institute of Mathematics, Novosibirsk, Russia
Yury Kochetov
Sobolev Institute of Mathematics, Novosibirsk, Russia
Igor Bykadorov
Matrosov Institute for System Dynamics and Control Theory, Irkutsk, Russia
Tatiana Gruzdeva

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kamzolov, D. (2020). Near-Optimal Hyperfast Second-Order Method for Convex Optimization. In: Kochetov, Y., Bykadorov, I., Gruzdeva, T. (eds) Mathematical Optimization Theory and Operations Research. MOTOR 2020. Communications in Computer and Information Science, vol 1275. Springer, Cham. https://doi.org/10.1007/978-3-030-58657-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-58657-7_15
Published: 14 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58656-0
Online ISBN: 978-3-030-58657-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Near-Optimal Hyperfast Second-Order Method for Convex Optimization

Abstract

Similar content being viewed by others

Superfast Second-Order Methods for Unconstrained Convex Optimization

Mirror-Descent Methods in Mixed-Integer Convex Optimization

Convergence analysis for Lasserre’s measure-based hierarchy of upper bounds for polynomial optimization

Keywords

1 Introduction

2 Problem Statement and Preliminaries

3 Inexact Near-Optimal Accelerated Tensor Method

Lemma 1

Proof

Lemma 2

Proof

Theorem 1

4 Hyperfast Second-Order Method

Theorem 2

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Near-Optimal Hyperfast Second-Order Method for Convex Optimization

Abstract

Similar content being viewed by others

Superfast Second-Order Methods for Unconstrained Convex Optimization

Mirror-Descent Methods in Mixed-Integer Convex Optimization

Convergence analysis for Lasserre’s measure-based hierarchy of upper bounds for polynomial optimization

Keywords

1 Introduction

2 Problem Statement and Preliminaries

3 Inexact Near-Optimal Accelerated Tensor Method

Lemma 1

Proof

Lemma 2

Proof

Theorem 1

4 Hyperfast Second-Order Method

Theorem 2

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation