Abstract
In a Hilbert space setting \({{\mathcal {H}}}\), we study the fast convergence properties as \(t \rightarrow + \infty \) of the trajectories of the second-order differential equation
where \(\nabla \Phi \) is the gradient of a convex continuously differentiable function \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}, \alpha \) is a positive parameter, and \(g: [t_0, + \infty [ \rightarrow {{\mathcal {H}}}\) is a small perturbation term. In this inertial system, the viscous damping coefficient \(\frac{\alpha }{t}\) vanishes asymptotically, but not too rapidly. For \(\alpha \ge 3\), and \(\int _{t_0}^{+\infty } t \Vert g(t)\Vert dt < + \infty \), just assuming that \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \), we show that any trajectory of the above system satisfies the fast convergence property
Moreover, for \(\alpha > 3\), any trajectory converges weakly to a minimizer of \(\Phi \). The strong convergence is established in various practical situations. These results complement the \({{\mathcal {O}}}(t^{-2})\) rate of convergence for the values obtained by Su, Boyd and Candès in the unperturbed case \(g=0\). Time discretization of this system, and some of its variants, provides new fast converging algorithms, expanding the field of rapid methods for structured convex minimization introduced by Nesterov, and further developed by Beck and Teboulle with FISTA. This study also complements recent advances due to Chambolle and Dossal.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Let \({{\mathcal {H}}}\) be a real Hilbert space endowed with the scalar product \(\langle \cdot ,\cdot \rangle \) and norm \(\Vert \cdot \Vert \), and let \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) be a convex differentiable function. In this paper, we first study the solution trajectories of the second-order differential equation
with \(\alpha >0\), in terms of their asymptotic behavior as \(t \rightarrow + \infty \). This will serve us as guideline to the study of corresponding algorithms.
We take for granted the existence and uniqueness of a global solution to the Cauchy problem associated with (1). Although this is not our main concern, we point out that, given \(t_0>0\), for any \(x_0 \in {{\mathcal {H}}}, \ v_0 \in {{\mathcal {H}}} \), the existence of a unique global solution on \([t_0,+\infty [\) for the Cauchy problem with initial condition \(x(t_0)=x_0\) and \(\dot{x}(t_0)=v_0\) can be guaranteed, for instance, if \(\nabla \Phi \) is Lipschitz-continuous on bounded sets, and \(\Phi \) is minorized.
Throughout the paper, unless otherwise indicated, we simply assume that
As we shall see, most of the convergence properties of the trajectories are valid under this general assumption. This approach paves the way for the extension of our results to non-smooth convex potential functions (replacing the gradient by the subdifferential).
In preparation for a stability study of this system and the associated algorithms, we will also consider the following perturbed version of (1):
where the right-hand side \(g:[t_0, + \infty [ \rightarrow {{\mathcal {H}}}\) is an integrable source term, such that g(t) is small for large t.
The importance of the evolution system (1) is threefold:
1. Mechanical interpretation: It describes the motion of a particle, with mass equal to one, subject to a potential energy function \(\Phi \), and an isotropic linear damping with a viscosity parameter that vanishes asymptotically. This provides a simple model for a progressive reduction of the friction, possibly due to material fatigue.
2. Fast minimization of function \(\Phi {:}\) Equation (1) is a particular case of the inertial gradient-like system
with asymptotic vanishing damping, studied by Cabot et al. [17, 18]. As shown in [17, Corollary 3.1] (under some additional conditions on \(\Phi \)), every solution \(x(\cdot )\) of (3) satisfies \(\lim _{t\rightarrow +\infty }\Phi (x(t))=\min \Phi \), provided \(\int _0^{\infty }a(t)dt = + \infty \). The specific case (1) was studied by Su, Boyd and Candès in [35] in terms of the rate of convergence of the values. More precisely, [35, Theorem 4.1] establishes that \(\Phi (x(t))-\min \Phi =\mathcal O(t^{-2})\), whenever \(\alpha \ge 3\). Unfortunately, their analysis does not entail the convergence of the trajectory itself.
3. Relationship with fast numerical optimization methods: As pointed out in [35, Section 2], for \(\alpha =3\), (1) can be seen as a continuous version of the fast convergent method of Nesterov (see [26–29]), and its widely used successors, such as the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA), studied in [14]. These methods have a convergence rate of \(\Phi (x_k)-\min \Phi ={{\mathcal {O}}}(k^{-2})\), where k is the number of iterations. As for the continuous-time system (1), convergence of the sequences generated by FISTA and related methods, has not been established so far. This is a central and long-standing question in the study of numerical optimization methods.
The purpose of this research is to establish the convergence of the trajectories satisfying (1), as well as the sequences generated by the corresponding numerical methods with Nesterov-type acceleration. We also complete the study with several stability properties concerning both the continuous-time system and the algorithms.
The main contributions of this work are the following:
In Sect. 2, we first establish the minimizing property in the general case where \(\alpha >0\), and \(\inf \Phi \) is not necessarily attained. As a consequence, every weak limit point of the trajectory must be a minimizer of \(\Phi \), and so, the existence of a bounded trajectory characterizes the existence of minimizers. Assuming \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \) and \(\alpha \ge 3\), we recover the \({{\mathcal {O}}}(t^{-2})\) convergence rates, and give several examples and counterexamples concerning the optimality of these results. Next, we show that every solution of (1) converges weakly to a minimizer of \(\Phi \) provided \(\alpha >3\) and \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \). For the limiting case \(\alpha =3\), which corresponds exactly to Nesterov’s method, the convergence of the trajectories is still a puzzling open question. We finish this section by providing an ergodic convergence result for the acceleration of the system in case \(\nabla \Phi \) is Lipschitz-continuous on sublevel sets of \(\Phi \).
In Sect. 3, strong convergence is established in various practical situations enjoying further geometric features, such as strong convexity, symmetry, or nonemptyness of the interior of the solution set. In the strongly convex case, we obtained a surprising result: convergence of the values occurs at a \(\mathcal O(t^{-\frac{2\alpha }{3}})\) rate.
In Sect. 4, we analyze the asymptotic behavior, as \(t\rightarrow +\infty \), of the solutions of the perturbed differential system (2), and obtain similar convergence results under integrability assumptions on the perturbation term g.
Section 5 contains the analogous results for the associated Nesterov-type algorithms (also for \(\alpha >3\)). To avoid repeating similar arguments, we state the results and develop the proofs directly for the perturbed version. As a guideline, we follow the proof of the convergence of the continuous dynamic. We provide discrete versions of the differential inequalities that we used in the Lyapunov convergence analysis. The convergence results are parallel to those for the continuous case, under summability conditions on the errors.
As we were preparing the final version of this manuscript, we discovered the preprint [19] by Chambolle and Dossal, where the weak convergence result of the algorithm in the unperturbed case is obtained by a similar, but different argument (see [19, Theorem 3]). This approach has been further developed by Aujol–Dossal [10] in the perturbed case.
2 Minimizing property, convergence rates and weak convergence of the trajectories
We begin this section by providing some preliminary estimations concerning the global energy of the system (1) and the distance to the minimizers of \(\Phi \). These allow us to show the minimizing property of the trajectories under minimal assumptions. Next, we recover the convergence rates for the values originally given in [35], and obtain further decay estimates that ultimately imply the convergence of the solutions of (1). We finish the study by proving an ergodic convergence result for the acceleration. Several examples and counterexamples are given throughout the section.
2.1 Preliminary remarks and estimations
The existence of global solutions to (1) has been examined, for instance, in [17, Proposition 2.2] in the case of a general asymptotic vanishing damping coefficient. In our setting, for any \(t_0 >0, \alpha >0\), and \((x_0,v_0)\in \mathcal H\times {{\mathcal {H}}}\), there exists a unique global solution \(x : [t_0, +\infty [ \rightarrow {{\mathcal {H}}}\) of (1), satisfying the initial condition \(x(t_0)= x_0, \dot{x}(t_0)= v_0\), under the sole assumption that \(\nabla \Phi \) is Lipschitz-continuous on bounded sets, and \(\inf \Phi > - \infty \). Taking \(t_0>0\) comes from the singularity of the damping coefficient \(a(t) = \frac{\alpha }{t}\) at zero. Indeed, since we are only concerned about the asymptotic behavior of the trajectories, we do not really care about the origin of time. If one insists in starting from \(t_0 =0\), then all the results remain valid with \(a(t)=\frac{\alpha }{t+1}\).
At different points, we shall use the global energy of the system, given by \(W:[t_0,+\infty [\rightarrow {\mathbb {R}}\)
Using (1), we immediately obtain
Lemma 2.1
Let W be defined by (4). For each \(t>t_0\), we have
Hence, W is nonincreasing,Footnote 1 and \(W_\infty =\lim _{t\rightarrow +\infty }W(t)\) exists in \({\mathbb {R}}\cup \{-\infty \}\). If \(\Phi \) is bounded from below, \(W_\infty \) is finite.
Now, given \(z\in {{\mathcal {H}}}\), we define \(h_z:[t_0,+\infty [ \rightarrow {\mathbb {R}}\) by
By the Chain Rule, we have
Using (1), we obtain
The convexity of \(\Phi \) implies
and we deduce that
We have the following relationship between \(h_z\) and W:
Lemma 2.2
Take \(z\in {{\mathcal {H}}}\), and let W and \(h_z\) be defined by (4) and (5), respectively. There is a constant C such that
Proof
Divide (7) by t, and use the definition of W given in (4), to obtain
Integrate this expression from \(t_0\) to \(t>t_0\) (use integration by parts for the first term), to obtain
On the one hand, Lemma 2.1 gives
On the other hand, another integration by parts yields
Combining these inequalities with (8), we get
where C collects the constant terms. \(\square \)
2.2 Minimizing property
It turns out that the trajectories of (1) minimize \(\Phi \) in the completely general setting, where \(\alpha >0, {{\mathrm{argmin\,}}}\Phi \) is possibly empty, and \(\Phi \) is not necessarily bounded from below (recall that we assume the existence and uniqueness of a global solution to the Cauchy problem associated with (1), which is not guaranteed by these general assumptions). This property was obtained by Alvarez in [2, Theorem 2.1] for the heavy ball with friction (where the damping is constant). Similar results can be found in [17].
We have the following:
Theorem 2.3
Let \(\alpha >0\), and suppose \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) is a solution of (1). Then
-
(i)
\(W_\infty =\lim _{t \rightarrow +\infty }W(t)=\lim _{t \rightarrow +\infty }\Phi (x(t))=\inf \Phi \in {\mathbb {R}}\cup \{-\infty \}\).
-
(ii)
As \(t\rightarrow +\infty \), every weak limit point of x(t) lies in \({{\mathrm{argmin\,}}}\Phi \).
-
(iii)
If \({{\mathrm{argmin\,}}}\Phi =\emptyset \), then \(\lim _{t\rightarrow +\infty }\Vert x(t)\Vert =+\infty \).
-
(iv)
If x is bounded, then \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \).
-
(v)
If \(\Phi \) is bounded from below, then \(\lim _{t\rightarrow +\infty }\Vert {{\dot{x}}}(t)\Vert =0\).
-
(vi)
If \(\Phi \) is bounded from below and x is bounded, then \(\lim _{t\rightarrow +\infty }{{\dot{h}}}_z(t)=0\) for each \(z\in {{\mathcal {H}}}\). Moreover,
$$\begin{aligned} \int _{t_0}^\infty \frac{1}{t}(\Phi (x(t))-\min \Phi )dt<+\infty . \end{aligned}$$
Proof
To prove (i), first set \(z\in {{\mathcal {H}}}\) and \(\tau \ge t>t_0\). By Lemma 2.1, W is nonincreasing. Hence, Lemma 2.2 gives
which we rewrite as
and then
Integrate from \(t=t_0\) to \(t=\tau \) to obtain
But
Hence,
for suitable constants \(A, B, \widetilde{C}\) and D. This immediately yields \(W_\infty \le \Phi (z)\), and hence \(W_\infty \le \inf \Phi \). It suffices to observe that
to obtain (i).
Next, (ii) follows from (i) by the lower-semicontinuity of \(\Phi \) for the weak topology. Clearly, (iii) and (iv) are immediate consequences of (ii). We obtain (v) by using (i) and the definition of W given in (4). For (vi), since \({{\dot{h}}}_z(t)=\langle x(t) - z, \dot{x}(t) \rangle \) and x is bounded, (v) implies \(\lim \nolimits _{t\rightarrow +\infty }{{\dot{h}}}_z(t)=0\). Finally, using the definition of W together with Lemma 2.2 with \(z\in {{\mathrm{argmin\,}}}\Phi \), we get
which completes the proof. \(\square \)
Remark 2.4
We shall see in Theorem 2.14 that, for \(\alpha \ge 3\), the existence of minimizers implies that every solution of (1) is bounded. This gives a converse to part iv) of Theorem 2.3.
If \(\Phi \) is not bounded from below, it may be the case that \(\Vert \dot{x}(t)\Vert \) does not tend to zero, as shown in the following example:
Example 2.5
Let \({{\mathcal {H}}}={\mathbb {R}}\) and \(\alpha >0\). The function \(x(t)=t^2\) satisfies (1) with \(\Phi (x)=-2(\alpha +1)x\). Then \(\lim _{t\rightarrow +\infty }\Phi (x(t))=-\infty =\inf \Phi \), and \(\lim _{t\rightarrow +\infty }\Vert {{\dot{x}}}(t)\Vert =+\infty \).
2.3 Two “anchored” energy functions
We begin by introducing two important auxiliary functions, and showing their basic properties. From now on, we assume \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \). Fix \(\lambda \ge 0, \xi \ge 0, p\ge 0\) and \(x^*\in {{\mathrm{argmin\,}}}\Phi \). Let \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) be a solution of (1). For \(t\ge t_0\) define
and notice that \({{\mathcal {E}}}_{\lambda ,\xi }\) and \({{\mathcal {E}}}_\lambda ^p\) are sums of nonnegative terms. These generalize the energy functions \({{\mathcal {E}}}\) and \(\tilde{{{\mathcal {E}}}}\) introduced in [35]. More precisely, \({{\mathcal {E}}}={{\mathcal {E}}}_{\alpha -1,0}\) and \(\tilde{\mathcal E}={{\mathcal {E}}}_{(2\alpha -3)/3}^1\).
We need some preparatory calculations prior to differentiating \({{\mathcal {E}}}_{\lambda ,\xi }\) and \({{\mathcal {E}}}_\lambda ^p\). For simplicity of notation, we do not make the dependence of x or \({{\dot{x}}}\) on t explicit. Notice that we use (1) in the second line to dispose of \({{\ddot{x}}}\).
Whence, we deduce
Remark 2.6
If \(y\in {{\mathcal {H}}}\) and \(x^*\in {{\mathrm{argmin\,}}}\Phi \), the convexity of \(\Phi \) gives \(\min \Phi =\Phi (x^*)\ge \Phi (y)+\langle \nabla \Phi (y),x^*-y\rangle \). Using this in (9) with \(y=x(t)\), we obtain
If \(\alpha \ge 3\) and \(2\le \lambda \le \alpha -1\) and if one chooses \(\xi ^*=\lambda (\alpha -\lambda -1)\) (which is nonnegative), then
Hence \({{\mathcal {E}}}_{\lambda ,\xi ^*}\) is nonincreasing. The extreme cases \(\lambda =2\) and \(\lambda =\alpha -1\) are of special importance, as we shall see shortly.
2.4 Rate of convergence for the values
We now recover convergence rate results for the value of \(\Phi \) along a trajectory, already established in [35, Theorem 4.1]:
Theorem 2.7
Let \(x:[t_0,+\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of (1), and assume \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \). If \(\alpha \ge 3\), then
If \(\alpha >3\), then
Proof
Suppose \(\alpha \ge 3\). Choose \(\lambda =\alpha -1\) and \(\xi =0\), so that \(\xi -\lambda (\alpha -\lambda -1)=\alpha -\lambda -1=0\) and \(\lambda -2=\alpha -3\). Remark 2.6 gives
and \({{\mathcal {E}}}_{\alpha -1,0}\) is nonincreasing. Since \(t^2(\Phi (x)-\min \Phi )\le {{\mathcal {E}}}_{\alpha -1,0}(t)\) (recall the definition of \({{\mathcal {E}}}_{\lambda ,\xi }\)), we obtain
If \(\alpha >3\), integrating (11) from \(t_0\) to t we obtain
which allows us to conclude. \(\square \)
Remark 2.8
It would be interesting to know whether \(\alpha =3\) is critical for the convergence rate given above.
Remark 2.9
For the (first-order) steepest descent dynamical system, the typical rate of convergence is \({{\mathcal {O}}}(1/t)\) (see, for instance, [33, Section 3.1]). For the second-order system (1), we have obtained a rate of \({{\mathcal {O}}}(1/t^2)\). It would be interesting to know whether higher-order systems give the corresponding rates of convergence. Another challenging question is the convergence rate of the trajectories defined by differential equations involving fractional time derivatives, as well as integro-differential equations.
Remark 2.10
The constant in the order of convergence given by Theorem 2.7 is
where \(x_0=x(t_0)\) and \(v_0={{\dot{x}}}(t_0)\). This quantity is minimized when \(x_0^*\in {{\mathrm{argmin\,}}}\Phi \) and \(v_0^*=\frac{(\alpha -1)}{t_0}(x^*-x_0^*)\), with \(\min K=0\). If \(x_0^*\ne x^*\), the trajectory will not be stationary, but the value \(\Phi (x(t))\) will be constantly equal to \(\min \Phi \). Of course, selecting \(x_0^*\in {{\mathrm{argmin\,}}}\Phi \) is not realistic, and the point \(x^*\) is unknown. Keeping \(\hat{x}_0\) fixed, the function \(v_0\mapsto K(\hat{x}_0,v_0)\) is minimized at \(\hat{v}_0=\frac{(\alpha -1)}{t_0}(x^*-\hat{x}_0)\). This suggests taking the initial velocity as a multiple of an approximation of \(x^*-\hat{x}_0\), such as the gradient direction \(\hat{v}_0=\nabla \Phi (\hat{x}_0)\), Newton or Levenberg-Marquardt direction \(\hat{v}_0=[\varepsilon I+\nabla ^2\Phi (\hat{x}_0)^{-1}]\nabla \Phi (\hat{x}_0)\) (\(\varepsilon \ge 0\)), or the proximal point direction \(\hat{v}_0=\left[ (I+\gamma \nabla \Phi )^{-1}(\hat{x}_0)-\hat{x}_0\right] \) (\(\gamma >>0\)).
2.5 Some examples and counterexamples
A convergence rate of \({{\mathcal {O}}}(1/t^2)\) may be attained, even if \({{\mathrm{argmin\,}}}\Phi = \emptyset \) and \(\alpha <3\). This is illustrated in the following example:
Example 2.11
Let \({{\mathcal {H}}} = {{\mathbb {R}}}\) and take \(\Phi (x) = \frac{\alpha -1}{2}e^{-2x}\) with \(\alpha \ge 1\). Let us verify that \(x(t)= \ln t\) is a solution of (1). On the one hand,
On the other hand, \(\nabla \Phi (x)= -(\alpha -1)e^{-2x}\) which gives
Thus, \(x(t)= \ln t\) is a solution of (1). Let us examine the minimizing property. We have \(\inf \Phi = 0\), and
Therefore, one may wonder whether the rapid convergence of the values is true in general. The following example shows that this is not the case:
Example 2.12
Let \({{\mathcal {H}}} = {{\mathbb {R}}}\) and take \(\Phi (x) = \frac{c}{x^{\theta }}\), with \(\theta >0, \alpha \ge \frac{\theta }{(2+\theta )}\) and \(c= \frac{2( 2\alpha + \theta (\alpha -1))}{\theta ( 2+ \theta )^2}\). Let us verify that \(x(t)= t^{\frac{2}{2 + \theta }}\) is a solution of (1). On the one hand,
On the other hand, \(\nabla \Phi (x)= -c \theta x^{-\theta -1} \) which gives
Thus, \(x(t)= t^{\frac{2}{2 + \theta }}\) is solution of (1). Let us examine the minimizing property. We have \(\inf \Phi = 0\), and
We conclude that the order of convergence may be strictly slower than \({{\mathcal {O}}}(1/t^2)\) when \({{\mathrm{argmin\,}}}\Phi =\emptyset \). In the Example 2.12, this occurs no matter how large \(\alpha \) is. The speed of convergence of \(\Phi (x(t))\) to \(\inf \Phi \) depends on the behavior of \(\Phi (x)\) as \(\Vert x\Vert \rightarrow +\infty \). The above examples suggest that, when \(\Phi (x)\) decreases rapidly and attains its infimal value as \(\Vert x\Vert \rightarrow \infty \), we can expect fast convergence of \(\Phi (x(t))\).
Even when \({{\mathrm{argmin\,}}}\Phi \ne \emptyset , {{\mathcal {O}}} (1/t^2)\) is the worst possible case for the rate of convergence, attained as a limit in the following example:
Example 2.13
Take \({{\mathcal {H}}} = {{\mathbb {R}}}\) and \(\Phi (x) = c|x|^{\gamma }\), where c and \(\gamma \) are positive parameters. Let us look for nonnegative solutions of (1) of the form \(x(t)= \frac{1}{t^{\theta }}\), with \(\theta >0\). This means that the trajectory is not oscillating, it is a completely damped trajectory. We begin by determining the values of \(c, \gamma \) and \(\theta \) that provide such solutions. On the one hand,
On the other hand, \(\nabla \Phi (x)= c \gamma |x|^{\gamma -2}x \), which gives
Thus, \(x(t)= \frac{1}{t^{\theta }}\) is solution of (1) if, and only if,
-
(i)
\(\theta + 2 = \theta (\gamma -1)\), which is equivalent to \(\gamma >2\) and \(\theta = \frac{2}{\gamma -2}\); and
-
(ii)
\(c \gamma = \theta (\alpha -\theta -1)\), which is equivalent to \(\alpha > \frac{\gamma }{\gamma -2}\) and \(c= \frac{2}{\gamma (\gamma -2)}( \alpha - \frac{\gamma }{\gamma -2})\).
We have \(\min \Phi = 0\) and
The speed of convergence of \(\Phi (x(t))\) to 0 depends on the parameter \(\gamma \). As \(\gamma \) tends to infinity, the exponent \(\frac{2 \gamma }{\gamma -2 }\) tends to 2. This limiting situation is obtained by taking a function \(\Phi \) that becomes very flat around the set of its minimizers. Therefore, without other geometric assumptions on \(\Phi \), we cannot expect a convergence rate better than \({{\mathcal {O}}} (1/t^2)\). By contrast, in Sect. 3, we will show better rates of convergence under some geometrical assumptions, like strong convexity of \(\Phi \).
2.6 Weak convergence of the trajectories
In this subsection, we show the convergence of the solutions of (1), provided \(\alpha >3\). We begin by establishing some preliminary estimations that cannot be derived from the analysis carried out in [35]. The first statement improves part (v) of Theorem 2.3, while the second one is the key to proving the convergence of the trajectories of (1):
Theorem 2.14
Let \(x:[t_0,+\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of (1) with \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \).
-
(i)
If \(\alpha \ge 3\) and x is bounded, then \(\Vert {{\dot{x}}}(t)\Vert ={{\mathcal {O}}}(1/t)\). More precisely,
$$\begin{aligned} \Vert {{\dot{x}}}(t)\Vert \le \frac{1}{t} \left( \sqrt{2\mathcal E_{\alpha -1,0}(t_0)} + (\alpha -1)\sup _{t\ge t_0}\Vert x(t)-x^*\Vert \right) . \end{aligned}$$(12) -
(ii)
If \(\alpha >3\), then x is bounded and
$$\begin{aligned} \int _{t_0}^{+\infty }t\Vert {{\dot{x}}}(t)\Vert ^2\,dt\le \frac{\mathcal E_{2,2(\alpha -3)}(t_0)}{\alpha -3}<+\infty . \end{aligned}$$(13)
Proof
To prove (i), assume \(\alpha \ge 3\) and x is bounded. From the definition of \({{\mathcal {E}}}_{\lambda ,\xi }\), we have \(\frac{1}{2}\Vert \lambda (x-x^*)+t{{\dot{x}}}\Vert ^2\le {{\mathcal {E}}}_{\lambda ,\xi }(t)\), and so \(\Vert t{{\dot{x}}}\Vert \le \sqrt{2{{\mathcal {E}}}_{\lambda ,\xi }(t)}+\lambda \Vert x-x^*\Vert \). By Remark 2.6, \({{\mathcal {E}}}_{\alpha -1,0}\) is nonincreasing, and we immediately obtain (12).
In order to show (ii), suppose now that \(\alpha >3\). Choose \(\lambda =2\) and \(\xi ^*=2(\alpha -3)\). By Remark 2.6, we have
and \({{\mathcal {E}}}_{\lambda ,\xi ^*}\) is nonincreasing. From the definition of \({{\mathcal {E}}}_{\lambda ,\xi }\), we deduce that \(\Vert x(t)-x^*\Vert ^2 \le \frac{2}{\xi } {{\mathcal {E}}}_{\lambda ,\xi }(t)\), which gives
and establishes the boundedness of x. Integrating (14) from \(t_0\) to t, and recalling that \({{\mathcal {E}}}_{\lambda ,\xi ^*}\) is nonnegative, we obtain
as required. \(\square \)
Remark 2.15
In view of (12) and (15), when \(\alpha >3\), we obtain the following explicit bound for \(\Vert {{\dot{x}}}\Vert \), namely
Since \(\lim _{t\rightarrow +\infty }\Vert {{\dot{x}}}(t)\Vert =0\) by Theorem 2.3, we also have \(\lim _{t\rightarrow +\infty }t\,\Vert \dot{x}(t)\Vert ^2=0\).
We are now in a position to prove the weak convergence of the trajectories of (1), which is the main result of this section. The proof relies on a Lyapunov analysis, which was first used by Alvarez [2] in the context of the heavy ball with friction.
Theorem 2.16
Let \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) be a continuously differentiable convex function such that \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \), and let \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) be a solution of (1) with \(\alpha >3\). Then x(t) converges weakly, as \(t\rightarrow +\infty \), to a point in \({{\mathrm{argmin\,}}}\Phi \).
Proof
We shall use Opial’s Lemma 5.7. To this end, let \(x^*\in {{\mathrm{argmin\,}}}\Phi \) and recall from (7) that
where \(h_z\) is given by (5). This yields
In view of Theorem 2.14, part (ii), the right-hand side is integrable on \([t_0,+\infty [\). Lemma 5.9 then implies \(\lim _{t\rightarrow +\infty }h_{x^*}(t)\) exists. This gives the first hypothesis in Opial’s Lemma. The second one was established in part (ii) of Theorem 2.3. \(\square \)
Remark 2.17
A puzzling question concerns the convergence of the trajectories for \(\alpha =3\), a question which is directly related to the convergence of the sequences generated by Nesterov’s method [26] (see [35, Section 3]).
2.7 Further stabilization results
Let us complement the study of Eq. (1) by examining the asymptotic behavior of the acceleration \(\ddot{x}\). To this end, we shall use an additional regularity assumption on the gradient of \(\Phi \).
Proposition 2.18
Let \(\alpha >3\) and let \(x:[t_0,+\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of (1) with \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \). Assume \(\nabla \Phi \) Lipschitz-continuous on bounded sets. Then \({{\ddot{x}}}\) is bounded, globally Lipschitz continuous on \([t_0,+\infty [\), and satisfies
Proof
First recall that x and \({{\dot{x}}}\) are bounded, by virtue of Theorems 2.14 and 2.3, respectively. By (1), we have
Since \(\nabla \Phi \) is Lipschitz-continuous on bounded sets, it follows from (16), and the boundedness of x and \({{\dot{x}}}\), that \(\ddot{x}\) is bounded on \([t_0, +\infty [\). As a consequence, \(\dot{x}\) is Lipschitz-continuous on \([t_0, +\infty [\). Returning to (16), we deduce that \(\ddot{x}\) is Lipschitz-continuous on \([t_0, +\infty [\).
Pick \(x^*\in {{\mathrm{argmin\,}}}\Phi \), set \(h=h_{x^*}\) (to simplify the notation) and use (6) to obtain
Let L be a Lipschitz constant for \(\nabla \Phi \) on some ball containing the minimizer \(x^*\) and the trajectory x. By virtue of the Baillon–Haddad Theorem (see, for instance, [12], [32, Theorem 3.13] or [27, Theorem 2.1.5]), \(\nabla \Phi \) is \(\frac{1}{L}\)-cocoercive on that ball, which means that
Substituting this inequality in (17), and using the fact that \(\nabla \Phi (x^{*}) =0\), we obtain
In view of (16), this gives
Developing the square on the left-hand side, and neglecting the nonnegative term \((\alpha \Vert {{\dot{x}}}(t)\Vert /t)^2/L\), we obtain
We multiply this inequality by \(t^\alpha \) to obtain
Integration from \(t_0\) to t yields
Neglecting the nonnegative term \(\alpha t^{\alpha -1}\Vert \dot{x}(t)\Vert ^2/L\), we obtain
where \(C=t_0^\alpha {{\dot{h}}}(t_0)+\alpha t_0^{\alpha -1}\Vert \dot{x}(t_0)\Vert ^2/L\).
If \(t_0<1\), we have
for all \(t\ge 1\). Since the first term on the right-hand side tends to 0 as \(t\rightarrow +\infty \), we may assume, without loss of generality, that \(t_0\ge 1\).
Observe now that \(s^{\alpha -2} \le s^{\alpha }\), whenever \(s\ge 1\). Whence, inequality (18) simplifies to
Dividing by \(t^\alpha \) and integrating again, we obtain
Setting \(C'=h(t_0)+C t_0^{-\alpha +1}/(\alpha -1)\), and neglecting the nonnegative term h(t) of the left-hand side and the nonpositive term \(-C t^{-\alpha +1}/(\alpha -1)\) of the right-hand side, we get
Set \(\displaystyle g(\tau )= \tau ^{-\alpha }\left( \int _{t_0}^\tau s^\alpha \Vert {{\ddot{x}}}(s)\Vert ^2 ds\right) \) and use Fubini’s Theorem on the second integral to get
By part (ii) of Theorem 2.14, the integral on the right-hand side is finite. We have
The derivative of g is
Let \(C''\) be an upper bound for \(\Vert {{\ddot{x}}}\Vert ^2\). We have
From (19) and (20) we deduce that \(\lim _{\tau \rightarrow +\infty }g(\tau )=0\) by virtue of Lemma 5.6. \(\square \)
Remark 2.19
Since \(\int _{t_0}^ts^\alpha ds= \frac{1}{\alpha +1}\left( t^{\alpha +1}-t_0^{\alpha +1}\right) \), Proposition 2.18 expresses a fast ergodic convergence of \(\Vert {{\ddot{x}}}(s)\Vert ^2\) to 0 with respect to the weight \(s^\alpha \) as \(t\rightarrow +\infty \), namely
3 Strong convergence results
A counterexample due to Baillon [11] shows that the trajectories of the steepest descent dynamical system may converge weakly but not strongly. Nevertheless, under some additional geometrical or topological assumptions on \(\Phi \), the steepest descent trajectories do converge strongly. This has been proved in the case where the function \(\Phi \) is either even or strongly convex (see [16]), or when \(\mathrm{int}({{\mathrm{argmin\,}}}\Phi ) \ne \emptyset \) (see [15, theorem 3.13]). Some of these results have been extended to inertial dynamics, see [2] for the heavy ball with friction, and [4] for an inertial version of Newton’s method. This suggests that convexity alone may not be sufficient for the trajectories of (1) to converge strongly, but one can reasonably expect it to be the case under some additional conditions. The purpose of this section is to establish this fact. The different types of hypotheses will be studied in independent subsections since different techniques are required.
3.1 Set of minimizers with nonempty interior
Let us begin by studying the case where \(\mathrm{int}({{\mathrm{argmin\,}}}\Phi )\ne \emptyset \).
Theorem 3.1
Let \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) be a continuously differentiable convex function. Let \(\mathrm{int}({{\mathrm{argmin\,}}}\Phi )\ne \emptyset \), and let \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) be a solution of (1) with \(\alpha >3\). Then x(t) converges strongly, as \(t\rightarrow +\infty \), to a point in \({{\mathrm{argmin\,}}}\Phi \). Moreover,
Proof
Since \(\mathrm{int}({{\mathrm{argmin\,}}}\Phi )\ne \emptyset \), there exist \(x^*\in {{\mathrm{argmin\,}}}\Phi \) and some \(\rho >0\) such that \(\nabla \Phi (z)=0\) for all \(z\in {{\mathcal {H}}}\) such that \(\Vert z-x^*\Vert <\rho \). By the monotonicity of \(\nabla \Phi \), for all \(y\in {{\mathcal {H}}}\), we have
Hence,
Taking the supremum with respect to \(z \in {{\mathcal {H}}}\) such that \(\Vert z-x^*\Vert < \rho \), we infer that
for all \( y \in {{\mathcal {H}}} \). In particular,
By using this inequality in (9) with \(\lambda =\alpha -1\) and \(\xi =0\), we obtain
whence we derive, by integrating from \(t_0\) to t
Since \({{\mathcal {E}}}_{\alpha -1,0}(t)\) is nonnegative, part (ii) of Theorem 2.7 gives
Finally, rewrite (1) as
Since the right-hand side is integrable, we conclude by applying Lemma 5.12 and Theorem 2.16. \(\square \)
3.2 Even functions
Let us recall that \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) is even if \(\Phi (-x)= \Phi (x) \) for every \(x\in {{\mathcal {H}}}\). In this case the set \({{\mathrm{argmin\,}}}\Phi \) is nonempty, and contains the origin.
Theorem 3.2
Let \(\Phi :{{\mathcal {H}}}\rightarrow {{\mathbb {R}}}\) be a continuously differentiable convex even function, and let \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) be a solution of (1) with \(\alpha >3\). Then x(t) converges strongly, as \(t\rightarrow +\infty \), to a point in \({{\mathrm{argmin\,}}}\Phi \).
Proof
For \(t_0 \le \tau \le s\), set
We have
Combining these two equalities and using (1), we obtain
Recall that the energy \(W(\tau )= \frac{1}{2} \Vert \dot{x}(\tau ) \Vert ^2 + \Phi (x(\tau ))\) is nonincreasing. Therefore,
by convexity. After simplification, we obtain
Combining (22) and (23), we obtain
As in the proof of Lemma 5.9, we have
where \(C=2\Vert {{\dot{x}}}(t_0)\Vert \,\Vert x\Vert _{\infty }\). The function k does not depend on s. Moreover, using Fubini’s Theorem, we deduce that
by part ii) of Theorem 2.14. Integrating \(\dot{q}(\tau ) \le k(\tau )\) from t to s, we obtain
Since \(\Phi \) is even, we have \(0 \in {{\mathrm{argmin\,}}}\Phi \). Hence \(\lim _{t\rightarrow +\infty }\Vert x(t) \Vert ^2\) exists (see the proof of Theorem 2.16). As a consequence, x(t) has the Cauchy property as \(t \rightarrow + \infty \), and hence converges. \(\square \)
3.3 Uniformly convex functions
Following [13], a function \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) is uniformly convex on bounded sets if, for each \(r>0\), there is an increasing function \(\omega _r:[0,+\infty [\rightarrow [0,+\infty [\) vanishing only at 0, and such that
for all \(x, y \in {{\mathcal {H}}}\) such that \(\Vert x \Vert \le r\) and \(\Vert y \Vert \le r\). Uniformly convex functions are strictly convex, so the set of their minimum points is at most a singleton.
Theorem 3.3
Let \(\Phi \) be uniformly convex on bounded sets with \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \), and let \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) be a solution of (1) with \(\alpha >3\). Then x(t) converges strongly, as \(t\rightarrow +\infty \), to the unique \(x^*\in {{\mathrm{argmin\,}}}\Phi \).
Proof
Recall that the trajectory \(x(\cdot )\) is bounded, by part (ii) in Theorem 2.14. Let \(r >0\) be such that x is contained in the ball of radius r centered at the origin. This ball also contains \(x^*\), which is the weak limit of the trajectory in view of the weak lower-semicontinuity of the norm and Theorem 2.16. Writing \(y=x(t)\) and \(x=x^*\) in (24), we obtain
The right-hand side tends to 0 as \(t\rightarrow +\infty \) by virtue of Theorem 2.3. It follows that x(t) converges strongly to \(x^*\) as \(t\rightarrow +\infty \). \(\square \)
Let us recall that a function \(\Phi : {{\mathcal {H}}} \rightarrow \mathbb R\) is strongly convex if there exists a positive constant \(\mu \) such that
for all \(x, y \in {{\mathcal {H}}}\). Clearly, strongly convex functions are uniformly convex on bounded sets. However, interestingly, convergence rates increase indefinitely with larger values of \(\alpha \) for these functions.
Theorem 3.4
Let \(\Phi :{{\mathcal {H}}}\rightarrow {\mathbb {R}}\) be strongly convex, and let \(x:[t_0,+\infty [\rightarrow {{\mathcal {H}}}\) be a solution of (1) with \(\alpha >3\). Then x(t) converges strongly, as \(t\rightarrow +\infty \), to the unique element \(x^*\in {{\mathrm{argmin\,}}}\Phi \). Moreover
Proof
Strong convergence follows from Theorem 3.3 because strongly convex functions are uniformly convex on bounded sets. From (10) and the strong convexity of \(\Phi \), we deduce that
for any \(\lambda \ge 0\) and any \(p\ge 0\). Now fix \(p=\frac{2}{3}(\alpha -3)\) and \(\lambda =\frac{2}{3}\alpha \), so that \(p+2-\lambda =\alpha -\lambda -1-p/2=0\) and \(\alpha -\lambda -1-p=-p/2\). The above inequality becomes
Define \(t_1=\max \left\{ t_0,\sqrt{\frac{p\lambda }{\mu }}\right\} \), so that
for all \(t\ge t_1\). Integrate this inequality from \(t_1\) to t (use integration by parts on the right-hand side) to get
Hence,
in view of the strong convexity of \(\Phi \). By the definition of \({{\mathcal {E}}}_\lambda ^p\), we have
Dividing by \(t^{p+2}\) and using the definition of \(t_1\), along with the fact that \(t\ge t_1\), we obtain
Recalling that \(p=\frac{2}{3}(\alpha -3)\) and \(\lambda =\frac{2}{3}\alpha \), we deduce that
The strong convexity of \(\Phi \) then gives
Inequalities (27) and (28) settle the first two points in (25).
Now, using (26) and (27), we derive
The definition of \({{\mathcal {E}}}_\lambda ^p\) then gives
Hence
and
But using (28), we deduce that
The last two inequalities together give
Taking squares, and rearranging the terms, we obtain
which shows the last point in (25) and completes the proof. \(\square \)
The preceding theorem extends [35, Theorem 4.2], which states that if \(\alpha >9/2\), then \(\Phi (x(t))-\min \Phi =\mathcal O(1/t^{3})\).
4 Asymptotic behavior of the trajectory under perturbations
In this section, we analyze the asymptotic behavior, as \(t\rightarrow +\infty \), of the solutions of the differential equation
From the Cauchy–Lipschitz–Picard Theorem (see, for instance, [7, Theorem 17.1.2b)]), for any initial condition \(x(t_0)=x_0 \in {{\mathcal {H}}}, \dot{x}(t_0)= v_0 \in {{\mathcal {H}}}\), we deduce the existence and uniqueness of a maximal local solution x to (29), with \({{\dot{x}}}\) locally absolutely continuous, if \(\nabla \Phi \) is Lipschitz-continuous on bounded sets and g is locally integrable. If \(\Phi \) is minorized, the global existence follows from the energy estimate proved in Lemma 4.1, in the next subsection.
This being said, our main concern here is to obtain sufficient conditions on the perturbation g in order to guarantee that the convergence properties established in the preceding sections are preserved. The analysis follows very closely the arguments given in Sects. 2 and 3. It is developed in the same general setting, just assuming that \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) is a continuously differentiable convex function. Therefore, we shall state the main results and sketch the proofs, underlining the parts where additional techniques are required.
4.1 Energy estimates and minimization property
The following result is obtained by considering the global energy of the system, and showing that it is a strict Lyapunov function:
Lemma 4.1
Let \(\Phi \) be bounded from below, and let \(x: [t_0, +\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of (29) with \(\alpha >0\) and \({\int _{t_0}^{\infty } \Vert g(t)\Vert \,dt < + \infty }\). Then, \(\sup \nolimits _{t>t_0} \Vert \dot{x}(t) \Vert < + \infty \) and \({\int _{t_0}^{\infty } \frac{1}{\tau } \Vert \dot{x}(\tau ) \Vert ^2 d\tau <+\infty }\). Moreover, \(\lim \nolimits _{t\rightarrow +\infty }\Phi (x(t))=\inf \nolimits _{{{\mathcal {H}}}}\Phi \).
Proof
Set \(T >t_0\). For \(t_0 \le t \le T\), define the energy function
Since \(\dot{x}\) is continuous and g is integrable, the function \(W_T\) is well defined. Derivating \(W_T\) with respect to time, and using (29), we obtain
Hence \(W_T(\cdot )\) is a decreasing function. In particular, \( W_T(t) \le W_T(t_0)\), that is
As a consequence,
Applying Lemma 5.13, we obtain
It follows that
As a consequence, we may define a function \(W : \,]t_0,+\infty [\rightarrow {\mathbb {R}}\) by
by (31). From the definition of \(W_T\) and W, we have
Integrating from \(t_0\) to t, and using (31), we obtain
which gives a bound for the second improper integral.
For the minimization property, consider the function \(h: \,]t_0,+\infty [\rightarrow {\mathbb {R}}\), defined by \(h(t)=\frac{1}{2}\Vert x(t)-z\Vert ^{2}\), where z is an arbitrary element of \({{\mathcal {H}}}\). We can easily verify that
By convexity of \(\Phi \), we obtain
Recall that the function W defined above is nonincreasing and bounded from below. Hence, W(t) converges, as \(t\rightarrow +\infty \), to some \(W_{\infty } \in {{\mathbb {R}}}\). Moreover, using the definition of W in (33), we deduce that
Setting \(B_{\infty }=W_{\infty } +\inf \Phi -\Phi (z) \), we may write
Multiplying this last equation by \(\frac{1}{t}\), and integrating between \(t_0\) and \(\theta >t_0\), we get
Let us estimate the integrals in the second member of the last inequality:
-
(1)
The first term is finite, in view of Lemma 4.1.
-
(2)
The second term is also finite, since the relation \(\Vert x(t)-z\Vert \le \Vert x(t_0)-z\Vert +\int _{t_0}^{t}\Vert \dot{x}(s)\Vert ds\) implies
$$\begin{aligned} \int _{t_{0}}^{\theta }\dfrac{\Vert g(t)\Vert \Vert x(t)-z\Vert }{t} dt\le \left( \frac{\Vert x_{0}-z\Vert }{t_{0}}+ \sup _{t\ge t_0}\Vert \dot{x}(t)\Vert \right) \int _{t_{0}}^{+\infty }\Vert g(t)\Vert dt< +\infty .\end{aligned}$$ -
(3)
For the third term, integration by parts gives
$$\begin{aligned} \int _{t_{0}}^{\theta } \left( \frac{1}{t}\int _{t}^{\infty }\Vert g(s)\Vert ds\right) dt= & {} \ln \theta \int _{\theta }^{\infty }\Vert g(s)\Vert ds - \ln t_0 \int _{t_0}^{\infty }\Vert g(s)\Vert ds \\&+ \int _{t_{0}}^{\theta } \Vert g(t)\Vert \ln t \ dt. \end{aligned}$$ -
(4)
For the fourth term, set \(I=\int _{t_{0}}^{\theta }\frac{1}{t^{\alpha +1}}\frac{d}{dt}(t^{\alpha }\dot{h}(t))dt\), and integrate by parts twice to obtain
$$\begin{aligned} I= & {} \left[ \frac{1}{t}\dot{h}(t)\right] _{t_{0}}^{\theta } +(\alpha +1)\int _{t_{0}}^{\theta }\frac{1}{t^{2}}\dot{h}(t)dt\\= & {} C_0 +\frac{1}{\theta }\dot{h}(\theta ) +\frac{(1+\alpha )}{\theta ^{2}}h(\theta )\\&+\,2(1+\alpha )\int _{t_{0}}^{\theta } \frac{1}{t^{3}}h(t)dt\ge C_0 +\frac{1}{\theta }\dot{h}(\theta ), \end{aligned}$$for some constant \(C_0\), because \(h\ge 0\). Finally, notice that
$$\begin{aligned} \vert \dot{h}(\theta )\vert =\vert \langle \dot{x}(\theta ),x(\theta )-z\rangle \vert \le \sup _{t\ge t_0}\Vert \dot{x}(t)\Vert \left( \Vert x(0)-z\Vert +\theta \sup _{t\ge t_0}\Vert \dot{x}(t)\Vert \right) .\end{aligned}$$
Collecting the above results, we deduce that
for some other constant \(C_1\). Dividing by \(\ln (\frac{\theta }{t_{0}})\), and letting \(\theta \rightarrow +\infty \), we conclude that \(B_{\infty } \le 0\), by using Lemma 5.11 with \(\psi (t) =\ln t\). This implies that \(W_{\infty }\le \Phi (z)-\inf \Phi \) for every \(z\in {{\mathcal {H}}}\), which leads to \(W_{\infty } \le 0\).
On the other hand, it is easy to see that
Passing to the limit, as \(t\rightarrow +\infty \), we deduce that
Since we always have \(\inf \Phi \le \liminf \Phi (x(t))\), we conclude that \(\lim _{t\rightarrow +\infty }\Phi (x(t))=\inf \Phi \). \(\square \)
4.2 Fast convergence of the values
We are now in position to prove the following:
Theorem 4.2
Let \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \), and let \(x: [t_0, +\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of (29) with \(\alpha \ge 3\) and \({\int _{t_0}^{\infty } t\,\Vert g(t)\Vert \, dt < + \infty }\). Then \(\Phi (x(t))- \min _{{{\mathcal {H}}}}\Phi = \mathcal {O} \left( \frac{1}{t^2}\right) \).
Proof
The proof follows the arguments used for Theorem 2.7. Take \(x^{*} \in S={{\mathrm{argmin\,}}}\Phi \). For \(t_0 \le t \le T\), define the energy function
Let us show that
Derivation of \({{\mathcal {E}}}_{\alpha , g,T}\) gives
Using the subdifferential inequality for \(\Phi \), and rearranging the terms, we obtain
As a consequence, for \(\alpha \ge 3\), the function \({\mathcal {E}}_{\alpha ,g,T}\) is nonincreasing. In particular, \({\mathcal {E}}_{\alpha ,g,T} (t) \le {{{\mathcal {E}}}}_{\alpha ,g,T} (t_0)\), which gives
with
From (36), we infer that
Applying Lemma 5.13, we obtain
and so
Using (37) in (36), we conclude that
and the result follows. \(\square \)
Remark 4.3
As a consequence, the energy function
is well defined on \([t_0,+\infty [\), and is a Lyapunov function for the dynamical system (29).
4.3 Convergence of the trajectories
In the case \(\alpha >3\), provided that the second member g(t) is sufficiently small for large t, we are going to show the convergence of the trajectories of (29), as it occurs for the unperturbed system studied in the previous sections
Theorem 4.4
Let \({{\mathrm{argmin\,}}}\Phi \ne \emptyset \), and let \(x: [t_0, +\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of (29) with \(\alpha >3\) and \({\int _{t_0}^{\infty } t\,\Vert g(t)\Vert \, dt < + \infty }\). Then, x(t) converges weakly, as \(t\rightarrow +\infty \), to a point in \({{\mathrm{argmin\,}}}\Phi \).
Proof
Step 1 Recall, from the proof of Theorem 4.2, that the energy function \({\mathcal E}_{\alpha ,g}\) defined in Remark 4.3 satisfies
Integrating this inequality, we obtain
By the definition of \({{{\mathcal {E}}}}_{\alpha ,g}\), and neglecting its nonnegative terms, we infer that
and so
by (38). Since \(\alpha >3\), we deduce that
Step 2 Let us show that
By taking the scalar product of (29) by \(t^2 \dot{x}(t)\), we get
Using the Chain Rule and the Cauchy–Schwarz inequality, we obtain
Integration by parts yields
As a consequence,
for some constant \(C_0\) depending only on the Cauchy data. Since \(\int _{t_0}^{\infty } s (\Phi (x(s))- \inf _{{{\mathcal {H}}}}\Phi )ds <+ \infty \) by (39), and \(\alpha >1\), we deduce that
for some other constant \(C_1\), which we may assume to be nonnegative. Applying Lemma 5.13, we obtain
and so
Returning to (40), we deduce that
which gives
Moreover, combining (38) and (42), we deduce that
and the trajectory x is bounded.
Step 3 As before, we prove the weak convergence by means of Opial’s Lemma 5.7. Take \(x^{*}\in {{\mathrm{argmin\,}}}\Phi \), and define \(h: [0, +\infty [ \rightarrow {{\mathbb {R}}}^+\) by
By the Chain Rule,
Combining these two equations, and using (29), we obtain
By the monotonicity of \(\nabla \Phi \) and the fact that \(\nabla \Phi (x^{*}) = 0\), we have
and we infer that
Equivalently
with
because the trajectory x is bounded, by (44). Recall that \(\int _{t_0}^{+\infty } t \Vert g(t)\Vert dt < + \infty \) by assumption, and \(\int _{t_0}^{\infty } t\Vert \dot{x}(t) \Vert ^2 dt < + \infty \) by (13). Hence, the function \(t \mapsto tk(t)\) belongs to \(L^1 (t_0, +\infty )\). Applying Lemma 5.9, with \(w(t)= \dot{h}(t)\), we deduce that \(\dot{h}^+(t) \in L^1 (t_0, +\infty )\), which implies that the limit of h(t) exists, as \(t \rightarrow + \infty \). This proves item (i) of Opial’s Lemma 5.7. For item (ii), observe that every weak limit point of x(t) as \(t\rightarrow +\infty \) must minimize \(\Phi \), since \(\lim _{t\rightarrow +\infty }\Phi (x(t))=\inf \Phi \). \(\square \)
Remark 4.5
Throughout the proof of Theorem 4.4, we proved that
We also proved that \(\sup \nolimits _{t\ge t_0}t\, \Vert \dot{x}(t) \Vert <+\infty \), and hence \(\lim \nolimits _{t\rightarrow \infty } \Vert \dot{x}(t) \Vert = 0\).
4.4 Strong convergence results
For strong convergence, we have the following:
Theorem 4.6
Let \(\Phi :{{\mathcal {H}}}\rightarrow {{\mathbb {R}}}\) be a continuously differentiable convex function, and let \(x:[t_0,+\infty [\rightarrow \mathcal {H}\) be a solution of (29) with \(\alpha >3\) and \({\int _{t_0}^\infty t\,\Vert g(t)\Vert \,dt<+\infty }\). Then x(t) converges strongly, as \(t\rightarrow +\infty \), in any of the following cases:
-
(i)
The set \({{\mathrm{argmin\,}}}\Phi \) has nonempty interior;
-
(ii)
The function \(\Phi \) is even; or
-
(iii)
The function \(\Phi \) is uniformly convex.
In order to prove this result, it suffices to adapt the arguments given in Sect. 3 for the unperturbed case. Since it is relatively straightforward, we leave it as an exercise to the reader.
5 Convergence of the associated algorithms
In this section, we analyze the fast convergence properties of the associated Nesterov-type algorithms. To avoid repeating similar arguments, we state the results and develop the proofs directly for the perturbed version.
5.1 A dynamical introduction of the algorithm
Time discretization of dissipative gradient-based dynamical systems leads naturally to algorithms, which, under appropriate assumptions, have similar convergence properties. This approach has been followed successfully in a variety of situations. For a general abstract discussion see [5, 6]. For dynamics with inertial features, see [2–4, 9]. To cover practical situations involving constraints or non-smooth data, we need to broaden our scope. This leads us to consider the non-smooth structured convex minimization problem
where \(\Psi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}} \cup \lbrace + \infty \rbrace \) is a proper lower semicontinuous convex function; and \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) is a continuously differentiable convex function, whose gradient is Lipschitz continuous.
The optimal solutions of (47) satisfy
where \(\partial \Psi \) is the subdifferential of \(\Psi \), in the sense of convex analysis. In order to adapt our dynamic to this non-smooth situation, we will consider the corresponding differential inclusion
This dynamical system is within the following framework
where \(\Theta : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}} \cup \lbrace + \infty \rbrace \) is a proper lower semicontinuous convex function, and \(a(\cdot )\) is a positive damping parameter.
It is interesting to establish the asymptotic properties, as \(t\rightarrow +\infty \), of the solutions of the differential inclusion (48). Beyond global existence issues, one must check that the Lyapunov analysis is still valid. In view of the validity of the subdifferential inequality for convex functions, the (generalized) chain rule for derivatives over curves (see [15]), most results presented in the previous sections can be transposed to this more general context, except for the stabilization of the acceleration, which relies on the Lipschitz character of the gradient. However, a detailed study of this differential inclusion goes far beyond the scope of the present article. See [8] for some results in the case of a fixed positive damping parameter, i.e., \(a(t)= \gamma >0\) fixed, and \(g=0\). Thus, setting \(\Theta (x)= \Psi (x) + \Phi (x)\), we can reasonably assume that, for \(\alpha >3\), and \(\int _{t_0}^{+\infty } t \Vert g(t)\Vert dt < + \infty \), for each trajectory of (48), there is rapid convergence of the values
and weak convergence of the trajectory to an optimal solution.
We shall use these ideas as a guideline, in order to introduce corresponding fast converging algorithms, making the link with Nesterov [26–29] and Beck–Teboulle [14]; and so, extending the recent works of Chambolle–Dossal [19] and Su–Boyd–Candès [35] to the perturbed case.
In order to preserve the fast convergence properties of the dynamical system (48), we are going to discretize it implicitely with respect to the nonsmooth function \(\Psi \), and explicitely with respect to the smooth function \(\Phi \).
Taking a fixed time step size \(h>0\), and setting \(t_k= kh, x_k = x(t_k)\) the implicit/explicit finite difference scheme for (48) gives
where \(y_k\) is a linear combination of \(x_k\) and \(x_{k-1}\), that will be made precise later on. After developing (50), we obtain
A natural choice for \(y_k\) leading to a simple formulation of the algorithm (other choices are possible, offering new directions of research for the future) is
Using the classical proximal operator (equivalently, the resolvent of the maximal monotone operator \(\partial \Psi \))
and setting \(s=h^2\), the algorithm can be written as
For practical purposes, and in order to fit with the existing literature on the subject, it is convenient to work with the following equivalent formulation
Indeed, we have \(\frac{k -1}{k + \alpha -1} = 1- \frac{\alpha }{k + \alpha -1}\). When \(\alpha \) is an integer, we obtain the same sequences \((x_k)\) and \((y_k)\), up to the reindexation \(k\mapsto k + \alpha -1\). For general \(\alpha >0\), we can easily verify that the algorithm (55) is still associated with the dynamical system (48).
This algorithm is within the scope of the proximal-based inertial algorithms [3, 24, 25], and forward-backward methods. In the unperturbed case, \(g_k =0\), it has been recently considered by Chambolle–Dossal [19] and Su–Boyd–Candès [35]. It enjoys fast convergence properties which are very similar to that of the continuous dynamic.
For \(\alpha = 3\) and \(g_k =0\), we recover the classical FISTA algorithm developed by Beck–Teboulle [14], based on the acceleration method introduced by Nesterov [26] in the smooth case and by Güler [21] in the proximal setting:
An important question regarding the (FISTA) method, as described in (56), is the convergence of sequences \((x_k)\) and \((y_k)\), which is still an open question. A major interest to consider the broader context of algorithms (55) is that, for \(\alpha >3\), these sequences converge, even when inexactly computed, provided the errors or perturbations are sufficiently small.
5.2 Fast convergence of the values
We will see that the fast convergence properties of algorithm (54) can be obtained in a parallel way with the convergence analysis in the continuous case in Theorem 4.2.
Theorem 5.1
Let \(\Psi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}} \cup \lbrace + \infty \rbrace \) be proper, lower-semicontinuous and convex, and let \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) be convex and continuously differentiable with L-Lipschitz continuous gradient. Suppose that \(S={{\mathrm{argmin\,}}}(\Psi + \Phi )\ne \emptyset \), and let \((x_k)\) be a sequence generated by algorithm (55) with \(\alpha \ge 3, 0< s < \frac{1}{L} \), and \( \sum _{k \in {{\mathbb {N}}}} k \Vert g_k\Vert < + \infty \). Then,
where
Proof
To simplify notations, we set \(\Theta = \Psi + \Phi \), and take \(x^{*} \in {{\mathrm{argmin\,}}}\Theta \). As in the continuous case, we shall prove that the energy sequence \(({{\mathcal {E}}} (k))\) given by
with
is non-increasing (we shall justify further that it is well defined). Note that \({{\mathcal {E}}} (k)\) equals the Lyapunov function considered by Su–Boyd–Candès in [35, Theorem 4.3], plus a perturbation term. For each \(y \in {{\mathcal {H}}}\), we set \(\Phi _k (y):= \Phi (y) - \left\langle g_k, y \right\rangle \), and \(\Theta _k(y) = \Psi (y) + \Phi _k(y)\). Since \(\nabla \Phi _k (y)= \nabla \Phi (y) -g_k\), we deduce that \(\nabla \Phi _k\) is still L-Lipschitz continuous. By introducing the operator \(G_{s,k}: {{\mathcal {H}}} \rightarrow {{\mathcal {H}}}\), defined by
for each \(y \in {{\mathcal {H}}}\), we can write
and rewrite algorithm (55) as
The variable \(z_k\), defined in (59), will play an important role. It comes naturally into play as a discrete version of the term \( \frac{t}{\alpha -1} \dot{x}(t) + x(t) - x^{*} \) which enters \({{{\mathcal {E}}}}_{\alpha ,g}(t)\). Simple algebraic manipulations give
and also
The operator \(G_{s,k}\) satisfies
for all \(x, y\in {{\mathcal {H}}}\) (see [14, 19, 31, 35]), since \(s \le \frac{1}{L}\), and \(\nabla \Phi _k\) is L-lipschitz continuous. Let us write successively this formula at \(y=y_k\) and \(x= x_k\), then at \(y=y_k\) and \(x= x^{*}\). We obtain
respectively. Multiplying the first inequality by \(\frac{k}{k + \alpha -1}\), and the second one by \(\frac{\alpha -1}{k + \alpha -1}\), then adding the two resulting inequalities, and using the fact that \(x_{k+1} = y_k - s G_{s,k} (y_k)\), we obtain
We rewrite the scalar product above as
We obtain
We shall obtain a recursion from (64). To this end, observe that (62) gives
After developing
and multiplying the above expression by \(\frac{(\alpha -1)^2}{2s\left( k + \alpha -1\right) ^2}\), we obtain
Replacing this expression in (64), we obtain
Equivalently,
Recalling that \(\Theta (y)= \Theta _k (y) + \left\langle g_k, y \right\rangle \), we obtain
Multiplying by \( \frac{2s}{\alpha -1}\left( k + \alpha -1\right) ^2\), we obtain
which implies
in view of
Setting
we can reformulate (65) as
Equivalently,
Using (61), we deduce that
Fix an integer K, and set
so that (66) is equivalent to
and we deduce that the sequence \(({{\mathcal {E}}}_K (k))_k\) is nonincreasing. In particular, \({{\mathcal {E}}}_K (k) \le {{\mathcal {E}}}_K (0)\), which gives
As a consequence,
By the definition of \({{\mathcal {G}}} (k)\), neglecting some positive terms, and using the Cauchy–Schwarz inequality, we infer that
Applying Lemma 5.14 with \(a_k = \Vert z_{k} -x^{*} \Vert \), we deduce that
Note that M is finite, because \(\sum _{k \in {{\mathbb {N}}}} k \Vert g_k \Vert < + \infty \). Returning to (67) we obtain
By the definition of \({{\mathcal {G}}} (k)\), we finally obtain
which gives (57) and completes the proof. \(\square \)
Remark 5.2
In [22], Kim–Fessler introduce an extra inertial term in the FISTA method that allows them to reduce the constant by a factor of 2 in the complexity estimation. It would be interesting to know whether this variant can be obtained by another discretization in time of our inertial dynamic, or a different one.
5.3 Convergence of the sequence \((x_k)\)
Let us now study the convergence of the sequence \((x_k)\).
Theorem 5.3
Let \(\Psi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}} \cup \lbrace + \infty \rbrace \) be proper, lower-semicontinuous and convex, and let \(\Phi : {{\mathcal {H}}} \rightarrow {{\mathbb {R}}}\) be convex and continuously differentiable with L-Lipschitz continuous gradient. Suppose that \(S={{\mathrm{argmin\,}}}(\Psi + \Phi )\ne \emptyset \), and let \((x_k)\) be a sequence generated by algorithm (55) with \(\alpha > 3, 0< s < \frac{1}{L} \), and \( \sum _{k \in {{\mathbb {N}}}} k \Vert g_k\Vert < + \infty \). Then,
-
(i)
\( \sum _k k \Big ((\Psi + \Phi )(x_k) - \inf (\Psi + \Phi ) \Big ) < + \infty \);
-
(ii)
\(\sum k\Vert x_{k+1}-x_k\Vert ^2<+\infty \); and
-
(iii)
\(x_k\) converges weakly, as \(k\rightarrow +\infty \), to some \(x^*\in {{\mathrm{argmin\,}}}(\Psi +\Phi )\).
Proof
We follow the same steps as those of Theorem 4.4:
Step 1 Recall from (66) that
By (68), the sequence \((z_{k})\) is bounded. Summing the above inequalities, and using \(\alpha >3\), we obtain item i).
Step 2 Rewrite inequality (63) as
Take \(y= y_k\), and \(x=x_k\). Since \(x_{k+1} = y_k - s G_{s,k} (y_k)\), and \(y_k - x_{k} = \frac{k -1}{k + \alpha -1} ( x_{k} - x_{k-1})\), we obtain
By the definition of \( \Theta _k \), this is
Set \(\theta _k = \Theta (x_k) -\Theta (x^{*}), d_k = \frac{1}{2}\Vert x_{k} - x_{k-1} \Vert ^2, a=\alpha -1\). By the Cauchy–Schwarz inequality, (69) gives
Multiply by \((k + a)^2\) to get
Summing for \(k=1,\ldots ,K\), we obtain
Performing a similar computation as in Chambolle–Dossal [19, Corollary 2], we can write
By item (i), we have \(\sum _k \left( 2k +2a -1 \right) \theta _k < + \infty \). Hence there exists some constant C such that
for all \( K \in {{\mathbb {N}}}\). We now proceed as in the proof of Theorem 4.4. To this end, write (71) as
where \(a_j:= (j + a)\Vert x_{j+1}-x_j \Vert \). Recalling that \(\sum _k k \Vert g_k\Vert <+\infty \), apply Lemma 5.14 with \(\beta _j = (j + a)\Vert g_j\Vert \) to deduce that
Injecting this information in (70), we obtain
Since \(a = \alpha -1 \ge 2\), item (i) and the definition of \(d_k\), together give
which is ii).
Step 3 We finish by applying Opial’s Lemma 5.8 with \(S={{\mathrm{argmin\,}}}(\Psi + \Phi )\). By Theorem 5.1, we have \((\Psi + \Phi )(x_k) \rightarrow \min (\Psi + \Phi )\). The weak lower-semicontinuity of \(\Psi + \Phi \) gives item (ii) of Opial’s Lemma. Thus, the only point to verify is that \(\lim \Vert x_k-x^{*}\Vert \) exists for each \(x^{*} \in {{\mathrm{argmin\,}}}(\Psi + \Phi )\). Take any such \(x^*\). We shall show that \(\lim _{k\rightarrow \infty }h_k\) exists, where \(h_k := \frac{1}{2} \Vert x_k - x^{*}\Vert ^2\).
The beginning of the proof is similar to [3] or [19], and consists in establishing a discrete version of the second-order differential inequality (46). We use the identity
which holds for any \(a,b,c \in {{\mathcal {H}}}\). Taking \(b= x^{*}, a= x_{k+1}, c=x_k\), we obtain
which is equivalent to
By the definition of \(y_k\), we have
Therefore,
We now use the monotonicity of \(\partial \Psi \). Since \(- s\nabla \Phi (x^{*}) \in s\partial \Psi ( x^{*}) \), and \(y_k - x_{k+1}- s \nabla \Phi (y_k) +sg_k\in s\partial \Psi (x_{k+1})\), we have
Equivalently,
Replacing in (74), we obtain
On the other hand, the co-coercivity of \(\nabla \Phi \) gives
(vertex of the parabola). Combining (75) and (76), we deduce that
Replace k by \(k-1\) in (73) to obtain
Combine (77) with (78) to deduce that
By the definition of \(y_k = x_{k} + \frac{k -1}{k + \alpha -1} ( x_{k} - x_{k-1})\), we have \(x_{k+1}- y_k = x_{k+1} - x_{k} - \frac{k -1}{k + \alpha -1} ( x_{k} - x_{k-1})\). Hence,
Substituting this in (79), we obtain
where \(\gamma _k= \frac{k -1}{k + \alpha -1} \). Since \(0< s < \frac{1}{L} \), we have \((1- \frac{sL}{2}) >0\). On the other hand, since \(\gamma _k <1\), we have \(\gamma _k + {\gamma _k}^2 < 2 \gamma _k\). Therefore,
By (68), we know that the sequence \((z_k)\) is bounded. By (72), we know that \(\sup _k k \Vert x_{k+1}-x_k \Vert < + \infty \). Since \(x_k = z_k - \frac{k + \alpha -1}{ \alpha -1} ( x_{k+1}-x_k )\), we deduce that the sequence \((x_k)\) is bounded. Returning to (80), we have
for some constant C. Now, item (ii), combined with the assumption \(\sum _{k } k \Vert g_k\Vert < + \infty \), together give
for some nonnegative sequence \((\omega _k) \) such that \( \sum _{k \in {{\mathbb {N}}}} k\omega _k < + \infty \). Taking the positive parts and applying Lemma 5.10 with \(a_k = \left( h_{k} -h_{k-1}\right) ^+\), to obtain
Since \((h_k)\) is nonnegative, one easily sees that it must converge. This completes the proof. \(\square \)
Remark 5.4
One reasonable conjecture is that strong convergence results can be obtained for algorithm (54), by transposing the results of Sect. 3 from the continuous to the discrete case. This direction will not be explored here, though.
Remark 5.5
The analysis carried out in this section for inertial forward-backward algorithm (55) is a reinterpretation of the proof of the corresponding results in the continuous case. In other words, we built a complete proof having the continuous setting as a guideline. It would be interesting to know whether the results in [5, 6] can be applied in order to deduce the asymptotic properties without repeating the proofs.
5.4 Final comments
In the particular case \(\alpha = 3\), for a perturbed version of the classical FISTA algorithm, Schmidt–Le Roux–Bach proved in [34] a result which is similar to Theorem 5.1, concerning the fast convergence of the values. In [36], Villa–Salzo–Baldassarres–Verri use the notion of \(\epsilon \)-subdifferential in order to compute inexact proximal points in the algorithm. In a recent article [10], Aujol–Dossal extend this study by introducing additional perturbation terms to analyze the stability of the FISTA method, and prove convergence of the sequences generated by the algorithm (in the spirit of Chambolle–Dossal [19]). Although the results obtained in the previous papers have some similarities to those of Theorems 5.1 and 5.3, our dynamic approach is original and opens the door to new developments. One example is the following: In the dynamical system studied here, second-order information with respect to time ultimately induces fast convergence properties. On the other hand, in Newton-type methods, second-order information with respect to space has a similar consequence. In a forthcoming paper, and based on previous works [4] and [9], we study the solutions of the second-order (both in time and space) evolution equation
where \(\Phi \) is a smooth convex function, and \(\alpha , \beta \) are positive parameters. This inertial system combines an isotropic viscous damping which vanishes asymptotically, and a geometrical Hessian-driven damping, which makes it naturally related to Newton and Levenberg-Marquardt methods.
The rich literature on the optimal damping of oscillating systems offers interesting insight (see, for example, the recent paper by Ghisi–Gobbino–Haraux [20], which deals with periodic damping).
Notes
In fact, W decreases strictly, as long as the trajectory is not stationary.
References
Abbas, B., Attouch, H., Svaiter, B.F.: Newton-like dynamics and forward–backward methods for structured monotone inclusions in Hilbert spaces. J. Optim. Theory Appl. 161(2), 331–360 (2014)
Alvarez, F.: On the minimizing property of a second-order dissipative system in Hilbert spaces. SIAM J. Control Optim. 38(4), 1102–1119 (2000)
Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set Valued Anal. 9(1–2), 3–11 (2001)
Alvarez, F., Attouch, H., Bolte, J., Redont, P.: A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics. J. Math. Pures Appl. 81(8), 747–779 (2002)
Alvarez, F., Peypouquet, J.: Asymptotic almost-equivalence of Lipschitz evolution systems in Banach spaces. Nonlinear Anal. 73(9), 3018–3033 (2010)
Alvarez, F., Peypouquet, J.: A unified approach to the asymptotic almost-equivalence of evolution systems without Lipschitz conditions. Nonlinear Anal. 74(11), 3440–3444 (2011)
Attouch, H., Buttazzo, G., Michaille, G.: Variational analysis in Sobolev and BV spaces. Applications to PDEs and optimization, 2nd edn. MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA, xii+793 pp (2014)
Attouch, H., Cabot, A., Redont, P.: The dynamics of elastic shocks via epigraphical regularization of a differential inclusion. Adv. Math. Sci. Appl. 12(1), 273–306 (2002)
Attouch, H., Peypouquet, J., Redont, P.: A dynamical approach to an inertial forward-backward algorithm for convex minimization. SIAM J. Optim. 24(1), 232–256 (2014)
Aujol, J.-F., Dossal, C.: Stability of over-relaxations for the forward–backward algorithm, application to FISTA. SIAM J. Optim. Society for Industrial and Applied Mathematics (2015). hal-01163432
Baillon, J.-B.: Un exemple concernant le comportement asymptotique de la solution du problème \(\frac{du}{dt} + \partial \phi (u) \ni 0\). J. Funct. Anal. 28, 369–376 (1978)
Baillon, J.-B., Haddad, G.: Quelques propriétés des opérateurs angle-bornés et n-cycliquement monotones. Isr. J. Math. 26, 137–150 (1977)
Bauschke, H., Combettes, P.: Convex Analysis and Monotone Operator Theory in Hilbert spaces. CMS Books in Mathematics, Springer (2011)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Brézis, H.: Opérateurs maximaux monotones dans les espaces de Hilbert et équations d’évolution, Lecture Notes 5, North Holland (1972)
Bruck, R.E.: Asymptotic convergence of nonlinear contraction semigroups in Hilbert spaces. J. Funct. Anal. 18, 15–26 (1975)
Cabot, A., Engler, H., Gadat, S.: On the long time behavior of second order differential equations with asymptotically small dissipation. Trans. Am. Math. Soc. 361, 5983–6017 (2009)
Cabot, A., Engler, H., Gadat, S.: Second order differential equations with asymptotically small dissipation and piecewise flat potentials. Electron. J. Differ. Equ. 17, 33–38 (2009)
Chambolle, A., Dossal, Ch.: On the convergence of the iterates of Fista, HAL Id: hal-01060130 https://hal.inria.fr/hal-01060130v3. Submitted on 20 Oct 2014
Ghisi, M., Gobbino, M., Haraux, A.: The remarkable effectiveness of time-dependent damping terms for second order evolution equations (2015). arXiv:1506.06915v1 [math.AP]
Güler, O.: New proximal point algorithms for convex minimization. SIAM J. Optim. 2(4), 649–664 (1992)
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization (2015). arXiv:1406.5468v2 [math.OC]
Knopp, K.: Theory and Application of Infinite Series. Blackie & Son, Glasgow (1951)
Lorenz, D.A., Pock, T.: An inertial forward–backward algorithm for monotone inclusions. J. Math. Imaging Vis. 51(2), 1–15 (2015)
Moudafi, A., Oliny, M.: Convergence of a splitting inertial proximal method for monotone operators. J. Comput. Appl. Math. 155(2), 447–454 (2003)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(1/k2). Sov. Math. Dokl. 27, 372–376 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, Volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Gradient methods for minimizing composite objective function. CORE Discussion Papers (2007)
Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings. Bull. Am. Math. Soc. 73, 591–597 (1967)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1, 123–231 (2013)
Peypouquet, J.: Convex Optimization in Normed Spaces: Theory, Methods and Examples. Springer, Berlin (2015)
Peypouquet, J., Sorin, S.: Evolution equations for maximal monotone operators: asymptotic analysis in continuous and discrete time. J. Convex Anal. 17(3–4), 1113–1163 (2010)
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: NIPS’11—25th Annual Conference on Neural Information Processing Systems, Grenada, Spain (2011) HAL inria-00618152v3
Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Neural Inf. Process. Syst. 27, 2510–2518 (2014)
Villa, S., Salzo, S., Baldassarres, L., Verri, A.: Accelerated and inexact forward–backward. SIAM J. Optim. 23(3), 1607–1633 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Effort sponsored by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under Grant Number FA9550-14-1-0056. Also supported by Fondecyt Grant 1140829, Conicyt Anillo ACT-1106, ECOS-Conicyt Project C13E03, Millenium Nucleus ICM/FIC RC130003, Conicyt Project MATHAMSUD 15MATH-02, Conicyt Redes 140183, and Basal Project CMM Universidad de Chile.
Appendix: Some auxiliary results
Appendix: Some auxiliary results
In this section, we present some auxiliary lemmas to be used later on. The following result can be found in [1]:
Lemma 5.6
Let \(\delta >0, 1\le p<\infty \) and \(1\le r\le \infty \). Suppose \(F\in L^{p}([\delta ,\infty [)\) is a locally absolutely continuous nonnegative function, \(G\in L^{r}([\delta ,\infty [)\) and
for almost every \(t>\delta \). Then \(\lim _{t\rightarrow \infty }F(t)=0\).
To establish the weak convergence of the solutions of (1), we will use Opial’s Lemma [30], that we recall in its continuous form. This argument was first used in [16] to establish the convergence of nonlinear contraction semigroups.
Lemma 5.7
Let S be a nonempty subset of \({{\mathcal {H}}}\) and let \(x:[0,+\infty )\rightarrow {{\mathcal {H}}}\). Assume that
-
(i)
for every \(z\in S, \lim _{t\rightarrow \infty }\Vert x(t)-z\Vert \) exists;
-
(ii)
every weak sequential limit point of x(t), as \(t\rightarrow \infty \), belongs to S.
Then x(t) converges weakly as \(t\rightarrow \infty \) to a point in S.
Its discrete version is
Lemma 5.8
Let S be a non empty subset of \({{\mathcal {H}}}\), and \((x_k)\) a sequence of elements of \({{\mathcal {H}}}\). Assume that
-
(i)
for every \(z\in S, \lim _{k\rightarrow +\infty }\Vert x_k-z\Vert \) exists;
-
(ii)
every weak sequential limit point of \((x_k)\), as \(k\rightarrow \infty \), belongs to S.
Then \(x_k\) converges weakly as \(k\rightarrow \infty \) to a point in S.
The following allows us to establish the existence of a limit for a real-valued function, as \(t\rightarrow +\infty \):
Lemma 5.9
Let \(\delta >0\), and let \(w: [\delta , +\infty [ \rightarrow \mathbb R\) be a continuously differentiable function which is bounded from below. Assume
for some \(\alpha > 1\), almost every \(t>\delta \), and some nonnegative function \(g\in L^1 (\delta , +\infty )\). Then, the positive part \([{{\dot{w}}}]_+\) of \({{\dot{w}}}\) belongs to \(L^1(t_0,+\infty )\) and \(\lim _{t\rightarrow +\infty }w(t)\) exists.
Proof
Multiply (81) by \(t^{\alpha -1}\) to obtain
By integration, we obtain
Hence,
and so,
Applying Fubini’s Theorem, we deduce that
As a consequence,
Finally, the function \(\theta :[\delta ,+\infty )\rightarrow {\mathbb {R}}\), defined by
is nonincreasing and bounded from below. It follows that
exists. \(\square \)
In the study of the corresponding algorithms, we use the following result, which is a discrete version of Lemma 5.9:
Lemma 5.10
Let \(\alpha \ge 3\), and let \((a_k)\) and \((\omega _k)\) be two sequences of nonnegative numbers such that
for all \(k\ge 1\). If \(\sum _k k\omega _{k\in {{\mathbb {N}}}} <+\infty \), then \(\sum _{k \in {{\mathbb {N}}}} a_k < +\infty \).
Proof
Since \(\alpha \ge 3\) we have \(\alpha -1\ge 2\), and hence
Multiplying this expression by \((k+1)^2\), we obtain
Summing this inequality for \(j=1,2,\ldots ,k\), we obtain
Dividing by \(k^2\), and summing for \(k=2,\ldots ,K\), we obtain
Applying Fubini’s Theorem to this last sum, and observing that \(\sum _{k=j+1}^{\infty }\frac{1}{k^2} \le \int _{j}^{\infty }\frac{1}{t^2}dt = \frac{1}{j}\), we obtain
and the result follows. \(\square \)
The following is a continuous version of Kronecker’s Theorem for series (see, for example, [23, page 129]):
Lemma 5.11
Take \(\delta >0\), and let \(f \in L^1 (\delta , +\infty )\) be nonnegative and continuous. Consider a nondecreasing function \(\psi :(\delta ,+\infty )\rightarrow (0,+\infty )\) such that \(\lim \limits _{t\rightarrow +\infty }\psi (t)=+\infty \). Then,
Proof
Given \(\epsilon >0\), fix \(t_\epsilon \) sufficiently large so that
Then, for \(t \ge t_\epsilon \), split the integral \(\int _{\delta }^t \psi (s)f(s) ds\) into two parts to obtain
Now let \(t\rightarrow +\infty \) to deduce that
Since this is true for any \(\epsilon >0\), the result follows. \(\square \)
Using the previous result, we also obtain the following vector-valued version of Lemma 5.9:
Lemma 5.12
Take \(\delta >0\), and let \(F \in L^1 (\delta , +\infty ; {{\mathcal {H}}})\) be continuous. Let \(x:[\delta , +\infty [ \rightarrow {{\mathcal {H}}}\) be a solution of
with \(\alpha >1\). Then, x(t) converges strongly in \({{\mathcal {H}}}\) as \(t \rightarrow + \infty \).
Proof
As in the proof of Lemma 5.9, multiply (82) by \(t^{\alpha -1}\) and integrate to obtain
Integrate again to deduce that
Fubini’s Theorem applied to the last integral gives
Finally, apply Lemma 5.11 to the last integral with \(\psi (s)=s^{\alpha -1}\) and \(f(s)=\Vert F(s)\Vert \) to conclude that all the terms in the right-hand side of (83) have a limit as \(t \rightarrow + \infty \). \(\square \)
Finally, in the analysis of the solutions for the perturbed system, we shall use the following Gronwall-Bellman Lemma (see [15, Lemme A.5]):
Lemma 5.13
Let \(m:[\delta ,T]\rightarrow [0,+\infty [\) be integrable, and let \(c\ge 0\). Suppose \(w:[\delta ,T]\rightarrow {\mathbb {R}}\) is continuous and
for all \(t\in [\delta ,T]\). Then, \(\displaystyle {|w(t)| \le c + \int _{\delta }^t m(\tau ) d\tau }\) for all \(t\in [\delta ,T]\).
We shall also make use of the following discrete version of the preceding result:
Lemma 5.14
Let \((a_k)\) be a sequence of nonnegative numbers such that
for all \(k\in {{\mathbb {N}}}\), where \((\beta _j )\) is a summable sequence of nonnegative numbers, and \(c\ge 0\). Then, \(\displaystyle a_k \le c + \sum \nolimits _{j=1}^{\infty } \beta _j\) for all \(k\in {{\mathbb {N}}}\).
Proof
For \(k\in {{\mathbb {N}}}\), set \(A_k := \max _{1\le m \le k} a_m \). Then, for \(1\le m\le k\), we have
Taking the maximum over \(1\le m\le k\), we obtain
Bounding by the roots of the corresponding quadratic equation, we obtain the result. \(\square \)
Rights and permissions
About this article
Cite this article
Attouch, H., Chbani, Z., Peypouquet, J. et al. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. 168, 123–175 (2018). https://doi.org/10.1007/s10107-016-0992-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-016-0992-8
Keywords
- Convex optimization
- Fast convergent methods
- Dynamical systems
- Gradient flows
- Inertial dynamics
- Vanishing viscosity
- Nesterov method