Abstract
In this paper, we propose in a Hilbertian setting a second-order time-continuous dynamic system with fast convergence guarantees to solve structured convex minimization problems with an affine constraint. The system is associated with the augmented Lagrangian formulation of the minimization problem. The corresponding dynamics brings into play three general time-varying parameters, each with specific properties, and which are, respectively, associated with viscous damping, extrapolation and temporal scaling. By appropriately adjusting these parameters, we develop a Lyapunov analysis which provides fast convergence properties of the values and of the feasibility gap. These results will naturally pave the way for developing corresponding accelerated ADMM algorithms, obtained by temporal discretization.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Our paper is part of the active research stream that studies the relationship between continuous-time dissipative dynamical systems and optimization algorithms. From this perspective, damped inertial dynamics offer a natural way to accelerate these systems. An abundant literature has been devoted to the design of the damping terms, which is the basic ingredient of the optimization properties of these dynamics. In line with the seminal work of Polyak on the heavy ball method with friction [45, 46], the first studies have focused on the case of a fixed viscous damping coefficient [1, 2, 17]. A decisive step was taken in [52] where the authors considered inertial dynamics with an asymptotic vanishing viscous damping coefficient. In doing so, they made the link with the accelerated gradient method of Nesterov [25, 41, 42] for unconstrained convex minimization. This has resulted in a flurry of research activity; see e.g. [3, 7,8,9, 11,12,13, 18, 20, 21, 23, 27, 31, 39, 51, 53].
In this paper, we consider the case of affinely constrained convex structured minimization problems. To bring back the problem to the unconstrained case, there are two main ways: either penalize the constraint (by external penalization or an internal barrier method), or use (augmented) Lagrangian multiplier methods.
Accounting for approximation/penalization terms within dynamical systems has been considered in a series of papers; see [16, 29] and the references therein. It is a flexible approach which can be applied to non-convex problems and/or ill-posed problems, making it a valuable tool for inverse problems. Its major drawback is that in general, it requires a subtle tuning of the approximation/penalization parameter.
Here, we will consider the augmented Lagrangian approach and study the convergence properties of a second-order inertial dynamic with damping, which is attached to the augmented Lagrangian formulation of the affinely constrained convex minimization problem. The proposed dynamical system can be viewed as an inertial continuous-time counterpart of the ADMM method originally proposed in the mid-1970s and which has gained considerable interest in the recent years, in particular for solving large-scale composite optimization problems arising in data science. Among the novelties of our work, the dynamics we propose involves three parameters which vary in time. These are associated with viscous damping, extrapolation, and temporal scaling. By properly adjusting these parameters, we will provide fast convergence rates both for the values and the feasibility gap. The balance between the viscosity parameter (which tends towards zero) and the extrapolation parameter (which tends towards infinity) has already been developed in [36, 54] and [14], though for different problems. Temporal scaling techniques were considered in [6] for the case of convex minimization without affine constraint; see also [10, 12, 15]. Thus, another key contribution of this paper is to show that the temporal scaling and extrapolation can be extended to the class of ADMM-type methods with improved convergence rates. Working with general coefficients and in general Hilbert spaces allows us to encompass the results obtained in the above-mentioned papers and to broaden their scope.
It has been known for a long time that the optimality conditions of the (augmented) Lagrangian formulation of convex structured minimization problems with an affine constraint can be equivalently formulated as a monotone inclusion problem; see [48,49,50]. In turn, the problem can be converted into finding the zeros of a maximally monotone operator and can therefore be attacked using inertial methods for solving monotone inclusions. In this regard, let us mention the following recent works concerning the acceleration of ADMM methods via continuous-time inertial dynamics:
-
In [26], the authors proposed an inertial ADMM by making use of the inertial version of the Douglas–Rachford splitting method for monotone inclusion problems recently introduced in [28], in the context of concomitantly solving a convex minimization problem and its Fenchel dual; see also [34, 43, 44, 47] in the purely discrete setting.
-
Attouch [5] uses the maximally monotone operator which is associated with the augmented Lagrangian formulation of the problem, and specializes to this operator the inertial proximal point algorithm recently developed in [19] to solve general monotone inclusions. This gives rise to an inertial proximal ADMM algorithm where an appropriate adjustment of the viscosity and proximal parameters gives provably fast convergence properties, as well as the convergence of the iterates to saddle points of the Lagrangian function. This approach is in line with [22] who considered the case without inertia. But this approach fails to achieve a fully split inertial ADMM algorithm.
Contents In Sect. 2, we introduce the inertial second-order dynamical system with damping (coined (TRIALS)) which is attached to the augmented Lagrangian formulation. In Sect. 3, which is the main part of the paper, we develop a Lyapunov analysis to establish the asymptotic convergence properties of (TRIALS). This gives rise to a system of inequalities-equalities which must be satisfied by the parameters of the dynamics. From the energy estimates thus obtained, we show in Sect. 4 that the Cauchy problem attached to (TRIALS) is well posed, i.e. existence and possibly uniqueness of a global solution. In Sect. 5, we examine the case of the uniformly convex objectives. In Sect. 6, we provide specific choices of the system parameters that satisfy our assumptions and achieve fast convergence rates. This is then supplemented by preliminary numerical illustrations. Some conclusions and perspectives are finally outlined in Sect. 7.
2 Problem Statement
Consider the structured convex optimization problem:
where, throughout the paper, we make the following standing assumptions:
Throughout, we denote by \(\langle \cdot ,\,\cdot \rangle \) and \(\left\| {\cdot }\right\| \) the scalar product and corresponding norm associated with any of \({{\mathcal {X}}},{{\mathcal {Y}}},{\mathcal {Z}}\), and the underlying space is to be understood from the context.
2.1 Augmented Lagrangian Formulation
Classically, (\({{\mathcal {P}}}\)) can be equivalently reformulated as the saddle point problem
where \({{\mathcal {L}}}: {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\rightarrow {{\mathbb {R}}}\) is the Lagrangian associated with (1)
Under our standing assumption \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\), \({{\mathcal {L}}}\) is convex with respect to \((x,y)\in {{\mathcal {X}}}\times {{\mathcal {Y}}}\), and affine (and hence concave) with respect to \(\lambda \in {\mathcal {Z}}\). A pair \((x^\star ,y^\star )\) is optimal for (\({{\mathcal {P}}}\)), and \(\lambda ^\star \) is a corresponding Lagrange multiplier if and only if \((x^\star , y^\star , \lambda ^\star )\) is a saddle point of the Lagrangian function \({{\mathcal {L}}}\), i.e. for every \((x,y,\lambda ) \in {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\),
We denote by \({{\mathscr {S}}}\) the set of saddle points of \({{\mathcal {L}}}\). The corresponding optimality conditions read
where we use the classical notations: \(\nabla f\) and \(\nabla g\) are the gradients of f and g, \(A^*\) is the adjoint operator of A, and similarly for B. The operator \(\nabla _z\) is the gradient of the corresponding multivariable function with respect to variable z. Given \(\mu > 0\), the augmented Lagrangian \({{\mathcal {L}}}_\mu : {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\rightarrow {{\mathbb {R}}}\) associated with the problem (\({{\mathcal {P}}}\)), is defined by
Observe that one still has \( (x^\star , y^\star , \lambda ^\star )\in {{\mathscr {S}}}\Longleftrightarrow {\left\{ \begin{array}{ll} \nabla _x {{\mathcal {L}}}_\mu (x^\star ,y^\star ,\lambda ^\star )=0, \\ \nabla _y {{\mathcal {L}}}_\mu (x^\star ,y^\star ,\lambda ^\star )=0, \\ \nabla _\lambda {{\mathcal {L}}}_\mu (x^\star ,y^\star ,\lambda ^\star )=0. \end{array}\right. }\)
2.2 The Inertial System (TRIALS)
We will study the asymptotic behaviour, as \(t \rightarrow +\infty \), of the inertial system:
for \(t \in [t_0,+\infty [\) with initial conditions \((x(t_0),y(t_0),\lambda (t_0))\) and \((\dot{x}(t_0),\dot{y}(t_0), \dot{\lambda }(t_0))\). The parameters of (TRIALS) play the following roles:
-
\(\gamma (t)\) is a viscous damping parameter,
-
\(\alpha (t)\) is an extrapolation parameter,
-
b(t) is attached to the temporal scaling of the dynamic.
In the sequel, we make the following standing assumption on these parameters:
Plugging the expression of the partial gradients of \({{\mathcal {L}}}_\mu \) into the above system, the Cauchy problem associated with (TRIALS) is written as follows, where we unambiguously remove the dependence of \((x,y,\lambda )\) on t to lighten the formula:
If in addition to \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\), the gradients of f and g are Lipschitz continuous on bounded sets, we will show later in Sect. 4 that the Cauchy problem associated with (TRIALS) has a unique global solution on \([t_0,+\infty [\). Indeed, although the existence and uniqueness of a local solution follow from the standard non-autonomous Cauchy–Lipschitz theorem, the global existence necessitates the energy estimates derived from the Lyapunov analysis in the next section. The centrality of these estimates is the reason why the proof of well-posedness is deferred to Sect. 4. Thus, for the moment, we take for granted the existence of classical solutions to (TRIALS).
2.3 A Fast Convergence Result
Our Lyapunov analysis will allow us to establish convergence results and rates under very general conditions on the parameters of (TRIALS), see Sect. 3. In fact, there are many situations of practical interest where such conditions are easily verified, and which will be discussed in detail in Sect. 6. Thus, for the sake of illustration and reader convenience, here we describe an important situation where convergence occurs with the fast rate \({{\mathcal {O}}}(1/t^2)\).
Theorem 1
Suppose that the coefficients of (TRIALS) satisfy
where \(\eta > 1\). Suppose that the set of saddle points \({{\mathscr {S}}}\) is non-empty and let \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\). Then, for any solution trajectory \((x(\cdot ),y(\cdot ),\lambda (\cdot ))\) of (TRIALS), the trajectory remains bounded, and we have the following convergence rates:
where \(C_1\) and \(C_2\) are positive constants.
In particular, for \(\alpha _0 = \frac{1}{2}\), i.e. no time scaling \(b \equiv 1\), we have
For the ADMM algorithm (thus in discrete time \(t=k h\), \(k \in {{\mathbb {N}}}, h > 0\)), it has been shown in [32, 33] that the convergence rate of (squared) feasibility is \({{\mathcal {O}}}\left( {\frac{1}{k}}\right) \) and that on \(|F(x_k,y_k)-F(x^\star ,y^\star )|\) is \({{\mathcal {O}}}\left( {\frac{1}{k^{1/2}}}\right) \). These rates were shown to be essentially tight in [32]. Our results then suggest than for \(\alpha _0=\frac{1}{2}\), a proper discretization of (TRIALS) would lead to an accelerated ADMM algorithm with provably faster convergence rates (see [37, 38] in this direction on specific problem instances and algorithms). These discrete algorithmic issues of (TRIALS) will be investigated in a future work.
Again, for \(\alpha _0 = \frac{1}{2}\), the \({{\mathcal {O}}}\left( {\frac{1}{t^2}}\right) \) rate obtained on the Lagrangian is reminiscent of the fast convergence obtained with the continuous-time dynamical version of the Nesterov accelerated gradient method in which the viscous damping coefficient is of the form \(\gamma (t) = \frac{\gamma _0}{t}\) and the fast rate is obtained for \(\gamma _0 \ge 3\); see [11, 52]. With our notations, this corresponds to \(\gamma _0 = \frac{\eta +\alpha _0}{\alpha _0}\), and our choice \(\alpha _0 = \frac{1}{2}\) entails \(\gamma _0=2\eta +1 > 3\). This corresponds to the same critical value as Nesterov’s, but the inequality here is strict. This is not that surprising in our context since one has to handle the dual multiplier and there is an intricate interplay between \(\gamma \) and the extrapolation coefficient \(\alpha \).
2.4 The Role of Extrapolation
One of the key and distinctive features of (TRIALS) is that the partial gradients (with the appropriate sign) of the augmented Lagrangian function are not evaluated at \((x(t),y(t),\lambda (t))\) as it would be the case in a classical continuous-time system associated with ADMM-type methods,be but rather at extrapolated points. This new property will be instrumental to allow for faster convergence rates, and it can be interpreted from different standpoints: optimization, game theory, or control:
-
Optimization standpoint: in this field, this type of extrapolation was recently studied in [14, 36, 54]. It will play a key role in the development of our Lyapunov analysis. Observe that \(\alpha (t) \dot{x}(t)\) and \(\alpha (t) \dot{\lambda }(t)\) point to the direction of future movement of x(t) and \(\lambda (t)\). Thus, (TRIALS) involves the estimated future positions \(x(t) + \alpha (t) \dot{x}(t)\) and \(\lambda (t) + \alpha (t) \dot{\lambda }(t)\). Explicit discretization \(x_k + \alpha _k (x_{k}-x_{k-1})\) and \(\lambda _k + \alpha _k (\lambda _{k}-\lambda _{k-1})\) gives an extrapolation similar to the accelerated method of Nesterov. The implicit discretization reads \(x_k + \alpha _k (x_{k+1}-x_k)\) and \(\lambda _k + \alpha _k (\lambda _{k+1}-\lambda _k)\). For \(\alpha _k=1\), this gives \(x_{k+1}\) and \(\lambda _{k+1}\), which would yield implicit algorithms with associated stability properties.
-
Game theoretic standpoint: let us think about (x, y) and \(\lambda \) as two players playing against each other, and shortly speaking, we identify the players with their actions. We can then see that in (TRIALS), each player anticipates the movement of its opponent. In the coupling term, the player (x, y) takes account of the anticipated position of the player \(\lambda \), which is \(\lambda (t) + \alpha (t) \dot{\lambda }(t)\), and vice versa.
-
Control theoretic standpoint: the structure of (TRIALS) is also related to control theory and state derivative feedback. By defining \(w(t)= (x(t), y(t), \lambda (t))\) the equation can be written in an equivalent way
$$\begin{aligned} \ddot{w}(t) + \gamma (t) \dot{w}(t) = K(t,w(t), \dot{w}(t)), \end{aligned}$$for an operator K appropriately identified from (TRIALS) in terms of the partial gradients of \({{\mathcal {L}}}_\mu \), \(\alpha \) and b. In this system, the feedback control term K, which takes the constraint into account, is not only a function of the state w(t) but also of its derivative. One can consult [40] for a comprehensive treatment of state derivative feedback. Indeed, we will use \(\alpha (\cdot )\) as a control variable, which will turn to play an important role in our subsequent developments.
2.5 Associated Monotone Inclusion Problem
The optimality system (4) can be written equivalently as:
where \(T_{{{\mathcal {L}}}}: {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\rightarrow {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\) is the maximally monotone operator associated with the convex-concave function \({{\mathcal {L}}}\), and which is defined by
Indeed, it is immediate to verify that \(T_{{{\mathcal {L}}}}\) is monotone using \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\). Since it is continuous, it is a maximally monotone operator. Another way of seeing it is to use the standard splitting of \(T_{{{\mathcal {L}}}}\) as \( T_{{{\mathcal {L}}}}= T_1 + T_2 \) where
The operator \(T_1 = \partial \Phi \) is nothing but the gradient of the convex function \(\Phi (x,y,\lambda ) = f(x) + g(y)\), and therefore is maximally monotone owing to \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\) (recall that convexity of a differentiable function implies maximal monotonicity of its gradient [50]). The operator \(T_2\) is obtained by translating a linear continuous and skew-symmetric operator, and therefore, it is also maximally monotone. This immediately implies that \(T_{{{\mathcal {L}}}}\) is maximally monotone as the sum of two maximally monotone operators, one of them being Lipschitz continuous ( [30, Lemma 2.4, page 34]). In turn, \({{\mathscr {S}}}\) can be interpreted as the set of zeros of the maximally monotone operator \(T_{{{\mathcal {L}}}}\). As such, it is a closed convex subset of \({{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\).
The evolution equation associated with \(T_{{{\mathcal {L}}}}\) is written
Following [30], the Cauchy problem (8) is well-posed, and the solution trajectories of (8), which define a semi-group of contractions generated by \(T_{{{\mathcal {L}}}}\), converge weakly in an ergodic sense to equilibria, which are the zeros of the operator \(T_{{{\mathcal {L}}}}\). Moreover, appropriate implicit discretization of (8) yields the proximal ADMM algorithm.
The situation is more complicated if we consider the corresponding inertial dynamics. Indeed, the convergence theory for the heavy ball method can be naturally extended to the case of maximally monotone cocoercive operators. Unfortunately, because of the skew-symmetric component \(T_2\) in \(T_{{{\mathcal {L}}}}\) (when \(c=0\)), the operator \(T_{{{\mathcal {L}}}}\) is not cocoercive. To overcome this difficulty, recent studies consider inertial dynamics where the operator \(T_{{{\mathcal {L}}}}\) is replaced by its Yosida approximation, with an appropriate adjustment of the Yosida parameter; see [19] and [5] in the case of the Nesterov accelerated method. However, such an approach does not achieve full splitting algorithms, hence requiring an additional internal loop.
3 Lyapunov Analysis
Let \((x^\star ,y^\star ) \in {{\mathcal {X}}}\times {{\mathcal {Y}}}\) be a solution of (\({{\mathcal {P}}}\)), and denote by \(F^\star :=F(x^\star ,y^\star )\) the optimal value of (\({{\mathcal {P}}}\)). For the moment, the variable \(\lambda ^\star \) is chosen arbitrarily in \({\mathcal {Z}}\). We will then be led to specialize it. Let \(t \mapsto (x(t),y(t),\lambda (t))\) be a solution trajectory of (TRIALS) defined for \(t\ge t_0\). It is supposed to be a classical solution, i.e. of class \({{\mathcal {C}}}^2\). We are now in position to introduce the function \(t \in [t_0, +\infty [\; \mapsto {{\mathcal {E}}}(t)\in {{\mathbb {R}}}\) that will serve as a Lyapunov function:
The coefficient \(\sigma (t)\) is non-negative and will be adjusted later, while \(\delta (t), \xi (t)\) are explicitly defined by the following formulas:
This choice will become clear from our Lyapunov analysis. To guarantee that \({{\mathcal {E}}}\) is a Lyapunov function for the dynamical system (TRIALS), the following conditions on the coefficients \(\gamma , \, \alpha , \, b, \, \sigma \) will naturally arise from our analysis:
Lyapunov system of inequalities/equalities on the parameters. | |
---|---|
\((\mathcal {G}_{1})\,\,\,\sigma (t)\Big ({\gamma (t)\alpha (t)-{\dot{\alpha }} (t) -1}\Big ) -2 \alpha (t) {\dot{\sigma }}(t) \ge 0\), | |
\((\mathcal {G}_{2})\,\,\,\sigma (t)\Big ({\gamma (t)\alpha (t)- {\dot{\alpha }} (t) - 1}\Big ) - \alpha (t)\dot{\sigma } (t) \ge 0\), | |
\((\mathcal {G}_{3})\,\,\,-\frac{d}{dt}\left[ {\sigma (t) \left( {\sigma (t)\Big ({\gamma (t)\alpha (t) - {\dot{\alpha }} (t)}\Big ) -2 \alpha (t){\dot{\sigma }} (t)}\right) }\right] \ge 0\), | |
\((\mathcal {G}_{4})\,\,\,\alpha (t)\sigma (t)^2 b(t) -\frac{d}{dt}\left( \alpha ^2 \sigma ^2 b\right) (t)= 0\). |
Observe that condition \((\mathcal {G}_{1})\) automatically ensures that \(\xi (t)\) is a non-negative function. In most practical situations (see Sect. 6), we will take \(\sigma \) as a non-negative constant, in which case \((\mathcal {G}_{1})\) and \((\mathcal {G}_{2})\) coincide, and thus, conditions \((\mathcal {G}_{1})\)–\((\mathcal {G}_{4})\) reduce to a system of three differential inequalities/equalities involving only the coefficients \((\gamma ,\alpha ,b)\) of the dynamical system (TRIALS).
3.1 Convergence Rate of the Values
By relying on a Lyapunov analysis with the function \({{\mathcal {E}}}\), we are now ready to state our first main result.
Theorem 2
Assume that \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\) and \(({{\mathcal {H}}}_{{{\mathcal {D}}}})\) hold. Suppose that the growth conditions \((\mathcal {G}_{1})\)–\((\mathcal {G}_{4})\) on the parameters \((\gamma , \alpha , \sigma , b)\) of (TRIALS) are satisfied for all \(t\ge t_0\). Let \(t\in [t_0, +\infty [ \mapsto (x(t),y(t),\lambda (t))\) be a solution trajectory of (TRIALS). Let \({{\mathcal {E}}}\) be the function defined in (9)-(10). Then, the following holds:
-
(1)
\({{\mathcal {E}}}\) is a non-increasing function, and for all \(t\ge t_0\)
$$\begin{aligned} F(x(t),y(t))-F^\star = {{\mathcal {O}}}\left( {\frac{1}{\alpha (t)^2\sigma (t)^2 b(t)}}\right) . \end{aligned}$$ -
(2)
Suppose moreover that \({{\mathscr {S}}}\), the set of saddle points of \({{\mathcal {L}}}\) in (1) is non-empty, and let \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\). Then, for all \(t\ge t_0\), the following rates and integrability properties are satisfied:
-
(i)
\( 0 \le {{\mathcal {L}}}(x(t),y(t),\lambda ^\star ) - {{\mathcal {L}}}(x^\star ,y^\star ,\lambda ^\star )={{\mathcal {O}}}\left( {\frac{1}{\alpha (t)^2\sigma (t)^2 b(t)}}\right) ; \)
-
(ii)
\( \Vert {Ax(t)+By(t)-c}\Vert ^2={{\mathcal {O}}}\left( {\frac{1}{\alpha (t)^2\sigma (t)^2 b(t)}}\right) ; \)
-
(iii)
there exist positive constants \(C_1\) and \(C_2\) such that
$$\begin{aligned} -\frac{C_1}{\alpha (t)\sigma (t) \sqrt{b(t)}} \le F(x(t),y(t))-F^\star \le \frac{C_2}{\alpha (t)^2\sigma (t)^2 b(t)}; \end{aligned}$$ -
(iv)
\( \displaystyle {\int _{t_0}^{+\infty } \alpha (t)\sigma (t)^2b(t)\Vert {Ax(t)+By(t)-c}\Vert ^2 dt <+\infty } ; \)
-
(v)
\( \displaystyle {\int _{t_0}^{+\infty } k(t) \Vert {(\dot{x}(t), \dot{y}(t), \dot{\lambda }(t))}\Vert ^2 dt <+\infty }, \) where
$$\begin{aligned} k(t)= \alpha (t)\sigma (t) \Big ({\sigma (t)\left( {\gamma (t)\alpha (t) - {\dot{\alpha }}(t) - 1}\right) - \alpha (t) {\dot{\sigma }} (t)}\Big ). \end{aligned}$$
-
(i)
Proof
To lighten notation, we drop the dependence on the time variable t. Recall that \((x^\star ,y^\star )\) is a solution of (\({{\mathcal {P}}}\)) and \(\lambda ^\star \) is an arbitrary vector in \({\mathcal {Z}}\). Let us define
With these notations, we have (recall (9) and (10))
Differentiating \({{\mathcal {E}}}\) gives
Using the constitutive equation in (TRIALS), we have
where the operator \(K_{\mu ,\alpha }: {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\rightarrow {{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\) is defined by
Elementary computation gives
According to the above formulas for v, \(\dot{v}\) and \(K_{\mu ,\alpha }\), we get
Let us insert this expression in (12). We first observe that the term \(\langle \nabla {{\mathcal {F}}}_\mu (w),\,\dot{w} \rangle \) appears twice but with opposite signs, and therefore cancels out. Moreover, the coefficient of \(\langle \dot{w},w-w^\star \rangle \) becomes \(\xi + \delta {\dot{\sigma }} -\sigma (\gamma \delta -{\dot{\delta }}- \sigma )\). Thanks to the choice of \(\delta \) and \(\xi \) devised in (11), the term \(\langle \dot{w},\,w-w^\star \rangle \) also disappears. We recall that by virtue of \((\mathcal {G}_{1})\), \(\xi \) is non-negative, and thus so is the last term in \({{\mathcal {E}}}\). Overall, the formula (12) simplifies to
where
Since \((x^\star ,y^\star ) \in {{\mathcal {X}}}\times {{\mathcal {Y}}}\) is a solution of (\({{\mathcal {P}}}\)), we obviously have \(A x^\star + B y^\star =c\). Thus, \({{\mathcal {W}}}\) reduces to
Since it is difficult to control the sign of the above expression, the choice of \(\delta \) in (11) appears natural, which entails \({{\mathcal {W}}}=0\).
On the other hand, by convexity of \({{\mathcal {L}}}(\cdot ,\cdot ,\lambda ^\star )\), strong convexity of \(\frac{\mu }{2}\Vert {\cdot - c}\Vert ^2\), the fact that \(Ax^\star +By^\star = c\) and \({{\mathcal {F}}}_\mu (w^\star )=0\), it is straightforward to see that
Collecting the above results, (13) becomes
Since \(\delta \) is non-negative (\(\sigma \) and \(\alpha \) are), and in view of \((\mathcal {G}_{2})\), the coefficient of the second term in the right-hand side (14) is non-positive. The same conclusion holds for the coefficient of the first term since its non-positivity is equivalent to \((\mathcal {G}_{3})\). Therefore, inequality (14) implies
The sign of \({{\mathcal {F}}}_\mu (w)\) is unknown for arbitrary \(\lambda ^\star \). This is precisely where we invoke \((\mathcal {G}_{4})\) which is equivalent to
-
(1)
Altogether, we have shown so far that (15) eventually reads, for any \(t\ge t_0\),
$$\begin{aligned} \dfrac{d}{dt}{{\mathcal {E}}}(t) \le 0 , \end{aligned}$$(16)i.e. \({{\mathcal {E}}}\) is non-increasing as claimed. Let us now turn to the rates.
\({{\mathcal {E}}}\) being non-increasing entails that for all \(t\ge t_0\)
$$\begin{aligned} {{\mathcal {E}}}(t) \le {{\mathcal {E}}}(t_0) . \end{aligned}$$(17)Dropping the non-negative terms \(\frac{1}{2}\left\| {v(t)}\right\| ^{2}\) and \(\frac{1}{2}\xi (t)\Vert w(t)-w^\star \Vert ^2\) entering \({{\mathcal {E}}}\), and according to the definition of \({{\mathcal {L}}}_\mu \), we obtain that, for all \(t\ge t_0\)
$$\begin{aligned}&\delta (t)^2b(t)\Big ({{{\mathcal {L}}}_\mu (x(t), y(t), \lambda ^\star )- {{\mathcal {L}}}_\mu (x^\star , y^\star , \lambda ^\star )}\Big ) \nonumber \\&= \delta (t)^2b(t) \Big ({{{\mathcal {L}}}(x(t), y(t), \lambda ^\star )- {{\mathcal {L}}}(x^\star , y^\star , \lambda ^\star ) + \frac{\mu }{2}\left\| {Ax(t)+By(t) - c}\right\| ^2}\Big ) \le {{\mathcal {E}}}(t_0).\nonumber \\ \end{aligned}$$(18)Dropping again the quadratic term in (18), we obtain
$$\begin{aligned}&\delta (t)^2b(t) \Big ({F(x(t),y(t))-F^\star + \langle \lambda ^\star ,\,Ax(t)+By(t)-c \rangle }\Big ) \\&\quad \le \delta ^2(t_0)b(t_0)\Big (F(x(t_0),y(t_0))-F^\star + \langle \lambda ^\star ,\,Ax(t_0)+By(t_0)-c \rangle \\&\quad \quad +\frac{\mu }{2}\Vert {Ax(t_0)+By(t_0)-c}\Vert ^2\Big )+ \frac{1}{2}\Vert {v(t_0)}\Vert ^{2}\\&\quad \quad +\frac{1}{2}\xi (t_0)\Vert {(x(t_0), y(t_0), \lambda (t_0))-(x^\star , y^\star , \lambda ^\star )}\Vert ^2\\&\quad \le \delta ^2(t_0)b(t_0)\Vert {\lambda ^\star }\Vert \Vert {Ax(t_0)+By(t_0)-c}\Vert + C_0, \end{aligned}$$where \(C_0\) is the non-negative constant
$$\begin{aligned}&C_0 = \delta ^2(t_0)b(t_0)\Big ({\left| {F(x(t_0),y(t_0))-F^\star }\right| +\frac{\mu }{2}\Vert {Ax(t_0)+By(t_0)-c}\Vert ^2}\Big ) \nonumber \\&\quad + \frac{1}{2}\Vert {v(t_0)}\Vert ^{2} + \frac{1}{2}\xi (t_0)\Vert {(x(t_0), y(t_0), \lambda (t_0))-(x^\star , y^\star , \lambda ^\star )}\Vert ^2 . \end{aligned}$$(19)When \(Ax(t)+By(t)-c = 0\), we are done by taking, e.g. \(\lambda ^\star = 0\) and \(C > C_0\). Assume now that \(Ax(t)+By(t)-c \ne 0\). Since \( \lambda ^\star \) can be freely chosen in \({\mathcal {Z}}\), we take it as the unit-norm vector
$$\begin{aligned} \lambda ^\star = \frac{Ax(t)+By(t)-c}{\Vert {Ax(t)+By(t)-c}\Vert } . \end{aligned}$$(20)We therefore obtain
$$\begin{aligned} \delta (t)^2b(t) \Big ({F(x(t),y(t))-F^\star + \Vert {Ax(t)+By(t)-c}\Vert }\Big ) \le C , \end{aligned}$$(21)where \(C > \delta ^2(t_0)b(t_0)\left\| {Ax(t_0)+By(t_0)-c}\right\| + C_0\). Since the second term in the left-hand side is non-negative, the claimed rate in (1) follows immediately.
-
(2)
Embarking from (18) and using (3) since \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\), we have the rates stated in (i) and (ii).
To show the lower bound in (iii), observe that the upper-bound of (3) entails that
$$\begin{aligned} F(x(t), y(t)) \ge F(x^\star , y^\star ) - \langle Ax(t)+ By(t) - c,\,\lambda ^\star \rangle . \end{aligned}$$(22)Applying Cauchy–Schwarz inequality, we infer
$$\begin{aligned} F(x(t), y(t)) \ge F(x^\star , y^\star ) - \Vert {\lambda ^\star }\Vert \Vert {Ax(t)+ By(t)-c}\Vert . \end{aligned}$$We now use the estimate (ii) to conclude. Finally, the integral estimates of the feasibility (iv) and velocity (v) are obtained by integrating (14).
\(\square \)
3.2 Boundedness of the Trajectory and rate of the Velocity
We will further exploit the Lyapunov analysis developed in the previous section to assert additional properties on the iterates and velocities.
Theorem 3
Suppose the assumptions of Theorem 2 hold. Assume also that \({{\mathscr {S}}}\), the set of saddle points of \({{\mathcal {L}}}\) in (1) is non-empty, and let \((x^\star , y^\star , \lambda ^\star ) \in {{\mathscr {S}}}\). Then, each solution trajectory \(t\in [t_0, +\infty [ \mapsto (x(t),y(t),\lambda (t))\) of (TRIALS) satisfies the following properties:
-
(1)
There exists a positive constant C such that for all \(t \ge t_0\)
$$\begin{aligned}&\Vert {(x(t), y(t), \lambda (t))-(x^\star , y^\star , \lambda ^\star )}\Vert ^2 \\&\quad \le \frac{C}{\sigma (t)^2\Big (\gamma (t)\alpha (t)-{\dot{\alpha }} (t)-1 \Big ) -2 \alpha (t) \sigma (t) {\dot{\sigma }}(t)} \\&\left\| {(\dot{x}(t), \dot{y}(t), \dot{\lambda }(t))}\right\| \\&\quad \le \frac{C}{\alpha (t)\sigma (t)}\left( {1+\sqrt{\frac{\sigma (t)}{\sigma (t)\left( {\gamma (t)\alpha (t)-{\dot{\alpha }} (t)-1}\right) -2 \alpha (t) \dot{\sigma }(t)}}}\right) . \end{aligned}$$ -
(2)
If \(\sup _{t \ge t_0} \sigma (t) < +\infty \) and \((\mathcal {G}_{1})\) is strengthened to
-
\((\mathcal {G}_{1}^{+})\) \(\inf _{t\ge t_0} \sigma (t)\Big ({\sigma (t)\left( {\gamma (t)\alpha (t)-{\dot{\alpha }} (t)-1}\right) - 2 \alpha (t) {\dot{\sigma }}(t)}\Big ) >0\),
then
$$\begin{aligned} \sup _{t\ge t_0} \left\| {(x(t), y(t), \lambda (t))}\right\| < +\infty \text {and} \left\| {(\dot{x}(t), \dot{y}(t), \dot{\lambda }(t))}\right\| = {{\mathcal {O}}}\left( {\frac{1}{\alpha (t)\sigma (t)}}\right) . \end{aligned}$$If moreover,
-
\((\mathcal {G}_{5})\) \(\inf _{t \ge t_0} \alpha (t) > 0\),
then
$$\begin{aligned} \sup _{t\ge t_0} \left\| {(\dot{x}(t), \dot{y}(t), \dot{\lambda }(t))}\right\| < +\infty . \end{aligned}$$ -
Proof
We start from (17) in the proof of Theorem 2, which can be equivalently written
Since \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\), the first term is non-negative by (3), and thus
Choosing a positive constant \(C \ge \sqrt{2{{\mathcal {E}}}(t_0)}\), we immediately deduce that for all \(t\ge t_0\)
Set \(z(t)= (x(t), y(t), \lambda (t))-(x^\star , y^\star , \lambda ^\star )\). By definition of v(t), we have
From the triangle inequality and the bound (23), we get
According to the definition (11) of \(\delta (t)\) and \(\xi (t)\), we get
which ends the proof. \(\square \)
3.3 The Role of \(\alpha \) and Time Scaling
The time scaling parameter b enters the conditions on the parameters only via \((\mathcal {G}_{4})\), which therefore plays a central role in our analysis. Now consider relaxing \((\mathcal {G}_{4})\) to the inequality
This is a weaker assumption in which case the corresponding term in (15) does not vanish. However, such an inequality can still be integrated to yield meaningful convergence rates. This is what we are about to prove.
Theorem 4
Suppose the assumptions of Theorem 2(2) hold, where condition \((\mathcal {G}_{4})\) is replaced with \((\mathcal {G}_{4}^{+})\). Let \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\ne \emptyset \). Assume also that \(\inf F(x,y) > -\infty \). Then, for all \(t \ge t_0\)
where \(C_1\) and \(C_2\) are positive constants.
Proof
We embark from (15) in the proof of Theorem 2. In view of \((\mathcal {G}_{4}^{+})\) and (11), (15) becomes
Since \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\), \({{\mathcal {F}}}_\mu \) is non-negative and so is the Lyapunov function \({{\mathcal {E}}}\). Integrating (27), we obtain the existence of a positive constant C such that, for all \(t\ge t_0\)
which entails, after dropping the positive terms in \({{\mathcal {E}}}\),
(24) and (25) follow immediately from (28) and the definition of \({{\mathcal {F}}}_\mu \).
Let us now turn to (26). Arguing as in the proof of Theorem 2(2), we have
Plugging (25) in this inequality yields the lower bound of (26).
For the upper bound, we will argue as in the proof of Theorem 2(1) by considering \(\lambda ^\star \) as a free variable in \({\mathcal {Z}}\). By assumption, we have that F is bounded from below. This together with (25) implies that \({{\mathcal {E}}}\) is also bounded from below, and we denote \(\underline{{{\mathcal {E}}}}\) this lower-bound. Define \(\tilde{{{\mathcal {E}}}}(t) = {{\mathcal {E}}}(t) - \underline{{{\mathcal {E}}}}\) if \(\underline{{{\mathcal {E}}}}\) is negative and \(\tilde{{{\mathcal {E}}}}(t) = {{\mathcal {E}}}(t)\) otherwise. Thus, from (27), it is easy to see that \(\tilde{{{\mathcal {E}}}}\) verifies
Integrating (29) and arguing with the sign of \(\underline{{{\mathcal {E}}}}\), we get the existence of a positive constant C such that, for all \(t\ge t_0\)
Dropping the quadratic terms in \({{\mathcal {E}}}\), this yields
When \(Ax(t)+By(t)-c = 0\), we are done by taking, e.g. \(\lambda ^\star = 0\). Assume now that \(Ax(t)+By(t)-c \ne 0\) and choose
We arrive at
which completes the proof. \(\square \)
Remark 1
Though the rates in Theorem 2 and 4 look apparently different, it turns out that as expected, those of Theorem 2 are actually a specialization of those in Theorem 4 when \((\mathcal {G}_{4}^{+})\) holds as an equality, i.e. \((\mathcal {G}_{4})\) is verified. To see this, it is sufficient to realize that with the notation \(a(t) :=\alpha (t)^2 \sigma (t)^2 b(t)\), \((\mathcal {G}_{4})\) is equivalent to \(\dot{a}(t) = \frac{1}{\alpha (t)}a(t)\). Upon integration, we obtain \(a(t) = \exp \left( \int _{t_0}^t\frac{1}{\alpha (s)} ds\right) \), or equivalently
4 Well-Posedness of (TRIALS)
In this section, we will show existence and uniqueness of a strong global solution to the Cauchy problem associated with (TRIALS). The main idea is to formulate (TRIALS) in the phase space as a non-autonomous first-order system. In the smooth case, we will invoke the non-autonomous Cauchy–Lipschitz theorem [35, Proposition 6.2.1]. In the non-smooth case, we will use a standard Moreau–Yosida smoothing argument.
4.1 Case of Globally Lipschitz Continuous Gradients
We consider first the case where the gradients of f and g are globally Lipschitz continuous over \({{\mathcal {X}}}\) and \({{\mathcal {Y}}}\). Let us start by recalling the notion of strong solution.
Definition 1
Denote \({{\mathcal {H}}}:={{\mathcal {X}}}\times {{\mathcal {Y}}}\times {\mathcal {Z}}\) equipped with the corresponding product space structure, and \(w: t \in [t_0,+\infty [ \mapsto (x(t),y(t),\lambda (t)) \in {{\mathcal {H}}}\). The function w is a strong global solution of the dynamical system (TRIALS) if it satisfies the following properties:
-
w is in \({{\mathcal {C}}}^1([0,+\infty [;{{\mathcal {H}}})\);
-
w and \(\dot{w}\) are absolutely continuous on every compact subset of the interior of \([t_0,+\infty [\) (hence almost everywhere differentiable);
-
for almost all \(t \in [t_0,+\infty [\), (TRIALS) holds with \(w(t_0) = (x_0,y_0,\lambda _0)\) and \(\dot{w}(t_0) = (u_0,v_0,\nu _0)\).
Theorem 5
Suppose that \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\) holdsFootnote 1 and, moreover, that \(\nabla f\) and \(\nabla g\) are Lipschitz continuous, respectively, over \({{\mathcal {X}}}\) and \({{\mathcal {Y}}}\). Assume that \(\gamma , \, \alpha , \, b: [t_0, +\infty [ \rightarrow {{\mathbb {R}}}^+\) are non-negative continuous functions. Then, for any given initial condition \((x(t_0), \dot{x}(t_0))=(x_0,\dot{x}_0)\in {{\mathcal {X}}}\times {{\mathcal {X}}}\), \((y(t_0), \dot{y}(t_0))=(y_0,\dot{y}_0)\in {{\mathcal {Y}}}\times {{\mathcal {Y}}}\), \((\lambda (t_0), \dot{\lambda }(t_0))=(\lambda _0,{\dot{\lambda }}_0)\in {\mathcal {Z}}\times {\mathcal {Z}}\), the evolution system (TRIALS) has a unique strong global solution.
Proof
Recall the notations of Definition 1. Let \(I = [t_0,+\infty [\) and let \(Z: t \in I \mapsto (w(t),\dot{w}(t)) \in {{\mathcal {H}}}^2\). (TRIALS) can be equivalently written as the Cauchy problem on \({{\mathcal {H}}}^2\)
where \(Z_0=(x_0,y_0,\lambda _0,u_0,v_0,\nu _0)\), and \(G: I \times {{\mathcal {H}}}^2 \rightarrow {{\mathcal {H}}}^2\) is the operator
To invoke [35, Proposition 6.2.1], it is sufficient to check that for a.e. \(t \in I\), \(G(t,\cdot )\) is \(\beta (t)\)-Lipschitz continuous with \(\beta (\cdot ) \in L^1_{loc}(I)\), and for a.e. \(t \in I\), \(G(t,Z) = {{\mathcal {O}}}(P(t))(1+\Vert {Z}\Vert )\), \(\forall Z \in {{\mathcal {H}}}^2\), with \(P(\cdot ) \in L^1_{loc}(I)\). Since \(\nabla f\), \(\nabla g\) are globally Lipschitz continuous, and A and B are bounded linear, elementary computation shows that there exists a constant \(C > 0\) such that
Owing to the continuity of the parameters \(\gamma (\cdot )\), \(\alpha (\cdot )\), \(b(\cdot )\), \(\beta (\cdot )\) is integrable on \([t_0,T]\) for all \(t_0< T < +\infty \). Similar calculation shows that
and we conclude similarly. It then follows from [35, Proposition 6.2.1] that there exists a unique global solution \(Z(\cdot ) \in W^{1,1}_{loc}(I;{{\mathcal {H}}}^2)\) of (30) satisfying the initial condition \(Z(t_0)=Z_0\), and thus, by [30, Corollary A.2] that \(Z(\cdot )\) is a strong global solution to (30). This in turn leads to the existence and uniqueness of a strong solution \((x(\cdot ),y(\cdot ),\lambda (\cdot ))\) of (TRIALS). \(\square \)
Remark 2
One sees from the proof that for the above result to hold, it is only sufficient to assume that the parameters \(\gamma \), \(\alpha \), b are locally integrable instead of continuous. In addition, in the above results, we even have existence and uniqueness of a classical solution.
4.2 Case of Locally Lipschitz Continuous Gradients
Under local Lipschitz continuity assumptions on the gradients \(\nabla f\) and \(\nabla g\), the operator \(Z \mapsto G(t,Z)\) defined in (31) is only Lipschitz continuous over the bounded subsets of \({{\mathcal {H}}}^2\). As a consequence, the Cauchy–Lipschitz theorem provides the existence and uniqueness of a local solution. To pass from a local solution to a global solution, we will rely on the estimates established in Theorem 3.
Theorem 6
Suppose that \(({{\mathcal {H}}}_{{{\mathcal {P}}}})\) holdsFootnote 2 and moreover, that \(\nabla f\) and \(\nabla g\) are Lipschitz continuous over the bounded subsets of, respectively, \({{\mathcal {X}}}\) and \({{\mathcal {Y}}}\). Assume that \(\gamma , \, \alpha , \, b: [t_0, +\infty [ \rightarrow {{\mathbb {R}}}^+\) are non-negative continuous functions such that the conditions \((\mathcal {G}_{1}^{+})\), \((\mathcal {G}_{2})\), \((\mathcal {G}_{3})\), \((\mathcal {G}_{4})\) and \((\mathcal {G}_{5})\) are satisfied, and that \(\sup _{t \ge t_0} \sigma (t) < +\infty \). Then, for any initial condition \((x(t_0), \dot{x}(t_0))=(x_0,\dot{x}_0)\in {{\mathcal {X}}}\times {{\mathcal {X}}}\), \((y(t_0), \dot{y}(t_0))=(y_0,\dot{y}_0)\in {{\mathcal {Y}}}\times {{\mathcal {Y}}}\), \((\lambda (t_0), \dot{\lambda }(t_0))=(\lambda _0,{\dot{\lambda }}_0)\in {\mathcal {Z}}\times {\mathcal {Z}}\), the evolution system (TRIALS) has a unique strong global solution.
Proof
We use the same notation as in the proof of Theorem 5. Let us consider the maximal solution of the Cauchy problem (30), say \(Z : [t_0, T[ \rightarrow {{\mathcal {H}}}^2\). We have to prove that \(T=+\infty \). Following a classical argument, we argue by contradiction, and suppose that \(T < +\infty \). It is then sufficient to prove that the limit of Z(t) exists as \(t \rightarrow T\), so that it will be possible to extend Z locally to the right of T, thus getting a contradiction. According to the Cauchy criterion, and the constitutive equation \(\dot{Z}(t) = G(t,Z(t))\), it is sufficient to prove that Z(t) is bounded over \([t_0,T[\). At this point, we use the estimates provided by Theorem 3, which gives precisely this result under the conditions imposed on the parameters. \(\square \)
4.3 The Non-Smooth Case
For a large number of applications (e.g. data processing, machine learning, statistics), non-smooth functions are ubiquitous. To cover these practical situations, we need to consider the case where the functions f and g are non-smooth. In order to adapt the dynamic (TRIALS) to this non-smooth situation, we will consider the corresponding differential inclusion
where \(\partial f\) and \(\partial g\) are the subdifferentials of f and g, respectively. Beyond global existence issues that we will address shortly, one may wonder whether our Lyapunov analysis in the previous sections is still valid in this case. The answer is affirmative provided one takes some care in two main steps that are central in our analysis. First, when taking the time-derivative of the Lyapunov, one has to invoke now the (generalized) chain rule for derivatives over curves (see [30]). The second ingredient is the validity of the subdifferential inequality for convex functions. In turn all our results and estimates presented in the previous sections can be transposed to this more general non-smooth context. Indeed, our approximation scheme that we will present shortly turns out to be monotonically increasing. This gives a variational convergence (epi-convergence) which allows to simply pass to the limit over the estimates established in the smooth case.
Let us now turn to the existence of a global solution to (32). We will again consider strong solutions to this problem, i.e. solutions that are \({{\mathcal {C}}}^1([t_0,+\infty [;{{\mathcal {H}}})\), locally absolutely continuous, and (32) holds almost everywhere on \([t_0,+\infty [\). A natural idea is to use the Moreau–Yosida regularization in order to bring the problem to the smooth case before passing to an appropriate limit. Recall that, for any \(\theta > 0\), the Moreau envelopes \(f_{\theta }\) and \(g_{\theta }\) of f and g are defined, respectively, by
As a classical result, \(f_{\theta }\) and \(g_{\theta }\) are continuously differentiable and their gradients are \(\frac{1}{\theta }\)-Lipschitz continuous. We are then led to consider, for each \(\theta >0\), the dynamical system
The system (33) comes under our previous study, and for which we have existence and uniqueness of a strong global solution. In doing so, we generate a filtered sequence \((x_{\theta },y_{\theta },\lambda _{\theta })_{\theta }\) of trajectories.
The challenging question is now to pass to the limit in the system above as \( \theta \rightarrow 0^+\). This is a non-trivial problem, and to answer it, we have to assume that the spaces \({{\mathcal {X}}}\), \({{\mathcal {Y}}}\) and \({\mathcal {Z}}\) are finite dimensional, and that f and g are convex real-valued (i.e. \(\mathrm {dom}(f) = {{\mathcal {X}}}\) and \(\mathrm {dom}(g)={{\mathcal {Y}}}\) in which case f and g are continuous). Recall that \(\partial F(x,y) = \partial f(x) \times \partial g(y)\), and denote \(\left[ {\partial F(x,y)}\right] ^0\) the minimal norm selection of \(\partial F(x,y)\).
Theorem 7
Suppose that \({{\mathcal {X}}}\), \({{\mathcal {Y}}}\), \({\mathcal {Z}}\) are finite dimensional Hilbert spaces, and that the functions \(f: {{\mathcal {X}}}\rightarrow {{\mathbb {R}}}\) and \(g: {{\mathcal {Y}}}\rightarrow {{\mathbb {R}}}\) are convex. Assume that
-
(i)
F is coercive on the affine feasibility set;
-
(ii)
\(\beta _F :=\sup _{(x,y) \in {{\mathcal {X}}}\times {{\mathcal {Y}}}} \Vert {\left[ {\partial F(x,y)}\right] ^0}\Vert < +\infty \);
-
(iii)
the linear operator \(L = [A ~ B]\) is surjective.
Suppose also that \(\gamma , \, \alpha , \, b: [t_0, +\infty [ \rightarrow {{\mathbb {R}}}^+\) are non-negative continuous functions such that the conditions \((\mathcal {G}_{1}^{+})\), \((\mathcal {G}_{2})\), \((\mathcal {G}_{3})\) \((\mathcal {G}_{4})\) and \((\mathcal {G}_{5})\) are satisfied, and that \(\sup _{t \ge t_0} \sigma (t) < +\infty \). Then, for any initial condition \((x(t_0), \dot{x}(t_0))=(x_0,\dot{x}_0)\in {{\mathcal {X}}}\times {{\mathcal {X}}}\), \((y(t_0), \dot{y}(t_0))=(y_0,\dot{y}_0)\in {{\mathcal {Y}}}\times {{\mathcal {Y}}}\), \((\lambda (t_0), \dot{\lambda }(t_0))=(\lambda _0,{\dot{\lambda }}_0)\in {\mathcal {Z}}\times {\mathcal {Z}}\), the evolution system (32) admits a strong global solution .
Condition (i) is natural and ensures, for instance, that the solution set of (\({{\mathcal {P}}}\)) is non-empty. Condition (iii) is also very mild. A simple case where (ii) holds is when f and g are Lipschitz continuous.
Proof
The key property is that the estimates obtained in Theorem 2 and 3, when applied to (33), have a favourable dependence on \(\theta \). Indeed, a careful examination of the estimates shows that \(\theta \) enters them through the Lyapunov function at \(t_0\) only via \(|F_{\theta }(x_0,y_0) - F_{\theta }(x_{\theta }^\star ,y_{\theta }^\star )|\) and \(\Vert {(x_{\theta }^\star ,y_{\theta }^\star ,\lambda _{\theta }^\star )}\Vert \), where \(F_{\theta }(x,y)=f_{\theta }(x)+g_{\theta }(y)\), \((x_{\theta }^\star ,y_{\theta }^\star ) \in {{\,\mathrm{argmin}\,}}_{Ax+Bx=c} F_{\theta }(x,y)\) and \(\lambda _{\theta }^\star \) is an associated dual multiplier; see (19). With standard properties of the Moreau envelope, see [4, Chapter 3] and [24, Chapter 12], one can show that for all \((x,y) \in {{\mathcal {X}}}\times {{\mathcal {Y}}}\)
This together with the fact that \((x_{\theta }^\star ,y_{\theta }^\star ) \in {{\,\mathrm{argmin}\,}}_{Ax+Bx=c} F_{\theta }(x,y)\) and \((x^\star ,y^\star ) \in {{\,\mathrm{argmin}\,}}_{Ax+Bx=c} F(x,y)\) yields
Thus,
This entails, owing to (ii), that
and thus, since we are interested in the limit as \(\theta \rightarrow 0^+\),
On the other hand,
Thus, in view of (i), \(\exists a > 0\) and \(b \in {{\mathbb {R}}}\) such that
which shows that
Let us turn to \(\lambda _{\theta }^\star \). When \(\lambda _{\theta }^\star \) is chosen as in (20), then we are done. When \(\lambda _{\theta }^\star \) is the optimal dual multiplier satisfying (4), then it is a solution to the Fenchel–Rockafellar dual problem
where \(F_\theta ^*\) is the Legendre–Fenchel conjugate of \(F_\theta \). Without loss of generality, we assume \(c = 0\). Classical conjugacy results give
Since f and g are convex and real-valued, the domain of F is full. This is equivalent to coercivity of \(F^*\). This together with injectivity of \(L^*\) (see (iii)) implies that there exists \(a > 0\) and \(b \in {{\mathbb {R}}}\) (potentially different from those above) such that
Altogether, this shows that
Combining the above with Theorem 3, we conclude that for all \(T > t_0\), the trajectories \((x_{\theta }(.),y_{\theta }(.),\lambda _{\theta }(\cdot ))\) and the velocities \((\dot{x}_{\theta }(.),\dot{y}_{\theta }(.),\dot{\lambda }_{\theta }(\cdot ))\) are bounded in \(L^2(t_0,T;{{\mathcal {X}}}\times {{\mathcal {Y}}})\) uniformly in \(\theta \). Since \({{\mathcal {X}}}\) and \({{\mathcal {Y}}}\) are finite-dimensional spaces, we deduce by the Ascoli–Arzelà theorem that the trajectories are relatively compact for the uniform convergence over the bounded time intervals. By properties of the Moreau envelope, we also have, for all \((x,y) \in {{\mathcal {X}}}\times {{\mathcal {Y}}}\),
and thus
Using this and the boundedness assertions of the trajectories and velocities proved above in the constitutive equations (33), the acceleration remains also bounded on the bounded time intervals. Passing to the limit as \(\theta \rightarrow 0^+\) in (33) is therefore relevant by a classical maximal monotonicity argument. Indeed, we work with the canonical extension of the maximally monotone operators \(\nabla F_\theta \) and \(\partial F\) to \(L^2(t_0,T, {{\mathcal {X}}}\times {{\mathcal {Y}}})\), and, in this functional setting, we use that \(\nabla F_\theta \) graph converges to \(\partial F\) in the strong-weak topology. \(\square \)
We conclude this section by noting that at this stage, uniqueness of the solution to (32) is a difficult open problem. In fact, even existence in infinite dimension and/or with any proper lower semicontinuous convex functions f and g is not clear. This goes far beyond the scope of the present paper, and we leave it to a future work.
5 The Uniformly Convex Case
We now turn to examine the convergence properties of the trajectories generated by (TRIALS), when the objective F in (\({{\mathcal {P}}}\)) is uniformly convex on bounded sets. Recall, see e.g. [24], that \(F: {{\mathcal {X}}}\times {{\mathcal {Y}}}\rightarrow {{\mathbb {R}}}\) is uniformly convex on bounded sets if, for each \(r > 0\), there is an increasing function \(\psi _r: [0,+\infty [ \rightarrow [0,+\infty [\) vanishing only at the origin, such that
for all \((v,w) \in ({{\mathcal {X}}}\times {{\mathcal {Y}}})^2\) such that \(\Vert {v}\Vert \le r\) and \(\Vert {w}\Vert \le r\). The strongly convex case corresponds to \(\psi _r(t) = c_F t^2/2\) for some \(c_F > 0\). In finite dimension, strict convexity of F entails uniform convexity on any non-empty bounded closed convex subset of \({{\mathcal {X}}}\times {{\mathcal {Y}}}\), see [24, Corollary 10.18].
Theorem 8
Suppose that F is uniformly convex on bounded sets, and let \((x^\star , y^\star )\) be the unique solution of the minimization problem (\({{\mathcal {P}}}\)). Assume also that \({{\mathscr {S}}}\), the set of saddle points of \({{\mathcal {L}}}\) in (1) is non-empty. Suppose that the conditions \((\mathcal {G}_{1}^{+})\)–\((\mathcal {G}_{4})\) on the coefficients of (TRIALS) are satisfied for all \(t \ge t_0\). Then, each solution trajectory \(t\in [t_0, +\infty [ \mapsto (x(t),y(t),\lambda (t))\) of (TRIALS) satisfies, \(\forall t \ge t_0\),
As a consequence, assuming that \(\lim _{t\rightarrow +\infty } \alpha (t)^2 \sigma (t)^2 b(t) = +\infty \), we have that the trajectory \(t\mapsto (x(t), y(t))\) converges strongly to \((x^\star , y^\star )\) as \(t\rightarrow +\infty \).
Proof
Uniformly convex functions are strictly convex and coercive, and thus \((x^\star ,y^\star )\) is unique. From Theorem 3(2), there exists \(r_1 > 0\) such that
Taking \(r \ge r_1 + \left\| {(x^\star ,y^\star )}\right\| \), we have that the trajectories \((x(\cdot ),y(\cdot ))\) and \((x^\star ,y^\star )\) are both contained in the ball of radius r centred at the origin. Let \(\lambda ^\star \) be a Lagrange multiplier of problem (\({{\mathcal {P}}}\)), i.e. \((x^\star ,y^\star ,\lambda ^\star ) \in {{\mathscr {S}}}\). On the one hand, applying the uniform convexity inequality (34) at \(v=(x(t),y(t))\) and \(w=(x^\star ,y^\star )\), we have
On the other hand, the optimality conditions (4) tell us that
and obviously
Thus,
Invoking the estimate in Theorem 2(2)(i) yields the claim. \(\square \)
Remark 3
The assumption \(\lim _{t\rightarrow +\infty } \alpha (t)^2 \sigma (t)^2 b(t) = +\infty \) made in the above theorem is very mild. It holds in particular in all the situations discussed in Sect. 6. In particular, for \(\alpha (t) = t^r\), \(0\le r<1\), \(\sigma \) constant, and \(b(t)= \frac{1}{t^{2r}} \exp \Big ({\frac{1}{1-r}t^{1-r}}\Big )\), one has \(\alpha (t)\sqrt{b(t)}= \exp \Big ({\frac{1}{2(1-r)}t^{1-r}}\Big )\). Thus, if F is strongly convex, the trajectory \(t \mapsto (x(t),y(t))\) converges exponentially fast to the unique minimizer \((x^\star ,y^\star )\).
6 Parameters Choice for Fast Convergence Rates
In this section, we suppose that the solution set \({{\mathscr {S}}}\) of the saddle value problem (4) is non-empty, so as to invoke Theorem 2(2), Theorem 3 and Theorem 4. The set of conditions \((\mathcal {G}_{1}^{+})\), \((\mathcal {G}_{2})\), \((\mathcal {G}_{3})\), \((\mathcal {G}_{4})\) and \((\mathcal {G}_{5})\) imposes sufficient assumptions on the coefficients \(\gamma , \, \alpha , \, b\) of the dynamical system (TRIALS), and on the coefficients \(\sigma , \delta , \xi \) of the function \({{\mathcal {E}}}\) defined in (9), which guarantee that \({{\mathcal {E}}}\) is a Lyapunov function for the dynamical system (TRIALS).
Let us show that this system admits many solutions of practical interest which in turn will entail fast convergence rates. For this, we will organize our discussion around the coefficient \(\alpha \) as dictated by Theorem 4. Indeed, the latter shows that the convergence rate of the Lagrangian values and feasibility is \(\displaystyle {{{\mathcal {O}}}\left( {\exp \left( {-\int _{t_0}^t\frac{1}{\alpha (s)} ds}\right) }\right) }\). Therefore, to obtain a meaningful convergence result, we need to assume that
This means that the critical growth is \(\alpha (t) = a t\) for \(a > 0\). If \(\alpha (t)\) grows faster, our analysis do not provide an instructive convergence rate. So, it is an essential ingredient of our approach to assume that \(\alpha (t)\) remains positive, but not too large as \(t\rightarrow +\infty \). In fact, the set of conditions \((\mathcal {G}_{1}^{+})\), \((\mathcal {G}_{2})\), \((\mathcal {G}_{3})\), \((\mathcal {G}_{4})\) and \((\mathcal {G}_{5})\) simplifies considerably by taking \(\sigma \) a positive constant, and \(\gamma \alpha -{\dot{\alpha }}\) a constant strictly greater than one. This is made precise in the following statement whose proof is immediate.
Corollary 1
Suppose that \(\sigma \equiv \sigma _0\) is a positive constant, and \(\gamma \alpha -{\dot{\alpha }} \equiv \eta >1\). Then, the set of conditions \((\mathcal {G}_{1}^{+})\), \((\mathcal {G}_{2})\), \((\mathcal {G}_{3})\), \((\mathcal {G}_{4})\) and \((\mathcal {G}_{5})\) reduces to
Following the above discussion, we are led to consider the following three cases.
6.1 Constant Parameter \(\alpha \)
Consider the simple situation where \(\sigma \equiv \sigma _0 > 0\) and \(\eta > 1\) in which case \((\mathcal {G}_{1}^{+})\) reads \(\gamma \alpha - \dot{\alpha } - 1 = \eta - 1 > 0\). Taking \(\alpha \equiv \alpha _0\), a positive constant, yields \(\gamma \equiv \frac{\eta }{\alpha _0}\) and (35) amounts to solving
that is, \(b(t)= \exp \left( {{\frac{t}{\alpha _0}}}\right) \). Capitalizing on Corollary 1, and specializing the results of Theorem 2(2) (or equivalently Theorem 4 according to Remark 1) and Theorem 3 to the current choice of parameters yields the following statement.
Proposition 1
Suppose that \(\sigma \equiv \sigma _0 > 0\), \(\eta > 1\), and that the coefficients of (TRIALS) satisfy: the functions \(\alpha , \gamma \) are constant with
Suppose that \({{\mathscr {S}}}\) is non-empty. Then, for any solution trajectory \((x(\cdot ),y(\cdot ),\lambda (\cdot ))\) of (TRIALS), the trajectory and its velocity remain bounded, and we have
where \(C_1\) and \(C_2\) are positive constants.
6.2 Linearly Increasing Parameter \(\alpha \)
We now take \(\sigma \equiv \sigma _0 > 0\), \(\eta > 1\) and \(\alpha (t)= \alpha _{0} t\) with \(\alpha _{0} > 0\). Then, \((\mathcal {G}_{1}^{+})\) is satisfied, and we have \(\gamma (t) = \frac{\eta +\alpha _0}{\alpha _0 t}\). Condition (35) then becomes
which admits \(b(t)= t^{\frac{1}{\alpha _0}-2}\) as a solution. We then have \(b \equiv 1\) for \(\alpha _0=1/2\), while one can distinguish two regimes for its limiting behaviour with
In view of Theorem 2(2) and Theorem 3, we obtain the following result.
Proposition 2
Suppose that \(\sigma \equiv \sigma _0 > 0\), \(\eta > 1\), and that the coefficients of (TRIALS) satisfy
Suppose that \({{\mathscr {S}}}\) is non-empty. Then, for any solution trajectory \((x(\cdot ),y(\cdot ),\lambda (\cdot ))\) of (TRIALS), the trajectory remains bounded, and we have the following convergence rates:
where \(C_1\) and \(C_2\) are positive constants.
6.3 Power-type Parameter \(\alpha \)
Let us now take \(\sigma \equiv \sigma _0 > 0\), \(\eta > 1\) and consider the intermediate case between the two previous situations, where \(\alpha (t) = t^r\), \(0<r<1\). Thus, \((\mathcal {G}_{1}^{+})\) is satisfied and we have \(\gamma (t) = \frac{\eta }{t^r} + \frac{r}{t}\). Condition (35) is then equivalent to
which, after integration, shows that \(b(t)= \frac{1}{t^{2r}} \exp \left( { \frac{1}{1-r}t^{1-r}}\right) \) is a solution. Appealing again to Theorem 2(2) and Theorem 3, we obtain the following claim.
Proposition 3
Take \(\sigma \equiv \sigma _0 > 0\) and \(\eta > 1\). Suppose that the coefficients of (TRIALS) satisfy
Suppose that \({{\mathscr {S}}}\) is non-empty. Then, for any solution trajectory \((x(\cdot ),y(\cdot ),\lambda (\cdot ))\) of (TRIALS), the trajectory remains bounded, and we have the convergence rates:
where \(C_1\) and \(C_2\) are positive constants.
6.4 Numerical experiments
To support our theoretical claims, we consider in this section two numerical examples with \({{\mathcal {X}}}= {{\mathcal {Y}}}= {\mathcal {Z}}= {{\mathbb {R}}}^2\): one with a strongly convex objective F and one where F is convex but not strongly so.
Example 1
We consider the quadratic programming problem
whose objective is strongly convex and verifies all required assumptions.
Example 2
We consider the minimization problem
The objective is convex (but not strongly so) and smooth as required. This problem is reminiscent of (regularized) logistic regression very popular in machine learning.
In all our numerical experiments, we consider the continuous time dynamical system (TRIALS), solved numerically with a Runge–Kutta adaptive method (ode45 in MATLAB) on the time interval [1, 20].
For the solely convex (resp. strongly convex) objective, Figure 1 (resp. Figure 2) displays the objective error \(\left| {F(x(t),y(t))-F^\star }\right| \) on the left, the feasibility gap \(\Vert {Ax(t)+By(t)-c}\Vert \) in the middle, and the velocity \(\Vert {(\dot{x}(t), \dot{y}(t), \dot{\lambda }(t))}\Vert \) on the right. In each figure, the first row shows the results for \(\alpha (t) \equiv \alpha _0\) with \(\alpha _0 \in \left\{ {1,2,4}\right\} \), the second row corresponds to \(\alpha (t)=\alpha _0 t\) with \(\alpha _0 \in \left\{ {0.25,0.5,1}\right\} \) and the third row to \(\alpha (t)=t^r\) with \(r \in \left\{ {0.01,0.1,0.5}\right\} \). In all our experiments, we set \(\mu =10\) (recall that \(\mu \) is the parameter associated with the augmented Lagrangian formulation). All these choices of the parameters comply with the requirements of Propositions 1, 2 and 3. The numerical results are in excellent agreement with our theoretical results, where the values, the velocities and the feasibility gap all converge at the predicted rates.
7 Conclusion, Perspectives
In this paper, we adopted a dynamical system perspective and we have proposed a second-order inertial system enjoying provably fast convergence rates to solve structured convex optimization problems with an affine constraint. One of the most original aspects of our study is the introduction of a damped inertial dynamic involving several time-dependent parameters with specific properties. They allow to consider a variable viscosity coefficient (possibly vanishing so making the link with the Nesterov accelerated gradient method), as well as variable extrapolation parameters (possibly large) and time scaling. The analysis of the subtle and intricate interplay between these objects together has been made possible through Lyapunov’s analysis. It would have been quite difficult to undertake such an analysis directly on the algorithmic discrete form. On the other hand, as we have now gained a deeper understanding with such a powerful continuous-time framework, we believe this will serve us as a guide to design and analyse a class of inertial ADMM algorithms which can be naturally obtained by appropriate discretization of the dynamics (TRIALS). Their full study would go beyond the scope of this paper and will be the subject of future work. Besides, several other open questions remain to be studied, such as the introduction of geometric damping controlled by the Hessian, and the convergence of the trajectories in the general convex constrained case.
Notes
Actually, convexity is not needed here.
Again, convexity is superfluous here.
References
Alvarez, F.: On the minimizing property of a second order dissipative system in Hilbert spaces. SIAM J. Control Optim. 38(4), 1102–1119 (2000). https://doi.org/10.1137/S0363012998335802
Alvarez, F., Attouch, H., Bolte, J., Redont, P.: A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics. J. Math. Pures Appl. (9) 81(8), 747–779 (2002). https://doi.org/10.1016/S0021-7824(01)01253-3
Apidopoulos, V., Aujol, J.F., ossal, C.: Convergence rate of inertial forward-backward algorithm beyond Nesterov’s rule. Math. Program. 180(1–2, Ser. A), 137–156 (2020). https://doi.org/10.1007/s10107-018-1350-9
Attouch, H.: Variational convergence for functions and operators. Applicable mathematics series. Pitman Advanced Publishing Program (1984). https://books.google.fr/books?id=oxGoAAAAIAAJ
Attouch, H.: Fast inertial proximal ADMM algorithms for convex structured optimization with linear constraint. Minimax Theory Appl. 6(1), 1–24 (2021)
Attouch, H., Balhag, A., Chbani, Z., Riahi, H.: Fast convex optimization via inertial dynamics combining viscous and hessian-driven damping with time rescaling. Evolution Equations & Control Theory (2021). http://aimsciences.org//article/id/92767e45-ae6f-4ff2-9c62-d7ead2d77844
Attouch, H., Cabot, A.: Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity. J. Differ Equ. 263(9), 5412–5458 (2017). https://doi.org/10.1016/j.jde.2017.06.024
Attouch, H., Cabot, A.: Convergence rates of inertial forward-backward algorithms. SIAM J. Optim. 28(1), 849–874 (2018). https://doi.org/10.1137/17M1114739
Attouch, H., Cabot, A., Chbani, Z., Riahi, H.: Rate of convergence of inertial gradient dynamics with time-dependent viscous damping coefficient. Evol. Equ. Control Theory 7(3), 353–371 (2018). https://doi.org/10.3934/eect.2018018
Attouch, H., Chbani, Z., Fadili, J., Riahi, H.: First-order optimization algorithms via inertial systems with hessian driven damping. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01591-1
Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. 168(1–2, Ser. B), 123–175 (2018). https://doi.org/10.1007/s10107-016-0992-8
Attouch, H., Chbani, Z., Riahi, H.: Fast proximal methods via time scaling of damped inertial dynamics. SIAM J. Optim. 29(3), 2227–2256 (2019). https://doi.org/10.1137/18M1230207
Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the Nesterov accelerated gradient method in the subcritical case \(\alpha \le 3\). ESAIM Control Optim. Calc. Var. 25(2), 34 (2019). https://doi.org/10.1051/cocv/2017083
Attouch, H., Chbani, Z., Riahi, H.: Fast convex optimization via a third-order in time evolution equation. Optimization (2020). Preprint available at hal-02432351. https://doi.org/10.1080/02331934.2020.1764953
Attouch, H., Chbani, Z., Riahi, H.: Fast convex optimization via time scaling of damped inertial gradient dynamics. Pure and Applied Functional Analysis (2020). To appear
Attouch, H., Czarnecki, M.O., Peypouquet, J.: Coupling forward-backward with penalty schemes and parallel splitting for constrained variational inequalities. SIAM J. Optim. 21(4), 1251–1274 (2011). https://doi.org/10.1137/110820300
Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method. I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(1), 1–34 (2000). https://doi.org/10.1142/S0219199700000025
Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than \(1/k^2\). SIAM J. Optim. 26(3), 1824–1834 (2016). https://doi.org/10.1137/15M1046095
Attouch, H., Peypouquet, J.: Convergence of inertial dynamics and proximal algorithms governed by maximally monotone operators. Math. Program. 174(1–2, Ser. B), 391–432 (2019). https://doi.org/10.1007/s10107-018-1252-x
Attouch, H., Peypouquet, J., Redont, P.: A dynamical approach to an inertial forward-backward algorithm for convex minimization. SIAM J. Optim. 24(1), 232–256 (2014). https://doi.org/10.1137/130910294
Attouch, H., Peypouquet, J., Redont, P.: Fast convex optimization via inertial dynamics with Hessian driven damping. J. Differ. Equ. 261(10), 5734–5783 (2016). https://doi.org/10.1016/j.jde.2016.08.020
Attouch, H., Soueycatt, M.: Augmented Lagrangian and proximal alternating direction methods of multipliers in Hilbert spaces. Applications to game, PDE’s and control. Pac. J. Optim. 5(1), 17–37 (2009)
Aujol, J.F., Dossal, C.: Stability of over-relaxations for the forward-backward algorithm, application to FISTA. SIAM J. Optim. 25(4), 2408–2433 (2015). https://doi.org/10.1137/140994964
Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. CMS Books in Mathematics. Springer, New York (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Boţ, R.I., Csetnek, E.R.: An inertial alternating direction method of multipliers. Minimax Theory Appl. 1(1), 29–49 (2016)
Boţ, R.I., Csetnek, E.R.: An inertial Tseng’s type proximal algorithm for nonsmooth and nonconvex optimization problems. J. Optim. Theory Appl. 171(2), 600–616 (2016). https://doi.org/10.1007/s10957-015-0730-z
Boţ, R.I., Csetnek, E.R., Hendrich, C.: Inertial Douglas-Rachford splitting for monotone inclusion problems. Appl. Math. Comput. 256, 472–487 (2015). https://doi.org/10.1016/j.amc.2015.01.017
Boţ, R.I., Csetnek, E.R., László, S.C.: Second-order dynamical systems with penalty terms associated to monotone inclusions. Anal. Appl. (Singap.) 16(5), 601–622 (2018). https://doi.org/10.1142/S0219530518500021
Brézis, H.: Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert. North-Holland Publishing Co., Amsterdam-London; American Elsevier Publishing Co., Inc., New York (1973). North-Holland Mathematics Studies, No. 5. Notas de Matemática (50)
Chambolle, A., Dossal, C.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166(3), 968–982 (2015). https://doi.org/10.1007/s10957-015-0746-4
Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. In: Splitting methods in communication, imaging, science, and engineering, Sci. Comput., pp. 115–163. Springer, Cham (2016)
Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42(3), 783–805 (2017). https://doi.org/10.1287/moor.2016.0827
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014). https://doi.org/10.1137/120896219
Haraux, A.: Systèmes dynamiques dissipatifs et applications, Recherches en Mathématiques Appliquées [Research in Applied Mathematics], vol. 17. Masson, Paris (1991)
He, X., Hu, R., Fang, Y.: Convergence rates of inertial primal-dual dynamical methods for separable convex optimization problems. arXiv:2007.12428 (2020)
Kang, M., Kang, M., Jung, M.: Inexact accelerated augmented Lagrangian methods. Comput. Optim. Appl. 62(2), 373–404 (2015). https://doi.org/10.1007/s10589-015-9742-8
Kang, M., Yun, S., Woo, H., Kang, M.: Accelerated Bregman method for linearly constrained \(\ell _1\)-\(\ell _2\) minimization. J. Sci. Comput. 56(3), 515–534 (2013). https://doi.org/10.1007/s10915-013-9686-z
May, R.: Asymptotic for a second-order evolution equation with convex potential and vanishing damping term. Turkish J. Math. 41(3), 681–685 (2017). https://doi.org/10.3906/mat-1512-28
Michiels, W., Vyhlídal, T., Huijberts, H., Nijmeijer, H.: Stabilizability and stability robustness of state derivative feedback controllers. SIAM J. Control Optim. 47(6), 3100–3117 (2009). https://doi.org/10.1137/070697136
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1, Ser. B), 125–161 (2013). https://doi.org/10.1007/s10107-012-0629-5
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/k^{2})\). Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
Patrinos, P., Stella, L., Bemporad, A.: Douglas-Rachford splitting: Complexity estimates and accelerated variants. In: 53rd IEEE Conference on Decision and Control, pp. 4234–4239 (2014). https://doi.org/10.1109/CDC.2014.7040049
Pejcic, I., Jones, C.N.: Accelerated ADMM based on accelerated Douglas-Rachford splitting. In: European Control Conference (ECC), pp. 1952–1957. Aalborg, Denmark (2016)
Polyak, B.T.: Some methods of speeding up the convergence of iterative methods. Ž. Vyčisl. Mat i Mat. Fiz. 4, 791–803 (1964)
Polyak, B.T.: Introduction to optimization. Translations Series in Mathematics and Engineering. Optimization Software Inc, Publications Division, New York, : Translated from the Russian. With a foreword by Dimitri P, Bertsekas (1987)
Poon, C., Liang, J.: Trajectory of alternating direction method of multipliers and adaptive acceleration. In: 33rd Conference on Neural Information Processing Systems (NeurIPS). Vancouver, Canada (2019)
Rockafellar, R.T.: Monotone operators associated with saddle-functions and minimax problems. In: Nonlinear Functional Analysis (Proc. Sympos. Pure Math., Vol. XVIII, Part 1, Chicago, Ill., 1968), pp. 241–250. Amer. Math. Soc., Providence, R.I. (1970)
Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976). https://doi.org/10.1287/moor.1.2.97
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976). https://doi.org/10.1137/0314056
Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-resolution differential equations. arXiv:2440124 (2018)
Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 43 (2016)
Wilson, A.C., Recht, B., Jordan, M.I.: A lyapunov analysis of momentum methods in optimization. arXiv:1611.02635 (2016)
Zeng, X., Lei, J., Chen, J.: Dynamical primal-dual accelerated method with applications to network optimization. arXiv:1912.03690 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Attouch, H., Chbani, Z., Fadili, J. et al. Fast Convergence of Dynamical ADMM via Time Scaling of Damped Inertial Dynamics. J Optim Theory Appl 193, 704–736 (2022). https://doi.org/10.1007/s10957-021-01859-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-021-01859-2
Keywords
- Augmented Lagrangian
- ADMM
- Damped inertial dynamics
- Convex constrained minimization
- Convergence rates
- Lyapunov analysis
- Nesterov accelerated gradient method
- Temporal scaling